Written by: Malika Shah
Peer Reviewer: Akhila Gundavelli
Professional Reviewer: David Witman, PhD.
Reinforcement learning algorithms, such as Q-Learning, have been shown in certain situations to mimic the behavior of biological systems. One example is ant-foraging models, where the goal of each individual ant is to explore an environment with the purpose of providing food for its colony. Although traditional ant-foraging models have not included explicit reinforcement learning algorithms, they are well suited for this type of approach. In this work, we propose an ant-foraging model using Q-Learning to explore a constantly changing terrain environment that obeys the rules of a general Abelian Sandpile Model (ASM). We consider a stochastic version of the ASM in order to show learned behavior is not dependent on deterministic conditions. Additionally, we will show that the agents (ants), given enough time, will eventually learn the most optimal path to the food; however, in some cases due to the terrain, the agents were not always able to take it. Our research provides an approach for understanding how Q-Learning algorithms explore a constantly changing, random environment and its associated complexities.
General Ant-Foraging Models
Ant-foraging models, included in the broader topic of optimal foraging theory, describe how ants locate and gather food (Pyke, 2000). Note that the ants will be referred to agents in our simulation. One goal of optimal foraging theory is to understand multi-agent scenarios, where a set of agents has the ability to interact with an environment (Beckers, Holland, & Deneubourg, 1994) (Deneubourg, et al., 1991). In typical ant-foraging models, pheromones are used to communicate information among the agents in the environment (Holldobler & Wilson, 1990). Generally speaking, there are two types of pheromones that have been used in past research. The first pheromone represents the smell released by the food in the predefined environment which the agents use to locate the food source. While the second pheromone is released by each individual agent after it has successfully located the food; this serves as a communication mechanism between agents (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004) (Panait & Luke, 2004) (Wolf & Wehner, 2000). This work models the typical characteristics of Cataglyphis fortis, more generally known as Desert Ants, which can smell food up to 3 meters away (Wolf & Wehner, 2000). (Panait & Luke, Learning ant foraging behaviors, 2004) showed through a preliminary ant-foraging model that given only pheromones as a guide and basic encoded logic, the agents could collectively determine an optimal path (Panait & Luke, Learning ant foraging behaviors, 2004). Additionally, they included various obstacles in their environment including multiple blocks and a rotating “clock” that needed to be avoided in order to locate the food source, showing that the majority of agents were able to find the most optimal path (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004).
1 Pheromones are chemical signals that many species use to communicate a variety of signals including territorial,
alarming, and food trailing (Silva-Junior, et al., 2018). Although all of the chemicals in pheromones are not known,
the main one in ants is pyrazine (Silva-Junior, et al., 2018).
Abelian Sandpile Model
The Abelian Sandpile Model, also known as the Bak-Tang-Wiesenfeld Model, has a few general practical applications. We use this model, however, because it provides us a simple random model on which we can build on. The model represents a theoretical simulation of a dynamic environment where grains of sand are continuously added to a defined two-dimensional map, either randomly or systematically (Bak, Tang, & Wiesenfeld, 1987). After a certain amount of time, local “avalanches” occur when a site has exceeded its capacity and the grains topple onto adjacent sites. Eventually, these local avalanches will cascade into a global “avalanche” affecting the entire region.
The ASM we make use of requires a finite, nonnegative square lattice, Z where represent the i horizontal and j vertical cell indices respectively, a time-step, k, with where represents the duration of the simulation, and a critical value, K, which denotes a critical number of sand grains (Bak, Tang, & Wiesenfeld, 1987). The model proposed by Bak, Tang, and Wiesenfeld has three steps (Bak, Tang, & Wiesenfeld, 1987):
If any , then the site topples and the following occurs:
- The site gives a grain to each of its four neighbors
2. The site is left with zero grains.
3. If any of the other sites is equal to or exceeds the K value, then the steps are repeated. If there is more than one “avalanche”, the order in which the “avalanches” occur does not
matter as well as the order in which the sand grains are distributed (Bak, Tang, & Wiesenfeld,
1987) . In our simulation, we do not restrict the sand grains on Z, so if a sand grain is supposed to
be distributed to a site that is not on Z , it leaves the system.
Q-Learning is a reinforcement-learning algorithm used in many current applications today like robotics, modeling and simulation, and gaming (Watkins C. J., 1989) (Chen, Li, & Dong, 2008) (Al-Tamimi, Lewis, & Abu-Khalaf, 2007) (Mnih, et al., 2013) . For example, in robotics, (Chen, Li, & Dong, 2008) were able to show how a robot can execute biped balance using Q-learning. Another notable use of Q-Learning is in teaching an agent playing the Atari game, Space
Invaders (Mnih, et al., 2013) . The agent was able to learn how to play the game as well as earn a high score (Mnih, et al., 2013) .
Q Learning utilizes Bellman’s Equation, which relies on parameters that quantify: states, actions, rewards and two parameters that define learning rates (Watkins C. J., 1989) . The state (s) is a numerical quantization of the environment at a given point in time a ind space depending on the scenario (Watkins C. J., 1989) . In our simulation, the state with respect to each agent is defined by its location on the grid and the corresponding number of sand grains present at that site. At every single discrete state, there is a set of actions (a) available to the agent (Watkins C. J. 1989) . The actions available to each individual agent are the movements to an adjacent cell (diagonally, left, right, up or down). Given a state-action pair, a reward (r) (scalar or function value) metric provides feedback to the agent quantifying how beneficial that action at that given state was (Watkins C. J., 1989) . For our study, the agent’s reward can be represented as the number of pheromones in the site and whether or not it reached the food in our simulation. The learning rate is a scalar between 0 and 1 and allows for a configurable parameter for the feedback to the agent (Watkins C. J., 1989) . Theoretically, the greater the learning rate, the faster the agent will learn; however, this is not guaranteed. The discount rate also set between 0 and 1, allows for future rewards to be worth less than the immediate rewards (Watkins C. J., 1989) . If the discount rate is 0.9, the agent will care more about its long-term rewards. Contrastingly, if the discount rate is 0, the agent will only care about its short- term rewards; this is said to be “greedy” behavior because the agent is trying to maximize its immediate rewards (Russell & Norving, 2010) . The Bellman equation, named after Richard E. Bellman, is an optimization method used in dynamic programming (Dixit, 1990) . The Q- Learning algorithm utilizes the following Bellman equation (Watkins C. J., 1989) :
Although all Q-Learning processes use the same equation, Q-Learning can be implemented in various forms, such as a tabular approach as well as a function approximation via a Neural Network (Watkins & Dayan, 1992) (Mnih, et al., 2013) .
A tabular approach, the approach being used in this work, implements a Q table or an array containing real numbers storing the value of Q based on the state and set of available actions (Watkins C. J., 1989) . As an agent explores an environment, it uses the Q table to determine an optimal state action pair (Watkins C. J., 1989) . Given an action procured by the agent, a resultant reward and new state information are used to update the Q table (Watkins C. J., 1989).
Using a function approximation (like a neural network) in lieu of the Q table approach is generally known as deep q learning. This approach uses a neural network to map the state-action relationship instead of storing values in a Q-table. (Mnih, et al., 2013) . As the agent explores the environment, the respective state-action pairs are stored, updated and used to train the Neural Network (Mnih, et al., 2013) . The neural network then has the ability to approximate the Q-Table in a more condensed form (Mnih, et al., 2013) . These approximations are revised as necessary based on future experiences in the environment and its ability to approximate the given space (Mnih, et al., 2013) .
The main goal of our research is to determine how a Q-learning algorithm applied to an ant-foraging model behaves in a constantly changing, random environment. Previous ant-foraging research has not focused on the aspect of a random environment. The following sections correspond to various components of our simulation. The first section describes the environment in which the agents traverse. Then we will discuss the constraints imposed upon the agents in the simulation. Finally, we will discuss the specific parameters associated with the implemented Q-Learning Algorithm.
Our model builds on Luke’s and Panait’s model with the rotating “clock” for obstacles (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004). In our model, we use the ASM to represent the terrain that the agents must traverse. For this work, a stochastic definition of the ASM is used, meaning the grains are placed randomly at different locations throughout time. This creates a “mosaic” type of effect where “hills” of different heights form over time creating the effect of constantly changing terrain.
For this simulation we allow the ASM to create an initial environment before the agents are released onto the domain. We denote the number of grains that the ASM places before the agents begin to move towards the food as The nest location is in the top left corner of the domain, while the food location is in the bottom-right corner, where and are the coordinates of the center of the food. The food is not restricted to a single cell so, a radius is defined for the food placement such that multiple cells are contained. The number of sand grains at a site and time is denoted as The number of pheromones released is denoted as Np and how long the pheromones last on a site is denoted by . Finally, the pheromones associated with the food are denoted by .
To define the agent behavior and logic we utilize an ant-foraging model, similar to that of (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004). However, our work implements a Q-Learning tabular approach, which completely determines the agent’s actions instead of the decision being based on the pheromones available at a given state. Generally, ants have an impressive memory as it relates to foraging; in fact, about 50% of their memory is solely dedicated to locating food sources (Hourcade, Muenz, Rossler, & Devaud, 2010). Using this information, we make two assumptions in regards to the agent logic. First, each individual agent has its own version of which represents its learned knowledge. Second, each agent has full memory of its initial starting point, i.e. the agent nest. This second assumption allows the agent to immediately return to the nest, via the most direct path, after collecting food.
The rules that we impose on the agents concerning their movement on the terrain are described as follows:
The first rule creates obstacles for the agent, since the agents can become blocked by the grains of sand on their way to and from the food. Our implementation of ASM represents a stochastic implementation seeing as the grains are placed randomly. Thus, the obstacle configuration cannot be one hundred percent known by the agents, no matter how much they explore the environment. The agents must have a holistic approach and be adaptable to different situations.
Once an agent collects food, it releases pheromones on its way back to the nest. Since we have assumed perfect knowledge of the original nest location each agent makes its way back to the nest by calculating the minimum straight-line distance between neighboring sites and the nest. This assumption is made by (Collett, Dillmann, Giger, & Wehner, 1992), who found that an ant’s memory could accurately pinpoint nest location during the foraging process. This process allows the agents to take the theoretically optimal path back to the nest; however, the agents still have to follow the rules determined by the terrain.
The parameters of the Q-Learning algorithm and the simulation, as a whole, are set with the values as follows:
We completed 50 trials of each experiment to get a 95% confidence interval. The value was found through analysis of Monte Carlo sampling trials (Driels & Shin, 2004). The standard deviation, s, the mean, Y, and the estimated error, , were approximated using preliminary trials (Driels & Shin, 2004). We decided to complete 50 trials in order to account for any error that may have gone into calculating the minimum value of iterations.
Our simulation parameters follow a similar setup to that of (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004) and our values were chosen from the recommendation from (Even-Dar & Mansour, 2003).
The pseudo code for the simulation:
Baseline Experiment: Preliminary observations
For our first experiment, we measured the average number of moves it took a single agent to get to the food for all trips (see Figure 6). If the number of steps decreases with respect to the overall number trips, then we would imply “learned” behavior.
From Figure 6 we see that after 15 trips, the agents have decreased the total number of steps to less than 100 steps. This is an improvement from the more than 1500 steps that the agents took when they ventured out for the first time.
After 15 experiences of getting food, the agents were overall roughly 80% more efficient at getting food than their first experience; however, not all of the agents were taking the optimal path. Although most of the agents were able to learn the environment, some of the agents could not if they got stuck because the sand grains blocked their path and took thousands of steps to get to the food. That is why the average is considerably higher than it should be; however, one can see in the next experiment that the minimum value is about 50, which is consistent with the expected results.
Experiment 2: Dynamic and static environments
For our second experiment, we studied the effect of the dynamic environment on the minimum number of steps it took for an agent to locate the food. Two cases were analyzed: a static environment where the terrain is fixed and a dynamic environment where sand is constantly added to the terrain. Our static environment was pre-populated with 2000 grains of sand, and then fixed for the remainder of the scenario. The average minimum number of steps from the static environment was 47 steps, while the dynamic environment required 85 steps. The average minimum number of steps is also a measure of how fast it took one agent to reach the food. In Figure 7, the comparison of the static environment and changing environment is further observed. The transition to fewer steps to the nest is seen as the agents progress through the time-steps.
We went a step further in this experiment by letting the changing environment run for more time steps. At 10000 time steps the minimum number of steps the agents took in the changing environment was 45 steps.
Given enough time steps (10000), the changing environment is able to achieve the level of the static environment at 45 steps as seen in Figure 8.
Experiment 3: Number of ants on board
For our third experiment, we examined how the number of agents in the environment affects how well the agents are able to learn the environment. We observed the percentage of agents that have food at any given time-step with respect to the total number of agents in the environment. Figure 9 shows a violin plot of the resultant distributions for environments with 20, 60, and 100 agents. As the number of agents increases, the maximum number of agents that have food at any given time-step also increases (Table 3). However, the percentage of agents that have food at any given time-step stayed roughly the same between 20 and 25% in the simulation. The maximum percentage of agents that have food at any given time-step theoretically assuming perfect learned behavior should be 50%. This would imply that the agents spend half their time traveling to the food source and half dropping the food off at the nest. Figure 9 is created using data from 2500 time-steps run with 50 repetitions for 100 agents.
Experiment 4: Variable starting sand grains
For our fourth experiment, we observed how , the number of sand grains that are added to the board before the agents are placed, affects the minimum number of steps the agents need to take to the food. From Figure 10, the general trend shows that as more grains are placed on the board, the more steps the agents needed to take to get the food. This trend is seen up until a point at which a plateau is reached. This means that, at least initially, more sand grains implies longer travel times to initially find the food. At around 3500 sand grains, the minimum number of steps plateaus at roughly 130 steps, this can be considered a steady state terrain for this scenario. This result is expected because the ASM eventually becomes “organized” and there are not any more obstacles than there were before.
Although the agents were placed in a dynamic environment with random configurations, generally speaking, the RL ant-foraging model was still able to locate the food source. In certain cases, the routes developed by the agents were suboptimal due to the dynamic nature of the terrain.
The Baseline Experiment reinforces the idea that the Q-Learning algorithm can help optimize the path of the agents in a changing environment quickly, although it takes many more time steps than initially thought for an agent with Q-Learning to learn to navigate a changing, random environment.
Additionally, when comparing dynamic and static environments, we conclude that agents in a changing environment take longer than agents in a static environment. This may be due to randomness, as well as the changing nature of the environment. However, given enough time-steps, the agents in the changing environment were able to learn the environment at a similar accuracy to the agents in the static environment evidenced in Figure 8.
By analyzing the effect of the number of ants on board, we showed that adding more agents to the environment increased the overall complexity of the scenario. It also reveals how not all of the agents can have food at the same time. All the agents do not have food at the same time, but go in cycles of picking up and dropping food. The maximum number of agents that had food at any given time is at about 25% of the number of total agents in the environment. So, our agents did not reach our theoretical efficiency of 50%. This shows how in a changing, random environment, even though the agents know the optimal path, they are hindered from taking it because of the unexpected sand grains.
Finally, by varying the initial starting sand grains, we were able to observe the effects of the dynamic nature of the terrain. The more sand grains, the longer the agents took to get to the food because the board had more obstacles, and the more obstacles there are, the more complex the terrain gets. At around 3500 sand grains, the number of steps the agents took plateaus because the complexity reaches a steady state. The grid configurations do not become more complex. If the environment is even more complex, the agents may not do as well or may take a considerably long time to explore the environment and reach the goal.
Comparing our results with the past work of (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004), we note a few similarities as well as differences when compared to this work. Both of our agents were able to find the most optimal path; however, our agents were not always able to take it due to the terrain (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004). Additionally, pheromones played a large role in our simulations. They allowed us to see emergent properties in the behavior of the agents. Our agents took longer than (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004) model to find the food initially; however, over time it became shorter (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004). Using a Q-Learning algorithm allows us to explore other types of environments, ones that cannot be explored only using pheromones. If Q-Learning was applied to (Panait & Luke, A pheromone-based utility model for collaborative foraging, 2004) model, the agents would have learned much faster than basing their entire logic on just using pheromones. For example, incorporating Q-Learning into our simulation allows us to observe different macro behaviors of the agents (Hourcade, Muenz, Rossler, & Devaud, 2010).
Our research had the goal of observing how the Q-Learning algorithm with the tabular approach behaves in a constantly changing, random environment. We investigated whether or not it is feasible and useful to apply to future work, which involves a random, changing environment. Through our results, we would recommend using a tabular approach in future work if the environment is comparatively small and there are no real-time limitations. Our work points to implementing a functional approximation via a neural network for large environmental problems. As seen from our results, the agents were able to explore the environment, but they could not fully adapt to the changes in the environment. A functional approximation via a neural network may be able to resolve the mentioned limitations; however, more experimental evidence is needed.
Al-Tamimi, A., Lewis, F. L., & Abu-Khalaf, M. (2007). Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 373-481.
Bak, P., Tang, C., & Wiesenfeld, K. (1987). Self-organized criticality: An explanation of 1/f noise. Physical Review Letters, 381-384.
Beckers, R., Holland, O. E., & Deneubourg, J. (1994). From local actions to global tasks: Stigmergy and collective robotics. Artificial Life IV: Proceedings of the International Workshop on the Synthesis and Simulation of Living Systems.
Chen, C., Li, H. X., & Dong, D. (2008). Hybrid control for robot navigation-a hierarchical Q -learning algorithm. IEEE Robotics and Automation Magazine.
Collett, T. S., Dillmann, E., Giger, A., & Wehner, R. (1992). Visual landmarks and route following in desert ants. Journal of Comparative Physiology A, 435-442.
Deneubourg, J. L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., & Chretian, L. (1991). The dynamics of collective sorting: robot-like ants and ant-like robots. Animals to Animals: Proceedings of the First International Conference on Simulation of Adaptive Behavior, 356-363.
Dixit, A. K. (1990). Optimization in economic theory. Oxford University Press on Demand.
Driels, M. R., & Shin, Y. S. (2004). Determining the number of iterations for Monte Carlo simulations of weapon effectiveness.
Even-Dar, E., & Mansour, Y. (2003). Learning rates for Q-learning. Journal of Machine Learning Research, 1-25.
Holldobler, B., & Wilson, E. (1990). The ants. Harvard University Press.
Hourcade, B., Muenz, T. S., Rossler, W., & Devaud, J. M. (2010). Long-term memory leads to synaptic reorganization in the mushroom bodies: a memory trace in the insect brain. Journal of Neuroscience, 6461-6465.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Oberle, W. (2015). Monte Carlo Simulations: Number of Iterations and Accuracy. US Army Research Laboratory.
Panait, L. A., & Luke, S. (2004). Learning ant foraging behaviors. Artificial Life XI Ninth International Conference on the Simulation and Synthesis of Living Systems, 575-580.
Panait, L., & Luke, S. (2004). A pheromone-based utility model for collaborative foraging. Autonomous Agents and Multiagent Systems, 36-43.
Panait, L., & Luke, S. (2004). Learning ant foraging behaviors. Artificial Like XI Ninth International Conference on the Simulation and Synthesis of Living Systems.
Pyke, G. (2000). Optimal foraging theory: a critical review. Annual Reviews Ecological Systems, 523-575.
Russell, S., & Norving, P. (2010). Artificial Intelligence: A Modern Approach. Prentice Hall.
Silva-Junior, E. A., Ruzzini, A. C., Paludo, C. R., Nascimento, F. S., Currie, C. R., Clardy, J., & Pupo, M. T. (2018). Pyrazines from bacteria and ants: convergent chemistry within an ecological niche. Scientific Reports.
Sornete, A., & Sornete, D. (1989). Self-organized criticality and earthquakes. Europhysics Letters, 197.
Watkins, C. J. (1989). Learning from delayed rewards.
Watkins, C., & Dayan, P. (1992). Q-learning. Machine Learning, 279-292.
Wolf, H., & Wehner, R. (2000). Pinpointing food sources: olfactory and anemotactic orientation in desert ants, cataglyphis fortis. The Journal of Experimental Biology, 857-868.