What if the future of replenishing in Fashion Retail depended on decision-making in a simulator?
During my academic journey, I explored the concept of bridging the reality gap in Reinforcement Learning, developing an agent capable of making smart decisions in new environments.
Introduction
Replenishing clothes at retail stores is a crucial and complex problem. Wrong decisions can have damaging consequences for the company, such as low stock levels, which can reduce sales, or overstocking, which ties up capital. A sequence of poor decisions can significantly impact the profits of a small company.
To help with complex decision-making like this, researchers have turned to AI, namely Reinforcement Learning (RL). In RL, an agent learns to make smart choices by interacting with its sorroundings, or environment. The agent gets feedback based on its decisions – reward for good actions and penalties for bad ones – and over time, it improves by picking actions that maximize rewards.
For example, imagine a store manager who needs to decide how many t-shirts to restock for the next week. If they order exactly what is needed, say 3 SKUs of a t-shirt (eg. blue t-shirt of size M), the store runs smoothly and profits are maximized. But if they order too many or too few, the store either loses sales or ties up money in excess stock. While the manager can’t predict exact sales, they can make an informed guess based on factors like past sales, weather, or upcoming holidays. Similarly, an RL agent learns through trial and error, slowly figuring out the best way to maximize store profits while avoiding costly mistakes.
To train a Reinforcement Learning agent we can let it explore the environment by making random decisions and learning from the results. In the retail example, this would involve trying different stock levels to see what works best to increase profits. However, like explained before, a sequence of bad decisions can be very detrimental for the success of the store, making it very dangerous to allow the agent to make random real-life decisions until it learns to understand how the environment works.
To avoid this, researchers use simulations where the agent can safely learn and practice without causing harm. However, these simulations don’t always perfectly match real-world conditions. This difference between the simulation and reality is known as the Reality Gap, and it is crucial to address this problem before applying the knowledge learned in a simulation to the real world.
Together with WAIR, we delved into this problem of creating an agent in simulation that is able to make smart decisions when deployed in the real world.
The Reality Gap
Agents are often trained in simulated environments, where they can safely learn optimal behaviors. However, when we deploy these agents directly in the real world, the differences between simulation and reality can cause the agent to be unable to understand the environment and make poor decisions. This Reality Gap arises from various factors, such as unexpected variables and complex dynamics that are not fully captured in simulations.
In the retail example, an RL agent trained to manage a store’s inventory in a simulation learns to optimize stock by analyzing past sales, weather forecasts, and holidays. However, if a product suddenly becomes popular on social media, the agent might not know to increase the stock since it has never seen this scenario during its training. This could result in empty shelves and lost sales. This highlights the need to bridge the Reality Gap so the RL agent can make effective decisions in real-life situations.
Bridge the Reality Gap
Multiple techniques have been developed to bridge this gap. To understand the ones we used, it is important to first explain how a simulation is created. For any environment we are simulating, we need to first define the elements that are relevant for the process we are studying, followed by creating mathematical equations that describe how these elements interact. These equations are the central part of the two techniques we use:
1. System Identification
System Identification (SI) focuses on finding the best equations and its parameters to accurately represent a specific problem. By creating a realistic representation of our environment, we can reduce the Reality Gap, making our simulation closely resemble the real-world setting where we want to deploy our agent.
In our replenishment example, we need to consider important factors like past sales, product characteristics, and store location. For instance, one equation might predict future sales by averaging the sales from the last three weeks. By including these variables, we create a more realistic model for managing the stock.
System Identification involves continuously refining the parameters and equations based on what we already know about the system. This iterative process allows the RL agent to learn and adapt more effectively, leading to better decision-making in real-world scenarios.
2. Domain Randomization
For Domain Randomization (DR) we do not try to improve our knowledge about the real system. Instead, this technique is used to train the agent by exposing it to many different scenarios. The idea is that if the agent practices in various environments, it will become more flexible and better at adapting to new situations it hasn’t encountered before.
In this method, we change the values of some variables to create different training environments. Every time a goal is completed, the simulator resets with a new set of variables, allowing the agent to continue training in a fresh setting.
In the replenishment example, we could define as a goal the manager replenishing the store for 5 weeks. Once the five weeks are reached, the store simulator would change the past sales data, product details and store location. Training this agent in multiple instances of the simulation can be seen as a very experienced manager that worked in multiple stores, with different sales patterns.
Pre-trained in-Context Reinforcement Learning
Normally, when doing Domain Randomization, the agent doesn’t know the real values of some variables. So, as the environment changes, the agent has to figure out the best actions to take by analyzing the rewards received after making a decision. The problem is that, even if the agent trains in many different scenarios, it can still fail catastrophically in a completely new situation, because it doesn’t have the information needed to make the best decisions.
Our new method, PiCRL, tries to overcome this problem by giving more information upfront. We built and trained a System Identification model that uses a Transformer (a type of AI architecture) to process past interactions between the agent and its environment. This model helps the agent predict important details about the new environment. Transformers, which are often used in language models, are great at analyzing data over time – like sales trends, for example.
We combine this with Domain Randomization by training the agent in different environments and using the System Identification model to give it crucial information about each new environment.
This way, the agent not only learns in varied situations (Domain Randomization) but also has extra help to understand the environment better (System Identification) and make smarter decisions. When it’s finally used in the real world, the agent can quickly adapt to its surroundings and make accurate choices.
Results
In one of our experiments, we trained a Reinforcement Learning agent on a simple environment called Pendulum (https://gymnasium.farama.or). The results are shown in the graph above, where we tracked how well the agent performed over time. Each line on the graph represents a different approach used in training.
- The red line shows the agent that trained in a fixed environment without any randomization of its parameters.
The other agents faced frequent changes in the environment. Every time the agent failed or completed his task successfully, some values of the environment were randomized and a new environment was created.
- The green line represents the agent that had access to the real changes in the environment.
- The orange line represents the agent that had access to the information provided by our method (PiCRL).
- The blue line represents an agent trained with Uniform Domain Randomization (UDR), which didn’t have any access to extra information and had to learn solely from the rewards it received.
The results were really exciting for this environment! The blue line (UDR) had the lowest performance, showing that when the agent doesn’t have enough information about its environment, it struggles to make good decisions. In contrast, our method (orange line) allowed the agent to perform almost as well as the agent with full access to environmental information (green line). This shows that PiCRL provides enough helpful information to help the agent adapt and make smart choices, even in new and changing environments.
Practical Applications and benefits
Our approach showed significant performance in generalizing to unseen environments, showing several benefits:
– Enhanced Performance: Agents demonstrate higher performance when transferred to unseen environments when compared with the traditional Randomization method.
– Reduced Training Costs: By pre-training the System Identification model that directly creates the representation of the environment without having to be re-trained for each environment the agent sees we lower computational and time costs
– Increased Robustness: Our models are more resilient to unexpected changes and challenges in real-world environments.
Conclusion
In conclusion with this new method, PiCRL, we are able to train decision-making agents solely in simulations that are able to be deployed in real-world scenarios, where making wrong decisions can be very costly.