top of page

A stable connection for Python-Based reinforcement learning training on AnyLogic models

This is a guest post from Mingze Li. He is currently a second-year PhD student at the Physical Internet Center at Georgia Institute of Technology. His research interest is in applying reinforcement learning to in-facility decision-making of supply chain systems. In this post, he will introduce a stable method to connect Python-based reinforcement learning agent to AnyLogic models for training. He also provides a complete example you can download from his Git repository.


Introduction and Motivation

The rapid development of deep reinforcement learning (DRL), the combination of deep learning and reinforcement learning, has attracted more and more researchers from different fields to apply DRL to solve problems in their research fields. With the ability of deep learning to handle the continuous or complicated state space and the ability of reinforcement learning to learn from trial and error in a complicated environment, DRL is particularly good at solving problems that lack good exact or heuristic methods in complex environments. Since solving most reinforcement learning problems requires an extremely large amount of data, most DRL (or RL) agents are trained in a simulated environment. With a diverse library of machine learning tools, Python has become the go-to choice for DRL training. However, using Python, as a programming language, to build large-scale simulations that simulate complicated environments is hard. AnyLogic is a perfect platform for building simulation models to train DRL agents in complex environments. The newly developed Alpyne library is a Python library that enables users to train DRL agents in Python by interacting with the AnyLogic model during run time. Unfortunately, it is still not stable enough to handle complicated simulation models. In this blog post, we introduce a new way to apply DRL to simulation models in AnyLogic using the Pypeline library in AnyLogic. This method can also be used for RL (not deep) training, but due to simplicity, most environments that can be solved with RL can be simulated directly in programming languages, like Python.

The standard way of training a DRL agent is to interact with simulation models from Python. In this method, the DRL agent is called from the simulation model to observe and act on the model at the action time step, and saves all its critical components, for example, replay buffer and neural networks, to local at the end of each episode. This method provides a stable way to implement DRL in AnyLogic models.

In the remaining sections, we will first provide a general walkthrough of the main components of this method. Specifically, we use the implementation of Deep Q-Learning for demonstration purposes, but this method can be applied to various RL algorithms. Then, we will show a simple small-scale example (simplified OpenAI Gym Taxi-v3) to demonstrate the implementation of this method.

General Walkthrough on Main Components

Components on AnyLogic (Environment) Side

To communicate with Python, first we need to install the Pypeline library to your AnyLogic model. Since the focus of this blog post is not on the Pypeline library, please refer to for specific instructions on the installation and use of the Pypeline library.

After installation of the Pypeline library, we need to import the Python module for our DRL training and create an instance of the DRL training class in the On Startup section of the main agent. At each action step during the run time of the simulation, this instance of the DRL training class will be called to receive state information to output action and to receive a reward from the simulation environment.

For the training of RL agent, there are four important abilities that the simulation environment must have:

  1. the ability to output the state information from the environment,

  2. the ability to output reward from the environment,

  3. the ability to receive and implement action from the RL agent, and

  4. the ability to tell the RL agent whether the episode is finished.

Thus, there should be functions made in the simulation to enable these four abilities. Specifically for our implementation, a function was made for enabling each of (1) and (2), and another function was made for enabling (3) and (4). The function for (1) simply returns the current state information in a double or integer list. The function for (2) simply returns the current reward in double or integer. The function for (3) and (4) take input of the action from the RL agent to act in the environment and returns whether the environment will be done after taking the action.

Finally, a function communicating with the RL action should be made to utilize the above four abilities and communicate with the RL agent at each action step.

Components on Python (RL Agent) Side

As discussed above, a new instance of the RL agent will be initialized at the beginning of each episode. Since there is a new RL agent initialized in every episode, it is critical to find a way to record the important information of the RL training locally, so that this information would not be lost at the end of each training episode. Here, we use JSON and the saving functions from libraries like PyTorch to save the information at the end of each training episode and load the information at the initialization. Use Deep Q-Learning as an example, the important information includes but is not limited to replay buffer, policy network, target network, number of steps taken, reward buffer, loss history, and optimizer (if momentum-based optimizer, like ADAM, is used). To learn more about the Deep Q-Learning algorithm, please refer to [1].

The logging of important information enables us to train the RL agent in a continuous fashion between episodes. However, one more thing that needs to be addressed is that the simulation model only outputs the current state, reward and whether the episode is finished (we call this DONE from now on), but the RL agent needs the previous state to form a transition to push into the replay buffer. This problem is tackled by initializing the previous state and action values to null. Upon receiving the state, reward and DONE information from the simulation, the state will become the new previous state and the output action from the state will become the new previous action. If the previous state and the previous action values are not null, a new transition consisting previous state, previous action, current state, reward, and DONE will be appended to the replay buffer.

Simple Demonstration – Simplified Taxi-v3

Without further ado, let’s dive immediately into the implementation. The AnyLogic model with Python files created for this demo can be accessed at:


For demonstration purposes, we demonstrate our method using a simplified OpenAI Gym Taxi-v3 environment replicated in AnyLogic. Still, this method is stable enough to be applied to large-scale and much more complicated environments. It perhaps fits more complicated environments better because the extra cost of communication between AnyLogic and Python can become ignorable in more complicated environments.

This environment is in a 4*4 grid world, with an RL-controlled taxi and a passenger. A visualization of the grid world is shown in figure 1, where the green lines represent walls that the taxi cannot go across. The initial location of the passenger is G, and the destination of the passenger is Y. The taxi will be initialized anywhere randomly other than the passenger location. The goal of the taxi is to first pick up the passenger and then drop the passenger off at the destination. Once the passenger is dropped off or more than 200 action steps are taken, the episode ends. The action space in this environment is 0: move up, 1: move down, 2: move left, 3: move right, 4: pick up, and 5: drop off. The state space is the position of taxi on the x-axis, the position of the taxi on the y-axis, and whether the passenger has been picked up (0 or 1). When the taxi makes a failed pick up or drop off, it receives a reward of -10. When the taxi successfully drops off the passenger, it receives a reward of +20. The taxi receives a -1 reward, unless one of the above-mentioned rewards is triggered.

Figure 1: Visualization of the grid world

Implementation in AnyLogic

In this model, there are some important functions that enable the training of the RL agent. The f_State function returns an integer list of length three for the representation of the current state. The f_Reward function returns the reward resulting from the action. The f_TaxiAction function implements the action from the RL agent and returns whether the episode is finished after taking that action. If the model parameter deploy is set to be true, the f_TaxiAction function will change the visualization according to the action. The f_RLAction function calls the RL agent to select action according to the current state and provides the RL agent with the training required information using the above-mentioned three functions. During the run time of the simulation, the f_RLAction function is called every 0.1 seconds with a cyclical event.

Implementation in Python

PyTorch library, a deep learning library, is used to implement Deep Q-Learning in Python. Other than some extra lines of code to save and load important information for training, this implementation is no different than the other standard implementation of Deep Q-Learning. Since the focus of this blog post is not on RL algorithms and to not bother you with technical details, only parts of the code that are related to the application of RL in AnyLogic will be discussed in this section. In this implementation, there are two Python files created for the RL training, and, since only consists of the construction of the neural network, it is not discussed in this blog post.

One thing to notice here is that given that we are creating a connection between AnyLogic and Python, it is better to modularize the Python code to make the connection easy and clean. Here we created a class for the Deep Q-Learning training agent, called DQN_Main.

To initialize an instance of the DQN_Main class (this happens at the beginning of each episode), we need to first load the necessary information from the local disk using JSON and the load function from PyTorch, then set the previous state and previous action value to null and the episode reward to 0. The information needed for this instance are marked in red in figure 2.

Figure 2: Files in the Model Folder (red: information necessary for training, yellow: plots for monitoring training)

Then at each action step, the act function is defined to be called from AnyLogic to push experience into the replay buffer, to call the train function to train the neural network, and if the episode is done, to save some important information to local disk.

After being called by the act function, the train function trains the neural network for one epoch and saves the important information that was changed from training to the local disk if the episode is done.

If desired, you can also add functions to generate reward and loss plots to the local disk, so that you can watch your RL agent getting better. The generated plots for this instance are marked in yellow in figure 2.

The full code on the is attached below:

import os.path
import os
import torch
import json
from DQNModel import DQN
import random
import numpy as np
import torch.nn.functional as F
import matplotlib.pyplot as plt

class DQN_Main:
    def __init__(self):
        self.BUFFER_SIZE = 200000
        self.MIN_REPLAY_SIZE = 50000
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.GAMMA = 0.99
        self.BATCH_SIZE = 128
        self.EPSILON_START = 0.99
        self.EPSILON_END = 0.1
        self.EPSILON_DECAY = 0.000025
        self.TARGET_UPDATE_FREQ = 10000
        self.LR = 0.00025
        os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
        if os.path.exists('replay_buffer.json'):
            with open("replay_buffer.json", "r") as read_content:
                self.replay_buffer = json.load(read_content)
                    self.replay_buffer = []

        if os.path.exists('reward_buffer.json'):
            with open("reward_buffer.json", "r") as read_content:
                self.reward_buffer = json.load(read_content)
            self.reward_buffer = []

        if os.path.exists('step.json'):
            with open("step.json", "r") as read_content:
                self.step = json.load(read_content)
            self.step = 0

        self.policy_net = DQN(device=self.device).to(self.device)
        self.target_net = DQN(device=self.device).to(self.device)

        if os.path.exists('policy_net.pth'):

        self.optimizer = torch.optim.Adam(self.policy_net.parameters(), lr=self.LR)
        if os.path.exists('policy_net_adam.pth'):

        if os.path.exists('loss_hist.json'):
            with open("loss_hist.json", "r") as read_content:
                self.loss_hist = json.load(read_content)
            self.loss_hist = []

        if os.path.exists('loss_hist_show.json'):
            with open("loss_hist_show.json", "r") as read_content:
                self.loss_hist_show = json.load(read_content)
            self.loss_hist_show = []

        self.episode_reward = 0
        self.prev_state = None
        self.prev_action = None

    def save_hyperparams(self):
        hyperparams_dict = {
            'BUFFER SIZE': self.BUFFER_SIZE,
            'MIN REPLAY SIZE': self.MIN_REPLAY_SIZE,
            'GAMMA': self.GAMMA,
            'BATCH SIZE': self.BATCH_SIZE,
            'EPSILON START': self.EPSILON_START,
            'EPSILON END': self.EPSILON_END,
            'EPSILON DECAY': self.EPSILON_DECAY,
            'LR': self.LR,
        with open("hyperparameters.json", "w") as write:
            json.dump(hyperparams_dict, write)

    def train(self, done):
        # add training step here
        transitions = random.sample(self.replay_buffer, self.BATCH_SIZE)

        states = np.asarray([t[0] for t in transitions])
        actions = np.asarray([t[1] for t in transitions])
        rewards = np.asarray([t[2] for t in transitions])
        dones = np.asarray([t[3] for t in transitions])
        next_states = np.asarray([t[4] for t in transitions])

        states_t = torch.as_tensor(states, dtype=torch.float32).to(self.device)
        actions_t = torch.as_tensor(actions, dtype=torch.int64).unsqueeze(-1).to(self.device)
        rewards_t = torch.as_tensor(rewards, dtype=torch.float32).unsqueeze(-1).to(self.device)
        dones_t = torch.as_tensor(dones, dtype=torch.float32).unsqueeze(-1).to(self.device)
        next_states_t = torch.as_tensor(next_states, dtype=torch.float32).to(self.device)

        # compute targets
        _, actions_target = self.policy_net(next_states_t).max(dim=1, keepdim=True)
        target_q_values_1 = self.target_net(next_states_t).gather(dim=1, index=actions_target)
        targets_1 = rewards_t + self.GAMMA * (1 - dones_t) * target_q_values_1

        # compute loss
        q_values = self.policy_net(states_t)
        action_q_values = torch.gather(input=q_values, dim=1, index=actions_t)

        # Gradient Descent
        loss = F.mse_loss(action_q_values, targets_1)
        if self.step % 200 == 0:

        # Update Target Net
        if self.step % self.TARGET_UPDATE_FREQ == 0:

        # we need to have a done parameter, since we need to save the neural nets if the episode is done
        if done:
  , 'policy_net.pth')
  , 'target_net.pth')
  , 'policy_net_adam.pth')
            with open("loss_hist.json", "w") as write:
                json.dump(self.loss_hist, write)
            with open("loss_hist_show.json", "w") as write:
                json.dump(self.loss_hist_show, write)

    def random_action(self):
        return random.choice(self.policy_net.action_space)

    def act(self, state, reward, done, deploy):
        if deploy:
            with torch.no_grad():
                state_t = torch.tensor(state)
                action = self.policy_net.act(state_t)
            return action
        if len(self.replay_buffer) >= self.MIN_REPLAY_SIZE:
            rnd = random.random()
            epsilon = self.EPSILON_START - self.EPSILON_DECAY * self.step
            self.step += 1
            if epsilon < self.EPSILON_END:
                epsilon = self.EPSILON_END
            if rnd <= epsilon:
                action = self.random_action()
                with torch.no_grad():
                    state_t = torch.tensor(state)
                    action = self.policy_net.act(state_t)
            # fill up the replay buffer
            action = self.random_action()

        if self.prev_state is None:
            # beginning of an episode, we just take the action, nothing to append to the replay buffer
            self.prev_state = state.copy()
            self.prev_action = action

            # we still need to train our neural net here
            if len(self.replay_buffer) >= self.MIN_REPLAY_SIZE:
                # if done neural nets will be saved in the train function
            return action
            # here we add the transitions to replay buffer
            self.episode_reward += reward
            transition = (self.prev_state, self.prev_action, reward, done, state)
            if len(self.replay_buffer) > self.BUFFER_SIZE:

            # adjust previous state and action
            self.prev_state = state.copy()
            self.prev_action = action

        if done:
            # since we connect to AnyLogic, we have to save everything every episode
            with open("reward_buffer.json", "w") as write:
                json.dump(self.reward_buffer, write)
            with open("replay_buffer.json", "w") as write:
                json.dump(self.replay_buffer, write)
            with open("step.json", "w") as write:
                json.dump(self.step, write)
            if len(self.reward_buffer)%100 == 0:

        if len(self.replay_buffer) >= self.MIN_REPLAY_SIZE:
            # if done neural nets will be saved in the train function

        return action

    def plot_reward_buffer(self):
        plt.savefig('reward buffer.jpg')

    def plot_loss_hist(self):
        plt.xlabel('100 Epoch')
        plt.savefig('Loss History.jpg')

Training Through AnyLogic Experiment

The training of the RL agent is done using the Monte Carlo experiment from AnyLogic. In the Monte Carlo experiment, we preset the number of episodes that we want the RL agent to be trained for in the Replications section, then all we need to do is to sit back and watch the RL agent grow through gaining from simulated experience!!!

If you want to abort the current training and re-train a new RL agent, you can do so by removing all the files marked in yellow or red in figure 2.

Training Result

The training result confirms that our method works just like any other RL model training. Figure 2 shows that the reward improved steadily, and the model successfully converged.

Figure 2: Reward History

The success of training is further confirmed with a visualization in the AnyLogic simulation experiment (yellow: taxi, red: passenger, green: destination):

Thank you for reading this post! If you have any further questions, feel free to go to my GitHub page to post your questions in the discussions section! :)

Mingze Li


1. Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).


Mingze Li is a guest writer for the AnyLogic Modeler. Feel free to connect with him over LinkedIn.

What next?

If you liked this post, you are welcome to read more posts by following the links above to similar posts. Why not subscribe to our blog or follow us on any of the social media accounts for future updates? The links are in the Menu bar at the top, or the footer at the bottom. You can also join the mobile app here!

If you really want to make a difference in supporting us please consider joining our Patreon community here

If you want to contact us for some advice, maybe a potential partnership or project or just to say "Hi!", feel free to get in touch here, and we will get back to you soon!

1,739 views0 comments


Post: Blog2_Post
bottom of page