Why does using the right probability distribution in a simulation model matter?

This is a special guest blog post by Agustin Olivares.

After successfully completing our Java for AnyLogic course Agustin reached out to us wanting to use some of the course content in his YouTube Channel. After seeing his work on the YouTube channel, we offered him the opportunity to post on our blog and share some of his learnings. Agustin shares an example using distfit in Python and illustrates the outcome of choosing different probabilities in an AnyLogic model. This model is also available to download.

Background

Almost all real-world systems contain one or more sources of randomness, for example, a manufacturing process contains processing times, machine times to failure, or machine repair times as sources of randomness, also in transportation distribution process has ship-loading times, and inter-arrival times of trucks, among others. To carry out a simulation using random input, as mentioned before, it’s important to specify their probability distribution, with these, the simulation model can generate through time random values from the distribution.

Before defining the distribution we will use in the model, we must identify the kind of data we can access. There are two groups of data, categorical and numerical, the first is for data that describes something that is not numeric, such as colors, states, or location, and a numeric value, such as size, weight, or time represent the numeric. In the numerical group, we have two types, continuous and discrete data. The difference is that the continuous has infinitely many possible values and the discrete only contains a finite number of possible values.

The most commonly used distributions

Source: Law (2015) - Simulation Modeling and Analysis [5th ed.].

Continuous variables

For continuous variables, the following distributions exists.

· Uniform U(a, b)

Possible applications: Used as a “first” model for a quantity that is felt to be randomly varying between a and b but about which little else is known. The U(0, 1) distribution is essential in generating random values from all other distributions.

Parameters: a and b are real numbers with a as the minimum and b as the maximum of the data.

· Exponential expo(β)

Possible applications: Inter-arrival times of “customers” to a system that occurs at a constant rate, time to failure of a piece of equipment.

Parameter: β is the mean of the data.

· Normal N(μ, σ^2)

Possible applications: Errors of various types, for example, in the impact point of a bomb; quantities that are the sum of a large number of other quantities (by virtue of central limit theorems).

Parameters: μ is the mean and σ^2 is the variance of the data.

· Triangular triang(a, b, m)

Possible applications: Used as a rough model in the absence of data.

Parameters: a, b, and m are real number, a as the minimum, b as the maximum, and m as the mode of the data.

Discrete variables

For the discrete variables, the following distributions can be used.

· Poisson Poisson(λ)

Possible applications: Number of events that occur in an interval of time when the events are occurring at a constant rate a number of items in a batch of random size; a number of items demanded from an inventory or arrival times.

Parameters: λ is the mean number of events within a given interval of time or space.

· Discrete Uniform DU(i,j)

Possible applications: Random occurrence with several possible outcomes, each of which is equally likely; used as a “first” model for a quantity that is varying among the integers i through j but about which little else is known.

Parameters: i and j are integers, with i as the minimum and j as the maximum of the data.

How to use distributions

The way the distribution is used in a simulation model can vary depending on the process we are simulating, however, there are different situations that the book mentions.

The data values themselves are used directly in the simulation. For example, if the data represent service times, then one of the data values is used whenever a service time is needed in the simulation. This is sometimes called a trace-driven simulation.
The data values themselves are used to define an empirical distribution function in some way. If these data represent service times, we would sample from this distribution when a service time is needed in the simulation.
Standard techniques of statistical inference are used to “fit” a theoretical distribution form for example, exponential or Poisson, to the data and to perform hypothesis tests to determine the goodness of fit. If a particular theoretical distribution with certain values for its parameters is a good model for the service-time data, then we would sample from this distribution when a service time is needed in the simulation.

Probabilistic Distributions Anylogic Wizard

Anylogic has a probability distribution wizard, where you can choose the distribution that you are going to use with a description of the distribution and its parameters.

The way to open the wizard is the following.

Select the delay block.
Go to the delay time label in the properties of the block.
Select the "Choose Probability Distribution..." button on the top (As marked with red rectangle in the figure below).

Let's do an experiment.

Once the data is obtained and defined the type, we will have the question.

How to choose the right distribution for the model?

In statistics, there are different methodologies to choose the best one. The easiest way is to plot the probability distribution function (pdf) of the data and find the distribution that best fits the data with statistical software. We will use a service time of a Call Center and find the best distribution. For this example, we used Python with the library distfit, you can find the code and the data in this link.

The way to choose the best distribution is using the Residual Sum of Squares (RSS), the lower score the distribution has, the better fit of the data. In this case, gamma, pareto, exponential and beta distributions are good fits. For this model, we will use the exponential distribution and compare it to the uniform distribution which the RSS suggested is not as good of a fit. The indicator will be the meantime of clients in the queue.

The model has five staff to attend to the clients and the source will be generated agents by a database. The first experiment will be used an exponential distribution, with an average time of 5 minutes (round 4.985 value of the previous table), with this is obtained a lambda of 0.2 clients per minute. The simulation will start from 1 January 2021 until 9 January 2021.

The mean time in the queue is 0.03 seconds, with exponential distribution.

Now, let’s experiment to see the impact of choosing a wrong distribution in the simulation model using AnyLogic. The uniform distribution has a minimum of 0 and a maximum of 52 minutes (round 51.833 of the table).

The mean time in queue is 68.37 minutes, with uniform distribution.

In conclusion, the distributions will help us to get close to the process that we want to simulate and obtain information. With this, we can create randomness in the simulation and be able to see scenarios that could affect the task, and make a plan to reduce an impact or increase revenue. Depending on the analysis that is used to find the best distribution for our models, the impact on the outputs will be significant, and a non-optimal distribution of the data will be harmful to the analysis that we want to carry out.

Agustin Olivares

Agustin Olivares is an Industrial Engineer from Chile, content creator for Simulation and Data Science. Agustin creates a YouTube channel Agustin Olivares - Ingeniería to help people to understand the world of analytics and simulation with experience in Machine learning algorithms, discrete events and agent-based simulations. You can find him on Twitter @aaosoto or on LinkedIn.

What next?

If you liked this post, you are welcome to read more posts by following the links above to similar posts. Why not subscribe to our blog or follow us on any of the social media accounts for future updates? The links are in the Menu bar at the top, or the footer at the bottom. You can also join the mobile app here!

If you really want to make a difference in supporting us, please consider joining our Patreon community here.

If you want to contact us for some advice, maybe a potential partnership or project or just to say "Hi!", feel free to get in touch here, and we will get back to you soon!

1 Comment

Augusto Cesar Pereira

Jun 01, 2023

Hi! I see you used some Python package for fitting de distribution and they are always defined with 'loc' and 'scale' parameters. Do you have any idea on how to 'convert' these arguments to the ones AnyLogic uses? So far, we are converting each kind of distribution individually. Thank you!