Example 5: Working with NumPy

This example shows how to use NumPy to create a service duration pmf from empirical data. If NumPy is not already installed in your working environment, you may do so as follows:

pip3 install numpy

We don't actually have any empirical data on hand, but we can use NumPy to invent some:

import numpy as np

mu = 0.0
sigma_squared = 0.1
data_size = 100
rng = np.random.default_rng(1234)
empirical_data = rng.lognormal(mu, sigma_squared, data_size)
print(empirical_data)

This creates a dataset with 100 independent and identically distributed samples from a lognormal distribution with parameters \(\mu=0\) and \(\sigma^2=0.1\). The seed of the random number generator is set to 1234 for reproducibility. The output is:

[0.8518169  1.00643058 1.07690279 1.01537898 1.09021442 1.33817925
 0.8625326  1.09916115 0.84652669 1.03497209 0.95004648 1.14153734
 0.91756852 1.05332236 0.88116149 0.80580468 1.04443221 1.18925723
 1.05338987 0.90464147 1.02719784 1.07973698 1.1265132  0.89070581
 1.07210922 1.03576302 0.99676374 1.00131903 0.93433055 0.93983288
 1.1423887  1.02622175 0.95299236 0.77944047 0.91607561 0.95070552
 0.8795781  0.87543634 1.08610647 0.97558157 0.84368961 0.87501409
 0.97048058 1.11793216 0.86015654 1.17235109 0.95243586 0.84272868
 1.05264813 1.15454831 0.97806377 1.06703273 0.96871084 0.99890278
 1.18121273 1.09371334 0.88668976 1.32308048 0.90292153 1.08851099
 1.05106955 0.99159139 1.02045574 0.98375285 1.08730915 0.93123467
 0.8892161  1.0486743  1.18974557 0.98642857 1.18569805 0.99121696
 1.1685039  1.10113447 1.05366295 1.09824674 0.91971547 1.00985505
 0.85465676 0.83695264 1.09623725 0.98520378 1.10579402 1.01318799
 0.92560777 1.33566693 1.14764021 1.01729341 1.0022267  1.17970992
 0.96832498 1.16469175 1.0676506  0.87618099 1.07715633 1.11821993
 1.05616661 1.1012053  1.14737631 0.95205899]

These data are measured in units of real time, so they need to be converted to model time. It is necessary to specify a mesh size, the amount of real time associated with each unit of model time. The number of model time units is obtained by rounding up, since service durations must be strictly positive. In this example, the mesh size is taken as 0.1. Therefore, an empirical value of 0.27 would correspond to a service duration of 3 units of model time.

The NumPy histogram construction function associates each empirical value with a bin whose range is a half-open interval [0,0.1), [0.1,0.2), ... . The bin indices start from 0, so the empirical value 0.27 goes into bin 2. As the service duration in units of model time differs from the bin index by 1, this leads to a typical fenceposting challenge.

The procedure to create the service duration pmf from the empirical data consists of the following three steps:

  • Step 1: Determine the maximum value in the support of the service duration pmf to be constructed.

  • Step 2: Create a histogram of frequencies.

  • Step 3: Construct the desired pmf from the histogram.

Here is the code is to execute Step 1:

mesh_size = 0.1

max_empirical_value = np.max(empirical_data)
print(max_empirical_value)

max_bin_index = int(np.floor(max_empirical_value / mesh_size))
print(max_bin_index)

max_pmf_value = max_bin_index + 1
print(max_pmf_value)

The output is:

1.3381792512143813
13
14

The following code executes Step 2:

bin_edges = mesh_size * np.arange(max_pmf_value + 1)
print(bin_edges)

frequencies, _ = np.histogram(empirical_data, bins=bin_edges)
print(frequencies)

Here is the output:

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.  1.1 1.2 1.3 1.4]
[ 0  0  0  0  0  0  0  1 17 26 34 19  0  3]

The last entry of the frequencies array is 3, and there are indeed 3 data points in the 14th interval [1.3, 1.4).

Finally, here is the code to execute Step 3:

pmf = {bin_index+1: frequencies[bin_index]/data_size
    for bin_index in range(max_pmf_value) if frequencies[bin_index] > 0}
print(pmf)

The output for this step is:

{8: 0.01, 9: 0.17, 10: 0.26, 11: 0.34, 12: 0.19, 14: 0.03}

Here is the complete code:

import numpy as np

mu = 0.0
sigma_squared = 0.1
data_size = 100
rng = np.random.default_rng(1234)
empirical_data = rng.lognormal(mu, sigma_squared, data_size)

mesh_size = 0.1
max_empirical_value = np.max(empirical_data)
max_bin_index = int(np.floor(max_empirical_value / mesh_size))
max_pmf_value = max_bin_index + 1

bin_edges = mesh_size * np.arange(max_pmf_value + 1)
frequencies, _ = np.histogram(empirical_data, bins=bin_edges)

pmf = {bin_index+1: frequencies[bin_index]/data_size
    for bin_index in range(max_pmf_value) if frequencies[bin_index] > 0}

print(pmf)

And here is the output:

{8: 0.01, 9: 0.17, 10: 0.26, 11: 0.34, 12: 0.19, 14: 0.03}