Academia.eduAcademia.edu

Abstract

Simulation modeling and analysis is a technique for improving or investigating process performance. It is a cost-effective method for evaluating the performance of resource allocation and alternative operating policies. It may also be used to evaluate the performance of capital equipment before investment. These benefits have resulted in simulation modeling and analysis projects in virtually every service and manufacturing sector. In a simulation project, the ultimate use of input data is to drive the simulation. This process involves the collection of input data, analysis of the input data, and use of the analysis of the input data in the simulation model. The input data may be either obtained from historical records or collected in real time as a task in the simulation project. The analysis involves the identification of the theoretical distribution that represents the input data. The use of the input data in the model involves specifying the theoretical distributions in the simul...

ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA SIMULATION MODELING. INPUT DATA COLLECTION AND ANALYSIS Delia UNGUREANU Francisc SISAK Dominic Mircea KRISTALY Sorin MORARU Automatics Department, “Transilvania” University of Brasov, M.Viteazu street, no.5, 500174, Brasov, Romania, phone: +40 0268 418836 delia@deltanet.ro, sisak@unitbv.ro, kdominic@vision-systems.ro, smoraru@vision-systems.ro The steps of the process for conducting a simulation modeling and analysis project include: problem formulation, project planning, system definition, input data collection, model translation, verification, validation, experimental design, analysis. Simulation project involves the collection of input data, analysis of the input data, and use of the analysis of the input data in the simulation model. The input data may be either obtained from historical records or collected in real time as a task in the simulation project. The analysis involves the identification of the theoretical distribution that represents the input data. The use of the input data in the model involves specifying the theoretical distributions in the simulation program code. If we successfully fit the observed data to a theoretical distribution, then any data value from the theoretical distribution may drive the simulation model. Keywords: simulation modeling, data distribution, analysis project. 1. INTRODUCTION Simulation modeling and analysis is a technique for improving or investigating process performance. It is a cost-effective method for evaluating the performance of resource allocation and alternative operating policies. It may also be used to evaluate the performance of capital equipment before investment. These benefits have resulted in simulation modeling and analysis projects in virtually every service and manufacturing sector. In a simulation project, the ultimate use of input data is to drive the simulation. This process involves the collection of input data, analysis of the input data, and use of the analysis of the input data in the simulation model. The input data may be either obtained from historical records or collected in real time as a task in the simulation project. The analysis involves the identification of the theoretical distribution that represents the input data. The use of the input data in the model involves specifying the theoretical distributions in the simulation program code. This process is represented in Fig.1. Figure 1- Role of theoretical probability distributions in simulation. The collection of input data is often considered the most difficult process involved in conducting a simulation modeling and analysis project. ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA In the data collection and analysis process is necessary to solve some problems: sources for input data, collecting input data, deterministic versus probabilistic input data, discrete versus continuous input data, common input data distributions, analyzing input data. 2. COLLECTING INPUT DATA There are many sources that we can use to acquire input data, like the following: historical records, manufacturer specifications, vendor claims, operator estimates, management estimates, automatic data capture, direct observation, etc. The data collection phase is the most difficult part of the simulation process. If the operator is knowledgeable about the system, it may be possible to obtain some performance estimates that can be used as input data. The most physically and mentally demanding form of data collection is direct observation. The input data may be collected either manually or with the assistance of electronic devices. An important issue for simulation input data concerning time intervals is the time unit that should be used. It is usually less labor intensive to collect the data correctly in the first place using a relative, interarrival time approach. A second time collection issue is what types of units to use. The simulation practitioner should know that we want unbiased data and we do not want to disrupt the process. If the data are biased in either manner, it can lead to a model that may yield inaccurate results. While collecting the input data, we should realize that there are different classifications of data. One method of classifying data is whether it is deterministic or probabilistic. Each individual project will call for a unique set or type of input data. Some of the types of input data may be deterministic, and other types are probabilistic. Deterministic data means that the event involving the data occurs in the same manner or in a predictable manner each time. This means that this type of data needs to be collected only once because it never varies in value. A probabilistic process does not occur with the same type of regularity. In this case, the process will follow some probabilistic distribution. Thus, it is not known with the same type of certainty that the process will follow an exactly known behavior. Another classification of input data is whether the data are discrete or continuous. Discrete-type data can take only certain values. Usually this means an integer. Continuous distributions can take any value in the observed range. This means that fractional numbers are a definite possibility. 3. INPUT DATA DISTRIBUTIONS We present a few of the most common input data distributions. There are many more different types of probabilistic distributions that we may actually encounter. Sometimes we may encounter these distributions only as a result of a computerized data fitting program. These types of programs are geared toward returning the best mathematical fit among many possible theoretical distributions. In these types of cases, a particular result does not necessarily mean that there is a rational reason why the data best fit a specific distribution. Sometimes a theoretical distribution that does make sense will be almost as good a fit. In these cases, we will have to decide ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA whether it makes more sense to use the best mathematical fit or a very close fit that makes sense. Bernoulli Distribution - is used to model a random occurrence with one of two possible outcomes. These are frequently referred to as a success or failure. The Bernoulli distribution is illustrated in Fig.2. The mean and variance of the Bernoulli distribution are: mean = p var = p (1 - p) where: p = the fraction of successes; (1 - p) = the fraction of failures. Figure 2 – Bernoulli distribution Uniform Distribution (Fig.3) – that means that over the range of possible values, each individual value is equally likely to be observed. Uniform distributions can be used as a first cut for modeling the input data of a process if there is little knowledge of the process. The uniform distribution may be either discrete or continuous. The mean and variance of a uniform distribution are: ( a + b) 2 (b − a ) 2 var = 12 mean = where: a is the minimum value b is the maximum value. Figure 3 - Uniform distribution. Exponential Distribution (Fig.4)- is commonly utilized in conjunction with interarrival processes in simulation models because the arrival of entities in many systems has been either proven or assumed to be a random or Poisson process. This means that a random number of entities will arrive within a specific unit of time. The number of arrivals that can be expected to arrive during the unit of time is randomly distributed around the average value. The statistical equations for the mean and variance of the exponential distribution are: mean =B var =B2 The probability is represented by: f ( x) = 1 −x e , or x = B * ln[1 − F ( x)] B where: B is the average of the data sample; x is the data value. Figure 4 – Exponential distribution ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA Triangular Distribution - it may be used in situations where the practitioner does not have complete knowledge of the system but suspects that the data are not uniformly distributed. In particular, if the practitioner suspects that the data are normally distributed, the triangular distribution may be a good first approximation. The triangular distribution has only three parameters: the minimum possible value, the most common value, and the maximum possible value (Fig.5). Because the most common value does not have to be equally between the minimum and the maximum value, the triangular distribution does not necessarily have to be symmetric. The mean and variance of the triangular distribution are: a+m+b 3 2 (a + m 2 + b 2 − ma − ab − −mb) var = 18 mean = where : a = minimum value m = most common value b = maximum value Figure 5 - Triangular distribution. Normal Distribution (Fig.6)- The time duration for many service processes follows the normal distribution. The reason for this is that many processes actually consist of a number of subprocesses. Regardless of the probability distribution of each individual subprocess, when the subprocess times are added together, the resulting time durations frequently become normally distributed. The normal distribution has two parameters: the mean and the standard deviation. The normal distribution is also symmetric. This means that there are an equal number of observations less than and greater than the data mean. The pattern or distribution of the observations on each side is also similar. The somewhat formidable mathematical formula for the normal distribution probability is: f ( x) = 1 σ 2∏ e −( x − µ ) 2 / 2σ 2 where: µ is the mean: σ is the standard deviation. Figure 6 - Normal distribution. The normal distribution is frequently discussed in terms of the standard normal or Z distribution. This is a special form of the normal distribution where the mean is 0 and the standard deviation is 1. For normally distributed data, the standard normal distribution can be converted to an actual data value with the following equation: x = µ ± σZ, where µ is the true population mean and σ is the true population standard deviation. We can also manipulate this equation to find out the values for specific ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA cumulative percentages. This is performed with the use of the standard normal or Z table. The following table is a common standard normal or Z table. The small table below illustrates how to use the table. This table is known as a right-hand tail table. The top row and left column contain the Z values. The interior of the table contains the cumulative percentage of observations for the standard normal distribution. The table is symmetric. The cumulative percentages are subtracted from 1. Poisson Distribution (Fig.7) - is used to model a random number of events that will occur in an interval of time. The Poisson distribution has only one parameter, λ. p( x) = e −λ λx x! where: λ is both mean and variance; x is the value of the random variable. Figure 7 - Poisson distribution. Weibull Distribution (Fig.8) - is often used to represent distributions that cannot have values less than zero. The Weibull distribution possesses two parameters. These are an α shape parameter and a β scale parameter. The lengthy probability function for the Weibull is: f ( x) = αβ −α x α −1e − ( x / β ) , for x >0, 0 otherwise where: α is a shape parameter and β is a scale parameter. The mean and variance are represented mathematically by: α mean = α B 1 Γ( ) α β   2  1 var = 2Γ  − α   α  α 2   1  Γ  α      where:α is a shape parameter, β is a scale parameter, 2    Γ is given by Γ = ∫ x α −1e − x dx . ∞ 0 Figure 8 - Weibull distribution. ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA Gamma Distribution (Fig.9) - is another distribution that may be less common to the practitioner. The probability density equation for the gamma distribution is: f ( x) = 1 x α −1e − x / β , β Γ(α ) α for x>0, 0 otherwise where α, β, and Γ are defined as in the Weibull distribution. Figure 9 – Gamma distribution The gamma distribution can degenerate to the same mathematical representation as the exponential distribution. The gamma distribution cannot go below 0. Beta Distribution (Fig.10) - is a bit different from most of the previously presented distributions. The beta distribution (Fig. 10) holds the distinction of being able to cover only the range between 0 and 1. The mathematical formula for the beta distribution probability density is: f ( x) = Γ(α + β ) α −1 x (1 − x) β −1 , for 0<x<1, 0 elsewhere Γ(α )Γ( β ) where α and β are shape parameters 1 and 2, respectively, and Γ is defined as for the Weibull and gamma distributions. The mean and variance of the beta distribution are: mean = var = α α +β αβ (α + β ) (α + β + 1) 2 Figure 10 – Beta distribution Geometric Distribution (Fig.11) - In contrast to the previous less common distributions, the geometric distribution is discrete. This means that the values taken on by the geometric distribution must be whole numbers. The geometric distribution (Fig.11) has one parameter p, which is considered the probability of success on any given attempt; (1 - p) is the probability of failure on any given attempt. The probability of x – 1 failures before a success on the xth attempt is represented by: p ( x) = p (1 − p ) x −1 , x=1,2,3… The mean and variance of the geometric distribution are represented by: 1− p p 1− p var = 2 p mean = Figure 11 – Geometric distribution ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA Some types of input data are actually a combination of a deterministic component and a probabilistic component. These types of processes generally have a minimum time, which constitutes the deterministic component. The remaining component of the time follows some sort of distribution. 4. ANALYZING INPUT DATA The process of determining type of the distribution for a set of data usually involves what is known as the essence of fit test. These tests are based on some sort of comparison between the observed data distribution and a corresponding theoretical distribution. If the difference between the observed data distribution and the corresponding theoretical distribution is small, then it may be stated with some level of certainty that the input data could have come from a set of data with the same parameters as the theoretical distribution. There are four different methods for conducting this comparison: Graphic approach - The most fundamental approach to attempting to fit input data is the graphic approach. This approach consists of a visual qualitative comparison between the actual data distribution and a theoretical distribution from which the observed data may have come. The steps for using the graphic approach include: Create a histogram of observed data Create a histogram for the theoretical distribution Visually compare the two histograms for similarity Make a qualitative decision as to the similarity of the two data sets There are two common approaches for determining how to handle the cell issue: equal-interval approach - we set the width of each data cell range to be the same value, or equal-probability approach – is a more statistically robust method for determining the number of cells is the equal-probability approach Chi-square test - is commonly accepted as the preferred goodness of fit technique. Like the graphic comparison test, the chi-square test is based on the comparison of the actual number of observations versus the expected number of observations. Kolmogorov–Smirnov test - should be utilized only when the number of data points is extremely limited and the chi-square test cannot be properly applied. The reason for this is that it is generally accepted that the KS test has less ability to properly fit data than other techniques such as the chi-square test. Square error - uses a summed total of the square of the error between the observed and the theoretical distributions. The error is defined as the difference between the two distributions for each individual data cell. A very common question among data acquisition are: how much data needs to be collected – is necessary to observe: the right data, the different values that are likely to occur, the need to have enough data to perform a goodness of fit test. is possible to fit the observed data to a theoretical distribution - possible causes for this difficulty include: not enough data were collected, data are a combination of a ELECTRONICS’ 2005 21 – 23 September, Sozopol, BULGARIA number of different distributions. 5. SOFTWARE IMPLEMENTATIONS FOR DATA FITTING Fitting a significant number of observed data sets to theoretical distributions can become a very time consuming task. In some cases, it may be mathematically very difficult to fit the observed data to some of the more exotic probability distributions. For this reason, most practitioners utilize some sort of data fitting software. Two commonly available programs among others to perform this function are: Arena input analyzer Expert fit Figure 12 - ARENA Input Analyzer. Figure 13 – Expert fit 6. CONCLUSIONS In this paper we covered the collection and analysis of input data. This phase of the simulation project is often considered the most difficult. Data may be collected from historical records or can be collected in real time. In order to use observed data in a simulation model, it is preferred to firstly be fit to a theoretical probability distribution. The theoretical probability distribution is then used to generate values to drive the simulation model. 7. RESULTS In order to determine the best theoretical distribution fit for the observed data, we can use one of many comparison methods. These include the graphic, chi-square, KS, and square error methods. In some cases, it may not be possible to obtain a satisfactory theoretical fit. Additional data should be collected, and a new attempt to fit the data should be made. If this is not possible, it may be necessary to use observed data to drive the simulation model. 8. REFERENCES [1] Hildebrand, D.K. and Ott, L., Statistical Thinking for Managers, PWS-Kent, Boston, 1991. [2] Johnson, R.A., Freund, J.E., Miller, I., Miller and Freund’s Probability and Statistics for Engineers, Pearson Education, New York, 1999. [3] Law, A.M., Kelton, D.W., Simulation Modeling and Analysis, 3rd ed., McGraw-Hill, New York, 2000.