Home Download Help Resources Extensions FAQ References Contact Us Donate Models: Library Community Modeling Commons User Manuals: Web Printable Chinese Czech Japanese Spanish

NetLogo Models Library: 
If you download the NetLogo application, this model is included. (You can also run this model in your browser, but we don't recommend it; details here.) 
This demonstrates relations between population distributions and their sample mean distributions as well as the affect of sample size on this relation. In this model, a population is distributed by some variable, for instance by their total assets in thousands of dollars. The population is distributed randomly  not necessarily 'normally'  but sample means from this population nevertheless accumulate in a distribution that approaches a normal curve. The program allows for repeated sampling of individual specimens in the population.
This model is a part of the ProbLab curriculum. The ProbLab curriculum is currently under development at the CCL. For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.
Either the program or the user creates a population that ranges along some dimension, such as total assets. In the View, we see this population arranged in a "picture bar chart." For instance, poorer people are farther to the left, and richer people are farther to the right. Next, a group of individual specimens from this population is selected as a sample (these sampled people are painted in a different color). The program calculates the mean value of this sample  their average assets  and plots this mean in the histogram below the view. We can set the program to sample repeatedly, and we can observe the emergence of the distribution of sample means. We can also look at the corresponding histogram of the sums of the samples. This allows us to study the relation between the sums and the means in terms of properties of their distributions.
Press SETUP, and then press either CREATE RANDOM PEOPLE or CREATE MY OWN PEOPLE or just press one of the PRESET buttons. If you've created random people or if you have pressed one of the presets, you are now ready to press either GO ONCE or GO. But if you've pressed CREATE MY OWN PEOPLE, you now need to click on the View to create these people. Only then you should press either GO ONCE or GO. You can also vary the setting of the sliders. Watch results in the plot and explore relations between the settings and the results. If you'd like to guess the mean value of the sample means before you begin sampling, you can use the RANGE slider to indicate the location of this mean.
Buttons: SETUP  initialize variables CREATE RANDOM PEOPLE  activates SETUP and then creates a random population in the View CREATE MY OWN PEOPLE  allows the user to create persons by clicking on the View PRESET 1  PRESET 6  creates populations with special patterns of possible interest GO ONCE  marks a sample in the population and calculates and plot their mean value. GO  repeats GO ONCE indefinitely.
Sliders: SAMPLER SIZE  determines the number of specimens sampled at each run through the Go procedure MAXVALUE  determines the range of values the members of the population can take (a number on the interval 0 to MAXVALUE)
Switches: ALSOSUMS?  if set to "On," the sums of the samples (and the means of these sums) is plotted as well as the means of the samples (and the mean of these means). If set to "Off," only the means are plotted. SHOWEV?  if set to "On," the EXPECTED VALUE (EV) monitor will display a value
Monitors: NUMSAMPLES  shows the total number of samples taken this the last 'setup.' EXPECTED VALUE  calculates the mean xvalue of the population. For each column, the program calculates the product of the number of turtles and the xvalue of that column. Next, all these products are summed up and divided by the total number of turtles. STDDEVMEANS  shows the standard deviation of the histogram of sample means. This is an index of how "diffuse" the distribution is. The smaller the number, the tighter (narrower, more clustered) is the distribution of sample means. STDDEVMEANS  shows the standard deviation of the histogram of sample sums.
Plots: SAMPLEMEAN DISTRIBUTION  distribution of the mean values of all samples taken and the mean sum of all samples taken.
The property we are looking at is indexed by the "xvalue" of the people in the bar chart that is in the View. A person's xvalue can be seen in the numerical label at the bottom of its column in the view. For instance, the xvalue of people in the leftmost column is 0. There could, in principle, be no person with the xvalue "0," there could be a single person with that value, or there could be two or more. They all share the same value, because they are all in the same column.
Members of the population that turn green when you press GO or GO ONCE are the 'sample.' Their mean xvalue, for instance, their savings, is plotted in the histogram below the view. For example, if a sample of three people is taken and their xvalues are 7, 8, and 12, then the histogram column "9" will bump up by one unit, because 9 is the mean of 7, 8, and 12.
For some settings of the population, the more samples you take the more likely you are to get a rare sample. So the distribution you get after only a few samples is not necessarily reflective of all possible mean values.
Can you see any connections between the distribution of the population (in the graphics display window) and the mean value of the histogram (in the plot window)? For instance, if there happen to be more population specimens ("people") on the left side of the range, where do you expect to see most of the sample means?
Try running the model with a SAMPLESIZE of just 1. What do you get. Now try with a SAMPLESIZE of 2. Has anything changed? How about a larger sample size?
Are there any connections between SAMPLE SIZE, RANGE, and STDDEV? One way to explore this question is to keep two of these variables constant and examine what happens when you change the third variable. You may want to take an equal number of samples for each of these trials.
If you set the model to a sample size that is larger than the total size of the population, you will receive a message telling you cannot do this. However, you may set a sample size that is larger than most columns. This means that the entire sample cannot fit into those columns. Is this a problem? What does this do to the distribution of sample means?
Using the CREATE MY OWN PEOPLE option, build some "unusual" populations. Some of these have already been put into the PRESET buttons. For instance, you could create people only in one or two columns, or you could make the population "Ushaped" (more on the outside and less and less as you go towards the middle). What are your findings?
Again, using the CREATE MY OWN PEOPLE option, build one very tall column off on the right side of the view (about at xvalue 8) and build a few very short columns. Set the SAMPLESIZE to 10. Press GO ONCE. What can you say about the number of persons that happened to be chosen from the tall column? Try this again and then press GO. Look at the plot. Do you see any connection between the chance of getting samples from the tall column and the location of the mean in the plot?
Set ALSOSUMS? to "On," and activate the program. Can you explain the similarities and differences between the two histograms you get? For instance, you can look at their range, the total area they cover, their height, and their shape. Try to explain the transformation between the two histograms. For instance, why is the histogram on the left taller than the histogram on the right? Look at the standard deviation of the means and of the sums. What is the ratio between these two values? Does this ratio relate to any other value in the settings of this model?
Relating to the two histograms, can you find a case in which the two histograms converge to a single histogram?
Press SETUP, press CREATE MY OWN PEOPLE and make 6 columns of the height 2 (two persons), set the sample size to 2, and set the ALSOSUMS? to "On." Now press GO. It is interesting to compare between this statistics activity and a probability activity in which you are rolling a pair of dice. For instance, how many possible columns are there in the sums histogram? In fact, such a comparison can help us think through similarities and differences between statistics and probability.
What is the relation between the number of samples you take, the size of each sample, and the resulting distribution of sample means? For instance, if you have a budge to sample 1,000 people, should you take 10 samples of size 100 each or 100 samples of size 10 each? What do you gain and what do you, perhaps, lose, in each of these choices? For instance, in terms of confidence or in terms of information about the population you are sampling from. To explore this question, you may want to extend the range of the sample size. You may also want to resize the view so as to allow for more specimens in your population. Finally, it may be helpful to have a slider and corresponding code for controlling the total number of samples you are taking.
The first thing to remember is that in reality we do not know the distribution of the population from which we are sampling. We only have the plot, so to speak. So as you are interacting with this model, you should recall that in applied statistics the Graphic Display does not exist. In this model, however, we are simulating the population  as if we do know its distribution  in order to understand the relation between population metrics and their sample means distributions.
You may have noticed that, almost regardless of the shape of your population, the histogram always eventually takes on a certain shape. This shape is called a "normal curve" or "bell curve" or "bellshaped curve." We say that the histogram "approaches" the normal curve as one takes more and more samples. For special population distributions, we may get special cases of this curve. For instance, if you have created a population that has all the people in the same column, your histogram will be an extreme case of a bell curve  it itself will consist only of a single column.
Often, people say that, "a population is distributed...etc," but it could be that, sometimes, what they actually mean to say is that, "the samplemeans of the population are distributed...etc." This does not imply that the second figure of speech is necessarily preferable, but only that we should understand the difference between these two ideas. In this model, the view shows how the population itself is distributed, whereas the plot shows how the sample means are distributed. Working with this model, one may be struck by the contrast between these two distributions.
The plot shows both the distribution of the sample means and the mean of these means. This mean of means converges on the expected value of sampling from this population. It can be calculated as the average xvalue. That is, multiply the x and y values of each column, add these products, and divide the sum by the total number of data points ("persons"). For instance, if there are 3 persons over the 0 value, 2 persons over the 1 value, and 5 persons over the 2 value, the sum of the three x and y products is: 3*0 + 2*1 + 5*2 = 12. We now divide 12 by 10 (the total number of persons). 12 : 10 = 1.2. So if we sample from this population, the mean of the sample means will converge on 1.2. Try this. You can use the above example or any other example you invent.
The biggest challenge is to use this model so as to come up with an explanation of why, we almost always get a bell curve when we take enough samples.
Please note the following point of potential confusion. In order to enable close examination of the sampling process, the populations in this model contains fewer specimens than most populations that are commonly studied by researchers using statistical analysis. For instance, there might be no more than 5 individual specimens in a population who share the same xvalue (that is, who are all in the same column). Therefore, a sample that is larger in size than the number of specimens in that column, say a sample of size 8, can never include specimens exclusively from that column. In "real life," it could theoretically happen that an entire random sample is taken from a single column. This means that if you use large sample sizes you should expect to get narrower samplemean distributions than what one would otherwise expect. For instance, if the leftmost column is not tall enough to contain the entire sample, you will never receive a sample mean that is equal in value to the xvalue of that column. This is because some of the sample will "spill over" to the right of that column, resulting in a greater sample mean. Because the same logic holds for samples taken from the far right, the sample mean values will be closer to the center than they would be for small samples sizes.
The current version of the model allows for repeated sampling of individual specimens. That means that if a person was selected randomly in the first sample, it can be sampled again in the second sample, etc. If we did not allow this repeated sampling, would the samplemean distribution be affected at all? If so, how? Add code that allows the repeated sampling only as an option and compare the outcomes between the two options.
How do medians behave? The same way as means? Add an option to see the median of both the population and the sample data.
Add a monitor that shows the ratio between the two standard deviations represented in this model.
This model shows means and sums of sampled data. It may be interesting to look at other analyses of the samples. For instance, the product of all values of sampled people as well as the nth root of this product (for a sample of size n).
How does the standard deviation change as we collect more and more samples? To examine this, you can add a plot of the standard deviation over "time" (over samples).
In the procedure draw
we used a temporary variable, tempmousexcor
. This variable assures that the program won't become "confused." Without this variable, the program might enter a clause in the ifelse
code that no longer satisfies what the user actually meant when s/he clicked with the mouse. This could occur, because the user is moving the mouse rapidly. So the program selects the ifelse
clause that is correct at that moment, but meanwhile the mouse click would already be some place else, and so the clause selection would no longer be suitable. The temporary variable avoids this by going through with the instructions as though the mouse were still clicked down where it had been a moment ago.
In the current version, you can create the population, but you cannot experiment with the sampling. Build a procedure that allows you to take your own samples.
Several ProbLab models look at the emergence of bellshaped curves through the accumulation of sample means. See, for example, Prob Graphs Basic and Random Basic Advanced. To look closer at the idea of expected value, see the models Expected Value and Expected Value Advanced.
This model is a part of the ProbLab curriculum. The ProbLab Curriculum is currently under development at Northwestern's Center for Connected Learning and ComputerBased Modeling. . For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.
If you mention this model or the NetLogo software in a publication, we ask that you include the citations below.
For the model itself:
Please cite the NetLogo software as:
Copyright 2005 Uri Wilensky.
This work is licensed under the Creative Commons AttributionNonCommercialShareAlike 3.0 License. To view a copy of this license, visit https://creativecommons.org/licenses/byncsa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
Commercial licenses are also available. To inquire about commercial licenses, please contact Uri Wilensky at uri@northwestern.edu.
(back to the NetLogo Models Library)