NetLogo Models Library:
This demonstrates relations between population distributions and their sample mean distributions as well as the affect of sample size on this relation. In this model, a population is distributed by some variable, for instance by their total assets in thousands of dollars. The population is distributed randomly -- not necessarily 'normally' -- but sample means from this population nevertheless accumulate in a distribution that approaches a normal curve. The program allows for repeated sampling of individual specimens in the population.
This model is a part of the ProbLab curriculum. The ProbLab curriculum is currently under development at the CCL. For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.
Either the program or the user creates a population that ranges along some dimension, such as total assets. In the View, we see this population arranged in a "picture bar chart." For instance, poorer people are farther to the left, and richer people are farther to the right. Next, a group of individual specimens from this population is selected as a sample (these sampled people are painted in a different color). The program calculates the mean value of this sample -- their average assets -- and plots this mean in the histogram below the view. We can set the program to sample repeatedly, and we can observe the emergence of the distribution of sample means. We can also look at the corresponding histogram of the sums of the samples. This allows us to study the relation between the sums and the means in terms of properties of their distributions.
Press SETUP, and then press either CREATE RANDOM PEOPLE or CREATE MY OWN PEOPLE or just press one of the PRESET buttons. If you've created random people or if you have pressed one of the presets, you are now ready to press either GO ONCE or GO. But if you've pressed CREATE MY OWN PEOPLE, you now need to click on the View to create these people. Only then you should press either GO ONCE or GO. You can also vary the setting of the sliders. Watch results in the plot and explore relations between the settings and the results. If you'd like to guess the mean value of the sample means before you begin sampling, you can use the RANGE slider to indicate the location of this mean.
Buttons: SETUP -- initialize variables CREATE RANDOM PEOPLE -- activates SETUP and then creates a random population in the View CREATE MY OWN PEOPLE -- allows the user to create persons by clicking on the View PRESET 1 - PRESET 6 -- creates populations with special patterns of possible interest GO ONCE -- marks a sample in the population and calculates and plot their mean value. GO -- repeats GO ONCE indefinitely.
Sliders: SAMPLER SIZE -- determines the number of specimens sampled at each run through the Go procedure MAX-VALUE -- determines the range of values the members of the population can take (a number on the interval 0 to MAX-VALUE)
Switches: ALSO-SUMS? -- if set to "On," the sums of the samples (and the means of these sums) is plotted as well as the means of the samples (and the mean of these means). If set to "Off," only the means are plotted. SHOW-EV? -- if set to "On," the EXPECTED VALUE (EV) monitor will display a value
Monitors: NUM-SAMPLES -- shows the total number of samples taken this the last 'setup.' EXPECTED VALUE -- calculates the mean x-value of the population. For each column, the program calculates the product of the number of turtles and the x-value of that column. Next, all these products are summed up and divided by the total number of turtles. STD-DEV-MEANS -- shows the standard deviation of the histogram of sample means. This is an index of how "diffuse" the distribution is. The smaller the number, the tighter (narrower, more clustered) is the distribution of sample means. STD-DEV-MEANS -- shows the standard deviation of the histogram of sample sums.
Plots: SAMPLE-MEAN DISTRIBUTION -- distribution of the mean values of all samples taken and the mean sum of all samples taken.
The property we are looking at is indexed by the "x-value" of the people in the bar chart that is in the View. A person's x-value can be seen in the numerical label at the bottom of its column in the view. For instance, the x-value of people in the left-most column is 0. There could, in principle, be no person with the x-value "0," there could be a single person with that value, or there could be two or more. They all share the same value, because they are all in the same column.
Members of the population that turn green when you press GO or GO ONCE are the 'sample.' Their mean x-value, for instance, their savings, is plotted in the histogram below the view. For example, if a sample of three people is taken and their x-values are 7, 8, and 12, then the histogram column "9" will bump up by one unit, because 9 is the mean of 7, 8, and 12.
For some settings of the population, the more samples you take the more likely you are to get a rare sample. So the distribution you get after only a few samples is not necessarily reflective of all possible mean values.
Can you see any connections between the distribution of the population (in the graphics display window) and the mean value of the histogram (in the plot window)? For instance, if there happen to be more population specimens ("people") on the left side of the range, where do you expect to see most of the sample means?
Try running the model with a SAMPLE-SIZE of just 1. What do you get. Now try with a SAMPLE-SIZE of 2. Has anything changed? How about a larger sample size?
Are there any connections between SAMPLE SIZE, RANGE, and STD-DEV? One way to explore this question is to keep two of these variables constant and examine what happens when you change the third variable. You may want to take an equal number of samples for each of these trials.
If you set the model to a sample size that is larger than the total size of the population, you will receive a message telling you cannot do this. However, you may set a sample size that is larger than most columns. This means that the entire sample cannot fit into those columns. Is this a problem? What does this do to the distribution of sample means?
Using the CREATE MY OWN PEOPLE option, build some "unusual" populations. Some of these have already been put into the PRESET buttons. For instance, you could create people only in one or two columns, or you could make the population "U-shaped" (more on the outside and less and less as you go towards the middle). What are your findings?
Again, using the CREATE MY OWN PEOPLE option, build one very tall column off on the right side of the view (about at x-value 8) and build a few very short columns. Set the SAMPLE-SIZE to 10. Press GO ONCE. What can you say about the number of persons that happened to be chosen from the tall column? Try this again and then press GO. Look at the plot. Do you see any connection between the chance of getting samples from the tall column and the location of the mean in the plot?
Set ALSO-SUMS? to "On," and activate the program. Can you explain the similarities and differences between the two histograms you get? For instance, you can look at their range, the total area they cover, their height, and their shape. Try to explain the transformation between the two histograms. For instance, why is the histogram on the left taller than the histogram on the right? Look at the standard deviation of the means and of the sums. What is the ratio between these two values? Does this ratio relate to any other value in the settings of this model?
Relating to the two histograms, can you find a case in which the two histograms converge to a single histogram?
Press SETUP, press CREATE MY OWN PEOPLE and make 6 columns of the height 2 (two persons), set the sample size to 2, and set the ALSO-SUMS? to "On." Now press GO. It is interesting to compare between this statistics activity and a probability activity in which you are rolling a pair of dice. For instance, how many possible columns are there in the sums histogram? In fact, such a comparison can help us think through similarities and differences between statistics and probability.
What is the relation between the number of samples you take, the size of each sample, and the resulting distribution of sample means? For instance, if you have a budge to sample 1,000 people, should you take 10 samples of size 100 each or 100 samples of size 10 each? What do you gain and what do you, perhaps, lose, in each of these choices? For instance, in terms of confidence or in terms of information about the population you are sampling from. To explore this question, you may want to extend the range of the sample size. You may also want to resize the view so as to allow for more specimens in your population. Finally, it may be helpful to have a slider and corresponding code for controlling the total number of samples you are taking.
The first thing to remember is that in reality we do not know the distribution of the population from which we are sampling. We only have the plot, so to speak. So as you are interacting with this model, you should recall that in applied statistics the Graphic Display does not exist. In this model, however, we are simulating the population -- as if we do know its distribution -- in order to understand the relation between population metrics and their sample means distributions.
You may have noticed that, almost regardless of the shape of your population, the histogram always eventually takes on a certain shape. This shape is called a "normal curve" or "bell curve" or "bell-shaped curve." We say that the histogram "approaches" the normal curve as one takes more and more samples. For special population distributions, we may get special cases of this curve. For instance, if you have created a population that has all the people in the same column, your histogram will be an extreme case of a bell curve -- it itself will consist only of a single column.
Often, people say that, "a population is distributed...etc," but it could be that, sometimes, what they actually mean to say is that, "the sample-means of the population are distributed...etc." This does not imply that the second figure of speech is necessarily preferable, but only that we should understand the difference between these two ideas. In this model, the view shows how the population itself is distributed, whereas the plot shows how the sample means are distributed. Working with this model, one may be struck by the contrast between these two distributions.
The plot shows both the distribution of the sample means and the mean of these means. This mean of means converges on the expected value of sampling from this population. It can be calculated as the average x-value. That is, multiply the x and y values of each column, add these products, and divide the sum by the total number of data points ("persons"). For instance, if there are 3 persons over the 0 value, 2 persons over the 1 value, and 5 persons over the 2 value, the sum of the three x and y products is: 3*0 + 2*1 + 5*2 = 12. We now divide 12 by 10 (the total number of persons). 12 : 10 = 1.2. So if we sample from this population, the mean of the sample means will converge on 1.2. Try this. You can use the above example or any other example you invent.
The biggest challenge is to use this model so as to come up with an explanation of why, we almost always get a bell curve when we take enough samples.
Please note the following point of potential confusion. In order to enable close examination of the sampling process, the populations in this model contains fewer specimens than most populations that are commonly studied by researchers using statistical analysis. For instance, there might be no more than 5 individual specimens in a population who share the same x-value (that is, who are all in the same column). Therefore, a sample that is larger in size than the number of specimens in that column, say a sample of size 8, can never include specimens exclusively from that column. In "real life," it could theoretically happen that an entire random sample is taken from a single column. This means that if you use large sample sizes you should expect to get narrower sample-mean distributions than what one would otherwise expect. For instance, if the left-most column is not tall enough to contain the entire sample, you will never receive a sample mean that is equal in value to the x-value of that column. This is because some of the sample will "spill over" to the right of that column, resulting in a greater sample mean. Because the same logic holds for samples taken from the far right, the sample mean values will be closer to the center than they would be for small samples sizes.
The current version of the model allows for repeated sampling of individual specimens. That means that if a person was selected randomly in the first sample, it can be sampled again in the second sample, etc. If we did not allow this repeated sampling, would the sample-mean distribution be affected at all? If so, how? Add code that allows the repeated sampling only as an option and compare the outcomes between the two options.
How do medians behave? The same way as means? Add an option to see the median of both the population and the sample data.
Add a monitor that shows the ratio between the two standard deviations represented in this model.
This model shows means and sums of sampled data. It may be interesting to look at other analyses of the samples. For instance, the product of all values of sampled people as well as the n-th root of this product (for a sample of size n).
How does the standard deviation change as we collect more and more samples? To examine this, you can add a plot of the standard deviation over "time" (over samples).
In the procedure
draw we used a temporary variable,
temp-mouse-xcor. This variable assures that the program won't become "confused." Without this variable, the program might enter a clause in the
ifelse code that no longer satisfies what the user actually meant when s/he clicked with the mouse. This could occur, because the user is moving the mouse rapidly. So the program selects the
ifelse clause that is correct at that moment, but meanwhile the mouse click would already be some place else, and so the clause selection would no longer be suitable. The temporary variable avoids this by going through with the instructions as though the mouse were still clicked down where it had been a moment ago.
In the current version, you can create the population, but you cannot experiment with the sampling. Build a procedure that allows you to take your own samples.
Several ProbLab models look at the emergence of bell-shaped curves through the accumulation of sample means. See, for example, Prob Graphs Basic and Random Basic Advanced. To look closer at the idea of expected value, see the models Expected Value and Expected Value Advanced.
This model is a part of the ProbLab curriculum. The ProbLab Curriculum is currently under development at Northwestern's Center for Connected Learning and Computer-Based Modeling. . For more information about the ProbLab Curriculum please refer to http://ccl.northwestern.edu/curriculum/ProbLab/.
If you mention this model or the NetLogo software in a publication, we ask that you include the citations below.
For the model itself:
Please cite the NetLogo software as:
Copyright 2005 Uri Wilensky.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
Commercial licenses are also available. To inquire about commercial licenses, please contact Uri Wilensky at firstname.lastname@example.org.