# Histograms: Pictures of Distributions

Our goal in this first sequence of posts on “Real Statistics”, is to understand the concept of a “distribution”. There are four different, closely related concepts. We are in process of discussing the first concept, which I call a “real distribution”. The first post “Understanding Statistical Distributions” explains this in detail. As a brief summary, this refers to a population, subdivided into categories. The percentage of members belonging to each category, is a ‘distribution’. There are many ways to categorize populations, and each categorization produces a different distribution. Note that so far we have not introduced any randomness or probabilities into the picture. Before proceeding to do so, it is useful to have a better handle on real distributions. This can be obtained by using histograms to produce pictures of distributions. This allows us to “see” the distribution, which permits greater intuitive understanding. The second post “PP2: Building Confidence” explains how concepts must be related to the life experience of students, in for them to learn with confidence. The reasons for creating a new course on “Real Statistics” is to build the course on foundations of real world concepts which lie within the experience of students. This is different from conventional courses, which define probability axiomatically, and use unfamiliar mathematical concepts for foundations.

It is useful to introduce some mathematical notation for a systematic study of histograms. We always start with a FINITE population P which has N members. We can think of this as a population of people; this ties in closely to our past knowledge and experience, and makes concepts easy to understand. (However, theoretically, the population can be ANY collection of different objects.) Since the population must be finite, we can always count the number of objects, and assign a unique number to each object. Thus, every population can be considered as a set P_{N }= {1, 2, 3, …, N}. Here each member of the population has been assigned a unique number, and they have all been gathered into a set P_{N} which consists of all members of the population.

In conventional treatments of probability and statistics, the population is called the “sample space”. Also, in conventional treatments, any function defined on the sample space is called a “random variable.” We will explain why these words are used when we introduce probability. For the moment, it is easier to explain simple concepts without mentioning the more complex concept of probability. A function on the sample space simply measures some characteristic of the population. For example, we can have a Gender characteristic, or an Age characteristic, or a Height characteristic. These are all functions which assign to each member of the population a number which indicates the category of the member. For example, we can encode Male as {1} and Female as {2} and define a function G(j) by G(j)=1 if member j of population PN is male, and similarly G(j)=2 if the member j is female. This function creates two categories, males and females, within the population. The distribution of G(.) is the proportion of each of the two types. For a concrete illustration, suppose the population is a class of 20 students, and 12 of the students are males, while 8 are females. Then the distribution of gender is 60% males and 40% females. The histogram, given below, provides a picture of this distribution:

Next, consider the Age distribution. Suppose, for simplicity, that the students have been numbered in sequence according to their age, and also that the age is measured in months. Suppose that A(1)=222, A(2)=224, and so on – each student has age 2 months more than the previous one. We can also create an EXCEL chart of the data, for any set of ages. This particular simple set of ages has been chosen for convenience.

Student | Age | Student | Age | Student | Age | Student | Age |
---|---|---|---|---|---|---|---|

1 | 222 | 6 | 232 | 11 | 242 | 16 | 252 |

2 | 224 | 7 | 234 | 12 | 244 | 17 | 254 |

3 | 226 | 8 | 236 | 13 | 246 | 18 | 256 |

4 | 228 | 9 | 238 | 14 | 248 | 19 | 258 |

5 | 230 | 10 | 240 | 15 | 250 | 20 | 260 |

Now, when we consider the Age function in months, every student belongs to his/her own unique category. No two students have the same age. In this case, the distribution of Age is very simple. With 20 students, every student is 5% of the total population. The following histogram provides a picture of this distribution:

It is worth noting that whenever the characteristic is such that every member of the population is unique, then the histogram always looks like this. Since all categories consist of exactly one member of the population, each category is 1/N percent of the population, where N is the total population size. Also note that each bar on the histogram is a probability, each probability is non-negative, and the sum of all the probabilities is one. These are the laws of probability. But this relationship will be clarified later; for the moment, we have not introduced the concepts of randomness and probabilities.

We get more interesting distributions when we look at the Age in Years, rounded upwards to the nearest Year.

Student | Age | Student | Age | Student | Age | Student | Age |
---|---|---|---|---|---|---|---|

1 | 19 | 6 | 19 | 11 | 20 | 16 | 21 |

2 | 19 | 7 | 20 | 12 | 20 | 17 | 21 |

3 | 19 | 8 | 20 | 13 | 21 | 18 | 21 |

4 | 19 | 9 | 20 | 14 | 21 | 19 | 22 |

5 | 19 | 10 | 20 | 15 | 21 | 20 | 22 |

Now there are 6 students each in categories of 19,20,21, but only 2 students in category 22. This creates the following histogram:

Note that the picture of the distribution provides clear, direct, and intuitive information about how ages are distributed in the population. There are equal proportions of students aged 19,20,21, while the proportion of students aged 22 is much smaller – only 10%, as opposed to 30% in the other categories.

**IMPORTANT TECHNICAL NOTE**: The word “distribution” is an ordinary English language word, which has an ordinary English language meaning, and is commonly used by native speakers. We have now introduced a TECHNICAL meaning for the same word. When we talk about the distribution of ages in the population, this can be understood in two different ways. One is the English language meaning, which can be understood by all. The other is the technical meaning, which can only be understood by those who have taken enough statistics to know it. Failure to differentiate between the two different usages of the word can lead to many kinds of confusions.

We will end this post by one final illustration of the histogram. This is a set of 30 Households picked at random from the Household Income Expenditure Survey 2004. The households are listed number, and the characteristic of interest is the household size – the number of people who belong to the household. For the 30 chosen households, this household size is listed in the data set below:

HH# | Size | HH# | Size | HH# | Size |
---|---|---|---|---|---|

1 | 2 | 11 | 4 | 21 | 8 |

2 | 2 | 12 | 4 | 22 | 8 |

3 | 2 | 13 | 4 | 23 | 8 |

4 | 2 | 14 | 4 | 24 | 8 |

5 | 2 | 15 | 6 | 25 | 8 |

6 | 3 | 16 | 6 | 26 | 8 |

7 | 3 | 17 | 6 | 27 | 8 |

8 | 3 | 18 | 7 | 28 | 9 |

9 | 3 | 19 | 7 | 29 | 9 |

10 | 4 | 20 | 7 | 30 | 9 |

We can see that there are 5 HH of size 2, 4 HH of size 3,5 HH of size 4, 0 HH of size 5, 3 HH of size 6, 3 HH of size 7, 7 HH of size 8, and 3 HH of size 9. We can create a histogram, which provides a picture of this distribution as follows:

There are 8 categories of Household Size, ranging from 2 to 9. Each of these has a percentage within the total population of 30 Households. All of the percentages are nonnegative. There are no households of size 5 within this sample of 30, so the percentage is 0% for this category. The sum of the percentages is 100%. These percentages satisfy the laws of probability. Looking at the histogram, we can see that the largest category is size 8, with 23% of households belonging to this category. The graph gives us a direct visual understanding of the proportions of families of different sizes within the population of the 30 families within the sample.

At this point, we will conclude the current post, about histograms as a way to PICTURE a distribution. This provides students with direct, hands-on, intuitive, way to understand the distribution of any characteristic of a population. This is COMPLETELY ADEQUATE background to understand the concept of probability in a direct and intuitive way, which is VERY DIFFERENT from what any current textbook of probability teaches. As we will say, the concept of probability is just another way of looking at the concept of a histogram (or a distribution). We will make the extension in the next post.

Hopefully, students would have understood this material fairly easily, because it is based on familiar concepts. At this point, I would recommend re-reading PP2: Building Confidence. This is how learning proceeds – by taking small and simple steps. This creates confidence in students that they can understand, and also gives them experience in taking small steps. This confidence and experience creates the strength, energy, and motivation required for taking larger steps, by building upon this experience. The next post in this sequence is Coarse and Fine Histograms, which studies histograms in greater depth.