Understanding Statistical Distributions 1
In his Guide to Econometrics, Peter Kennedy writes that, even though the concept of distribution of random variables is the most basic and the most important concept at the foundation of all of statistics and econometrics, very few students actually understand what this means. My own teaching experience with students who have taken statistics and econometrics courses confirm this observation – despite its fundamental importance, very few understand it. The goal of this sequence of posts is to explain this concept in the simplest cases, and in the simplest way, in the hope of fixing this problem. A Very Important reason for the confusion is that there are actually FOUR different concepts of distribution. The FIRST concept refers to the distribution of characteristics within a parent population. This is the only REAL distribution – that is, this distribution exists in external reality, and is objective and concrete, and the same for all. The SECOND concept refers to a THEORETICAL distribution, which can be used to provide mathematically convenient APPROXIMATIONS to real distributions. These theoretical distributions DO NOT EXIST in external reality. These are convenient simplifications and idealizations which exist in our minds, but not in the real world. The example is that of a Euclidean line which is perfectly straight from minus infinity to plus infinity. It is so infinitely long that it cannot fit in our finite universe, and can only exist in our imagination. The theoretical distributions are exactly of this nature – idealizations which exist in our imagination, but not in the real world. The THIRD and FOURTH concepts are statistical and relate to repeated independent random draws from parent populations. These create random variables which have two types of distributions. The problem of understanding for students arises from the problem that Teachers fail to differentiate between different concepts, and fail to build up the complex concept from the basic and elementary parts. This FIRST deals with the simplest case of a REAL DISTRIBUTION of characteristics within a REAL population, which does not involve any concepts of randomness or probability.
The first type of distribution, which is very easy to understand, is the only one which actually exists in the Real World. All the other three concepts are imaginary – they exist in a world of theory, which exists only within our imagination, and does not have a counterpart in reality. Our planned course in Real Statistics (see also RSIA 1, RSIA 2, RSIA 3, and RSIA 4) will differentiate strongly between real world concepts and imaginary concepts. In order to understand “real distributions” we start with the concept of a PARENT POPULATION. This is the target of our statistical inference procedures. That is, the goal of our statistical efforts is to learn about the parent population. It is convenient to think of “population” in literal terms as a population of human beings, although it can any collection of real world objects. We are interested in studying the characteristics of the members of the population
For example, consider the characteristic of gender. For each member J of the population, the gender of J is G(J) which can be either Male or Female. By looking at this characteristic, we can create two categories within the population. The distribution of the characteristic gender in the population is, by definition, the percentage of men and the percentage of women. So for example, if the population is 100,000 and there are 52,000 men and 48,000 women than the distribution is G(J)=M: 52% and G(J)=W: 48%. We can proceed more formally as follows.
Suppose the Parent Population has N members. We use index J=1,2,…,N to enumerate the members. A characteristic is a function C(.) which assigns some characteristic to each member of the population. For example, given candidates X1, X2, … ,Xn in an election, a characteristic is the Vote V(j) which assigns to each member of the population the candidate that he or she voted for. So if V(j)=2, this means that member j voted for candidate 2. In this situation, we need to make sure that the function V(.) is defined for all members. If someone did not vote for anyone, than the value of V(j) cannot be defined. So it would be necessary to create an artificial category, such as a candidate 0. We can define V(j)=0 to mean that j did not vote for anyone.
Coming back to the simplest case of two categories, suppose that for all J=1,2,…,N, we can categorize the gender characteristics as G(j)=M or G(j)=W. Then each member of the population belongs to one of these two categories M or W. Now the DISTRIBUTION of the characteristic function G(.) is defined by the percentages of each of the categories. To be specific, if the population consists of 100 people and 52 are Men and 48 are Women, then the distribution of G(.) is M: 52% and W: 48%. Each of the possible outcomes of the function is assigned a percentage of the population which has that that outcome (characteristic). It is easy to extend the concept of distribution from two possible outcomes to many possible outcomes.
Multiple Categories: It is easy to extend the concept of distribution to the case of multiple categories created by mutiple outcomes. Suppose we have N candidates in an election X1, X2, …, Xn. Introduce the categories C1, C2, …, Cn for the people in the population who vote for the corresponding candidates. Assuming each person can only vote for one candidate, these categories are mutually exclusive. If we add a category C(0) of people who did not vote, then the n+1 categories 0,1,2,…,N would also become exhaustive. The function V(j) defines for each member j of the population the candidate for which j voted, with value 0 representing non-voting.
Technical Terminology: We say that the categories C1, C2, …, Cn are mutually exclusive if no member of the population can belong to more than one category – for example, “male” and “female” are mutually exclusive, but “Senior Citizen (age>65)” and “Tall (Height> 5’6”)” are not mutually exclusive. We say that categories C1, C2, …, Cn are exhaustive if every person in the population belongs to one of the categories. Again, the first two categories are exhaustive, but the second two are not.
GENERAL DEFINITION OF DISTRIBUTION: Suppose there are M members of parent population P. Suppose V(j) is a characteristic function which assigns characteristics 1,2,…,N to each member of the population J=1,2,…,M. Then the population can be divided into mutually exclusive and exhaustive categories C1, C2, …, CN by defining member j to belong to category C(i) when V(j)=i. Then the distribution of the function V(.) is defined by the percentages of the population which falls into each category.
Some further clarifications, examples, and explanations: Let p1, p2, …, pN be the percentages for each of the N categories in the definition above. Then each percentage is greater than or equal to zero, and the percentages must sum to 100%. As we will discuss more formally later, a percentage has a natural interpretation as a probability – when we make a random draw from the population, the probability of each category is exactly equal to the percentage of the population which belongs to that category. So the two laws of percentages are also ‘laws of probabilities’:
- For each of the categories Ci, the number belonging to the category must be greater than or equal to 0. It follows that the percentage of that category satisfies the same inequality pi ≥ 0:
- Since the categories are mutually exclusive and exhaustive, the sum total of the percentages p1+p2+…+pn=100%
Example 1: Suppose the population of Pakistan is 200M(illion), which is divided into four provinces: Punjab with 100M, Sind 50M, KPK 30M and Balochistan 20M. Consider the function P(j) which assigns to each member of the population the provice to which he or she belongs. Then the distribution of the function P(j) corresponds to the categorization by province: (Punjab 50%,Sind 25%,KPK 15%,Balochistan 10%).
Size of Population Does Not Matter: It is important to note that the distribution only depends on the percentages, and not on the size of the population. So, if the population is only 200 people and the provincial numbers are proportionately reduced to be 100,50,30,20, the distribution by provincial category will be exactly the same as in the original population.
Example 2: Suppose we have a population of 80 people, representing all ages from 1,2,3,…,80. So, person J is j years old, for j=1,2,…,80. For each member j, define the Age function to be A(j)= the age of the member j = j. What is the distribution of age in this population?
Here, we can consider each age from 1 to 80 as a separate category. In this case, the age distribution is that each age I is 1/80=0.0125=1.25% of the population. Depending on the purpose of our study, we can define the Age function differently, and thereby get different categories, and different distributions. For example, suppose we define A*(j) = 5 for all j <= 10, A*(j) =15 for 10<j<=20. A*(j)=25 for 20<j<=30 and so on. This creates 8 categories within the population. Category 5 is all ages from 0 upto 10, category 15 is ages from 10 to 20, and so on, upto category 75 which goes from 70 to 80. Then each of the 8 categories would contain 12.5% of the population, and this is the age distribution, for this age function A*. We can also categorize age in different ways. For example by defining the Age function differently, we could create categories like child (0-12), teenager (13-19), tween(20-29), middle age (30-50), older (51-65), and senior citizen (66-80) for this population. The distribution would then change according to this categorization. It may be worth clarifying that it is not necessary to keep the ages in order. For example, the function E(j) could take the value one for all people with odd ages 1,3,5, … and E(j)=2 for all people with even ages 0,2,4,…. In this case we would have two categories of people, with even ages and odd ages.
Categorization can be chosen by statistician: Depending on purpose of study, we can categorize the population in different ways. Some examples have already been given for the case of the age distribution. As another example, consider voters for candidates X1, …, Xn in a population of size N. One way to categorize was already given earlier, where voters for each candidate I are classified as category Ci, and non-voters get their own category C(n+1). But we could also reduce the categories to two only by considering Voters and Non-Voters as two categories. Or, we could focus on the first candidate, and create a category C of voters for X1, and C’ of all others, all people who did not vote for X1. For each way of categorization, we would get the corresponding distribution, according to the percentage of people who belong to each category.
Summary: The universe we live in is finite. All real populations are finite. We have only considered populations of people – as the word ‘population’ suggests. However, we can generalize the term to consider a population of all trees in a forest, for example. Any kind of object which can be counted can be considered, as long as each object is distinct, and can be counted separately. For example, the population of drops of water in the ocean would not be suitable as a target population, because of the difficulty of separating, identifying, and counting. But drops of rainfall, or number of snowflakes, can be suitable objects for study as populations, in different contexts. The categorization we impose on the population will also depend on the reasons we have for studying the population, and can be arbitrarily chosen to suit our purpose.
Confidence Building Measures: There. Now we have understood the concept of a real distribution for a finite population. That wasn’t so difficult, was it? Every function F which defines a characteristic for each member of the population has a distribution. Note that the set of possible outcomes of F must be finite, because the population itself is finite. The distribution of the function F is the just the percentage of members of the population which belong to the categories created by the outcomes of F. If all outcomes are different, then each member is a category of 1, and the distribution is just 1/N for each category. The more complex case arises if many members belong to the same category — for example, when F(j)=2 is true for all members of the population who voted for candidate 2. In this case, the distribution of the category 2 is the percentage of members who voted for candidate 2. The distribution of F is the distribution of all of the possible outcomes of F, each of which creates a category of people within the population P.
We learned how to calculate percentages, and to understand this concept many years ago in school. Even though it is a very small step, understanding the concept of distribution of a category within a REAL and a FINITE population is a SOLID accomplishment. Understanding this is like understanding how to build a single brick. Once we know how to build bricks, we can create huge and fancy castles, even the Taj Mahal. We just have to put the bricks together. Just like this, all of the more complex concepts of distributions we have yet to learn are built from this simple concept that we have just mastered. This basic concept is the building block, and we just put together many of these building blocks to create more complex concepts. As we will discuss in the next post, the most serious obstacle to understanding distributions is created by self-doubt — the feeling in the student that these concepts are too complicated for me to understand. We will discuss these feelings in great detail in the next post. For now, I recommend watching my video lecture on “The Ways of the Eagles“. The linked post provides a detailed English outline of the lecture, while the video lecture itself is in Urdu. This was the first lecture in my course on Bayesian Econometrics – the lecture has nothing to do with any kind of statistics. Instead it is all about building confidence, and learning to believe in our own abilities to master any kind of knowledge. Next post will discuss this point in greater detail.
POSTSCRIPT: There are TWO next posts — the first followup is PP2: Building Confidence. This provides a brief summary of the concepts covered in this post, and then goes on to discuss PEDAGOGY — issues which arise in the teaching and learning, specially mathematics. One of the main barriers to learning mathematics is that it is taught in a way which is remote from any practical experience. This leads many students to feel frustrated, and lose confidence. This lack, the feeling that I cannot understand, is the greatest obstacle to learning. The second followup post is “Real Statistics: An Introduction“. This provides the background ideas which led to the creation of this course, which represents an entirely new approach to the subject of Probability and Statisics. These posts discuss pedagogy and philosophy respectively, but do not make progress on the statistics. The fourth post is “Histograms: Pictures of Distributions” which actually takes the next step, explaining the concept of a distribution further, in way that can be pictured, visualized, and build intuition.