Coarse and Fine Histograms
Our universe has amazing structure, which varies tremendously by the level at which it is examined. At the macroscopic level the spiraling galaxies race outwards. Within a galaxy, starfish arms dance around the center. Sharpening the focus, planets circle our sun in regular orbits. At our human scale, the flora, fauna, forests, oceans, mountains, and glaciers dazzle our eyes. The microscope reveals amazing structures of miniature life forms, all bearing witness to the Creator, at whatever level of magnification we can achieve with our current technology. In this lecture, we will see that histograms can be made at different levels of magnifications, and each level reveals different aspects of the data under examination.
Consider a population of 30 students, and their scores on the midterm and the final. This artificial data set, meant to illustrate certain concepts, is given below.
This are scores on the midterm exam for 30 students, which range from 26 to 99. We start by making the finest histogram, where each student’s score is directly represented. The students have been sorted and number in a sequence from 1 to 30, in order of increasing scores. Consider the function S(j) which assigns to student j, his or her score on the midterm. The above table lists the value of the function (that is, the midterm score) for each student. A histogram for this function is as follows:
Note that there are a few scores which two students obtained. For example, students 4 and 5 both had 32, and this shows up as a spike with 2 points of data. Note also the EXCEL by default produces a histogram with a count of the data (1 and 2), and not with the percentages, which would be 1/30=3.3% and 2/30=6.6% which is the right axis value for a percentage histogram. To convert an EXCEL histogram to a percentage histogram, we must divide the labels on the vertical axis (which provide a COUNT of the population belonging to each category) by the total population. In the present example, since the total population of students is 30, the labels on the Y-axis must be divided by 30 to give the percentage of the population belonging to each category.
Next we consider categorizing the scores by grades. Suppose that the grading is done as follows: F is 25 points or less. Then we have grades A,B,C,D,E. Each of the grades has a 15 point range, with A going from 100 to 86, B from 71 to 85, C from 56 to 70, D from 41 to 55, and E from 26 to 40. Further subdivide each 15 point grade range into 5 point ranges, to create the grades A+,A,A- and so on. Each of the letter grades has a + for the top five numbers, and – for the bottom five numbers in the range of 15 numbers for the category. The histogram for this way of categorizing the scores is given below. This histogram shows 3 students in the bottom categories of E- and E. Out of 30 students, this gives a percentage of 10%, One student has score within 36-40 or E+, which is 1/30=3.33%. Combining the 3 categories E-,E,E+, we get 23.33% as the percentage of population within the 3. The biggest number of students fall within the categories D, D+, and C – these categories all have 4 students each for a percentage of 4/30=13.33%. The categories which have the largest percentage of the population belonging to them are called “modal” categories. In many cases, there is only one modal category, and in this case, this category is called the “mode” of the distribution. In the present example, the distribution is multi-modal, and has modes at D, D+ and C. Similarly, we can see that there are two students each in categories A+, A-,B, and B-, while there is one student each in category E+, C-, B+, and A. The histogram provides us with a picture of the distribution of the grades. In the previous sentence, ‘distribution’ is simultaneously in both senses of the word – the English language word, and the statistical & technical word – both meanings apply.
Before proceeding further, we introduce another technical term: “Bin” is just another name for what we have been calling category. For non-native speakers of English, the word may not be familiar. The British have an expression “Chuck it in the bin” meaning ‘throw it in the garbage’. The “Bin” is just a container (not necessarily for garbage), as in the picture below:
Each of the five garbage cans above is meant for a different category of garbage – paper, cans, glass, etc. Just like that, when we categories the students scores into A+,A,A-, B+ and so on, we created “bins” for the scores. The Bin-Size in the above histogram is “5”. The first histogram has Bin Size = 1. That is, each score is in a category by itself. The score 56 is a bin and the score 57 is next higher bin. We say that a categorization is coarse, or fine, according the bin size. The size of 1 is the smallest possible bin-size for integer grades (no half points). With this bin size we get the “finest” possible histogram. By increasing the bin size to 5, as in the second histogram, we get a histogram with is coarser. We can make it even more coarse by forgetting about the distinctions between the + and -, and lumping all three types into one grade. That is, we consider B+, B. and B- together as just a “B”. Then we have 5 categories – E,D,C,B,A – and the bin-size for each category is 15. This creates an even coarser histogram picture below:
This histogram provides us with a picture of the grade distribution of the class according to the grades E,D,C,B,A. Note that the modal grade is D, and this category (bin) has 8 students, so that the percentage is 8/30=37.5%, which is higher than the percentage of any other category. With this categorization, the distribution is unimodal. Note the modes change as we change the bin size.
In previous versions of EXCEL, it used to be a difficult job to produce a histogram, but now this is a built-in chart type, which makes it very easy to produce. Here is a link to EXCEL Support which provides instructions on how to produce a histogram. A quick summary is as follows. Select the data for which you want the histogram. Then click on insert chart, and choose chart type histogram. A histogram with default bin-size chosen by EXCEL according to an inbuilt algorithm will be produced. To change the bin-size, click on the X-axis. On the popup menu that appears, click on “Format Axis” (usually at the bottom of the menu choices). There you will see a set of options to choose either the bin size, or the number of bins you want. Using this method, it is easy to produce histograms with different bin-sizes. For the sake illustration, here is picture with bin size = 10 – this is an intermediate bin-size between the 5-point range which categorizes into 5×3=15 categories (5 grades, with 3 types), and the 15 point range based on the five categories. Note that the modal category here is (46,56]. As a parenthetical note, the round bracket (46, means that 46 is NOT included in this category, but all numbers ABOVE 46 are included. The square bracket ,56] means that 56 IS INCLUDED in this bin.
The goal of this lesson was to learn how to build histograms going from fine bin-size to coarse bin size, and to learn how to interpret these histograms. At each level of fine-ness or coarseness, the histogram picture provides us with new and different types of information about the data. It is worth noting that the finest histogram, which has a picture of all the data points separately on the graph, has ALL THE INFORMATION. But this is just like saying that the EXCEL table we started with already has all the information about the scores. The histograms cannot provide more information. The problem is that our brains are not built to process tables of numbers. So conversion to a graph is a device to help us get an intuitive grasp of the data, using pictures which our brains are built to understand much better. The histogram is not an “objective” picture of external reality; it is an interface between our human brains and the complexities of real world data sets.
To complete our analysis, we add three more histograms which are of increasing levels of coarseness. The built-in histogram of EXCEL divides the data into three categories, with binsize = 25, as pictured below. This creates three bins, corresponding to the grade ranges of LOW (E-,E,E+,D-,D), MIDDLE (D+,C-,C,C+,B-) and HIGH (B,B+,A-,A,A+). Both LOW and MIDDLE are modal bins, since both contain the maximum number of students with 11/30 = 36.67% of the population.
If we increase the bin-size to 36, we get only two bins, as pictured below. From this we learn that 18 students fell below the middle score of 63 while 12 students scored above 63.
The coarsest possible bin-size is 75, where one bin covers ALL the scores. This gives us the following histogram, where all data is lumped into a single category. This gives us just the RANGE of all the scores – All the scores fall within the range of 26 to 100.
This concludes our lesson on coarse and fine histograms. The goal of this lesson was to show how the same data can be sorted into different types of categories. Each way of sorting creates a different histogram. The finest histogram is the one with the largest number of categories, corresponding to bins of the smallest size. This provides a very sharp and accurate picture of the actual data. As we make the bin size larger, and the number of categories smaller, we get a broader perspective on the data, which reveals patterns not easily seen directly from looking at the data.
Concluding Remarks: We end this lessons with a discussion of two pedagogical principles. First, learning is by doing. The lesson above shows how I made the histograms. By writing up this lesson, I learnt how to make and interpret a histogram in EXCEL. If YOU want to learn this lesson, then you must do all the things that I did. Here is a link to the student scores data (artificial) that is discussed in the above lesson: 30Students.xlsx I have shared this file in Microsoft Office with edit permission. That means you can access it and edit it. Try not to destroy the data, and use your own workspace to make your own histograms, to match the ones above, so that you can learn how it is done. For greater interest and variety, I have also provide data on GDP per capita, in PPP (purchasing power parity) term in constant 2005 US Dollars, for 188 countries. This data has been taken from the World Bank data set on World Development Indicators (WDI). Create Histograms of this data with varying bin-sizes and study these histograms to learn about the distribution of this statistic for the 188 countries of the world for which data was available in 2010. Instructions on how to make histograms in EXCEL are given Here, and Here
The second principle is “The Forest and Tree Principle”. Learning involves being able to see the micro-structure and the macro-structure TOGETHER. For example, each data point is a tree. The collectivity of all the data points is the forest. At the finest level, where each data point is separate, you can see all the trees, but it is difficult to see the forest. As you group the data into categories, a picture of how the collectivity of points behaves begins to emerge. Notice that the forest is an abstract concept – the collectivity exists in our minds, because we CREATE the groupings and the grades. To understand this grouping, we must look at how it functions in the real world. To understand any abstraction, we must translate it into a real world context. To understand a forest, we must look at the trees within the forest. We are currently studying the abstract concept of the “distribution” of characteristics of a population. A distribution just counts the percentage of members within each category. This is still too abstract, but when we convert it to a histogram, it becomes a graph which is directly and intuitively understandable. Similarly, probability is normally explained with reference to an abstract set – a sample space, equipped with a measure. I have not mentioned these concepts, but our population is actually a sample space. The “FOREST” level abstract idea is being illustrated by a concrete example – scores of students – which is an object within the experience of students and very familiar and easy to understand. The probability measure on the sample space is just the percentage of elements within a subset. We have not formally introduced the abstract concepts – the forest level view. Instead, we have been working with the trees to get familiarity with how these abstract concepts work within a concrete and easy-to-understand framework. When we later introduce abstract ways of looking at the same picture, the student will be able to understand the abstraction by referring it to the concrete case already known and familiar. This differs from standard practice which defines probabilities on an abstract set first, which makes it impossible to understand for most students.
POSTSCRIPT: This is the FIFTH post in a sequence of posts explaining the concept of a “Distribution”. The first post is Understanding Statistical Distributions, which gives the basic definitions and ideas. The second post is PP2: Building Confidence. This provides a brief summary of the concepts covered in this post, and then goes on to discuss PEDAGOGY — issues which arise in the teaching and learning, specially mathematics. One of the main barriers to learning mathematics is that it is taught in a way which is remote from any practical experience. This leads many students to feel frustrated, and lose confidence. This lack, the feeling that I cannot understand, is the greatest obstacle to learning. The third post is “Real Statistics: An Introduction“. This provides the background ideas which led to the creation of this course, which represents an entirely new approach to the subject of Probability and Statistics. These posts discuss pedagogy and philosophy respectively, but do not make progress on the statistics. The fourth post is “Histograms: Pictures of Distributions” which actually takes the next step, explaining the concept of a distribution further, in way that can be pictured, visualized, and build intuition. This post explains the concept of histograms further, by showing how bin sizes can be varied to create different histograms for the same data set.