The average Joe (isn’t the average you are thinking)

Pedro Prado
4 min readJul 3, 2021

--

We often overlook the importance of knowing about the median and the mode. However a great amount of times these are the values that we are thinking about instead of the mean.

I took the title from a talk from Cassie Kozyrkov, the Lead Data Scientist of Google. Cassie presents some interesting insights now and then about statistics and once she mentioned “the average Joe” (which in the way we understand the expression, it should be “the modal Joe”) I felt it was the perfect setting to discuss about the importance of concepts such as median and mode.

In the family of the most common “averages”, we have the mean, the median and the mode. Some people have used or heard about another one called the harmonized mean but it has its own space for another article. In the end, the mean is the most beloved and the other two are like forgotten twins.

Group Portrait of Three Brothers by Thomas de Keyser. People had weird proportions back then!

The reasons behind so many assumptions are made talking about “average” as the mean instead of median or mode are many. Most of them are related to the fact that (1) it’s not easy to calculate these two at hand or other related parameters (such as confidence intervals) and (2) not as straightforward to understand such as the mean. Also I heard too many times that “it’s a question of legacy” which could be understandable since studies published up to mid 1950s didn’t have the computational power we have now to chew data to get those different averages in a single click and therefore, most of the statistical tools were developed using the mean (however there are some interesting non-parametrical statistics which we could branch into another article in the future).

The median suffers from both of the causes indicated above. It is overlooked by its intricacy to be calculated by hand: “How will I put 6 billion people in line from youngest to oldest in order to calculate the median age of the world population?”. In fact, even with a computer, the task was not an easy one at first. Back in the 80s, when personal computers were in their infancy, running a sorting algorithm to put numbers in order was still something not trivial. So most of the times education tends to overlook the proper use of the median and it is left out as it was first presented — in the books. However it has an interesting feature: outliers have little influence on this parameter . This could represent fairly well data which has outliers and you could still keep those outliers in your dataset.

Sidenote: Remember that outliers should not be removed unless you have a good reason to, they could potentially reveal something down the road on your model.

Am I left out? What am I? Where do we all go after this?!

The mode is another one that people tend to forget what it is about, even though it has the simplest concept of all members of the “average” family. It is simply the most frequent value in a distribution. Which means that you would need to know how the distribution looks or count the frequency of every observation and do some kind of sorting to find the most frequent. Hence it fits in the first category as the one that is not easier to calculate by hand compared to the mean. Outliers generally won’t be as frequent as “regular” data but it’s good to check the distribution, just in case (the possibility exists even though that scenario would result in a weird distribution).

If we asked CS players to say which is their prefered scenario, dust2 probably would be the mode of this preferred scenarios distribution.

The (arithmethic) mean is the mode (yep, we said that) — ie. the most frequent type of average — when people talk about averages. It’s easy to calculate — just sum everybody, divide by the number of dudes and voilá. Everybody knows how an average is calculated. We do this instinctively most of the time with everything, from estimating the milk we should buy in the grocery store to the amount of sleep we should get daily.

Danger alert: this kind of average is heavily influenced by outliers but nobody seems to remember this feature which is a very important one.

Mean’s dirty little secret. The All-American Rejects could have included this one, I am sure they would still be relevant by now due to that (not really).

After this brief but visual explanation showcasing the two missing brothers of the mean, I hope I have convinced you that we should not only specify a little bit more which kind of average we are talking about when saying the “average Joe” but also point that the average we mean (sorry) is the mode — so we are actually saying the “modal Joe” — the most common/frequent person. Not the “median Joe”, neither the “mean Joe” (which actually could have a double meaning).

Pizza, TV and moustache. Are those fitting features of our modal Joe or we are being biased?

--

--

Pedro Prado

Identity crisis between a Data Scientist and Data Engineer