Advice for my Brother: How to think about Statistics

Advice for my Brother: How to think about Statistics

In the fall semester of 2020 I was well on my way through the Information Science and Data Science Program at the University of South Florida. At the same time my twin brother, Stuart, was also getting into some of the advanced classes of Mechanical Engineering, and had to take college level statistics for the first time. Over the course of the semester he came to me with questions since I took a similar course in the Spring and I am also a data scientist.

This blog was originally started for my homework and to be a portfolio of projects. However, I know that my brother will always have questions about statistics and data science in general, so the plan is to write a blog series to give him advice on various data science concepts, and if he asks me any questions about what I do outside of statistics, I will write about that as well.

Statistics is not Math

To begin this series I would like to discuss a key point that helped me and I think will help you in understanding statistics.

Statistics is not math, it is language.

Someone, somewhere, sometime

While I will admit, at the foundation of statistics is mathematics, and every statistical model is based in math, that is not the core of statistics. If you think about it, statistics is a tool used by scientists to describe, interpret, and communicate in a formalized manner the world that they observe. In a sense, that is just language.

Of course just spouting out that statistics is a language will help no one. To me at least as a data scientist, it is an easy thing to say, but much harder to explain because it is such an abstract concept that goes against the common teaching. While I may have found success in studying information and data science, I am a stubborn idiot and will attempt to further break down the concept of statistics as a language anyways.

Vocabulary for Patterns

To begin, most people that refer to statistics as a language, they are focusing on the communication of the results of statistical analysis. The idea is that there are formulas that one can use and the work focuses on applying some meaning to those results. This is what I would refer to as Vocabulary for Patterns.

The results from these formulas are all just methods of describing patterns. Each formula or concept in statistics has its own conditions and can be used to describe different patterns. For example when working with normally distributed data there is a mean, and a standard deviation, alone they do not describe the data fully, but together one can picture where the center of the data lies and the spread of that data.

Adding Meaning

While much of the discussion of statistics as a language is around the idea that analysis is performed and then it is communicated. However it seems to be missing something that makes statistics a language instead of just being math. Let us dig a little deeper.

Something that might help is a concept from information science, and that is the hierarchy of information. Some people add their own spin to it however it generally breaks down as such:

  1. Data: At the bottom of the stack. This is any kind of unprocessed bits that is near meaningless. Examples: Red, 4, 27ΒΊ 35.876’, rough, etc. On its own it is meaningless, but when combined with other data it becomes data.
  2. Information: Pieces of data gathered together to provide context for the other pieces of data. Examples: Intersection at GPS Coordinates, the stoplight is red, yellow car is traveling at 35 miles per hour north.
  3. Knowledge: Information is given context with other bits of information to create knowledge. This knowledge can become something such as: The stoplight ahead of me is Red.
  4. Wisdom: Knowledge does not necessarily combine with other knowledge to create wisdom. Often times wisdom can be mistaken for knowledge. The key difference is that there is some additional meaning added. Example: The stoplight ahead of me is red, I should stop because I do not want to be hit by oncoming traffic.

Often one might think statistics sits at the bottom of this hierarchy, however I argue that when performing statistical analysis, one will go through every level of this hierarchy. Statistics in practice begins with data, when gathered together it becomes information. The analysis of that information can become knowledge, and when that knowledge is given further context, it is turned into wisdom.

But how does that relate to language? Language is the tool to give something meaning, and every level of the information hierarchy is about adding meaning. It is impossible to climb the information hierarchy without the use of language.

Similarly statistical analysis only produces more data. If I take some raw data and find the mean, I just have a number. It can be turned into information or knowledge through the addition of meaning. The main focus in a lot of statistics is on the communication of the meaning.

Closing

When working with Statistics, the way you say things will change their meaning and thus also lead you to different results. Two different ways of saying something may carry the same connotation, however their meanings will be different. Some professors will try and catch you on how things are said in there questions. The best way to overcome this is to breakdown what is being said or asked in a question.

I hope that this post is helpful. The plan is to write more about statistics soon, I will start with the basics and slowly work up to more advanced concepts. If you have any questions feel free to leave a comment below.