Top 4 Statistics Pitfalls and One Super Easy Solution
In my wanderings from education, to finance, to password protection, I’ve uncovered a few common pitfalls that I have tried (and occasionally failed) to avoid when dealing with statistics. Here is my list of the top 4 mistakes and at the end, there is one super easy way to help you avoid the pitfalls!
1. Assuming you have an adequate sample size… when you don’t.
This is probably the biggest pitfall that I’ve witnessed (even among professional statisticians!). Let’s be real, we all make this assumptions based off of heuristics (you know, those rules of thumb that we internally create based off our experiences) and one of those is that the numbers we are looking at are based off of a lot of instances. But is this the case? How big is big enough? This is one of those hard to determine assumptions. I don’t really have an answer for you. But, with more data points, you have more ammunition to determine whether or not your data makes sense. If you have a sample size of 100 students with data suggesting that 66% of those students like to eat chocolate cake and then another sample set of 10,000 students suggesting that only 32% of students like to eat chocolate cake, the data set with the higher population will tend to be more accurate, law of large numbers and all that.
2. Assuming your population is evenly distributed.
Hey everyone, I just built this awesome report that shows me that 85% of all our students who have 17 infractions have blue hair! Awesome! My question to you: How many of your students have blue hair? If 85% or so of your student population has blue hair, then it follows that any metric you use should follow that trend pretty closely. Or another one: Hey, I just checked to see which zip code occurs the most often (statistical mode) for students who have 5 or more absences! Great! But how many of your students live in that zip code? If a majority of your students live in that zip code, it is likely that the trend will follow suit and it doesn’t mean much. Be aware of how your data is distributed because you don’t want to make an erroneous assumption. The exception? If you are EXPECTING the mode to follow trend and then it doesn’t (I have fewer SPED students than non-SPED students, but the student population with the most absences IS SPED students). But then you get to the next pitfall…
3. Assuming that a correlation exists without confirmation.
Humans love patterns. In fact, I even read somewhere (probably Wikipedia so you may need to do your own research about it) that we are highly adapted to finding patterns because it helped us survive way back when. This is great and works well with data too! The problem though? We tend to assume that the trend already has meaning without confirming that the meaning is legitimate. Once we have found some type of pattern between 2 disparate data sets, many armchair statisticians (including me) jump to the conclusion that data set one’s conclusion implies data set two. This is called predictive analytics. Which is great, but predictive analytics only works if a correlation has already been established within a certain margin of error. If it hasn’t been established, instead it is labeled a “spurious correlation”. This is probably best illustrated in an example. Over the last decade, it was found that as more chicken was consumed per capita, that the US imported more barrels of oil. Therefore, in order for the US to import fewer barrels of oil, we need to consume less chicken (stolen shamelessly from Tyler Vigen). Of course, this is nonsense. There is no correlation between the two, however we are presupposing that one exists and so to predict the forward movement of oil, we assume that we need to eat less chicken. Bringing it back to education, just because the reading scale score of your benchmark is high for all 5th graders who scored in the top performance level on your state assessment (and the scale score decreases as the performance band decreases), doesn’t mean that the scale score is a good analytic predictor of that level. You have to demonstrate that a correlation actually exists between the benchmark and the state assessment before you can use the data as a predictor. It could just be that students who take one test well, take all tests well.
4. Assuming that the curve is normal.
First of all, what kind of curve am I talking about? And I don’t mean a curve ball. I’m talking about a distribution curve. In statistics, there is a type of model called a Gaussian Distribution Model, or a Normal Distribution. This type of model creates the super famous “Bell Curve.” That one that all the college kids complained was skewed because you “messed it up” with your 95%, you overachiever you. But is data “Normal”? One of my favorite statisticians is Nicolas Taleb who argued that while some few things may fit neatly on a bell curve (things like height, weight, calorie consumption), complex systems involving human action do not, and that those can have super extreme conditions at the ends of the distribution curve (called “tails”) that completely skew your average. Probably the easiest way to think of this is with income. If you take 1000 people’s annual salary, all of whom are principals, you would probably get a bell curve. Some would have a higher income, some a lower, but most would be in around the same spot and you could find a nice average salary and add in a few Standard Deviations to get some maths accomplished. However, let’s have 999 principals and add in one Warren Buffet. All of sudden, that one “extreme” event has skewed the entire population and the average salary of my 999 principals just shot up to $25+ million annually. That’s a pretty awesome salary for anyone, sign me up! So… when looking at student data, try to remember that there could be some data out there that has such a large impact that it skews the average of every other data point in your set. This is especially true if your data set is limited by population size or some other factor.
There is a simple and easy remedy to all 4 pitfalls. Ready? The remedy is: Know Your Data (KYD). Exactly! Super easy right? Knowing your data (the size, the distribution, the outliers, etc.) can really help to mitigate conclusions made as a result of the data set. Population size too small? Instead of using a metric like STAR PR (which is local only), consider using a metric on a broader scale (like NCE). Is your population not evenly distributed? Consider converting raw data into percentages based on population in order to identify trends. Seeing an interesting pattern and want to confirm a correlation? Use Program Analysis to help you determine that. Averages for your metrics look a little off? Check to see the min/max and other data pieces to tell if an outlier is skewing the data. If it is, you may need to reconsider what you are testing!
That’s all for now, statisticians! Good luck, be safe, and watch out, Richard Walter pops up when you least expect him….
Want to know more about actually using the student data you collect? Click below!