Few years ago an article on CNBC defined data analysis “The Sexiest Job of the 21st Century”. I wouldn’t be so optimistic, but certainly it is becoming more and more demanded by the day. There is high demand for big data professionals, ETL specialists, data architects and statisticians. Traditionally, and until some time ago, database developers and administrators were considered as far from “sexy” as it could possibly be; nowadays, as the article recalls, “There are a lot of folks who can spell Hadoop and put it on their résumé and call themselves data scientists”.
Don’t believe me, just watch: The Best Jobs of 2016, Statistician is ranked #2, only surpassed by Data Scientist at #1.
The first non-IT related job in the list is Audiologist, #4.
This article is an attempt to fill the gaps between what we historically know as data manipulation language (DML) and the basic concepts of Statistics. If you google “what does a statistician do”, the answer looks very similar to what I do day in and day out, except for the *statistical method* part.
So my suggestion to all the SQL people out there, is to become acquainted to the terms and functions mostly used in Statistics, so that you don’t need to install and learn R just to understand if a column in the table is skewed towards the left. I have friends and colleagues who are statisticians whom I respect but often misunderstand, and thus I regard them as some sort of alchemists. So this article aims to reduce the distance between the “Stats Guy” and the “DB Guy”, the target audience being a person who loves data, experienced in looking at it as a collection of tables and columns, and who wants to get the best out of it without having to learn a statistical software package.
More specifically, being a MS SQL Server developer, I’ll try to explain some concepts in MS SQL terms. Other RDBMS can have slightly different syntax or lack some functionalities but the concepts are applicable to most of them.
Having said that, imagine a dialogue like this in an unspecified company of an unspecified sector…
Statistician:Ok, so: how is the data distributed?
DB admin:I receive a CSV every week, then I email a PDF to all managers
Statistician:I know that, but I asked about the distribution of the data, is it left-skewed, right-skewed…
DB admin:Actually it’s…ehm…from the union workers…so I guess a little…leftist?
Statistician:Oh, please, be serious: did you make an histogram already?
DB admin:SQL Server does that automatically, with DBCC…so I don’t need to
Statistician:Ok. Guess I’d better ask Debbie C.C. then. What about deviation? Did you observe…
DB admin:Well that’s rather intimate…I have a…Batman outfit I like to wear but my gir…
oh! OH! NO,..sorry: you want STDEVP(), right? Oh! I just…sorry…for a second…
Statistician:You’re a mean person, you know it?
DB admin:No…I’m just an average guy. If you mean MEAN, ask Fred, the Mongo guy, next cubicle.
In order to smooth relationships, and make the world a better place, here is a quick and dirty series of functions and calculations to help the DB admin better serve the Statistician.
All formulas have been double checked against Excel and R, so I’m quite confident that calculations are reliable and ready to use.
Prepare the data
Let’s start creating a table of numbers. For the purpose of this article we don’t know where they came from or what they represent, from a DB perspective they are just another table with several numerical columns.
In all examples I will use the column Y1 and format the output as a DECIMAL(38,20). This is not strictly necessary but useful indeed when dealing with very little numbers or when you need to check the results against other tools like R.
Mean, Median, Mode
These are fairly simple and probably the most basic terms in Stats, won’t spend much time on them.
[Median](https://en.wikipedia.org/wiki/Median) is the number in the middle when you sort the results, if there are two numbers in the middle then the average of the two. ORDER the query BY the value ASC of the column, if the number of rows n is odd pick the value in row (n/2)+1, otherwise get the average of rows number (n/2) and (n/2)+1
COUNT() the occurrences of each value, ORDER BY the count DESC and pick the first row
‘Mode’ AS ‘Descriptive Statistics’
,CAST((SELECT TOP 1
GROUP BY Y1
ORDER BY COUNT(Y1) DESC)
AS DECIMAL(38, 20)) AS Y1
Variance, Standard Deviation, Standard Error
Having calculated the Mean (AVG) of a set (column), the next step is verifying if the numbers in the set are close to each other or they are very sparse.
[Variance](https://en.wikipedia.org/wiki/Variance) is a measure of the distance between the numbers in the set and the Mean. Technically this is done by subtracting the Mean from each value, obtaining the squares and averaging all of them. Easier done that said, there is an aggregate function for that.
Please note that there are two versions of the Variance and Standard Deviation functions: VAR, VARP and STDEV, STDEVP. See this article for details https://en.wikipedia.org/wiki/Bessel%27s_correction, but generally speaking, we use the P version when we are sure that the data corresponds to the entire “population” i.e. we are sure we are using all the numbers we can possibly have. This makes more sense is statistics, as most of the times the numbers represent only a fraction of the population (hence sample).
,CAST(STDEV(Y1) / (SQRT(COUNT(Y1))) AS DECIMAL(38, 20)) AS Y1
A nice way of visualizing how the data is distributed (spread), how close the numbers are and if there are any strange values too far from the rest (outliers) that can tamper the aggregate functions results. Given a set of numbers, I want to have n containers, say 20 bins. Each bin has a min and max boundaries and I put the numbers that fall between the boundaries in the corresponding bin. The difference between the max and min of each container is the same and the bins -goes without saying- must be contiguous.
Look at the picture above, it was produced with the R instruction:
hist(Y1, breaks=20, labels=TRUE))
By the way, R only plots 19 bins ignoring my proposal of 20, it probably realized I’m not a Stats Guy…
set a variable to the number of desired bins,
get the min and max of the entire column (had to round those to mimic R behaviour),
calculate the binsize with the (difference between max and min / number of bins), use the next integer value if decimal
subtract the min from each value and the divide the result by the binsize, use the next integer value as bin number
COUNT all values GROUP BY bin number
DECLARE @bins INT = 20;
CAST(FLOOR(MIN(Y1)) AS DECIMAL(38, 20)) AS minimum
,CAST(CEILING(MAX(Y1)) AS DECIMAL(38, 20)) AS maximum
Looking at the density of the values (imagine we draw a line chart instead of bars in the histogram) we can tell from the line if the values are concentrated more to the right or to the left of the Mean (blue vertical line).
According to this image, we say the distribution is skewed to the right, looks like the right slope is less steep than the left one.
[Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is the number indicating if the shape of the distribution line is fat and wide or tall and thin. It has a name that sounds like a skin imperfection but actually is a measure of the obesity of your chart (sorry…). Negative numbers = the curve of the line is low and wide, positive numbers = the curve is high and thin. When is zero ? Congratulations, you have a perfect Gaussian shape, whatever that means.
I’d rather not say this was an exhaustive article, nor an introduction to Stats, it was just a teaser…
I believe you have now the basis to resist a two and a half minutes conversation with a Statistician; but if you liked what I’ve written there’s plenty more where that came from. I suggest enrolling in any of the publicly available MOOC on the web, lots of good contents out there.
If you fancy buying a book, you may want to consider one of these two to begin with: