
Statistics 101 for DB Dummies
SQL Cheat sheet for The “AVERAGE” DB Guy
Few years ago an article on CNBC defined data analysis “The Sexiest Job of the 21st Century”. I wouldn’t be so optimistic, but certainly it is becoming more and more demanded by the day. There is high demand for big data professionals, ETL specialists, data architects and statisticians. Traditionally, and until some time ago, database developers and administrators were considered as far from “sexy” as it could possibly be; nowadays, as the article recalls, “There are a lot of folks who can spell Hadoop and put it on their résumé and call themselves data scientists”.
Don’t believe me, just watch: The Best Jobs of 2016, Statistician is ranked #2, only surpassed by Data Scientist at #1.
The first non-IT related job in the list is Audiologist, #4.
This article is an attempt to fill the gaps between what we historically know as data manipulation language (DML) and the basic concepts of Statistics. If you google “what does a statistician do”, the answer looks very similar to what I do day in and day out, except for the *statistical method* part.
So my suggestion to all the SQL people out there, is to become acquainted to the terms and functions mostly used in Statistics, so that you don’t need to install and learn R just to understand if a column in the table is skewed towards the left. I have friends and colleagues who are statisticians whom I respect but often misunderstand, and thus I regard them as some sort of alchemists. So this article aims to reduce the distance between the “Stats Guy” and the “DB Guy”, the target audience being a person who loves data, experienced in looking at it as a collection of tables and columns, and who wants to get the best out of it without having to learn a statistical software package.
More specifically, being a MS SQL Server developer, I’ll try to explain some concepts in MS SQL terms. Other RDBMS can have slightly different syntax or lack some functionalities but the concepts are applicable to most of them.
Having said that, imagine a dialogue like this in an unspecified company of an unspecified sector…
Statistician: Ok, so: how is the data distributed?
DB admin: I receive a CSV every week, then I email a PDF to all managers
Statistician: I know that, but I asked about the distribution of the data, is it left-skewed, right-skewed…
DB admin: Actually it’s…ehm…from the union workers…so I guess a little…leftist?
Statistician: Oh, please, be serious: did you make an histogram already?
DB admin: SQL Server does that automatically, with DBCC…so I don’t need to
Statistician: Ok. Guess I’d better ask Debbie C.C. then. What about deviation? Did you observe…
DB admin: Well that’s rather intimate…I have a…Batman outfit I like to wear but my gir…
oh! OH! NO,..sorry: you want STDEVP(), right? Oh! I just…sorry…for a second…
Statistician: You’re a mean person, you know it?
DB admin: No…I’m just an average guy. If you mean MEAN, ask Fred, the Mongo guy, next cubicle.
In order to smooth relationships, and make the world a better place, here is a quick and dirty series of functions and calculations to help the DB admin better serve the Statistician.
/*
All formulas have been double checked against Excel and R, so I’m quite confident that calculations are reliable and ready to use.
*/
Prepare the data
Let’s start creating a table of numbers. For the purpose of this article we don’t know where they came from or what they represent, from a DB perspective they are just another table with several numerical columns.
Download this script and execute it
In all examples I will use the column Y1 and format the output as a DECIMAL(38,20). This is not strictly necessary but useful indeed when dealing with very little numbers or when you need to check the results against other tools like R.
Mean, Median, Mode
These are fairly simple and probably the most basic terms in Stats, won’t spend much time on them.
Mean
[Mean](https://en.wikipedia.org/wiki/Arithmetic_mean) is what we call average, there is an aggregate function AVG() we know too well.
SELECT
‘Mean’ AS ‘Descriptive Statistics’
,CAST(AVG(Y1) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
Median
[Median](https://en.wikipedia.org/wiki/Median) is the number in the middle when you sort the results, if there are two numbers in the middle then the average of the two. ORDER the query BY the value ASC of the column, if the number of rows n is odd pick the value in row (n/2)+1, otherwise get the average of rows number (n/2) and (n/2)+1
WITH S1
AS
(SELECT
COUNT(Y1) / 2 c
,COUNT(Y1) % 2 r
FROM lab.ENB2012)
SELECT
‘Median’ AS ‘Descriptive Statistics’
,CAST(CASE
WHEN r = 0 THEN (SELECT
AVG(t1.Y1)
FROM (SELECT
ROW_NUMBER() OVER (ORDER BY Y1 ASC) N
,Y1
FROM lab.ENB2012) t1
WHERE n IN (c, c + 1))
WHEN r = 1 THEN (SELECT
t1.Y1
FROM (SELECT
ROW_NUMBER() OVER (ORDER BY Y1 ASC) N
,Y1
FROM lab.ENB2012) t1
WHERE n IN (c + 1))
END AS DECIMAL(38, 20)) AS Y1
FROM S1
Mode
[Mode](https://en.wikipedia.org/wiki/Mode_(statistics)) is the number that appears most frequently in a result set.
COUNT() the occurrences of each value, ORDER BY the count DESC and pick the first row
SELECT
‘Mode’ AS ‘Descriptive Statistics’
,CAST((SELECT TOP 1
Y1
FROM lab.ENB2012
GROUP BY Y1
ORDER BY COUNT(Y1) DESC)
AS DECIMAL(38, 20)) AS Y1
Variance, Standard Deviation, Standard Error
Having calculated the Mean (AVG) of a set (column), the next step is verifying if the numbers in the set are close to each other or they are very sparse.
Variance
[Variance](https://en.wikipedia.org/wiki/Variance) is a measure of the distance between the numbers in the set and the Mean. Technically this is done by subtracting the Mean from each value, obtaining the squares and averaging all of them. Easier done that said, there is an aggregate function for that.
SELECT
‘Sample Variance’ AS ‘Descriptive Statistics’
,CAST(VAR(Y1) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
Standard Deviation
[Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) is the square root of the Variance, as easy as it sounds. And, yes, there’s a function for it.
SELECT
‘Standard Deviation’ AS ‘Descriptive Statistics’
,CAST(STDEV(Y1) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
- Please note that there are two versions of the Variance and Standard Deviation functions: VAR, VARP and STDEV, STDEVP. See this article for details https://en.wikipedia.org/wiki/Bessel%27s_correction, but generally speaking, we use the P version when we are sure that the data corresponds to the entire “population” i.e. we are sure we are using all the numbers we can possibly have. This makes more sense is statistics, as most of the times the numbers represent only a fraction of the population (hence sample).
Standard Error of the Mean
[Standard Error of the Mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean) is the Standard Deviation divided by the square root of the number of rows. It is an indicator of “how much I may be wrong with this…”
SELECT
‘Standard Error’ AS ‘Descriptive Statistics’
,CAST(STDEV(Y1) / (SQRT(COUNT(Y1))) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
Histograms
A nice way of visualizing how the data is distributed (spread), how close the numbers are and if there are any strange values too far from the rest (outliers) that can tamper the aggregate functions results. Given a set of numbers, I want to have n containers, say 20 bins. Each bin has a min and max boundaries and I put the numbers that fall between the boundaries in the corresponding bin. The difference between the max and min of each container is the same and the bins -goes without saying- must be contiguous.
Look at the picture above, it was produced with the R instruction:
hist(Y1, breaks=20, labels=TRUE))
By the way, R only plots 19 bins ignoring my proposal of 20, it probably realized I’m not a Stats Guy…
In SQL,
- set a variable to the number of desired bins,
- get the min and max of the entire column (had to round those to mimic R behaviour),
- calculate the binsize with the (difference between max and min / number of bins), use the next integer value if decimal
- subtract the min from each value and the divide the result by the binsize, use the next integer value as bin number
- COUNT all values GROUP BY bin number
DECLARE @bins INT = 20;
WITH S1
AS
(SELECT
CAST(FLOOR(MIN(Y1)) AS DECIMAL(38, 20)) AS minimum
,CAST(CEILING(MAX(Y1)) AS DECIMAL(38, 20)) AS maximum
,CASE
WHEN
FLOOR((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) <> CEILING((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) THEN ROUND((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins, 0)
ELSE (CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins
END AS binsize
FROM lab.ENB2012)
SELECT
bin
,COUNT(*) frequency
FROM (SELECT
Y1
,CEILING(CAST(((Y1 – minimum) / (binsize)) AS DECIMAL(38, 20))) AS bin
FROM lab.ENB2012
CROSS JOIN S1) T1
GROUP BY T1.bin
ORDER BY T1.bin
Another way of creating histograms, if you already know the boundaries, would be a very big CASE statement, not very elegant but quick. Refer to the below articles for more ideas:
https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx
https://sqlsunday.com/2014/09/14/histogram-in-t-sql/
Skewness, Kurtosis
Looking at the density of the values (imagine we draw a line chart instead of bars in the histogram) we can tell from the line if the values are concentrated more to the right or to the left of the Mean (blue vertical line).
According to this image, we say the distribution is skewed to the right, looks like the right slope is less steep than the left one.
Skewness
[Skewness](https://en.wikipedia.org/wiki/Skewness) is the value indicating where the distribution is leaning. If it’s greater than zero = right, less than zero = left.
WITH S1
AS
(SELECT
AVG(Y1) AS mean
,STDEV(Y1) AS stdev
,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr
FROM lab.ENB2012)
SELECT
‘Skewness’ AS ‘Descriptive Statistics’
,CAST(SUM(((Y1 – mean) / stdev) * ((Y1 – mean) / stdev) * ((Y1 – mean) / stdev)) * MIN(corr) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
CROSS JOIN S1
Kurtosis
[Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is the number indicating if the shape of the distribution line is fat and wide or tall and thin. It has a name that sounds like a skin imperfection but actually is a measure of the obesity of your chart (sorry…). Negative numbers = the curve of the line is low and wide, positive numbers = the curve is high and thin. When is zero ? Congratulations, you have a perfect Gaussian shape, whatever that means.
WITH S1
AS
(SELECT
AVG(Y1) AS mean
,STDEV(Y1) AS stdev
,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr
,COUNT(Y1) * 1.0 * (COUNT(Y1) + 1) / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS corr2
,3.0 * SQUARE((COUNT(Y1) – 1)) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS subfact
FROM lab.ENB2012)
SELECT
‘Kurtosis’ AS ‘Descriptive Statistics’
,CAST(SUM(SQUARE(SQUARE(((Y1 * 1.0 – mean) / stdev)))) * MIN(corr2) – MIN(subfact) AS DECIMAL(38, 20)) AS Y1
FROM lab.ENB2012
CROSS JOIN S1
Standing on the Shoulder of Giants
I must thank the following articles, and many more I don’t remember:
* https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx
* https://sqlsunday.com/2014/09/14/histogram-in-t-sql/
* https://www.mathsisfun.com/data/correlation.html
* https://ayadshammout.com/2013/11/30/t-sql-linear-regression-function/
I’d rather not say this was an exhaustive article, nor an introduction to Stats, it was just a teaser…
I believe you have now the basis to resist a two and a half minutes conversation with a Statistician; but if you liked what I’ve written there’s plenty more where that came from. I suggest enrolling in any of the publicly available MOOC on the web, lots of good contents out there.
If you fancy buying a book, you may want to consider one of these two to begin with:
Naked Statistics: Stripping the Dread from the Data
The Cartoon Guide to Statistics
Conclusion
Statistician: (walks away whispering) …”lazy Disk Jockey”
DB admin: (resumes Warcraft game and thinks) …”selfish Coin Flipper”