Statistics 101 for DB Dummies

SQL Cheat sheet for The “AVERAGE” DB Guy

Few years ago an article on CNBC defined data analysis “The Sexiest Job of the 21st Century”. I wouldn’t be so optimistic, but certainly it is becoming more and more demanded by the day. There is high demand for big data professionals, ETL specialists, data architects and statisticians. Traditionally, and until some time ago, database developers and administrators were considered as far from “sexy” as it could possibly be; nowadays, as the article recalls, “There are a lot of folks who can spell Hadoop and put it on their résumé and call themselves data scientists”.

Don’t believe me, just watch: The Best Jobs of 2016, Statistician is ranked #2, only surpassed by Data Scientist at #1.

The first non-IT related job in the list is Audiologist, #4.

This article is an attempt to fill the gaps between what we historically know as data manipulation language (DML) and the basic concepts of Statistics. If you google “what does a statistician do”, the answer looks very similar to what I do day in and day out, except for the *statistical method* part.

So my suggestion to all the SQL people out there, is to become acquainted to the terms and functions mostly used in Statistics, so that you don’t need to install and learn R just to understand if a column in the table is skewed towards the left. I have friends and colleagues who are statisticians whom I respect but often misunderstand, and thus I regard them as some sort of alchemists. So this article aims to reduce the distance between the “Stats Guy” and the “DB Guy”, the target audience being a person who loves data, experienced in looking at it as a collection of tables and columns, and who wants to get the best out of it without having to learn a statistical software package.

More specifically, being a MS SQL Server developer, I’ll try to explain some concepts in MS SQL terms. Other RDBMS can have slightly different syntax or lack some functionalities but the concepts are applicable to most of them.

Having said that, imagine a dialogue like this in an unspecified company of an unspecified sector…

Statistician: Ok, so: how is the data distributed?

DB admin: I receive a CSV every week, then I email a PDF to all managers

Statistician: I know that, but I asked about the distribution of the data, is it left-skewed, right-skewed…

DB admin: Actually it’s…ehm…from the union workers…so I guess a little…leftist?

Statistician: Oh, please, be serious: did you make an histogram already?

DB admin: SQL Server does that automatically, with DBCC…so I don’t need to

Statistician: Ok. Guess I’d better ask Debbie C.C. then. What about deviation? Did you observe…

DB admin: Well that’s rather intimate…I have a…Batman outfit I like to wear but my gir…

oh! OH! NO,..sorry: you want STDEVP(), right? Oh! I just…sorry…for a second…

Statistician: You’re a mean person, you know it?

DB admin: No…I’m just an average guy. If you mean MEAN, ask Fred, the Mongo guy, next cubicle.

In order to smooth relationships, and make the world a better place, here is a quick and dirty series of functions and calculations to help the DB admin better serve the Statistician.


All formulas have been double checked against Excel and R, so I’m quite confident that calculations are reliable and ready to use.


Prepare the data

Let’s start creating a table of numbers. For the purpose of this article we don’t know where they came from or what they represent, from a DB perspective they are just another table with several numerical columns.

Download this script and execute it


In all examples I will use the column Y1 and format the output as a DECIMAL(38,20). This is not strictly necessary but useful indeed when dealing with very little numbers or when you need to check the results against other tools like R.

Mean, Median, Mode

These are fairly simple and probably the most basic terms in Stats, won’t spend much time on them.


[Mean]( is what we call average, there is an aggregate function AVG() we know too well.


 ‘Mean’ AS ‘Descriptive Statistics’


FROM lab.ENB2012


[Median]( is the number in the middle when you sort the results, if there are two numbers in the middle then the average of the two. ORDER the query BY the value ASC of the column, if the number of rows n is odd pick the value in row (n/2)+1, otherwise get the average of rows number (n/2) and (n/2)+1




   COUNT(Y1) / 2 c

  ,COUNT(Y1) % 2 r

 FROM lab.ENB2012)


 ‘Median’ AS ‘Descriptive Statistics’







         FROM lab.ENB2012) t1

       WHERE n IN (c, c + 1))






         FROM lab.ENB2012) t1

       WHERE n IN (c + 1))

 END AS DECIMAL(38, 20)) AS Y1



[Mode]( is the number that appears most frequently in a result set.

COUNT() the occurrences of each value, ORDER BY the count DESC and pick the first row


 ‘Mode’ AS ‘Descriptive Statistics’



   FROM lab.ENB2012



 AS DECIMAL(38, 20)) AS Y1

Variance, Standard Deviation, Standard Error

Having calculated the Mean (AVG) of a set (column), the next step is verifying if the numbers in the set are close to each other or they are very sparse.


[Variance]( is a measure of the distance between the numbers in the set and the Mean. Technically this is done by subtracting the Mean from each value, obtaining the squares and averaging all of them. Easier done that said, there is an aggregate function for that.


 ‘Sample Variance’ AS ‘Descriptive Statistics’


FROM lab.ENB2012

Standard Deviation

[Standard Deviation]( is the square root of the Variance, as easy as it sounds. And, yes, there’s a function for it.


 ‘Standard Deviation’ AS ‘Descriptive Statistics’


FROM lab.ENB2012

  • Please note that there are two versions of the Variance and Standard Deviation functions: VAR, VARP and STDEV, STDEVP. See this article for details, but generally speaking, we use the P version when we are sure that the data corresponds to the entire “population” i.e. we are sure we are using all the numbers we can possibly have. This makes more sense is statistics, as most of the times the numbers represent only a fraction of the population (hence sample).

Standard Error of the Mean

[Standard Error of the Mean]( is the Standard Deviation divided by the square root of the number of rows. It is an indicator of “how much I may be wrong with this…”


 ‘Standard Error’ AS ‘Descriptive Statistics’


FROM lab.ENB2012


A nice way of visualizing how the data is distributed (spread), how close the numbers are and if there are any strange values too far from the rest (outliers) that can tamper the aggregate functions results. Given a set of numbers, I want to have n containers, say 20 bins. Each bin has a min and max boundaries and I put the numbers that fall between the boundaries in the corresponding bin. The difference between the max and min of each container is the same and the bins -goes without saying- must be contiguous.

Look at the picture above, it was produced with the R instruction:

hist(Y1, breaks=20, labels=TRUE))

By the way, R only plots 19 bins ignoring my proposal of 20, it probably realized I’m not a Stats Guy…


  1. set a variable to the number of desired bins,
  2. get the min and max of the entire column (had to round those to mimic R behaviour),
  3. calculate the binsize with the (difference between max and min / number of bins), use the next integer value if decimal
  4. subtract the min from each value and the divide the result by the binsize, use the next integer value as bin number
  5. COUNT all values GROUP BY bin number

DECLARE @bins INT = 20;




   CAST(FLOOR(MIN(Y1)) AS DECIMAL(38, 20)) AS minimum

  ,CAST(CEILING(MAX(Y1)) AS DECIMAL(38, 20)) AS maximum



       FLOOR((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) <> CEILING((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) THEN ROUND((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins, 0)

     ELSE (CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins

   END AS binsize

 FROM lab.ENB2012)



,COUNT(*) frequency



  ,CEILING(CAST(((Y1 – minimum) / (binsize)) AS DECIMAL(38, 20))) AS bin

 FROM lab.ENB2012




Another way of creating histograms, if you already know the boundaries, would be a very big CASE statement, not very elegant but quick. Refer to the below articles for more ideas:

Skewness, Kurtosis

Looking at the density of the values (imagine we draw a line chart instead of bars in the histogram) we can tell from the line if the values are concentrated more to the right or to the left of the Mean (blue vertical line).

According to this image, we say the distribution is skewed to the right, looks like the right slope is less steep than the left one.


[Skewness]( is the value indicating where the distribution is leaning. If it’s greater than zero = right, less than zero = left.




   AVG(Y1) AS mean

  ,STDEV(Y1) AS stdev

  ,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

 FROM lab.ENB2012)


 ‘Skewness’ AS ‘Descriptive Statistics’

,CAST(SUM(((Y1 – mean) / stdev) * ((Y1 – mean) / stdev) * ((Y1 – mean) / stdev)) * MIN(corr) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012



[Kurtosis]( is the number indicating if the shape of the distribution line is fat and wide or tall and thin. It has a name that sounds like a skin imperfection but actually is a measure of the obesity of your chart (sorry…). Negative numbers = the curve of the line is low and wide, positive numbers = the curve is high and thin. When is zero ? Congratulations, you have a perfect Gaussian shape, whatever that means.




   AVG(Y1) AS mean

  ,STDEV(Y1) AS stdev

  ,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

  ,COUNT(Y1) * 1.0 * (COUNT(Y1) + 1) / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS corr2

  ,3.0 * SQUARE((COUNT(Y1) – 1)) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS subfact

 FROM lab.ENB2012)


 ‘Kurtosis’ AS ‘Descriptive Statistics’

,CAST(SUM(SQUARE(SQUARE(((Y1 * 1.0 – mean) / stdev)))) * MIN(corr2) – MIN(subfact) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012


Standing on the Shoulder of Giants

I must thank the following articles, and many more I don’t remember:





I’d rather not say this was an exhaustive article, nor an introduction to Stats, it was just a teaser…

I believe you have now the basis to resist a two and a half minutes conversation with a Statistician; but if you liked what I’ve written there’s plenty more where that came from. I suggest enrolling in any of the publicly available MOOC on the web, lots of good contents out there.

If you fancy buying a book, you may want to consider one of these two to begin with:

Naked Statistics: Stripping the Dread from the Data

The Cartoon Guide to Statistics


Statistician: (walks away whispering) …”lazy Disk Jockey”

DB admin: (resumes Warcraft game and thinks) …”selfish Coin Flipper”

Free MicroStrategy training in Seville (students and unemployed only)


Espacio RES (@espacio_RES) is organizing in Seville a 2 week Business Intelligence course with MicroStrategy and SQL Server.

SELECT CONVERT(DATE,'20161114',112) AS [Start Date],
CONVERT(DATE,'20161127',112) AS [End Date]

The course is free for students and unemployed professionals. The classes will be held in English from Mon to Fri (9am – 2pm).

I will be the main teacher, with some little help from colleagues and friends in the BI sector.

Please see this document:


for more information, and contact to apply

Twitter Trending Topics by Location, Date (PowerBi WordCloud)

Twitter Trending Topics by Date, Location

One of the things I like the most about Power BI is the ability to use custom visuals. I did not try building one myself and honestly I think it’s beyond my skills, yet there is a gallery with a growing number of pre-build viz that you can download and use.

Go to and take a look.

I downloaded the WordCloud plus the Chiclet Slicer to create a report showing the trending topics in Twitter sliced by date and location.

Only few locations are available from Twitter API, here is a sample of spanish cities I am monitoring.

Happy tweets!

Microsoft Azure SQL Data Warehouse. How much does it cost?

As many of you know, Microsoft has open to public preview its latest BI technology: Azure SQL Data Warehouse.
On the documentation page is defined as: “an enterprise-class distributed database capable of processing petabyte volumes of relational and non-relational data.”
How much does it cost, is it worth it, and how does it differ from previous SQL offering on Azure?
The main difference -in my opinion- is the billing of compute power and storage:
in Azure SQL DW you decide how fast your query run and how big your data is. This generates a charge for DWU (Data Warehouse Unit), a unit of measure defined as the level of performance of your data warehouse; and a charge for RA-GRS Page Blobs (Read-Access Geographically Redundant Storage).
I suppose this will cause more than a headache in financial departments and SQL administrators, but it is indeed a good thing.

The main point I see here is that -wisely used- the DWU pricing can help tune the database responsiveness and number-crunching speed according to the time of the day (or day of week). I can set for example 100DWU on low traffic hours or weekends, and scale up to 2000DWU Monday morning or during the reporting season when everybody want to access data. I can even choose to pause the database -for example during holidays- and incur in no compute  charge at all. It would be extremely useful to have PowerShell scripts to switch from a DWU level to another and schedule them so that the database can auto-adjusts the compute power needed.

Doing a very simple math: assuming a 30 days month, at 100DWU we will receive a $504 (USD) bill for computing + $62.5 (USD) for 1TB of data, a total $566.5* for a fully functioning SQL Data Warehouse. On the other end of the spectrum, 2000DWU for 1 month would be $10,080(USD)*.

Additionally to the Grow, shrink, or pause capabilities, Azure SQL Data Warehouse features an MPP architecture, column-store indexes, and the ability to query relational and non relational (read Hadoop) with Polybase, more on this here.

So my .02: let’s give it a try…

*During the preview period the prices are discounted, I do not expect them to vary that much anyway, but certainly at the moment it is not clear how much the discount is and how much will the price be after the preview, see here for more details:

The Lahman's Baseball Database – real life sample data

Mr Lahman is (by his own words) a Watchdog reporter & data journalist, and an author.
But the collective wisdom of the Business Intelligence world reminds him as the creator of a wonderful public dataset: the Baseball Database.

The Lahman database is extremely useful and I am grateful to his author for several reasons:

  1. It is not a bike/sport/equipment shop sales database, that almost all examples in the BI world use (my book is not an exception…)
  2. It is real life data with an interesting subject, and frequently updated.
  3. And it has errors (very few), those little nice imperfections that come very handy when you are testing/training/developing a test case or a proof of concept.
I don’t remember who told that PowerPoint is always right and nothing can go wrong with a slide presentation -as long as the projector is working and the usb drive is plugged in- but he surely would agree that it’s no fun to demo with a database that is perfectly crafted to return exact results.
I never found in my life a “squeaky clean” production db. As a general rule there is always a primary key missing, an unenforceable foreign key, or missing records, outdated documentation, you name it…that’s why I like the Lahman db, because I can experiment -under a controlled environment- how the software behave in situations that are beyond the standard Show and Tell “look-ma-this-works” case scenarios.
It is available for free and you can download it in several formats from here:
I converted it to Oracle and SQL Server, and you can download those version too from here:
In Oracle, simply run IMPDP with the SCHEMAS=LAHMAN option, the password for the LAHMAN schema is (guess) LAHMAN.
In SQL Server copy the MDF into the DATA folder and attach it to an existing SQL Engine instance.
Hope this helps, I will often use this data to showcase MicroStrategy and other products in my next posts.