MasterClass Big Data – Análisis predictivo con Azure Machine Learning Studio

## Statistics 101 for DB Dummies

*SQL Cheat sheet for The “AVERAGE” DB Guy*

Few years ago an article on CNBC defined data analysis “The Sexiest Job of the 21st Century”. I wouldn’t be so optimistic, but certainly it is becoming more and more demanded by the day. There is high demand for big data professionals, ETL specialists, data architects and statisticians. Traditionally, and until some time ago, database developers and administrators were considered as far from “sexy” as it could possibly be; nowadays, as the article recalls, “There are a lot of folks who can spell Hadoop and put it on their résumé and call themselves data scientists”.

Don’t believe me, just watch: The Best Jobs of 2016, Statistician is ranked #2, only surpassed by Data Scientist at #1.

The first non-IT related job in the list is Audiologist, #4.

This article is an attempt to fill the gaps between what we historically know as data manipulation language (DML) and the basic concepts of Statistics. If you google “what does a statistician do”, the answer looks very similar to what I do day in and day out, except for the *statistical method* part.

So my suggestion to all the SQL people out there, is to become acquainted to the terms and functions mostly used in Statistics, so that you don’t need to install and learn R just to understand if a column in the table is skewed towards the left. I have friends and colleagues who are statisticians whom I respect but often misunderstand, and thus I regard them as some sort of alchemists. So this article aims to reduce the distance between the “Stats Guy” and the “DB Guy”, the target audience being a person who loves data, experienced in looking at it as a collection of tables and columns, and who wants to get the best out of it without having to learn a statistical software package.

More specifically, being a MS SQL Server developer, I’ll try to explain some concepts in MS SQL terms. Other RDBMS can have slightly different syntax or lack some functionalities but the concepts are applicable to most of them.

Having said that, imagine a dialogue like this in an unspecified company of an unspecified sector…

**Statistician**: Ok, so: how is the data distributed?

**DB admin**: I receive a CSV every week, then I email a PDF to all managers

**Statistician**: I know that, but I asked about the distribution of the data, is it left-skewed, right-skewed…

**DB admin**: Actually it’s…ehm…from the union workers…so I guess a little…leftist?

**Statistician**: Oh, please, be serious: did you make an histogram already?

**DB admin**: SQL Server does that automatically, with DBCC…so I don’t need to

**Statistician**: Ok. Guess I’d better ask Debbie C.C. then. What about deviation? Did you observe…

**DB admin**: Well that’s rather intimate…I have a…Batman outfit I like to wear but my gir…

oh! OH! NO,..sorry: you want STDEVP(), right? Oh! I just…sorry…for a second…

**Statistician**: You’re a mean person, you know it?

**DB admin**: No…I’m just an average guy. If you mean MEAN, ask Fred, the Mongo guy, next cubicle.

In order to smooth relationships, and make the world a better place, here is a quick and dirty series of functions and calculations to help the DB admin better serve the Statistician.

/*

All formulas have been double checked against Excel and R, so I’m quite confident that calculations are reliable and ready to use.

*/

## Prepare the data

Let’s start creating a table of numbers. For the purpose of this article we don’t know where they came from or what they represent, from a DB perspective they are just another table with several numerical columns.

Download this script and execute it

*In all examples I will use the column Y1 and format the output as a DECIMAL(38,20). This is not strictly necessary but useful indeed when dealing with very little numbers or when you need to check the results against other tools like R.*

## Mean, Median, Mode

These are fairly simple and probably the most basic terms in Stats, won’t spend much time on them.

### Mean

[Mean](https://en.wikipedia.org/wiki/Arithmetic_mean) is what we call average, there is an aggregate function AVG() we know too well.

SELECT

‘Mean’ AS ‘Descriptive Statistics’

,CAST(AVG(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

### Median

[Median](https://en.wikipedia.org/wiki/Median) is the number in the middle when you sort the results, if there are two numbers in the middle then the average of the two. ORDER the query BY the value ASC of the column, if the number of rows n is odd pick the value in row (n/2)+1, otherwise get the average of rows number (n/2) and (n/2)+1

WITH S1

AS

(SELECT

COUNT(Y1) / 2 c

,COUNT(Y1) % 2 r

FROM lab.ENB2012)

SELECT

‘Median’ AS ‘Descriptive Statistics’

,CAST(CASE

WHEN r = 0 THEN (SELECT

AVG(t1.Y1)

FROM (SELECT

ROW_NUMBER() OVER (ORDER BY Y1 ASC) N

,Y1

FROM lab.ENB2012) t1

WHERE n IN (c, c + 1))

WHEN r = 1 THEN (SELECT

t1.Y1

FROM (SELECT

ROW_NUMBER() OVER (ORDER BY Y1 ASC) N

,Y1

FROM lab.ENB2012) t1

WHERE n IN (c + 1))

END AS DECIMAL(38, 20)) AS Y1

FROM S1

### Mode

[Mode](https://en.wikipedia.org/wiki/Mode_(statistics)) is the number that appears most frequently in a result set.

COUNT() the occurrences of each value, ORDER BY the count DESC and pick the first row

SELECT

‘Mode’ AS ‘Descriptive Statistics’

,CAST((SELECT TOP 1

Y1

FROM lab.ENB2012

GROUP BY Y1

ORDER BY COUNT(Y1) DESC)

AS DECIMAL(38, 20)) AS Y1

## Variance, Standard Deviation, Standard Error

Having calculated the Mean (AVG) of a set (column), the next step is verifying if the numbers in the set are close to each other or they are very sparse.

### Variance

[Variance](https://en.wikipedia.org/wiki/Variance) is a measure of the distance between the numbers in the set and the Mean. Technically this is done by subtracting the Mean from each value, obtaining the squares and averaging all of them. Easier done that said, there is an aggregate function for that.

SELECT

‘Sample Variance’ AS ‘Descriptive Statistics’

,CAST(VAR(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

### Standard Deviation

[Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) is the square root of the Variance, as easy as it sounds. And, yes, there’s a function for it.

SELECT

‘Standard Deviation’ AS ‘Descriptive Statistics’

,CAST(STDEV(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

- Please note that there are two versions of the Variance and Standard Deviation functions: VAR, VARP and STDEV, STDEVP. See this article for details https://en.wikipedia.org/wiki/Bessel%27s_correction, but generally speaking, we use the P version when we are sure that the data corresponds to the entire “population” i.e. we are sure we are using all the numbers we can possibly have. This makes more sense is statistics, as most of the times the numbers represent only a fraction of the population (hence sample).

### Standard Error of the Mean

[Standard Error of the Mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean) is the Standard Deviation divided by the square root of the number of rows. It is an indicator of “how much I may be wrong with this…”

SELECT

‘Standard Error’ AS ‘Descriptive Statistics’

,CAST(STDEV(Y1) / (SQRT(COUNT(Y1))) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

## Histograms

A nice way of visualizing how the data is distributed (spread), how close the numbers are and if there are any strange values too far from the rest (outliers) that can tamper the aggregate functions results. Given a set of numbers, I want to have n containers, say 20 bins. Each bin has a min and max boundaries and I put the numbers that fall between the boundaries in the corresponding bin. The difference between the max and min of each container is the same and the bins -goes without saying- must be contiguous.

Look at the picture above, it was produced with the R instruction:

hist(Y1, breaks=20, labels=TRUE))

By the way, R only plots 19 bins ignoring my proposal of 20, it probably realized I’m not a Stats Guy…

In SQL,

- set a variable to the number of desired bins,
- get the min and max of the entire column (had to round those to mimic R behaviour),
- calculate the binsize with the (difference between max and min / number of bins), use the next integer value if decimal
- subtract the min from each value and the divide the result by the binsize, use the next integer value as bin number
- COUNT all values GROUP BY bin number

DECLARE @bins INT = 20;

WITH S1

AS

(SELECT

CAST(FLOOR(MIN(Y1)) AS DECIMAL(38, 20)) AS minimum

,CAST(CEILING(MAX(Y1)) AS DECIMAL(38, 20)) AS maximum

,CASE

WHEN

FLOOR((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) <> CEILING((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) THEN ROUND((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins, 0)

ELSE (CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins

END AS binsize

FROM lab.ENB2012)

SELECT

bin

,COUNT(*) frequency

FROM (SELECT

Y1

,CEILING(CAST(((Y1 – minimum) / (binsize)) AS DECIMAL(38, 20))) AS bin

FROM lab.ENB2012

CROSS JOIN S1) T1

GROUP BY T1.bin

ORDER BY T1.bin

Another way of creating histograms, if you already know the boundaries, would be a very big CASE statement, not very elegant but quick. Refer to the below articles for more ideas:

https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx

https://sqlsunday.com/2014/09/14/histogram-in-t-sql/

## Skewness, Kurtosis

Looking at the density of the values (imagine we draw a line chart instead of bars in the histogram) we can tell from the line if the values are concentrated more to the right or to the left of the Mean (blue vertical line).

According to this image, we say the distribution is skewed to the right, looks like the right slope is less steep than the left one.

### Skewness

[Skewness](https://en.wikipedia.org/wiki/Skewness) is the value indicating where the distribution is leaning. If it’s greater than zero = right, less than zero = left.

WITH S1

AS

(SELECT

AVG(Y1) AS mean

,STDEV(Y1) AS stdev

,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

FROM lab.ENB2012)

SELECT

‘Skewness’ AS ‘Descriptive Statistics’

,CAST(SUM(((Y1 – mean) / stdev) * ((Y1 – mean) / stdev) * ((Y1 – mean) / stdev)) * MIN(corr) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

CROSS JOIN S1

### Kurtosis

[Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is the number indicating if the shape of the distribution line is fat and wide or tall and thin. It has a name that sounds like a skin imperfection but actually is a measure of the obesity of your chart (sorry…). Negative numbers = the curve of the line is low and wide, positive numbers = the curve is high and thin. When is zero ? Congratulations, you have a perfect Gaussian shape, whatever that means.

WITH S1

AS

(SELECT

AVG(Y1) AS mean

,STDEV(Y1) AS stdev

,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

,COUNT(Y1) * 1.0 * (COUNT(Y1) + 1) / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS corr2

,3.0 * SQUARE((COUNT(Y1) – 1)) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS subfact

FROM lab.ENB2012)

SELECT

‘Kurtosis’ AS ‘Descriptive Statistics’

,CAST(SUM(SQUARE(SQUARE(((Y1 * 1.0 – mean) / stdev)))) * MIN(corr2) – MIN(subfact) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

CROSS JOIN S1

## Standing on the Shoulder of Giants

I must thank the following articles, and many more I don’t remember:

* https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx

* https://sqlsunday.com/2014/09/14/histogram-in-t-sql/

* https://www.mathsisfun.com/data/correlation.html

* https://ayadshammout.com/2013/11/30/t-sql-linear-regression-function/

I’d rather not say this was an exhaustive article, nor an introduction to Stats, it was just a teaser…

I believe you have now the basis to resist a two and a half minutes conversation with a Statistician; but if you liked what I’ve written there’s plenty more where that came from. I suggest enrolling in any of the publicly available MOOC on the web, lots of good contents out there.

If you fancy buying a book, you may want to consider one of these two to begin with:

Naked Statistics: Stripping the Dread from the Data

The Cartoon Guide to Statistics

## Conclusion

**Statistician:** (walks away whispering) …”lazy Disk Jockey”

**DB admin:** (resumes Warcraft game and thinks) …”selfish Coin Flipper”

## Free MicroStrategy training in Seville (students and unemployed only)

Espacio RES (@espacio_RES) is organizing in Seville a 2 week Business Intelligence course with MicroStrategy and SQL Server.

`SELECT CONVERT(DATE,'20161114',112) AS [Start Date],`

CONVERT(DATE,'20161127',112) AS [End Date]

The course is free for students and unemployed professionals. The classes will be held in English from Mon to Fri (9am – 2pm).

I will be the main teacher, with some little help from colleagues and friends in the BI sector.

Please see this document:

introduction-to-business-intelligence-with-microstrategy

for more information, and contact hola@espacio-res.org to apply

## A word(pad) about passwords

As any other professional we are concerned about security; data analysts and DBAs, we often use complex passwords to secure our data and our web accounts. One draw-back I often encounter is when I change the language or keyboard -depending the client I’m working with-, and some special characters may not be present at my fingertips.

Take this as an example: RIGHT SINGLE QUOTATION MARK

which is not available on Spanish keyboards. Its Unicode number is the hexadecimal 2019.

Thanks to this article: Three ways to enter Unicode characters in Windows

I learned a new way of typing special characters.

Open up Wordpad (write.exe) , type the hexadecimal number, in this case 2019, and then Alt+x.

It will convert the preceding numbers in the corresponding Unicode character. Easy, isn’t it? Next you only need to copy and paste it in the corresponding text box. Of course you should not write the entire password in Wordpad, just the special chars, because there are programs out there that can sniff your clipboard.

## La higiene de los datos se aprende desde niños (video)

Here you can find my latest chat about data hygiene (Spanish audio). And the Mind-map

used during the presentation (sorry no PPT available).

Thanks to @espacio_RES

## Power BI publish to web now in preview

Without JavaScript. No, seriously…

According to sources from Microsoft, this newest feature will allow to publish interactive reports to any web page supporting iframes. Did not try it yet, but looks very promising. It’s not clear if MS will eventually charge for this once out of preview.

Meanwhile, we can impress our customers and friends embedding self refreshing reports on our blogs and webs. For free.

See https://powerbi.microsoft.com/en-us/blog/announcing-power-bi-publish-to-web/ for details.

Happy embedding!

## Filter a Power BI tile on your web page

EDIT 2016-07-12: Microsoft has changed their API and policies and functionality, so this artcile is obsolete now. I will shut down the sample site. For details, please see: https://powerbi.microsoft.com/en-us/blog/what-s-new-and-what-s-next-for-power-bi-embedded-july-2016/

======================================================================================================================

I received a lot of feedback regarding my two recent posts: http://moraschi.com/2015/12/08/embed-multiple-power-bi-tiles-into-your-web/, and http://moraschi.com/2015/11/30/use-a-power-bi-tile-in-your-web-page/

So I decided to continue in my exploration of the Power BI API. The next natural step in creating customized dashboards is the ability to dynamically filter the tiles based on user interaction. As of today, this is something you cannot achieve in https://app.powerbi.com (you can only filter using selectors in Reports).

There is a post here, with very scarce information about a parameter of the embed API, namely **$filter**. We can use this parameter much like its oData cousin, with the same syntax.

&$filter={tablename/fieldname}+eq+'{literal constant}’

I tested it for string parameters and it works. Did not have such luck with dates. Everything will eventually come into place, as this API is still under development as I understand.

I created a sample web page trying to keep it very simple, to demonstrate how this works. You can find it here and you are welcome to help yourself reusing my code and adapting it to your needs. In the example you will see a list box, clicking on it will trigger a refresh of the tile with the appropriate results.

Happy coding!

## SKILL UP offer from PACKT Publishing, every title only 5$ (Offer lasts till 8th Jan 2016)

Pick any electronic book for just 5$.

Packt has also created their very first Year in Review – https://www.packtpub.com/packt/five-dollar-skillup/report

It will guide you through the biggest topics to come out of 2015 and show you what’s going to be defining 2016 and beyond.

Merry learning and happy new year!

## A map, a map! My kingdom for a Choropleth map! (Power BI)

One of the most fascinating aspect of Business Intelligence is the power to picture data in an easy to understand way. I personally like maps a lot, not the google or bing type with lots of circular points which give very low added value to the information, but the Choropleth maps. Those shaped areas with shaded colors intended to convey an idea at the first sight.

Choropleth maps are mostly used with regional data but they can picture any kind of information on any kind of shape, the idea behind is very simple: you have a SVG image (a vectorial one), with one id for each closed area. Remember those days when we were kids and used the coloring book to paint on pre-designed images?

That’s it. Then you load your data and assign to each id a numeric value to get a colored area with a increasing intensity of color.

This is exactly what the **Synoptic panel** does.

It’s a custom visual that you can use with Power BI, more on this here. You can download the current version from gitHub, and import it into your Power BI report (yes, it’s free).

The Synoptic panel itself doesn’t paint the picture, you need to have a coded SVG to use with it. And creating a coded SVG is not difficult, you may use any SVG editor, or just go to http://synoptic.design which is an **extremely useful** web build by the same people who did the custom visual.

I must thank @DanielePerilli for his very good job and his continuous support. He was very responsive and helped me through building my first map.

So in about an hour and a half here we have two maps with polls data from the Spanish elections in 2011. I downloaded the CSV from http://www.electionresources.org/es/data/index_es.html#PROVINCIAS and used the Synoptic panel to create two maps where I filtered respectively the votes obtained by PP and PSOE.

No rocket science, anybody can do that. Just a little caveat, it’s not very intuitive: if you want to have different shades of color, you should drag a field to the “Saturation Values” box in the visualization settings. And if you want to display names on the map, you need to enable “Category Labels” in the format tab of visualization settings. That took me more than 5 minutes to figure out.

Anyway, go download it, create your own map or use those available in the gallery and happy coloring!

## Embed multiple Power BI tiles into your Web

EDIT 2016-07-12: Microsoft has changed their API and policies and functionality, so this article is obsolete now. I will shut down the sample site. For details, please see: https://powerbi.microsoft.com/en-us/blog/what-s-new-and-what-s-next-for-power-bi-embedded-july-2016/

Code re-usability is at the core of programming best practices and Business Intelligence is no exception.

None wants to code multiple times just to refactor thousands of lines of similar code whenever there’s a change in business rules.

In my previous post Use a Power BI Tile in your web page we saw a technique that allows us to use a Power BI tile into a custom web page. More often than not, our HTML documents are composed by more than one grid or chart, so we need a flexible way to be able to insert multiple tiles on the same page without rewriting the same JavaScript procedures over and over.

I reworked the previous example to make it easier and created this simple function:

[code language=”javascript”]

<script type="text/javascript">

function EmbedTile(sTileUrl, sFrameName, iWidth, iHeight) {

var iframe = document.getElementById(sFrameName);

iframe.src = sTileUrl + "&width=" + iWidth + "&height=" + iHeight;

iframe.onload = function () {

var m = { action: "loadTile", accessToken: accessToken, height: iHeight, width: iWidth };

message = JSON.stringify(m);

iframe.contentWindow.postMessage(message, "*");

}

}

window.onload = function(){

EmbedTile(

‘https://app.powerbi.com/embed?dashboardId=9153d5…..365207&tileId=2aef8……913edc524353’,

‘iFrameEmbedTile1’,

550,

350);

EmbedTile(

‘https://app.powerbi.com/embed?dashboardId=9153d5……365207&tileId=c69f153c…….9336bb’,

‘iFrameEmbedTile2’,

550,

350);

};

</script>

[/code]

Feel free to reuse it or see it in action here: Two tiles in the same page