MasterClass Big Data – Análisis predictivo con Azure Machine Learning Studio

## Statistics 101 for DB Dummies

*SQL Cheat sheet for The “AVERAGE” DB Guy*

Few years ago an article on CNBC defined data analysis “The Sexiest Job of the 21st Century”. I wouldn’t be so optimistic, but certainly it is becoming more and more demanded by the day. There is high demand for big data professionals, ETL specialists, data architects and statisticians. Traditionally, and until some time ago, database developers and administrators were considered as far from “sexy” as it could possibly be; nowadays, as the article recalls, “There are a lot of folks who can spell Hadoop and put it on their résumé and call themselves data scientists”.

Don’t believe me, just watch: The Best Jobs of 2016, Statistician is ranked #2, only surpassed by Data Scientist at #1.

The first non-IT related job in the list is Audiologist, #4.

This article is an attempt to fill the gaps between what we historically know as data manipulation language (DML) and the basic concepts of Statistics. If you google “what does a statistician do”, the answer looks very similar to what I do day in and day out, except for the *statistical method* part.

So my suggestion to all the SQL people out there, is to become acquainted to the terms and functions mostly used in Statistics, so that you don’t need to install and learn R just to understand if a column in the table is skewed towards the left. I have friends and colleagues who are statisticians whom I respect but often misunderstand, and thus I regard them as some sort of alchemists. So this article aims to reduce the distance between the “Stats Guy” and the “DB Guy”, the target audience being a person who loves data, experienced in looking at it as a collection of tables and columns, and who wants to get the best out of it without having to learn a statistical software package.

More specifically, being a MS SQL Server developer, I’ll try to explain some concepts in MS SQL terms. Other RDBMS can have slightly different syntax or lack some functionalities but the concepts are applicable to most of them.

Having said that, imagine a dialogue like this in an unspecified company of an unspecified sector…

**Statistician**: Ok, so: how is the data distributed?

**DB admin**: I receive a CSV every week, then I email a PDF to all managers

**Statistician**: I know that, but I asked about the distribution of the data, is it left-skewed, right-skewed…

**DB admin**: Actually it’s…ehm…from the union workers…so I guess a little…leftist?

**Statistician**: Oh, please, be serious: did you make an histogram already?

**DB admin**: SQL Server does that automatically, with DBCC…so I don’t need to

**Statistician**: Ok. Guess I’d better ask Debbie C.C. then. What about deviation? Did you observe…

**DB admin**: Well that’s rather intimate…I have a…Batman outfit I like to wear but my gir…

oh! OH! NO,..sorry: you want STDEVP(), right? Oh! I just…sorry…for a second…

**Statistician**: You’re a mean person, you know it?

**DB admin**: No…I’m just an average guy. If you mean MEAN, ask Fred, the Mongo guy, next cubicle.

In order to smooth relationships, and make the world a better place, here is a quick and dirty series of functions and calculations to help the DB admin better serve the Statistician.

/*

All formulas have been double checked against Excel and R, so I’m quite confident that calculations are reliable and ready to use.

*/

## Prepare the data

Let’s start creating a table of numbers. For the purpose of this article we don’t know where they came from or what they represent, from a DB perspective they are just another table with several numerical columns.

Download this script and execute it

*In all examples I will use the column Y1 and format the output as a DECIMAL(38,20). This is not strictly necessary but useful indeed when dealing with very little numbers or when you need to check the results against other tools like R.*

## Mean, Median, Mode

These are fairly simple and probably the most basic terms in Stats, won’t spend much time on them.

### Mean

[Mean](https://en.wikipedia.org/wiki/Arithmetic_mean) is what we call average, there is an aggregate function AVG() we know too well.

SELECT

‘Mean’ AS ‘Descriptive Statistics’

,CAST(AVG(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

### Median

[Median](https://en.wikipedia.org/wiki/Median) is the number in the middle when you sort the results, if there are two numbers in the middle then the average of the two. ORDER the query BY the value ASC of the column, if the number of rows n is odd pick the value in row (n/2)+1, otherwise get the average of rows number (n/2) and (n/2)+1

WITH S1

AS

(SELECT

COUNT(Y1) / 2 c

,COUNT(Y1) % 2 r

FROM lab.ENB2012)

SELECT

‘Median’ AS ‘Descriptive Statistics’

,CAST(CASE

WHEN r = 0 THEN (SELECT

AVG(t1.Y1)

FROM (SELECT

ROW_NUMBER() OVER (ORDER BY Y1 ASC) N

,Y1

FROM lab.ENB2012) t1

WHERE n IN (c, c + 1))

WHEN r = 1 THEN (SELECT

t1.Y1

FROM (SELECT

ROW_NUMBER() OVER (ORDER BY Y1 ASC) N

,Y1

FROM lab.ENB2012) t1

WHERE n IN (c + 1))

END AS DECIMAL(38, 20)) AS Y1

FROM S1

### Mode

[Mode](https://en.wikipedia.org/wiki/Mode_(statistics)) is the number that appears most frequently in a result set.

COUNT() the occurrences of each value, ORDER BY the count DESC and pick the first row

SELECT

‘Mode’ AS ‘Descriptive Statistics’

,CAST((SELECT TOP 1

Y1

FROM lab.ENB2012

GROUP BY Y1

ORDER BY COUNT(Y1) DESC)

AS DECIMAL(38, 20)) AS Y1

## Variance, Standard Deviation, Standard Error

Having calculated the Mean (AVG) of a set (column), the next step is verifying if the numbers in the set are close to each other or they are very sparse.

### Variance

[Variance](https://en.wikipedia.org/wiki/Variance) is a measure of the distance between the numbers in the set and the Mean. Technically this is done by subtracting the Mean from each value, obtaining the squares and averaging all of them. Easier done that said, there is an aggregate function for that.

SELECT

‘Sample Variance’ AS ‘Descriptive Statistics’

,CAST(VAR(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

### Standard Deviation

[Standard Deviation](https://en.wikipedia.org/wiki/Standard_deviation) is the square root of the Variance, as easy as it sounds. And, yes, there’s a function for it.

SELECT

‘Standard Deviation’ AS ‘Descriptive Statistics’

,CAST(STDEV(Y1) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

- Please note that there are two versions of the Variance and Standard Deviation functions: VAR, VARP and STDEV, STDEVP. See this article for details https://en.wikipedia.org/wiki/Bessel%27s_correction, but generally speaking, we use the P version when we are sure that the data corresponds to the entire “population” i.e. we are sure we are using all the numbers we can possibly have. This makes more sense is statistics, as most of the times the numbers represent only a fraction of the population (hence sample).

### Standard Error of the Mean

[Standard Error of the Mean](https://en.wikipedia.org/wiki/Standard_error#Standard_error_of_the_mean) is the Standard Deviation divided by the square root of the number of rows. It is an indicator of “how much I may be wrong with this…”

SELECT

‘Standard Error’ AS ‘Descriptive Statistics’

,CAST(STDEV(Y1) / (SQRT(COUNT(Y1))) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

## Histograms

A nice way of visualizing how the data is distributed (spread), how close the numbers are and if there are any strange values too far from the rest (outliers) that can tamper the aggregate functions results. Given a set of numbers, I want to have n containers, say 20 bins. Each bin has a min and max boundaries and I put the numbers that fall between the boundaries in the corresponding bin. The difference between the max and min of each container is the same and the bins -goes without saying- must be contiguous.

Look at the picture above, it was produced with the R instruction:

hist(Y1, breaks=20, labels=TRUE))

By the way, R only plots 19 bins ignoring my proposal of 20, it probably realized I’m not a Stats Guy…

In SQL,

- set a variable to the number of desired bins,
- get the min and max of the entire column (had to round those to mimic R behaviour),
- calculate the binsize with the (difference between max and min / number of bins), use the next integer value if decimal
- subtract the min from each value and the divide the result by the binsize, use the next integer value as bin number
- COUNT all values GROUP BY bin number

DECLARE @bins INT = 20;

WITH S1

AS

(SELECT

CAST(FLOOR(MIN(Y1)) AS DECIMAL(38, 20)) AS minimum

,CAST(CEILING(MAX(Y1)) AS DECIMAL(38, 20)) AS maximum

,CASE

WHEN

FLOOR((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) <> CEILING((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins) THEN ROUND((CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins, 0)

ELSE (CEILING(MAX(Y1)) – FLOOR(MIN(Y1))) / @bins

END AS binsize

FROM lab.ENB2012)

SELECT

bin

,COUNT(*) frequency

FROM (SELECT

Y1

,CEILING(CAST(((Y1 – minimum) / (binsize)) AS DECIMAL(38, 20))) AS bin

FROM lab.ENB2012

CROSS JOIN S1) T1

GROUP BY T1.bin

ORDER BY T1.bin

Another way of creating histograms, if you already know the boundaries, would be a very big CASE statement, not very elegant but quick. Refer to the below articles for more ideas:

https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx

https://sqlsunday.com/2014/09/14/histogram-in-t-sql/

## Skewness, Kurtosis

Looking at the density of the values (imagine we draw a line chart instead of bars in the histogram) we can tell from the line if the values are concentrated more to the right or to the left of the Mean (blue vertical line).

According to this image, we say the distribution is skewed to the right, looks like the right slope is less steep than the left one.

### Skewness

[Skewness](https://en.wikipedia.org/wiki/Skewness) is the value indicating where the distribution is leaning. If it’s greater than zero = right, less than zero = left.

WITH S1

AS

(SELECT

AVG(Y1) AS mean

,STDEV(Y1) AS stdev

,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

FROM lab.ENB2012)

SELECT

‘Skewness’ AS ‘Descriptive Statistics’

,CAST(SUM(((Y1 – mean) / stdev) * ((Y1 – mean) / stdev) * ((Y1 – mean) / stdev)) * MIN(corr) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

CROSS JOIN S1

### Kurtosis

[Kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is the number indicating if the shape of the distribution line is fat and wide or tall and thin. It has a name that sounds like a skin imperfection but actually is a measure of the obesity of your chart (sorry…). Negative numbers = the curve of the line is low and wide, positive numbers = the curve is high and thin. When is zero ? Congratulations, you have a perfect Gaussian shape, whatever that means.

WITH S1

AS

(SELECT

AVG(Y1) AS mean

,STDEV(Y1) AS stdev

,COUNT(Y1) * 1.0 / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) AS corr

,COUNT(Y1) * 1.0 * (COUNT(Y1) + 1) / (COUNT(Y1) – 1) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS corr2

,3.0 * SQUARE((COUNT(Y1) – 1)) / (COUNT(Y1) – 2) / (COUNT(Y1) – 3) AS subfact

FROM lab.ENB2012)

SELECT

‘Kurtosis’ AS ‘Descriptive Statistics’

,CAST(SUM(SQUARE(SQUARE(((Y1 * 1.0 – mean) / stdev)))) * MIN(corr2) – MIN(subfact) AS DECIMAL(38, 20)) AS Y1

FROM lab.ENB2012

CROSS JOIN S1

## Standing on the Shoulder of Giants

I must thank the following articles, and many more I don’t remember:

* https://technet.microsoft.com/en-us/library/aa224898(v=sql.80).aspx

* https://sqlsunday.com/2014/09/14/histogram-in-t-sql/

* https://www.mathsisfun.com/data/correlation.html

* https://ayadshammout.com/2013/11/30/t-sql-linear-regression-function/

I’d rather not say this was an exhaustive article, nor an introduction to Stats, it was just a teaser…

I believe you have now the basis to resist a two and a half minutes conversation with a Statistician; but if you liked what I’ve written there’s plenty more where that came from. I suggest enrolling in any of the publicly available MOOC on the web, lots of good contents out there.

If you fancy buying a book, you may want to consider one of these two to begin with:

Naked Statistics: Stripping the Dread from the Data

The Cartoon Guide to Statistics

## Conclusion

**Statistician:** (walks away whispering) …”lazy Disk Jockey”

**DB admin:** (resumes Warcraft game and thinks) …”selfish Coin Flipper”

## Power BI publish to web now in preview

Without JavaScript. No, seriously…

According to sources from Microsoft, this newest feature will allow to publish interactive reports to any web page supporting iframes. Did not try it yet, but looks very promising. It’s not clear if MS will eventually charge for this once out of preview.

Meanwhile, we can impress our customers and friends embedding self refreshing reports on our blogs and webs. For free.

See https://powerbi.microsoft.com/en-us/blog/announcing-power-bi-publish-to-web/ for details.

Happy embedding!

## Filter a Power BI tile on your web page

EDIT 2016-07-12: Microsoft has changed their API and policies and functionality, so this artcile is obsolete now. I will shut down the sample site. For details, please see: https://powerbi.microsoft.com/en-us/blog/what-s-new-and-what-s-next-for-power-bi-embedded-july-2016/

======================================================================================================================

I received a lot of feedback regarding my two recent posts: http://moraschi.com/2015/12/08/embed-multiple-power-bi-tiles-into-your-web/, and http://moraschi.com/2015/11/30/use-a-power-bi-tile-in-your-web-page/

So I decided to continue in my exploration of the Power BI API. The next natural step in creating customized dashboards is the ability to dynamically filter the tiles based on user interaction. As of today, this is something you cannot achieve in https://app.powerbi.com (you can only filter using selectors in Reports).

There is a post here, with very scarce information about a parameter of the embed API, namely **$filter**. We can use this parameter much like its oData cousin, with the same syntax.

&$filter={tablename/fieldname}+eq+'{literal constant}’

I tested it for string parameters and it works. Did not have such luck with dates. Everything will eventually come into place, as this API is still under development as I understand.

I created a sample web page trying to keep it very simple, to demonstrate how this works. You can find it here and you are welcome to help yourself reusing my code and adapting it to your needs. In the example you will see a list box, clicking on it will trigger a refresh of the tile with the appropriate results.

Happy coding!

## Embed multiple Power BI tiles into your Web

EDIT 2016-07-12: Microsoft has changed their API and policies and functionality, so this article is obsolete now. I will shut down the sample site. For details, please see: https://powerbi.microsoft.com/en-us/blog/what-s-new-and-what-s-next-for-power-bi-embedded-july-2016/

Code re-usability is at the core of programming best practices and Business Intelligence is no exception.

None wants to code multiple times just to refactor thousands of lines of similar code whenever there’s a change in business rules.

In my previous post Use a Power BI Tile in your web page we saw a technique that allows us to use a Power BI tile into a custom web page. More often than not, our HTML documents are composed by more than one grid or chart, so we need a flexible way to be able to insert multiple tiles on the same page without rewriting the same JavaScript procedures over and over.

I reworked the previous example to make it easier and created this simple function:

[code language=”javascript”]

<script type="text/javascript">

function EmbedTile(sTileUrl, sFrameName, iWidth, iHeight) {

var iframe = document.getElementById(sFrameName);

iframe.src = sTileUrl + "&width=" + iWidth + "&height=" + iHeight;

iframe.onload = function () {

var m = { action: "loadTile", accessToken: accessToken, height: iHeight, width: iWidth };

message = JSON.stringify(m);

iframe.contentWindow.postMessage(message, "*");

}

}

window.onload = function(){

EmbedTile(

‘https://app.powerbi.com/embed?dashboardId=9153d5…..365207&tileId=2aef8……913edc524353’,

‘iFrameEmbedTile1’,

550,

350);

EmbedTile(

‘https://app.powerbi.com/embed?dashboardId=9153d5……365207&tileId=c69f153c…….9336bb’,

‘iFrameEmbedTile2’,

550,

350);

};

</script>

[/code]

Feel free to reuse it or see it in action here: Two tiles in the same page

## Use a Power BI Tile in your web page

EDIT 2016-07-12: Microsoft has changed their API and policies and functionality, so this artcile is obsolete now. I will shut down the sample site. For details, please see: https://powerbi.microsoft.com/en-us/blog/what-s-new-and-what-s-next-for-power-bi-embedded-july-2016/

Creating dashboards is more witchcraft than art and more art than science. Think about how many times you have banged your head to produce a useful chart according to the specs of your client. You have the numbers, you have the chart and everything in place but the customer keeps saying: “it’s just not the right shade of green…”

Many times we have to create custom web pages to accomplish what the BI solution can’t offer out of the box; and Power BI provides you just that, the ability to use a tile from a dashboard without forcing the customer to open the entire Power BI user interface. With this technique you can embed just a piece of a bigger BI report into an HTML page. It’s not very straightforward (at least for me it wasn’t), and I must say that the documentation often makes it worse (there’s an errata in the GET REST Uri of the **Step 3 – Get user’s tile information**).

So you want to embed a Power BI Tile into a web page: to quickly see how it works go to this example page. If you want more details, just read on.

I assume you already have a Power BI dashboard and you know how to get an access token. This is probably the most difficult part and I covered that in another article.

I am using cURL from a headless EC2 server just for sake of simplicity, you can do this with your favorite programming language.

First step is to get the ID of your dashboard:

[code]

curl -k -X GET "https://api.powerbi.com/beta/myorg/dashboards" -H "$AUTH_HEADER_LOCAL"

[/code]

where $AUTH_HEADER_LOCAL is the “Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiI……..” you created when asking for an access token.

The server will respond something like this:

From here you need to extract the id element which represents your dashboard.

Second step is to get the tile ID:

[code]

curl -k -X GET https://api.powerbi.com/beta/myorg/dashboards/${DASHBOARD_ID}/tiles -H "$AUTH_HEADER_LOCAL"

[/code]

where $AUTH_HEADER_LOCAL is the same as above and ${DASHBOARD_ID} is the value you got in step 1.

the answer will include an element named “embedUrl” which is the link to your tile.

Now that you have the embedUrl and the access token, you can build a simple HTML page:

- Create an
element that will host your tile**<iframe>** - Have your access token ready into a variable, I personally have it in an external .js file that I load as a <script> inside the page
- Add a
element to activate the**<script>**with the access token and the dimensions of the tile**<iframe>**

[code language=”javascript”]

<script type="text/javascript">

var width = 800;

var height = 600;

var embedTileUrl = "https://app.powerbi.com/embed?dashboardId=1d555f40-…..2618e&tileId=8cd8bc5….7624a";

window.onload = function () {

updateEmbedTile();

};

function updateEmbedTile() {

if ("" === embedTileUrl)

return;

iframe = document.getElementById(‘iFrameEmbedTile’);

iframe.src = embedTileUrl + "&width=" + width + "&height=" + height;

iframe.onload = postActionLoadTile;

}

function postActionLoadTile() {

if ("" === accessToken)

return;

var h = height;

var w = width;

var m = { action: "loadTile", accessToken: accessToken, height: h, width: w };

message = JSON.stringify(m);

iframe = document.getElementById(‘iFrameEmbedTile’);

iframe.contentWindow.postMessage(message, "*");

}

</script>

[/code]

Note that the access token will expire after 1 hour, so you need to have a way to refresh it on the server side.

Feel free to use http://powerbitile.azurewebsites.net/ as a reference.

## Pin an Excel chart/pivot table to a Power BI dashboard

According to this post on the official Power BI blog it is now possible to bring your Excel file into a Power BI dataset as a whole (not only importing data); and use portions of it as tiles on your dashboard.

Wow: you can (re)use your carefully crafted charts and amazingly engineered pivot tables in a web-shareable software as a service, not bad.

This is a very interesting feature and adds a range of new possibilities for this recent Business Intelligence Saas.

Now, few considerations.

- To accomplish this you need to link an Excel file from OneDrive and the pin-to-the-dashboard option is only available to OneDrive for Business subscribers. So no free beers, sorry.
- To pin a chart you actually need to select a range of cells where the chart resides and it will be pinned as an image, much like a print screen, so do not dream of interaction with the rest of the selectors/filters on the dashboard. See what I mean in this picture:

- Your chart/table/range will appear in the dashboard as a
**<div>**with abackground:**base-64 encoded data:image/png**

I did not try to embed this tile in a web page, but it could be the next logical step to do.

So, overall, a very desirable addition. Prices for OneDrive for Business start at 5USD per user/month, and if you are a no-profit/government/university you can have it even chaper.

## Twitter Trending Topics by Location, Date (PowerBi WordCloud)

One of the things I like the most about Power BI is the ability to use custom visuals. I did not try building one myself and honestly I think it’s beyond my skills, yet there is a gallery with a growing number of pre-build viz that you can download and use.

Go to https://app.powerbi.com/visuals?WT.mc_id=Blog_Visuals and take a look.

I downloaded the WordCloud plus the Chiclet Slicer to create a report showing the trending topics in Twitter sliced by date and location.

Only few locations are available from Twitter API, here is a sample of spanish cities I am monitoring.

Happy tweets!

## Microsoft Azure SQL Data Warehouse. How much does it cost?

As many of you know, Microsoft has open to public preview its latest BI technology: Azure SQL Data Warehouse.

__On the documentation page is defined as: “__*an enterprise-class distributed database capable of processing petabyte volumes of relational and non-relational data.”*

How much does it cost, is it worth it, and how does it differ from previous SQL offering on Azure?

The main difference -in my opinion- is the billing of compute power and storage:

in Azure SQL DW you decide how fast your query run and how big your data is. This generates a charge for __DWU (Data Warehouse Unit)__, a unit of measure defined as the level of performance of your data warehouse; and a charge for __RA-GRS Page Blobs__ (Read-Access Geographically Redundant Storage).

I suppose this will cause more than a headache in financial departments and SQL administrators, but it is indeed a good thing.

The main point I see here is that -wisely used- the DWU pricing can help tune the database responsiveness and number-crunching speed according to the time of the day (or day of week). I can set for example 100DWU on low traffic hours or weekends, and scale up to 2000DWU Monday morning or during the reporting season when everybody want to access data. I can even choose to pause the database -for example during holidays- and incur in no compute charge at all. It would be extremely useful to have PowerShell scripts to switch from a DWU level to another and schedule them so that the database can auto-adjusts the compute power needed.

Doing a very simple math: assuming a 30 days month, at 100DWU we will receive a $504 (USD) bill for computing + $62.5 (USD) for 1TB of data, a total $566.5* for a fully functioning SQL Data Warehouse. On the other end of the spectrum, 2000DWU for 1 month would be $10,080(USD)*.

Additionally to the Grow, shrink, or pause capabilities, Azure SQL Data Warehouse features an MPP architecture, column-store indexes, and the ability to query relational and non relational (read Hadoop) with Polybase, more on this __here__.

So my .02: let’s give it a try…

*During the preview period the prices are discounted, I do not expect them to vary that much anyway, but certainly at the moment it is not clear how much the discount is and how much will the price be after the preview, see here for more details: https://azure.microsoft.com/en-us/pricing/details/sql-data-warehouse/

## See-through MicroStrategy (playing with transparencies)

One thing that I like about MicroStrategy is the ability to create Web documents. Nowadays self-service discovery technologies (being based on Flash or HTML5) tend to constrain a little the “creative” side of the data analysis: trading off the ease of use for a quite limited set of tools to picture your data. Boundaries are build and it is difficult to bend the rules.

Web documents do not have such restrictions and -in fact- are almost limitless in possibilities. Think about the HTML container that allows us to insert custom HTML/JavaScript, or the image container.

I believe that Business Intelligence is about information as much as about design, and -surely enough- a nice looking dashboard is always preferable to a boring dull one.

A fundamental aspect of modern day web pages is the CSS Opacity / Transparency, so why not using it in a dashboard to spicy it up? The components and widget that come with MicroStrategy allow a very limited use of “No Fill” in the background: you can create a transparent Chart or Panel, but not a transparent rectangle. So I simply created a series of white PNGs with different *alpha* values that I can insert on top of a background picture to simulate a see-through rectangle.

Then I downloaded a background picture on Flickr (careful with the CC license…) and removed the main panel of a blank dashboard.

You can see the result here, where a Web document turns to A Whiter Shade Of Pale.

You can download the transparent pngs here.