The Lahman's Baseball Database – real life sample data

Mr Lahman is (by his own words) a Watchdog reporter & data journalist, and an author.
But the collective wisdom of the Business Intelligence world reminds him as the creator of a wonderful public dataset: the Baseball Database.

The Lahman database is extremely useful and I am grateful to his author for several reasons:

  1. It is not a bike/sport/equipment shop sales database, that almost all examples in the BI world use (my book is not an exception…)
  2. It is real life data with an interesting subject, and frequently updated.
  3. And it has errors (very few), those little nice imperfections that come very handy when you are testing/training/developing a test case or a proof of concept.
I don’t remember who told that PowerPoint is always right and nothing can go wrong with a slide presentation -as long as the projector is working and the usb drive is plugged in- but he surely would agree that it’s no fun to demo with a database that is perfectly crafted to return exact results.
I never found in my life a “squeaky clean” production db. As a general rule there is always a primary key missing, an unenforceable foreign key, or missing records, outdated documentation, you name it…that’s why I like the Lahman db, because I can experiment -under a controlled environment- how the software behave in situations that are beyond the standard Show and Tell “look-ma-this-works” case scenarios.
It is available for free and you can download it in several formats from here:
I converted it to Oracle and SQL Server, and you can download those version too from here:
In Oracle, simply run IMPDP with the SCHEMAS=LAHMAN option, the password for the LAHMAN schema is (guess) LAHMAN.
In SQL Server copy the MDF into the DATA folder and attach it to an existing SQL Engine instance.
Hope this helps, I will often use this data to showcase MicroStrategy and other products in my next posts.

Have you read the news today?

You may already know that Oracle has the capability to get data directly from the Web. Like in:

SELECT HTTPURITYPE.createuri (‘http://www.economist.com/rss/leaders_rss.xml’).getclob () RSS FROM DUAL;

This offers a lot of possibilities when you need to query external data and, combined with some XML magic, can make a cute RSS view out of a well-formed xml feed:

 CREATE VIEW RSS_READER
AS
   SELECT EXTRACTVALUE (VALUE (p), ‘/item/title’) title,
          EXTRACTVALUE (VALUE (p), ‘/item/link’) link_url,
          EXTRACTVALUE (VALUE (p), ‘/item/description’) description
     FROM TABLE (
             XMLSEQUENCE (EXTRACT (xmltype (HTTPURITYPE.createuri (‘http://www.economist.com/rss/leaders_rss.xml’).getclob (),
                                            NULL,
                                            1,
                                            1),
                                   ‘/rss/channel/item’,
                                   ‘xmlns:media=”http://search.yahoo.com/mrss/”‘))) p;

Nice and cheap, no need to store data in the RDBMS. Every time you access the view the data is read from the Web. Just change the rss url according to your feed (you will probably need to rework the EXTRACT functions also).

Note: if you’re on Oracle 11g you may have to grant connecttion to the rss feed domain thru’ an ACL (ask your DBA for details…)

Playing with DATE ranges

If you search on the net for a trick to generate date rangees in Oracle, you most probably will end up here:

http://stackoverflow.com/questions/418318/generate-a-range-of-dates-using-sql

variuos interesting and mind-challenging ways to achieve the same result.

I found this way the most cost effective and readable, yet flexible enough for my datawarehouse ETL processes.

WITH LIMITS AS (SELECT TO_DATE (’01-JAN-2010′) BEGINDATE, SYSDATE + 10 ENDDATE FROM DUAL)

SELECT (BEGINDATE + LEVEL – 1) AS DATES

FROM LIMITS

CONNECT BY LEVEL <= (ENDDATE – BEGINDATE)

/

You can modify start and end date and have a calendar in one SQL pass. Let’s go a little further: I always have reports wich need to show the current week (WTD) for example:

WITH LIMITS AS (SELECT TRUNC ( SYSDATE, ‘DY’) BEGINDATE, TRUNC (SYSDATE) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE))

SELECT DATES, TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

  FROM CALENDAR

/

Easy way to retrieve a list of days in the current week until today. Fast, cost effective and easy to modify. And so on:

–Last full Week

WITH LIMITS AS (SELECT TRUNC ( SYSDATE – 7, ‘DY’) BEGINDATE, TRUNC ( SYSDATE, ‘DY’) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE))

SELECT DATES, TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

  FROM CALENDAR

/

–Same day last Week to Date

WITH LIMITS AS (SELECT TRUNC (SYSDATE – 6) BEGINDATE, TRUNC (SYSDATE) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE + 1))

SELECT DATES, TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

  FROM CALENDAR

/

–Last Month

WITH LIMITS AS (SELECT TRUNC ( ADD_MONTHS ( SYSDATE, -1), ‘MONTH’) BEGINDATE, TRUNC (SYSDATE,’MONTH’) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE))

SELECT DATES, TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

  FROM CALENDAR

/

–Year to Date

WITH LIMITS AS (SELECT TRUNC ( SYSDATE, ‘YEAR’) BEGINDATE, TRUNC (SYSDATE) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE + 1))

SELECT DATES, TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

  FROM CALENDAR

/

If you need to build a time dimension for your datawarehouse, you may probably need more that that.

WITH LIMITS AS (SELECT TRUNC ( SYSDATE, ‘YEAR’) BEGINDATE, TRUNC (SYSDATE) ENDDATE FROM DUAL),

     CALENDAR

     AS (    SELECT (BEGINDATE + LEVEL – 1) AS DATES

               FROM LIMITS

         CONNECT BY LEVEL <= (ENDDATE – BEGINDATE + 1))

SELECT DATES

      ,TO_CHAR ( DATES, ‘Dy’) DAYS_OF_WEEK

      ,DATES – 7 ONE_WEEK_AGO

      ,ADD_MONTHS ( DATES, -1) ONE_MONTH_AGO

      ,ADD_MONTHS ( DATES, -12) ONE_YEAR_AGO

      ,TRUNC ( DATES, ‘Q’) BEGIN_OF_QUARTER

      ,TRUNC ( ADD_MONTHS ( DATES, 3), ‘Q’) – 1 END_OF_QUARTER

  FROM CALENDAR

/

Check out http://programmerslounge.blogspot.com/2011/07/how-to-get-first-day-and-last-day-of.html for a list of more first/last attributes.

Chapter 2 – The Oracle VALUES table

An article I’ve read time ago gave me the idea to attach an extra table to manage the variable number of columns in different data sets*. This article about dimensional modeling was introducing the concept of “helper table”. In few words, I can create a single column and store a reference (a code, an ID) to another table which holds the real dimensions, no matter how many columns they are. I simply need to create a fake code for the dimension group and link the two tables on that code.
In the “helper table” I may have a row for every dimension, which would be the correct think to do if we want to comply strictly with the normalization rules.
So for the example TSTRAD in Chapt. 1and the helper table would be stored this way:
TSTRAD
TSTRAD
HELPER TABLE
HELPER TABLE
But I am using Oracle 10g and in this version there is no easy way later to transform several rows into several columns; say for example that I want to select all dimensions in group 1 and display them as columns with the respective value. This would be a very complicated query involving DECODE and MAX and who knows what else. So I am taking advantage of an Oracle feature: the ability to store custom type objects inside table columns, so I create a data type to hold an array of 10 dimensions:

CREATE OR REPLACE TYPE T_VARCHAR2_25 AS VARRAY (10) OF VARCHAR2 (25);

This type will hold up to ten varchar2 strings and since varrays in Oracle are numbered sequentially I will always be sure that they are in correct order. So the “helper table” becomes:

CREATE TABLE UNI_DIM_DIMENSIONS(
DIMENSION_ID NUMBER,
VA_ELEMENT_NAMES T_VARCHAR2_25 NOT NULL, CONSTRAINT PK_UNI_DIM_DIMENSIONS PRIMARY KEY (DIMENSION_ID));

UNI_DIM_DIMENSIONS
UNI_DIM_DIMENSIONS

Why do I do this? Can’t I simply store the varray in the original VALUES table and get rid of the helper table? Yes, I can, I do this just in case that two or more values happen to share the same exact dimension columns: in that case I will not need duplicate the information, I will simply reuse the code of the proper dimension group and feel an experienced database developer 🙂

Chapter 1 – The beginning

The institution where I work needs to analyze data from different sources with different statistical and modeling packages.

The way that the data is stored at the moment is baroque: tab or comma delimited ASCII data files, Excel sheets, and word of mouth. Multiple copies of the same data set may be stored on the network and -even worse- different and updated version can coexist. Result data sets are stored on local hard drives and are not visible to everyone. This can lead to embarrassing situations were at the same meeting two or more pie-chart of the same data set do not agree with each other.

So the aim of these posts is trying to find an elegant database solution to this: a way to build a common repository for different data sets from different softwares so that data can be easily exchanged and the history of the changes can be visible to everyone involved in data analysis. And -much better- a new interface will help maintain, correct, cleanse the various data sets.

The data sets can be about anything, from pollution data, to GDP, to fuel consumption or demographics. All data sets have both cross-sectional and time series dimensions (see Christopher Dougherty “Introduction to Econometrics 3e” book, for a good introduction to PANEL or LONGITUDINAL data sets)*.

There are different modeling packages on the market; in my job we use both GAMS® and VENSIM®. Also STATA® is used sometimes for specific functions. Every package has its own terminology and give different names to the same entity: for example VENSIM® uses “Subscripts” while GAMS® has “Sets” but they both refer to the same thing; this lead to confusion among people using different tools who need to share their data sets.

TSTRAD
mmpdr aus xoc Y1996 12
mmpdr aus xoc Y1999 2
mmpdr aus xoc Y2000 15
mmpdr aus xoc Y2001 2
mmpdr aus xoc Y2002 3
mmpdr aus hkg Y1979 1
mmpdr aus hkg Y1980 1
mmpdr aus hkg Y1981 3
mmpdr aus hkg Y1982 1
mmpdr aus jpn Y1994 133
mmpdr aus jpn Y1995 17
mmpdr aus jpn Y1996 47
mmpdr aus jpn Y1997 23
mmpdr aus jpn Y1998 13
mmpdr aus jpn Y1999 22

Take this for example: the last column is a value, column number 4 looks like a time indicator. Now look at column 3; it can contain one of the following: (xoc, hkg or jpn). The list of all possible codes for column 3 is called “Subscript Range” in VENSIM® and “Set” in GAMS®.

Look at the first data row, it could be written this way:

TSTRAD[mmpdr, aus, xoc, Y1996]=12

So how do we call TSTRAD? We may call it a variable, and what’s the name of the codes inside parentheses? Subscripts, dimensions, indexes? There is a need for a common terminology across different software tools. So from now on we will call them variable and dimension respectively, and domain the list of all possible codes inside a column.

Another issue we are facing is that different data sets have different number of columns, and the number is not known “a priori”. If I want to store the example table above in a database I could easily create a table with 5 columns and set the primary key to the combination of the first four columns. But if I need to store another data set with 3 or 7 columns then I have to modify my table or create another table, leading to poor design and confusion in the names of the columns. Also, do I need to normalize this data set? For storage and speed sake I’d better replace all descriptive columns with numerical codes. Giving for example xoc a code of 1, hkg = 2, and jpn = 3, I can spare several bytes every row and make faster searches. Then I would need an extra reference table to explain what numerical codes mean.

In order to facilitate the table design and maintenance I will need a flexible table to store a variable number of descriptive columns and their respective value. VENSIM® and GAMS® have a limit on the number of columns for a given data set, this limit is 10 for both. We can then create a table with 11 fields and have a coffee break -end of the story- but we wouldn’t be satisfied, would we?