The institution where I work needs to analyze data from different sources with different statistical and modeling packages.
The way that the data is stored at the moment is baroque: tab or comma delimited ASCII data files, Excel sheets, and word of mouth. Multiple copies of the same data set may be stored on the network and -even worse- different and updated version can coexist. Result data sets are stored on local hard drives and are not visible to everyone. This can lead to embarrassing situations were at the same meeting two or more pie-chart of the same data set do not agree with each other.
So the aim of these posts is trying to find an elegant database solution to this: a way to build a common repository for different data sets from different softwares so that data can be easily exchanged and the history of the changes can be visible to everyone involved in data analysis. And -much better- a new interface will help maintain, correct, cleanse the various data sets.
The data sets can be about anything, from pollution data, to GDP, to fuel consumption or demographics. All data sets have both cross-sectional and time series dimensions (see Christopher Dougherty “Introduction to Econometrics 3e” book, for a good introduction to PANEL or LONGITUDINAL data sets)*.
There are different modeling packages on the market; in my job we use both GAMS® and VENSIM®. Also STATA® is used sometimes for specific functions. Every package has its own terminology and give different names to the same entity: for example VENSIM® uses “Subscripts” while GAMS® has “Sets” but they both refer to the same thing; this lead to confusion among people using different tools who need to share their data sets.
Take this for example: the last column is a value, column number 4 looks like a time indicator. Now look at column 3; it can contain one of the following: (xoc, hkg or jpn). The list of all possible codes for column 3 is called “Subscript Range” in VENSIM® and “Set” in GAMS®.
Look at the first data row, it could be written this way:
TSTRAD[mmpdr, aus, xoc, Y1996]=12
So how do we call TSTRAD? We may call it a variable, and what’s the name of the codes inside parentheses? Subscripts, dimensions, indexes? There is a need for a common terminology across different software tools. So from now on we will call them variable and dimension respectively, and domain the list of all possible codes inside a column.
Another issue we are facing is that different data sets have different number of columns, and the number is not known “a priori”. If I want to store the example table above in a database I could easily create a table with 5 columns and set the primary key to the combination of the first four columns. But if I need to store another data set with 3 or 7 columns then I have to modify my table or create another table, leading to poor design and confusion in the names of the columns. Also, do I need to normalize this data set? For storage and speed sake I’d better replace all descriptive columns with numerical codes. Giving for example xoc a code of 1, hkg = 2, and jpn = 3, I can spare several bytes every row and make faster searches. Then I would need an extra reference table to explain what numerical codes mean.
In order to facilitate the table design and maintenance I will need a flexible table to store a variable number of descriptive columns and their respective value. VENSIM® and GAMS® have a limit on the number of columns for a given data set, this limit is 10 for both. We can then create a table with 11 fields and have a coffee break -end of the story- but we wouldn’t be satisfied, would we?