Issue #3: On the old figure and table "titles", data content of modern data cubes, and GSIM

Among the different concepts available in GSIM I just found a generic "measure" concept in Data Structures, consisting of "Represented Variables with uncoded Value Domains (Described Value Domains)", according to GSIM Specification document.

In my opinion, the organization of a data structure of a generic data cube could benefit by appropriately structuring one measure dimension in the data structure as the content/title of a table or a figure available on paper documents, with the same attention to the possible users of the data in the data structure. In fact, the table or figure titles are usually organised in such a way that the statistical content of the table or figure is immediately comprehensible to any reader, and it is essentially complete so that a reader should not even look at the table metadata (name of columns, code lists, etc) to understand the meaning of the data in the table itself. Anyway tables and figures on a paper document contain a small amount of data, usually quite homogeneous, which is the opposite of what is available in data cubes. Hence, a data cube title is not fit for the purpose, and there should be a place where the content of the data is available in a complete and comprehensible way. I tried to summarize my point of view in the last METIS meeting in Geneva, May 2015. The idea is to organize a unique measure dimension in a data structure, consisting of a code list of data contents. The data contents should actually follow the typical rules that make a reader understand what is the meaning of what is measured. I identified two kinds of data content for data structures consisting of macrodata, that fit all the data available in the Istat corporate DWH up to now.

Macrodata consisting of the computation of a parameter of a distribution on one microdata set (means, totals, percentages, Gini indexes, variances, maximums, minimums, medians, ..., all of these possibly multivariate and conditional to some categorical variables). In these cases the item of a measure dimensions should be as much precise as possible on the population of interest, numerical variable under analysis, possible conditional categorical variables, statistical transformation tool used to pass from a micro to a macro data
Macrodata consisting of (usually) comparisons between other macrodata, as in ratios, index numbers, balances,... The data content should be clear on the comparison tool used, that contains a definition that illustrates what operation is performed.

Each data content should also be complemented by a list of attributes that make each data content (output) linkable to its input(s) and the statistical method used to transform inputs into outputs. These attributes depend on being macrodata of the 1st or 2nd type, as described above. For those of the first type, attributes consist of reference population, numerical variable, statistical operator, possible conditional categorical variable, unit of measure, unit multiplier. For those of the second type, statistical operator, the list of data contents that are compared/aggregated (for instance ratios should show the numerator and denominator), unit of measure, unit multiplier, base year (for index numbers). These attributes make search facilities much easier: for instance it is easy to find all the outputs relative to a population, or all the outputs that study a statistical variable. These search facilities are really useful when considering the problem of harmonizing metadata coming from different statistical processes.

Hence, a data content allows users (including NSI managers) to find in a unique place all the available outputs that characterize a data cube. Furthermore tools (search facilities) for comparing metadata of different production processes can be considered in order to tackle metadata harmonization. Finally, if all NSIs agree on how to structure a data structure by the use of a measure dimension with the attached data content list as described above, comparisons between data structures produced by different NSIs would be much easier, and the communication of data and metadata, e.g. by means of SDMX tools, would be dramatically simplificated, because mapping between two data structures should not be performed at the cell level but at a dimension level.

More details on these can be found on the paper and slides presented in the METIS conference Geneva 2015 METIS Conference on session (ii), contribution by Italy. If anyone is interested on our experience and desires to know more, I would be very happy to give more details. If anyone finds possible aggregate data/statistical output that does not fit the ones roughly described as 1st and 2nd type of data contents, I am extremely interested in these examples.

As a matter of fact, it is my intention to propose the data content as an additional concept to include in GSIM in order to appropriately structure data structures. If anyone does not agree, this wiki can be the right place to start a discussion

Page tree

4 Comments

user-8e470

Alistair Hamilton

InKyung Choi

user-8e470