Among the different concepts available in GSIM I just found a generic "measure" concept in Data Structures, consisting of "Represented Variables with uncoded Value Domains (Described Value Domains)", according to GSIM Specification document.

In my opinion, the organization of a data structure of a generic data cube could benefit by appropriately structuring one measure dimension in the data structure as the content/title of a table or a figure available on paper documents, with the same attention to the possible users of the data in the data structure. In fact, the table or figure titles are usually organised in such a way that the statistical content of the table or figure is immediately comprehensible to any reader, and it is essentially complete so that a reader should not even look at the table metadata (name of columns, code lists, etc) to understand the meaning of the data in the table itself. Anyway tables and figures on a paper document contain a small amount of data, usually quite homogeneous, which is the opposite of what is available in data cubes. Hence, a data cube title is not fit for the purpose, and there should be a place where the content of the data is available in a complete and comprehensible way. I tried to summarize my point of view in the last METIS meeting in Geneva, May 2015. The idea is to organize a unique measure dimension in a data structure, consisting of a code list of data contents. The data contents should actually follow the typical rules that make a reader understand what is the meaning of what is measured. I identified two kinds of data content for data structures consisting of macrodata, that fit all the data available in the Istat corporate DWH up to now.

  1. Macrodata consisting of the computation of a parameter of a distribution on one microdata set (means, totals, percentages, Gini indexes, variances, maximums, minimums, medians, ..., all of these possibly multivariate and conditional to some categorical variables). In these cases the item of a measure dimensions should be  as much precise as possible on the population of interest, numerical variable under analysis, possible conditional categorical variables, statistical transformation tool used to pass from a micro to a macro data
  2. Macrodata consisting of (usually) comparisons between other macrodata, as in ratios, index numbers, balances,... The data content should be clear on the comparison tool used, that contains a definition that illustrates what operation is performed.

Each data content should also be complemented by a list of attributes that make each data content (output) linkable to its input(s) and the statistical method used to transform inputs into outputs. These attributes depend on being macrodata of the 1st or 2nd type, as described above. For those of the first type, attributes consist of reference population, numerical variable, statistical operator, possible conditional categorical variable, unit of measure, unit multiplier. For those of the second type, statistical operator, the list of data contents that are compared/aggregated (for instance ratios should show the numerator and denominator), unit of measure, unit multiplier, base year (for index numbers). These attributes make search facilities much easier: for instance it is easy to find all the outputs relative to a population, or all the outputs that study a statistical variable. These search facilities are really useful when considering the problem of harmonizing metadata coming from different statistical processes.

Hence, a data content allows users (including NSI managers) to find in a unique place all the available outputs that characterize a data cube. Furthermore tools (search facilities) for comparing metadata of different production processes can be considered in order to tackle metadata harmonization. Finally, if all NSIs agree on how to structure a data structure by the use of a measure dimension with the attached data content list as described above, comparisons between data structures produced by different NSIs would be much easier, and the communication of data and metadata, e.g. by means of SDMX tools, would be dramatically simplificated, because mapping between two data structures should not be performed at the cell level but at a dimension level.

More details on these can be found on the paper and slides presented in the METIS conference Geneva 2015 METIS Conference on session (ii), contribution by Italy. If anyone is interested on our experience and desires to know more, I would be very happy to give more details. If anyone finds possible aggregate data/statistical output that does not fit the ones roughly described as 1st and 2nd type of data contents, I am extremely interested in these examples.

As a matter of fact, it is my intention to propose the data content as an additional concept to include in GSIM in order to appropriately structure data structures. If anyone does not agree, this wiki can be the right place to start a discussion

  • No labels

4 Comments

  1. user-8e470

    Mauro ScanuJenny Linnerud Essi KaukonenFrancine Kalonji - was this discussed and resolved in the context of LIM?

  2. This is being handled in ABS implementation by large data cubes that may contain multiple measures (eg counts, volume/weight and $ values in one cube) as they are easier for consistent internal data management / governance / sign off (eg when going from unit record data to initial estimates). 

    Such cubes are, however, are needing to be split in our dissemination environment - as Mauro proposes - to become closer to "bite size" structures for end user consumption which sometimes does require clarifying labels / titles for the specific context.  (We are looking at something similar to the GSIM "presentation" concept.)

    Also, sometimes also data from two cubes (eg one related to persons and one related to some other unit) may be subset down to be based around common classificatory dimensions and then be brought together into a new comparative visualisation structure within the ABS.

    So aggregate data structures are pretty fluid for us - rather than one aggregate data structure directly serving all business needs.  Large, often multi-measure, cubes are, however, foundational to our internal data management.   

  3. Meeting 27 June, 2018

    • If we need to add object Data Content, it seems Exchange Group is the right place rather than Structure Group.
    • It seems that the use case for the proposed new object Data Content is for dissemination level (e.g. table that end user access so that users can easily find topics across many tables)
    • Then, can Presentation be used for this? There is an attribute description? Do we need more than this? We might need more clarification about the use cases from Mauro Scanu
  4. user-8e470

    Final thought - not enough evidence to suggest the need for a new object. Adding to documentation list to see if there is some text that could be added to the spec to address this.