Login required to access the wiki. Please register to create your login credentials We apologize for any inconvenience this may cause, but please note that this step is necessary to protect your privacy and ensure a safer browsing experience. Thank you for your cooperation. Documents available for download: GAMSO , GSBPM , GSIM |
Contact person* | |
---|---|
Job title | Senior Researcher |
Telephone | +39 06 4673 3357 |
Statistical Information Model
The information model adopted in Istat for the development of the Sistema Unitario dei Metadati for structural metadata (SUM-MS) strongly relies on the GSIM concepts available in groups Concepts and Structures. Furthermore, some concepts of the Production group have also been used. The model adopted in Istat can be considered as a customized GSIM compliant model.
The primary objective of the Istat information model under SUM-MS is to give a definition to data inside Data Sets organized in Data Structures. SUM-MS adopts essentially two data structures: macro data hypercubes and data sets of micro data. The general plan is given in the next Figure. In the next sections there is a general overview of the model.
SUM-MS has been built in line with sections 47-50 in GSIM Specification, version 1.1 (December 2013). Each data is a result of a Process step through the application of a Process method on the necessary Inputs. Hence, each data is marked by the following elements:
1) The Statistical Program and Statistical Program Cycle under which it has been produced;
2) The Process Step (phase) under which the data has been produced;
3) The Process method that specifies the method that produced the data (consisting of a set of Rules);
4) The Inputs that are necessary in order to produce the data through the application of the Process method.
Micro and macro data along the statistical process are then described in terms of a set of concepts as available in the Concepts and Structures GSIM groups. For this reason, the model is essentially very similar to the one depicted in GSIM, although some modifications have been also considered, as described in the next sections.
Adoption of GSIM
GSIM has been adopted as a reference for nomenclature and definitions of concepts related to data definitions. Among all the concepts available in GSIM, some of them have not been adopted, some others will be included in the next SUM-MS releases and some others are new objects with respect to GSIM.
The concepts that have not been adopted are those that refer to actual instances (e.g. instance variable) given that SUM-MS contains only metadata, not data. For the moment the SUM-MS system contains metadata without documenting many details: e.g. SUM does not documents the Unit Type for a Population (by the way, SUM-MS declares only “analysis populations”, i.e. the reference population for the data; other concepts are left to the referential part of the system); statistical variables are organized under the Represented Variable concept, and their conceptual domain is described by means of the different kinds of variables (numerical or categorical). Anyway there is room for including also other concepts in the SUM-MS.
System overview
Each (micro or macro) data produced in any phase of a Statistical Program is described as an output (product of a Process Step) and assigned a code and description. Its main characteristics are:
1) The Statistical Program and Statistical Program Cycle under which it has been produced
2) The Process Step (phase) under which the data has been produced
3) The Process method that specifies the method that produced the data (consisting of a set of Rules)
4) The Inputs that are necessary in order to produce the data through the application of the Process method
Given the above mentioned scenario, the characteristics of micro and macro data sets are as in the following.
Micro data set
The data set consists of as many Data Structures as the available reference populations in the data set. The data set details the concepts 1),.. 4) described above plus the following characteristics.
- Reference time of the micro data set
- Specification of the Reference Population(s) of the micro data set
- Specification of the relationships between the reference populations observed in the data set (e.g. the relationship between households and individuals)
- For each population, list of the Represented Variables, as observed on the units of the reference population. The represented variables are organized in
- Numerical variables, with also the specification if they belong to a table of Aggregate data at the unit level (following the rules given for the Data Content, as described for macro data);
- Categorical variables;
- Free-text variables.
In case of numerical variables, there are cases where some of them form a tabular sub-data set at a different unit level. The possibility to read that sub-data set in all its different meanings is important for their reuse (e.g. when data on the number of employees per gender and age class is requested to each enterprise, and these data can be seen both as the observation of numerical variables on each enterprise (first kind of reference population), as well as aggregate tabular data for the employees (second kind of reference population for the same data)).
As far as categorical variables are concerned, each one is connected to a Classification and a classification variant (named data structure variant, see the section on classifications) containing the detail of the actual set of codes used in practice in the data set.
Macro data set
Due to the presence of different kinds of data in a hypercube of macro data, Statistical Program, Statistical Program Cycle, Process method and Inputs are described at different levels in micro and macro data. While these elements can be referred at a data set level for micro data, the presence of data of different nature in macro data sets induced us to detail them at a more disaggregated level. Our model includes an additional concept, the Data Content, that details Statistical Program, Process method and Inputs for each macro data (see the data Content definition in Section 5).
The whole macro data hypercube is described reporting the phase in which it has been produced (e.g. data dissemination), and the following components:
- One dimension with the Data Content;
- One dimension with the time of reference of the data;
- As many dimensions as the number of categorical variables that cross cut the data content;
- Any other dimensions that detail operational aspects not included in the data content (e.g. the adjustment or the edition for national accounts time series).
Process description that this GSIM implementation supports
Business case
The Sistema Unitario dei Metadati (SUM) aims at two main goals
- To store metadata, and to build search functionalities that may ease metadata reuse and harmonization;
- To trace the data production process, and the corresponding metadata changes.
The second goal is aligned with the GSIM process description of a Statistical Program Cycle in Process Steps: each data is a result of a Process step (usually in line with the corresponding Phase available in GSBPM) consisting in applying a Process method on the necessary Inputs.
The first goal took advantage of the available concepts and definitions in the GSIM Structure and Concept groups.
Apart the two main SUM goals, the system is potentially able to tackle other objectives:
- To enhance the possibilities of data integration, looking for data on the same population and on the variables observed on the different data sets: according to the kind of data sets, different data integration methods can be used (record linkage, statistical matching,…)
- To support data search facilities, so that data producers in our NSI or data users outside the NSI (under the necessary restrictions) can look for the available data according to different facets.
Relation to other Models
The SUM-MS, once a data structure of macro data has been uploaded, is able to translate the data structure content in SDMX. When building the SUM-MS, the idea was not to compare SDMX and GSIM concepts. In fact GSIM specifies what has been left vague in SDMX, hence SUM-MS uses GSIM in order to complement SDMX.
SDMX considers just two general repositories of metadata (code lists and concept schemes): Data Structure Definitions are defined choosing a set of concepts from one or more concept schemes, assigning them a role and attaching a code list to the chosen concepts, if necessary. The role of concepts (and the corresponding code lists) is clear only after a Data Structure Definition (DSD) has been built: concepts assume the roles of dimensions, primary measures and attributes, and among dimensions there is the possibility to distinguish frequency and time dimensions and, from version 2.1 of SDMX in a different (and more appropriate) way, also the measure dimensions; code lists are attached to concepts whenever necessary, without declaring levels when these are available.
SUM-MS uses GSIM in order to assign a proper role to concepts in concept schemes before their use in a data structure. For macro data, concepts in a concept scheme come from different repositories: categorical variables, data content of statistical aggregates, operational concepts, time concepts and frequency. According to the data structure model outlined in Section 1, it is possible to derive its corresponding SDMX translation:
a. The dimension with the Data Content plays the role of a measure dimension (SDMX version 2.1) or a general dimension (SDMX version 2.0). The presence of the Data Content is mandatory in our model.
b. The time and frequency concepts play the roles of Time Dimension and Frequency dimension respectively.
c. The categorical variables that cross cut the Data Content are just dimensions with a code list corresponding to a system variant of a statistical classification;
Any other dimensions that detail operational aspects not included in the data content (e.g. the adjustment or the edition for national accounts time series) are other dimensions associated with code lists (not classifications).
Design
.Licensing
New Information Objects and/or new specialisations of GSIM Information Objects
Although the model adopted in SUM uses extensively GSIM concepts, it was necessary to include one more concept for macro data: the Data Content whose aim is to describe in a unique place the macrodata output of a Business Process related to data dissemination in terms of the statistical program that produced the data, the necessary inputs and the method (in other terms, a complete title of the slice of the hypercube with that Data Content). Other concepts, although available in GSIM, have been partially adapted or specified in more detail, and are described elsewhere (section 1 for the organization of micro and macro data structures, section 7 for statistical classifications).
Data Content is a concept used for macro data and includes the references to the Statistical Program, Statistical Program Cycle, Process method and the Inputs for each data, giving the definition of the contents of a macro data set. In fact, hypercubes of macro data may contain either homogeneous data (e.g. the “Number of residents per gender, place of residence and age”) or different kinds of data (as when different economic indicators are available in the same hypercube). In any case, a SUM-MS restriction (useful for practical purposes) is that the content and meaning of the data should be available in just a unique place: a table dimension. Other hypercube concepts (as a table title) or the possibility to split the data content and meaning in more than one dimension in a table are not considered in this model. In addition to the Statistical Program, Statistical Program Cycle, Process method and the Inputs the Data Content declares also all the necessary elements useful for a full understanding of data. These elements are differently organized according to the kind of inputs.
If the input is a data set of micro data (e.g. for means, medians, percentages, totals, variances, Gini indexes, …) , then the data content is organized and described with the following items:
- Statistical program and Statistical Program Cycle;
- Reference population of the aggregate data (this item details the Input);
- Numerical variable used in the computation of the macrodata (in case only categorical variables are used, the counting variable; this item details the Input);
- Statistical operator that transforms the continuous variable observed on the units of the reference population into the macro data (this item details the Process method);
- (if needed) unit measure;
- (if needed) scale;
- (if needed) conditional categorical variable over which conditional percentages are given (and categories over which these conditional percentages are given).
If the input(s) consist of one or more macro data, for instance for comparisons (as for ratios, balances, index numbers, percentage changes,..) the data content should declare:
- Statistical program and Statistical Program Cycle;
- Macro data used as Inputs;
- Statistical operator used to combine/compare the inputs (this item details the Process method);
- (if needed) unit measure;
- (if needed) scale;
- (only for index numbers) base year.
This organization facilitates the search functionalities of the system, and is one of the major nodes where metadata connections are described.
Note that the Data Content feeds the GSIM concept Measure in a Data Structure. While GSIM declares that “measures correspond to Represented Variables with uncoded Value Domains (Described Value Domains)” (UNECE 2013, GSIM Specification, version 1.1, item 101), the proposed approach is to code each possible output, relate it to its statistical description (in terms of reference population, statistical variables, statistical operator and so on) and allow its use as a tool for easing metadata traceability and reuse.
Lessons learned
.Suggestions for changes to GSIM
File | Modified | |
---|---|---|
Microsoft Excel Spreadsheet GSIM classification implementation.xlsx | 24 Apr, 2015 by Marco Silipo | |
Labels
|
Links: |
---|