Contact person*	Jenny Linnerud
Job title
Email
Telephone

Statistical Information Model

Our current information models in Statistics Norway are fragmented and mostly in Norwegian. However, we are in the process of adopting GSIM.

Adoption of GSIM

Statistics Norway has participated in the development of GSIM v1.0 from the initial idea within the Statistical Network through sprint 1 in Slovenia, sprint 2 in south Korea and the integration workshop in Statistics Netherlands including reviews of v0.4 and v0.8. We participated in the GSIM Implementers Group and the MetaData Task Force looking at the flow of GSIM v1.0 information objects throughout the GSBPM v4.0. We contributed our experience with the Neuchâtel terminology for classifications to the GSIM Statistical Classification Model in GSIM v1.1.We will continue to work with GSIM and DDI in connection with the RAIRD (Remote Access Infrastructure for Data to researchers) project together with the Norwegian data archive. We intend to adopt v1.1 with extensions and give constructive feedback intended to strengthen GSIM in its fit for purpose role.

System overview

.

Process description that this GSIM implementation supports

RAIRD changes the traditional process of statistical production (see Annex B in 9. Links and attachments) , by bringing the consumer of statistical products into the production flow, offering them the ability to better meet their own needs by assuming some of the tasks traditionally carried out within SSB. This change can be seen by showing where, in the traditional process as described by Statistics Norway’s Business Process Model, the end-user either participates with or replaces the SSB in performing some functions.

The end-users in question are more statistically literate than typical consumers of statistical products – because they are researchers associated with accredited research institutions, they have a level of skill in terms of working with input data that goes beyond the capabilities of most journalists or policy makers. This is a class of users that also has the most specific demands in terms of the statistical products they wish to see.

RAIRD – by offering the creation of specific statistical products as a service to end-users – represents a different paradigm, in which both SSB and the end users benefit. SSB benefits because the end users assume some of the tasks performed today by SSB staff. Thus, the resources needed to meet the end-users demands are less. For the end users, they will be able to get what they want quicker, based on real-time interactions on a larger set of source data.

Business case

GSIM Coverage

Note that there are three very different – but related – types of data structures in this process. The data loaded into RAIRD has a very event-history-based structure, with a single datum recorded for each available event. The Analysis Data Set has a different structure, organized into traditional rows and columns – what GSIM calls a Unit Data Set. The results of statistical operations may be tabular, but more typically as estimates of statistical model parameters and graphical displays – what GSIM describes as a Presentation. The RAIRD Information Model describes all of these, using what GSIM offers, but extending it to cover event-history data.

Further, we have many different examples of processing: the loading process, data catalogue browsing, data selection, data processing, data analysis, and the application of disclosure control. These processes will be modeled using the information objects supplied by GSIM.

Other functions (institutional accreditation, user management) are not as well-described in GSIM, and will require RAIRD-specific extensions.

Relation to other Models

Annex D Technical Considerations of Mapping RIM to DDI

In order to understand how the Data Documentation Initiative (DDI) standard can be used to implement RIM, we must first ask “Which parts of RIM require the functionality offered by DDI?” Typically, a standard such as DDI is most relevant where metadata is required in a structured, machine-actionable form. This occurs where metadata (and often the data it describes) are passed from one institution to another.

In the case of RAIRD, this occurs during the load process, where the data collected and cleaned within SSB is passed into the Event History Data Store. DDI documentation might also be useful for describing the Final Outputs delivered to the researchers. Everywhere else, the data and metadata within RAIRD exists within a single implementation environment, where a serialization in XML or similar implementation syntax are generally not needed.

When we consider the requirements for the loading of metadata into the Event History Data Store (the Input Metadata Set) we find that several parts of the RIM could usefully be mapped against DDI:

Concepts and concept systems, with links to variables and external documentation
Variable descriptions with labels and definitions
Pre-assessment values for disclosure control, assigned to variables
Codelists and classifications
Links to external documentation
Unit types and associated logical record structures (which variables are associated with which units)
Relationships between Unit types

Given this set of information, in the XML syntax of the DDI Lifecycle standard we find the ability to describe almost all of the needed information. The only requirement we see here that DDI Lifecycle does not support is the ability to capture pre-assessment values for disclosure control at the level of the variable – this could be handled with some simple extensions to the standard, however, using the extension mechanism recommended in DDI Lifecycle. (DDI Codebook cannot describe the relationships between unit types, nor the pre-assessment values for disclosure control.)

In most cases, the use of DDI Lifecycle for describing this metadata is obvious: the DDI ConceptScheme can describe concept systems, and links with variables and external documentation are supported (DDI Variables reference Concepts, and DDI OtherMaterials can be linked to DDI Concepts as well). DDI offers a rich description for Variables – for the RIM, this will cover all of the different types of variables (GSIM Variable. Represented Variable, and Instance Variable map to DDI’s ConceptualVariable, RepresentedVariable, and Variable constructs, respectively; to support variable-level pre-assessment values, the Data Element construct would need to be extended.)

Codelists and classifications are both present in DDI Lifecycle, and map cleanly to GSIM. Unit Types are associated with Variables in DDI, and the Variables are assembled into Logical Records – these constructs can be used to support the needed description of unit types and record structures. Further, the linkages between unit types can be described in DDI Lifecycle using the RecordRelationship construct.

It should be mentioned that the variable and record descriptions are needed only to the logical level – the load process for the Event History Data Store does not require physical descriptions of the data.

There have been some standard DDI Lifecycle profiles created by UN/ECE for representing GSIM information objects, and these standard prfiles cover almost all of the metadata required for the load process. These can be found at: http://www1.unece.org/stat/platform/display/gsim/DDI+Profiles. These do not cover everything which is needed (especially when it comes to concept systems) but do cover most of the needed constructs from DDI Lifecycle.

For describing the Final Outputs using DDI, we would need to add the physical description of the data and use the DDI Lifecycle NCube construct. This is probably not of value to the researchers, although it might be desireable from the perspective of the organizations archiving the researcher’s work (such as NSD).

Design

.

RIM Design Principles

The design principles for RIM listed below are based on the design principles in GSIM v0.8 and the design principles for DDI.

Design Principle	Name	Statement
0	Change control	RAIRD IM (RIM) has change control i.e. the following principles for designing RIM apply to every revision of RIM.
1	Complete coverage	RIM supports the whole business process resulting in access to products for researchers.
2	Supports production of products for researchers	RIM supports the design, documentation, production and maintenance of products for researchers.
3	Supports access to products for researchers	RIM supports access to products for researchers.
4	Separation of production and access	RIM enables explicit separation of the production and access phases.
5	Linking processes	RIM represents the information inputs and outputs to the production and access process.
6	Common language	RIM provides a basis for a common understanding of information objects
7	Agreed level of detail	RIM contains information objects only down to the level of agreement between key stakeholders.
8	Robustness, adaptability and extensibility	RIM is robust, but can be adapted and extended to meet users’ needs.
9	Simple presentation	RIM objects and their relationships are presented as simply as possible.
10	Reuse	RIM makes reuse of existing terms, definitions and standards.
11	Platform independence	RIM does not refer to any specific IT setting or tool.
12	Formal specification	RIM defines and classifies its information objects appropriately, including specification of attributes and relations.

Licensing

RAIRD Information Model v1.0

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it jointly to Statistics Norway and the Norwegian Social Science Data Services.

New Information Objects and/or new specialisations of GSIM Information Objects

Object	Definition	Explanatory text
Access Agreement	The terms and conditions agreed between the provider of confidential data and the Research Institution allowing access to data.
Access Request	A request to obtain access to confidential data.
Accreditation Request	An official statement asking to be approved as a Research Institution whose members may access confidential data in some form for research purposes.
Analysis Data Set	The data as selected and extracted from the Event History Data Store.
Corrective Action	An action taken to reduce the disclosure risk of a Data Set.	These actions may include cell suppression, aggregation, techniques for detecting differential disclosure, etc. Transformation of the data is typical of such actions.
Data Catalogue	The set of metadata describing the data contained in the Event History Data Store as presented to researchers so they can browse the content of the store.
Datum-Based Data Set	A Data Set which records a set of observations specific to time-bound events, for a range of variables, structured according to a Datum-Based Data Structure.
Datum-Based Data Structure	The description of how a Data Set is arranged, in which a set of time-bound events or statuses relative to events are recorded, for a known set of Units.
Disclosive Data Point	A Data Point which is potentially disclosive on its own.	Such aspects as the discoverability or sensitivity of the value may make it potentially disclosive.
Disclosive Data Point Combination	A combination of Data Points which, when taken together, pose a potential disclosure risk.	The Data Points by themselves could be disclosive or non-disclosive.
Disclosure Control Rule	A Rule based on a Process Method which determines if a Provisional Output is disclosive
Disclosure Risk Assessment	The level of disclosure risk, determined by an assessment of both the assigned risk factors and those arising from the data taken in context.	Assigned risk factors are attributes of the data such as discoverability or sensitivity; contextual factors are attributes such as skewness or granularity.
Event History Data Store	A system which provides access to a set of Event History data, structured according to a Datum-Based Data Structure.
Event History Input Data Set	The data which is loaded into an Event History Data Store.	Typically, this will be structured according to a Datum-Based Data Structure; it is accompanied by an Input Metadata Set for load into the Event History Data Store.
Event Period	The point of time at which an event occurs, or the period of time during which the status of a Unit persists in regards to that event.	If an event or status is observed at a single point in time, then that point in time is the event period. When a change in status occurs, an end-time is recorded, and a new event status begins.
Final Output	Result of analysis on the Analysis Data Set that has been subject to disclosure control, and is deemed safe to be viewed and published by a researcher.
Input Metadata Set	The set of metadata needed to fully document and describe the structure of an Event History Input Data Set.
Nondisclosive Data Point	A Data Point which is not by itself potentially disclosive.
Project	A collaborative enterprise, involving research, which is carefully planned to achieve a particular aim.
Project Profile	A set of information used to administer a Project.
Provisional Output	The product of the analysis of an Analysis Data Set by a researcher, which is submitted for finalization	If the Provisional Output passes disclosure control, it becomes a Final Output and may be published. Otherwise, it may be subjected to Corrective Actions to reduce the risk of disclosure.
Refusal Rationale	The official reason provided for denying a Research Institution accreditation or for denying a Project access to confidential data.
Relationship	The connection between two Units, of a specified type.	The Relationship object represents the connection of two units, for example marriage, employment, etc. It is sometimes seen in a real form (a civil contract in the case of marriage, or an employment contract in the case of an employer-employee relationship), but this is not always the case.
Research Institution	An institution which conducts research.
Research Publication	A published document which answers a research question based on an analysis of data.
Researcher	A post holder with a research title or other academic/scientific personnel at approved research units, and students studying for a masters degree or PhD, and post graduates provided that these are under the supervision of a qualified researcher at such a research unit.
User Profile	A set of information for administering a user within a system supporting research.

Lessons learned

.

Suggestions for changes to GSIM

.

	File	Modified
	Microsoft Excel Spreadsheet GSIM classification implementation_ssb.xlsx	17 Apr, 2015 by Jenny Linnerud
	Labels No labels Preview View

Name of contact for statistical classifications model:

Anne Gro Hustoft

Email:

agt@ssb.no

Links:

Attachments

Page tree

Statistics Norway: use of GSIM