Seitenhierarchie
Zum Ende der Metadaten springen
Zum Anfang der Metadaten

A. Definition

7. In order to provide guidance for statisticians on how data integration activities fit into the statistical business process, it is important to define the term "data integration".

8. According to the Generic Statistical Business Process Model (GSBPM), data integration is an activity in the statistical business process when data from one or more sources are integrated. Although the GSBPM defines data integration under Process, it is important to remember that the GSBPM is not a linear model. Data integration is possible in the development and production and dissemination of official statistics whenever a combined, integrated dataset is produced.

9. Data sources could be a mixture of various data sources. In official statistics, these sources are usually primary data sources (more "traditional" sources such as statistical surveys) or secondary data sources (typically administrative datasets, big data or any other non-traditional source of information for official statistics). The result of the data integration activity is always an integrated dataset.

10. This document defines data integration as the activity when at least two different sources of data are combined into a dataset. This dataset can be one that already exists in the statistical system or ones that are external sources (e.g. administrative dataset acquired from an owner of administrative registers or web-scraped information from a publicly available website).

11. Some examples of data integration include:

  • an integrated dataset that serves as an input to produce official statistics
  • a statistical model developed and produced using different sources to produce model-based information
  • a dataset integrated for the purposes of micro-validation when some rules are defined to check the validity of the data in one dataset compared to another one
  • missing values imputed in a dataset using another dataset as the source for imputation
  • datasets combined to produce a sampling frame for a survey
  • data from several subject-matter domains combined into one dataset that is the basis for the production of statistics (example: national accounts)
  • datasets from different subject-matter domains compared to check the quality and the validity of information produced (macro-validation)
  • input from several sources integrated into one dataset to provide microdata files for the researchers for scientific purposes
  • different sources used to apply proper statistical disclosure control methods on a microdata set.

B. Types of data integration

12. There are many possible types of data integration. Five common types of integration are: administrative sources with survey and other traditional data; new data sources (such as big data) with traditional data sources; geospatial data with statistical information; micro level data with data at the macro level; and validating data from official sources with data from other sources. More information about these can be found in Annex 1.

13. Integration can be done at the micro level, at the level of a common denominator, at the aggregate (macro) levels, through modelling approaches or a mixture of these.

14. The survey conducted in 2017 asked statistical organisations about their data integration experiences and practices. The results showed that the four most comment types of data being integrated with other data sets were survey data, census data, commercial transactions and data from public administrations.

15. Data integration techniques can be applied for several reasons in the statistical business process. The 2017 survey found that data is commonly used to supplement survey data (e.g. for part of a population, for a set of variables), validate data, maintain registers and edit/impute data. It is also used as a source for sample frames.

16. The results of the data integration survey show that data integration for the ongoing production of statistics is more commonly used in some statistical domains (for example Business Statistics and Economic Accounts) than other domains. There is a high level of experimentation and research in some of the other domains (for example, Education and Indicators related to the Millennium or Sustainable Development Goals).

17. There are a number of groups which are working on developing strategies for global data agreements and methods in different statistical domains. Two examples are the Task Force on Data Integration for Measuring Migration, the Ottawa Group and the Group of Experts on Consumer Price Indices which are working on issues such as the use of big data and global data agreements for consumer price indices.

 

  • Keine Stichwörter
Report inappropriate content