55. Data integration procedures differ regarding the types of data sources to be combined, the characteristics of data sets – such as the coverage and overlapping of data sets through the data sources, the micro- or macro level of data, the existence and usability of unique identifiers, and the purposes of combining data.
56. Data integration methods can be divided into two main groups: a) Record linkage methods and b) Statistical matching methods, that, however, can be further divided in several subgroups and categories from different perspectives.
57. There is growing literature available on these methods, on their sub processes, advantages and disadvantages, mathematical bases and tools. Indeed, there have been some international projects and training programs related to data integration – such as the CENEX-ISAD (European Centres and Networks of Excellence, Integration of Surveys and Administrative Data), the ESSnet on Data Integration (ESSNET), the CULT project (CULT) or the European Statistical Training Program on Statistical Matching and Record Linkage in 2016 (ESTP) that reviewed and contributed to the specialized literature on data integration with important reports, articles, lists of recommended bibliographies and other materials. It is strongly recommended the reader to check these projects, their work packages, reports and outputs at the web pages shown in the footnotes.
58. To avoid duplication, a brief overview of data integration methods, with some of their most notable features, and some of the tools that are used in official statistics to carry out data integration are given.
A. Record Linkage
59. Record linkage refers to the identification and combination of records corresponding to the same entities – persons, enterprises, dwellings, households, etc. – throughout two or more data sources. Record linkage methods can be further classified in two branches:
a) Deterministic matching (or exact matching) is when a formal decision rule – usually the coincidence (or mismatch) of the unique identifiers that correspond to the same units in two or more data sources – is applied to find out whether a pair of records is a match or not;
b) In the case of Probabilistic matching such strict decision rules are not applicable. Instead, complex probabilistic decision rules are established based on a set of key variables that are common in the data sets to be integrated to be able to accept or refuse matches on a probabilistic basis.
60. Due to the similarities of deterministic and probabilistic matching methods, i.e. that both are based on the matching of key variables, some of their features are common. First, the deterministic and probabilistic matching procedures both can lead to linkage errors as false matches (or false positives) interpreted as real ones or false "unmatches" (or false negatives) that is, real matches not recognized as such. Moreover, both consist of similar phases (For further details see for example the presentations of the ESTP course on Statistical Matching and Record Linkage, 2016, or the CENEX-ISAD WP1.):
1) Pre-processing:
- Choice of the key variables,
- Data cleaning and quality improvement,
- Key variables in standard forms.
2) Linkage:
- Match (same entity);
- Unmatch (different entities);
- Uncertain match (unable to decide – possible match).
3) Post-linkage (Manual review of unlinked records)
4) Data analysis.
61. Deterministic matching is considered as the ideal case of record linkage due to the existence of a unique identifier – social security number of persons, fiscal code of enterprises, geocodes of addresses, etc. – which assures an error-free, one-to-one matching of records with the same identifier, that is, that belong to the same entity. For this reason, there is considerably less literature on this method than on the others. However, some challenges can emerge during the application of this method. A possible difficulty is that unique identifiers could also be affected by errors occurred for instance during either the data collection, or the data capture processes. There could be missing values as well in some of the data sources. Identifying records in basic registers– in the "spines of integration" – could be a proper solution in order to obtain or check unique identifiers.
62. Contrarily, Probabilistic matching is a more complex approach. Instead of unique identifiers, softer key variables are used here such as the name, date of birth, address, or other variables describing the units of the target population. These are more prone to be affected by data collection or data capture errors, or they are often recorded in different formats making their comparison more complicated. In these cases, the pre-processing phase has a crucial role that could strongly affect the results of the record linkage exercise.
63. The complex mathematical bases of probabilistic record linkage and probabilistic decision rules go back to the ground-breaking works of Newcombe et al. (1959) and Fellegi - Sunter (1969) who formalised the theory of probabilistic matching based on the assumption of conditional independence. Even today, this method serves as the basis of record linkage applications. Other probabilistic record linkage techniques are that of Jaro's (1989), further developed by Winkler (1995), or the distance-based record linkage method as described by Pagliuca and Seri (1999).
Tools for record linkage
64. An up-dated critical review of methods and software for record linkage is in Tuoto et al. (2014) 1 where the proliferation of methodologies and tools in recent years is interpreted according to assigned criteria, namely flexibility of the tools with respect to the support to input/output formats, extensibility, maturity, supported language and coverage of functionalities related to identified sub-phases in which it is possible to organize a record linkage process to reduce its recognized complexity.
65. The CENEX-ISAD WP3 report and the CULT projects results offer a detailed discussion of software tools for data integration. These include:
- Automatch
- Febrl
- GRLS
- LinkageWiz
- RELAIS
- DataFlux
- Link King
- Trillium Software
- Link Plus
- RecordLinkage (R codes)
- FRIL
- Fundy
- QualityStage
B. Statistical Matching
66. Statistical matching (or synthetic matching) involves the integration of data sources with usually distinct samples from the same target population, in order to study and provide information on the relationship of variables not jointly observed in the data sets. The main difference from record linkage – as Leulescu and Agafit (2013) put it – is that "record linkage deals with identical units, while statistical matching deals with 'similar' units. In practice, matching procedures can be regarded as an imputation problem of the target variables from a donor to a recipient survey." The statistical matching situation is usually described with a recipient data source A containing X and Y variables and donor data source B with X and Z variables. That is the statistical matching itself is imputing Z variable in data source A using the common variable X.
Figure 2. Statistical matching illustration by Eurostat (2014)
67. Statistical matching methods are categorized in the specialised literature from different angles:
a) The Micro approach aims at constructing a complete (containing all variables of interest) and synthetic (that is of not directly observed units) micro level data set.
b)The Macro approach seeks the integration of data sources in order to facilitate the estimation of the parameters of interest as the correlation or regression coefficients, and contingency tables of not jointly observed variables at the macro level
68. At another level, the Micro and Macro approaches can both be parametric or non-parametric and a mix of them can also be applied for the Micro approach:
a) The Parametric approach is based on the normality assumption of data. In this case a specified model is needed for the joint distribution of the variables that can lead to misspecification. (Usually maximum likelihood).
b) The Non-parametric approach is applied when data do not hold the normality assumption. This approach is more flexible than the parametric one when variables are of different types. (Usually hot-deck technics).
c) A mix of the parametric and the non-parametric approaches can be applied in the case of micro level matching: "first a parametric model is assumed, and its parameters are estimated then a synthetic data set is derived through a nonparametric micro approach. In this manner the advantages of both parametric and nonparametric approach are maintained: the model is parsimonious while nonparametric techniques offer protection against model misspecification" (D'Orazio, 2017).
69. Furthermore, approaches can be distinguished in accordance with the availability of information on the not jointly observed variables:
a) Approaches that assume the conditional independence of variables (originally all the micro-, macro-, parametric, non-parametric, and mixed methods were based on the conditional independence assumption).
b) Approaches where auxiliary information is available from a third data set in which variables are jointly observed.
c) In the case of Uncertainty, no assumptions are made, and no joint information is available on the variables, thus uncertainty analysis technics are applied usually at the macro level.
Tools for statistical matching
70. The list below is based on the CENEX-ISAD WP3 report that offers a detailed discussion of software tools:
- StatMatch (R code)
- SAMWIN
- SPlus code
- SAS code.
C. Other methodological considerations
71. A common issue with linked datasets is inconsistencies in the records linked. Where inconsistencies occur in records linked from two different data sources, it is important to know which of the two data sources is more reliable. Sometimes, even the order in which the datasets are linked is important in determining where an inconsistency arose. It is expected that as the number of datasets being linked together increases, the potential for efficiencies in detecting and treating inconsistencies in records increase as the number of variables increase. However, this may also increase the amount of editing required for the linked datasets.
72. Issues to be addressed by an editing strategy for linked datasets can be summarised by its ability to: edit inconsistencies from the same unit from different sources, treat erroneous and missing variables in a record and ensure consistency in variables across a record for a time period and over time.
73. Sources of potential bias have been identified with regard to integrating datasets. These include:
- Coverage and conceptual issues may only apply for some groups of a population, so care should be taken in generalising results.
- Some variables have the potential to affect the quality of linking and may be a source for potential bias in carrying out analysis on resulting datasets. Investigations on linkage rates across different subpopulations may be required.
- The use of linked datasets even for validation purposes may result in a break in the data series that needs to be managed.
74. Extreme care should be taken in backwards and forward casting of linked data especially for longitudinal data. A person may link in one quarter but not in another due to data quality reasons (or may link to a different record). A weight may be needed to adjust for missed links in linked datasets.
75. Methods to better estimate linkage errors are required to determine models appropriate to account for these linkage errors. Linkage errors contribute to potential coverage errors in the resulting target population. Care should also be undertaken when creating statistical units from integrated datasets wherein one dataset is sourced from an external dataset since the unit may be defined differently in the external dataset.
76. Data sourced externally may suffer from measurement errors, e.g., validity error, and these errors propagate when the data is integrated with other data sources to produce a statistical output. Hence, target concepts used in a dataset sourced externally should be well understood before being used in the production of official statistics.
- Tuoto, T., Gould, P., Seyb, A., Cibella, N., Scannapieco, N., Scanu, M. (2014) Data Linking: A Common Project for Official Statistics in Proceedings of CONFERENCE OF EUROPEAN STATISTICS STAKEHOLDERS Rome 24/25 November 2014 ↩
