4 Metadata for Data Editing Process

4.1 Introduction

51. Metadata can be defined as information that is needed to be able to use and to interpret statistics (Eurostat Glossary). This chapter describes metadata that are needed to define and describe the data editing process.

52. Firstly, metadata are necessary to describe the input and output of the data editing process, thus the following sections of this chapter are devoted to detailing process input and process output metadata. The same metadata elements are used to describe both the input and output of the whole data editing process and the input and output of each process step.

53. Secondly, in order to be able to describe but also to drive the data editing process, metadata elements to structure the process steps and the process flow are needed. Process steps of the data editing process are characterised by functions and methods. As mentioned in Chapter 2, functions specify which actions are to be performed, while methods specify how they are performed. In addition, methods can be associated with rules. Functions, methods and rules have been extensively explained in the previous chapter being the main focus of statistical data editing. In Chapter 4.4, some examples of input and output metadata for the three function types (i.e. review, selection and treatment) will be presented. Additional metadata objects need to be considered to specify a process flow, that is a combination of process steps. The sequence of process steps in the process flow is ruled by process controls, that is “a set of decision points which determine the flow between the process steps used to perform a business process” (GSIM). Further descriptions and applications of these metadata concepts to statistical data editing flows are reported in Chapter 5.

54. A summary table including an overview of all metadata elements for data editing processes with examples and correspondence with GSIM information objects is presented in the last section of the chapter.

4.2 Process Input Metadata

55. Process input metadata include all the information that describe the input of the data editing process, i.e.:

The data set that is the object of the data editing functions
Additional information needed to apply the functions, such as metadata describing the auxiliary data and the parameters necessary for the data editing process to run

4.2.1 Metadata Describing Input Data

56. Conceptual and structural metadata are needed to define and describe both the input data set and the auxiliary data. Conceptual and structural metadata include the meaning of these data by describing the concepts that are being measured (concepts and definitions) and their practical implementation (value domains and data structure).

57. Concepts and Definitions. These metadata describe and define the concepts that the data are measuring (e.g. income, education, turnover). They also define the objects of these measurements that are the units of some specified population (e.g. persons, families, businesses).

58. Variables and Value Domains. A variable combines the concept with a unit resulting in, conceptually, measurements of the concept for each unit (e.g. income of a person; income of a family; turnover of a business unit). Variables can have different roles and these roles are also part of the descriptions of the concepts. An important role is the one of unit identifier. Other roles that may be important for data editing functions are: classification variable (with classes that may be provided by a central classification server); stratification variable (defining strata for which some data editing functions are performed separately).

59. A unit data set, as is considered here, consists of the representation of the values of variables for a set of units. To describe this representation, the unit of measurement and value domain of the variables involved are specified. For quantitative variables, this could be, for example: thousands of Euros; non-negative real numbers. For categorical variables, this can be expressed by an enumeration of category codes and their meaning, e.g. 1 - male, 2 - female.

60. Data Structure. A unit data set is an organised collection of values. This organisation is described by the data structure. The most common data structure is the record. A record is a collection of elements, typically in fixed number and sequence and indexed by serial numbers or identity numbers. The elements of records may also be called fields. Examples of other data structures are: array, set, tree, graphs.

61. A record data structure must always contain at least one variable that can be used as unit identifier. A data set may contain units of different types. These might be hierarchically ordered such as persons and households. Different types of units can have different record descriptions. A data structure can also have attributes that describes properties of the data set as a whole, such as the phases of the statistical process it has gone through, the time it has been created or the population and time it refers to.

4.2.2 Additional Input

Auxiliary Data

62. Auxiliary data consist of data from other sources than the data being edited. Auxiliary data can be on the micro-level, when the auxiliary data are available for all, or some of, the same units as the data being edited. They can also be available on the macro-level, when the auxiliary data are aggregates, usually known totals of variables similar to or correlated with those that are being edited. The difference is that while the input data set is the object of the statistical data editing process (i.e. its plausibility is assessed and, if necessary, some specified actions are made), the auxiliary data only serve as referential information for one or more of the functions in the editing process and are not reviewed or modified themselves.

63. Micro-level auxiliary data sets are unit data sets consisting of measurements of variables on units. In this sense, they are similar to the input data set. Micro-level auxiliary variables can be used in review functions to assess the plausibility of the data. This practice includes the use of such variables in edit rules, score functions and outlier detection measures. Reference values from other sources can also aid in e.g. the detection of thousand errors. Macro-level auxiliary data can be used as input for review functions to evaluate the plausibility of aggregates in macro-editing.

64. Micro-level auxiliary data sets can be used in treatment functions as predictors in imputation models or in distance functions for donor imputation. Macro-level auxiliary data such as totals, ratios between auxiliary totals or between auxiliary totals and totals of the statistical data can be used to determine parameters in imputation models. Where micro-level auxiliary data sets are used in data editing functions, the metadata of the auxiliary data are crucial in order to implement the functions correctly and efficiently. This implies that a thorough cross-check among the metadata is needed.

Parameters

65. Some methods need explicit values for one or more parameters. More generally, a parameter can be defined as an input used to specify which configuration should be used for a specific function. The assignment of fixed values to parameters is also part of the metadata that need to be specified before the process is started.

66. Imputation methods in treatment functions require specification of the variables to be used to obtain an imputation value. These can be the predictors in parametric imputation models, the variables in a distance function for nearest neighbour imputation, or the variables that define classes for hot-deck imputation within classes.

67. Selection of outlying values or combinations of values in review functions needs the specification of thresholds. Selection of influential suspicious units for manual editing in selection functions also needs the specification of thresholds.

68. Error localisation based on the generalised Fellegi-Holt paradigm needs the specification of reliability weights. Adjustment for consistency with hard edit rules needs the specification of these weights.

Unstructured Metadata

69. Auxiliary metadata can also be gathered by domain specialists in a more or less unstructured way. Reference values for main variables may be available from, for example, annual reports of businesses. Also, information on the internet may be available about, for instance, a business’s current activities and products, or about statutory benefits.

70. Unstructured metadata can be used in interactive editing. Up-to-date information from websites can help in the editing of unit properties such as out-of-data NACE-codes.

4.3 Process Output Metadata

71. The primary process output is the edited output data set. The metadata for this unit data set consists of the description of conceptual and structural metadata as described earlier. Other metadata produced by the editing process is quality information for both the input data and the output data. Furthermore, information may be gathered about how the process has run which is not directly related to data quality (paradata).

4.3.1 Quality Measures

72. The review functions produce quality measures or indicators that are used by other selection and treatment functions but are also of interest in their own right since these measures reflect the quality of the input data. In particular, we mention the evaluated edit rules and the unit scores.

73. Failed edit matrix. The evaluated hard edit rules result in an N × K (number of units by number of edit rules) matrix of Boolean values. This matrix can be summarised in several ways. In particular, we can consider a unit view which gives the number of failed edits for each unit or an edit view which gives the number of failures for each edit. When each edit is linked to the variables involved in that edit, we can also obtain a variable view which gives the number of times a variable is involved in a failed edit.

74. Scores. The unit scores provide information on the unit quality and the influence of units.

75. Both the failed edit matrix and the unit scores should be evaluated after each process step to monitor the effects of each data editing step separately on these quality measures.

76. Imputation flags. Imputation flags typically indicate whether a variable has been imputed by way of a binary (0/1) indicator which may be appended to the output dataset or held in a separate file. They can be summarised to the unit level to indicate whether a unit has been imputed or not. Imputation flags act as a link between the input and output data and can facilitate quality assurance, calculation of imputation rates and the estimation of the imputation impact on results.

4.3.2 Paradata

77. Paradata can arise by monitoring the different kinds of actions that took place in the process steps. This can result in counts for these actions and the time involved.

78. The information from paradata can trigger the review of process parameters or to make adaptations to the process design to improve the efficiency and effectiveness of the process.

4.4 Input and Output Metadata by Function Types

79. As mentioned in Chapter 2 each function type is characterised by its core type of input and output metadata:

Review
- Input metadata: metadata describing the input data set (e.g. data structure, concepts and definitions, variables and value domains), as well as parameters like valid values, or limits by variable in edit rules or thresholds in outlier detection.
- Output metadata: quality measurements as evaluation of review functions (e.g. unit response rate, item response rates, number of units failing edit rules, the rate of edit failures).
Selection
- Input metadata: selection criteria (e.g. Fellegi-Holt paradigm, deterministic rules, score function, threshold).
- Output metadata: indicators defining subsets of units and/or variables of the input data set specified for further processing (e.g. critical units for selective editing, erroneous record).
Treatment
- Input metadata: metadata from review and selection functions identifying subsets of units and variables for the application of the treatment function, parameters to apply the treatment method (e.g. predictors in parametric imputation models, the variables in a distance function for nearest neighbour imputation).
- Output metadata: imputation flags and quality measurements as evaluation of treatment functions (e.g. imputation rate and impact).

4.5 Metadata Summary Table

80. Table 5 summarises all the metadata concepts introduced and relevant for SDE process with corresponding GSIM information objects.

Table 5. Metadata Summary Table

Metadata concepts in GSDEM	Examples	Corresponding GSIM information object
Process input		Process input
- Input data	Survey data to which review, selection or treatment functions should be applied	Transformable input
- Auxiliary data	frame, t-1 data for repetitive surveys, relevant administrative data	Process Support input
- Parameter	Parameters for outlier detection, thresholds for score functions	Parameter input
Process step and process flow
- Process step	Domain editing, selective editing	Process step
Function	Review of eligible units, selection of units affected by influential errors, correction of systematic errors	Business function
Method	Edit rules by valid values, random error localization, regression imputation	Process method
- Process control	Presence of influential units, presence of suspicious aggregates	Process control
Process output		Process output
- Output data set	Survey data transformed by treatment functions	Transformed output
- Quality measure and paradata	Number of edit violations, imputation flags	Process metric
Conceptual and structural metadata to describe input, output and auxiliary data
Unit data set	Survey data set, administrative data set	Unit data set
Concept	Income, turnover	Concept
Unit	Household, individual, enterprise	Unit
Population	Enterprises in Germany 1 January 2019	Population
Variable	Income of households, turnover of manufacturing enterprises	Variable
Value domain	Sex in {M,F}, Income>0	Value domain
Data structure	Record, array	Data structure

Page tree