2 Introducing the GSDEM

2.1 Background

6. The idea of creating a generic process framework for statistical data editing was raised at the UNECE Work Session on Statistical Data Editing in 2014. The report of the work session identified, under future work, the need to develop a "common, generic process framework for statistical data editing", suggesting that "this could be done by a task team under the High-Level Group for the Modernisation of Statistical Production and Services¹ , and presented at the next Work Session".

7. The UNECE launched a call for expressions of interest, and a task team was established subsequently. The output of this task team was the Generic Statistical Data Editing Model (GSDEM) version 1.0 which was launched at the Workshop on the Modernisation of Official Statistics in 2015.

8. The GSDEM is envisaged as a standard reference for statistical data editing, in the same way, from a methodological point of view, as the suite of standard models and methods for survey estimation. By providing standard terminology and models, the GSDEM aims to facilitate understanding, communication, practice and development in the field of statistical data editing.

9. In 2018, at the Workshop on Statistical Data Editing a team was established to review the GSDEM and update if necessary. The current version of the GSDEM (version 2.0) is the result of the revision. Whilst this version is considered as final at the time of release, it is also expected that future updates may be necessary in the coming years, either to reflect further experiences from using the model in practice, or due to the evolution of the nature of statistical data editing. The readers are therefore invited to check the UNECE Statistical Data Editing Wiki² to be sure of having the latest version.

2.2 Statistical Data Editing Process

10. The statistical data editing (SDE) process can be represented as follows (Figure 1): data and metadata are provided as an input, a series of activities are performed to assess data plausibility, identify potential problems and remedy the problems; and transformed data are produced as an output. The process is set according to constraining factors as shown.

Figure 1. Statistical Data Editing Process

11. Statistical data editing is often also referred to as editing and imputation, or “E&I” for short. Throughout this document, we will primarily use the former term (as well as “data editing”, or simply “editing”), but occasionally use the latter when referring to commonly used expressions such as “Initial E&I”.

12. The data editing process is chiefly composed of business functions that perform specific tasks with specific purposes. In the context of statistical data editing, we refer to these functions as “data editing functions” or “functions” for short. In terms of their purpose, these functions can be divided into three function types: "review", "selection" and "treatment", based on terms proposed by Pannekoek and Zhang (2012). These types of functions can also be viewed as high-level functions themselves, thereby characterizing the data editing process as performing three tasks: review, selection and/or treatment.

13. Statistical data editing either involves or affects all eight phases of the Generic Statistical Business Process Model (GSBPM)³ . Under the GSBPM, statistical production can be viewed as a process of transforming the initial input data into statistical information. Data editing is part of this production process. Sometimes, all the data editing activities can be grouped to form a "fixed segment" in the chain with one point of entry and one point of exit (for example by GSBPM Sub-processes 5.3 “review and validate” and 5.4 “edit and impute”).

14. Generally, however, data editing services may be applied at different places during the data life-cycle, including the case when previously processed data are reused and combined with other data to generate new statistical outputs, such as editing for national accounts or other macro accounts. For instance, weighting of sample units is a process within GSBPM Phase 5. By convention weighting is not considered a data editing activity, although it is a statistical function that may be relevant both to the input and the output of data editing.

15. From this perspective, editing during data collection (GSBPM Phase 4), including within collection instruments, constitutes either a data editing process or sub-process, depending on the scope of the process and the interpretation of the input and output data. Traditionally, there are also debates between imputation from the editing perspective and imputation from the estimation perspective. Again, by clarifying the purposes and usages of the treatment functions in a data editing process, we may or may not pay special attention to a particular imputation process as part of the data editing process and reach an agreement by convention.

16. In this document, the focus is mainly on the implementation of data editing processes. The design, development and evaluation of editing strategies are generally considered out of scope, though reference is made to process metrics (paradata) as this information may be used directly in the editing process. The articulation of the data editing functions provides a means to the focus of this framework, while the service-orientated perspective helps to ensure the scope of what is described, in order to make the framework sufficiently generic.

17. Finally, the current document is primarily orientated towards the treatment of the data to achieve fitness for use. Other important goals of statistical data editing, such as quality assessment and future error prevention, can take their points of departure from the results of the various data editing review and selection functions, but are not detailed or elaborated here.

2.3 Common Terminology

18. The overall statistical data editing process described in this document contains a number of activities or tasks that aim to assess the plausibility of the data, identify potential problems and perform selected actions intended to remedy the identified problems.

19. Below are proposals for definitions and descriptions of some of the main elements that can be distinguished in the data editing process and its inputs and outputs. The proposals are based on more general definitions from Generic Statistical Information Model (GSIM)⁴ that are applied to the specific context of data editing. There are three sections:

Functions and methods. This section describes the elements associated with different data editing tasks.
Metadata types. This section describes the metadata that are needed to define and to describe the data editing process, as well as the outputs of the process.
Process flow, process step and control. This section describes the elements concerning the organisation of tasks within the data editing process.

2.3.1 Functions and Methods

Functions

20. The GSIM refers to a business function as “something an enterprise does, or needs to do, in order to achieve its objectives”. A statistical data editing function is a business function that performs a specific purpose in the chain of activities defining the data editing process, and can be categorized into three broad function types:

Review. Functions that examine the data to identify potential problems.
Selection. Functions that select units or fields within units for specified further treatment.
Treatment. Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation.

21. Common examples of functions in each of the three categories are:

Review: measuring the plausibility of values or combinations thereof; assessing data for logical consistency; measuring plausibility of macro-level estimates.
Selection: selection of units for interactive treatment; selection of outlying units for specific treatment; selection of influential outlying values for specific treatment; selection of variables for treatment by specific imputation methods; localising erroneous values among those that are inconsistent.
Treatment: imputation of missing or discarded (erroneous) values; correction of systematic errors; adjustment for inconsistencies.

22. The different types of functions are often linked and ordered as follows: review functions lead to quality indicators or measures that can point out specific problems in the data; selection functions take quality indicators and/or selection criteria (thresholds) and data as input and produce indicators identifying records or fields within records for further treatment; finally, treatment functions change or impute the selected data values in order to resolve the problems detected earlier. The results may then be subject to another (or the next) review activity.

23. Each function type is characterized by its core type of input and output data and metadata. All function types use the data as input, while only treatment functions produce new data as output. The input and output metadata for each function type is discussed in Chapter 4.

Methods

24. Data editing functions specify what action is to be performed in terms of its purpose, but not how it is performed. The latter is specified by the process method. Examples of methods for different function types are:

Review: evaluating a specified score-function or set of edit rules; calculating specific measures for outlier detection.
Selection: using a specified criterion for outlier selection; using a specified threshold on a specific score function for selective editing; selection of units within a specified percentage of the highest score values; application of Fellegi-Holt error localisation with specified weights.
Treatment: specific imputation methods and models for specified variables; adjustment for consistency of specific variables with a specific algorithm; amendment of values by subject matter specialists.

2.3.2 Metadata Types

25. The SDE process has input data and input metadata as its inputs. Input data is the data that is the object of the editing activities. Input metadata consists of all other information that is needed for the process to run. On the output side, there is transformed data, which correspond to the input data (with modifications) and output metadata which contain further information produced by the process.

26. According to Figure 1, the metadata needed to define and describe the data editing process can by broadly categorized as follows:

Process input metadata. The information objects that describe the input of statistical data editing process. Process input metadata include conceptual and structural metadata elements useful to describe input data (units, variable, value domain, data set, record…) and additional information needed to apply the functions, such as auxiliary data and parameters.
Process steps and process flow metadata. The information objects needed to describe the statistical data editing process itself. Each process step is detailed in terms of functions and methods while the routing among process steps is governed through process control defining the flow.
Process output metadata. The information objects that describe the output of the statistical data editing process. Process output metadata include conceptual and structural metadata elements useful to describe output data. Other metadata produced by the editing process is quality information for both the input data and the output data. Furthermore, information may be gathered about how the process has run which is not directly related to data quality (paradata).

2.3.3 Process Step, Process Flow and Control

27. The elements required to design and describe a specific statistical data editing process consist of the following.

Process Step

28. An operational data editing process usually contains a considerable number of functions with specified methods that are executed in an organized way. To describe the characteristics of the organisation of the overall process in a comprehensible way, it is useful to subdivide the process in a limited number of process steps and describe the organisation in terms of these process steps.

Process Flow and Control

29. The description of a process in terms of process steps must also include a specification of the routing among them. The process flow shows the process steps that are performed and the sequence in which they are performed. A trivial sequence is when a step is followed by the same step under all circumstances. When a step can be followed by several alternative steps, depending on some conditions, this is managed by a flow-element that is called a control, describing a branching in the process sequence.

30. Examples of generic high-level process steps include the following:

Initial E&I (or Domain Editing and Editing Systematic Errors)
Automatic E&I
Interactive E&I
Macro E&I
Linkage & alignment

31. Examples of controls include the following:

Selection of units with influential suspicious values for interactive treatment.
Selection of variables within units for specified treatment (e.g. imputation by some appropriate method, editing methods for categorical/continuous variables).
Finding the underlying causes of suspicious aggregates.

2.4 Topics Covered in the Remainder of This Document

32. The remainder of this document considers different aspects of the GSDEM:

Chapter 3 considers data editing functions and methods in detail, provides further examples of each, and includes some comments on practical solutions.
Chapter 4 considers issues related to different types of metadata with the statistical data editing process.
Chapter 5 considers the elements defining a process (process steps, flow and control) and gives examples of data editing flows for different statistical domains.

Currently the High-Level Group for the Modernisation of Official Statistics (HLG-MOS)
UNECE Statistical Data Editing Wiki: https://statswiki.unece.org/display/sde
The GSBPM version used throughout this document is GSBPM v5.1. For more, see UNECE GSBPM Wiki: http://www1.unece.org/stat/platform/display/GSBPM
GSIM provides a set of standardized, consistently described information objects that are the inputs and outputs in the design and production of statistics. The GSIM version used throughout this document is GSIM v1.2. For more, see UNECE GSIM Wiki: https://statswiki.unece.org/display/gsim

Page tree