2.1 Background

6. The idea of creating a generic process framework for statistical data editing was raised at the UNECE Work Session on Statistical Data Editing in 2014. The report of the work session identified, under future work, the need to develop a "common, generic process framework for statistical data editing", suggesting that "this could be done by a task team under the High-Level Group for the Modernisation of Statistical Production and Services

Currently the High-Level Group for the Modernisation of Official Statistics (HLG-MOS)

, and presented at the next Work Session". 

7. The UNECE launched a call for expressions of interest, and a task team was established subsequently. The output of this task team was the Generic Statistical Data Editing Model (GSDEM) version 1.0 which was launched at the Workshop on the Modernisation of Official Statistics in 2015.

8. The GSDEM is envisaged as a standard reference for statistical data editing, in the same way, from a methodological point of view, as the suite of standard models and methods for survey estimation. By providing standard terminology and models, the GSDEM aims to facilitate understanding, communication, practice and development in the field of statistical data editing.

9. In 2018, at the Workshop on Statistical Data Editing a team was established to review the GSDEM and update if necessary. The current version of the GSDEM (version 2.0) is the result of the revision. Whilst this version is considered as final at the time of release, it is also expected that future updates may be necessary in the coming years, either to reflect further experiences from using the model in practice, or due to the evolution of the nature of statistical data editing. The readers are therefore invited to check the UNECE Statistical Data Editing Wiki

UNECE Statistical Data Editing Wiki: https://statswiki.unece.org/display/sde

to be sure of having the latest version.

2.2 Statistical Data Editing Process

10. The statistical data editing (SDE) process can be represented as follows (Figure 1): data and metadata are provided as an input, a series of activities are performed to assess data plausibility, identify potential problems and remedy the problems; and transformed data are produced as an output. The process is set according to constraining factors as shown.

Figure 1. Statistical Data Editing Process

11. Statistical data editing is often also referred to as editing and imputation, or “E&I” for short. Throughout this document, we will primarily use the former term (as well as “data editing”, or simply “editing”), but occasionally use the latter when referring to commonly used expressions such as “Initial E&I”.

12. The data editing process is chiefly composed of business functions that perform specific tasks with specific purposes. In the context of statistical data editing, we refer to these functions as “data editing functions” or “functions” for short. In terms of their purpose, these functions can be divided into three function types: "review", "selection" and "treatment", based on terms proposed by Pannekoek and Zhang (2012). These types of functions can also be viewed as high-level functions themselves, thereby characterizing the data editing process as performing three tasks: review, selection and/or treatment.

13. Statistical data editing either involves or affects all eight phases of the Generic Statistical Business Process Model (GSBPM)

The GSBPM version used throughout this document is GSBPM v5.1. For more, see UNECE GSBPM Wiki: http://www1.unece.org/stat/platform/display/GSBPM

. Under the GSBPM, statistical production can be viewed as a process of transforming the initial input data into statistical information. Data editing is part of this production process. Sometimes, all the data editing activities can be grouped to form a "fixed segment" in the chain with one point of entry and one point of exit (for example by GSBPM Sub-processes 5.3 “review and validate” and 5.4 “edit and impute”).

14. Generally, however, data editing services may be applied at different places during the data life-cycle, including the case when previously processed data are reused and combined with other data to generate new statistical outputs, such as editing for national accounts or other macro accounts. For instance, weighting of sample units is a process within GSBPM Phase 5. By convention weighting is not considered a data editing activity, although it is a statistical function that may be relevant both to the input and the output of data editing.

15. From this perspective, editing during data collection (GSBPM Phase 4), including within collection instruments, constitutes either a data editing process or sub-process, depending on the scope of the process and the interpretation of the input and output data. Traditionally, there are also debates between imputation from the editing perspective and imputation from the estimation perspective. Again, by clarifying the purposes and usages of the treatment functions in a data editing process, we may or may not pay special attention to a particular imputation process as part of the data editing process and reach an agreement by convention.

16. In this document, the focus is mainly on the implementation of data editing processes. The design, development and evaluation of editing strategies are generally considered out of scope, though reference is made to process metrics (paradata) as this information may be used directly in the editing process. The articulation of the data editing functions provides a means to the focus of this framework, while the service-orientated perspective helps to ensure the scope of what is described, in order to make the framework sufficiently generic.

17. Finally, the current document is primarily orientated towards the treatment of the data to achieve fitness for use. Other important goals of statistical data editing, such as quality assessment and future error prevention, can take their points of departure from the results of the various data editing review and selection functions, but are not detailed or elaborated here.

2.3 Common Terminology

18. The overall statistical data editing process described in this document contains a number of activities or tasks that aim to assess the plausibility of the data, identify potential problems and perform selected actions intended to remedy the identified problems.

19. Below are proposals for definitions and descriptions of some of the main elements that can be distinguished in the data editing process and its inputs and outputs. The proposals are based on more general definitions from Generic Statistical Information Model (GSIM)

GSIM provides a set of standardized, consistently described information objects that are the inputs and outputs in the design and production of statistics. The GSIM version used throughout this document is GSIM v1.2. For more, see UNECE GSIM Wiki: https://statswiki.unece.org/display/gsim

that are applied to the specific context of data editing. There are three sections:

2.3.1 Functions and Methods

Functions

20. The GSIM refers to a business function as “something an enterprise does, or needs to do, in order to achieve its objectives”. A statistical data editing function is a business function that performs a specific purpose in the chain of activities defining the data editing process, and can be categorized into three broad function types:

21. Common examples of functions in each of the three categories are:

22. The different types of functions are often linked and ordered as follows: review functions lead to quality indicators or measures that can point out specific problems in the data; selection functions take quality indicators and/or selection criteria (thresholds) and data as input and produce indicators identifying records or fields within records for further treatment; finally, treatment functions change or impute the selected data values in order to resolve the problems detected earlier. The results may then be subject to another (or the next) review activity.

23. Each function type is characterized by its core type of input and output data and metadata. All function types use the data as input, while only treatment functions produce new data as output. The input and output metadata for each function type is discussed in Chapter 4.

Methods

24. Data editing functions specify what action is to be performed in terms of its purpose, but not how it is performed. The latter is specified by the process method. Examples of methods for different function types are:

2.3.2 Metadata Types

25. The SDE process has input data and input metadata as its inputs. Input data is the data that is the object of the editing activities. Input metadata consists of all other information that is needed for the process to run. On the output side, there is transformed data, which correspond to the input data (with modifications) and output metadata which contain further information produced by the process.

26. According to Figure 1, the metadata needed to define and describe the data editing process can by broadly categorized as follows:

2.3.3 Process Step, Process Flow and Control

27. The elements required to design and describe a specific statistical data editing process consist of the following.

Process Step

28. An operational data editing process usually contains a considerable number of functions with specified methods that are executed in an organized way. To describe the characteristics of the organisation of the overall process in a comprehensible way, it is useful to subdivide the process in a limited number of process steps and describe the organisation in terms of these process steps.

Process Flow and Control

29. The description of a process in terms of process steps must also include a specification of the routing among them. The process flow shows the process steps that are performed and the sequence in which they are performed. A trivial sequence is when a step is followed by the same step under all circumstances. When a step can be followed by several alternative steps, depending on some conditions, this is managed by a flow-element that is called a control, describing a branching in the process sequence.

30. Examples of generic high-level process steps include the following:

31. Examples of controls include the following:

2.4 Topics Covered in the Remainder of This Document

32. The remainder of this document considers different aspects of the GSDEM: