5.1 Introduction

81. This chapter describes the elements that can be used to design statistical data editing processes taking into account design elements and constraining factors that could determine the SDE process flow model. According to the GSIM terminology, we may think of the SDE process as a “business process” which is composed of “process steps” and “process control”, that could be combined in different manners according to different scenarios. More precisely, a SDE process flow can be defined as: the representation of the sequencing and conditional logic among different process steps. The sequence of process steps in the process flow is ruled by process controls. The SDE process flow models, or more shortly, the SDE flow models aim to help the understanding of which activities are executed inside the SDE process and how these activities are linked and managed.

5.2 Elements of the Process Flow Models

82. The following SDE flow models focus on genuine data editing activities. However, a SDE process flow sometimes include activities not genuinely related to SDE, which results may influence how to execute some of the genuine data editing activities. These required activities may typically include coding, linkage, derivation of new variables (synthetic variables) and weighting. The outcomes of these activities then also undergo the review-selection-treatment functions. These activities are not represented in the following SDE process flow models in order to keep them as simple as possible. However, “linkage and alignment” are briefly described in paragraph ‎87.

83. A process step is a set of specific functions, each with its proper implemented methods, that are executed in an organised way for a specific SDE purpose. Process steps allow different data states to be distinguished and therefore the monitoring of the previous process steps as well as process loops.

84. The navigation between process steps is managed by rule-based process controls. A process control is called trivial when a process step is followed by the same process step under all circumstances and non-trivial when a step can be followed by several alternative steps, depending on the conditions of the process control.

85. The delineation of process steps and controls between each other are chosen to highlight the design considerations for the overall data editing according to the type of process under analysis. Common examples of such considerations are: “first treat errors that can be resolved with high reliability and little cost”, “apply interactive editing only to units with influential suspect values” and “always apply macro-editing”. The first consideration leads to a process step “Initial E&I” that contains a number of functions involved in e.g. the treatment of systematic errors. The second consideration leads to the process steps “Interactive E&I” and “Automatic E&I” that contain a number of functions that are applied to different parts of the data. The third consideration leads to a check needed to finish the SDE process and may result in a loop to the process step implementing the second consideration when macro-editing fails.

86. The main process steps and process controls which are commonly used to describe a SDE process flow are listed below.

Process Steps

  • Domain editing (in terms of units and variables). Check of structural informative objects defining the target population and the variables: e.g., verification and selection of eligible units, classification variables (e.g. ISIC/NACE, legal status).
  • Editing systematic errors. This process step deals with obvious errors, that are easily detectable and treatable, and with systematic errors, that are less recognizable than the previous ones, but for which the treatment during this process step can assure a high level of reliability.
  • Selective editing. Selective editing is a general approach for the detection of influential errors. It is based on the idea of looking for influential errors with respect to the main results in order to focus the most accurate treatment on the corresponding subset of units to limit the costs of interactive editing, while maintaining the desired level of quality of estimates (see MEMOBUST (2014), Selective editing).
  • Interactive editing. In interactive editing, micro-data are checked for errors and, if necessary, adjusted by a human editor, using expert judgment (see MEMOBUST (2014), Manual editing). Interactive editing often follows selective editing which may at least partially provide error localization, i.e. review and selection.
  • Automatic editing. The goal of automatic editing is to detect and treat errors and missing values in a data file in a fully automated manner, i.e. without human intervention (see MEMOBUST (2014), Automatic editing).
  • Macro editing (also known as output editing or selection at the macro level). It is a general approach to identify (select) the records in a data set that may contain potentially influential errors and outliers by analysing aggregates and/or quantities computed on (or extrapolated to) the whole population.
  • Variable reconciliation. It consists in the alignment of variable values at micro-level observed in different sources. This includes also the procedures used for predicting the (latent) target variable given the observed ones.
  • Linkage and alignment. Linkage and alignment refers to micro data processing that is typically necessary when combining (linkage) and reconciling (alignment) the different units residing in multiple input sources. The common scenario is where there are many relevant objects/units present in the linked datasets, which can be potentially useful for the following process step of deriving the statistical units of interest, such that person, kinship, enterprises, etc. The alignment stage focuses on clarifying all the “links” that exist or are admissible, providing the basis for deriving the units afterwards.
  • Derivation of [Complex unit] structure. Derivation and check of the structure of complex unit (e.g. assignment of individuals to households, households to buildings). For instance, if the complex unit is the household: [complex unit] structure= “HH structure”.

87. It should be noted that the distinction between “linkage and alignment” and “derivation of complex unit structure” follows from the fact that these steps are generally applied sequentially: “linkage and alignment” always first, “derivation of complex unit structure” only afterwards. To understand the difference, it is useful to introduce an example. If according to some input sources a student has a different address than the parents, one may need to check the plausibility of this information by asking if the address is either at the place of study or not. The result of this query, either positive or negative, is the consequence of what we call “alignment”, whereas the way to actually assign a dwelling or a household for that student is the construction of a statistical unit that comes only afterwards and is performed in the process step “Derivation of complex unit structure”.

88. As a further remark, note that the previously introduced process steps “linkage and alignment”, “derivation of complex unit structure” and “variable reconciliation”, belong to the general set named micro-integration, which in fact aims at processing integrated data to make variables coherent and consistent at micro level (see MEMOBUST (2014), Microdata fusion).

Process Controls

  • Influential units. Selection of units with potentially influential values for interactive treatment.
  • Variable type (continuous, categorical, etc.). Selection of variables for specified treatment (e.g. imputation by some appropriate method, editing methods for categorical/continuous variables).
  • Suspicious aggregates. Selection of suspicious aggregates for detection of possibly important errors.
  • Unresolved micro-data. After reapplying the edit rules following a treatment step some micro-data are still unresolved. This may be an indication that the set of edit rules for reviewing and selection were not exhaustive or that the treatment did not resolve all erroneous situations. In the first case, the set of edit rules has to be updated and in the second case, unresolved units are selected for a further treatment with alternative methods.
  • Hierarchical data. Verifying whether data have a hierarchical structure, that is, if there are units that can be grouped into more complex units (e.g. individuals in households, local units in enterprises).

89. It is worthwhile to remark that a process step or a process control may have the same name but use a quite different method or configuration from one SDE flow model to another. For instance, “automatic editing” can differ greatly from one situation to another both in terms of methods and difficulty. In fact, in some situations it could be performed by using a deterministic approach based on IF-THEN rules, in other cases by using the Fellegi-Holt paradigm. There are nevertheless at least two main reasons that justify the use of common names: (1) economy of elaboration, (2) emphasis of similarity or distinction. For example, one may wish to emphasise that a key difference between two flow models is that there is no need of the process control “influential error” in one of them, while the same process control is of paramount importance in the other.

90. In Table 6, the main process steps and process controls of a SDE process are listed, and for each of them the relevant functions and methods introduced in the previous sections are reported.

Table 6. The Main Process Steps of SDE Process

Process steps

Function(s)

(what)

Function types

Methods

(how)

Domain editing

Review and selection of eligible units

Review, Selection

IF-THEN

Review, selection and treatment of data properties (NACE, legal status etc.)

Review, Selection, Treatment

IF-THEN

Editing systematic errors

Review, selection and treatment of obvious errors

Review, Selection, Treatment

IF-THEN

Review of systematic errors

Review

Cluster analysis, latent class analysis, edit rules, graphical editing (e.g. log for 1000 error)

Identification of units affected by systematic errors (influential units)

Selection

IF-THEN, cluster analysis, latent class analysis

Correction of systematic errors

Treatment

Deductive imputation, model-based imputation

Selective editing

Identification of units affected by influential errors

Review

Score calculation

Selection of units for interactive treatment, selection of units for non-interactive treatment, selection of units not to be treated

Selection

Selection by fixed threshold

Interactive editing

Treatment of units in the critical set

Review, Selection, Treatment

Re-contact, inspection of questionnaires

Automatic editing

Verification of data consistency with respect to the edit set


Review

Analysis of edit failures

Localizing the variables affected by errors for each unit

Selection

IF-THEN, Fellegi-Holt paradigm, NIM (Nearest-Neighbour Imputation Method)

Imputation of localized errors

Treatment

IF-THEN, deductive imputation, non-random imputation, random imputation, prorating, NIM

Imputation of missing data

Treatment

IF-THEN, deductive, non-random imputation, random imputation, NIM

Macro editing

Review and identification of suspicious aggregates and outliers (influential units)

Review, Selection

Outlier analysis, aggregate comparison within data set, aggregate comparison with external sources, aggregate comparison with results from history

Data States

91. Process steps ruled by process controls have input data that are processed to produce output data. The main data states are:

  • Raw. Original data set that is not edited - Note: this category includes data that may have been edited by the providing agency (for administrative data) or during collection (e.g. within the internet questionnaire or by field interviewers).
  • Edited DOS. Data set after the treatment of domain, obvious and systematic errors (DOS).
  • Edited LA. Data set after linking and aligning (LA) the different units residing in multiple input sources.
  • Critical. Data set containing potentially influential errors.
  • Non-critical. Data set without error or containing only non-influential errors.
  • Edited [name of the higher-level unit]-ST. Data set after editing the structure (ST) of the higher-level unit under analysis. For instance, when the high-level unit is the household: Edited [name of the higher level unit]-ST = “Edited HH-ST”.
  • Micro-edited [name of the unit]. Data set after editing of the variables referring to the specified units at micro level. For instance, when the unit is the household: Micro-edited [name of the unit]=“Micro-edited HH”.
  • Final. set of data at the end of the overall SDE process after successful Macro-editing.

Design Elements

92. The design of a data editing business process, that is which process steps, process controls and how to combine them, is determined by specific characteristics of the input and output data (referred to as “design input and output metadata”), and by constraining factors.

93. Design input elements

  • Input metadata.
  • Units. Type of units: enterprises – large/small, individuals and/or households – hierarchical units, units from administrative sources, agricultural firms, macro/micro data.
  • Variables. Types of variables: numerical, categorical. Statistical distributions: skewed, multimodal, zero-inflated. Relations between variables: edit rules.
  • Survey. Type of survey: census/sample, structural surveys, short-term statistics, panel, register-based data, big data.
  • Characteristic of auxiliary information. Reliability, timeliness, coverage, structured/unstructured, micro/macro.

94. Design output elements

  • Type of output to be disseminated (e.g. micro-data file, table of domain estimates, target parameters).
  • Quality requirements (e.g. required level of accuracy).

Constraining Factors

95. Constraining factors are mainly referred to characteristics pertaining to organisational aspects that have a strong impact on the methodological choices. The most important constraining factors are:

  • Available resources (monetary budget, human resources, time).
  • Availability of auxiliary data (timeliness, quality, linkage procedure etc.).
  • Aim of SDE regarding completeness and coherence also with respect to external data sources and in view of further data integration.
  • Human competencies (knowledge and capacity).
  • IT (available software and hardware tools).
  • Legal constraints.
  • Policy decisions.

96. For instance, scarcity of people available for a manual review/follow-up of the observations may lead to the design of a completely automated data editing procedure. An example of a policy decision is the decision to limit re-contacts to reduce response burden. In Chapter 5.3, the influence of the above elements on the design of a business process will be clarified by the description of typical SDE flow models under different scenarios.

97. From a theoretical point of view, the design elements introduced above can be viewed as process controls as they determine the choice of a SDE flow model instead of another. In fact, since a process step can be defined at different levels of granularity, the overall SDE process may be seen as a “process step” at a higher level, and hence the input-output characteristics and the constraining factors can be seen as “process control” at this higher level.

5.3 SDE Process Flows under Different Scenarios

98. In this section, we provide some examples of “generic SDE flow models” for different types of statistical production processes (scenarios) in terms of type of investigated units (enterprises, households), variables (continuous, categorical), and sources (direct surveys, integrated sources). These models represent general proposals which might need to be adapted due to different production conditions mentioned in Chapter 5.2 as constraining factors.

99. In particular, we consider the following typical scenarios:

  • Structural business statistics
  • Short-term business statistics
  • Business census
  • Household statistics
  • Statistics through data integration.

100. The several type of processes are modelled starting from the one described in EDIMBUS (2007) which is represented in Figure 2 for structural business surveys. For each scenario, the elements conditioning the design will be highlighted.

101. Process steps are represented in a SDE flow model by rectangles. Data states are represented by ellipses with names associated to the function implemented in the previous process steps. A trivial process control (i.e. a process step followed by the same process step under all circumstances) is represented by an arrow and non-trivial process control (i.e. several alternative steps) is represented by a diamond as it represents a branching in the process sequence. A dotted arrow is used for loops that are expected to stop after some cycles.

Table7. Elements of the SDE Flow Models

Rectangle

Ellipse

Diamond

Line

Process step

Data state

Non-trivial process control

Trivial process control

Scenario A. Structural Business Statistics

102. Structural business statistics are usually based on cross sectional sample surveys where a high number of variables can be required and they are mostly quantitative. Starting from the generic SDE process flow model and taking into account those key elements, the SDE flow model for structural business statistics is depicted in Figure 2.

Figure 2. SDE Flow Model for Structural Business Statistics

Figure 2. SDE Flow Model for Structural Business Statistics

Scenario B. Short-Term Business Statistics

103. Short-term business statistics (STS) are usually based on panel surveys which are characterised by few variables and a short production process. The output is in the form of indices and variation values at aggregated levels.

104. The SDE flow model in this scenario mainly aims to deal with influential errors on the main target variable to ensure accurate aggregates/estimates in a short time. Due to time constraints, “automatic editing” is performed (e.g. if micro-data are to be released/published) only once the interactive verification of influential data has been completed.

105. The following model represents a general proposal for STS surveys, it has to be underlined that for such processes, the constraints can strongly influence the way the flow is managed. The choice of a specific treatment strategy mostly depends on:

  • Available resources (e.g. time, human and financial)
  • Efficiency of automatic editing.

106. For example, it might not be necessary to loop back to selective editing after the detection of suspicious aggregates, for instance outliers, which might be treated during weighting (not represented in the flow model) or by automatically imputing for them. In addition, the detection of suspicious aggregates might already furnish the units responsible for those aggregates and, therefore, the loop could point directly to interactive editing. Furthermore, the flow model does not explicitly state whether interactive editing treats only the variable values interacting in the detection of influential units or all variable values of these units.

Figure 3. SDE Flow Model for Short-Term Business Statistics

Figure 3. SDE Flow Model for Short-Term Business Statistics

Scenario C. Business Census

107. In the case of business censuses, due to the large number of units and variables, more emphasis is given to automatic procedures.

108. Selective editing is performed only on those data that determine suspicious aggregates, in order to verify the possible presence of residual errors (i.e. errors that are not identified in previous process steps of the SDE process).

Figure 4. SDE Flow Model for Business Censuses

Figure 4. SDE Flow Model for Business Censuses

Scenario D. Household Statistics

109. The SDE flow model for household statistics mainly depends on two design elements:

  • Type of investigated units
  • Type of observed variables.

110. Concerning the first element, household statistics may be based on either hierarchical data (individuals belonging to households) or individual data. In case of hierarchical data, the SDE process can be structured in different ways:

  • Data editing activities of household (HH) variables and individual variables are performed separately. In this case, the SDE flow consists of two sequential sub-processes, where the data editing activities performed in the last sub-process depend on (are constrained to) the outputs of the first one (Figure 5).
  • HH variables and individual variables are edited and imputed jointly (this is allowed, for example, by using the NIM/Canceis methodology). In this case, the process steps relating to the HH structure, the HH variables and the individual variables are performed in a unique sub-process.

111. The model is complicated if mixed types of variables (both categorical and continuous) are collected for the population units (e.g. in case of economic variables like income, expenses, etc. observed in a household expenditure survey). In this case, the editing of categorical and continuous variables can be performed:

  • Separately: in this case, the SDE process will include different process steps, each dealing with a different type of variable. Note that in this case a hierarchy among the two SDE sub-processes has to be specified if the categorical and the continuous variables are related to each other;
  • Jointly: in this case, the automatic treatment of categorical and continuous variables can be performed in a unique step (as allowed, for example, by the NIM/Canceis methodology). However, a preliminary step for the identification of extreme values for continuous variables is generally performed.

112. A generic model representing the typical SDE flow is the one reported in Figure 5.

Figure 5. SDE Flow Model for Household Statistics

Figure 5. SDE Flow Model for Household Statistics

Scenario E. Statistics Through Data Integration

113. Data integration has been developed strongly in the recent years. Currently, many scenarios are feasible, although usually a number of administrative data sets from external sources are used and integrated. In other cases, the administrative data can be integrated with surveys as well. In the following, the SDE strategy is described as structured in MEMOBUST (2014), in such a way that editing is performed on each source first, and then jointly after a linkage and alignment step. The scenario would change where one source also contains survey data.

Figure 6. SDE Flow Model for Statistics Through Data Integration

Figure 6. SDE Flow Model for Statistics Through Data Integration


  • No labels