3.1 Introduction

33. The functions and methods are an essential part in describing lower levels of hierarchy in the construction of a SDE process. This chapter provides more exact categorization and definitions of the functions and methods in editing processes and provides examples and explanations for their use. It uses concepts and structures which have appeared earlier in Camstra and Renssen (2011), Pannekoek and Zhang (2012), and Pannekoek et al. (2013), here presented with some modification.

34. The difference between functions and methods can be understood as follows. Functions specify what data editing action is to be performed, methods specify how these actions are to be performed. A function may be implemented by multiple methods and one method can perform different functions.

35. Chapter ‎3.2 and ‎‎3.3 define the terms “function” and “method” and split them into categories or types. These chapters provide examples for different functions and methods. In Chapter ‎3.5, the practical application of the concept of functions and methods are considered.

3.2 Functions

36. A statistical data editing function is a business function that performs a specific purpose in the chain of activities defining the data editing process and can be categorized into three broad function types:

37. The functions can be further categorized based on the task they are assigned to, the type of output they produce and whether they apply to units or variables. The descriptions of the function categories are as follows. Table 1 provides examples of these functions.

Table 1. Statistical Data Editing Functions

Function type

Function category

Examples

Review

Review of data validity

(by checking combinations of values)

Review of obvious errors

Assessing the logical consistency of combinations of values

Review of data properties

Review of data plausibility

(by analysis)

Measuring the (im)plausibility of values or combinations thereof

Measuring plausibility of macro-level estimates.

Review and identification of suspicious aggregates

Presence review and identification of systematic errors

Macro-level review of combining units

Review of units

Review of eligible units

Review of non-eligible units

Review by scores for influential or outlying units

Review of micro-level consistency of unit

Selection

Selection of units

Selection of eligible units

Selection of units for interactive treatment, for non-interactive treatment and not to be treated

Selection of units affected by influential errors

Selection of outlying units to be treated by weight adjustment

Selection by structure of units

Selection of units by macro-level review

Selection of variables

Selection of variables with obvious errors

Selection of variables with errors in unit properties

Selection of variables for treatment by specific imputation methods

Selection of influential outlying values for manual review

Localizing the erroneous values among those that are inconsistent

Localizing the variables affected by errors for each unit

Treatment

Variable imputation

Correction of obvious errors

Correction of systematic errors

Correction of errors in unit properties

Imputation of localized errors

Imputation of missing or discarded (erroneous) values

Adjustment for inconsistency

Unit treatment

Treatment of units in the critical set

Creation of statistical units

Treatment of unit linkage deficits

38. Beyond the data set that is reviewed and/or modified, the functions need additional metadata in order to be put into the process flow and process steps. This metadata can be classified as process input metadata, metadata for functions and process output metadata. Input metadata include auxiliary data, parameters and unstructured metadata. Metadata for functions specify methods and rules. Output metadata are quality measures and paradata. See Chapter ‎4 for more information on these metadata issues and their importance for functions.

3.3 Methods

39. A process method specifies how the data editing functions in a process flow are to be performed in real life situations. In this document, we call them methods for short.

40. Methods can be associated with rules. Rules are mathematical/logical functions of the variables in the data set and possibly also of auxiliary variables. Sometimes these rules may further be fine-tuned using parameters. The process of defining the right parameters in a certain context is called parameterization. We distinguish between edit rules, score functions and correction rules.

41. Edit rules describe the valid (hard edits) or plausible (soft edits) values of variables or combinations of variables. Especially in business statistics there are often large sets of hard and soft edit rules such as: linear equalities (balance edits), inequalities and ratio edits. Edit rules are used in review functions that assess the violation of hard edits or the amount of violation of soft edits. Hard edit rules are also used by methods for selection of values presumed to be in error, e.g. implementations of the Fellegi-Holt method. Imputation methods may also use edit rules, in particular, the adjustment for consistency of imputed values uses hard edit rules.

42. Score functions asses the plausibility and influence of the values in a unit as a whole. They are typically used by selection functions that select units for interactive editing.

43. Correction rules combine detection, selection and imputation of missing data or erroneous values. In particular, in case of specific “obvious” errors, they are used for the correction of systematic errors or, more generally, of errors with a detectable cause and known error mechanism. They can be formulated as IF-THEN type rules of the following form: IF (condition) THEN OldValue = NewValue. This type of rule is usually applied during micro-editing. IF-THEN type rules can also be used for automatic error detection. They can be expressed in IF-THEN form as: IF (condition) THEN FlagValue = ErrorCode.

44. Table 2, Table 3 and Table 4 are similarly structured: each function category has one or more corresponding method categories with examples. Some of the methods also appear in common process steps described in Table 6 in Chapter 5. Note that the subcategory classification is not meant to cover all possible alternatives, though it shows many familiar methods for each of the three function types.

Review

45. The methods for the review function vary from simple to complex. The most usual review methods are edit rules in various forms. The methods targeted to study data plausibility usually require specific analytical constructions to obtain indicators for selection. A score is a quality measure of a unit. The review by unit scores has two main parts: scores for selective editing and other types of scores for review. The micro-level consistency is studied in order to reveal problematic unit situations concerning linkage and alignment between multiple input sources. Table 2 presents examples of these review methods.

Table 2. Categories of Methods for Review Functions

Function category

Method category

Examples

Review of data validity

 

Edit rules

Edit rules by valid values (a set of valid values defined for a variable)

Edit rules by limits (an interval for valid values defined for a variable)

Edit rules by historic comparisons (variable value relations in different time points)

Edit rules by variable relations (constructing variable relations by prior knowledge)

Mixture of types of edit rules (a combination of different edit rules)

Review of data plausibility

 

Analytical methods for review

Measures for outlier detection (e.g. calculating measures from a distribution of a variable)

Aggregates for macro level studies (e.g. calculating totals for comparing to previous totals)

Coverage analysis (e.g. does a subpopulation have high proportion of non-matching units?)

Population sizing (e.g. no. of register households ≈ no. census households?)

Cluster analysis (recognizing erroneous values with mixture modelling)

Review of units

Sufficiency study of unit

Sufficiency check of value content of unit (a study of value content and item nonresponse)

Micro-level consistency

Edit rules by linkage status (e.g. check status match, non-match, multiple matches)

Edit rules of misalignment (e.g. does a person have multiple addresses?)

Score by auxiliary variable

Auxiliary variable as a criterion for importance (e.g. using turnover for assessing importance of an enterprise)

Score calculation for selective editing

Score function for totals (quantifying editing effect of unit on estimated total)

Score by parametric model for data with errors (parametric model taking possible errors into account)

Edit-related score calculation (score calculation taking edit rules and estimates into account)

Score calculation by latent class analysis (score related to the expected error based on modelling)

Score calculation by prediction model (predicting error probabilities based on previous well-edited data)

Interactive review of unit

Inspection the unit and the variable values as a whole (a clerical evaluation of the state of unit)

Selection

46. The action of selection leads to a simple outcome, either we mark a unit or variables of a unit as selected or not (binary, 0/1). The techniques for units use threshold or unit structure testing-based automation, or a manual selection based on decisions by the editor. Correspondingly, the techniques for variables use various computational solutions for limiting the set of variables in observations for further processing. Again, manual inspection is an option. The rules for aggregates may resemble the principles used in the edit rules for observations. Table 3 presents some methods familiar from both theoretical selection types and practical solutions.

Table 3. Categories of Methods for Selection Functions

Function category

Method category

Examples

Selection of units

Selection by scores

Selection by fixed threshold (a threshold based on experiences or reasoning is used)

Selection by threshold from score distribution (a point from the score distribution as threshold)

Selection by threshold from pseudo-bias study (a percent level of manual treatment for the pseudo-bias study is used for determination of a threshold)

Selection by structure

Complicated relations (e.g. unmarried couple with their child at one address and man’s wife at a separate address)

Dubious structure (e.g. address with a family nucleus, a grand aunt and an unrelated person)

Macro-level selection

Selection by group statistics (e.g. postcodes with highest linkage errors)

Interactive unit selection

Units chosen interactively (a clerical selection of the unit)

Selection of variables

Micro-level selection of variables

Selection of obvious errors (directing obvious errors to correction with selection)

Random error localisation (identify erroneous value with algorithm)

Accepting multivariate error situation in unit (selecting all variables with indicator in unit)

Macro-level selection of variables

Selection based on outlier calculations (method-specific selection rules for outliers)

Selection based on rules for aggregates (identify suspicious set of units based on estimate)

Interactive variable selection

Variables chosen interactively (a clerical selection of a variable for further treatment)

Treatment

47. The treatment function type usually has many corresponding alternate methods, which are familiar from the literature as well as editing practices in statistical editing processes. Other more general classes may be defined for some methods, for example dividing variable imputation methods into random and non-random imputation. The unit level treatments are usually connected to various operations needed when combining and reconciling the different units residing in multiple input sources. Table 4 presents several treatment methods.

Table 4. Categories of Methods for Treatment Functions

Function category

Method category

Examples

Variable imputation

Interactive treatment of errors

Re-contact (obtaining real value from respondent or data provider)

Inspection of questionnaires (checking values from a questionnaire, e.g. for process errors)

Value replacement (substituting or adding a value from another variable/source)

Value creation (value decision based on knowledge of substance)

Deductive imputation

Imputation with a function (a value calculated as a function of other values)

Imputation with logical deduction (a value deducted with logical expressions)

Imputation with historic values (a value transferred from an earlier time point)

Proxy imputation (a value adopted from a related unit)

Model based imputation

Mean imputation (using a mean of a variable)

Median imputation (using a median of a variable)

Ratio imputation (using auxiliary variable value through ratio calculation)

Regression imputation (predicting a value with a regression model)

Donor imputation

Random donor imputation (selecting a donor randomly (within a domain))

Sequential donor imputation (a sequential selection of donors)

Nearest neighbour imputation (selecting a donor based on a distance function)

Random nearest neighbour imputation (selecting a donor randomly in the neighbourhood)

Consistency adjustment

Balance edit solution (a solution as a result derived from consistency conditions)

Prorating (adjusting block of existing values for consistency)

Ratio corrected donor imputation (donor imputation with ratio correction for consistency)

Partial variable adjustment (correcting variable values with prior knowledge)

Unit treatment

Unit rejection

Deletion (rejecting a unit)

Unit creation

Mass imputation (e.g. imputation of missing households in one-number census)

Imputation of lower level units for upper level unit (e.g. imputation of missing persons in responding households)

Creating upper level units from lower level units (e.g. grouping persons into households and deduce household variables)

Unit linkage

Correcting linkage deficits (e.g. clerical review of linked pairs of units)

Matching different types of units (e.g. place a household with unknown address in an ‘unoccupied’ dwelling)

3.4 Practical Solutions

48. Functions are often specialized and can be categorized with respect to specific properties such as error types (e.g. obvious errors, systematic errors), target (e.g. macro-level estimates) or forthcoming action (e.g. interactive treatment, imputation). Some functions already include two or more process steps in conjunction. A common example is review and selection simultaneously, e.g. Fellegi-Holt error localization or outlier detection.

49. In practice, methods implemented in production often do not distinguish between all methods presented in the previous sections. In some cases, due to computational challenges, the solution being implemented might not fully reflect the conceptual functions and methods initially designed. However, the parameterization of a method is a task that should be performed carefully in practice for an efficient and effective editing process. Instead of spreading data editing parameters over several programs, it is better to maintain them in a metadata system that feeds them into the methods in a centralized manner.

50. A practical solution to the risk of spreading data editing parameters across different systems is to perform different functions at once, either in one action or as a string of actions. These special upper level methods are called methods for a combination of functions. A very common case of this is an IF-THEN rule. This method combines functions of all three function types in one operation: the IF part contains review in the form of evaluating an edit rule (e.g. the conditions for thousand error), the selection is in the decision that this rule should cause treatment in one or more variables (those specified in the THEN part) and the treatment is specified by the prescription that provides a new value. Other typical operations belonging to this class are the outlier analysis with review and selection and sometimes treatment at once, and the Fellegi-Holt paradigm, which may include an edit rule mechanism and an algorithm needed for localisation of errors with minimal value changes in the data.