View Source

3.1 Introduction

33. The functions and methods are an essential part in describing lower levels of hierarchy in the construction of a SDE process. This chapter provides more exact categorization and definitions of the functions and methods in editing processes and provides examples and explanations for their use. It uses concepts and structures which have appeared earlier in Camstra and Renssen (2011), Pannekoek and Zhang (2012), and Pannekoek et al. (2013), here presented with some modification.

34. The difference between functions and methods can be understood as follows. Functions specify what data editing action is to be performed, methods specify how these actions are to be performed. A function may be implemented by multiple methods and one method can perform different functions.

35. Chapter ‎3.2 and ‎‎3.3 define the terms “function” and “method” and split them into categories or types. These chapters provide examples for different functions and methods. In Chapter ‎3.5, the practical application of the concept of functions and methods are considered.

3.2 Functions

36. A statistical data editing function is a business function that performs a specific purpose in the chain of activities defining the data editing process and can be categorized into three broad function types:

Review. Functions that examine the data to identify potential problems.
Selection. Functions that select units or fields within units for specified further treatment.
Treatment. Functions that change the data in a way that is considered appropriate to improve the data quality. The modification of specific fields within a unit (i.e. filling in missing values or changing erroneous ones) is referred to as imputation.

37. The functions can be further categorized based on the task they are assigned to, the type of output they produce and whether they apply to units or variables. The descriptions of the function categories are as follows. Table 1 provides examples of these functions.

Review of data validity (by checking combinations of values). Functions that check the validity of combination of data values against a specified range or a set of values and also the validity of specified combinations of values. Each check leads to a binary value (TRUE, FALSE).
Review of data plausibility (by analysis). Functions that calculate measures for the plausibility of data values in a data set (combination of units). It results in quantitative measures that can be used to evaluate the plausibility of data values, which may include aggregates. This also includes less formally specified “functions” such as analysis by inspection of graphical displays.
Review of units. Functions that calculate scores that provide quality measures for making a selection of a unit. A score function can be whatever measure describes a unit. The outcome of a score function is often needed for further use in the next step of the process in which the output of the score function is taken as an input.
Selection of units. Functions that select units from a data set for separate processing. Automatic selection appears e.g. when values of score functions are compared with a predefined threshold value. Correspondingly, manual selection is usually based on macro-editing, e.g. with aggregates and graphics.
Selection of variables. Functions that point out variables in units for a different treatment than the remaining variables, usually referring to their observed (suspected) errors. As for units, this operation can be done either manually (clerical review) or automatically (detection of unit of measurement errors, Fellegi-Holt method for error localisation).
Imputation of variables. Functions that alter observed values or fill in missing values in order to improve data quality. Usually the imputation functions are dedicated to correcting different error types (e.g. systematic errors, errors in unit properties). The functions may lead to solutions that are conducted automatically (many different methods) or manually (e.g. interactive operations). Imputation in this document is not just defined as providing values for missing fields but also to modify erroneous fields.
Unit treatment. Functions that alter the structure of the unit by combining (i.e. linkage) and reconciling (alignment) the different units residing in multiple input sources. The aim is to derive and to edit the target statistical units that are not given in advance.

Table 1. Statistical Data Editing Functions

Function type	Function category	Examples
Review	Review of data validity (by checking combinations of values)	Review of obvious errors Assessing the logical consistency of combinations of values Review of data properties
	Review of data plausibility (by analysis)	Measuring the (im)plausibility of values or combinations thereof Measuring plausibility of macro-level estimates. Review and identification of suspicious aggregates Presence review and identification of systematic errors Macro-level review of combining units
	Review of units	Review of eligible units Review of non-eligible units Review by scores for influential or outlying units Review of micro-level consistency of unit
Selection	Selection of units	Selection of eligible units Selection of units for interactive treatment, for non-interactive treatment and not to be treated Selection of units affected by influential errors Selection of outlying units to be treated by weight adjustment Selection by structure of units Selection of units by macro-level review
Selection	Selection of variables	Selection of variables with obvious errors Selection of variables with errors in unit properties Selection of variables for treatment by specific imputation methods Selection of influential outlying values for manual review Localizing the erroneous values among those that are inconsistent Localizing the variables affected by errors for each unit
Treatment	Variable imputation	Correction of obvious errors Correction of systematic errors Correction of errors in unit properties Imputation of localized errors Imputation of missing or discarded (erroneous) values Adjustment for inconsistency
Treatment	Unit treatment	Treatment of units in the critical set Creation of statistical units Treatment of unit linkage deficits

38. Beyond the data set that is reviewed and/or modified, the functions need additional metadata in order to be put into the process flow and process steps. This metadata can be classified as process input metadata, metadata for functions and process output metadata. Input metadata include auxiliary data, parameters and unstructured metadata. Metadata for functions specify methods and rules. Output metadata are quality measures and paradata. See Chapter ‎4 for more information on these metadata issues and their importance for functions.

3.3 Methods

39. A process method specifies how the data editing functions in a process flow are to be performed in real life situations. In this document, we call them methods for short.

40. Methods can be associated with rules. Rules are mathematical/logical functions of the variables in the data set and possibly also of auxiliary variables. Sometimes these rules may further be fine-tuned using parameters. The process of defining the right parameters in a certain context is called parameterization. We distinguish between edit rules, score functions and correction rules.

41. Edit rules describe the valid (hard edits) or plausible (soft edits) values of variables or combinations of variables. Especially in business statistics there are often large sets of hard and soft edit rules such as: linear equalities (balance edits), inequalities and ratio edits. Edit rules are used in review functions that assess the violation of hard edits or the amount of violation of soft edits. Hard edit rules are also used by methods for selection of values presumed to be in error, e.g. implementations of the Fellegi-Holt method. Imputation methods may also use edit rules, in particular, the adjustment for consistency of imputed values uses hard edit rules.

42. Score functions asses the plausibility and influence of the values in a unit as a whole. They are typically used by selection functions that select units for interactive editing.

43. Correction rules combine detection, selection and imputation of missing data or erroneous values. In particular, in case of specific “obvious” errors, they are used for the correction of systematic errors or, more generally, of errors with a detectable cause and known error mechanism. They can be formulated as IF-THEN type rules of the following form: IF (condition) THEN OldValue = NewValue. This type of rule is usually applied during micro-editing. IF-THEN type rules can also be used for automatic error detection. They can be expressed in IF-THEN form as: IF (condition) THEN FlagValue = ErrorCode.

44. Table 2, Table 3 and Table 4 are similarly structured: each function category has one or more corresponding method categories with examples. Some of the methods also appear in common process steps described in Table 6 in Chapter 5. Note that the subcategory classification is not meant to cover all possible alternatives, though it shows many familiar methods for each of the three function types.

Review

45. The methods for the review function vary from simple to complex. The most usual review methods are edit rules in various forms. The methods targeted to study data plausibility usually require specific analytical constructions to obtain indicators for selection. A score is a quality measure of a unit. The review by unit scores has two main parts: scores for selective editing and other types of scores for review. The micro-level consistency is studied in order to reveal problematic unit situations concerning linkage and alignment between multiple input sources. Table 2 presents examples of these review methods.

Table 2. Categories of Methods for Review Functions

Function category	Method category	Examples
Review of data validity	Edit rules	Edit rules by valid values (a set of valid values defined for a variable) Edit rules by limits (an interval for valid values defined for a variable) Edit rules by historic comparisons (variable value relations in different time points) Edit rules by variable relations (constructing variable relations by prior knowledge) Mixture of types of edit rules (a combination of different edit rules)
Review of data plausibility	Analytical methods for review	Measures for outlier detection (e.g. calculating measures from a distribution of a variable) Aggregates for macro level studies (e.g. calculating totals for comparing to previous totals) Coverage analysis (e.g. does a subpopulation have high proportion of non-matching units?) Population sizing (e.g. no. of register households ≈ no. census households?) Cluster analysis (recognizing erroneous values with mixture modelling)
Review of units	Sufficiency study of unit	Sufficiency check of value content of unit (a study of value content and item nonresponse)
	Micro-level consistency	Edit rules by linkage status (e.g. check status match, non-match, multiple matches) Edit rules of misalignment (e.g. does a person have multiple addresses?)
	Score by auxiliary variable	Auxiliary variable as a criterion for importance (e.g. using turnover for assessing importance of an enterprise)
	Score calculation for selective editing	Score function for totals (quantifying editing effect of unit on estimated total) Score by parametric model for data with errors (parametric model taking possible errors into account) Edit-related score calculation (score calculation taking edit rules and estimates into account) Score calculation by latent class analysis (score related to the expected error based on modelling) Score calculation by prediction model (predicting error probabilities based on previous well-edited data)
	Interactive review of unit	Inspection the unit and the variable values as a whole (a clerical evaluation of the state of unit)

Selection

46. The action of selection leads to a simple outcome, either we mark a unit or variables of a unit as selected or not (binary, 0/1). The techniques for units use threshold or unit structure testing-based automation, or a manual selection based on decisions by the editor. Correspondingly, the techniques for variables use various computational solutions for limiting the set of variables in observations for further processing. Again, manual inspection is an option. The rules for aggregates may resemble the principles used in the edit rules for observations. Table 3 presents some methods familiar from both theoretical selection types and practical solutions.

Table 3. Categories of Methods for Selection Functions

Function category	Method category	Examples
Selection of units	Selection by scores	Selection by fixed threshold (a threshold based on experiences or reasoning is used) Selection by threshold from score distribution (a point from the score distribution as threshold) Selection by threshold from pseudo-bias study (a percent level of manual treatment for the pseudo-bias study is used for determination of a threshold)
	Selection by structure	Complicated relations (e.g. unmarried couple with their child at one address and man’s wife at a separate address) Dubious structure (e.g. address with a family nucleus, a grand aunt and an unrelated person)
	Macro-level selection	Selection by group statistics (e.g. postcodes with highest linkage errors)
	Interactive unit selection	Units chosen interactively (a clerical selection of the unit)
Selection of variables	Micro-level selection of variables	Selection of obvious errors (directing obvious errors to correction with selection)
		Random error localisation (identify erroneous value with algorithm)
		Accepting multivariate error situation in unit (selecting all variables with indicator in unit)
	Macro-level selection of variables	Selection based on outlier calculations (method-specific selection rules for outliers)
	Macro-level selection of variables	Selection based on rules for aggregates (identify suspicious set of units based on estimate)
	Interactive variable selection	Variables chosen interactively (a clerical selection of a variable for further treatment)

Treatment

47. The treatment function type usually has many corresponding alternate methods, which are familiar from the literature as well as editing practices in statistical editing processes. Other more general classes may be defined for some methods, for example dividing variable imputation methods into random and non-random imputation. The unit level treatments are usually connected to various operations needed when combining and reconciling the different units residing in multiple input sources. Table 4 presents several treatment methods.

Table 4. Categories of Methods for Treatment Functions

Function category	Method category	Examples
Variable imputation	Interactive treatment of errors	Re-contact (obtaining real value from respondent or data provider) Inspection of questionnaires (checking values from a questionnaire, e.g. for process errors) Value replacement (substituting or adding a value from another variable/source) Value creation (value decision based on knowledge of substance)
	Deductive imputation	Imputation with a function (a value calculated as a function of other values) Imputation with logical deduction (a value deducted with logical expressions) Imputation with historic values (a value transferred from an earlier time point) Proxy imputation (a value adopted from a related unit)
	Model based imputation	Mean imputation (using a mean of a variable) Median imputation (using a median of a variable) Ratio imputation (using auxiliary variable value through ratio calculation) Regression imputation (predicting a value with a regression model)
	Donor imputation	Random donor imputation (selecting a donor randomly (within a domain)) Sequential donor imputation (a sequential selection of donors) Nearest neighbour imputation (selecting a donor based on a distance function) Random nearest neighbour imputation (selecting a donor randomly in the neighbourhood)
	Consistency adjustment	Balance edit solution (a solution as a result derived from consistency conditions) Prorating (adjusting block of existing values for consistency) Ratio corrected donor imputation (donor imputation with ratio correction for consistency) Partial variable adjustment (correcting variable values with prior knowledge)
Unit treatment	Unit rejection	Deletion (rejecting a unit)
	Unit creation	Mass imputation (e.g. imputation of missing households in one-number census) Imputation of lower level units for upper level unit (e.g. imputation of missing persons in responding households) Creating upper level units from lower level units (e.g. grouping persons into households and deduce household variables)
	Unit linkage	Correcting linkage deficits (e.g. clerical review of linked pairs of units) Matching different types of units (e.g. place a household with unknown address in an ‘unoccupied’ dwelling)

3.4 Practical Solutions

48. Functions are often specialized and can be categorized with respect to specific properties such as error types (e.g. obvious errors, systematic errors), target (e.g. macro-level estimates) or forthcoming action (e.g. interactive treatment, imputation). Some functions already include two or more process steps in conjunction. A common example is review and selection simultaneously, e.g. Fellegi-Holt error localization or outlier detection.

49. In practice, methods implemented in production often do not distinguish between all methods presented in the previous sections. In some cases, due to computational challenges, the solution being implemented might not fully reflect the conceptual functions and methods initially designed. However, the parameterization of a method is a task that should be performed carefully in practice for an efficient and effective editing process. Instead of spreading data editing parameters over several programs, it is better to maintain them in a metadata system that feeds them into the methods in a centralized manner.

50. A practical solution to the risk of spreading data editing parameters across different systems is to perform different functions at once, either in one action or as a string of actions. These special upper level methods are called methods for a combination of functions. A very common case of this is an IF-THEN rule. This method combines functions of all three function types in one operation: the IF part contains review in the form of evaluating an edit rule (e.g. the conditions for thousand error), the selection is in the decision that this rule should cause treatment in one or more variables (those specified in the THEN part) and the treatment is specified by the prescription that provides a new value. Other typical operations belonging to this class are the outlier analysis with review and selection and sometimes treatment at once, and the Fellegi-Holt paradigm, which may include an edit rule mechanism and an algorithm needed for localisation of errors with minimal value changes in the data.