33. The functions and methods are an essential part in describing lower levels of hierarchy in the construction of a SDE process. This chapter provides more exact categorization and definitions of the functions and methods in editing processes and provides examples and explanations for their use. It uses concepts and structures which have appeared earlier in Camstra and Renssen (2011), Pannekoek and Zhang (2012), and Pannekoek et al. (2013), here presented with some modification.
34. The difference between functions and methods can be understood as follows. Functions specify what data editing action is to be performed, methods specify how these actions are to be performed. A function may be implemented by multiple methods and one method can perform different functions.
35. Chapter 3.2 and 3.3 define the terms “function” and “method” and split them into categories or types. These chapters provide examples for different functions and methods. In Chapter 3.5, the practical application of the concept of functions and methods are considered.
36. A statistical data editing function is a business function that performs a specific purpose in the chain of activities defining the data editing process and can be categorized into three broad function types:
37. The functions can be further categorized based on the task they are assigned to, the type of output they produce and whether they apply to units or variables. The descriptions of the function categories are as follows. Table 1 provides examples of these functions.
Table 1. Statistical Data Editing Functions
Function type | Function category | Examples |
Review | Review of data validity (by checking combinations of values) | Review of obvious errors Assessing the logical consistency of combinations of values Review of data properties |
Review of data plausibility (by analysis) | Measuring the (im)plausibility of values or combinations thereof Measuring plausibility of macro-level estimates. Review and identification of suspicious aggregates Presence review and identification of systematic errors Macro-level review of combining units | |
Review of units | Review of eligible units Review of non-eligible units Review by scores for influential or outlying units Review of micro-level consistency of unit | |
Selection | Selection of units | Selection of eligible units Selection of units for interactive treatment, for non-interactive treatment and not to be treated Selection of units affected by influential errors Selection of outlying units to be treated by weight adjustment Selection by structure of units Selection of units by macro-level review |
Selection of variables | Selection of variables with obvious errors Selection of variables with errors in unit properties Selection of variables for treatment by specific imputation methods Selection of influential outlying values for manual review Localizing the erroneous values among those that are inconsistent Localizing the variables affected by errors for each unit | |
Treatment | Variable imputation | Correction of obvious errors Correction of systematic errors Correction of errors in unit properties Imputation of localized errors Imputation of missing or discarded (erroneous) values Adjustment for inconsistency |
Unit treatment | Treatment of units in the critical set Creation of statistical units Treatment of unit linkage deficits |
38. Beyond the data set that is reviewed and/or modified, the functions need additional metadata in order to be put into the process flow and process steps. This metadata can be classified as process input metadata, metadata for functions and process output metadata. Input metadata include auxiliary data, parameters and unstructured metadata. Metadata for functions specify methods and rules. Output metadata are quality measures and paradata. See Chapter 4 for more information on these metadata issues and their importance for functions.
39. A process method specifies how the data editing functions in a process flow are to be performed in real life situations. In this document, we call them methods for short.
40. Methods can be associated with rules. Rules are mathematical/logical functions of the variables in the data set and possibly also of auxiliary variables. Sometimes these rules may further be fine-tuned using parameters. The process of defining the right parameters in a certain context is called parameterization. We distinguish between edit rules, score functions and correction rules.
41. Edit rules describe the valid (hard edits) or plausible (soft edits) values of variables or combinations of variables. Especially in business statistics there are often large sets of hard and soft edit rules such as: linear equalities (balance edits), inequalities and ratio edits. Edit rules are used in review functions that assess the violation of hard edits or the amount of violation of soft edits. Hard edit rules are also used by methods for selection of values presumed to be in error, e.g. implementations of the Fellegi-Holt method. Imputation methods may also use edit rules, in particular, the adjustment for consistency of imputed values uses hard edit rules.
42. Score functions asses the plausibility and influence of the values in a unit as a whole. They are typically used by selection functions that select units for interactive editing.
43. Correction rules combine detection, selection and imputation of missing data or erroneous values. In particular, in case of specific “obvious” errors, they are used for the correction of systematic errors or, more generally, of errors with a detectable cause and known error mechanism. They can be formulated as IF-THEN type rules of the following form: IF (condition) THEN OldValue = NewValue. This type of rule is usually applied during micro-editing. IF-THEN type rules can also be used for automatic error detection. They can be expressed in IF-THEN form as: IF (condition) THEN FlagValue = ErrorCode.
44. Table 2, Table 3 and Table 4 are similarly structured: each function category has one or more corresponding method categories with examples. Some of the methods also appear in common process steps described in Table 6 in Chapter 5. Note that the subcategory classification is not meant to cover all possible alternatives, though it shows many familiar methods for each of the three function types.
45. The methods for the review function vary from simple to complex. The most usual review methods are edit rules in various forms. The methods targeted to study data plausibility usually require specific analytical constructions to obtain indicators for selection. A score is a quality measure of a unit. The review by unit scores has two main parts: scores for selective editing and other types of scores for review. The micro-level consistency is studied in order to reveal problematic unit situations concerning linkage and alignment between multiple input sources. Table 2 presents examples of these review methods.
Table 2. Categories of Methods for Review Functions
Function category | Method category | Examples |
Review of data validity
| Edit rules | Edit rules by valid values (a set of valid values defined for a variable) Edit rules by limits (an interval for valid values defined for a variable) Edit rules by historic comparisons (variable value relations in different time points) Edit rules by variable relations (constructing variable relations by prior knowledge) Mixture of types of edit rules (a combination of different edit rules) |
Review of data plausibility
| Analytical methods for review | Measures for outlier detection (e.g. calculating measures from a distribution of a variable) Aggregates for macro level studies (e.g. calculating totals for comparing to previous totals) Coverage analysis (e.g. does a subpopulation have high proportion of non-matching units?) Population sizing (e.g. no. of register households ≈ no. census households?) Cluster analysis (recognizing erroneous values with mixture modelling) |
Review of units | Sufficiency study of unit | Sufficiency check of value content of unit (a study of value content and item nonresponse) |
Micro-level consistency | Edit rules by linkage status (e.g. check status match, non-match, multiple matches) Edit rules of misalignment (e.g. does a person have multiple addresses?) | |
Score by auxiliary variable | Auxiliary variable as a criterion for importance (e.g. using turnover for assessing importance of an enterprise) | |
Score calculation for selective editing | Score function for totals (quantifying editing effect of unit on estimated total) Score by parametric model for data with errors (parametric model taking possible errors into account) Edit-related score calculation (score calculation taking edit rules and estimates into account) Score calculation by latent class analysis (score related to the expected error based on modelling) Score calculation by prediction model (predicting error probabilities based on previous well-edited data) | |
Interactive review of unit | Inspection the unit and the variable values as a whole (a clerical evaluation of the state of unit) |
46. The action of selection leads to a simple outcome, either we mark a unit or variables of a unit as selected or not (binary, 0/1). The techniques for units use threshold or unit structure testing-based automation, or a manual selection based on decisions by the editor. Correspondingly, the techniques for variables use various computational solutions for limiting the set of variables in observations for further processing. Again, manual inspection is an option. The rules for aggregates may resemble the principles used in the edit rules for observations. Table 3 presents some methods familiar from both theoretical selection types and practical solutions.
Table 3. Categories of Methods for Selection Functions
Function category | Method category | Examples |
Selection of units | Selection by scores | Selection by fixed threshold (a threshold based on experiences or reasoning is used) Selection by threshold from score distribution (a point from the score distribution as threshold) Selection by threshold from pseudo-bias study (a percent level of manual treatment for the pseudo-bias study is used for determination of a threshold) |
Selection by structure | Complicated relations (e.g. unmarried couple with their child at one address and man’s wife at a separate address) Dubious structure (e.g. address with a family nucleus, a grand aunt and an unrelated person) | |
Macro-level selection | Selection by group statistics (e.g. postcodes with highest linkage errors) | |
Interactive unit selection | Units chosen interactively (a clerical selection of the unit) | |
Selection of variables | Micro-level selection of variables | Selection of obvious errors (directing obvious errors to correction with selection) |
Random error localisation (identify erroneous value with algorithm) | ||
Accepting multivariate error situation in unit (selecting all variables with indicator in unit) | ||
Macro-level selection of variables | Selection based on outlier calculations (method-specific selection rules for outliers) | |
Selection based on rules for aggregates (identify suspicious set of units based on estimate) | ||
Interactive variable selection | Variables chosen interactively (a clerical selection of a variable for further treatment) |
47. The treatment function type usually has many corresponding alternate methods, which are familiar from the literature as well as editing practices in statistical editing processes. Other more general classes may be defined for some methods, for example dividing variable imputation methods into random and non-random imputation. The unit level treatments are usually connected to various operations needed when combining and reconciling the different units residing in multiple input sources. Table 4 presents several treatment methods.
Table 4. Categories of Methods for Treatment Functions
Function category | Method category | Examples |
Variable imputation | Interactive treatment of errors | Re-contact (obtaining real value from respondent or data provider) Inspection of questionnaires (checking values from a questionnaire, e.g. for process errors) Value replacement (substituting or adding a value from another variable/source) Value creation (value decision based on knowledge of substance) |
Deductive imputation | Imputation with a function (a value calculated as a function of other values) Imputation with logical deduction (a value deducted with logical expressions) Imputation with historic values (a value transferred from an earlier time point) Proxy imputation (a value adopted from a related unit) | |
Model based imputation | Mean imputation (using a mean of a variable) Median imputation (using a median of a variable) Ratio imputation (using auxiliary variable value through ratio calculation) Regression imputation (predicting a value with a regression model) | |
Donor imputation | Random donor imputation (selecting a donor randomly (within a domain)) Sequential donor imputation (a sequential selection of donors) Nearest neighbour imputation (selecting a donor based on a distance function) Random nearest neighbour imputation (selecting a donor randomly in the neighbourhood) | |
Consistency adjustment | Balance edit solution (a solution as a result derived from consistency conditions) Prorating (adjusting block of existing values for consistency) Ratio corrected donor imputation (donor imputation with ratio correction for consistency) Partial variable adjustment (correcting variable values with prior knowledge) | |
Unit treatment | Unit rejection | Deletion (rejecting a unit) |
Unit creation | Mass imputation (e.g. imputation of missing households in one-number census) Imputation of lower level units for upper level unit (e.g. imputation of missing persons in responding households) Creating upper level units from lower level units (e.g. grouping persons into households and deduce household variables) | |
Unit linkage | Correcting linkage deficits (e.g. clerical review of linked pairs of units) Matching different types of units (e.g. place a household with unknown address in an ‘unoccupied’ dwelling) |
48. Functions are often specialized and can be categorized with respect to specific properties such as error types (e.g. obvious errors, systematic errors), target (e.g. macro-level estimates) or forthcoming action (e.g. interactive treatment, imputation). Some functions already include two or more process steps in conjunction. A common example is review and selection simultaneously, e.g. Fellegi-Holt error localization or outlier detection.
49. In practice, methods implemented in production often do not distinguish between all methods presented in the previous sections. In some cases, due to computational challenges, the solution being implemented might not fully reflect the conceptual functions and methods initially designed. However, the parameterization of a method is a task that should be performed carefully in practice for an efficient and effective editing process. Instead of spreading data editing parameters over several programs, it is better to maintain them in a metadata system that feeds them into the methods in a centralized manner.
50. A practical solution to the risk of spreading data editing parameters across different systems is to perform different functions at once, either in one action or as a string of actions. These special upper level methods are called methods for a combination of functions. A very common case of this is an IF-THEN rule. This method combines functions of all three function types in one operation: the IF part contains review in the form of evaluating an edit rule (e.g. the conditions for thousand error), the selection is in the decision that this rule should cause treatment in one or more variables (those specified in the THEN part) and the treatment is specified by the prescription that provides a new value. Other typical operations belonging to this class are the outlier analysis with review and selection and sometimes treatment at once, and the Fellegi-Holt paradigm, which may include an edit rule mechanism and an algorithm needed for localisation of errors with minimal value changes in the data.