Evaluated by T. de Waal, Statistics Netherlands, 2003
SYSTEM INFORMATION
Full name: 
CherryPie  Generalized error localization 
Version: 
1 
Year: 
2003 
Developer: 
Statistics Netherlands 
DESCRIPTION
CherryPie was developed by Statistics Netherlands to solve the error localization problem for mixtures of quantitative and qualitative variables. It was developed in a Visual C++ environment based on the LEO prototype, but with no interface (see the evaluation of LEO). CherryPie is part of SLICE, which is a collection of editing and imputation modules.
Editing rules identify the conditions to be satisfied in order for a record to be a good record. If one or more rules are violated then fields to be changed must be identified in order to satisfy the rules. The edit rules are defined as general IFTHEN statements, as opposed to LEO which expects the IFstatements to include categorical variables and the THENstatements to include numerical variables.
The main idea behind the algorithm is to build a binary tree (branchandbound method) where at each step (or nod), a variable is selected for analysis and then it is split in two categories: (a) to impute or (b) not to impute the variable. In the case where a variable is not imputed, it is fixed to its initial value in the set of edit rules to create a new set of rules to be analyzed. In the case where a variable must be imputed, it is removed from the rules by using a FourierMotzkin technique. If in a given step, we get inconsistent rules, then we go back to a previous step to resume the analysis there.
CherryPie generates a list of all optimal solutions in order to minimize the number of variables to be imputed. If the users want to, they can develop a module to select one of these solutions themselves. Such a component may be based on subjectmatter knowledge. The alternative consists in selecting one of the solutions at random. CherryPie allows the use of weights for each variable when the user wishes to exert some influence on the identification of the fields to be imputed.
In CherryPie, the maximum number of fields to be imputed may be around 10, in comparison to 5 in LEO. The system processes the errors and the missing values independently. Although the limit on the number of errors is relatively low (about 10 as mentioned above), the limit on the number of missing values can be as high as 50, or even more.
SLICE provides an independent module for finding an approximated solution in case the CherryPie branchandbound algorithm fails. CherryPie usually terminates quickly after it has found an optimal solution. Many branches of the trees can then be cut off. Therefore, visiting the entire tree is not always necessary. CherryPie offers a parser that reads general IFTHEN rules and parses them to rules that CherryPie can read. The general IFTHEN rules accept numerical or categorical variables in both the IFconditions and the THENconditions. Finally, the user may specify a time limit for the processing of each record.
STRENGTHS
 The algorithm is not complex and then it is easy to program. Furthermore, the approach allows splitting the problem in several components (or modules).
 It is possible to process categorical and numerical variables simultaneously. The edit rules can include both types together.
 It can process a high number of variables (more than 100 variables) in one run.
 The system allows the processing of negative values.
 Several parameters can be defined by the user: the weights associated to the variables, a flag or value that indicates the data is missing, etc.
 All the solutions with minimum change are identified and only one is randomly selected for the imputation. In such a case, the user can also develop a module to select a specific solution.
 The algorithm can be modified easily to add new components like the processing of integer values.
 It processes errors and missing values independently.
 The integrated parser makes edit rules more flexible than LEO.
WEAKNESSES
 The maximum number of fields to be imputed must not be too high in order to keep the binary tree at a reasonable size. It is recommended not to imputed more than ten fields. Although this would satisfy most applications, it may represent a limitation for some large ones.
 There is no approximated solution in the case where no optimal solution is found.
 The binary tree must be entirely visited to make sure the solutions are optimal, even when the optimal solutions are found early in the tree.
 The number of implicit rules that are kept at each step may become very large. This can bring CherryPie in an unstable state.
FUNCTIONAL EVALUATION
LEGEND 

*** 
The implementation offers subfunctions or options being required by a wide range of survey applications. 

** 
The implementation has a less complete set of options. 

* 
The implementation offers a partial functionality. Options are too restrictive or not generalized enough. 

 
No stars are assigned when the functionality is not offered at all. 

TYPE OF DATA 


Quantitative data 


Qualitative data 


EDITING FUNCTIONS 


Data verification 
* 

Online correction 
 

Error localization 


Minimum changes 


Userdefined changes 
 

Outlier detection 


IMPUTATION FUNCTIONS 


Deterministic imputation 
 

Donor imputation 


Imputation by estimators 


Multiple imputation 
 

GENERAL FEATURES 


Graphical user interface 


Userfriendliness 


Online help 


Online tutorial 


Documentation 


Diagnostic reports 


Integration 


Reusable code 


Portability 


Flexibility 


User support 


Acquisition cost 
 

REFERENCES
 Quere, R. and De Waal, T. (2000). "Error Localization in Mixed Data Sets". Statistics Netherlands Technical Report.
 De Waal, T. (2000). "An Optimality Proof of Statistics Netherlands' New Algorithm for Automatic Editing of Mixed Data". Statistics Netherlands Technical Report.