| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Table of ContentsExpectations on machine learning Retrospective and lessons learnt |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
It is surely uncontroversial that there is a need for NSIs to identify and deal with suspicious and missing values in datasets. It is truly also uncontroversial that there are several ways to do this. For example, this can be done in a rule based way where data items are checked whether they fulfil restrictions on their values. Another approach is to work with the distribution of (parts of) the data. Data that is not plausible should – in some sense – not be belonging to the rest of data at hand. Domain knowledge is usually a necessary component of editing. Several editing procedures are suggested (for example the GSDEM by UNECE which is internationally agreed), mostly based on the identification of the most proper statistical method to detect and treat specific type of errors. Analogously, there are several ways of dealing with missing values and a lot of approaches and goals to impute values. This theme report provides a summary of the activities that took place in and the experiences that have been made by the members of the editing and imputation group of the UNECE HLG-MOS Machine Learning Project. Main goal of the editing and imputation group is to show to which extent machine learning algorithms can be used to efficiently improve editing and imputation processes in NSIs (by replacing, improving or complementing methods used so far). Some of the paragraphs mentioned below are literally or paraphrased already part of the cited pilot study reports or of a paper which has been presented at the UNECE Statistical Data Editing Virtual Workshop 2020 (Dumpert 2020b). To make clear what the two parts (editing on the one hand side, imputation on the other) are of, it is necessary to introduce the following differentiation: for the machine learning project, we treated
Other definitions (which are not used here) treat the process of altering incorrect values as part of the editing. The editing and imputation group has members from from Belgium (imputation; Goyens & Buelens 2020), Germany (imputation; Dumpert 2020a), Italy (editing and imputation; Rocci 2020, Rocci & Varriale 2020, De Fausti et al 2020), Poland (imputation; Wójcik 2020), UK (editing; Sthamer 2020), and co-workers from Switzerland (editing; Ruiz 2018) and Australia (imputation; Buttsworth 2020). |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
This paragraph describes the motivation of the participating NSIs to look into machine learning. Often, machine learning is not used exclusively but in addition to or at least compared to an already existing process, i. e. to both other specific statistical methods and (in some cases) also to human interactive work. This is true for exploratory phases as well as for the production of official statistics and it is the case for survey and register based work in NSIs. One of the goals is often to increase the proportion of records in a statistic that can be treated in a more automated way. Sometimes, the need for the usage of statistical methods also comes from the special situation that data from more than one source has to become combined. Another goal is an improvement in the process of reporting official numbers by delivering better (e. g., more accurate) or faster (shorter production process for the final report and its contents) predictions. Basic considerations and an embedding of the work into models like the Generic Statistical Data Editing Model (GSDEM) are provided by Rocci (2020). |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Editing As mentioned, editing is meant to detect “problematic” cells or items in the data, to be treated more carefully. Broadly speaking, those methods can be classified according to two main criteria: (i) Whether they are based on edit rules that data are expected to respect. They can be either hard or soft, i. e. they represent constraints on data or only expected values or relationships between variables. (ii) Explorative methods aimed at identifying anomalous data or with respect to some models thought to represent properly the data. Hence, the following aspects were mentioned as possible value added of the usage of machine learning (ML) for editing: 1. ML may discover rules that have only been “known” by intuition at first, trained in previous experience on the same process mainly through interaction by the subject matter experts. This may help
2. More concretely: A supervised machine learning model could learn from former editing results which units (records or even cells) in a data set are problematic. This means:
3. ML (as well as model based approaches) may offer a valid and efficient new instrument for the not rule based perspective on editing. This would help
4. More concretely: An unsupervised machine learning model could be used to analyse data with respect to its “hidden structure” with a less need of a priori model for the data. At first glance, it means that it can help to gain efficiency to
It has also been expected that machine learning has the capacity to exploit a huge amount of information to support the design and the maintenance of the editing process features. However, this obviously requires the availability of a suitable amount of data. Imputation What follows shows the expectations on machine learning on imputation at the beginning of the project.
The first aspect here directly leads to the question when an imputation job is done satisfactorily. At this point, there is an intersection to work package 2 of the machine learning project that deals with quality aspects. However, there are different goals of imputation which can be summarized by citing the EUREDIT project (Chambers 2001):
Although mentioned as number 5, imputation plausibility is a criterion which should be applied in addition to 1.–4. Obviously, these different goals have to use different metrics to measure their success. Machine learning may offer an additional value when there is (a) either a regression or a classification (i. e. a prediction) step within the imputation process. If the focus is on predictive or ranking accuracy, this is obvious because machine learning is known to yield good predictions. If the focus is on distributional or estimation accuracy, very often a “prediction step” is involved like in (stochastic) regression imputation or predictive mean matching. There may also be value added by machine learning on the task of building imputation classes. Clustering algorithms might be useful in this situation. |
| Panel | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The expectations above have been checked against the results of several pilot studies. The results of the pilot studies are written down as stand-alone papers separately. On this account, this theme report only provides a short summary of the pilot studies. The first table offers insights into the motivation why the pilot studies have been conducted. The tables are ordered: first the pilot studies on editing, then the pilot studies on imputation (in alphabetical order of the countries that ran the pilot studies).
The second table gives an overview on used data, important conducted steps, and compared algorithms.
The following third table shows some details on the used software and hardware as well as on the metrics used to assess the performance of the compared algorithms.
A fourth table eventually contains the most important aspects of the individual conclusions from the projects.
|
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Editing Editing, i. e. the task to find missing and problematic data (e. g. unplausible values, contradictions in records, and so on), is obviously very important in official statistics. Traditionally, rule based comparisons of observed (or transmitted) values with (weak or strong) plausibility constraints, distributional investigations (e. g. for outlier detections), and comparisons with external and/or former data sets are applied. Every editing procedure can be designed in different flows, according to the process features. Several steps are usually considered, in which both automation (through edit rules) and subject matter experts (through interactive editing) play an important role in detecting problematic data. The degree of automation usually depends on the type of errors identified to be most common and from the possibility to detect edit rules that characterize them. However, complete automation should not be the most important goal of the use of machine learning in editing; and it should never be the only goal. Mainly UK’s editing pilot study (Sthamer 2020), supplemented by the editing pilot study from Italy (Rocci & Varriale 2020), delivers first insights. The first study from this project (Sthamer 2020) where the aim has been to analyse the capacity of the use of ML to increase the automation of the editing phase as much as possible, i. e. to reduce interactive editing in favour of automation, showed:
A second experiment (Rocci & Varriale 2020) has been started (but not yet finished), to assess the use of ML to design a completely new editing process. According to the two studies so far, with machine learning the editing process can be completed much faster and more consistently (compared to manual editing). It may possibly even lead to higher quality of the data and allow for much sooner publication, but the effort required maintaining training data, the machine learning model and the analysis of the results in a short term might not prove to be a cost saver. Hence, the gain until now seems to be not so much in efficiency of the results but in the efficiency of the statistical process: Machine learning allows using huge amount of data with much less a priori knowledge, hypotheses and data preparation (general underlying structure of the data, stratification, etc.). Imputation Imputation, i. e. the task of altering incorrect values and inserting missing values, is obviously very important in official statistics. From the pilot studies from Poland, Italy, Belgium, and Germany, we have seen:
From this it was possible to learn that machine learning belongs to the class of methods which are more powerful because of their property that fewer assumptions are needed (in comparison with the fully parametric models); on the other hand, by this, they are flexible enough to be perfect on the training set, but often to perform poorly on new data. To avoid this, it is highly recommended to assess the performance of a machine learning model on a separate test set, for example to estimate population parameters based on completed test set. To use machine learning successfully in production is possible only after a lot of (successful) experimentation on the topic of interest; substantial effort is needed to conduct a proper investigation into the usability of machine learning methods. Parametric models are always the best, from every point of view, if the hypothesis is good. Unfortunately, often mistakes in specifying the underlying hypothesis are made, i. e. in modelling the phenomena; hence the parametric model is not able to provide good predictions. Non-parametric models run less risk from this point of view but fit (in the finite data situation) less well than the “true” parametric model. Furthermore, there is a need to shift the interest of stakeholders to accuracy and timeliness of results rather than to the interpretation of the parameters. There are no obvious quick wins to be made, and the uptake of machine learning methods in standard procedures requires substantial and continued effort and commitment. One should also always consider and check against a baseline method that is simple, well accepted, and reasonably performing; this is to avoid drowning in complexities with only marginal effects. For both, editing and imputation, we have learnt that to apply machine learning methods to statistical processes needs data science skills in terms both, programming/coding and statistical training/testing principles. It is also important that subject matter experts are involved. Programmers, statisticians, subject matter experts have to work together intensively, and all of them need some data wrangling skills. This has already been expressed, for example by Cao (2017), who wrote: “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) in order to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology.” |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
|
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Buttsworth S. (2020). Imputation of Dwelling Occupancy for Census 2021. https://statswiki.unece.org/download/attachments/266142512/WP1_E%26I_Australia.pdf (last access: 10th Sept. 2020) Cao L. (2017). Data science: a comprehensive overview. ACM Computing Surveys, 50(3), 1–42. Chambers R. (2001). Evaluation Criteria for Statistical Editing and Imputation. De Fausti F., Di Zio M., Filippini R., Toti S., & Zardetto D. (2020). Imputation of the variable "Attained Level of Education" in Base Register of Individuals. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Dumpert F. (2020a). Machine learning for imputation. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Dumpert F. (2020b). The UNECE High-Level-Group for the Modernization of Official Statistics Machine Learning Project: A report of the Editing & Imputation Group. Paper presented at the UNECE Statistical Data Editing Virtual Workshop 2020. https://statswiki.unece.org/download/attachments/282329136/SDE2020_T1-B_Germany_Dumpert_Paper.pdf (last access: 10th Sept. 2020) Goyens A. & Buelens B. (2020). Early estimates of energy balance statistics using machine learning. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Rocci F. (2020). Machine Learning for Data Editing Cleaning in NSI (Editing & Imputation): Some ideas and hints. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Rocci F. & Varriale R. (2020). Machine Learning tool for editing in the Italian Register of the Public Administration, a proposal. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Ruiz C. (2018). Improving Data Validation using Machine Learning. Paper presented at the UNECE Statistical Data Editing Workshop 2018. https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2018/T4_Switzerland_RUIZ_Paper.pdf (last access: 10th Sept. 2020) Sthamer C. (2020). Editing of LCF (Living Cost and Food) Survey Income data with Machine Learning. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home Wójcik S. (2020). Imputation in the sample survey on participation of Polish residents in trips. UNECE HLG-MOS Machine Learning Project, https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home |