| This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
Table of ContentsExpectations on machine learning Retrospective and lessons learnt |
|
|
|
|
Country | E/I | Title | Legacy System and Aims |
Italy (Rocci & Varriale 2020) | E | Machine Learning tool for editing in the Italian Register of the Public Administration | No legacy system, the task is new. Edit rules are of the main focus, but there are also investigations whether the application of machine learning can add value to the traditional editing approaches. |
UK (Sthamer 2020) | E | Classification of records of LCF (Living Cost and Food) survey income data that need editing | So far, there is only manual detection of spurious records. The goal was to replace the need for manual detection by learning a supervised model from former editing steps. |
Belgium (Goyens & Buelens 2020) | I | Early estimates of energy balance statistics using machine learning | Old-fashioned working methods, such as large and complex Excel sheets should be replaced. |
Germany (Dumpert 2020a) | I | Machine learning methods for imputation | No legacy system. The study should show the principal behaviour of several ML methods in an imputation task. The aim was to investigate whether ML can replace other approaches in regression imputation. |
Italy (De Fausti et al 2020) | I | Imputation of the variable “Attained Level of Education” in Base Register of Individuals | No legacy system. The task is new. Goal of the investigation was to determine how and where ML can give greater benefits in solving the imputation problems compared with classic statistical models. |
Poland (Wójcik 2020) | I | Imputation in the sample survey on participation of Polish residents in trips | No legacy system. The goal was to achieve high predictive accuracy by imputation to avoid additional surveys. |
The second table gives an overview on used data, important conducted steps, and compared algorithms.
Country | E/I | Data | Steps | Algorithms |
Italy (Rocci & Varriale 2020) | E | Public Administration Database (BDAP) and the Information System on the Operations of Public Bodies (SIOPE) | comparing several variables from the two sources, identifying different types of inconsistent data, list of units regarded as important to be analysed deeper delivered by subject matter experts, identifying edit rules behind such units | decision trees and random forests |
UK (Sthamer 2020) | E | pre- and post-edited LCF data for one year | data preparation, calculation of the change vector, learning models to predict the change vector | decision trees, random forests, neural network |
Belgium (Goyens & Buelens 2020) | I | quarterly data, ranging from Q1 2000 through Q1 2019 | z-standardization of the data, feature selection for linear regression, calculating and comparing predictions | linear regression, ridge regression, lasso, random forest, neural network, ensemble prediction |
Germany (Dumpert 2020a) | I | German cost structure survey of enterprises in manufacturing, mining and quarrying | creating missing values (several proportions, several missing mechanisms), calculating and comparing predictions | k-nearest-neighbours (weighted and non-weighted), Bayesian networks, random forests and support vector machines |
Italy (De Fausti et al 2020) | I | administrative information from the ministry of education, university and research, 2011 census data, sample survey data | focussing on one region and on incomplete records, some manual feature selection, calculating and comparing predictions | multi-layer perceptron, random forests, log-linear model |
Poland (Wójcik 2020) | I | quarterly sample survey on participation of Polish residents in trips for 2016 to 2018 and some big data sources | learning different models for estimation and comparing their predictions by several measures | different kinds of (generalized) linear models, regression tree, random forest, nearest neighbour, different kinds of support vector machines |
The following third table shows some details on the used software and hardware as well as on the metrics used to assess the performance of the compared algorithms.
Country | E/I | Software / Hardware | Measures |
Italy (Rocci & Varriale 2020) | E | R no special hardware | usefulness of the results indicating whether a variable determines the presence of a dangerous error in data, accuracy for model selection |
UK (Sthamer 2020) | E | Python Intel Core i5-8365U, 1.60GHz, 8 GB RAM | recall, precision, F1 |
Belgium (Goyens & Buelens 2020) | I | Python Intel i7 CPU with 6 cores, and 32 GB of RAM | root mean squared error, mean error, mean absolute error, mean absolute percentage error |
Germany (Dumpert 2020a) | I | R Intel Core i5-6500, 3.2 GHz, 8 GB RAM | mean, standard deviation, skewness, kurtosis, minimum, maximum, 25 %-quantile, median, 75 %-quantile of the imputed variables, correlations between the variables |
Italy (De Fausti et al 2020) | I | Python Azure cloud platform with Tesla V100-PCIE-16GB GPU | micro-level accuracy, macro-level accuracy |
Poland (Wójcik 2020) | I | R Intel Core i7-4770, 2x3.40 GHz, 64bit, 16 GB RAM | mean absolute error, mean absolute percentage error, root mean squared error, R² |
A fourth table eventually contains the most important aspects of the individual conclusions from the projects.
Country | E/I | Conclusion |
Italy (Rocci & Varriale 2020) | E | • the first application of ML methods in this context has shown the possibility to use ML to support the design of an E&I scheme to make it more efficient • exploring hidden patterns in the data with ML tools can help to understand how to classify units in a more efficient way in erroneous/not erroneous in terms of different error types and, therefore, how to combine the different E&I process steps |
UK (Sthamer 2020) | E | ML can be used for editing, but some points have to be borne in mind: • a ground truth/gold standard data set for retraining the model has to be created and enhanced periodically • ML expertise should be within the survey team to monitor and retrain the model when required • editing will be far more efficient and faster with the ML solution compared to existing processes • survey data will be available sooner for further processing and this will allow for more timely data and faster release • it remains open if ML can save cost here, because some clerical editing resources have to be maintained as well as technical expertise to build, analyse and keep the ML solution in operation |
Belgium (Goyens & Buelens 2020) | I | • think of a baseline method that is simple, common-sensical and reasonably performing • no single ML method worked best, or even better than a very simple method • in this study the ensemble method, averaging results from several ML methods, seems promising • manage expectations well; some people expect great results without effort or investment • substantial effort is needed to conduct a proper investigation into the usability of ML methods • making data and code publicly available has been well received by the community and can stimulate future joint work |
Germany (Dumpert 2020a) | I | • it is too early to give a general (not survey specific) advice to use one of the investigated methods for imputation • random forest does the imputation faster than the other tested methods in the study • the usage of weighted k-nearest-neighbours and random forest lead to more stable and “correct” estimations of the moments and quantiles; furthermore, the boxplots of these two methods are more symmetric than the other ones |
Italy (De Fausti et al 2020) | I | • the results of estimation with the two approaches (MLP vs. log-linear model) are completely comparable • for particular sub-populations, such as extreme items (PhD), log-linear imputation is better • MLP micro accuracy is a bit better respect the log-linear model • MLP approach does not require variables pre-treatment |
Poland (Wójcik 2020) | I | • machine learning is much more powerful than traditional models and can easily overfit the data • estimating the out-of-bag error is important to compare various methods by bootstrapping or cross validation • when k-fold cross validation was run several times, it led to confusion about which model is the optimal model; bootstrapping seems to be a more reliable method for model selection but at the same time it is more time-consuming • model selection cannot be based just on the accuracy measures like MAPE, RMSE etc. without checking distributional accuracy including biasedness • when data is imputed, it is hard to expect to impute data perfectly on the individual level; it may be expected to retrieve a true mean level of imputed data with respect to some strata; then, on average, totals can be calculated correctly |
|
|
|