| This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
Introduction and backgroundThis executive summary is intended for Heads of Statistical organisation, Chief statistician, Chief Methodologist, Chief Data Scientist, CTO/CIO and Senior leadership. The Machine Learning (ML) Project was proposed and approved during the HLG-MOS November 2018 workshop. Based on a paper by the Blue Skies Thinking Network (BTSN) the project aims to inform policymakers about the possibilities to use ML in the production of official statistics and to demystify official statisticians unfamiliar with it. The interest in the use of ML for official statistics is rapidly growing. For the processing of some secondary data sources (incl. administrative sources, big data and Internet of Things) it seems essential that Statistical organisations consider opportunities offered by modern ML techniques. Although promising, so far there is only limited experience with concrete applications within the UNECE statistical community, and some issues have yet to be solved. The aim of this project is to therefore develop a proof of concept and to unearth any issues and challenges prior to full-scale development of any statistical outputs. The varied contexts of the NSOs are therefore hugely helpful to develop a full understanding of the challenges and opportunities inherent in the use of ML in official statistics. The business proposal concludes with: “ML is a key modern technology that the worldwide statistical community should consider and the methods, IT solutions and other related issues can be dealt with in a universal; manner. Since, at this moment in time, basically all NSOs are in the same pioneering phase this is an excellent opportunity for shared development and mutual collaboration. The ML proposal seamlessly fits the HLG-MOS mission, all four elements of its vision are covered, and all five HLG-MOS values are addressed.” Machine LearningMachine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. |
Table of Contents1. National Statistical Organisation, their role and strengths 3. Pilot studies and Theme Reports 8. What worked well and what worked even better |
|
|
2.1 Work Packages and Themes
During its first project sprint in May 2019 at the ONS in Newport, UK, the group consulted the Generic Statistical Business Process Model (GSBPM). It was decided to choose 3 processes from the model’s ‘Process Phase’ as the themes to conduct ML pilot studies to fulfil the objectives of this project:
The pilot study investigations form Work Package 1 (WP1). The other two WPs are:
|
|
The given set of data referred to in the above quote is typically a text narrative provided by the respondent to describe, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter.
There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’]
4.1 Summary of C&C pilot studies
Most of these pilot studies fall into the group of Multi-Class classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes and to assign an appropriate code from the coding frame. See Table 1 for a list of the coding frames used in the C&C pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too. The sentiment analysis of twitter data is a binary classification task where each twitter message is classified as either Positive or Negative.
Table 1 – Coding Frames
Coding Frame | Description |
SCIAN NAICS | Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América) English version: North American Industry Classification |
NOC | National Occupational Classification is Canada’s national system for describing occupations |
SINCO | National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones) |
NACE | European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne) |
SIC | Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997 |
SOC | Standard Occupational Classification |
OIICS | Occupational Injury and Illness Classification System – Developed by the BLS |
ECOICOP: | European Classification of Individual Consumption by Purpose |
CTS: | Catalogue of Time Series by the IMF |
There were 9 pilot studies submitted, of these, 3 are in production with one of them used as a supporting tool for human coders to make a faster decision:
Canada – StatsCanada
: Industry and Occupation Coding (NAICS & NOC), in production, 13.3% of cases are auto-coded, error rate < 5%
USA – BLS
: Workplace Injury & Illness (SOC & OIICS), this study has shown that ML with Support Vector Machines (SVM) or Logistic Regression can outperform human coders. The use of a Neural Network has improved on this, it makes an estimated 39% fewer errors than the manual coding process. In production, predicted codes above a set threshold are auto coded, the remaining ones are manually assigned >85% of codes auto-coded
Norway – Statistics Norway
: Standard Industrial Code (SIC), used ML as a supporting tool; up to 5 ML predicted SIC codes are presented to coders in decreasing order of their prediction score, in 22% of cases the highest prediction score is higher than 95%
The other 6 studies have advanced considerable during the 18 months since the project started.
Future plans for all of these include investigation into the use of other ML algorithms, increase in prediction accuracy or IT infrastructure before imbedding this technology into production pipelines.
The three operationalised projects have shown that building a “Golden Data Set” or ground truth is essential. For this, all labels are manually assigned and are deemed to be correct. To achieve this is very labour intensive and Subject Matter Experts (SME) have to check and re-check the assigned codes. This serves two purposes:
To commit this resource is a challenge for most NSOs and can be a serious blocker in the development of ML solutions. However, such resources should already be committed to assess the current processes.
Even though the success criteria for most of these studies was that ML has to perform at least as good as human coders, their motivation in conducting these was mainly gains in efficiency and timeliness. Accuracy as the objective was only stated in two studies. To achieve this, ML is used either to only auto code predictions made above a set threshold and/or predictions for coding classes that have a high prediction threshold. Predictions for minority classes, the ones that are rarely seen in the data, are then excluded from the ML solution and are still made by human coders. The BLS and Stats Canada solutions use a mix of ML auto coding and human coding to ensure that the overall accuracy is higher than human or machine coding alone. Instead of just showing the overall performance, this approach assessed and demonstrates if, how and where ML was at least as good or even better than human coders. Setting a prediction threshold for the groups of records where ML works better, where it works good enough to assist humans and where it does not perform good enough (e.g. minority classes) are important areas to analyse. The ultimate objective is that NSOs end up with a better overall process. Please see
for individual pilot study results.
The BLS solution, as the highest advanced in this group, has the highest proportion of auto coding, > 85%. This has been made possible by using Neural Networks for the ML algorithm that run on 4 Graphical Processing Units (GPUs) with 3584 cores.
The Norwegian solution was the only one using Cloud Computing, but it is anticipated that this technology will be utilised more often as it offers the provision of very powerful IT solutions without capital expenditure, but national privacy and data protection laws have to be considered first. All other studies relied on standard desktop or laptop hardware.
|
There were 7 pilot studies in the E&I theme:
Editing
Possible value added aspects of ML for Editing as discussed by the group:
Imputation
All for pilot studies for Imputation have shown that ML can add value to the data processing pipeline:
5.1 Summary of E&I pilot studies
The E&I Theme report concludes on Editing:
“According to the study so far, with machine learning the editing process can be completed much faster, more consistent, possibly even lead to higher quality data and allow for much sooner publication, but the effort required to maintain training data, the machine learning model and the analysis of the results in a short term might not prove to be a cost saver. I. e., the gain until now seems to be not so much in efficiency of the results but in the efficiency of the statistical process: machine learning methods allow using huge amount of data with much less a priori knowledge, hypotheses and data preparation (general underlying structure of the data, stratification, etc.).”
The E&I Theme Report points out: “Note that parametric models are always the best, from every point of view, if the hypothesis is good! Unfortunately, we often make mistakes in specifying the underlying hypothesis, i.e. in modelling the phenomena; hence the parametric model is not able to provide good predictions. Non-parametric models run less risk from this point of view but fit (in the finite data situation) less well than the “true” parametric model.”
All the pilot studies in the E&I theme used non-parametric models. The ML algorithms were left to find the structure and relationships in the data to build their own rules and parameters to predict values for imputation and to predict if a record needs human attention. The paper from Italy on Editing is not a pilot study but it is a framework on which to develop ML for Editing. This investigation together with the other pilot studies have shown that a great deal of progress has been made by the E&I group, things look generally promising, but there is still a way to go.
|
Pixel resolution range from ~23cm for aerial images to 30m for Landsat 5,7 and 8 images. In addition, complementary information such as ESRI shape files or the open GeoPackage format which contain statistical or geographic information can aide the imagery ML processes.
6.1 Summary of Imagery pilot studies
The main motivation for the participating pilot studies in this group was to reduce cost and time required by either manual inspection of all images or the respective existing methods of collecting the data. Just like the pilot studies from the other themes, Imagery needs labelled data for the algorithms to learn and establish rules to recognise the image features as required. This can be down to pixel level or areas of an image. Even though this labelling process is very time consuming for Imagery, the long term benefit can be significant.
All 4 pilot studies in this group used Convolutional Neural Networks (CNN) as the ML algorithm. And as the pilot study from Switzerland on land use and land coverage has shown, this can be augmented with other algorithms and data sources, e.g. Random Forest to increase the prediction capability of the pipeline. The skill level needed for imagery ML solutions is therefore higher compared to other ML classification solutions.
|
|
|
|
shows the business processes required to develop ML solutions for Imagery.
This project concludes in November 2020, but it will carry on as ML 2021 under the guidance of the ONS’ data Science Campus to continue with the aim to develop and integrate ML solutions in statistical processes at NSOs. It will also follow and support the development of the ML applications not yet operationalised.