Seitenhierarchie

Versionen im Vergleich

Schlüssel

  • Diese Zeile wurde hinzugefügt.
  • Diese Zeile wurde entfernt.
  • Formatierung wurde geändert.

...

Panel
borderColorgrey
bgColorwhite
borderWidth1

Table of Contents 

1. National Statistical Organisation, their role and strengths

2. Project Objectives

3. Pilot studies and Theme Reports

4. Classification and Coding

5. Edit and Imputation

6. Imagery

7. What is needed to do ML?

8. What worked well and what worked even better

9. Organisational and Skill requirements

10. Conclusion and what next

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch1
_Ch1
1. National Statistical Organisation, their role and strengths 

National Statistical Organisations (NSO) play a role of ever increasing importance to inform government policy makers of a true picture in their respective country. They produce trusted statistics based on the trust placed into them by the wider public, the respondence and the user of the statistical output. To stay relevant, NSOs have to adapt to and embrace new technologies and data sources to shorten the time between data collection and statistical output. The Covid-19 pandemic and NSOs response is testimony to this.

...

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch3
_Ch3
3. Pilot studies and Theme Reports 

The three themes agreed by the project: C&C, E&I and Imagery are all based on classification tasks with the exception of Imputation. For Editing it is a classification of records into two classes, the ‘Change’ class where the data are inconsistent, missing or suspicious looking and the ‘No-Change’ class, where the data do not need any further attention and are deemed to be correct or consistent. The sentiment analysis of web based data is included in the C&C theme as it is a classification task to classify the data into the chosen sentiment categories. Imagery, an example of big data and alternative data sources classifies satellite or aerial images or their components into classes like ‘Urban’ and ‘Non-Urban’.

All pilot studies used supervised ML, this is where the algorithm ‘learns’ from training data that have been labelled, e.g. where the correct code has been assigned manually. This can be an occupation code assigned to a data record with an occupational text description, or the type of object visible on a satellite image. These labelled training data allow the algorithm during the learning phase to recognise rules or patterns in the data without having to explicitly formulating those rules. New data that have not been labelled can then be fed into the algorithm for it to categorise and recognise these data.

The participants of this project have submitted reports on their respective pilot studies. These were then summarised for each of the three themes into Theme Report. Given references to these reports are to the statswiki.unece.org web site. All reports will be accessible to the public. Further information can be obtained by contacting unece.org or the authors of the reports. This summary report uses information provided in the pilot study reports as well as the 3 theme reports. 

Footnote

https://statswiki.unece.org/display/ML/WP1+-+Pilot+Studies

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch4
_Ch4
 4. Classification and Coding

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”

Footnote

https://en.wikipedia.org/wiki/Statistical_classification

The given set of data referred to in the above quote is typically a text narrative provided by the respondent to describe, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter.

There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’]

4.1 Summary of C&C pilot studies

Most of these pilot studies fall into the group of Multi-Class classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes and to assign an appropriate code from the coding frame. See Table 1 for a list of the coding frames used in the C&C pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too. The sentiment analysis of twitter data is a binary classification task where each twitter message is classified as either Positive or Negative.

Table 1 – Coding Frames

Coding Frame

Description

SCIAN

NAICS

Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América)

English version: North American Industry Classification

NOC

National Occupational Classification is Canada’s national system for describing occupations

SINCO

National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones)

NACE

European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne)

SIC

Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997

SOC

Standard Occupational Classification

OIICS

Occupational Injury and Illness Classification System – Developed by the BLS

ECOICOP:

European Classification of Individual Consumption by Purpose

CTS:

Catalogue of Time Series by the IMF

There were 9 pilot studies submitted, of these, 3 are in production with one of them used as a supporting tool for human coders to make a faster decision:

The other 6 studies have advanced considerable during the 18 months since the project started.

Future plans for all of these include investigation into the use of other ML algorithms, increase in prediction accuracy or IT infrastructure before imbedding this technology into production pipelines.

The three operationalised projects have shown that building a “Golden Data Set” or ground truth is essential. For this, all labels are manually assigned and are deemed to be correct. To achieve this is very labour intensive and Subject Matter Experts (SME) have to check and re-check the assigned codes. This serves two purposes:

  1. The ground truth is used to train the ML algorithm. The better this data set is, the better the ML algorithm can establish rules and find patterns in the data.
  2. As importantly, this allows to compare ML prediction results with traditional work by establishing how accurate the traditional manual coding process is.

To commit this resource is a challenge for most NSOs and can be a serious blocker in the development of ML solutions. However, such resources should already be committed to assess the current processes.

Even though the success criteria for most of these studies was that ML has to perform at least as good as human coders, their motivation in conducting these was mainly gains in efficiency and timeliness. Accuracy as the objective was only stated in two studies. To achieve this, ML is used either to only auto code predictions made above a set threshold and/or predictions for coding classes that have a high prediction threshold. Predictions for minority classes, the ones that are rarely seen in the data, are then excluded from the ML solution and are still made by human coders. The BLS and Stats Canada solutions use a mix of ML auto coding and human coding to ensure that the overall accuracy is higher than human or machine coding alone. Instead of just showing the overall performance, this approach assessed and demonstrates if, how and where ML was at least as good or even better than human coders. Setting a prediction threshold for the groups of records where ML works better, where it works good enough to assist humans and where it does not perform good enough (e.g. minority classes) are important areas to analyse. The ultimate objective is that NSOs end up with a better overall process. Please see

Footnote

WP1 - Pilot Studies

 for individual pilot study results.

The BLS solution, as the highest advanced in this group, has the highest proportion of auto coding, > 85%. This has been made possible by using Neural Networks for the ML algorithm that run on 4 Graphical Processing Units (GPUs) with 3584 cores.

The Norwegian solution was the only one using Cloud Computing, but it is anticipated that this technology will be utilised more often as it offers the provision of very powerful IT solutions without capital expenditure, but national privacy and data protection laws have to be considered first. All other studies relied on standard desktop or laptop hardware.

...

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch10
_Ch10
10. Conclusion and what is next 

Digitisation has driven the growth of big data, and machine learning makes big data more usable. It's easy to see why NSOs are exploring different ways of augmenting their existing decision-making and business operations with alternative data sources and machine learning. For example, the Office for National Statistics (ONS) has invested in infrastructure capable of supporting frictionless access to and harnessing of alternative data sources (specifically, administrative data and satellite data). It has also sought to grow its in-house data science capacity through the establishment of the Data Science Campus.

The value of traditional approaches to data gathering (such as surveys) and analysis, however, is still high, especially in circumstances where big data isn’t (yet) readily available. Applied machine learning can add value to these traditional approaches by making them more operationally efficient.

The UNECE ML project has been successful by demonstrating that participating NSOs have made advances in the use of ML. It has shown the added value ML can bring to the production of accurate, timely and cost-efficient official statistics.

Summary of outputs

  • 21 reports, mostly demonstrations of added value of ML (pilot studies)
  • 3 summary analysis reports on the use of ML in Classification & Coding, Edit & Imputation and use of Imagery
  • An initial quality framework for statistical algorithms (WP2)
  • Integration challenges and practices (WP3)
  • Shared code from numerous studies and shared product description data to practice some ML
  • Links to learning and training material; references

In summary, our findings on the use of ML, that ML should be used for Coding & Classification, we have numerous positive demonstrations on a variety of data sources and contexts. Currently we have a few applications in production or near production.

ML shows great promise for Edit & Imputation and Demonstrations with a varying degree of positivity, pilots are in advanced stages and plans are in place to move these to production.

ML is essential in the use of imagery, especially in the context of increasing access to large amounts of imagery data. There are more advanced developments in this theme group and the Generic Pipeline for Production of Official Statistics Using Satellite Data and Machine Learning

Footnote

https://statswiki.unece.org/download/attachments/285216428/ML_WP1_Imagery_UNECE.pdf?version=1&modificationDate=1605171593842&api=v2

 shows the business processes required to develop ML solutions for Imagery.

This project concludes in November 2020, but it will carry on as ML 2021 under the guidance of the ONS’ data Science Campus to continue with the aim to develop and integrate ML solutions in statistical processes at NSOs. It will also follow and support the development of the ML applications not yet operationalised.


...

Footnotes Display

Report inappropriate content