Seitenhierarchie
Zum Ende der Metadaten springen
Zum Anfang der Metadaten


 




UNECE – HLG-MOS Machine Learning Project

Classification and Coding Theme Report

English version (Word file)









This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community.

Background

The UNECE Machine Learning project was recommended by the HLG-MOS – Blue Sky Thinking network in the autumn of 2018, approved early 2019 and launched in March 2019. The objective of the project is to advance the research, development and application of Machine Learning (ML) techniques to add value to the production of official statistics [1].

Three themes were investigated:

  • Classification and Coding
  • Editing and Imputation
  • Imagery

This report attempts to summarise and investigates how and if the pilot studies carried out on Classification and Coding have shown to make official statistics better, where better can be any one or more aspects of:

  • Cheaper
  • Faster data release
  • More consistent data
  • Alternative data sources

Advances made to their respective organisation will also be investigated.

This report will also show a broad range of approaches and ML algorithms tested and used and the different results associated with that. This richness is a testament of the commitment and contribution of all the NSO (National Statistics Offices) involved and their commitment to this project.

[1] https://statswiki.unece.org/display/MLP/Project+context+and+objectives

Overview of Classification and Coding

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”[1]

As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class[2] classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 2 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too.

There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’].

For example:

“The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).”

Table -1 Some entries from the SOC-2010 coding frame:[3]

Code

Description

8221

Crane drivers

8222

Fork-lift truck drivers

8223

Agricultural machinery drivers

8229

Mobile machine drivers and operatives n.e.c.


In Table 1 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”

These natural language narratives have to be assigned to the correct code or class.

Table 2 – Coding Frames

Coding Frame

Description

SCIAN

NAICS

Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América)

English version: North American Industry Classification

NOC

National Occupational Classification is Canada’s national system for describing occupations

SINCO

National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones)

NACE

European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne)

SIC

Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997

SOC

Standard Occupational Classification

OIICS

Occupational Injury and Illness Classification System – Developed by the BLS

ECOICOP:

European Classification of Individual Consumption by Purpose

CTS:

Catalogue of Time Series by the IMF


[1]
https://en.wikipedia.org/wiki/Statistical_classification

[2] https://machinelearningmastery.com/types-of-classification-in-machine-learning/

[3]https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010/soc2010volume2thestructureandcodingindex#electronic-version-of-the-index

  • Keine Stichwörter
Report inappropriate content