Seitenhierarchie
Zum Ende der Metadaten springen
Zum Anfang der Metadaten


 




UNECE – HLG-MOS Machine Learning Project

Classification and Coding Theme Report

English version (Word file)









This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community.

Background

The UNECE Machine Learning project was recommended by the HLG-MOS – Blue Sky Thinking network in the autumn of 2018, approved early 2019 and launched in March 2019. The objective of the project is to advance the research, development and application of Machine Learning (ML) techniques to add value to the production of official statistics [1].

Three themes were investigated:

  • Classification and Coding
  • Editing and Imputation
  • Imagery

This report attempts to summarise and investigates how and if the pilot studies carried out on Classification and Coding have shown to make official statistics better, where better can be any one or more aspects of:

  • Cheaper
  • Faster data release
  • More consistent data
  • Alternative data sources

Advances made to their respective organisation will also be investigated.

This report will also show a broad range of approaches and ML algorithms tested and used and the different results associated with that. This richness is a testament of the commitment and contribution of all the NSO (National Statistics Offices) involved and their commitment to this project.

[1] https://statswiki.unece.org/display/MLP/Project+context+and+objectives

Overview of Classification and Coding

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”[1]

As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class[2] classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 2 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too.

There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’].

For example:

“The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).”

Table -1 Some entries from the SOC-2010 coding frame:[3]

Code

Description

8221

Crane drivers

8222

Fork-lift truck drivers

8223

Agricultural machinery drivers

8229

Mobile machine drivers and operatives n.e.c.


In Table 1 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”

These natural language narratives have to be assigned to the correct code or class.

Table 2 – Coding Frames

Coding Frame

Description

SCIAN

NAICS

Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América)

English version: North American Industry Classification

NOC

National Occupational Classification is Canada’s national system for describing occupations

SINCO

National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones)

NACE

European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne)

SIC

Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997

SOC

Standard Occupational Classification

OIICS

Occupational Injury and Illness Classification System – Developed by the BLS

ECOICOP:

European Classification of Individual Consumption by Purpose

CTS:

Catalogue of Time Series by the IMF


[1]
https://en.wikipedia.org/wiki/Statistical_classification

[2] https://machinelearningmastery.com/types-of-classification-in-machine-learning/

[3]https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010/soc2010volume2thestructureandcodingindex#electronic-version-of-the-index

Data pre-processing

Data pre-processing is needed to some degree by all Machine Learning algorithms. For Natural Language Processing (NLP), some techniques can be used to remove bits from the text that at their best do not add information and at their worst produce noise in the data and reduce the ability of the ML algorithm to recognise words of the same meaning.

Most pilot studies have done a lot of work in this area and have shown that these techniques can have an influence on the prediction ability of the ML solution.

Removal of stop-words: Stop words are words like “the”, “a”, “an”, “in”. These are unimportant words in natural Language Processing (NLP), they do not add meaning to the text to be analysed. By removing these words, the algorithm can focus in the important words

Stemming and Lemmatisation are two approaches to handle inflections, the syntactic difference between word forms. Both Stemming and lemmatisation can be achieved with commercial or open source tools and libraries.

Stemming: This is a process where words are reduced to their stem or root by chopping off the end of the word in the hope to reduce it to its stem, e.g. the word “flying” has the suffix “ing”. The stem is “fly”.

This typically done with an algorithm. The aim is to reduce the inflectional forms of each word into a common base. This increases the occurrence of the word and gives the algorithm more to learn on. The Porter Stemmer algorithm is very popular for the English language, it chops both “apple” and “apples” down to “appl”. This shows that stemming might produce something that is not a real word. But if this is done to all the narratives to be classified and all target documents, the algorithm can find matches.

Lemmatisation: This also tries to remove inflections, but it does not simply chop off these inflections. It uses lookup tables, e.g. WordNet, that contain all inflected forms of the word to find the base or dictionary form of the word, which is known as the lemma, e.g. “geese” is lemmatized by Wordnet to “goose”

If the word is not included in the table it is passed as the lemma.

Normalisation to lower case: Every word is converted to lower case.

n-gram: The text to be classified is chopped up into words or character sequences. To chop them up into syllabus is much harder as these depend on enunciation and are harder to calculate.

For the narrative “the fox jumps over the fence”

  • word 1-gram: the, fox, jumps, over, the, fence
  • word 2-gram: the fox, fox jumps, jumps over, over the, the fence
  • character 3-gram: ‘the’, ‘he_’, e_f’, ‘_fo’, ‘fox’, ‘ox_’, ‘x_j’, ‘_ju’ ……..

For the space character the underline ‘_’ is used here.

The Mexican pilot study on occupation and economic activity classification describes in great detail their work on n-grams and what impact changes in the n-grams have on their prediction results.

However, the U.S. Bureau of Labour Statistics (BLS) stopped using stop-word removal or stemming after they found out in early experiments that these techniques proved to be unhelpful due to the nature of the text narratives they need to classify. But they used CountVectorizer as a tool to create a “bag of features” representing the input. It shows which words (or sequences of words, or sequences of characters) occur in the input, but not the order in which those words or sequences appear. When they moved over to Neural Networks, they stopped this too. Preserving the original ordering of the sequence of letters allows the Neural Network to gain more insight into the text while simpler algorithms are not capable of learning the intermediate representations, i.e. that letters form words, and words form phrases and sentences, and sentences form paragraphs and so on.

Please see the column Data Pre-processing in Table 3 for steps carried out by all the pilot studies.

  • Keine Stichwörter
Report inappropriate content