Seitenhierarchie

Versionen im Vergleich

Schlüssel

  • Diese Zeile wurde hinzugefügt.
  • Diese Zeile wurde entfernt.
  • Formatierung wurde geändert.

...

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch2
_Ch2
2. Overview of Classification and Coding

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition.

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The  corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”

Footnote

https://en.wikipedia.org/wiki/Statistical_classification

As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class

Footnote

https://machinelearningmastery.com/types-of-classification-in-machine-learning/

 classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 1 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too.

Table 1 – Coding Frames

Coding Frame

Description

SCIAN

NAICS

Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América)

English version: North American Industry Classification

NOC

National Occupational Classification is Canada’s national system for describing occupations

SINCO

National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones)

NACE

European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne)

SIC

Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997

SOC

Standard Occupational Classification

OIICS

Occupational Injury and Illness Classification System – Developed by the BLS

ECOICOP:

European Classification of Individual Consumption by Purpose

CTS:

Catalogue of Time Series by the IMF

There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’].

For example:

“The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).”

Footnote

https://www.hesa.ac.uk/support/documentation/industrial-occupational#:~:text=The%20UK%20Standard%20Occupational%20Classification,again%20in%202010%20(SOC2010)

Table -2 Some entries from the SOC-2010 coding frame:

Footnote

https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010/soc2010volume2thestructureandcodingindex#electronic-version-of-the-index

Code

Description

8221

Crane drivers

8222

Fork-lift truck drivers

8223

Agricultural machinery drivers

8229

Mobile machine drivers and operatives n.e.c.

In Table 2 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”

The natural language narrative, the survey variable or feature, has to be assigned to the correct code or class.

The term feature is here used synonymous with the term ‘independent variable’, both are often used in Machine Learning terminology. Features in the data set are for example: Age, Job Description, NetPay or Description of Injury.  Feature selection is then the process of selecting the best subset of relevant input features to be used by the Machine Learning algorithm to build a model. These features can also be called Predictor features as they are used by the model to build its internal rules and structure to predict one or possibly a number of target features.  The predictive ability of the ML algorithm can sometimes be improved with feature engineering. This is done by transforming raw data into new features that better represent the underlying problem to the ML algorithm.

...

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch6
_Ch6
6. Classification and Coding Pilot Studies

At the time of writing this report, reports on 8 ML projects were submitted, see Table 3. Of these, 3 went either into production during this UNECE project or had already been operationalised when the project started, these are from StatsCanada, Statistics Norway and the US Bureau of Labor Statistics (BLS).

Table 3 – List of Pilot Studies

Organisation

Title

Legacy System

Data

Target Codes

Mexico -

INEGI

Occupation and Economic activity coding using natural language processing

Deterministic coding system assisted with manual coding with accuracy level over 95%

Household Income and Expenditure:

74600 Households,

158568 persons

SCIAN
SINCO

Canada -
StatsCanada

Industry and Occupation Coding

G-Code Word matching with
accuracy level > 95%, that is higher than human coders

CCHS - Canadian Community Household Survey - 89K records
LFS - Labour Force Survey - 440K records
CHMS - Canadian Health Measures Survey
NOC Index Entries: 114161
NAICS Index Entries: 38256

NAICS
NOC

Belgium/
Flanders
Statistiek Vlaanderen

Sentiment Analysis of twitter data

Life Statistics via surveys

Twitter data:
Tweets that contain either a positive or a negative emoticon as a label to avoid manual production of training data.


Serbia -
SORS

Coding Economic Activity

Manual coding

LFS
20,000 cases

NACE
 2 & 3 - digit classification

Norway –

Statistics Norway

Standard Industrial Code Classification by Using Machine Learning

SIC classification of new companies for the Central Coordination Register are made manually from the description provided.

Historical data - 1.5 million:
1. description of economic activities
2. 'Official' descriptions of codes and keywords

SIC
821 labels

US -
BLS

Coding Workplace Injury and Illness

Manual coding

Survey of Occupational Injuries and Illnesses
initially 261,000 records, later grew to > 2 million

SOC: Occupation codes
OIICS: Injury, Part of body injured, Event, Source of injury, Secondary source

Poland -
Statistics Poland

Production description to ECOICOP

N/A

Web scraped product names
17,000 cases
manually coded to ECOICOP

ECOICOP
1st group of products: Food and non-alcoholic beverages
61 codes, all 5 digits

IMF

Automated Coding of IMF’s Catalogue of Time Series (CTS)

Manual coding

Time Series data sets from member countries

CTS (Catalogue of Time Series)
28,886 codes

Table 4 – Steps/Features/Algorithm/Technology

Organisation

Data Pre-processing

Features

Algorithms

Software - Hardware

Mexico -

INEGI

Article suppression, stemming, lemmatisation, uppercase, synonyms, Assembly of different algorithms, TF-IDF

5 x text variables
10 x categorical variables

Assembly of Algorithms:
SVM, Logistic Regression, Random Forest, Neural networks, XBBoost, K-NN, Naïve Bayse, Decision Trees

Python, sci-kit learn, keras

20 cores, 256 GB RAM, 4 TB drives

Canada -
StatsCanada

Removal of Stop Words, Lowercasing character conversion, Merging of variables, Caesar Cipher, Addition of LFS 440K records to CCHS’s training datasets (89K records)

Company, Industry, Job Title, Job Description”

Mandated to use
FastText or XGBoost as they are already in G-Code

G-Code

3 GHz Intel i5-3570, 16 GB RAM

Belgium/
Flanders
Statistiek Vlaanderen

lower casing, stemming, removing stop words, lemmatization, removing special characters, n-gramming

Extract from the tweet narrative (Feature engineering):

Count vectorization,
TF-IDF vectorisation,
Autoencoder neural network embedding,
Pretrained Neural Network

penalized logistic regression, random forest
gradient boosting trees
multi-layered perceptrons

Python

4 GHz Intel i7 6700K, 16 GB RAM

Serbia -
SORS


NACE Activity Code
Activity Name
Description

Random Forest,
SVM,
Logistic Regression

Python,
Sci-Kit Learn, Pandas, Pyzo IDE

Norway –

Statistics Norway

removal of obvious unreliable activities/code,
removal of stop words, digits and punctuation, uppercases

1. description of economic activities
2. 'Official' descriptions of codes and keywords
3. Company names

FastText, Logic Regression,
Random Forest, Naïve Bayes,
SVM, CNN (Convolutional Neural Network)

Python
Google Cloud

US -
BLS

very little data cleaning or normalisation

Injury description and circumstances
Industry code,
Employer's name,
worker's occupation

Logistic Regression, SVM, Naïve Bayes, Random Forest, Multilayer perceptrons, Convolutional and recurrent neural networks

Python, sci-kit learn
Initially 2-4 cores 8-16 GB Ram
Finally: 4 Titan X Pascal GPUs each with 12 GB and 3,584 cores

Poland -
Statistics Poland

Vectorisation

Product Description

Naive Bayes, Logistic Regression,
Random Forest, SVM
Neural networks (Ludwig Library)

Python, sci-kit learn
Office PCs

IMF

The different country file formats make data processing prone to errors which require manual intervention and it is time consuming.

Feature extraction:

TF-IDF,

Word2Vec

Logistic Regression with Word2Vec

Logistic Regression with TF-IDF,
Nearest Neighbour with Word2Vec

Nearest Neighbour with TF-IDF

2.4 GHz Intel Core i5-6300U

The above Table 4 shows that the number of features used in the available data sets is typically rather small with the exception of the pilot study from Mexico where 15 features were used. This table also shows that the processing power used for ML training and prediction is typically that of a desktop or laptop computer. The exceptions are the two already operationalised solutions from Norway and BLS that use neural networks. Statistics Norway use Google Cloud and BLS use 4 GPUs with 3584 cores each.

Results of the individual pilot studies are shown in Table 5.

Table 5 – Results/Status/Future

Country

Results

Current Status

Future Plans

Mexico INEGI

Economic Activity:
Accuracy: 87.7%  Precision: 66%  Recall: 64.5
Occupation:
Accuracy: 83.1%  Precision: 57.8%  Recall: 57.3

Exploratory

Analysis for results
New Methodologies

Canada -
StatsCanada

Accuracy rate > 95% when combined with clerical classification

Up to 100% Recall and precision on QC sample

In production for CCHS, Coded Cases:
2019 Q3 12.6%
2019 Q4 13.3%
and for CHMS

Tensor Flow
PyTorch
SVM

Belgium/
Flanders
Statistiek Vlaanderen

Precision: 80%
Recall:      81%

PoC

Use of other pretrained sentiment classifiers
Gather more data
Production IT infrastructure

Serbia -
SORS

Two digit level:
Random Forest ≈ 69% of accuracy
SVM ≈ 75% of accuracy
Logistic regression ≈ 69% of accuracy

Three digit level:
Random Forest ≈ 55% of accuracy
SVM ≈ 63% of accuracy
Logistic regression ≈ 63% of accuracy

Accuracy not sufficient for production, investigation carries on

to achieve accuracy of over 90%

Norway –

Statistics Norway

FastText, SVM and CNN all about same results
FastText faster in training,
22% of units predicted

In production as a Supporting tool:
5 best codes with probability is offered. This allows for making a faster choice.

A new application is being delivered with expected savings between 7 million and 17 million NOK

US -
BLS

From comparison with the 'Gold Standard' data:
ML is more accurate that manual;
Neural Network coding is better in any of the 6 codes to be assigned than humans, with results between 69.8% and 91.9% Accuracy.

In production:
ML auto coding only above set threshold to maximise the overall macro-F1-score for the human/ML coding;
> 85% of codes are assigned by a neural network

Continued use of this ML app

Expand ML to other coding tasks

Poland -
Statistics Poland

Indicator      NB         SVM    LR          RF
Accuracy      90,5%  92%     91,6%   92,2%
Precision      90%     92%     92%      93%
Recall           90%     92%     92%       92%
F1-score      90%     92%     92%       92%
MCC             0,90     0,92     0,91      0,92

PoC

Application to help classifying products was developed and shared

IMF

ML codes 80% of Time Series data sets correctly

PoC

Moving the ML solution into production,
combine models to improve predictions

Panel
borderColorgrey
bgColorwhite
borderWidth1

Anker
_Ch7
_Ch7
7. Value added by Classification and Coding using ML in the production of official statistics 

The expectation on Machine Learning is to automate many traditional and manual processes inherent in the production of official statistics and manual Classification and Coding is one of them. The manual process is lengthy, resource hungry and can be prone to errors. Even Subject Matter Experts (SME) can give different outcomes to a C&C task as shown by the BLS report. 

The pilot studies from Canada and BLS have proven that data can be auto coded with ML to various degrees. Only predictions with high level of confidence are allowed to auto code in these solutions. The more mature the ML solution is, the more trust appears to be put into it to let ML take on more and more of the predictions.

As manual coding resources are freed up, possibly on an increasing scale over time, the cost of building, monitoring and keeping the ML solution current has to be accounted for just like a possibly expensive IT infrastructure. Data consistency will increase as manual coding is reduced.

The ML solution for Norway acts like an advisory to the human coding process. The human coder is given the 5 best ML predictions with their score and the option to either accept one of them or reject it. A similar approach is taken in BLS’ ML solution BLS where the human coders can reject suggested codes. However, BLS has shown that a suggested ML prediction is too easily accepted by the human coder, even if it is incorrect. While building their Gold Standard data set, they only used human coding from experts. These experts were not shown how other experts, humans or computers coded the case.

As ML takes on the task of auto coding big parts of the data sets, results become more rapidly available.

A financial gain was only reported by Statistics Norway and is expected to be over a 10-year period between 7,000,000 NOK and 17,000,000 NOK (650 000 € to 1 600 000 €).

...

Report inappropriate content