...
| Panel | ||||||
|---|---|---|---|---|---|---|
|
...
...
UNECE – HLG-MOS Machine Learning |
...
WP1 - Theme 1 Coding and Classification Report
...
ProjectClassification and Coding Theme ReportEnglish version (Word file)
|
| Panel | ||||||
|---|---|---|---|---|---|---|
|
...
Table of Contents |
Section 2.1
....
VI. ...
VII. ...
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
The UNECE Machine Learning project was recommended by the HLG-MOS – Blue Sky Thinking network in the autumn of 2018, approved early 2019 and launched in March 2019. The objective of the project is to advance the research, development and application of Machine Learning (ML) techniques to add value to the production of official statistics [1]. Three themes were investigated:
This report attempts to summarise and investigates how and if the pilot studies carried out on Classification and Coding have shown to make official statistics better, where better can be any one or more aspects of:
Advances made to their respective organisation will also be investigated. This report will also show a broad range of approaches and ML algorithms tested and used and the different results associated with that. This richness is a testament of the commitment and contribution of all the NSO (National Statistics Offices) involved and their commitment to this project. [1] https://statswiki.unece.org/display/MLP/Project+context+and+objectives |
| Panel | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||
“Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning,[1] classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”[1] As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class[2] classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 2 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too. There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’]. For example: “The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).” Table -1 Some entries from the SOC-2010 coding frame:[3]
These natural language narratives have to be assigned to the correct code or class. Table 2 – Coding Frames
[2] https://machinelearningmastery.com/types-of-classification-in-machine-learning/ [3]https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010/soc2010volume2thestructureandcodingindex#electronic-version-of-the-index |

