| This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
Table of Contents |
|
|
Code | Description |
8221 | Crane drivers |
8222 | Fork-lift truck drivers |
8223 | Agricultural machinery drivers |
8229 | Mobile machine drivers and operatives n.e.c. |
In Table 1 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”
These natural language narratives have to be assigned to the correct code or class.
Table 2 – Coding Frames
Coding Frame | Description |
SCIAN NAICS | Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América) English version: North American Industry Classification |
NOC | National Occupational Classification is Canada’s national system for describing occupations |
SINCO | National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones) |
NACE | European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne) |
SIC | Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997 |
SOC | Standard Occupational Classification |
OIICS | Occupational Injury and Illness Classification System – Developed by the BLS |
ECOICOP: | European Classification of Individual Consumption by Purpose |
CTS: | Catalogue of Time Series by the IMF |
[1] https://en.wikipedia.org/wiki/Statistical_classification
[2] https://machinelearningmastery.com/types-of-classification-in-machine-learning/
[3]https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010/soc2010volume2thestructureandcodingindex#electronic-version-of-the-index
|
|
|
|
Organisation | Title | Legacy System | Data | Target Codes |
Mexico - INEGI | Occupation and Economic activity coding using natural language processing | Deterministic coding system assisted with manual coding with accuracy level over 95% | Household Income and Expenditure: 74600 Households, 158568 persons | SCIAN |
Canada - | Industry and Occupation Coding | G-Code Word matching with | CCHS - Canadian Community Household Survey - 89K records | NAICS |
Belgium/ | Sentiment Analysis of twitter data | Life Statistics via surveys | Twitter data: | |
Serbia - | Coding Economic Activity | Manual coding | LFS | NACE |
Norway – Statistics Norway | Standard Industrial Code Classification by Using Machine Learning | SIC classification of new companies for the Central Coordination Register are made manually from the description provided. | Historical data - 1.5 million: | SIC |
US - | Coding Workplace Injury and Illness | Manual coding | Survey of Occupational Injuries and Illnesses | SOC: Occupation codes |
Poland - | Production description to ECOICOP | N/A | Web scraped product names | ECOICOP |
IMF | Automated Coding of IMF’s Catalogue of Time Series (CTS) | Manual coding | Time Series data sets from member countries | CTS (Catalogue of Time Series) |
Mexico - INEGI | Article suppression, stemming, lemmatisation, uppercase, synonyms, Assembly of different algorithms, TF-IDF | 5 x text variables | Assembly of Algorithms: | Python, sci-kit learn, keras |
Canada - | Removal of Stop Words, Lowercasing character conversion, Merging of variables, Caesar Cipher, Addition of LFS 440K records to CCHS’s training datasets (89K records) | Company, Industry, Job Title, Job Description” | Mandated to use | G-Code |
Belgium/ | lower casing, stemming, removing stop words, lemmatization, removing special characters, n-gramming | Count vectorization | penalized logistic regression, random forest | Python |
Serbia - | NACE Activity Code | Random Forest, | Python, | |
Norway – Statistics Norway | removal of obvious unreliable activities/code, | 1. description of economic activities | FastText, Logic Regression, | Python |
US - | very little data cleaning or normalisation | Injury description and circumstances | Logistic Regression, SVM, Naïve Bayes, Random Forest, Multilayer perceptrons, Convolutional and recurrent neural networks | Python, sci-kit learn |
Poland - | Vectorisation | Product Description | Naive Bayes, Logistic Regression, | Python, sci-kit learn |
IMF | Logistic Regression, | 2.4 GHz Intel Core i5-6300U |
The above Table 3 shows that the number of features used in the available data sets is typically rather small with the exception of the pilot study from Mexico where 15 features were used. This table also shows that the processing power used for the ML training and prediction is typically that of a desktop or laptop computer. The exceptions are the two already operationalised solutions from Norway and BLS that use neural networks. Statistics Norway use Google Cloud and BLS use 4 GPUs with 3584 cores each.
Table 4 – Results/Status/Future
Country | Results | Current Status | Future Plans |
Mexico INEGI | Economic Activity: | Exploratory | Analysis for results |
Canada - | Accuracy rate > 95% when combined with clerical classification | In production for CCHS, Coded Cases: | Tensor Flow |
Belgium/ | Precision: 80% | PoC | Use of other pretrained sentiment classifiers |
Serbia - | Two digit level: | Accuracy not sufficient for production, investigation carries on | to achieve accuracy of over 90% |
Norway – Statistics Norway | FastText, SVM and CNN all about same results | In production as a Supporting tool: | A new application is being delivered with expected savings between 7 million and 17 million NOK |
US - | From comparison with the 'Gold Standard' data: | In production: | Continued use of this ML app |
Poland - | Indicator NB SVM LR RF | PoC | Application to help classifying products was developed and shared |
IMF | ML codes 80% of Time Series data sets correctly | PoC | Moving the ML solution into production, |
|
|
|
|
|