| This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
|
https://statswiki.unece.org/display/MLP/Project+context+and+objectives |
.
Three themes were investigated:
This report attempts to summarise and investigates how and if the pilot studies carried out on Classification and Coding have shown to make official statistics better, where better can be any one or more aspects of:
Advances made to their respective organisation will also be investigated.
This report will also show a broad range of approaches and ML algorithms tested and used and the different results associated with that. This richness is a testament of the commitment and contribution of all the NSO (National Statistics Offices) involved and their commitment to this project.
|
As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class
https://machinelearningmastery.com/types-of-classification-in-machine-learning/ |
classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 2 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too.
There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’].
For example:
“The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).”
Table -1 Some entries from the SOC-2010 coding frame:
Code | Description |
8221 | Crane drivers |
8222 | Fork-lift truck drivers |
8223 | Agricultural machinery drivers |
8229 | Mobile machine drivers and operatives n.e.c. |
In Table 1 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”
These natural language narratives have to be assigned to the correct code or class.
Table 2 – Coding Frames
Coding Frame | Description |
SCIAN NAICS | Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América) English version: North American Industry Classification |
NOC | National Occupational Classification is Canada’s national system for describing occupations |
SINCO | National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones) |
NACE | European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne) |
SIC | Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997 |
SOC | Standard Occupational Classification |
OIICS | Occupational Injury and Illness Classification System – Developed by the BLS |
ECOICOP: | European Classification of Individual Consumption by Purpose |
CTS: | Catalogue of Time Series by the IMF |
|
|
https://towardsdatascience.com/understanding-random-forest-58381e0602d2 |
For each class, a score is given for the data to belong to that class. Let us assume we have a classification task to classify animals into 4 classes (cat, dog, mouse, fish), these 4 classes are the target for the data.
Training of this machine learning algorithm is done with labelled cases, that are data records, in this case with details about animals, but each record as already been assigned to which class it belongs. The Random Forest algorithm is initialised with a set of hyperparameters that control the behaviour of the algorithm. One of them is the number of trees it builds for the prediction task, e.g. n_estimators = 1000 instruct the algorithm to build 1000 trees out of a random selection of data set features and data records.
When the learned model is used to predict into which class a record falls, a score is given for each class, e.g. [0.15, 0.30, 0.45, 0.10]. In this case the highest score is 0.45, meaning that 450 out of the 1000 trees voted for the 3rd class which is ‘mouse’ in our example. The Random Forest Machine learning model predicts that ‘mouse’ is the most likely classification.
Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks).
This algorithm was used in 6 of the pilot studies and in some cases, it has outperformed any other algorithm in their experiments and uses relatively little computer resources just like Logistic Regression (see 4.2) and SVM (see 4.3)
4.2 Logistic Regression
Contrary to Linear Regression which predicts a continuous value, e.g. age or price, Logistic regression can be used to predict a categorical variable from a given set of independent variables. It is used to predict into which discrete values (0/1, yes/no, true/false) a data record falls and it does this by fitting the data to a logarithmic or logit function. For this reason, the algorithm is also called logit regression.
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/ |
Logistic Regression has been used in 7 of the pilot studies.
4.3 Nearest Neighbour (k-NN)
The k-NN algorithm treats the features as coordinates in a multidimensional feature space and then places unlabelled data into the same category or classes as similar labelled data or Nearest Neighbours (NN).
For a given data record it locates a specified number K of records in the training data set that are the “nearest” in similarity. The classes these neighbours belong to are counted and unlabelled test case is assigned the class representing the majority of the k nearest neighbours.
If there are more than two possible classes this classifier has to pick from, it will pick the most common class. This algorithm is also called k-NN. It is known as a “lazy“ machine learning algorithm as it does not produce a model and its understanding on how features are related to a class is limited. It simply stores the training data and then calculates the Euclidean distance between the new record to the training data and then picks the k nearest to label the new record.
4.4 SVM
With the Support Vector Machine (SVM) algorithm significant results can be achieved with relatively little computing power. In this classification algorithm each data item is a point in a n-dimensional feature space where n is the number of features of the data set. If we only have two features, the SVM algorithm finds a line that is the maximum distance away from the nearest point of each class to this line. For many features a hyperplane is found that separates best the labelled data points belonging to each class, this is called the Maximum Margin Hyperplane (MMH)
The points from each class that are the closest to the MMH are called Support Vectors (SV). Each class must have at least one, but may have more than one SVs. These SV define MMH and thus provide a very compact way to store a classification model.
To predict which class a new data record belongs to, that data point is drawn and the class area it falls into is the predicted class. In this way, SVM learning combines aspects of both the instance based Nearest Neighbour and regression methods. As this combination is very powerful, SVMs can model highly complex data relationships which accounts for the good results achieved with this classifier in 5 pilot studies.
4.5 FastText
FastText was created by Facebook’s Artificial Intelligent (AI) lab. It is a library for efficient learning of word representations and sentence classification. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. FastText uses a neural network for word embedding.
4.6 Naïve Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
The Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. This algorithm is mostly used in text classification and with problems having multiple classes.
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ |
It is also the computational cheapest algorithm.
This algorithm was used in 4 pilot studies.
4.7 Neural Networks
Neural Networks have been used in two pilot studies for text classification:
These algorithms require expensive and powerful hardware as shown by these two pilot study reports. But they are also very powerful in ‘learning’ patterns in the data that relate to the target, in these cases to complex coding frames.
The work done by BLS has shown that extensive pre-processing typically carried out for other ML algorithms is not needed, or can even be detrimental when neural networks are used because information from the text is lost for the learning process.
4.7.1 Multi-Layer Perceptrons
A Multi-Layer Perceptron (MLP) is a type of feedforward Artificial Neural Network (ANN). It consists of input nodes that receive signals, nodes in one or more hidden layers and one or more output nodes. Signals travel from the input, via the hidden layers to the output. Each input node is responsible to process a single feature of the data set. The value of this feature is then transformed by the activation function of this node. The result of this is then passed onto each node, but weighted, in the next layer. The more complex the MLP, the more complex relationships in the data can be recognised. The processing expense grows rapidly with the number of layers and neurons in each layer. Not even the 100000 neurons of a fruit fly can yet be simulated with a ANN.
4.7.2 Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)
Convolutional and recurrent neural networks are types of neural networks designed to efficiently learn specific relationships in the data. Convolutional neural networks were originally designed for image processing and focus on efficiently learning spatial patterns.
In Recurrent neural networks, signals are allowed to travel backwards using loops. This ability mimics more closely how a biological neural network operates. This allows extremely complex patterns to be learned. The addition of a short-term memory or delay increases the power of RNNs, this included the ability to learn sequential patterns. Both approaches have been found to be useful for a variety of language processing tasks and are therefore powerful classification tools.
4.8 XGBoost
XGBoost stands for eXtreme Gradient Boosting. The XGBoost library implements the gradient boosting decision tree algorithm and is one of the fastest implementations for gradient boosting. This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. This approach supports both regression and classification predictive modelling problems.
https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ |
|
|
Organisation | Title | Legacy System | Data | Target Codes |
Mexico - INEGI | Occupation and Economic activity coding using natural language processing | Deterministic coding system assisted with manual coding with accuracy level over 95% | Household Income and Expenditure: 74600 Households, 158568 persons | SCIAN |
Canada - | Industry and Occupation Coding | G-Code Word matching with | CCHS - Canadian Community Household Survey - 89K records | NAICS |
Belgium/ | Sentiment Analysis of twitter data | Life Statistics via surveys | Twitter data: | |
Serbia - | Coding Economic Activity | Manual coding | LFS | NACE |
Norway – Statistics Norway | Standard Industrial Code Classification by Using Machine Learning | SIC classification of new companies for the Central Coordination Register are made manually from the description provided. | Historical data - 1.5 million: | SIC |
US - | Coding Workplace Injury and Illness | Manual coding | Survey of Occupational Injuries and Illnesses | SOC: Occupation codes |
Poland - | Production description to ECOICOP | N/A | Web scraped product names | ECOICOP |
IMF | Automated Coding of IMF’s Catalogue of Time Series (CTS) | Manual coding | Time Series data sets from member countries | CTS (Catalogue of Time Series) |
Mexico - INEGI | Article suppression, stemming, lemmatisation, uppercase, synonyms, Assembly of different algorithms, TF-IDF | 5 x text variables | Assembly of Algorithms: | Python, sci-kit learn, keras |
Canada - | Removal of Stop Words, Lowercasing character conversion, Merging of variables, Caesar Cipher, Addition of LFS 440K records to CCHS’s training datasets (89K records) | Company, Industry, Job Title, Job Description” | Mandated to use | G-Code |
Belgium/ | lower casing, stemming, removing stop words, lemmatization, removing special characters, n-gramming | Count vectorization | penalized logistic regression, random forest | Python |
Serbia - | NACE Activity Code | Random Forest, | Python, | |
Norway – Statistics Norway | removal of obvious unreliable activities/code, | 1. description of economic activities | FastText, Logic Regression, | Python |
US - | very little data cleaning or normalisation | Injury description and circumstances | Logistic Regression, SVM, Naïve Bayes, Random Forest, Multilayer perceptrons, Convolutional and recurrent neural networks | Python, sci-kit learn |
Poland - | Vectorisation | Product Description | Naive Bayes, Logistic Regression, | Python, sci-kit learn |
IMF | Logistic Regression, | 2.4 GHz Intel Core i5-6300U |
The above Table 3 shows that the number of features used in the available data sets is typically rather small with the exception of the pilot study from Mexico where 15 features were used. This table also shows that the processing power used for the ML training and prediction is typically that of a desktop or laptop computer. The exceptions are the two already operationalised solutions from Norway and BLS that use neural networks. Statistics Norway use Google Cloud and BLS use 4 GPUs with 3584 cores each.
Table 4 – Results/Status/Future
Country | Results | Current Status | Future Plans |
Mexico INEGI | Economic Activity: | Exploratory | Analysis for results |
Canada - | Accuracy rate > 95% when combined with clerical classification | In production for CCHS, Coded Cases: | Tensor Flow |
Belgium/ | Precision: 80% | PoC | Use of other pretrained sentiment classifiers |
Serbia - | Two digit level: | Accuracy not sufficient for production, investigation carries on | to achieve accuracy of over 90% |
Norway – Statistics Norway | FastText, SVM and CNN all about same results | In production as a Supporting tool: | A new application is being delivered with expected savings between 7 million and 17 million NOK |
US - | From comparison with the 'Gold Standard' data: | In production: | Continued use of this ML app |
Poland - | Indicator NB SVM LR RF | PoC | Application to help classifying products was developed and shared |
IMF | ML codes 80% of Time Series data sets correctly | PoC | Moving the ML solution into production, |
|
|
|
|
|