| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
| Anker | _Toc323807413 | _Toc323807413 |
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
| Panel | ||||||
| borderColor | grey | bgColor |
The UNECE Machine Learning project was recommended by the HLG-MOS – Blue Sky Thinking network in the autumn of 2018, approved early 2019 and launched in March 2019. The objective of the project is to advance the research, development and application of Machine Learning (ML) techniques to add value to the production of official statistics
. Three themes were investigated:
This report attempts to summarise and investigates how and if the pilot studies carried out on Classification and Coding have shown to make official statistics better, where better can be any one or more aspects of:
Advances made to their respective organisation will also be investigated. This report will also show a broad range of approaches and ML algorithms tested and used and the different results associated with that. This richness is a testament of the commitment and contribution of all the NSO (National Statistics Offices) involved and their commitment to this project. | |||
| Panel | ||||||
| ||||||
| Anker | _Ch2 | _Ch2 | 2. Overview of Classification and Coding
| Footnote |
|---|
As far as this theme report is concerned, the given set of data referred to in the above quote is typically a text or narrative provided by the respondent to describe, for example, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. Most of the pilot studies fall into the group of Multi-Class
| Footnote |
|---|
https://machinelearningmastery.com/types-of-classification-in-machine-learning/ |
classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes, see Table 2 for a list of the coding frames used in the pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too.
There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’].
For example:
“The UK Standard Occupational Classification (SOC) system is used to classify workers by their occupations. Jobs are classified by their skill level and content. The UK introduced this classification system in 1990 (SOC90); it was revised in 2000 (SOC2000) and again in 2010 (SOC2010).”
Table -1 Some entries from the SOC-2010 coding frame:
| Footnote |
|---|
Code
Description
8221
Crane drivers
8222
Fork-lift truck drivers
8223
Agricultural machinery drivers
8229
Mobile machine drivers and operatives n.e.c.
In Table 1 are some entries from the SOC coding frame shown. A SOC code classification task has to assign a code from the table to the textual narrative given by the respondence. That narrative could be: “I drive a fork-lift truck” or “I use fork-lift trucks in my job to load up lorries”
These natural language narratives have to be assigned to the correct code or class.
Table 2 – Coding Frames
Coding Frame
Description
SCIAN
NAICS
Spanish Version: North American Industrial Classification (Sistema de Clasificación Industrial de América)
English version: North American Industry Classification
NOC
National Occupational Classification is Canada’s national system for describing occupations
SINCO
National Classification System for Occupations (Sistema Nacional de Clasificación de Ocupaciones)
NACE
European Classification of Economic Activities (Nomenclature statistique des Activités économiques dans la Communauté Européenne)
SIC
Standard Industrial Classification – Established by the USA in 1937, replaced by NAICS in 1997
SOC
Standard Occupational Classification
OIICS
Occupational Injury and Illness Classification System – Developed by the BLS
ECOICOP:
European Classification of Individual Consumption by Purpose
CTS:
Catalogue of Time Series by the IMF
Currently, large volumes of satellite information are available, such as the one announced in March 2015
| Footnote |
|---|
by NASA, giving public access to the complete collection of LANDSAT 8 satellite images with 30-meter resolution, in the AMAZON cloud
| Footnote |
|---|
. This facilitates access to large volumes of satellite information that cover the entire Earth, the frequency these satellite travels generate images of the whole globe is in periods of 16 days, which means approximately 8 terabytes of information is generated per year. NASA also offers access to images from MODIS satellites with a resolution of 500 meters which generate a complete image of the entire earth on a daily basis. It is also possible to access Sentinel-2 images
| Footnote |
|---|
that have a resolution of 10 meters and cover the Earth every 5 days.
The availability of satellite information is growing more and more (Toth & Jóźków, 2016) . Today there are private companies with constellations of nanosatellites that are capable of generating an image at a resolution of 3-5 meters of the entire earth daily (Curzi, Modenini, & Tortora, 2020), (Safyan, 2020). The wide availability of free and commercial satellite images opens opportunities to take advantage of these sources of information through Machine Learning methods. While on the other hand the demand for information on monitoring natural resources and statistical variables that can be observed in images such as the growth of urban areas is growing. This demand for information is evident in international projects such as the one expressed in the United Nations document: "Transforming our world: the 2030 Agenda for Sustainable Development" where an action plan is established with broad scopes in favor of people, the planet and prosperity, in the three dimensions of sustainable development: economic, social and environmental. This wide reach is achieved through 17 sustainable development goals (SDGs) and their corresponding targets. In March 2016, the indicators that will allow to continuously monitor the fulfillment of these established goals were first defined during the meeting of the Inter-agency and Expert Group on SDG Indicators (IAEG-SDGs) of the United Nations Statistical Division by the member countries. Some of the indicators have significant potential to be estimated by processing of large volumes of satellite images through computer vision and Machine Learning techniques (Holloway & Mengersen, 2018). Therefore, in this report, the results of four pilot projects are presented, which correspond to pilot projects carried out by Australia, Netherlands, Switzerland and Mexico.
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
After noting the lack of a generalized approach to describe how satellite data can be used by NSOs, as well as, acknowledging that the issue is even more complicated because use of satellite data often requires ML techniques which themselves are being experimented and not yet integrated in the production process in many NSOs, the development of the generic process pipeline is one of the first the deliverables in the Imagery Theme team. A generic process model describes high-level activities that need to be followed to achieve a certain objective or to deliver a specific output. This pipeline focuses on the specific use of satellite data to produce official information. This pipeline aims to address following issues:
The pipeline developed as in diagram and outlines the six main stages (business understanding, data collection and preparation, ML modelling, prediction, dissemination, evaluation) and the main specialized roles (thematic expert, E0 scientist, data scientists, statisticians and computer scientists) involved in each of the steps. The diagram of the pipeline is provided below. More detailed description for this activity can be found in the specific report as well as additional examples related to the pilot projects of the Imagery Theme team. Diagram of the pipeline |
| Panel | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||
Data pre-processing is needed to some degree by all Machine Learning algorithms. For Natural Language Processing (NLP), some techniques can be used to remove bits from the text that at their best do not add information and at their worst produce noise in the data and reduce the ability of the ML algorithm to recognise words of the same meaning. Most pilot studies have done a lot of work in this area and have shown that these techniques can have an influence on the prediction ability of the ML solution. Removal of stop-words: Stop words are words like “the”, “a”, “an”, “in”. These are unimportant words in natural Language Processing (NLP), they do not add meaning to the text to be analysed. By removing these words, the algorithm can focus in the important words Stemming and Lemmatisation are two approaches to handle inflections, the syntactic difference between word forms. Both Stemming and lemmatisation can be achieved with commercial or open source tools and libraries. Stemming: This is a process where words are reduced to their stem or root by chopping off the end of the word in the hope to reduce it to its stem, e.g. the word “flying” has the suffix “ing”. The stem is “fly”. This typically done with an algorithm. The aim is to reduce the inflectional forms of each word into a common base. This increases the occurrence of the word and gives the algorithm more to learn on. The Porter Stemmer algorithm is very popular for the English language, it chops both “apple” and “apples” down to “appl”. This shows that stemming might produce something that is not a real word. But if this is done to all the narratives to be classified and all target documents, the algorithm can find matches. Lemmatisation: This also tries to remove inflections, but it does not simply chop off these inflections. It uses lookup tables, e.g. WordNet, that contain all inflected forms of the word to find the base or dictionary form of the word, which is known as the lemma, e.g. “geese” is lemmatized by Wordnet to “goose” If the word is not included in the table it is passed as the lemma. Normalisation to lower case: Every word is converted to lower case. n-gram: The text to be classified is chopped up into words or character sequences. To chop them up into syllabus is much harder as these depend on enunciation and are harder to calculate. For the narrative “the fox jumps over the fence”
For the space character the underline ‘_’ is used here. The Mexican pilot study on occupation and economic activity classification describes in great detail their work on n-grams and what impact changes in the n-grams have on their prediction results. However, the U.S. Bureau of Labour Statistics (BLS) stopped using stop-word removal or stemming after they found out in early experiments that these techniques proved to be unhelpful due to the nature of the text narratives they need to classify. But they used CountVectorizer as a tool to create a “bag of features” representing the input. It shows which words (or sequences of words, or sequences of characters) occur in the input, but not the order in which those words or sequences appear. When they moved over to Neural Networks, they stopped this too. Preserving the original ordering of the sequence of letters allows the Neural Network to gain more insight into the text while simpler algorithms are not capable of learning the intermediate representations, i.e. that letters form words, and words form phrases and sentences, and sentences form paragraphs and so on. Please see the column Data Pre-processing in Table 3 for steps carried out by all the pilot studies. | |||||||||||||||||||||||
| Panel | |||||||||||||||||||||||
| |||||||||||||||||||||||
| Anker | _Ch4 | _Ch4 | 4. Algorithms
| Footnote |
|---|
https://towardsdatascience.com/understanding-random-forest-58381e0602d2 |
For each class, a score is given for the data to belong to that class. Let us assume we have a classification task to classify animals into 4 classes (cat, dog, mouse, fish), these 4 classes are the target for the data.
Training of this machine learning algorithm is done with labelled cases, that are data records, in this case with details about animals, but each record as already been assigned to which class it belongs. The Random Forest algorithm is initialised with a set of hyperparameters that control the behaviour of the algorithm. One of them is the number of trees it builds for the prediction task, e.g. n_estimators = 1000 instruct the algorithm to build 1000 trees out of a random selection of data set features and data records.
When the learned model is used to predict into which class a record falls, a score is given for each class, e.g. [0.15, 0.30, 0.45, 0.10]. In this case the highest score is 0.45, meaning that 450 out of the 1000 trees voted for the 3rd class which is ‘mouse’ in our example. The Random Forest Machine learning model predicts that ‘mouse’ is the most likely classification.
Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks).
| Footnote |
|---|
This algorithm was used in 6 of the pilot studies and in some cases, it has outperformed any other algorithm in their experiments and uses relatively little computer resources just like Logistic Regression (see 4.2) and SVM (see 4.3)
4.2 Logistic Regression
Contrary to Linear Regression which predicts a continuous value, e.g. age or price, Logistic regression can be used to predict a categorical variable from a given set of independent variables. It is used to predict into which discrete values (0/1, yes/no, true/false) a data record falls and it does this by fitting the data to a logarithmic or logit function. For this reason, the algorithm is also called logit regression.
| Footnote |
|---|
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/ |
Logistic Regression has been used in 7 of the pilot studies.
4.3 Nearest Neighbour (k-NN)
The k-NN algorithm treats the features as coordinates in a multidimensional feature space and then places unlabelled data into the same category or classes as similar labelled data or Nearest Neighbours (NN).
For a given data record it locates a specified number K of records in the training data set that are the “nearest” in similarity. The classes these neighbours belong to are counted and unlabelled test case is assigned the class representing the majority of the k nearest neighbours.
If there are more than two possible classes this classifier has to pick from, it will pick the most common class. This algorithm is also called k-NN. It is known as a “lazy“ machine learning algorithm as it does not produce a model and its understanding on how features are related to a class is limited. It simply stores the training data and then calculates the Euclidean distance between the new record to the training data and then picks the k nearest to label the new record.
4.4 SVM
With the Support Vector Machine (SVM) algorithm significant results can be achieved with relatively little computing power. In this classification algorithm each data item is a point in a n-dimensional feature space where n is the number of features of the data set. If we only have two features, the SVM algorithm finds a line that is the maximum distance away from the nearest point of each class to this line. For many features a hyperplane is found that separates best the labelled data points belonging to each class, this is called the Maximum Margin Hyperplane (MMH)
The points from each class that are the closest to the MMH are called Support Vectors (SV). Each class must have at least one, but may have more than one SVs. These SV define MMH and thus provide a very compact way to store a classification model.
To predict which class a new data record belongs to, that data point is drawn and the class area it falls into is the predicted class. In this way, SVM learning combines aspects of both the instance based Nearest Neighbour and regression methods. As this combination is very powerful, SVMs can model highly complex data relationships which accounts for the good results achieved with this classifier in 5 pilot studies.
4.5 FastText
FastText was created by Facebook’s Artificial Intelligent (AI) lab. It is a library for efficient learning of word representations and sentence classification. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages. FastText uses a neural network for word embedding.
| Footnote |
|---|
4.6 Naïve Bayes
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
The Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. This algorithm is mostly used in text classification and with problems having multiple classes.
| Footnote |
|---|
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/ |
It is also the computational cheapest algorithm.
This algorithm was used in 4 pilot studies.
4.7 Neural Networks
Neural Networks have been used in two pilot studies for text classification:
- Mexico for the classification of occupation and Economic activity.
- USA – BLS for the classification of workplace injuries use Multi-Layer Perceptrons, Convolutional and Recurrent Neural networks.
These algorithms require expensive and powerful hardware as shown by these two pilot study reports. But they are also very powerful in ‘learning’ patterns in the data that relate to the target, in these cases to complex coding frames.
The work done by BLS has shown that extensive pre-processing typically carried out for other ML algorithms is not needed, or can even be detrimental when neural networks are used because information from the text is lost for the learning process.
4.7.1 Multi-Layer Perceptrons
A Multi-Layer Perceptron (MLP) is a type of feedforward Artificial Neural Network (ANN). It consists of input nodes that receive signals, nodes in one or more hidden layers and one or more output nodes. Signals travel from the input, via the hidden layers to the output. Each input node is responsible to process a single feature of the data set. The value of this feature is then transformed by the activation function of this node. The result of this is then passed onto each node, but weighted, in the next layer. The more complex the MLP, the more complex relationships in the data can be recognised. The processing expense grows rapidly with the number of layers and neurons in each layer. Not even the 100000 neurons of a fruit fly can yet be simulated with a ANN.
4.7.2 Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN)
Convolutional and recurrent neural networks are types of neural networks designed to efficiently learn specific relationships in the data. Convolutional neural networks were originally designed for image processing and focus on efficiently learning spatial patterns.
In Recurrent neural networks, signals are allowed to travel backwards using loops. This ability mimics more closely how a biological neural network operates. This allows extremely complex patterns to be learned. The addition of a short-term memory or delay increases the power of RNNs, this included the ability to learn sequential patterns. Both approaches have been found to be useful for a variety of language processing tasks and are therefore powerful classification tools.
4.8 XGBoost
XGBoost stands for eXtreme Gradient Boosting. The XGBoost library implements the gradient boosting decision tree algorithm and is one of the fastest implementations for gradient boosting. This algorithm goes by lots of different names such as gradient boosting, multiple additive regression trees, stochastic gradient boosting or gradient boosting machines. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the AdaBoost algorithm that weights data points that are hard to predict.
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models. This approach supports both regression and classification predictive modelling problems.
| Footnote |
|---|
https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/ |
| borderColor | grey |
|---|---|
| bgColor | white |
| borderWidth | 1 |
Most pilot studies use Accuracy and Recall measuring how well the ML solution predicts the right result. Precision was only used in two studies and the F1 score in only one of them.
All of the pilot studies are classification tasks where each case is classified to belong to exactly one class.
Here is a brief description of these quality measures:
- Accuracy is the proportion of true results among the total number of cases examined.
- Precision is the proportion of true results over all to be true predicted cases, it is the proportion of predictions that are correct.
- Recall is the proportion of actual positives that are correctly classified, the cases it is meant to find.
- F1-score To summarise Precision and Recall into one measure is done with the F1-score. It’s a way to combine precision and recall into a single number. The F1-score is computed using a mean (“average”), but not the usual arithmetic mean. It uses the harmonic mean.
For each class of a classification task we can put the ML algorithm predictions into these 4 categories:
- True Positives (TP) = the model predicts the correct class
- True Negatives (TN) = the model predicts correctly that the case does not belong to the class
- False Positives (FP) = the model incorrectly predicts that the case belongs to the class
- False negatives (FN) = the model incorrectly predicts that the case does not belong to the class.
These categories can then be used to calculate the quality measures Accuracy, Precision, Recall and F-score for each class:
Accuracy
This is the ability of the algorithm to make a correct prediction, that is to predict correctly records labelled belonging to the class we are looking at:
Precision
This expresses the ability of the algorithm to predict True Positives and avoid False Positives:
Recall
The ability to predict the records we want to find, the ones that were labelled belonging to the class we are looking at:
F1-score
F1 Score is the Harmonic Mean between Precision and Recall. The range for F1 Score is [0, 1]. It tells how precise the classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).
The F1-score will be lower than the simple arithmetic means if one of the two measures is much lower than the other. And this is the reason why the F1-score is such a useful measure.
The above quality measures are defined for each individual class.
Binary Classification:
For a binary classification, we typically calculate these for the positive class. We want to find the cases belonging to that class. For a classification task where a bank wants to predict Fraud and No-Fraud transactions, the positive class would be Fraud as these are the ones we want to predict and find.
Cases belonging to the TP counter are the ones that are truly Fraud and where the classifier has predicted Fraud and FP cases would be No-Fraud cases, but the classifier has predicted Fraud.
Getting good Precision OR Recall is usually very easy, but getting good Precision AND Recall is often very hard. In general, we prefer classifiers with higher Precision and Recall scores. However, there is a trade-off between Precision and Recall; when tuning a classifier, improving the Precision score often results in lowering the Recall score and vice versa.
If the number of cases per class is not evenly distributed (also often referred to as not balanced), e.g. the number of cased labelled as Fraud is much smaller than the cases labelled as No-Fraud, Accuracy, Precision and Recall can be misleading if seen in isolation. Let us assume the Non-Fraud class is the majority class where, let’s say 90% of all cases are Non-Fraud and the classifier always predicts Non-Fraud. It would appear to be predicting correctly in 90% of all cases. This might look like a good result by just considering Accuracy as it is a measure for correctly predicting cases. Recall for the positive class, the ones we want to find would be 0%, but for the Non-Fraud class 100%. Precision would be 0% for our positive Fraud class, but 90% for the Non-Fraud class. For a bank it would mean that no fraudulent transactions are detected, this classifier would be rather useless. A far better approach is to combine these metrics into metrics like the F1-score and the macro-F1-score (see below).
Multi Class Classification:
For a multi class classification task, the target is a group of classes, like the one mentioned in 3.1, where we want to classify pictures of animals into the classes (cat, dog, mouse, fish), or a text narrative is classified into the SOC coding frame where the target has hundreds of classes. To be able to assess the overall quality of the classifier, we need to calculate the above measures for each class and combine them in either a Macro or Micro quality measure. This will give us a balanced view of the classifier’s ability to predict all classes.
Macro-Averaged:
The simplest way to combine the per-class measures into the respective Macro measures is to calculate the arithmetic mean for each of them:
As well as the above Macro-averaged measures, the averages might be weighted by the numbers of cases for each class. The quality measures are then called weighted-average or simply Weighted-Accuracy and so on.
Micro-Averaged:
The Micro-Averaged quality measures are also called:
- Micro-Accuracy
- Micro-Precision
- Micro-Recall
- Micro-F1-score
To calculate the Micro-Averages, we look at all the samples together. In a multi-class case, we consider all the correctly predicted cases to be True Positives, i.e. we add up TP = TPCat+ TPDog + TPMouse +TPFish
For the False Positives we have to count all the prediction errors. A photo of a cat that is predicted to be a fish, is a false prediction. It is a False Positive (FP) for the fish class but also a False Negative (FN) for the cat class, adding all these up, we end up with FP =FN. More broadly, each prediction error (X is misclassified as Y) is a False Positive for Y, and a False Negative for X.
And this implies that:
Micro-F1 = Micro-Precision = Micro-Recall = Micro-Accuracy
Multi-Labelled Classification:
In a Multi-Labelled classification task, one case can be assigned multiple classes, e.g. a photo shows a dog and a cat. In this case, Micro-averaged metrics are different from the overall accuracy and/or when some classes are excluded in the multi-class case. None of the pilot studies falls in this category of classification tasks.
| borderColor | grey |
|---|---|
| bgColor | white |
| borderWidth | 1 |
At the time of writing this report, reports on 8 ML projects were submitted, see Table 3. Of these, 3 went either into production during this UNECE project or had already been operationalised when the project started, these are from StatsCanada, Statistics Norway and the US Bureau of Labor Statistics (BLS).
Table 3 – List of Pilot Studies
Organisation
Title
Legacy System
Data
Target Codes
Mexico -
INEGI
Occupation and Economic activity coding using natural language processing
Deterministic coding system assisted with manual coding with accuracy level over 95%
Household Income and Expenditure:
74600 Households,
158568 persons
SCIAN
SINCO
Canada -
StatsCanada
Industry and Occupation Coding
G-Code Word matching with
accuracy level > 95%, that is higher than human coders
CCHS - Canadian Community Household Survey - 89K records
LFS - Labour Force Survey - 440K records
CHMS - Canadian Health Measures Survey
NOC Index Entries: 114161
NAICS Index Entries: 38256
NAICS
NOC
Belgium/
Flanders
Statistiek Vlaanderen
Sentiment Analysis of twitter data
Life Statistics via surveys
Twitter data:
Tweets that contain either a positive or a negative emoticon as a label to avoid manual production of training data.
Serbia -
SORS
Coding Economic Activity
Manual coding
LFS
20,000 cases
NACE
2 & 3 - digit classification
Norway –
Statistics Norway
Standard Industrial Code Classification by Using Machine Learning
SIC classification of new companies for the Central Coordination Register are made manually from the description provided.
Historical data - 1.5 million:
1. description of economic activities
2. 'Official' descriptions of codes and keywords
SIC
821 labels
US -
BLS
Coding Workplace Injury and Illness
Manual coding
Survey of Occupational Injuries and Illnesses
initially 261,000 records, later grew to > 2 million
SOC: Occupation codes
OIICS: Injury, Part of body injured, Event, Source of injury, Secondary source
Poland -
Statistics Poland
Production description to ECOICOP
N/A
Web scraped product names
17,000 cases
manually coded to ECOICOP
ECOICOP
1st group of products: Food and non-alcoholic beverages
61 codes, all 5 digits
IMF
Automated Coding of IMF’s Catalogue of Time Series (CTS)
Manual coding
Time Series data sets from member countries
CTS (Catalogue of Time Series)
28,886 codes
Mexico -
INEGI
Article suppression, stemming, lemmatisation, uppercase, synonyms, Assembly of different algorithms, TF-IDF
5 x text variables
10 x categorical variables
Assembly of Algorithms:
SVM, Logistic Regression, Random Forest, Neural networks, XBBoost, K-NN, Naïve Bayse, Decision Trees
20 cores, 256 GB RAM, 4 TB drives
Canada -
StatsCanada
Removal of Stop Words, Lowercasing character conversion, Merging of variables, Caesar Cipher, Addition of LFS 440K records to CCHS’s training datasets (89K records)
Company, Industry, Job Title, Job Description”
Mandated to use
FastText or XGBoost as they are already in G-Code
3 GHz Intel i5-3570, 16 GB RAM
Belgium/
Flanders
Statistiek Vlaanderen
lower casing, stemming, removing stop words, lemmatization, removing special characters, n-gramming
Count vectorization
Tf-IDF
Autoencoder neural network
Pretrained Neural Network
penalized logistic regression, random forest
gradient boosting trees
multi-layered perceptrons
4 GHz Intel i7 6700K, 16 GB RAM
Serbia -
SORS
NACE Activity Code
Activity Name
Description
Random Forest,
SVM,
Logistic Regression
Python,
Sci-Kit Learn, Pandas,
Pyzo IDE
Norway –
Statistics Norway
removal of obvious unreliable activities/code,
removal of stopwords, digits and punctuation, uppercases
1. description of economic activities
2. 'Official' descriptions of codes and keywords
3. Company names
FastText, Logic Regression,
Random Forest, Naïve Bayes,
SVM, CNN (Convolutional Neural Network)
Python
Google Cloud
US -
BLS
very little data cleaning or normalisation
Injury description and circumstances
Industry code,
Employer's name,
worker's occupation
Logistic Regression, SVM, Naïve Bayes, Random Forest, Multilayer perceptrons, Convolutional and recurrent neural networks
Initially 2-4 cores 8-16 GB Ram
Finally: 4 Titan X Pascal GPUs each with 12 GB and 3,584 cores
Poland -
Statistics Poland
Vectorisation
Product Description
Naive Bayes, Logistic Regression,
Random Forest, SVM
Neural networks (Ludwig Library)
Office PCs
IMF
Logistic Regression,
Nearest Neighbour
2.4 GHz Intel Core i5-6300U
The above Table 3 shows that the number of features used in the available data sets is typically rather small with the exception of the pilot study from Mexico where 15 features were used. This table also shows that the processing power used for the ML training and prediction is typically that of a desktop or laptop computer. The exceptions are the two already operationalised solutions from Norway and BLS that use neural networks. Statistics Norway use Google Cloud and BLS use 4 GPUs with 3584 cores each.
Table 4 – Results/Status/Future
Country
Results
Current Status
Future Plans
Mexico INEGI
Economic Activity:
Accuracy: 87.7% Precision: 66% Recall: 64.5
Occupation:
Accuracy: 83.1% Precision: 57.8% Recall: 57.3
Exploratory
Analysis for results
New Methodologies
Canada -
StatsCanada
Up to 100% Recall and precision on QC sample
In production for CCHS, Coded Cases:
2019 Q3 12.6%
2019 Q4 13.3%
and for CHMS
Tensor Flow
PyTorch
SVM
Belgium/
Flanders
Statistiek Vlaanderen
Precision: 80%
Recall: 81%
PoC
Use of other pretrained sentiment classifiers
Gather more data
Production IT infrastructure
Serbia -
SORS
Random Forest ≈ 69% of accuracy
SVM ≈ 75% of accuracy
Logistic regression ≈ 69% of accuracy
Three digit level:
Random Forest ≈ 55% of accuracy
SVM ≈ 63% of accuracy
Logistic regression ≈ 63% of accuracy
Accuracy not sufficient for production, investigation carries on
to achieve accuracy of over 90%
Norway –
Statistics Norway
FastText, SVM and CNN all about same results
FastText faster in training,
22% of units predicted
In production as a Supporting tool:
5 best codes with probability is offered. This allows for making a faster choice.
A new application is being delivered with expected savings between 7 million and 17 million NOK
US -
BLS
From comparison with the 'Gold Standard' data:
ML is more accurate that manual;
Neural Network coding is better in any of the 6 codes to be assigned than humans, with results between 69.8% and 91.9% Accuracy.
In production:
ML auto coding only above set threshold to maximise the overall macro-F1-score for the human/ML coding;
> 85% of codes are assigned by a neural network
Expand ML to other coding tasks
Poland -
Statistics Poland
Indicator NB SVM LR RF
Accuracy 90,5% 92% 91,6% 92,2%
Precision 90% 92% 92% 93%
Recall 90% 92% 92% 92%
F1-score 90% 92% 92% 92%
MCC 0,90 0,92 0,91 0,92
PoC
Application to help classifying products was developed and shared
IMF
ML codes 80% of Time Series data sets correctly
PoC
Moving the ML solution into production,
combine models to improve predictions
| borderColor | grey |
|---|---|
| bgColor | white |
| borderWidth | 1 |
The expectation on Machine Learning is to automate many traditional and manual processes inherent in the production of official statistics and manual Classification and Coding is one of them. The manual process is lengthy, resource hungry and can be prone to errors. Even Subject Matter Experts (SME) can give different outcomes to a C&C task as shown by the BLS report.
The pilot studies from Canada and BLS have proven that data can be auto coded with ML to various degrees. Only predictions with high level of confidence are allowed to auto code in these solutions. The more mature the ML solution is, the more trust appears to be put into it to let ML take on more and more of the predictions. As manual coding resources are freed up, possibly on an increasing scale over time, the cost of building, monitoring and keeping the ML solution current has to be accounted for just like a possibly expensive IT infrastructure. Data consistency will increase as manual coding is reduced.
The ML solution for Norway acts like an advisory to the human coding process. The human coder is given the ML prediction and the option to either accept or reject it. A similar approach is taken in BLS’ ML solution BLS where the human coders can reject suggested codes. However, BLS has shown that a suggested ML prediction is too easily accepted by the human coder, even if it is incorrect. While building their Gold Standard data set, they only used human coding from experts. These experts were not shown how other experts, humans or computers coded the case.
As ML takes on the task of auto coding big parts of the data sets, results become more rapidly available.
A financial gain was only reported by Statistics Norway and is expected to be over a 10-year period between 7,000,000 NOK and 17,000,000 NOK (650 000 to 1 600 000 €).
| borderColor | grey |
|---|---|
| bgColor | white |
| borderWidth | 1 |
The expectation that ML can fully replace the labour-intensive task of classification and coding straight after implementation has not been shown. The ML solutions from Canada and BLS only use the ML predicted classification if it is above a set confidence threshold. Predictions below that are ignored and completed manually instead; some solutions use these predictions as advisories to assist the human coders.
Tougher and rare cases still have to be coded by humans, but these manually coded cases can then be added as new training cases for the ML algorithm on a continuous basis. This in turn allows the ML solution to mature over time, which in turn should see a rise in the number of auto coded cases. This cycle achieves an ever-increased timeline and resource savings. A simulation using the ML code and product description data shared by Statistics Poland clearly demonstrates this, albeit in a relatively simple context.
| borderColor | grey |
|---|---|
| bgColor | white |
| borderWidth | 1 |
These can be very subjective as this depends on the organisation’s expectations and the context in which the ML is to be used.
The more successful pilot studies have shown that establishing a ‘Ground Truth’ or ‘Golden Data Set’ that is created manually and is deemed to be accurate and free of errors is of prime importance. A comparison between ML predictions, human coding and other legacy systems, such as rule-based systems, can only be clearly and credibly established when each is compared to a Golden Data Set in a statistically sound manner.
Other pilot studies that are in the proof of concept (PoC) phase have stated that resources to build these data sets are not available.
The Neural network techniques applied by BLS can provide better performance than the bag-of-words approaches that many others are using for text classification, but there are 2 big caveats. First, their knowledge of neural networks is advancing very rapidly, the best approach today will not be the best approach tomorrow and in fact even for BLS the neural network techniques are no longer consistent with the state of the start. Second, neural networks are very hard to use effectively. Just drop in a generic implementation will not get good results, the structure of the network needs to be adapted to the task using complicated tools like Tensorflow and Pytorch. Computationally expensive structures like recurrent neural networks are needed. That in turn means specialized computing resources (specifically very powerful GPUs) are needed which most organizations do not have and cannot easily acquire.
The first thing to try should not be neural network because it’s just so much harder. For most organizations, especially ones that are just starting out on their ML journey, the best approach is not the best approach available but rather a good approach that they have the resources to easily implement. That means some variation of bag-of-words, or even no or little pre-processing at all in order to try and experiment with ML. Good results can be achieved very quickly. To improve on these becomes more and more costly and time consuming. This depends on expectations, the use case and available resources. There will be a day in the not too distant future where that is no longer true, but it’s true for now.
Another approach could be to utilise cloud-based ML resources that do not carry the heavy capital burden compared to on-premises IT resources, but these come with other challenges, such as governance, security and simply trust to avoid disclosure of sensitive survey dataMotivation
In order to explore the potential of alternative data sources to those already known in the Official Statistics (Censuses, Surveys and Administrative Records) or to enrich existing projects, several projects were carried out aiming to take advantage of satellite images with Machine Learning (ML) techniques.
This document is intended to summarize the pilot projects carried out by Australia, the Netherlands, Switzerland and Mexico.
Machine Learning involves the automatic discovery of patterns in the data using computational algorithms and, from those regularities, proceeding to carry out tasks such as the detection of various categories (Bishop, 2006) in a training set. This is called Supervised Learning. The pilot projects reported in this document belong to this category and show the application of various classification algorithms that seek to relate the implicit or explicit patterns found in data carefully labeled by experts, with equivalent patterns in unlabeled data, intending for the algorithms to identify generalization rules that allow assigning categories to objects that have not been manually analyzed. Once the algorithms assign the “predicted” category, it is important to perform the evaluation of the ability of the algorithm to generalize with previously labeled testing sets, but never used in training procedures, and reporting the corresponding performance metrics for each project.
Each country wrote a detailed report of their work and corresponding experiments, we invite the reader to review the specific details of each country, in this document we will present the essential aspects.
Problem to solve
Each Statistical Office established the characteristics of the pilot test to be carried out, in which satellite images were used in the context of Machine Learning applications in order to solve specific office problems. As stated by the NSOs themselves, they try to solve problems related to the reduction of human intervention in the process of updating the Address Register (AR) or the measurement of statistical variables such as poverty or expansion of urban areas, as well as the detection of change in land use and land cover (LULC). Regularly finding a link to satellite images implies having some type of geographically referenced statistical information, as well as field work for validation, which is the basis for training automatic classification algorithms. The participating countries have a georeferenced source for such training.
The countries established the main motivation of their pilot test, identifying a relevant motivation that allows them to explore the validity of the approach, through the execution of the pilot project and a subsequent evaluation with respect to the original motivation once the project is completed. Some countries are still in the preliminary stages so definitive results are not yet available in some cases.
The expectations of the participants involve the need to create a new process that complements the activities of the NSOs or simply to improve existing processes. Either way, progress will be based on the application of Machine Learning techniques to satellite images.
Country | Problem to Solve | Contribution | Value Assessment |
Australia | Use a model of ML to reduce the amount of manual intervention required during regular Address Register (AR) maintenance processes. | Reduce costs (time) by improving the current process that is a resource intensive process. | The number of automatically classified addresses. |
Netherlands | Explore the potential of ML for detecting poverty and population distribution from aerial or satellite imagery. | Learn how to use machine learning to exploit imagery as a new data source in the production of official statistics and assist other countries who do not have income data in measuring poverty from imagery. | A working computer prototype. |
Switzerland | Facilitating land use and cover classification and by improving change detection | Improvement existing process to reduce costs (time). At present, internal resources are almost entirely allocated to visual interpretation, at the expense of other activities. | A working computer prototype that allows to demonstrate the innovative potential of the FSO in the use of artificial intelligence to process images. |
Mexico | Detect the extension of urban areas nationwide using ML | Reduce time and money. Generate information products that contribute to the cartographic update. It will also be possible to incorporate urban growth data into the population estimation models. Finally, it will be possible to generate new types of statistics that allow observing the evolution of the extension of the cities of Mexico | Clear objectives with links to potential impacts on existing and future data products. |
| Panel | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||
This section identifies the institutional priorities that led to the pilot exercise and identifies the stakeholders that support the execution of the exploratory project.
|
| Panel | |||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||
In a ML project with satellite images it is very important to define the data sources with which to work. Each country determined the study region and proceeded to acquire satellite or aerial information. With this action, one of the key aspects of this type of projects is identified. Additionally, raster information handling capabilities are required such as specialized software like Geographic Information Systems (ArcGIS, QGis) or the developing processing algorithms through specialized libraries in programming languages like Python (Rasterio, RasterFrames) or R (raster). The images used for the classification processes were mainly aerial images with submetric resolution (~ 25 cm) per pixel, for which the countries have developed infrastructure and invested in specialized flights which generate the appropriate aerial images for visual interpretation processes carried out by experts or, in another cases, they used open sources such as Landsat images with a resolution of 30 meters per pixel. Each type of image has a specific amount of bands or also known as channels that keep the information corresponding to each color; once the images are available, a segmentation and labeling procedure is carried out, based on the manual work of experts in visual interpretation and/or in field work activities. In the labeling process, a specific number of classes, that depend on the specific objectives of each country, are identified; this process is expensive in time and money. Hence, ML processes aim to contribute to the automatic labeling processes in order to ease the workload of manual processes. The following table shows a summary of the characteristics of the images and the number of classes that were labeled in the pilot tests carried out in each country.
In addition to the images, complementary information from the study area is usually used that can be used to enrich the characterization processes, for example, georeferenced information in vector format like ESRI Shapefiles or the open GeoPackage format, which contain statistical or geographic information that can be the basis for new labels or contribute to the classification processes. It is also possible to incorporate digital elevation models, which correspond to reticular information where the values of the pixels represent the elevation with respect to sea level and from which it is possible to generate additional information such as the calculation of slopes. Data preparation and Feature Extraction Once the data is available, the data is typically processed into a format that is compatible with the classification algorithms, it is at this time that data augmentation can be performed, which is very common in Deep Learning workflows (pipelines). This consists in carrying out systematic variations of the original images to expand the number of labeled examples available, for example rotating the images, changing the scale, etc. This is done in order to prevent or reduce the chances for the algorithm to overfit when using very small data sets. In these exercises, all the countries carried out Data Augmentation processes to increase the amount of information used by the algorithms. The feature extraction is a procedure that consists of characterizing the images according to analysis processes, for example, texture calculation, characterizing aspects like the shape, or definition of spectral indices. Sometimes, such as the case of the pilot test in Mexico, feature extraction is performed manually, which means the experts determined the characterization strategy. In the case of the rest of the countries, they relied on the capabilities offered by the convolutional algorithms of deep neural networks for the automatic extraction of characteristics. |
| Panel | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||
There is a great variety of ML algorithms (Ferreira, Iten, & Silva, 2020), (Youssef, Aniss, & Jamal, 2020) applied to observations of the Earth. In the case of the pilot projects, two types of algorithms were used in the case of Australia, Netherland and Switzerland, who used state-of-the-art methods based on convolutional neural networks (CNN), these operate based on basic building blocks (convolution filters, pooling layers) that are organized in architectures identified by the state of the art or built by data scientists, according to the needs of each project. Due to the complexity in the training of these algorithms, some tools have been developed (Tensorflow, CNTK, PyThorch, Keras) that take advantage of the computational power of specialized hardware (Graphics Processor Unit (GPU) and Tensor Processing Unit (TPU)). Besides using Deep Learning algorithms, Mexico, Netherlands, and Switzerland used more “traditional” ML methods such as Extremely Randomized Trees (ET), Random Forest (RF) and Support Vector Machines (SVM). In addition to the images, complementary information from the study area is usually used that can be used to enrich the characterization processes, for example, georeferenced information in vector format like ESRI Shapefiles or the open GeoPackage format, which contain statistical or geographic information that can be the basis for new labels or contribute to the classification processes. It is also possible to incorporate digital elevation models, which correspond to reticular information where the values of the pixels represent the elevation with respect to sea level and from which it is possible to generate additional information such as the calculation of slopes.
ML is iterative and incremental, and hence, results can improve as experience is gathered in the application of the methods as well as in the specific problem. The countries consider that the algorithms used are well known in the literature; however, as knowledge is gained from the application of the methods, it is possible to reach customized adjustments that improve the results achieved so far. |
| Panel | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||
The results achieved by the countries are in a proof-of-concept stage, with the exception of Australia who have moved their project to production, and it is currently operating in the institution.
Although they are in different stages, all the countries consider going beyond the demonstration phase. As mentioned before, Australia is already in that phase, followed by Switzerland, who are in the validation and integration stage of established methodologies of the process they are improving; Mexico is in talks with key skate-holders to use their results of the pilot test in real models to validate the impact, specifically, by taking advantage of the results of the 2020 Population Census to validate the results with field data. In the case of Netherlands, they are in the stage of transferring the pilot project to a closed environment that protects the confidentiality of the data to test with confidential records in order to validate the model. The challenges faced to carry out the pilot test were diverse. For Australia, it was crucial to have a solid business case to convince their organization to launch the project, it was also very important to define the problem to make it as simple as possible so that the goals were achievable and with that, generate reasonably fast value to the organization. In the case of Netherlands, a bottleneck was not having the specialized hardware (GPU) to train the convolutional neural network in its computing center to avoid having confidential data in open environments. Therefore, their first exercise was carried out with an open data variant that allowed to validate the proof of concept while they manage to get a specialized equipment to work in a secure environment. In Mexico, it is considered that more iterations should be made in the algorithm training process to achieve better results in the validation phases. Benefits Obtained The countries were able to generate value in various ways from the pilot exercises conducted. Australia demonstrated that based on their solid business case and well-defined work, they were able to push the automatic address classification pipeline into production, managing to process a large volume of information in a reliable and fast way. Freeing up analysts and allowing efforts to be focused on the most complex issues that previously could not be addressed due to insufficient resources. The Netherlands, Switzerland and Mexico were able to show that it is possible that ML algorithms associate certain statistical variables and learn from aerial or satellite images to later recognize these variables in images that were not in the training set, meaning these algorithms were able to generalize the process for learning these variables. In their reports, the countries acknowledge that the results achieved will make it possible to create and implement more ML solutions as other application needs are identified, and that collaboration between methodologists and data scientists has been strengthened; which shows the value that ML can bring to the processes established within institutions. Learned lessons Have a solid business case for the ML project Narrow down the problem; do not try to solve a very complex problem, just enough complexity to add value. DNN training involves the use of specialized hardware and, if it is desired to use it to train models with large amounts of confidential data within the secure environments of the institutions, it is required to incorporate this hardware into the internal computer centers. To be able to carry out classification exercises based on satellite and aerial images. It is essential to have high quality training sets validated by experts in visual interpretation, field work as well as complementary data sets from administrative records, surveys or censuses. |
| Panel | |||||||
|---|---|---|---|---|---|---|---|
| |||||||
| _Ch10 | 10. Comparison of results||||||
| Panel | |||||||
| |||||||
| Anker | | _Ch11 | _Ch11 | 11. Conclusions
|
Curzi, G., Modenini, D., & Tortora, P. (2020). Large Constellations of Small Satellites: A Survey of Near Future Challenges and Missions. Aerospace, 7, 133. doi:10.3390/aerospace7090133
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. USA: Springer.
Ferreira, B., Iten, M., & Silva, R. G. (2020). Monitoring sustainable development by means of earth observation data and machine learning: a review. Environmental Sciences Europe, 32, 120. doi:10.1186/s12302-020-00397-4
Holloway, J., & Mengersen, K. (2018). Statistical Machine Learning Methods and Remote Sensing for Sustainable Development Goals: A Review. Remote Sensing, 10, 1365. doi:10.3390/rs10091365
Safyan, M. (2020). Handbook of Small Satellites, Technology, Design, Manufacture, Applications, Economics and Regulation. 1057-1073. doi:10.1007/978-3-030-36308-664
Toth, C., & Jóźków, G. (2016). Remote sensing platforms and sensors: A survey. ISPRS Journal of Photogrammetry and Remote Sensing, 22-36.
Youssef, R., Aniss, M., & Jamal, C. (2020). Machine Learning and Deep Learning in Remote Sensing and Urban Application: A Systematic Review and Meta-Analysis. Proceedings of the 4th Edition of International Conference on Geo-IT and Water Resources 2020, Geo-IT and Water Resources 2020. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3399205.3399224
...
| Footnotes Display |
|---|

