...
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Table of Contents1. National Statistical Organisation, their role and strengths 3. Pilot studies and Theme Reports 8. What worked well and what worked even better |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
National Statistical Organisations (NSO) play a role of ever increasing importance to inform government policy makers of a true picture in their respective country. They produce trusted statistics based on the trust placed into them by the wider public, the respondence and the user of the statistical output. To stay relevant, NSOs have to adapt to and embrace new technologies and data sources to shorten the time between data collection and statistical output. The Covid-19 pandemic and NSOs response is testimony to this. |
...
| Panel | ||||||||
|---|---|---|---|---|---|---|---|---|
| ||||||||
The three themes agreed by the project: C&C, E&I and Imagery are all based on classification tasks with the exception of Imputation. For Editing it is a classification of records into two classes, the ‘Change’ class where the data are inconsistent, missing or suspicious looking and the ‘No-Change’ class, where the data do not need any further attention and are deemed to be correct or consistent. The sentiment analysis of web based data is included in the C&C theme as it is a classification task to classify the data into the chosen sentiment categories. Imagery, an example of big data and alternative data sources classifies satellite or aerial images or their components into classes like ‘Urban’ and ‘Non-Urban’. All pilot studies used supervised ML, this is where the algorithm ‘learns’ from training data that have been labelled, e.g. where the correct code has been assigned manually. This can be an occupation code assigned to a data record with an occupational text description, or the type of object visible on a satellite image. These labelled training data allow the algorithm during the learning phase to recognise rules or patterns in the data without having to explicitly formulating those rules. New data that have not been labelled can then be fed into the algorithm for it to categorise and recognise these data. The participants of this project have submitted reports on their respective pilot studies. These were then summarised for each of the three themes into Theme Report. Given references to these reports are to the statswiki.unece.org web site. All reports will be accessible to the public. Further information can be obtained by contacting unece.org or the authors of the reports. This summary report uses information provided in the pilot study reports as well as the 3 theme reports.
|
| Panel | ||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||
“Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In the terminology of machine learning, classification is considered an instance of supervised learning, i.e., learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance.”
The given set of data referred to in the above quote is typically a text narrative provided by the respondent to describe, their occupation, the industrial activity of a company, injury information of an injured worker, product descriptions scraped from the internet or sentiment text obtained from Twitter. There was only one pilot study with binary classification, the twitter sentiment classification into positive and negative sentiments. The target classes for this are [‘Positive’, ‘Negative’] 4.1 Summary of C&C pilot studies Most of these pilot studies fall into the group of Multi-Class classification tasks. These aim to classify text descriptions into internationally accepted coding frames, e.g. SIC, SOC, NAICS, NOC or ECOICOP that offer many target classes and to assign an appropriate code from the coding frame. See Table 1 for a list of the coding frames used in the C&C pilot studies. Even though the aim of these studies appears to be mostly the same, their approach and software solutions used differ as their results do too. The sentiment analysis of twitter data is a binary classification task where each twitter message is classified as either Positive or Negative. Table 1 – Coding Frames
There were 9 pilot studies submitted, of these, 3 are in production with one of them used as a supporting tool for human coders to make a faster decision:
The other 6 studies have advanced considerable during the 18 months since the project started. Future plans for all of these include investigation into the use of other ML algorithms, increase in prediction accuracy or IT infrastructure before imbedding this technology into production pipelines. The three operationalised projects have shown that building a “Golden Data Set” or ground truth is essential. For this, all labels are manually assigned and are deemed to be correct. To achieve this is very labour intensive and Subject Matter Experts (SME) have to check and re-check the assigned codes. This serves two purposes:
To commit this resource is a challenge for most NSOs and can be a serious blocker in the development of ML solutions. However, such resources should already be committed to assess the current processes. Even though the success criteria for most of these studies was that ML has to perform at least as good as human coders, their motivation in conducting these was mainly gains in efficiency and timeliness. Accuracy as the objective was only stated in two studies. To achieve this, ML is used either to only auto code predictions made above a set threshold and/or predictions for coding classes that have a high prediction threshold. Predictions for minority classes, the ones that are rarely seen in the data, are then excluded from the ML solution and are still made by human coders. The BLS and Stats Canada solutions use a mix of ML auto coding and human coding to ensure that the overall accuracy is higher than human or machine coding alone. Instead of just showing the overall performance, this approach assessed and demonstrates if, how and where ML was at least as good or even better than human coders. Setting a prediction threshold for the groups of records where ML works better, where it works good enough to assist humans and where it does not perform good enough (e.g. minority classes) are important areas to analyse. The ultimate objective is that NSOs end up with a better overall process. Please see
for individual pilot study results. The BLS solution, as the highest advanced in this group, has the highest proportion of auto coding, > 85%. This has been made possible by using Neural Networks for the ML algorithm that run on 4 Graphical Processing Units (GPUs) with 3584 cores. The Norwegian solution was the only one using Cloud Computing, but it is anticipated that this technology will be utilised more often as it offers the provision of very powerful IT solutions without capital expenditure, but national privacy and data protection laws have to be considered first. All other studies relied on standard desktop or laptop hardware. |
...
| Panel | ||||||||
|---|---|---|---|---|---|---|---|---|
| ||||||||
Digitisation has driven the growth of big data, and machine learning makes big data more usable. It's easy to see why NSOs are exploring different ways of augmenting their existing decision-making and business operations with alternative data sources and machine learning. For example, the Office for National Statistics (ONS) has invested in infrastructure capable of supporting frictionless access to and harnessing of alternative data sources (specifically, administrative data and satellite data). It has also sought to grow its in-house data science capacity through the establishment of the Data Science Campus. The value of traditional approaches to data gathering (such as surveys) and analysis, however, is still high, especially in circumstances where big data isn’t (yet) readily available. Applied machine learning can add value to these traditional approaches by making them more operationally efficient. The UNECE ML project has been successful by demonstrating that participating NSOs have made advances in the use of ML. It has shown the added value ML can bring to the production of accurate, timely and cost-efficient official statistics. Summary of outputs
In summary, our findings on the use of ML, that ML should be used for Coding & Classification, we have numerous positive demonstrations on a variety of data sources and contexts. Currently we have a few applications in production or near production. ML shows great promise for Edit & Imputation and Demonstrations with a varying degree of positivity, pilots are in advanced stages and plans are in place to move these to production. ML is essential in the use of imagery, especially in the context of increasing access to large amounts of imagery data. There are more advanced developments in this theme group and the Generic Pipeline for Production of Official Statistics Using Satellite Data and Machine Learning
shows the business processes required to develop ML solutions for Imagery. This project concludes in November 2020, but it will carry on as ML 2021 under the guidance of the ONS’ data Science Campus to continue with the aim to develop and integrate ML solutions in statistical processes at NSOs. It will also follow and support the development of the ML applications not yet operationalised. |
...
| Footnotes Display |
|---|