The Machine Learning Group 2022 is an international platform that aims to advance the use of ML in the official statistics field. The ML Group 2022 brings together more than 400 people from 35+ different countries and 20+ international organizations and is coordinated by the UK Office for National Statistics Data Science Campus and the UNECE HLG-MOS. The objectives of the Group include: 

  • Knowledge exchange: Joining a group to share knowledge and experience on a topic of common interest on a regular basis. (e.g. a study group or discussion group)
  • Research support. Providing feedback and advice to other ML Group members working on their own research projects, or receiving input on your own project.
  • Research collaboration. Working on a common topic or data set with other group members with a primary aim of delivering the output together

The ML Group 2022 is focusing its activities on a number of key themes which are important to advancing our understanding of the added value of ML for official statistics and how it can best be integrated into statistical systems. Members have formed small groups to work together on knowledge exchange and research activities aimed at delivering common outputs. The groups are organised and run by members themselves, who contribute in different ways depending on their interests and the objectives of the group. The summary of group activities can be found in "ML Group 2022 Theme Group Outputs" panel below. 

The monthly ML Group meetings throughout the year provided a platform where members can share experiences, build connections and keep up to date with the new developments (see “ML Group 2022 Monthly Meeting Presentations” panel below for more information). The Group also ran three Coffee and Coding events, two of them are available (see "Coffee and Coding Session" panel below). 


ML Group 2022 Theme Group Outputs

Theme Group - Web-scraped Data

Led by Statistics Flanders 

The web holds a great potential to complete the traditional statistical production as it contains an immense amount of information that is relevant for almost any policy domain. However, transforming web data to trustworthy statistics is not straightfoward with numerous technical and methodological challenges. In the Theme Group, three organisations, Statistics Flanders, Statistics Poland and Turkish Statistical Institute, implemented experimental statistics using web scraped data in parallel for the production of identifying companies engaging in AI activity, R&D activity, corporate social responsibility activity respectively.

Click here for full report to read more about the implementation studies from the three organisations. 

Theme Group - Text Classification

Classifying textual response into predefined categories (e.g., job description into the Standard Occupational Classification (SOC), twitter comment into positive/negative sentiment) is one of common tasks that statistical organisations conduct when producing statistics. Traditionally, this used to be done manually or through a complex rule-based system, both of which are costly, time-consuming and hard to manage. With the advance of natural language processing and machine learning techniques, ML can help statistical organisations conduct this task in a more efficient way. 

This Theme Group provided a knowledge exchange platform for those working on text classification in statistical organisations to share their works, receive feedback from peers and discuss on common challenges. 

Click here for full report to read more about the activity of the group in 2022, key observations made throughout the year and NLP resources recommended. 

Theme Group - Imagery Analysis

Led by Statistics Netherlands and ONS

This group will focus on the use of machine learning for earth observation data. Its objectives are expected to concentrate on capability building and the research projects proposed by members (see right)

Click here for full report

Theme Group - AIS Data

Led by CSO Ireland and Norwegian School of Economics

The Theme Group explored methods to identify the berth areas using AIS data. Due to the vast size of data, the raw data was filtered through H3 index. During the regular meetings, various geospatial objects handling methods were introduced.

Click here for the cookbook for creating Berth Polygons based on AIS data

Theme Group - Quality of Training Data

Led by Statistics Netherlands

This group explored issues related to human annotation process and sampling methods to obtain representative training sets. 

Click here for the final report

Theme Group - Model Retraining

Led by UNECE

The Group examined the key concepts around the drifts (e.g., drifts in data, drifts in model), methods for monitoring and detecting those drifts (e.g., performance based-approach, distribution-based approach). It also discussed the implications for statistical organisations (pros, cons), and factors that enabling the monitoring and re-training.

Click here for the full report

Theme Group - Infrastructure

Led by Statistics Sweden

This group aims to share experiences of statistical organisations in developing open base platforms for ML data processing and analysis that will enable collaboration with external research partners.

Click here for the full report "Building an ML Ecosystem in Statistical Organisations"

ML Group 2022 Monthly Meeting Presentations 

October 26

Florian Dumpert (Federal Statistical Office of Germany)

Workshop on Quality Aspects of Machine Learning - presentation slides

Javier Oyarzun, Laura Wile (Statistics Canada)

Quality Control of Machine Learning Coding: A Statistics Canada experience - presentation slides 

September 21

Yuhua Li (Cardiff University, UK)

Covariate shift detection based on exponentially weighted moving average (presentation slides)

Riitta Piela (Statistics Finland)

Reaching for MLOps Level 1 at Statistics Finland (presentation slides)

August 31

Summer Wang (Australian Bureau of Statistics)

Raising Survey Response Rates by Using Machine Learning to Predict Gold Providers (presentation slides)

Saeid Molladavoudi (Statistics Canada)

Statistics Canada’s Framework for Responsible ML (presentation slides)

June 15 

Piet Daas (CBS Netherlands)

Using web site texts to identify different types of companies (presentation slides)

David Corney (Full Fact, UK)

How to stop people misusing statistics: Automatic verification of statistical claims (presentation slides)

May 4

Florian Dumpert (Federal Statistical Office of Germany)

Quality Framework for Statistical Algorithms (presentation slides)
April 6

Abel Dasylva (Statistics Canada)

Estimating linkage errors without training data and without assumptions about the interactions among the linkage variables (presentation slides)

Joep Burger (CBS Netherlands)

Convolutional neural networks for learning target variables and extracting image features from Earth Observation (presentation slides)
March 2Ingmar Weber (Qatar Computing Research Institute)Using Advertising Data to Model Digital Gender Gaps and Poverty (presentation slides)
Ralf Becker (UN Statistics Division)Introduction to the new UN Big Data Training Catalogue (presentation slides)

Coffee and Coding Session

January 26Alex Noyvirt and Claus Sthamer (UK ONS)
  • ML Fundamentals: An overview of what ML does, basic principles and a demonstration - (hybrid event - view recording )
  • ML Applications:  Participants got hands on experience of technical processes of how ML applications are developed and used. (In person only)
  • Applications Deep Dive: In this session participants were given a deep dive into the application of ML algorithms in practice together with python code examples. (In person only)
April 27Tom Wise (UK ONS)

Machine Learning foundations and focused on the theory behind these techniques.

November 2Tabitha Williams and Brittny Vongdara (Statistics Canada)

In this session, Tabitha Williams and Brittny Vongdara from Statistics Canada provided an interactive lesson on using GitHub, and an introduction to Git. 

Topics covered included forking a repository, making a commit, collaboration, and how to avoid uploading your data on GitHub. The session also included the theory and a discussion on the difference between GitHub and Git, what a Git project looks like normally, and best practices.

  • You can view the session recording here
  • The presentation slides here 

ML Group Sprint - Newport, UK

ML Group Sprint
12-14 July

Members of the Machine Learning Group 2022 met at the UK Data Science Campus in Newport, UK on 12-14 July for an in-person sprint. The aim was to accelerate the work of three of this year's theme groups: web scraping data, model retraining and quality of training data. The meeting was a great opportunity for in-person discussions and knowledge exchange between the three groups. 

You can read more about the sprint in this report. 

  • No labels