The Machine Learning Group 2022 is an international platform that aims to advance the use of ML in the official statistics field. The ML Group 2022 brings together more than 400 people from 35+ different countries and 20+ international organizations and is coordinated by theUK Office for National Statistics Data Science Campus and the UNECE HLG-MOS. The objectives of the Group include:
Knowledge exchange: Joining a group to share knowledge and experience on a topic of common interest on a regular basis. (e.g. a study group or discussion group)
Research support. Providing feedback and advice to other ML Group members working on their own research projects, or receiving input on your own project.
Research collaboration. Working on a common topic or data set with other group members with a primary aim of delivering the output together
The ML Group 2022 is focusing its activities on a number of key themes which are important to advancing our understanding of the added value of ML for official statistics and how it can best be integrated into statistical systems. Members have formed small groups to work together on knowledge exchange and research activities aimed at delivering common outputs. The groups are organised and run by members themselves, who contribute in different ways depending on their interests and the objectives of the group. The summary of group activities can be found in "ML Group 2022 Theme Group Outputs" panel below.
The monthly ML Group meetings throughout the year provided a platform where members can share experiences, build connections and keep up to date with the new developments (see “ML Group 2022 Monthly Meeting Presentations” panel below for more information). The Group also ran three Coffee and Coding events, two of them are available (see "Coffee and Coding Session" panel below).
The web holds a great potential to complete the traditional statistical production as it contains an immense amount of information that is relevant for almost any policy domain. However, transforming web data to trustworthy statistics is not straightfoward with numerous technical and methodological challenges. In the Theme Group, three organisations, Statistics Flanders, Statistics Poland and Turkish Statistical Institute, implemented experimental statistics using web scraped data in parallel for the production of identifying companies engaging in AI activity, R&D activity, corporate social responsibility activity respectively.
Classifying textual response into predefined categories (e.g., job description into the Standard Occupational Classification (SOC), twitter comment into positive/negative sentiment) is one of common tasks that statistical organisations conduct when producing statistics. Traditionally, this used to be done manually or through a complex rule-based system, both of which are costly, time-consuming and hard to manage. With the advance of natural language processing and machine learning techniques, ML can help statistical organisations conduct this task in a more efficient way.
This Theme Group provided a knowledge exchange platform for those working on text classification in statistical organisations to share their works, receive feedback from peers and discuss on common challenges.
Click here for full report to read more about the activity of the group in 2022, key observations made throughout the year and NLP resources recommended.
Theme Group - Imagery Analysis
Led byStatistics Netherlands and ONS
This group will focus on the use of machine learning for earth observation data. Its objectives are expected to concentrate on capability building and the research projects proposed by members (see right)
Led by CSO Ireland and Norwegian School of Economics
The Theme Group explored methods to identify the berth areas using AIS data. Due to the vast size of data, the raw data was filtered through H3 index. During the regular meetings, various geospatial objects handling methods were introduced.
The work continues until the 2023 Q1, the result will be made available afterward.
Theme Group - Quality of Training Data
Led byStatistics Netherlands
This group explored issues related to human annotation process and sampling methods to obtain representative training sets.
The Group examined the key concepts around the drifts (e.g., drifts in data, drifts in model), methods for monitoring and detecting those drifts (e.g., performance based-approach, distribution-based approach). It also discussed the implications for statistical organisations (pros, cons), and factors that enabling the monitoring and re-training.
Tabitha Williams and Brittny Vongdara (Statistics Canada)
In this session, Tabitha Williams and Brittny Vongdara from Statistics Canada provided an interactive lesson on using GitHub, and an introduction to Git.
Topics covered included forking a repository, making a commit, collaboration, and how to avoid uploading your data on GitHub. The session also included the theory and a discussion on the difference between GitHub and Git, what a Git project looks like normally, and best practices.
Members of the Machine Learning Group 2022 met at the UK Data Science Campus in Newport, UK on 12-14 July for an in-person sprint. The aim was to accelerate the work of three of this year's theme groups: web scraping data, model retraining and quality of training data. The meeting was a great opportunity for in-person discussions and knowledge exchange between the three groups.