...
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Table of Contents |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
The purpose of the HLG-MOS Machine Learning Project is to advance the use of machine learning in official statistics. To this end, much of the initial work focused on demonstrating value through pilot projects (see Work Package 1). Although the results of many of these projects show great promise, the path from pilot project to production system is far from trivial. This is supported by the fact that despite the dozens of participants in this project, only a few members of the ML project report using machine learning in production currently. One challenge is demonstrating the methodological suitability of these techniques. This is the focus of Work Package 2 (Quality Framework). The purpose of WP3 is to identify and address the remaining challenges to integration and production deployment. The WP3 team pursued two activities to further this goal, a short online questionnaire designed to get a high level overview of the key challenges and successes, and a deeper investigation into 6 key questions:
The results are presented in this report. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Our online questionnaire was designed and administered using SurveyMonkey. All members of the HLG-MOS Machine Learning project were encouraged to participate and also to forward the survey to colleagues with relevant expertise. Between September 15th and October 15th of 2020, 28 responses were collected and form the basis of this report. The questionnaire remains available online however at https://www.surveymonkey.com/r/6G5VVFH and additional responses are welcome and may be incorporated into future products. Our 28 respondents include representatives of national statistical organizations covering 14 countries and regions, all in either North America or Europe. Most report having a role of “Statistician / Data Scientist”, followed by “Analyst / Subject Matter Expert” and “Manager / Policy Maker.” Only one respondent reported a role of “Software Engineer / Information Technology Specialist”. Most respondents also report belonging to large national statistical organizations (54%) defined as those having more than 2000 employees, followed by 32% of respondents reporting the next largest grouping, between 500 and 2000 employees. See the appendix for details. What are the biggest challenges facing statistical agencies in ML? Our questionnaire divides this into two sub questions, one asking about “organizational issues” and the other about “technical issues.” Among organizational issues, “coordination between internal stakeholders” ranked among the largest challenges with 16/27 (59%) reporting this moderately limits, severely limits, or prevents use.
Among technical issues, “availability of staff with appropriate machine learning algorithm skills” was the most limiting factor with 10/28 respondents (36%) reporting that it severely limits use. The average score of 1.8 makes this the most problematic issue identified in our survey. Our survey ends with a question about which activities have been most useful. Collaboration Among the options presented, “collaboration with other statistical organizations ranks organizations” ranked as the most useful, with 14/28 respondents indicating it is “very useful”, followed closely by external training programs with 10/28 indicating “very useful”. See the appendix for additional details on the survey results. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
While the short form questionnaire gives us a high level overview of the challenges and potential solutions, it cannot tell us much about the details. lacks detail. To compliment this information we asked project participants to describe how they were addressing addressed six key questions. We received detailed responses from 4 organizations, the UK Office of National Statistics (ONS), the Australian Bureau of Statistics (ABS), Statistics Flanders, and the U.S. Bureau of Labor Statistics (BLS), and related comments from many others. The questions, and a high level overview of the responses are below. Where should machine learning fit in a statistical organization? Participants indicated 4 broad approaches:
What should the machine learning pipeline look like in regards to organizational structure? Where should projects start, who should control what aspects when? Interestingly, the responses to this question resulted in two seemingly opposite ideas. One set emphasized the importance of starting with a business need, moving to R&D, producing a prototype and then bringing in other areas like IT. The second response however emphasizes other emphasized the importance of building ML experience first, which in turn allows one to identify suitable business problems which might be solved by machine learning. It In retrospect, it is clear , in hindsight, that both are needed. An organization cannot determine whether machine learning is suitable if it knows nothing about machine learning, but it is also clear that the ultimate goal is to serve business needs. What machine learning skills are needed and where are they needed in the organization? On this question, there was general agreement among the responses. In organizations that distribute machine learning responsibilities across many divisions, machine learning requires new skills in many areas. Specifically:
An alternate approach is to centralize all or most of these functions in one or several “data science experts”, who assume ownership over many of these aspects simultaneously. This limits the amount of coordination and communication that must occur, but requires individuals with a broad range of skills. How can organizations efficiently acquire the ML skills they need? Responses identified 4 strategies
How should organizations demonstrate and communicate the value-added of ML techniques? One of the recurring challenges of working on projects involving many parties is the need to convince others to adopt or support new techniques. This is supported both by numerous anecdotes among participants in the ML group, and by questionnaire responses indicating coordination and resistance issues from internal stakeholders. Responses identified 3 potential strategies.
How should statistical organizations identify the right problems for machine learning? Our investigation uncovered 3 strategies.
Because of the difficulty of coordinating broadly distributed activities, another increasingly popular approach is to rely on positions and operational units that increasingly blur the distinctions between research, methodology, information technology, and subject matter. See, for example, Google’s Hybrid Approach to Research, and Data Scientist: The Sexiest Job of the 21st Century. In some organizations, a data scientist spends some of their time researching and evaluating different machine learning solutions to a problem (R&D, methodology), some of it building and running the model in production (IT), and some of it assisting with use and maintenance (subject matter). This blurring of boundaries reduces the extent to which machine learning skills need to be distributed across the organization, but requires individuals and teams with a broad range of skills and the organizational and IT infrastructure necessary to make it work. How can organizations efficiently acquire the ML skills they need? Responses identified 4 strategies:
How should organizations demonstrate and communicate the value-added of ML techniques? One of the recurring challenges of working on projects involving many parties is the need to convince others to adopt or support new techniques. This is supported both by numerous anecdotes among participants in the ML group, and by questionnaire responses indicating coordination and resistance issues from internal stakeholders. Responses identified 3 potential strategies.
How should statistical organizations identify the right problems for machine learning? Our investigation uncovered 3 strategies.
|
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
I would like to thank the members of the Work Package 3 group for the many helpful inputs used to produce this report, especially Jenny Pocknee (ABS), Eric Deeben and Oliver Mahoney (ONS), Michael Reusens (Statistics Flanders), Isaac Ross (Statistics Canada) and Krystyna Piatkowska and Marta Kruczek-Szepel (Statistics Poland). |
| Panel | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






