| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Table of ContentsGeneric Pipeline for Production of Official Statistics Using Satellite Data and Machine Learning |
| Panel | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
Currently, large volumes of satellite information are available, such as the one announced in March 2015
by NASA, giving public access to the complete collection of LANDSAT 8 satellite images with 30-meter resolution, in the AMAZON cloud
. This facilitates access to large volumes of satellite information that cover the entire Earth, the frequency these satellite travels generate images of the whole globe is in periods of 16 days, which means approximately 8 terabytes of information is generated per year. NASA also offers access to images from MODIS satellites with a resolution of 500 meters which generate a complete image of the entire earth on a daily basis. It is also possible to access Sentinel-2 images
that have a resolution of 10 meters and cover the Earth every 5 days. The availability of satellite information is growing more and more (Toth & Jóźków, 2016) . Today there are private companies with constellations of nanosatellites that are capable of generating an image at a resolution of 3-5 meters of the entire earth daily (Curzi, Modenini, & Tortora, 2020), (Safyan, 2020). The wide availability of free and commercial satellite images opens opportunities to take advantage of these sources of information through Machine Learning methods. While on the other hand the demand for information on monitoring natural resources and statistical variables that can be observed in images such as the growth of urban areas is growing. This demand for information is evident in international projects such as the one expressed in the United Nations document: "Transforming our world: the 2030 Agenda for Sustainable Development" where an action plan is established with broad scopes in favor of people, the planet and prosperity, in the three dimensions of sustainable development: economic, social and environmental. This wide reach is achieved through 17 sustainable development goals (SDGs) and their corresponding targets. In March 2016, the indicators that will allow to continuously monitor the fulfillment of these established goals were first defined during the meeting of the Inter-agency and Expert Group on SDG Indicators (IAEG-SDGs) of the United Nations Statistical Division by the member countries. Some of the indicators have significant potential to be estimated by processing of large volumes of satellite images through computer vision and Machine Learning techniques (Holloway & Mengersen, 2018). Therefore, in this report, the results of four pilot projects are presented, which correspond to pilot projects carried out by Australia, Netherlands, Switzerland and Mexico. |
...
IntroductionThe purpose of the HLG-MOS Machine Learning Project is to advance the use of machine learning in official statistics. To this end, much of the initial work focused on demonstrating value through pilot projects (see Work Package 1). Although the results of many of these projects show great promise, the path from pilot project to production system is far from trivial. This is supported by the fact that despite the dozens of participants in this project, only a few report using machine learning in production currently. One challenge is demonstrating the methodological suitability of these techniques. This is the focus of Work Package 2 (Quality Framework). The purpose of WP3 is to identify and address the remaining challenges to integration and production deployment. The WP3 team pursued two activities to further this goal, a short online questionnaire designed to get a high level overview of the key challenges and successes, and a deeper investigation into 6 key questions:
The results are presented in this report. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
Our online questionnaire was designed and administered using SurveyMonkey. All members of the HLG-MOS Machine Learning project were encouraged to participate and also to forward the survey to colleagues with relevant expertise. Between September 15th and October 15th of 2020, 28 responses were collected and form the basis of this report. The questionnaire remains available online however at https://www.surveymonkey.com/r/6G5VVFH and additional responses are welcome and may be incorporated into future products. Our 28 respondents include representatives of national statistical organizations covering 14 countries and regions, all in either North America or Europe. Most report having a role of “Statistician / Data Scientist”, followed by “Analyst / Subject Matter Expert” and “Manager / Policy Maker.” Only one respondent reported a role of “Software Engineer / Information Technology Specialist”. Most respondents also report belonging to large national statistical organizations (54%) defined as those having more than 2000 employees, followed by 32% of respondents reporting the next largest grouping, between 500 and 2000 employees. See the appendix for details. What are the biggest challenges facing statistical agencies in ML? Our questionnaire divides this into two sub questions, one asking about “organizational issues” and the other about “technical issues.” Among organizational issues, “coordination between internal stakeholders” ranked among the largest challenges with 16/27 (59%) reporting this moderately limits, severely limits, or prevents use.
Among technical issues, “availability of staff with appropriate machine learning algorithm skills” was the most limiting factor with 10/28 respondents (36%) reporting that it severely limits use. The average score of 1.8 makes this the most problematic issue identified in our survey. Our survey ends with a question about which activities have been most useful. Collaboration with other statistical organizations ranks as the most useful, with 14/28 respondents indicating it is “very useful”, followed closely by external training programs with 10/28 indicating “very useful”. |
| Panel | ||||||
|---|---|---|---|---|---|---|
| ||||||
While the short form questionnaire gives us a high level overview of the challenges and potential solutions, it cannot tell us much about the details. To compliment this information we asked project participants to describe how they were addressing six key questions. We received detailed responses from 4 organizations, the UK Office of National Statistics (ONS), the Australian Bureau of Statistics (ABS), Statistics Flanders, and the U.S. Bureau of Labor Statistics (BLS), and related comments from many others. The questions, and a high level overview of the responses are below. Where should machine learning fit in a statistical organization? Participants indicated 4 broad approaches:
What should the machine learning pipeline look like in regards to organizational structure? Where should projects start, who should control what aspects when? Interestingly, the responses to this question resulted in two seemingly opposite ideas. One set emphasized the importance of starting with a business need, moving to R&D, producing a prototype and then bringing in other areas like IT. The second response however emphasizes the importance of building ML experience first, which in turn allows one to identify suitable business problems which might be solved by machine learning. It is clear, in hindsight, that both are needed. An organization cannot determine whether machine learning is suitable if it knows nothing about machine learning, but it is also clear that the ultimate goal is to serve business needs. What machine learning skills are needed and where are they needed in the organization? On this question, there was general agreement among the responses. In organizations that distribute machine learning responsibilities across many divisions, machine learning requires new skills in many areas. Specifically:
An alternate approach is to centralize all or most of these functions in one or several “data science experts”, who assume ownership over many of these aspects simultaneously. This limits the amount of coordination and communication that must occur, but requires individuals with a broad range of skills. How can organizations efficiently acquire the ML skills they need? Responses identified 4 strategies
How should organizations demonstrate and communicate the value-added of ML techniques? One of the recurring challenges of working on projects involving many parties is the need to convince others to adopt or support new techniques. This is supported both by numerous anecdotes among participants in the ML group, and by questionnaire responses indicating coordination and resistance issues from internal stakeholders. Responses identified 3 potential strategies.
How should statistical organizations identify the right problems for machine learning? Our investigation uncovered 3 strategies.
|
| Panel | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


