6. Key aspects to accepting ML solutions
Alignment with business needs. Ultimately, they must be accepted by the people responsible for producing data (usually subject-matter analysts) and, more importantly, those who use the data. Like any approach or technology, ML is one of the means to an end. It should not be considered or adopted simply for what it is, but for what it can do to better address the business needs (increased relevance, detail, timeliness, accuracy, cost efficiency, etc.). The pilot studies generally focussed on improving timeliness and accuracy in three statistical processes (see: pilot study reports). Applications of ML to address other business needs in other processes abound (see: Other applications of Machine Learning for some examples). One of the characteristics that contributed to the delivery of the project was its hands-on approach. Early on the project pondered the idea of producing a “cookbook” to guide the use of ML. The project worked on developing some of the basic elements of a good recipe (frameworks and good practices) while many participants learned by experimenting with a mix of ingredients (pilot studies) to satisfy the nutritional needs of their respective organisations. Going forward, it is important that the developments in the group continue to be based on the needs of participating organisations (applications) and that they provide input to the more foundational aspects (see the “Supported by” row in Table 1 in appendix). Guidance from a quality framework. They must contribute to good or better-quality results towards fulfilling business needs. To do this, one needs to define what “quality” is. Definitions of quality are provided in many widely accepted quality frameworks developed by national and international statistical organisations. The Quality Framework for Statistical Algorithms (QF4SA) provides a supplement to these frameworks and focuses on aspects that are more prominent to the acceptance of ML solutions (see: Quality Framework for Statistical Algorithms). The QF4SA provides guidance on the choice of algorithms (including traditional algorithms) for the production process. It purposely uses the terminology statistical algorithm as it covers both traditional and modern methods typically used by official statisticians to strengthen the mutual comprehension between proponents of each types. There is no set formula to ascertain that results from ML solutions are good enough or better than alternatives. As with most quality frameworks, the QF4SA proposes five dimensions that must be considered jointly. One may choose to place more emphasis on one or two dimensions, but none should be ignored. The QF4SA was developed while the pilot studies were being conducted. They supplied some key input and inspiration to the framework but did not have enough time to formally experiment some of the practices recommended in the framework. One of the qualities of the project expressed by some participants was its hands-on approach. Going forward, it is recommended that such experimentations take place in pilot studies to provide valuable feedback towards improving and expanding the framework. Demonstration of added value. Most of the pilot studies focused on this aspect. On coding and classification, the studies demonstrated that ML can deliver better quality than a strictly manual operation. The common challenge faced by the pilot studies was a lack of a statistically sound baseline against which to compare ML results. Consequently, many studies start with the goal of replicating an existing operation, e.g., producing the same product classes as a manual classification operation, and focussing the added value on timeliness and, indirectly, cost. There are three severe issues with this goal. First, the accuracy of the existing (or competing) operation is often either not known or not supported by a sound assessment method. Second, ML can never fully replicate another operation. Studies conducted in the project show that ML solutions can produce the same result as existing (or traditional) operations at a level ranging from 40% to 85%. Third, and more importantly, the goal should not be limited to replicating another operation, unless it can do so much more quickly and at a significantly lower cost. The goal should be on improving the operation by combining the respective strengths of each. In the context of a classification operation, this could mean using ML predictions to automatically assign a class (on the predictions known to be very accurate, e.g., over 98%); using the accurate-but-not-quite-enough predictions to aid coders; and, ignoring the not-good-enough predictions and relying on coders to classify the rest (often less common classes). Variants of this strategy are used in production (see: Workplace injury and illness; Industry and occupation; Standard Industrial Classification). An experiment on this strategy using ML code and data shared was also conducted (see: A user's experiences with the ML code and data shared; Shared code; Product description dataset ). On edit and imputation, the studies showed results ranging from no added value (a simple imputation method did better than other options) to promising. There are no indications that ML methods can’t work. They may require less programming and be quicker to implement than current methods. On the downside, creating and maintaining good training data for such algorithms is a challenge and, explaining what they produce and how they produce it, even if it is quicker or more accurate, can be very challenging to explain and, thus, make it accepted by stakeholders. More studies and foundational developments (such as Hints and ideas on data cleaning) are needed to guide the use ML in this area and determine the characteristics of a favorable context in which to apply it. From the beginning of the project, it was thought that ML is essential to exploit large volumes of data in an efficient manner. This was confirmed in the pilot studies on the analysis of imagery data (satellite and air photography). As access to large volume of such data is increasing, one of the challenges is to provide users with information on the complex processes needed to correctly and efficiently exploit them, including when machine learning is to be called upon. To provide some of this information, the project proposes a generic pipeline to produce official statistics using satellite data and machine learning (see: Generic pipeline). The pipeline was used to describe two studies on satellite and aerial imagery (see: Aerial Image Address Use Classification and Integrating EO with Official Statistics using Machine Learning). Going forward, organisations are encouraged to continue advancing their current ML developments towards their operationalisation, and to do so while they continue to collaborate and share with others. The development could be broadened to other areas of interest to organisations (see: Other applications of Machine Learning for some examples), particularly in business needs that are labour intensive, stable over time and offer considerable data to train the algorithms. These developments should consider piloting some of the practices from the QF4SA and integration documents and return valuable feedback based on their experiences. Robust performance over time. The pilot studies focussed on assessing the added value of different ML algorithms and identifying the best model (algorithm and parameters) based on the data they had. As stated before, there are still many challenges in bringing a demonstrated ML solution into production. As importantly, the ML solution must, over time, not only continue to perform as well, but to perform even better as it “learns” more and adapts as the data being entered into the ML solution evolves. As the investigations in ML solutions advanced in the project, participants have been asking when to update or refresh the ML algorithms and/or its parameters, how frequently and how to proceed. Comparable questions are raised in other applications of ML. Only one application in the project has been in production long enough to have extensive experience in how to update the ML algorithms (see: Workplace injury and illness). The central element in putting in place and maintaining an efficient ML solution is the data used for training, not only at the start, when setting on the initial algorithm and its parameters, but throughout the use of the ML solution. Another key element is data used for evaluation, sometimes referred to as the “gold standard”. This data is needed to assess not only how the ML algorithm performs, but the entire operation, as it usually includes some clerical operations. It must be independent from the training data. These data are essential, but they usually come at a significant cost and must respect certain characteristics, e.g., collection of ground truth data, texts classified by subject-matter experts. The project has gained a good appreciation of the value and the characteristics of good training and evaluation data. Going forward, it is recommended to document and share this knowledge. Respect of ethical and legal consideration. “Machine Learning has become more powerful over the past decade, sparking an expansion of new applications. Some of these applications fall within the social domain, in which models based on data profiles can have a significant impact on the life of individuals. To prevent unwanted discrimination in these models, different methods have been proposed within the field of algorithmic fairness.” This text is taken from the abstract to a working paper from the Statistics Netherlands Center for Big Data Statistics (see: fair algorithms in context). This excerpt stresses the importance of ethical issues. They were raised in several discussions during the project, but not addressed specifically. Going forward, these issues could be addressed either by the ML group or in collaboration with other working groups looking at this issue on a broader context. In these developments, it will be important to distinguish the issues about the data sources from the methods to exploit them. It will also be important to focus on issues specific to official statistics that are mostly aggregates rather than direct outcomes on individuals, as often quickly raised in discussions, such as being accepted for a loan or getting a medical diagnosis. Development on solid scientific grounds from many disciplines. National and international official statistical organisations have always produced relevant and trusted information because they are based on sound methods and processes. When ML methods are developed and implemented on the same basic principles, they go a long way towards dealing with the aspects above and being accepted. The science needed to underpin the processes encompasses knowledge and skills in many disciplines: subject-matter, statistics, informatics, methodology, data science and operations. Compared to traditional methods, these disciplines need to work even more closely together from start (fleshing out an idea and connecting it with a business need) to finish (operationalisation). This is particularly the case for subject-matter knowledge, where ML is not just another solution that has to work for subject-matter business needs, but also a solution that particularly needs subject-matter knowledge to work. While the idea to use ML can come from a single individual (like in some of the pilot studies), the development of this idea needs to quickly involve other disciplines, notably subject matter, to correctly and efficiently advance. Borrowing a sentence from the quality aspect, one may count on one or two particular disciplines, but none should be left out. This is highlighted in an experiment with ML code and data shared by the project. It was conducted by someone with limited knowledge on ML, who, along the way, learned many lessons and made many mistakes (see: A user's experiences with the ML code and data shared). The ML project benefited from having experts from many disciplines. This allowed learning and sharing different perspectives and aspects to consider in developing, assessing and advancing ML solutions. Going forward, the ML group should seek more participants in data science, subject-matter and IT, depending on the group’s plans and goals. Statistical organisations will continue to face the challenge to acquire, develop and organise the varied expertise needed to effectively and efficiently use ML towards their business needs. The acquisition and development (e.g., training) of expertise was the most pressing need expressed in a poll conducted during the webinar (see: ML Webinar; Poll results). This aspect is further discussed below on facilitating ML solutions. |