A user's experiences with the ML code and data shared

The ML project shares some ML code and data provided by Statistics Poland to enable users to quickly start experimenting with four ML methods on real data. In a typical experiment, the input dataset is split into a training, validation and testing subset.

Training subset is used to fit the model.
Validation subset is used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
Test subset is used to provide an unbiased evaluation of a final model fit on the training dataset.

For each product in the testing subset, the ML code provides the most likely ECOICOP category from the application of the model created from the training subset and assessed by the validation subset. It also provides for each prediction its likelihood or probability of being accurate. The code finally produces several statistics to measure this accuracy by comparing the category predicted by the ML model to the ECOICOP category present on the input dataset. The most basic accuracy measure is the number of predicted category equal to the input category divided by the number of products.

This accuracy measurement indicates are far or close the predicted categories are from the input categories. One can look at this number overall products, e.g. the ML predicted categories are 87% accurate. Or, more importantly, look at subgroups of products, e.g. ML predictions with a likelihood above 0.9 are 99% accurate.

In the numbers mentioned above, 87% is good, but not great; 98.4% is pretty good, but compared to what? 100%? The next question that usually pops ups, in a real situation, is “How accurate are the input categories?”. They are most often the results from a manual coding operation. To adequately assess the add value of ML on accuracy, we need a target, reference or baseline against which to compared it to.

In the case of the product and ECOICOP categories provided by Statistics Poland, they clearly state that the products were coded by non-experts and contain a certain percentage of error or misclassification. This is obviously when one looks at some of the results that come out of the ML models. One example is when ML predicts that “3in1 360 g coffee powder (20 sachets)” is a “Coffee” product with a likelihood of 0.93545 while the input category indicated that it is a “Tea” product, the latter is wrong. Another example is when two very similar products (A and B) are both predicted by ML to ECOICOP value X, while the input category for product A is X and the input category for product B is Y. Both products should have the same ECOICOP value, either X or Y is wrong.

In my first experimentations with the ML code and data shared, I was able to identify 528 products that were LIKELY misclassified. I clearly say “likely” because only an expert reviewer can determine the correct or better category value. From these first experiments, I share a second version of the product dataset (ECOICOP alternate data) to which I added the following information:

kategoria_alt is the ECOICOP category with the 528 changes
diffcode with a value of 1 flags the 528 products with alternate ECOICOP category
dup_prod with a value of 1 flags 519 duplicate products in the dataset

The value of this dataset is to provide products and their ECOICOP category that are closer to reality, that is with errors. The second ECOICOP value (kategoria_alt) provides a reference against which to compare the accuracy of BOTH the ML predicted categories AND the input categories. In the case of the dataset provided by Statistics Poland, the “assumed” error rate is 97%, after excluding duplicate products. That is still quite higher than the 87% accuracy on the ML predicted categories. However, the ML predicted categories with a likelihood above 0.90 may show to be more accurate than their input category counterparts. The question is then what proportion of the products have a M predicted category with a likelihood above 0.90. A large proportion could lead to some significant reduction in cost by accepting the ML predicted category (autocoding) rather than having them manually classified.

The following document - experience in learning ML - describes how I, who knew little about ML, got familiar and comfortable with it using the product data and the code shared on the Statistics Poland github. It describes how ML helped me to identify misclassified products. It also presents a simulation on the integration of ML into a manual classification operation to achieve more accurate results at a lower or similar cost.

Page tree

A user's experiences with the ML code and data shared