Quality Assurance for Artificial Intelligence

Part 2: “She’s a model and she’s looking good”

7 min readMar 8, 2020

This is the second in my series about quality assurance for systems which leverage machine learning. In the first part, we introduced some general concepts and took a look at practices for ensuring the quality of data which goes into the model. Now we assume the data is clean, the model candidate is built and the data science team is eager to release it into production. What should be done to assure its accuracy?

The test set

The obvious way is to compare your model’s predictions to the truth. For this, you split the data set into a training set for training your model and a test set for evaluation. Once the model is trained, you apply it to every record in the test set and get something like

Building the right test set might be more challenging than you might expect. It needs to:

have no intersections with the training set
be similar to the data your model will encounter in production; in particular, this concerns data variety. Thus, if you recognise dog breeds by image, your test set should include all dog breeds, if possible in proportions similar to the complete data set
perform tasks similar to the task in production

The last two requirements look similar but represent different perspectives. If, when predicting demand for your bakery, you go for task similarity, you should predict the future by the past. So you should make the train/test set split at a particular date, using the data before it for training and data after for testing. If you go for variety, you might pick 3–4 days from every month from January to December. If you cannot meet both the requirements for similar variety and task, the similarity of the task is more important in my opinion.

When you pick the test set from your data set, keep in mind different ways of doing this (let’s say you want a 20–80 split):

sorting by a particular feature (like date) and taking the top 20%
sorting by a particular feature and taking every fifth record
randomly
stratified and randomly — you pick with respect to the distribution in some column. Usually, it makes sense to pick with respect to a distribution of target values. But you can also consider the distribution of a particular feature like age or gender. There are tools which help you make stratified splits, e.g. in the scikit-learn library

Accuracy and cost of error

Let’s assume we have a good test set. Next, we want to evaluate the accuracy of the model by applying it to the data. This sounds dead simple, but it may not be. If you perform the task of categorising pictures in images of dogs and images of cats, labeling a dog as a cat is exactly as bad an error as labeling a cat as a dog. So you can measure the accuracy of your model by the percentage of correctly/erroneously labeled records.

But if you have to label x-ray scans as innocent or dangerous, you cannot do this, as overlooking a tumor is much worse. Even in our bakery example, baking one loaf more than needed may be not so bad as a late customer who isn’t able to get anything.

In general, criteria for measuring the accuracy of machine learning models should come from decision-makers, domain experts, and in some cases from regulators. So consider approaching them and providing reports on which specific errors your model is making. If your model has to predict a number such as demand, you can show them a comparison of your model’s prediction to the truth. If you predict a class, a good choice would be a confusion matrix like this:

Ask them to define measures for the quality of predictions along with thresholds which will be checked before a new model version goes live, and will also be monitored in production. When you predict categories, i.e. for x-ray scans, the requirements on accuracy may look as follows:

rate of overseen dangerous scans (measure) must be less than 1% (threshold)
as radiologists have limited time for reviewing dangerous scans, the number of scans wrongly labeled as dangerous (measure) may be at most 25% (threshold)

Criteria like these are called precision and recall. You can calculate precision, recall and “F1 score” (precision and recall combined) for every class as well as for the whole model.

If you predict demand for a bakery, the requirements on accuracy may look as follows: every loaf thrown away costs us 1 EUR, every unhappy customer costs us 3 EUR; on average and per day, we shouldn’t lose (measure) more than 100 EUR (threshold).

Checking individual records

People who are familiar with QA for, let’s say, web applications, may wonder why until now we considered only set-level output validation, not validation for individual records. Well, in some cases, we can check for single records, e.g. when:

categorising pictures
analysing the sentiment of reviews
detecting a dog breed

And for some cases we hardly can do it, e.g. when:

recommending a book
predicting demand

The difference is: if we detect dog breed from an image, we know that we have all the relevant information — that every feature which impacts the result is part of the model (in this case, features are image pixels). Thus, we can know with certainty what the correct output of the model should be. When predicting demand for our bakery, we usually don’t know if one of the neighbours invited his friends to a brunch party, which raised demand by 5% on this particular day. If the model’s prediction deviates from the actual data, we don’t know whether this is a problem with the model or we are just missing some data.

So if your use case allows creating 100% verified test cases for individual records, don’t ignore them. If it doesn’t, consider tests based on a comparison of two records which differ by just one attribute, e.g.:

demand on Friday should be higher than demand on the following Saturday
if you detect a cat and a mouse on same picture by finding a bounding box for each, the bounding box for the cat must be bigger

Ethical issues

There is one more reason why you should compare records which differs by just one attribute. If your model is scoring people for a loan or a job position, it should not have biases on race or gender. So consider creating records which differ by just this feature and check that the model returns the same prediction for them.

On the testing of ethical issues, also consider the next section.

Feature importance

The list of QA steps the model should pass would not be complete without a “feature importance” review. Its purpose is to get insights into how much every feature impacts the model’s decisions. This is how the report on feature importance can look for our bakery:

For images where you consider every pixel as a feature, you can have a picture where the most important pixels are colorized:

Image from https://github.com/slundberg/shap

Almost every good model is more complicated than a single decision tree, so you can hardly follow or visualise every step when the model making a decision. But, nevertheless, there are dedicated libraries for explainable AI like SHAP, LIME and Anchor. Very generally, the approach they follow is to randomise one particular feature and analyze how much the quality of predictions changes, than do the same for the next feature. Treating the model as a black box, they can generate feature importance reports for almost every approach data scientists use. In the end, this means that, if you ask about feature importance, the answer “my model is too complex for this” is hardly acceptable in 2020.

What to do with feature importance insights depends on the use case. Sometimes, common sense is enough to assess if they match the reality. If the domain you’re trying to catch in your model requires professional experience (linguistics, radiology, stock trading), consider asking domain experts for a review.

What’s next?

We are now done with checking just the model; it’s time to release it into the big world. This will be the topic of the last part in this series.