Business June 9, 2025

Why testing data is as important as training data for machine learning models

Matt Prendergast 4 min read

When developing machine learning systems for facial age estimation, the conversation often centres on the training data: how much you have, how diverse it is, how inclusive it is, and how well it represents your end users.

Not to mention, where the data comes from.

Intuitively, that focus makes sense. More data presumably leads to better models. But test data is just as important, and in some ways, even more critical for ensuring models perform effectively.

Training data: more isn’t always better

Common sense would suggest that for a machine learning model “the more data, the better.” And that’s generally true. More data allows your model to learn from a broader array of inputs, improving accuracy and robustness. But this has caveats:

Data quality matters more than quantity. A smaller, more diverse, high-quality dataset can outperform a massive, noisy one. If your dataset includes mislabeled or biased images, the model may learn the wrong patterns.
Coverage matters. If you’re building an age estimation model, your training data should reflect the real-world distribution of age, skin tones, lighting conditions, and even types of spoofing attempt (like using photos or masks). Without this, your model might perform well in lab conditions but fail when used in the real world.
Avoiding overfitting. Overfitting occurs when a model performs well on training data but performs poorly on new, unseen data. This is especially problematic for facial age estimation. Features like wrinkles, skin texture and facial structure are subtle. A model may latch onto irrelevant factors (such as camera quality, good lighting, nose piercings or clothing) rather than onto genuine age-related facial features. This is where separate test data is vital – it ensures we detect overfitting early.

Why test data is crucial

Let’s say you’ve built a powerful model you believe performs well. Without a proper test dataset, the reality is that you’ll never really know if it’s any good. Even worse, you might think it is when it’s not.

A well-constructed test dataset tells you, objectively, whether your model generalises well to real-world use cases. It also:

Helps you choose the best model. If you’ve trained multiple models, only a diverse and representative test set can tell you which is best suited for deployment.
Uncovers blind spots. Regularly updating your test sets can reveal gaps in performance – perhaps your model struggles with teenagers, overestimates older users or misjudges certain skin tones. These insights are invisible if you’re using stale or narrow test data.
Protects against model collapse. A recent paper from Nature explores how models trained on the outputs of other models can degenerate over time. This “model collapse” underscores the importance of clean, unbiased training and test data to avoid feedback loops.

The importance of separate training and testing datasets

Machine learning models are trained to learn patterns from data. If the same data is used for both training (learning) and testing (evaluation), the model may appear highly accurate. This isn’t because it truly understands the task of age estimation, but because it has “learned”, or memorised, the specific images from the training dataset.

At Yoti, we use a separate set of around 120,000 facial images. Each of these are tagged with ground truth age, spread across years of age from 2 to 75, with representation across gender and skin tone. This ensures the model’s performance is measured effectively for real-world audiences.

Models that have not been tested effectively run the risk of performing poorly when tested independently. For example, the US National Institute for Standards and Technology (NIST) tests models against millions of test data images.

How can facial age estimation help

Around the world, new regulations are being introduced to enforce age checks online. Facial age estimation offers a scalable, privacy-preserving solution to meet these requirements. This fills a gap where no safeguards previously existed. It can bring meaningful protections for young users and help ensure that everyone has access to online experiences that are appropriate for their age.