Techniques for Improving Data Quality: The Key to Machine Learning

One of the fundamental challenges for machine learning (ML) teams is data quality, or more accurately the lack of data quality. Your ML solution is only as good as the data that you train it on, and therein lies the rub: Is your data of sufficient quality to train a trustworthy system?  If not, can you improve your data so that it is? You need a collection of data quality “best practices”, but what is “best” depends on the context of the problem that you face.  Which of the myriad of strategies are the best ones for you?

This presentation compares over a dozen traditional and agile data quality techniques on five factors: timeliness of action, level of automation, directness, timeliness of benefit, and difficulty to implement.  The data quality techniques explored include: data cleansing, automated regression testing, data guidance, synthetic training data, database refactoring, data stewards, manual regression testing, data transformation, data masking, data labeling, and more.  When you understand what data quality techniques are available to you, and understand the context in which they’re applicable, you will be able to identify the collection of data quality techniques that are best for you.

What You Will Learn About Data Quality and Machine Learning

    • How do organizations develop machine learning (ML)-based solutions?
    • Why is data quality (DQ) critical to the success of ML initiatives, and were do DQ techniques fit in?
    • How do you choose the most effective DQ techniques for the situation that you face?

Audience for This Presentation

    • Data practitioners who want to gain a better understanding of their data quality options.
    • Machine learning (ML) practitioners who want to improve their way of working (WoW).
    • IT and business leaders who want to identify how to improve their chance of success at ML initiatives.

Why You Want to Hear About Data Quality and Machine Learning From Me

I am the thought leader behind the Agile Data method, a focus of which is data quality techniques. Regarding machine learning, I am currently working on a Masters degree in Artificial Intelligence (AI) from the University of Leeds. I was the co-creator of Project Management Institute (PMI)’s Disciplined Agile (DA), a context-driven hybrid toolkit of techniques. I have worked with organizations around the world to help them to improve their ways of working (WoW).

Presentation History

This presentation is a new presentation first developed in early 2023.  It has been presented at a data conference and to agile user groups.

Scott Ambler is an international keynote speaker