Checklist and Resources#
Checklist#
Identify whether your dataset has missing data; use visualisation tools to help you Visualising Missingness
Try to determine what type of missing data there is (MCAR/MAR/MNAR) as described in Missing Data Structures, and don’t forget about Structured Missingness
Choose an appropriate missing data handling method; you can use Missing Data Handling Methods as a starting point for ideas
Apply the missing data handling method and then continue with your analyses!
References#
Coding segments of this chapter were in part created thanks to several online tutorials which were used as a reference:
Visualizing Missing Data: A python notebook exploring the use of the missingno library.
Gallery of Missing Data Visualisations: A tutorial on missing data visualisation in R.
Intro to MICE: An Imputation Strategy: A short notebook introducing implementing MICE in python.
Lastly, the scikit-learn documentation is incredibly helpful and detailed with regards to implementing missing data handling in python:
Other textbook and paper references used, that have not been previously directly cited:
On multiple imputation [AWLvanBuuren21, dGvDJ+13]
What to Learn Next#
If you happen to be handling sensitive data in your project, check out the Working on Sensitive Data Projects chapter.
Alternatively, if you want to make your research project and data analysis pipeline more reproducible, see the chapter on Reproducibility with Make, a build automation tool.
Further Reading#
Flexible Imputation of Missing Data: This is a much more in-depth look at missing data imputation that goes into further characterising data, including mathematical definitions, and describing data imputation methods.
Getting Started with naniar: More R functions to visualise Data Missingness, including one using decision trees to map out the proportion of missingness in a variable based on all other variables.
The papers cited throughout this chapter are all good resources for further reading. The original paper on MICE [vBGO11] and the review papers on missing data handling [OJDSP22, Pig01] are especially great resources.
For more R visualisation and imputation packages see:
The Turing-Roche partnership has some resources on structured missingness:
See #ExplainToMe: The Problem of Structured Missing Data for a great animated overview
Papers on structured missingness (that were cited previously): [MMC+23] and [JMH+23]
For more in-depth recordings from the Turing-Roche Knowledge Series see:
Modern Topics on Missing Data, which also provides a brief overview of missing data:
Structured Missingness Challenges in Data Integration: