Checklist¶
- Identify whether your dataset has missing data; use visualisation tools to help you Visualising Missingness
- Try to determine what type of missing data there is (MCAR/MAR/MNAR) as described in Missing Data Structures, and don’t forget about Structured Missingness
- Choose an appropriate missing data handling method; you can use Missing Data Handling Methods as a starting point for ideas
- Apply the missing data handling method and then continue with your analyses!
References¶
Coding segments of this chapter were in part created thanks to several online tutorials which were used as a reference:
- Visualizing Missing Data: A python notebook exploring the use of the missingno library.
- Gallery of Missing Data Visualisations: A tutorial on missing data visualisation in R.
- Imputing Missing Data with R; MICE package
- Intro to MICE: An Imputation Strategy: A short notebook introducing implementing MICE in python.
- Lastly, the scikit-learn documentation is incredibly helpful and detailed with regards to implementing missing data handling in python:
Other textbook and paper references used, that have not been previously directly cited:
- On types of missing data Buuren, 2015Mack C, 2018
- On multiple imputation Goeij et al., 2013Austin et al., 2021
What to Learn Next¶
If you happen to be handling sensitive data in your project, check out the Working on Sensitive Data Projects chapter.
Alternatively, if you want to make your research project and data analysis pipeline more reproducible, see the chapter on Reproducibility with Make, a build automation tool.
Further Reading¶
- Flexible Imputation of Missing Data: This is a much more in-depth look at missing data imputation that goes into further characterising data, including mathematical definitions, and describing data imputation methods.
- Getting Started with naniar: More R functions to visualise Data Missingness, including one using decision trees to map out the proportion of missingness in a variable based on all other variables.
- The papers cited throughout this chapter are all good resources for further reading. The original paper on MICE Buuren & Groothuis-Oudshoorn, 2011 and the review papers on missing data handling Pigott, 2001Oluwaseye Joel et al., 2022 are especially great resources.
- For more R visualisation and imputation packages see:
- The Turing-Roche partnership has some resources on structured missingness:
- See #ExplainToMe: The Problem of Structured Missing Data for a great animated overview
- Papers on structured missingness (that were cited previously): Mitra et al., 2023 and Jackson et al., 2023
- For more in-depth recordings from the Turing-Roche Knowledge Series see:
- Modern Topics on Missing Data, which also provides a brief overview of missing data:
- Structured Missingness Challenges in Data Integration:
- Buuren, S. van. (2015). Types of missing data [Book]. In An Introduction to Medical Statistics, Fourth Edition (pp. 306–307). Oxford University Press. https://www-users.york.ac.uk/~mb55/intro/typemiss4.htm#:~:text=When%20we%20say%20data%20are,CADET%2C%20sex%20might%20be%20MCAR.
- Mack C, W. D., Su Z. (2018). Types of Missing Data [Book]. In Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide, Third Edition [Internet]. Rockville (MD): Agency for Healthcare Research. https://www.ncbi.nlm.nih.gov/books/NBK493614/
- de Goeij, M. C. M., van Diepen, M., Jager, K. J., Tripepi, G., Zoccali, C., & Dekker, F. W. (2013). Multiple imputation: dealing with missing data. Nephrology Dialysis Transplantation, 28(10), 2415–2420. 10.1093/ndt/gft221
- Austin, P. C., White, I. R., Lee, D. S., & van Buuren, S. (2021). Missing Data in Clinical Research: A Tutorial on Multiple Imputation. Canadian Journal of Cardiology, 37(9), 1322–1331. https://doi.org/10.1016/j.cjca.2020.11.010
- van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67. 10.18637/jss.v045.i03
- Pigott, T. D. (2001). A Review of Methods for Missing Data. Educational Research and Evaluation, 7(4), 353–383. 10.1076/edre.7.4.353.8937
- Oluwaseye Joel, L., Doorsamy, W., & Sena Paul, B. (2022). A Review of Missing Data Handling Techniques for Machine Learning. International Journal of Innovative Technology and Interdisciplinary Sciences, 5(3), 971–1005. 10.15157/IJITIS.2022.5.3.971-1005
- Mitra, R., McGough, S. F., Chakraborti, T., Holmes, C., Copping, R., Hagenbuch, N., Biedermann, S., Noonan, J., Lehmann, B., Shenvi, A., Doan, X. V., Leslie, D., Bianconi, G., Sanchez-Garcia, R., Davies, A., Mackintosh, M., Andrinopoulou, E.-R., Basiri, A., Harbron, C., & MacArthur, B. D. (2023). Learning from data with structured missingness. Nature Machine Intelligence, 5(1), 13–23. 10.1038/s42256-022-00596-z
- Jackson, J., Mitra, R., Hagenbuch, N., McGough, S., & Harbron, C. (2023). A Complete Characterisation of Structured Missingness. https://arxiv.org/abs/2307.02650