Case Study: Reproducible Paper using Make

In the tutorial above we used IMDB movie ratings for different genres as example data. This data was obtained from a dataset shared on Kaggle as a CSV file. The file looks like this:

titles01/tt0012349,tt0012349,Der Vagabund und das Kind (1921),der vagabund und das kind,,8.4,40550,3240,1921,,1,0,19,96,85,3,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0015864,tt0015864,Goldrausch (1925),goldrausch,,8.3,45319,5700,1925,,2,1,35,110,122,3,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
titles01/tt0017136,tt0017136,Metropolis (1927),metropolis,,8.4,81007,9180,1927,,3,4,67,428,376,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
titles01/tt0017925,tt0017925,Der General (1926),der general,,8.3,37521,6420,1926,,1,1,53,123,219,3,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

While on the surface this looks like a regular CSV file, when you try to open it with the Python CSV library, or Pandas, or R’s read_csv, or even readr:read_csv, the data is not loaded correctly. This happens because the CSV file uses an escape character \ for movie names that have commas in them and the CSV readers don’t automatically detect this variation in the CSV format. It turns out that this is quite a common issue for data scientists: CSV files are often messy and use an uncommon dialect: such as strange delimiters and uncommon quote characters. Collectively, data scientists waste quite some time on these data wrangling issues where manual intervention is needed. But this problem is also not that easy to solve: to a computer a CSV file is simply a long string of characters and every dialect will give you some table, so how do we determine the dialect accurately in general?

Recently, researchers from the Alan Turing Institute have presented a method that achieves 97% accuracy on a large corpus of CSV files, with an improvement of 21% over existing approaches on non-standard CSV files. This research was made reproducible through the use of Make and is available through an online repository:

Below we will briefly describe what the Makefile for such a project looks like. For the complete file, please see the repository. The Makefile consists of several sections:

  1. Data collection: because the data is collected from public sources, the repository contains a Python script that allows anyone to download the data through a simple make data command.

  2. All the figures, tables, and constants used in the paper are generated based on the results from the experiments. To make it easy to recreate all results of a certain type, .PHONY targets are included that depend on all results of that type (so you could run make figures). The rules for these outputs follow the same pattern as those for the figures in the tutorial above. Tables are created as LaTeX files so they can be directly included in the LaTeX source for the manuscript.

  3. The rules for the detection results follow a specific signature:

    $(OUT_DETECT)/out_sniffer_%.json: $(OUT_PREPROCESS)/all_files_%.txt
    	python $(SCRIPT_DIR)/ sniffer $(DETECTOR_OPTS) $< $@

    The % symbol is used to create outputs for both sources of CSV files with a single rule in Pattern Rules and the rule uses in Automatic Variables. to extract the input and output filenames.

  4. Some of the cleaning rules will remove output files that take a while to create. Therefore, these depend on a special check_clean target that asks the user to confirm before proceeding:

    	@echo -n "Are you sure? [y/N]" && read ans && [ $$ans == y ]

It is important to emphasize that this file was not created in one go, but was constructed iteratively. The Makefile started as a way to run several dialect detection methods on a collection of input files and gradually grew to include the creation of figures and tables from the result files. Thus the advice for using Make for reproducibility is to start small and start early.

The published Makefile in the repository does not contain the paper, but this is included in the internal Makefile and follows the same structure as the report.pdf file in the tutorial above. This proved especially useful for collaboration as only a single repository needed to be shared that contains the code, the results, and the manuscript.