Reproducible Data Science, Part 1

Virtualization allows us data scientists to do our work in an environment that is reproducible by just sharing nothing more than a few scripts and what’s more, the environment is (in most cases) identical to that in production.

(Image Source : Google image search)

For the impatient …

I was recently dockerizing an older analytics project. One of the components required a few MySQL databases to be set up and prepopulated. The official MySQL image creates just one database. Here’s a way to get around creating and prepopulating the DBs without a separate Dockerfile.

First, we get the image of the required version; and then start the MySQL server instance. I mount a local directory (/path/to/vol) as a volume for this instance.

$ docker pull mysql:5.6
$ docker run --name some-mysql -v /path/to/vol:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -d mysql:5.6

Next, we create the required databases (db1 and db2 in this example)

$ docker exec -i some-mysql mysql -u root -p -e "create database db1";
$ docker exec -i some-mysql mysql -u root -p -e "create database db2";

Finally, we run the table creation + population script for each database

$ docker exec -i some-mysql mysql -uroot -p db1 < db1.sql
$ docker exec -i some-mysql mysql -uroot -p db2 < db2.sql

OK, Why the shipping containers?

Docker is one of the most popular Virtualization technologies today. It is allows us data scientists to do our work in an environment that is reproducible by just sharing nothing more than a Dockerfile (or few).

Let’s say for example, my colleague and I work on a MacBook and Windows laptops respectively and we use Linux in production. We both use Python and R, but are very likely using different versions of packages. In addition to that, we use a few relational (MySQL, PostgreSQL) and NoSQL (MySQL, Neo4J, Redis) databases.

Keeping our environments identical, manually, would be a bit of a waste of time.

Enter Docker

At the start of a new Analytics (or anything that requires code to be written) project, either of us defines, in a plain text file (Dockerfile or shell script) the components we’ll be using - along with the software/package versions.

Using this Dockerfile and some fairly basic commands, we are able to have identical setups in both our laptops, as well as in production and what’s more, neither of us needs to ever install anything other than Docker (just once) itself.

The various components of our project now run as individual lightweight containers running on a commmon machine, sharing the OS and yet using lesser RAM than they would, if run separately. The containers are based off Docker images - which are an interesting concept. (Details)

A Not-So-Old Alternative : Vagrant

(Image Source : slide 16 of this presentation by Docker)

Whilst Docker containers run in a single OS sitting on top of whatever our laptops run on (macOS, Windows, Linux), Vagrant creates separate virtual machines (VMs) - which contain everything (from the OS, to the underlying libraries, to the application layer packages/tools/etc.).

The concepts remain similar : in the past one of us would have written a Vagrantfile (along with one ore more optional provisioning scripts).


I have used Vagrant in the past