Intro to Docker for biostatisticians

Nick Strayer

4/25/2019

Goal

The goal of this document/ presentation is to take you (a biostatistician or similar) from knowing nothing about Docker to being able to utilize it in your research through a simple example.

A brief history

Docker was developed for software engineering. As software has gotten more complex and applications more numerous the act of manually configuring a new server every time you needed to scale up was simply too high. To deal with this Docker When I use the name “Docker” here I really am referring to any container or Virtual Machine (VM) software. software was built. Docker exists as a method of recreating an image of a server with all its software installed etc. in a single command.

Docker’s job is to make it so a programmer only has to sit and type out sudo apt-get install clang... once and then anytime they have a new machine they just use the Docker ‘image’ to start from a setup just like they had it before, guaranteed.

What does Docker do for us?

Quickly before I bore you with a bunch of terminology you could get elsewhere let’s motivate why you as a biostatistician/data scientist may want to use Docker.

The reasons why I find Docker valuable falls into five main points:

The Docker workflow:

A typical docker workflow is as follows:

  1. Install Docker on your machine
  2. Write a Dockerfile that specifies what software you need
  3. Build that Dockerfile to download all the necessary software and wrap it up into an ‘image’
  4. Run the image that your Dockerfile generated A ‘container’ is just what the image becomes when it is run
  5. (Optional) Save the image for running later so you can skip steps 1 and 2.

Because I expect that workflow to make approximately zero sense at first blush let’s expand on each step.

Install Docker

Installing Docker takes a different form on different machines. The main docs do a much better job than I can so I will point you to them.

Docker for mac

Unfortunately as of writing this you need the professional version of windows to run Docker, although this is supposed to change in the near-ish future Docker for windows

Docker for ubuntu

The Dockerfile

This is a simple text script that describes the state of the machine. Think of it like a super bash/shell script that does all the typing into the command line for you. So if your usual workflow upon getting a new computer is opening up the terminal and running something like…

sudo apt-get install R-Lang, RStudio-Server, ... 

This can get translated into the Dockerfile so Docker knows how to get the various software you need.

That is all great but you may wonder why you don’t just use a big bash script here and avoid all this hassle? The beauty of these Docker files is you can stack them, building upon previous images built by you or others.

Stacking images

For instance, say you want to run a machine that has instant access to the version 3.4 of R and the tidyverse suite of packages already installed. You can use the Rocker project who kindly provide pre-built images with different R versions etc for you.

Note the stacking of the shipping containers in the logo.

# ./DOCKERFILE
# Start from image with R 3.3.1 and tidyverse installed
FROM rocker/tidyverse:3.3.1

...
# Add your own desired packages etc. on top

Let’s setup a super simple example of building a docker image with both the tidyverse and another custom package visdat to look at data overviews.

FROM rocker/tidyverse:3.5.0

# Install visdat from github
RUN R -e "devtools::install_github('ropensci/visdat')"

Building the Dockerfile

Now that we have our simple dockerfile we can ‘build’ it. This simply means we tell our computer to go and grab all the necessary files and construct the image for use. Run this from the same directory you made your ./Dockerfile

$ docker build -t tidy_vizdat .

After you do this you’ll get a nice matrix-esque string of status bars…

Sending build context to Docker daemon  50.18kB
Step 1/2 : FROM rocker/tidyverse:3.5.0
3.5.0: Pulling from rocker/tidyverse
54f7e8ac135a: Pull complete...

These show you progress of the downloading and constructing of your new image!

Once this completes you now have an image. This means that no matter what happens you will always be able to run this image and it will work the same. Note that if you don’t specify versions for packages if you rebuild the image at a later date things may not be identical, but as long as you don’t rebuild the image when you run the image it will always be the same

Running the image, aka starting your container

Now that your container is built all you need to do is use the run command to start it up and enter.

docker run -it tidy_vizdat bash

This says run your just created image and enter it in a bash shell. After a second you will have your terminal look something like this:

$ docker run -it tidy_vizdat bash
root@061df01792d0:/#

You are now in the container! It’s just a linux machine running within your computer. We can open R and run our world-changing calculations…

root@061df01792d0:/# R

R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
...
Type 'q()' to quit R.

> 4 * 3
[1] 12

Beautiful! Now to close everything simply type exit into the terminal and Docker will shut down the container and it will be like nothing ever happened!

Going beyond the basics

But wait, this isn’t particularly helpful. What if we wanted to use RStudio or import data. Fret not these are possible as well.

Using RStudio (aka port-mapping)

The image that we loaded also happens to have RStudio-Server loaded on it. This means that if we can get access to the container from our web browser we can use everyone’s favorite IDE to work/ run scripts.

The beauty of container stacking in action. They just added some the tidyverse on top of their already built RStudio image. DRY comes to software installation!

To make our web browser able to connect into the container we need to tell Docker that we want to map some local port to the container’s internal ports. Aka if we have a server that is running on port 8787 like RStudio-Server does, we need to make sure our local computers 8787 is simply mapped into the container’s.

Luckily, this just means a couple changes to the docker run command.

docker run -it -p 8787:8787 -e DISABLE_AUTH=true tidy_vizdat

Note that we have added -p 9000:8787 which tells docker to map port 9000 on our computer to port 8787 in the container. In addition we have added -e DISABLE_AUTH=true which just tells RStudio we don’t want to use the login screen. To see more about customizing these behaviours such as when you need more security read the Rocker docs Last we simple left off the bash at the end of the command because we will do our accessing of the container through the web browser.