Docker is a hot item in the data scientist and developer worlds.
Data scientists are not software developers per se, yet Dockers offer plenty of useful features for data modeling and exploration to deployment. Since major services, including AWS support Docker containers, it has become easier to implement continuous integration and continuous delivery with Docker.
What is Docker for Data Science?
Docker is a software container platform offering an isolated container environment to have everything we need for performing our experiments. Really, it is a lightweight VM built from a script that can be version controlled – letting us version control the data in a data science environment.
Imagine you are an astronaut working on the internal space station. Should you want to go outside – you stand to face immense hostile conditions – including temperature, Oxygen, and radiation.
Fluctuations, which you are not built to stand against.
Humans are not built to thrive in scenarios like the deep sea or outer space – thus, we need a system to reproduce our environment. Be it a submarine or a space suit, humans need isolation or a barrier that controls thriving levels of Oxygen, temperature, and pressure.
To sum up, we need a container!
Software facing the same scenarios as the deep sea diver or an astronaut will also have to face the hostile outside environment, which calls for a protective mechanism to create a natural environment.
The Docker container is the spacesuit of software programs.
Docker for Data Science
Docker works to isolate the software from every other thing on the same system.
A program running inside the Docker – a spacesuit, has no idea it is wearing one and is generally unaffected by anything outside.
Container Stack – Docker
Application – high-level application, data science project
Dependencies – Are low-level basic software like Python or Tensorflow
Docker Container – Is the isolation layer, the spacesuit, if you may.
Operating system- Love level drivers or interfaces for interacting with hardware
Hardware – Memory, CPU, Network or Hard Disk, and more
The basic idea is to isolate the package in an application and its dependencies in a single reusable object. This object can then be instantiated consistently in varying environments.
Docker for Data Science | Terminology
Let’s look into some basic definitions for Docker:
Containers are small user-level virtualizations helping you to build, install and run your program code. An executed instance of an image – if you may.
Images are a snapshot of your computer or a compiled artifact
YAML-based file with instructions to build an image; this is used for version control Docker for Data Science Container Stack.
GitHub for Docker images, set up a Dockerhub for automatically building an image anytime you update Dockerfile in GitHub.
Creating a Container
The flow of creating Docker containers:
- Dockerfile – is the Instructions to compile an image
- Image – Compiled object
- Container – Instance of the image which is executed
Why is Docker for Data Science trending?
Have you ever heard these comments from coworkers?
“I don’t know why it’s not working here; it’s working fine on my computer.”
“It’s too long a process to install for Windows, Linux, and macOS while building the same environment for every OS.”
“I can’t install the package you used. Can you help, please?”
“I might need more computer power, I can use AWS, but it takes too long to install all the packages and configure the settings.”
Mostly, these issues are easily handled and resolved using Docker.
The only exception right now when posting is GPU support for the Docker images, which only runs for Linux machines. Other than this, you are all set.
Why a data scientist should care for Docker?
In a broader sense, there are two use cases for Dockers in machine learning:
The run-only container lets you edit the code on the local IDE and run it along with the container, so the code runs inside the controlled environment – the Docker container.
The end-to-end container means you have Jupyter Lab or IDE, plus your working environment is running in the container, and you are also running the code inside it.
Docker might require a learning curve for some data scientists and developers – but it’s well worth the effort. Plus, it won’t hurt to brush up on the DevOps skills.
Are you using Docker for your data science efforts? Head to Qwak for more guidance and tools.