Hello everyone.
I am Tania and this is my
presentation for PyCon 2020 online.
I'm going to be talking about Docker
and Python and how we can make them
play nicely and securely for data
science and machine learning.
I am a developer advocate at Microsoft.
My role is helping other developers and
professionals, researchers and academics.
To do their work better.
This can be by creating new applications
or developing content for them and
working with our engineering teams.
You can find me on Twitter at @ixek or
GitHub and on my website is trallard.dev
Also, you'll be able to access, this is
slides on this website and a link is going
to be accessible on all of the slides, so
you can refer to them whenever you want.
So let's start with what
you're learning today.
First, we're going to learn why you'd
want to use Docker, what are the
advantages of them and the specifics
of some Docker problems when working
with data science and machine learning.
I'm also going to give you some
tips around security and performance
when you're working with Docker
and containers in general.
And some tips on how not to
reinvent the wheel, but to automate.
And finally, I'll provide you some useful
and practical tips on how to use Docker
for all your data science projects.
So let's start with why Docker?
Why you might want to use Docker
for a data science or for anything.
More than likely you're developing
your applications, or your models,
or your scientific projects,
in your local computer.
So you might be familiar with this
kind of development where you create
your application, whatever it is.
I'm just going to generalize here.
And then you might want to
ship this to somebody else?
Whether it's customer or a colleague.
And how many times have you encountered
a this issue where it says that
module doesn't exist and that's
probably the least of your problems.
What about folks using different runtimes?
Maybe someone's still using Python 2 and
you've developed your whole algorithm
and your whole model using Python 3.
Or what about you developing your
applications on Ubuntu and somebody
else has been using Red Hat,
or was using Debian or Windows?
How are people going to know, what
dependencies they need or what environment
they need or configuration files?
If you're not explicit about it, and
here's where Docker comes into play.
Docker is a tool that allows you to
create, deploy, and run your applications
or projects by using containers.
Throughout the presentation,
I'm going to be using that little
icon to represent a container.
But how do containers help you?
Well, the most obvious answer for
this is that they provide you with a
solution to the problem on how you can
get your software or your models or your
applications when you are movig from
one computing environment to another.
This could be going from your laptop
to your test, staging, or production
environment, or it can be your
laptop on somebody else's laptop.
So in a world where you're using
containers to do your development
or data science work, this is
how your workflow would look like.
You develop your application.
But in this case, you're also adding
the libraries, all the dependencies and
runtime environment and the configuration
files that you need for it to work.
So you build your container with your
application and all of this extra assets.
When anyone wants to install
this or run this in their
machine, it's actually doable.
So I can imagine that those of you
that are familiar with virtual machines.
Might think that it looks
or sounds a bit familiar.
Docker and containers work like is that
you would have your infrastructure.
Probably a server or the cloud or a your
computer, or your virtual machine and its
host operating system and sitting on top
of that, you'll have Docker and then you
can have a lot of different apps that are
containerized running within this setup.
In this scenario, each app is
containerized, so Docker and containers
work at the application leve; each
app runs as an isolated process.
And the whole operating system and
infrastructure are abstracted away.
So it doesn't matter if you develop
using Ubuntu, or Red Hat and
somebody else, is using Windows.
Virtual machines.
On the other hand, will start
also with yout infrastructure.
But it will have, the hypervisor
is sitting on top of that.
So your hypervisor will be sitting on
top of your physical server and we'll
have two whole different operating
systems running on top of that.
So the abstracttion works at a
hardware level, meaning that you're
going to have not only your app,
but the whole operating system,
binaries, dependencies and everything
is on top of your infrastructure.
When did you delve into the world of
Docker and you're probably gonna also
use, or hear the words image and
container and these terms can seem
a bit confusing at the beginning.
So let's start debunking.
An image is an archive with all the
data needed to run your app and as an
archive, it needs to have a tag attached.
It can be LATEST or it can be a
concrete tag, whether you're using
semantic versioning or calendar
versioning, or a reference of a commit.
If you're using version
control you decide what your
tag is going to look like.
When you pull that image from
a repository, probably DockerHub
o your private repository and
use the command Docker run.
Then the Docker image actually creates
a container, where you can do your
development work and all of your work
is going to be in this isolated container.
So what are some of the common pain
points when using data science,
machine learning, and Docker together?
First of all, we tend to use
complex setups or dependencies.
And this is just because of the nature
Python some packages and some libraries.
We also have a high reliance
and data and databases.
A common charcateristic of data
science and machine learning projects.
Is that they tend to be
very fast evolving projects.
During the phase of
research and development.
We tend to do a very fast
and iterative process.
And as you go deeper into Docker you
are going to find that it's actually
quite complex and it might take
quite a lot of time to upskill and
learn everything you need about it.
Also, a common concern is whether
containers are secure enough for
our data or our model or algorithm.
What happens when you're
working with healthcare data?
Or it could be any financial or any
other sort of data that could help
to easily identify individuals.
So let's dive into scenarios, tips,
and best practices for Docker for
data science and machine learning.
Some of these are going to
be specific for our use case.
Some others are generally applicable.
So if you are not a data scientist or
machine learner, but rather a software
engineer or a DevOps folk, you can also
learn and apply some of these tricks.
Okay.
Also, so we've already learned what
are some of the main problems or some
of the complexities that we might
encounter when we are using Docker
or just working in data science.
So how is actually creating
containers for data science
projects different from webapps?
That was one of the first
things that I mentioned.
I said, we can have very complex setups
and dependencies on installations.
A lot of folks have been discussing these
very openly in our Python community
because this is a not a problem
completely inherent to data science
or the scientific computing ecosystem.
The trick to overcome this is finding
a robust, reproducible way of working.
You might feel tempted to try all
of the newest tools and have a very
complex environment, but sometimes just
choosing a tool and adhering to best
practices can take you a very long way.
Now.
When we're working in data science and
machine learning, not every deliverable
is an app, not all of the things that
we develop as part of machine learning
workflows can be productized as an app.
Also, not everything is a model
and not everything is going to be
delivered as a model that can be
accessed through an endpoint or an URL
or it's going to be on a smartphone.
We heavily rely on data.
Data is one of our most valuable
assets, and, and we'll also have a
mixture of wheels and compiled packages.
A lot of the Python scientific
ecosystem packages actually
rely on compiling binariess.
When you're working with sensitive
data, we also need to make sure, that
we have security access levels for
both data and software, and we're
also going to have a mixture of
stakeholders, data scientists, software
engineers, machine learning engineers.
If you're working on creating an
app that uses machine learning on
the backend, you're also going
to need someone taking care of the
front end and mobile development.
So this increases the complexity
of your projects, and each of
these stakeholders have different
requirements and different needs.
Let's dive into the basics.
If you've never used Docker images,
this is going to help you to
better understand the workflow.
If you've used Docker before, this
is going to serve as a refresher.
The way we work with Docker and
create containers and images is by
specifying all of the instructions
and requirements through Dockerfiles,
On the right hand side, you have, a not
very good example, well, a bad exaple
on how to build a Dockerfile, but
it should be simple enough for you to
understand how we can structure these.
We normally start from a base image
that is, let's say, our building blocks.
When you're using Lego blocks,
you need a solid foundation.
Then you're going to have a set of
instructions, and this is where you
actually install all your packages, all
the dependencies, compile binaries and
move files from your local development
environment into your container.
And finally you have an entry point.
This can be either to run an
interactive application like
Jupyterlab or executing a script or
simply to access the command line.
Something that you have to be
aware of is that each of the
instructions that you contain in
your Docker file creates a layer.
So the closest analogy for you to think
about of a Docker container is an onion,
your base image is going to be at the
core and every time to run unstructured
and using the command RUN, it's going
to generate a layer, which is going
to be warpping the previous layers.
So everything is self contained.The
more instructions you have
the bigger your onion or your
container and the number of layers.
So for each instruction in a
Docker container, you have a layer.
So you have to be very smart about
how you're creating those layers.
One of the crucial parts
when building containers is
selecting the best base image.
Okay.
You are going to read in a lot of
places that Alpine Linux is the best
option because it's very lightweight
and you can be able to build a
container quite easily I would
like to give you a word of caution.
To start with if you were to build
your images from scratch, make sure
to use the official Python images.
You can find them all in those links and
find all the available versions and tags.
Now, going back into the question of
Alpine Linux, it is a very light weight
base image, which means that you're
going to have to spend a lot of time
building libraries and dependencies
and it's not worth the complexity.
If you're looking for Python 3.8
I recommend you use slim Buster.
It is based on Debian and it also has the
longterm support at the moment, so it's
very likely to have continuous security
updates for the next few couple of years.
If you're going for Python
3.7 I recommend going for
slim stretch or slim Buster.
The slim images have the full
distribution, but they all remove
unnecessary files like manual pages.
Creating a Docker image from
scratch can take a very long time.
So if you find that you are going to
be needing conda, Jupyter notebooks
and the scientific Python ecosystem,
which is more than likely to occur
for data science and machine learning
use the Jupyter Docker stacks.
The Jupyter community has put
already a lot of effort in making
these container and images.
They all start from an Ubuntu base image.
And they layer one on top of each other.
So if you only need the
Jupyter notebook and you want
to install your dependencies
yourself, you can build from there.
If you need more complex setups with
scientific Python programs as well as
TensorFlow, there is something for you.
You can also use Pyspark and a data
sience notebook that also contains R.
And I think there's also
some other community stacks
that allow you to use Julia.
You can also use those have space
images to build your Docker images.
So without further ado, let's
go into some of the best
practices for you to get started.
First of all, always know
what you're expecting.
And this goes without saying.
Always make sure that you're using
a concrete tag, soin the directive
FROM always use a specific tag.
For example, base notebook, 6.0.3
avoid using LATEST because more
than likely when the next push is
created or the next image is created,
that's going to change your base.
Provide context with labels.
This is very important.
You can add important information
like who's the maintainer,
which is a security context.
If it's meant to be used in
production and many, many others.
Sometimes we might feel the need to have
very complex RUN statements, but let
you be able to is extremely important.
Make sure to split RUN
statements and sort them.
Prefer the copy command to add files
when they're done is to copy packages,
configuration files or something else.
You'll see that there also
exists an ADD command, but
COPY is more straightforward
and it's also very explicit.
Also, Docker is very
good at using cache.
That means that if you have a dependency
or something that has not changed within
your layers, then it preserves that layer
to speed up your build without needing
to download dependencies or compile
them or just instal them on the fly.
Since every instruction creates a layer.
Whenever any of this files
contained within these instructions
is modified to subsequent layer
or layers have to be built.
So be very careful when you're
structuring your Dockefile.
Separate instructions per scope.
Make sure that you're actually
using the cache to its maximum.
And also only install necessary packages.
Sometimes we might feel the need to
have, I don't know, additional packages
just in case we end up needing it,
but that is gonna blow your image
significantly and also increase the
potential risks, the potential security
issues and the image size overall.
And also explicitly ignore files.
If you're familiar with git for version
control, you can create an analog file.
Not only it will help you to
avoid carrying unnecessary files
like your readme or some data.
Always trying to avoid adding
data, especially when we're
working in data science.
We don't want to hard code the
data within our container, but
we want to access the data from a
database or even from our drive.
Also, make sure to never expose, secrets,
explicitly marking files with secrets
in our .dockerignore Will ensure
that they are not within our layers.
I've already told you not to
add data to your Dockerfiles.
So there are two ways, or
multiple ways to do this.
One is you can access a database
and ensure that your container
has access to whatever port
your database has access for.
Yeah.
If you have your data locally
or for example, in your cloud
virtual environment, you can
use volume mounts to directories.
Or just specify which directory
your data or your files are and
mount them to your container.
That way, you can run your container,
do your development work and the
progress or changes will be saved
directly in your local files.
Also there is, a big culprit there.
You also need to ensure that you're
running as a non-root user because you
might have problems accessing files or
changing the permission of the files.
I'm going to dive into this
deeper when we come to security.
So thus far, I've only given you
some good practices and how to start
building your Docker containers.
These are gonna take you a very long way.
One of the major concerns about using
Docker and containers relies on security.
I have mentioned before that you have to
use a USER flag when running container.
And this is done to minimize privilege.
We call this favour the less privileged
user, and this is very important
to minimize or prevent attacks.
First, run as non-root user.
By default, Docker runs as root and
this is because you need to normally
update libraries within your kernel.
So whenever you need to do an
install you are able to do apt-get
because you are a root user.
If you're using any of the Jupyter's
stack containers, you don't have
to worry about it because the folks
from Jupyter have already spent a
lot of time making sure that you're
not running as a privileged user.
However, if you're extending any of their
Docker containers, you need to make sure
that you also add an unprivileged user.
Change to root to do any system
updates and then turn back
to your non privileged user.
Be very careful when working
with sensitive information.
Sometimes it's very easy to have
environment variables that correspond
to passwords or keys or API tokens.
The first entry point to secure your
containers is adding the files that
contain those keys to your Dockerignore
it so they don't get in there.
Try to keep all of these
out of your Dockerfile.
You can set runtime environments
that can then be provided
by flags or other context.
Sometimes we can feel that if we
copy those files into our container
within a layer and then delet it
then because we're building on top
of that, that's not going to persist.
But because of the onion organization
with the containers, even things that
we've moved and deleted in previous layers
are cached so people can access them.
A very powerful way to secure your Docker
containers and to keep your secrets
secret is using multi-stage builds.
If you fetch and manage secrets
in an intermediate layerm.
Tha layer or that image is disposed.
And then they are not
persisted in the final image.
Another advantage, and that is very,
very important and helpful, especially
in the scientific Python ecosystem, is
because not all of our independences
will have been packed as wheels.
You might need a compiler.
Whether it's gCC or gFortran, compiler.
So it's very helpful to have those
compilers compile the binaries in an
initial image and then just install
or copy across the compiled packages.
And in general, when you were
using multi-stage builds, it helps
you to create smaller images.
Cause you can get rid of all the
unnecessary and wanted files and
then just keep what you're really
need.And this Dockerfile that we have
here as an example, we're starting
our image from Python 3.8.2 and
we're calling it compile image.
Because that's where we're actually
doing some system updates, installing our
compilers, creating a virtual environment,
and compiling our dependencies.
So when we do Docker build, whatever our
image is and provide a context, Docker,
will start building that compile-image.
And once that's completed, it's going
to move on to the runtime-image.
And in this runtime-image, what we're
going to do is, is copy only the virtual
environment, or you can copy only.
Some of the compiled packages.
This way you're getting rid off
the compiler is, and any other
secondary files that you don't need.
On your final image is going
to be the runtime image.
But with the talk that you provided when
you called the command Docker build.
The final part of this talk.
Refers to automation.
So far I've given you a lot of tips
and a lot of ways to increase the
performance and ways to ensuring that
your image is secure, but either all can
be very, very daunting, especially if
you're only building containers for your
personal use and for reproducibility.
One of the main advantages of using
Docker containers is reproducibility.
We care a lot about this when we're
doing machine learning and research
and scientific computing, we usually
want to share our assets with others,
or we want to deposit our data
code and paper is so that folks can
verify our research and our findings.
So one of the best things that
we can do is try to automate.
Whenever I talk about reproducibility
or whenever I talk about reproducibility
in the context or machine learning
and scientific computing, something
that I always recommend everyone
is to have a standard template.
That way everything lives within
a specific space and you always
know what you're expecting.
Then in terms of your folder
structure, what files should be,
and you have explicit conventions
about your directories and naming.
If you're looking for good project
templates for data science projects, try
to use the cookie cutter for data science.
If you're interested in Docker version
of it, use cookie cutter Docker science.
When you use these packages, it's going to
create a robust template for you to start
your data science project from You'll
see in the image the project layout for
the the cookie cutter Docker Science.
So we have to configuration for Jupyter.
We have a Docker file.
We have our directory for our models and
requirements, and additional scripts.
So this is my default enabling
you to use Jupyter notebooks.
In a similar way the project
Jupyter Docker stack, they're all
enabled with Jupyter notebooks.
So whenever you run your Docker container
using any after images, the first
thing that you have is access to the
terminal and to your Jupyter notebooks.
So you have a reliable and
well known environment.
The second step is do
nor reinvent the wheel.
Well, sometimes we might feel the need
to start everything from scratch and
there are very good packages that the
community has already been working on.
One of my favorite
packages is repo2docker.
Where it wil take a repository.
It can be either a
local or through an URL.
The main advantage is that it's
already configured and optimized for
adata science and cientific computing.
The common workflow is that, you're
going to have a configuration file to
explicitly describe the requirements
that your project relies on.
Once you have that you execute the command
jupyter repo2docker and provide a path.
It can either be a local directory
or it can be a URL corresponding
to repository or a DOI from Zenodo.
It will then use some of the
Jupyter stacks Docker bases and
build your Docker containert.
The most neat thing about this is that you
don't even have to write your Docker file.
You didn't have to spend a lot of
time refining jury, Ducker, FA,
and learning how to go about it.
So if you are looking for a
reliable way to create Docker
containers, repo2docker is your tool.
Whatever you're using to manage
your dependencies and your
environments, whether it's
conda pipenv and PIP, PIP tools.
Even if you are working in R or
Julia, you can use that to create a
container and bonus points because
those container is are of the same
format to do once used for binder.
So if you're looking at sharing
your assets in a reproducible
and interactive manner, it would
minimal overheads for others.
Binder and repo2dockerare
your tools to go,
And finally delegate to
your continuous integration.
So far I've mentioned that you can
create your images and tag them.
But there are a lot of instances
where you want to create a new
Docker image or a new version.
Probably you created a release of your
package or your model or your app.
So you create a tag in your version
control system and you want an
accompanying Docker image to make sure
that everything is up to date, but it
is a good practice also to rebuild
your images frequently or often.
If you were extensively using Docker
for research and development testing
and production, probably weekly.
It's a good idea that way you
ensured that you are getting all
the security updates released, but.
It can be cumbersome if
you did this by hand.
So the best way to go about it is
usingcontinuous integration, whatever
tools you are using already, whether
a Travis, GitHub actions or anything
else and delegate your build that will
allow you to take her to take less care
of the building and pushing of your
image and focus more on your code.
I've added an example here for
a workflow for GitHub actions
for me, I have two triggers.
One is whenever I create a new tag,
and also I'm specifying that my image
is going to be built every week on
Sundays at two o'clock in the morning.
But you can choose any time of the day.
because GitHub actions allow you to keep
all your secrets or your keys as secrets.
You don't have to worry about your
Docker username or password to be exposed.
And because it also integrates natively,
with your version controlled system,
you can talk directlyusing the ref,
that can be, for example, a tag.
And because it integrates nicely
and directly with GitHub auctions.
And your version control system, you
can directlly tag your Docker images
using a tag of reference or thr SHA
of any of the reference from GitHub.
So you can use details from the branch
that is coming n the date and so on.
So in this workflow or a in a
same workload, you would want to
have your code in version control.
This includes, for example, all your
scripts or specification files, whether
it's an environmen.yaml file or a pip
file or a requirements or any of the such.
And then you can specify the trigger.
Is this a tag and or a schedule trigger.
So whenever any of these two events,
whether it's Sunday at 2:00 AM or
you created a tag that's going to
trigger is your continuous integration.
Which is going to build your image,
and then it's going to push your
image to your registry so you don't
have to worry about what your code
and your imagesgetting out of sync
so you don't have to worry about
your images getting out of sync.
So to summarize, let me give
you the top tips for you to work
with Ducker and data science.
First, rebuild your images frequently.
Make sure that if you're going
to be using your containers, did
you are constantly rebuilding it.
That ensures that you get security updates
and it also, things are not broken.
Never work as root.
Minimize the privileges, the lower
privileges you have to better.
Also keep track of all the ports
that you are using and how you're
binding them to your container.
You don't want to use Alpine Linux, go
for buster, stretch or the Jupyter stack.
Always know what you're expecting
pin your versions, everything.
Pin your image use and use a commit
identifier to tag your, your images.
But his point, if you were using PIP
tools, conda poetry or pipenv to
manage your environment and dependencies
and leverage the build cash discussed
significantly affected performance
as well as security of your image.
Sometimes we feel like we have
to reuse the same Dockerfile
for all of our projects.
Eventually you're going to realize that
you end with a super big image with a lot
of dependencies that you really don't use.
So make sure to use when Dockefile
per project, you can have a base
image, for example, partying from the
base notebook from the Jupyter stack.
Use multi-stage builds.
If you're going for the option
of creating your own containers
from scratch, considered this.
If you need to compile code or
you need to reduce your image
size, this might be a good option.
Also, make sure that your images are
identifiable, that you can identify
whether it are meant to go in test
in test production or research and
development, be careful when accessing
your databases and using environment
variables as well as field variables.
Do not reinvent the wheel.
If you don't need really complex setups or
very specific packages or configurations
use repo2dockeror use a Jupyter
ecosystem stacks to get you started.
You don't have to do everything
from scratch and you also don't
have to do stuff manually.
Try to automate.
If you're already using, GitHub
try to explore, GitHub actions.
If you are using GitLab, it
also has an incredible CI and CD
ntegration trying to leverage them.
And finally, useful a linter.
I didn't dive into this before, but
it can go a very long way when you're
building your Dockerfiles so you can
identify errors early on, depending
on the editor that you are using there
are some really good linters out there.
VSCode has an excellent Docker extension.
It can help with image inspection,
building, lintingm adding support for
Dockerfiles and syntax highlighting.
If you follow some of these top
tips, I can guarantee your workflow
and your images are going to be
working much more nicely with machine
learning Python and data science.
I am sure you can benefit from a lot of
these, take and use whatever you need.
Automate wisely and reuse wisely as well.
I hope you have enjoyed this talk and
thank you very much for your attention.
If you need to reach out or have
any questions these are the means
in which you can contact me.
Keep enjoying PyCon online.
Bye.
