Elle O'Brien:
Hi, I'm Elle, and this is DeeVee, and I'm
going to attempt to explain DVC in a couple
of minutes using no code. Machine-learning
models are a special animal. They're not completely
defined by the code that you use to specify
them or by the dataset that they're trained
on, they're a reflection of both of those
pieces, and so every possible way of changing
our dataset, changing the way we process our
dataset, and changing our code represents
its own experiment, and a machine learning
project as a whole is like a family of all
of these different models, and so the complexity
is huge. It's not complexity in Big O notation,
I'm talking about the complexity of all of
the possible models that we could've created
from every version of our dataset, all of
the different ways we tried transforming it,
and all of the ways that we tried modeling
it.
Elle O'Brien:
The complication from this is that it becomes
very difficult to log what we've already done,
to reproduce what we've done, to share what
we've done, and all around, it makes machine
learning difficult for none of the fun reasons
that most of us get into it, so we need tools
to help us deal with this complexity, and
we have a couple of constraints for our solution.
We want it to be extremely flexible. Ideally,
it would work with any programming language,
any machine learning framework, whatever,
and we want it to be pretty straightforward
to learn based on all the tools that people
are already using to manage software projects
because we want to keep the cognitive load
light.
Elle O'Brien:
The underlying philosophy of DVC is to use
a tool that's really popular and really successful
for managing the complexity of software development
projects, which is Git version control and
extend it so that we can use it for data science
and machine learning, so I'm going to attempt
to demonstrate why this requires a tool using
only some objects from around my house.
Elle O'Brien:
I've got these index cards which can represent
the files in my project. Maybe I've got a
script to pull the data from storage, a script
to a clean it, a script to process and featurize
it, and a script to model it. These are lightweight,
right, they're easy, and I want to use Git
to take a snapshot of my project at any point
in time, and so Git commits are a snapshot,
and so Git is built for lightweight files,
right? It handles little files, so we can
take a snapshot, we can just make a copy of
everything in our project at some state and
time and then we can have a bunch of these
commits and that means we can always revert
back to "Oh, where was I right then? What
did it look like?" Easy, great, works for
everyone.
Elle O'Brien:
Except when you have really big files. In
data science, your datasets and your models
are big, right? They're not index cards anymore,
they're more like the fourth book in The Twilight
Saga. This came with my house. Books one through
three are not here, I only have this book.
I don't know. Anyway, it's not going to fit
in my Git repository, so what we can do is
take another very lightweight file and I can
write down on this file where I'm going to
store this book. I could write on this, "Okay,
I'm going to store it on the third shelf of
my bookshelf, second from the left," and then
I can put this in my Git commit and boom,
now I've got way to access my dataset, even
though I'm not trying to fit this dataset
into my Git repository.
Elle O'Brien:
That's the gist of DVC. Lots of people engineer
their own ways to do this, it's a pretty popular
approach, but the point of the DVC open-source
project is to really polish and standardize
this so that people don't have to engineer
this themselves every time they want to be
Git versioning and keeping track of their
big files like models and datasets.
Elle O'Brien:
DVC is more than just that one trick, so starting
from the idea of using meta files to version
datasets and models, plus the cultural practices
around using Git, we've built some other features
that tend to be pretty popular. Pipelines
are a cool way to tie together datasets to
scripts to models and version that as a pipeline.
Another cool thing is metrics and plots, so
DVC metrics allow you to compare model performance
across commits, and recently, we added plots
so that you can visualize how a model has
changed across commits. Another value is continuous
integration, so continuous integration is
a really foundational idea in DevOps for automating
frequent tests of your project, and so with
DVC, you can use continuous integration systems
to automate testing of your machine learning
models.
Elle O'Brien:
That's the scope of DVC. There is an open-source
community, so lots of active development.
There's pretty much always something being
made, so I definitely recommend checking out
the project repository. That was my five-minute
explanation. How did I do? Let me know in
the comments. Any questions, we'll try to
answer them, and thank you for watching.
