- Let's jump right into
the next generation
data science workspace
and I'd like to start with some old news.
If you're attending this conference,
you probably don't need
any more convincing
the data in the AI trend
has been transforming all industries
for a couple of years now.
Almost all major companies are realizing
that these investments lead
to competitive advantages.
Now, if you ask me, I don't know
what those remaining 12 fortune
1000 companies are up to
and as a result, the job market
has been playing catch
up to meet the demand.
Three out of the top eight jobs
in LinkedIn's latest emerging jobs report
are data in the AI roles.
Congratulations to you
if you're in any one of these categories,
I'd say you have some pretty
good job security ahead of you.
Now, one thing you will notice
is that there's a
diversity of roles needed
to deliver on these strategic initiatives.
It's not just one abstract
data and AI engineer
solving the world's toughest problems
really is a team sport
and you need all of these functions
working together as a
team and not in isolation
of course at Databricks, we
call this unified data analytics
and we've built a platform for that.
Collaborative data science workspace
becomes the focal point
of such a platform.
It needs to cater to all
of these user's needs
and bring them together.
Today I'm going to introduce
you to the next generation
of our data science workspace
and open and unified experience
for modern data teams.
But before we do that,
let's look at where we're coming from
and we've already come a long way
in simplifying data and AI
to help data teams innovate faster.
On the left here, you can
see a MapReduce program
to count words.
You should probably remember this
just in case it ever
comes up in an interview.
Of course, writing this code
snippet is just the first step
back when this was still cutting edge,
you would still have to worry
about setting up a cluster,
writing a couple of more
lines of configuration
and hope you get it all right.
Thankfully, today you don't
have to worry about any of this.
In Databricks you just
write a spark SQL query
and hit run
the cluster in the background
auto skills for you
and you just sit back
and wait for the result.
Now of course this is
a very simple example,
but it extrapolates to the full complexity
of data engineering and skill.
But as mentioned, this is
only part of the solution.
Where are we when it comes
to data science today?
When it comes to productionizing
data science projects today,
we're still dealing with a big mess.
If you've ever tried to do
statistical data analysis
or maybe train a machine learning model,
you know, that the tools
that are supposed to make your life easier
are still very difficult to use.
In fact, according to one study,
only about 15% of those
fortune 1000 companies
have deployed the AI capabilities
into widespread production.
The reason of course, is
that the tools available
to these companies specifically
in enterprise software,
haven't kept up with
new emerging practices
in data science and machine learning
and as a result we're made to choose
between three bad options
that are all not that great.
The first and for many,
the most natural option
is to just give everyone the freedom to do
whatever they want on their laptop.
Of course, data scientist love that
you have full freedom to
install anything you need
and you can move fast.
However, you're pretty
far away from your data.
You'll need to down sample and
copied data onto your laptop
and of course you don't
have to sit in compliance
to know that moving sensitive
data into your laptop
is generally a bad idea
and the folks who are maintaining
your production systems
are definitely not going to be happy
to try to reproduce
your local environment.
To address some of these concerns
some vendors take the approach
to just put those same tools
you use on your laptop into the cloud.
Essentially, they're giving
you a virtual laptop.
Now however, just hosting
Jupyter and giving you a VM
with psychic learner
TensorFlow pre-installed
isn't that much of an improvement.
Sure you no longer have to
copy data onto your laptop.
But aside from security and governance,
there are no obvious benefits,
just more constraints
and finally, you may be
asking yourself the question,
why not they trade on our
production infrastructure directly?
Well, unfortunately those
production hardened systems
are not really ideal for exploration
and most data scientists will not be happy
if you try to teach them Coobernetti's.
So you're left with a hard
choice, full freedom of a laptop,
slightly worse experience with
the same tools in the clouds
or a fully production hardened system
that no data scientists will want to use.
Thankfully we're in the 21st century
and with a little customer
obsession and engineering,
we've been able to
navigate these trade offs.
Our solution for modern data
teams starts with the premise
that developer environments
need to be open
and collaborative.
Our workspace follows
open source standards
and provides a collaborative
notebook environment
on a secure and scalable platform.
Next, the industry has already
figured out best practices
for versioning and CI/CD
and Derrick hit based.
So we integrate our
platform with this ecosystem
and provide those best
practices to data engineering
and data science, where reproducibility
is becoming more and more important.
Finally, to reduce the time
from experimentation to
production, the same environment
can be scaled to production deployments,
allowing you to manage the full life cycle
within one platform
and bringing all of this together
I'm extremely excited to announce
the next generation data
science workspace on database
and without further ado,
let me walk you through
some of these innovations step-by-step
before we give you a demo.
This is a screenshot of the
current workspace navigation
in database.
It brings together all of
the components you need
for collaborative data science,
notebooks, clusters, jobs, models,
and access to all of
your data at any scale.
For those of you are
familiar with this interface,
you may notice something different.
We're introducing a new
concept called projects
to the left navigation panel.
One of the most common ways
that data scientists start to work
is to clone a Git based repository,
with projects you can bring
all of your work to Databricks,
where you can access all of your data
and use best of breed, open source tools
in a secure and scalable environment,
because projects are Git based.
You can keep them in sync by
pushing and pulling changes.
Of course, you can also
switch between branches
or create new ones.
This basic functionality provides you
with a powerful set of capabilities
to integrate your database workflows
with your CSED automation,
and then enables you to
follow best practices.
When you move from
experimentation to production
and you don't have to learn DevOps tools.
Now, many of our customers
have been waiting for
this product feature,
and I'm happy to announce
that it is available in preview today.
Now we have many more
exciting features coming,
and I'll give you a quick
sneak peak of those as well.
At the intersection of Git based project
and environment management is the ability
to store your environment
configuration alongside your code.
This integration will allow
us to automatically detect
and enable your environment,
removing the need for you
to worry about installing
library dependencies yourself
and you know, sometimes
saying it just works.
It's the most powerful statement
and in this case, we make it just work
following the same behavior
you're used to on your laptop.
We give you an environment
that matches your
environment specification
and make it available
consistently on all workers
on an auto scaling cluster.
So now that your
environment is all set up,
let's look at your code.
You can already import I-PASS
and notebooks to Databricks.
This allows you to convert
your Jupyter notebooks
to Databricks notebooks and vice versa.
In the future, we will
store these notebooks
in their native format, removing
the need for conversion.
This not only makes Databricks
more standards compliant,
but it also enables us to
support alternative editors.
So if you want to, we allow
you to open these notebooks
in Jupyter right here in database
and similarly by the way,
for the AI users among you,
we also support our studio.
However, as I mentioned earlier,
just providing you with cloud
hosted open source tools
is not quite good enough.
So by default, we will open this notebooks
with the Databricks notebook editor.
The Databricks notebook editor
can open Jupyter notebooks.
and in addition, provide you
with collaborative features
like core presents as indicated
in the top right of the screen
and real time crediting as
indicated by colored cursor
and to facilitate collaboration even more
Databricks notebooks
also allow you to comment
and leave comments for your colleagues
all in one cloud based environment.
Now, by allowing you to
open Jupyter notebooks
in the database notebook book editor,
you will no longer have
to make the trade off
between using standard formats
and the collaborative
features and benefits
that Databricks provides.
So in summary, this is our
solution for modern data teams.
I showed you how we provide
collaborative notebooks
based on open standards,
integrate with the GIT ecosystem
for collaboration and reproducibility,
and provide integrations
with CI/CD systems
for the robust workflow,
from experimentation to production,
all on a secure and
scalable cloud platform.
But you don't just have
to take my word for it.
So let me introduce Lauren Richie.
Who's going to give you a demo.
- Thanks Clemens for the introduction.
Let's jump right into the demo.
For the purposes of this demo.
Imagine I work at a big retail company.
We used to do our forecasts
on a quarterly basis.
It's a big effort for our data scientists
to come up with those forecasts,
using many different tools
and once that is done,
we print them to PDFs
and send them out by email,
that has led to lags and decision making
because we use outdated forecasts.
Of course, these days,
the world is changing at a rapid pace
and our leadership team
has asked us to move
from a quarterly to a weekly basis
that will significantly
improve the quality
of our business decisions.
Like how much inventory to order.
The amount of manual work
involved in producing
this forecast is prohibitive
in doing this more frequently.
So as a good data scientist,
I am determined to automate this process
and provider decision-makers
with an interactive dashboard
that always has the latest forecast ready.
First I'm wondering if
there's a better tool
for those forecasts.
So I searched Google for
Python forecast libraries
because I know that
there's a lot of innovation
and I find this library called profit,
which is open source by Facebook.
I read about it online
and heard good things.
So I'll check it out.
Of course, there's a lot of
tour examples available online.
So I found one and forked
into my GitHub account here.
As you can see, it comes with a dataset
and a Jupyter notebook that shows you
how to create a forecast.
This is great, usually
the way I go about this
is to take an example like this
and try to recreate it just to
make sure it's not outdated.
So I click clone to get the repo URL
because I'll need it in a minute.
So here I am in my Databricks environment,
which we call the workspace.
In the past, it would
have been pretty difficult
to get code from the GIT
repository into Databricks.
However, as Clemins mentioned,
we now have this new
feature called projects,
which allows you to easily
clone a GIT repository.
When you click create project,
you provide the path to a GIT repository.
So I just paste the URL
that I just copied earlier
into this text box.
When you could create,
we clone this repository
and make it available in your user folder,
as you can see, it indicates
which branch you're on
and when you click into the project,
you see that all of the files were clones.
So the first thing that I'll
do is to create a branch
because I don't want to mess
around with the master branch
because that will be used
to run a production job.
I open the GIT dialogue
and I can just start typing a branch name
into this text fields.
Now I click on create branch from master
and I'm ready to go.
As you can tell, we're trying to make
the most common workflows super easy
without having to leave this environment.
Now that I'm in my feature branch,
I'll click on the Jupyter notebook
and you'll see that we open it
in the Databricks notebook editor.
In addition to supporting
standard formats,
this editor gives you several
collaborative features
that will highlight as part of this demo.
So let's get started.
Databricks provides you with
a scalable compute backend.
I can attach a notebook
to an existing cluster
or create a new cluster.
Let me attach this
notebook to this cluster
called ML cluster it's already running.
Now usually I would have to
worry about the environment
that is set up on this cluster
and the libraries are installed.
But with the integration
of the projects feature
and our run time that's
running on this cluster,
you will see that we automatically detect
the presence of this requirements text
and as soon as I run any
cell in this notebook,
the cluster will make sure
that the environment
matches those requirements.
So what's happening now is that profit
is being installed in the background.
You already pre-installed
many popular libraries
like Pandas and NumPy,
and we adjust their versions if needed.
So let's just run this entire notebook
and see if it works.
As you can see this reason, the CSD file
from the data folder
and loads it into a Panda's data frame.
The file has two columns, one for date
and one for the historic
values of the column
that we will try to forecast.
In this plot you'll see that
we have data up until 2016
and then we forecast for
another year after that.
This is great because that's
usually how you get started
finding an example and
make sure that it works.
However, this is just using toy data
and we have lots more
data on our actual stores.
So let's see if we can adjust this example
to actually scale to our needs.
I don't actually know
our sales data lives.
So I leave a comment to ask my colleague
to help me out here.
Okay, let's see if he's online.
Great, coincidentally it
looks like he's online
and ready to help.
You can see he opened up the notebook
from the indicator up here,
which shows, which users
are present in the notebook.
Okay, it looks like he
responded and he created a cell.
So I'll assume he'll just share some code.
Okay, great he's actually using Koalas.
Koalas is an open source
library developed by Databricks
that provides the exact same APIs,
but uses spark in the
backend to scale computation.
This way, I won't have to
change any of the other code.
The data frame is still named
DF and it should just work.
So if I run this cell,
you'll see that we're running
a spark job in the background
and hand you back a data frame.
This is no different from
the toy example before
we have an additional column
that indicates the store this data is from
in this case, we only have three for now,
San Francisco, Amsterdam and New York.
Of course, I don't want
to generate a forecast
across all stores,
but I want to have a forecast
for each store individually.
So instead of just running all of this
through one big forecast,
I grew up the data frame by a store ID
and apply the forecast as UDF.
What will happen is that
we'll run a spark job,
group the data frame,
and then for each store,
we'll run this forecast
and I can also delete all the other cells
because we don't need them.
Now, the only thing missing
is that I want to write the
forecast out to a Delta table
to actually use it in my dashboard.
I don't just want them on the forecast
every time someone wants to look at it.
So we add this code snippet
that takes the forecast
and stores today's date and the store ID
so that I can query by that later,
when I hit run, we'll bring up
some spark jobs in the background
that crunch through those data
and write out my predictions.
Now this is great, I just
cloned an example I found online
Databricks configured
the environment for me,
and I could easily change it
all to scale all of my data
by using koalas and writing
out forecasts to Delta.
Now, as mentioned earlier,
if for whatever reason
you want to use Jupyter,
you can right click on this
notebook to open it with Jupyter
right here in Databricks.
Unfortunately, the collaborative
features you just saw
are not available in Jupyter.
As a side note of course,
you could go back and forth
between the two editors whenever you want.
Okay, let's go back to the
Databricks notebook editor.
Of course, I've been working
off of my feature branch here.
So now I can go and
check that code into GIT.
I opened the GIT dialogue,
provided commit summary
and click commit and push.
We won't show this in the demo,
but usually you would go through
the typical CICT workflow
to create a PR, get it reviewed
and merge it into master
and here we set up our GIT automation
to automatically check out
the master branch of this repo
in this production folder
whenever a new PR gets merged
from here, I can just
open up the notebook.
Now this is the master branch version
that I want to automatically run.
You can click on this calendar icon,
which allows you to schedule this notebook
as a Databricks job.
Let's configure it to once a week
and this way I'll automatically
get the new forecasts
written into my Delta table.
I'll click okay and we're done here.
Now this is a full end to end life cycle
of experimenting with code
in my own feature branch
checking into GIT and
pulling the master version
into production folder
that is used to run a scheduled job.
Let's quickly take a look at the dashboard
that I put together.
In this notebook, you can
see that I use a feature
of Databricks notebooks
called notebook widgets.
I can just create two widgets
that give me the fields
available on the Delta table.
Those widgets will update
whenever I get new data
and then the data that is shown
is automatically filtered by my selection.
I can create a dashboard
from this notebook
that embeds the table and visualization,
and also integrates the widgets
to control the parameters
of the SQL query.
So when I click present,
you will see that the
version of this dashboard
that I'll share with our decision makers
here they can select which
forecast they want to see
for which store all updated
and without having to write any code
and that's it for the demo.
So just to summarize in this demo,
I showed you our support for
the native Jupyter format
and have the Databricks notebook editor
provides collaborative features
like co-presence
co-editing and commenting.
We saw the project based get integration
and how easy it was to start,
by cloning a GIT repository
and creating a new branch for development
and finally, we
productionized the forecast
by pulling the master branch
into a production project,
scheduling a job and
creating a notebook dashboard
to share the latest results
with our business stakeholders.
Now, in theory, I could even update
these forecasts every day,
simply by scheduling the
production job to run daily.
That's a massive improvement
from the quarterly cadence
that we were used to
and with this, I'll
hand it back to Clemens.
- Thank you, Lauren,
for this amazing demo.
I hope that everyone
watching is as excited about
the next generation data
workspace as we are,
to learn more, check out databricks.com.
