(upbeat jingle)
(scribbling)
Hi, I'm Matt Maccaux,
and I'm here to talk about how HPE BlueData can help you
accelerate the time to value for your data science
and analytic community.
Let's start talking about who these analytical users are.
The first set of users that everyone likes to talk about
are data scientists.
Data scientists are doing exploratory analytics,
the change the organization type of analytics,
and they're answering questions that are very open ended.
And in order to do that,
they're gonna put strain and pressure on the organization
because they need access to data that may not exist
within the four walls of your organization.
They may wanna use tools that have never been certified
by your IT organization.
They may do so in an iterative
or potentially destructive manner.
And, that puts tension on the data analytics community.
These are the traditional business intelligence
and reporting users that are using things like Tableau
and SaaS, and ClickView, that are generating code
to execute the type of report that they want.
Both sets of users though, need access to the same data.
And so, we can't have data scientists
potentially causing outages
that the data analysts are affected by.
The third persona is our sets of data engineers.
Data engineers are responsible for operationalizing
the models and reports
that this analytic community is generating.
What do I mean by operationalization?
Well, when a data scientist writes code or an algorithm,
it has to be put into production.
So there's probably code and
applications that surround that algorithm to make it live.
The last set of users are our data operations team.
They're the team that's responsible for providing
the infrastructure, tools,
IDEs, et cetera.
And so, every set of users,
whether you're data scientists,
need access to tools.
And they all have different sets of requirements.
Data scientists may want RStudio and Python.
Your data analysts use Tableau.
They all have different IDEs,
whether I'm a Java developer
and I'm using IntelliJ,
or I'm using Zeppelin, or Jupyter Notebook
as a data scientist.
Of course I need code,
or access to my model.
And lastly, in order for my data scientist, data analysts,
or engineers to do their job,
they need access to some kind of data.
Whether it's production data, sample data,
anonymized data, all of this has to be provided
through some sort of request interface.
The data operations team meanwhile needs to know
how long do you need this environment for?
Are you gonna be doing some exploratory work with tools
and IDEs, and you need to spin up
and spin down environments six times a day?
Or are you training a model over a weekend
against high performance infrastructure?
So, we need to know what are the performance characteristics
of the work that you're doing,
so that ultimately we can determine
how much infrastructure, how much software,
and how long, so we can either do charge back
or show back.
This request portal should be metadata driven.
Now, this is logical metadata
because you may have a data catalog,
several code repositories,
software libraries,
but the point here is that this is not a static environment.
These users are browsing and selecting this
based on generated information.
This metadata also will have templates
that we use to kick off automation.
This automation is gonna do things like
spin up an environment.
So, this is my exploratory data science environment,
where I take your tools, your RStudio
and your Python,
and using containers to spin up that environment,
those workers.
And the beauty of containerization is that
if I write bad code,
and I crash a container,
well, it takes no time at all to spin another container up.
I can also add capacity.
So, if this tenant is not right sized for my job,
well, I can come in
and I can make it bigger.
The operations team can provide more capacity
so that I can spin up more containers to do my job.
What's also important here is that
we're using software to find networking for multi-tenancy.
We're not just relying on YARN.
So that again, if this data scientist
crashes their entire cluster,
it doesn't effect any of the other users
that may have environments running on the system.
This system is also made up of on-premises infrastructure.
On-premises VMs, as well as the ability to
deploy these same containers, these same tools,
in all three public clouds.
But that's not enough.
That's not enough for these users to be productive.
We still need to tap into the data where it resides.
And so, under all of this is our lake.
And this is a logical data lake.
We know most organizations have multiple data lakes.
And that's okay.
The point here is that we need to think about this data lake
in two sort of different constructs.
One, we've got our data being fed in
in a curated, potentially adjusted
or transformed way.
Of course, we're gonna capture that metadata
as part of this process,
but this portion of the lake,
that curated, transformed data
is our read only data.
And why is read only important?
Well, read only is important because our data scientists
who request access to everything,
well, we can give you access to everything via our data tap.
Tap into that information,
but if you wanna write information,
you wanna bring data from outside,
well you do that in the read/write portion of the data lake.
We're gonna give you an analytical sandbox
that you can also tap into.
This is where you do your joins.
Potentially, you bring in data from outside to do your work.
And, it's important that we do that in the lake
so that we have audit ability and trace ability
of what is happening in that sandbox environment.
And when the timer's up,
we're gonna take what was done here,
archive it off,
potentially bring it back in
through the standard curation process
so that it shows up in the catalog.
But, because we have a timer,
we're gonna destroy the environment
after we've archived everything
and release those assets to other users in the system.
And so, this is what good looks like.
All of these different personas coming
through a common interface
to request the environment they need.
Selecting the tools, the IDE, the code models,
and the data they need,
which is metadata driven,
using automation to spin up containers
in a true multi-tenant environment
across any set of infrastructure or cloud,
and tap into the data where it exists.
And it's HPE BlueData that provides the power of this
under the covers,
working with your existing software investments
and software teams to make this real.
(calm jingle)
Learn more about HPE BlueData here.
