Hi, everybody. >> Hi, folks.
>> Thanks for joining us. >> Did you want to stand on the stage or --
>> Actually we're about right like this. >> Perfect.
I feel so tall. This is wonderful.
>> Cool, i'm damion, i am the devops or the of this.
>> And i am page, dynamic web page via the twitters and pretty
Much else on the internet and i'm the data science part.
>> We thought we would have an open discussion about devops and data science.
Which i think suits us both a little bit better.
>> And to open it up to your questions as well so if you have
Anything that you've been struggling with or any questions
About kind of the options for collaboration between the two
Different functions, feel free to ask, but did you want to get started?
>> Yeah, the one slide i wanted to shoe is this is microsoft's
Definition of devops. No, that's -- sorry, it changed.
Let's do the data science one first >> cool.
This page is intentionally left blank because i did not give
Damion a data science slide >> if you've seen her schedule, that's okay.
>> Thank you. So data science is really the --
It spans a lot of different areas so you can do descriptive
Analysis which is just looking at historic data and usually
Making dash boards, but the more interesting bit, the bit that i
Think we're all excited about is predictive modeling and using
Machine learning and especially deep learning in order to
Architect solutions and build predictive either classifiers or
Aggression models for your software application so data
Science encompasses being able to pull in data, get it into a
State that it needs to be in to be ingested into a machine
Learning model. Then you build that model to
Retrain it, change hyper parameters and monitor
Performance and accuracy to deploy it and then to make sure
That once it's deployed it's still doing what it's supposed
To so your model is sustainable over time.
That is all bits and pieces of data science >> yeah so the important part
From the devops perspective is the last part of what you said
There, putting it into production and monitoring what's
Happening and refreshing your data and models and things like that. >> Yes.
>> So this is what we talk about when we talk about devops and
This is programmer and developer focused, the union to be able to
Deliver products to our end users, right, and with
Predictive models, value is clear, what's the value of this predictive model?
If it's being used in your application you know the value,
It's to predict things accurately, right?
But you always want to be increasing accuracy, making
These models perform better and to do that you need to refresh
Them. >> Absolutely and i love the
Continuous delivery piece of this statement because a model
Is never 100 perfect and it's never going to be 100 accurate
Every single time. As your circumstances change, so
Say your customer base changes or it's a new month out of the
Year or a new day of the week, you might need to refresh your
Model in order to reflect those changes in it. The example i like to give is if
You were a microsoft store in downtown seattle and you were
Trying to deplicatae how many surface notebooks you were going
To be selling on a monday, you're probably to go to be
Selling a bunch more on the monday of builds than you would
Other mondays randomly throughout the year and if you
Didn't have the ability to incorporate those spikey periods
In your data set or changes for holiday seasons then you
Wouldn't have a representative model.
>> Yeah and the more these models get used in real world
Scenarios, the more data you have and the more accurate you
Can make your model, you can build on them.
So this is a continuous process, gathering information to improve
The models and work a little bit better in the applications, like
Deliver the actual value. So that process is really important.
So devops for data science is still kind of a pretty new area,
Right, so we talk about devops in software development a lot
And data scientists don't start from the same point of view as developers.
>> Not even. So most data scientists aren't
Coming from the computer science background.
Most of them are either statisticians or physical
Scientists, so physics marriages or mathematics marriages and
There is not a lot of understanding of software
Development life cycle best practices.
So version control is difficult to explain to a person who is
Usually working in an isolated environment by themselves.
Being able to implement tests, that's also a very hard concept
To explain to a data scientist. So damion, whenever he was
Telling me about this thing called vsts, it was immediately
Apparent, like, oh, hey, it does these checks before certain
Builds are deployed or before you are able to push your model
Or application into production and that's precisely the same
Mentality we should have for machine learning models.
>> We've got to the point with software development where the way we should be doing it is
Writing code, having that in source control and then having a
Pipeline, like a build process, some tests that run, a
Deployment to a test environment to run more tests and make sure
Everything works okay and then promotions through to your
Production environment and monitor what's happening all the
Way through, monitor how people are using that application in
Production and how effective it is and what parts need more work
And so on, feeding that back into your tooling. So these are things that align
Really well with this data science process and improving
The models, testing the models, working out what things need to
Be fixed, what things need to be changed and iterating on this
Process >> and gathering the telemetry
Is really, really cool. As a data scientist the thing i
Love most is data. I love the ability to input structured information into my
Model and a lot of the time i don't have the data i need or
The data i would like to have in order to build the best
Predictive model possible. If i'm able to talk to damion
Are on the people who are developing the software and say,
Hey, it would be great if we tracked this button click or
This mouse movement or this bit of user behavior then it would
Really add value to the thing that i'm creating.
Just that partnership of working closely with the software team
And making sure that both of us understand each other's needs is
Really, really are important. >> We have different fields and
I don't know what stuff you need and you know the process that
You need but maybe not the best tools to use it.
>> Absolutely. >> Stuff like that.
>> Vsts though i think is a great meeting point between the
Two of us where we can see the tooling available, all of the
Extensions to be able to support that process.
>> Yep so for those who don't know visual studio team as ofs
Or vsts, you would have seen it talked about in the keynote day
One and creating a pipeline and all that sort of stuff.
Ultimately under the covers the release infrastructure is -- the
Guys are all over there so they're not going to hear me but
It's kind of a glorified task runner, right? there are all sorts of things that you can implement to do the
Work for you and from a data science perspective we are
Having conversations about how to make this successful for data
Scientists without them needing to do too much work but as a
Task runner you can do paths and scripts and create models and
You can tell i don't know what i'm talking about, you can do
Machine modeling process predictions >> and monitor the logging.
That's the coolest bit to me at least like being able to see
What my models performance accuracy looks over time, or the
Statistic and using that as a trigger of whether or not to
Retrain or send an email to let me know that something is wrong. >> Definitely.
Did you want to open up to some questions?
>> Yeah. >> We've got stuff we can show
You and talk about but, you know, it's a relatively new
Field especially for me, devops with data science so i want to
Know what people want to know about this convergence of the
Two areas. >> If anyone is interested in
Reading a tech paper on this, there is something called tfx,
Released in 2017 and it's phenomenal, goes through the
Entire development life cycle for machine learning models and
How you would want to do data checks and accuracy and
Statistics checks and retraining your model and getting it into
A -- yep, there we go, it's an end-to-end machine learning life
Cycle study. >> I can push buttons okay. I think there is a question out
There. >> Audience member:  (away from
Mic.) >> Oh, gdr compliance, right?
Ct->> audience member:  (away from mic.)
>> So from the version controls point of view i wouldn't
Necessarily consider the data itself to be something you need
To do put in version control. You could put models based on
The learnings from those data sets in version control and the
Algorithms and all the code you write to make those models work
In source control and the artifacts, the models themselves
In a version repository of some kind, whether that can a docket
Hub or package management systems to say here is the
Version of the file i have, but the data itself you could still
Lock that down pretty much the way you are.
It's just when you run the model training and things like that,
That's when the algorithms will pull the data, use it and lock
It away the way it was before. >> And the question was about
Gdr compliance and whether it could be considered sensitive,
Personal data to build a model on someone's name, rank, serial
Number, address, all that stuff. There has actually been a
Start-up in the last couple of years called ocean protocol --
That's what i want to say it is. It's using yet another buzz
Word, blockchain, to do versioning of data sets so that
You can see precisely which names are added, which files are
Added, and be able to understand if there was every some sort of
Legal issue or identity from a model you would be able to trace
Back that data set to every single instantiation that you
Had in your model application. So if you want to get that
Granular, a legal issue that you want to know every single model
That leveraged that person's data and retrain it on not that
Person's data, then there are solutions that could help you with that.
>> Interesting. >> I was right about the name,
That's awesome. >> All i do it look up websites for you.
>> You're doing awesome, damion and he has taught me so much
About vsts and about devops of. >> I don't understand data science.
>> And i don't understand devops, so it works out nicely.
>> Awesome. The other thing i was going to
Show and i haven't looked through this a great deal but
This is a visual studio team services and these are build
Definitions for machine learning models, right?
They will build predictive models and you can see this
Running on cpus and gpus, have you looked at this in depth
>> No but having the ability to define requirements for models
Especially for the retraining bit is huge. >> So as an example let's look
At one of these. I'm going to choose this one
Because it is passing. If we look at what it's doing, i
Was mentioning that vsts is a glorified task runner which
Those guys would kill me for but this is a
Powershell script and it's a wipe per training --
Viper training dot pi and we are getting the training model
Results to see how well we are performing and things like that.
The bit in source control is this training stuff so if you
Look at the code which is held inside vsts, this is what we
Have. Like where was it? in training, probably.
Train dot py so this is in the source control in the file.
So if you change the model you're using or things low that
For the training this is versioned. You can comeback and say this
Version was performing much better, let's roll back to
There, let's do a branch and try and work out how it works with a
Different algorithm, maybe let's do a totally new project and put
It in source control >> and having the ability to
Publish logs directly to azure or wherever that location
Happens to be in you're a data scientist or machine learning engineer you're probably in love
With tensor board if you've tried it out before, it's a
Visual inspection tool for deep learning works and if you've got
The logs where you can ping them immediately accessible, then you
Would be able to access it even though you're not the person
Running the model each time. >> So you can actually work in a
Team like this and have different responsibilities,
Yeah. Pretty cool.
Any other questions? >> audience member:  (away from
Mic.) >> Right.
So the question was how do you determine what would be a
Pass/fail for the model. An example would be as you
Retrain models, so as you refresh with new data sources if
The accuracy of your model every fell below a specific threshold,
So say your business requirement was that you needed to have 85
Predictive accuracy and then after some sort of model retrain
You're down to 77, that might be a trigger that you shouldn't
Deploy that model, the latest version, to the application.
Or it could be any other metric that would give you insight into
The performance of the model itself so f statistic would be
Another -- you could also say -- you could do a trigger on if
There is any new additional data added or if, say, the data that
You had added initially had four different categories for one
Particular feature, and then another feature category had
Been added so you added a fifth one, that would be a great
Trigger to, hey, you have a new kind of configuration, a new
Structure for the incoming data set.
Take a look at it because i guarantee you if you don't take
A look at it at this new sampling frequency or at this
New data type then your model is going to go kapunk, it's not
Going to do well. >> So those are triggers to
Refresh your model? >> the data bit would be a check
To make sure the incoming data is the same as the original data
And then after the model has run and accuracy is reported back
And it's too low that would be a check on not to deploy it.
>> One of the things in vsts is this release process so i did
The model and this gives you the logs.
The release process is here, they're only going to one
Environment but if we edited that one -- they've gone further
Than that so deploying it to a test cluster, running tests
Around it and deploy it to the cluster if it meets your
Requirements, you can even -- and this is
Gie@build, and we have release dates and there you
Go check output from data testing.
So what they have -- this is maizing. [Chuckles]
They're testing on it and making sure it's performing and i'm
Guessing based on ab testing they're comparing it against the
Current production model >> yeah.
>> And if the accuracy is better then it will automatically flow
To the production environment. So this process that they have
Means that you can try things, update your code, your models,
And it will automatically flow through to production if that
Model performs better than the previous one.
So as a data scientist, somebody generating these models and
Doing all the work on the ground, they can focus on the
Work and the deployment of better and better models is kind
Of handled for you, this is hitting an azure function by the
Look of it. >> And it's awesome.
>> This is deployment gates gotta
Ga build. That's amazing
>> Anymore questions? >> audience member:
(Away from mic.) >> The batch.
>> The batch? >> audience member:  (away from
Mic.) >> The question was around can
You use ai batch. >> So there's azure batch ai and
That's, i think, leveraging spark in order to spin up a
Cluster. >> I don't know the answer to
This but i'm going to find outright now.
Does that have an api or end point?
>> You can use the command line to trigger, so you can use aztk
To sponsor a spark cluster and i'm assuming it has a command
Line tool. >> So if you can --
>> She will scripts. >> The general rule is if you
Can script it, you can do it. So you could either choose the
Path from your source repository which is this is how i call
Azure batch ai, am i going to get that right?
Batch ai to do the work and that is part of the orchestration
That vsts does. So it will fire off that task
And then you can either put a task in here to wait until that
Was done or have a trigger based on that completion to say all right now that's done, let's go
To the next step. So, again, with that pipeline
That i was showing where you have the gates, that could
Literally be like an api hit to say is this done yet?
If it is, let's go to the next environment which is basically taking the results of that and
Push it on to the next environment. You can fire off that stuff
Absolutely from tasks inside vsts out of the box, though,
You're writing scripts and this is the kind of stuff that we're
Talking about now, how to make this stuff more first class citizen.
That's one i'm putting in the backlog in my head to be able to
Fire off batch ai and wait for the response.
Would that make sense as a build or release task?
>> Audience member:  (away from mic.) >> Yeah, so the environment in
Vsts is poorly named. I'm glad there is nobody from
The vsts team here. Environments are really just a
Collection of tasks so this environment has one phase and
There are three tasks or two, once i delete this one.
This is all that environment is. It's not necessarily a machine or a target or something like that.
It's just when i'm talking about the training model environment,
That's just these tasks. The next environment it could be
All right that's finished now, what's the next thing i need to do.
Part of the improvements we can make is looking at renaming
Environments to something else. We have had lots of discussions about that
>> The other thing i would like
To talk about, too, let's say you're doing a model retrain and
You're start-up has been wildly successful and you no longer
Have a thousand customers but you have a million, you're pokemon go!
You can use cubernetti's to say
If we have this many less customers you can use a cpu
Cluster, spin up spark, do distributed machine learning
There or if we have massive customer requirements, crazy
Numbers, scale up automatically to gpu-enabled hardware and you
Can use cube flow for that and a couple of other tools.
But cube flow is open source and it's very good.
>> Cool. I just looked at the time as well
>> Are we over? >> yeah.
>> Sorry! sorry for stealing your time!
>> I'm going to apologize as well i need to run to channel 9.
>> I'm going to probably run to a podcast but i can take a
Couple of questions if anybody still has 'em.
>> There is also the twitter handles, that is the best way to get hold of me.
>> Same here. If you tweet at me it goes
Directly to my phone. >> Thank you very much.
>> Thank you. >> Have a good rest of the
