So we're going to,
as I said, we're
going to talk about
the general idea
of big data, predictive
analytics, and data
science today.
Let's start with this, let's
look around and try to see
where is big data happening
and where is data science?
So it is really all around us.
All of us are aware
of the search engines.
So if you look at this,
these search engines,
they have been doing big
data and data science
work even before these
became buzzwords right.
So the search
results that you see,
they are sort of recommenders
or ranking algorithms,
or the ads that
they show you when
they detect that your
query is a local query.
If you search for pizza and
then or some other query
that you search and then they
detect that it's a local query,
they would show
you some local ads.
If you type in something
that might show you
some stock quotes.
So they have been actually
doing big data and data
science for ages now.
Has been, I don't
know, 20 years now.
Insurance companies--
insurance companies
probably have been doing
data science even before.
Data science and big data,
they have been doing it
I mean I don't know--
I mean, again, big data is
something that is subjective.
I mean, how big is big?
And we'll get into that
debate during the boot camp
or, to some extent,
during this webinar.
But, if you look at
insurance companies,
insurance companies they
would use some demographics
information like your
age, where you live,
your gender, your
income, if they know it
or if they have estimated
it, how many people do you
have in your
household, and so on.
And they will offer you some
insurance rates, or some auto
insurance rates, or some health
insurance rates, and so on.
I think some of us may
have observed this,
and I'm not sure
how it is in Europe,
I think but in US if I
change my address my--
my auto insurance might change.
It is exactly the same person.
I'm still the same
person, no more accidents,
I'm still married,
I'm still this--
my age is still the same, I
don't have any new accident
history.
But, when I change my
address, my insurance change--
changes.
So my auto insurance
would change.
Anybody can come and tell us
why that might be happening?
Any thoughts?
Maybe local risk factor.
Yes, so what would
be-- what would be some
of the risks or risk factors?
Do you guys think that my new
address may have possibly more
there's a dangerous
intersection, a cross street--
street near my new home and
there are more accidents
that happen there?
Uhuh.
Right, maybe there
are more robberies,
or carjacking incidents here.
That's possible, right.
So insurance companies would
actually take many factors
into account besides.
So it may not be
strictly about your own--
it may not be just
factors related to you.
If factors related to your
surroundings they might change,
they would actually--
their predictive
models or their models
they will give them
different results
and then they would
really enforce
those different
premiums and so on.
Let's take another example here.
If I go to Telcos
telecom companies,
I think in US we have T-Mobile,
AT&T, Verizon, and Sprint.
I think in Europe
we have Vodafone,
we have Orange I guess what are
some big carriers in Europe?
So anybody would
mind telling me what
are some big carriers
in telephony-- telecom
companies in Europe?
Yeah Vodafone is one of them.
Orange is there, Vodafone
is there, yes, OK.
So and Mobilestar yes.
So what do you all think?
What is your-- what
are your thoughts
on why would they be using
big data and data science?
Or what could be
some of the reasons
they might be using
these technologies
or this set of tools or skills.
So, its OK to be wrong.
I mean I wanted to
be very interactive.
So please just
give some feedback,
and its OK to be wrong.
But I'm more-- I'm
very sure that you
would have some intuition
or some understanding of why
they would be using this.
So why do you think
Telco would use
data science, or predictive
analytics, or big data?
OK, revenue assurance,
sure but, how
would you assure that you
have guaranteed revenue?
So I think we would be
all of us would agree
that these companies, they
would have all the call
logs, when the calls were
dropped, how long were
the calls, how many people
were calling at any given time,
when was the network overloaded,
when was the network relatively
less usage.
So these companies they actually
would use all of this data
that they gather pretty
much every second from us.
And based on that, they might
actually plan their network.
How to expand the network and
where to place cellular towers
right, and so on.
So that's one thing.
And another very important
thing that they might be doing
is predicting customer churn.
And when I say
predicting customer
churn so what that
means is that if I'm
a T-Mobile customer, this
company that you see,
if you have never
heard of it, T-mobile.
I am a T-mobile customer so at
maybe every day or every week
T-Mobile runs their
predictive model
and tries to see
which customers are
likely to leave them or switch
to AT&T, or Sprint, or Verizon.
And if they sense that
I'm going to leave them,
they are going to actually
come up with an offer
or try to make me happy and
then make me not leave T-mobile.
So again, the same idea I
will ask all of you what
do you all think would--
first of all, do
you think that is
it possible to predict
that a given customer is
going to leave the service?
OK so Jose thinks that
yes, yes it is possible.
OK.
Yeah, I'm sure.
I'm sure as well but
I don't know why.
So how would you--
assume that-- I believe that
since all of you are interested
so all of us agree--
all of us agree that it is
entirely possible right.
Now, the question is how would
they go about predicting this?
What are some of the things
that-- what are some of the--
and in predictive modeling
and machine learning world
these are called signals right.
So that's a term
that they will--
they would use.
So what are some of
the signals, or what
are some of the features
that they would pick?
And decide yes I
mean if I know--
assume that they
are gathering all
of the-- every piece of
information about you,
based on usage.
What are some of the
things that would
flag Telecom provider that this
customer is going to leave us?
Maybe a visit to a web
site of one of the others.
But do you think
they can track it?
I think so.
OK, yeah, yeah I mean so some
good points in the chat window
as well.
If I'm changing, adding,
or removing services
in the network.
If my family
carrier has changed.
And I think, another
one, if you don't
use your phone very often.
If my usage pattern is changing.
Yes I think all of these, Marco
your point, and other points,
they are valid right.
So if-- what else do you think?
So assume that you get hired by
them and by one of the telcos
and they want you to tell OK,
these are a million customers
and give me an app, give us a
predictive model that gives us
a rank ordered score, or a
break score of whether somebody
is going to leave us or not.
So yes, another one,
customer calls service call--
customer service calls right.
So what about-- would
any customer service
call would mean that the
person is going to leave
or anything else about the
customer service calls?
Maybe outcome of the call,
if they're satisfied.
Yes and how would you
measure satisfaction?
I think they always
end with the question
if you're yeah you
know, of course going
to ask me to answer.
Yes sure, I mean so-- but how
many of us actually offered
an explicit feedback right?
So and this is precisely
the point right.
So I think what we want to do
is when you come in to the boot
camp you will be actually
surprised how easy it
is to build predictive models.
And I'm not exaggerating.
So there are
libraries that will,
with a single line of
syntax, you can actually
build a predictive model.
But so anybody can actually
quickly learn and build
a predictive model
and many people
think that they know
machine learning.
But what we will emphasize
during the boot camp
is on this kind of thinking.
How do you extract
some features?
So these are-- this
is what is called
the features or signals right.
How do you actually get in that
mindset that OK, I have to--
even if the data was
not given to me--
how do I extract more data?
Where should I go and look?
So this is going
to be a strong--
we will emphasize a lot on this.
So as I see in the chat
window that there is how many
dropped calls we have right.
So that is going to
be another factor.
And maybe, even if somebody
does not explicitly complain,
but maybe the length
when they called,
what was the length or
duration of the call?
That might be a proxy
for dissatisfaction.
And we were reached out by
one of the telco in the US
and they wanted to actually
measure satisfaction
based on the call transcripts.
So if I call and T-Mobile
actually responded to me.
I was talking to the
T-Mobile customer service
and based on what they said
to me and what I said to them
and then they turn it
into a text transcript.
And that transcript
then eventually
is based on some
techniques you understand
that is this a satisfied
session or not?
If this was not a
session that where
the customer experienced
dissatisfaction
they would treat
that session as such.
So we'll go into all of this,
during the boot camp we'll talk
about text analytics as well.
Very briefly though,
but we'll talk about it.
Let's keep making progress.
There are applications in
online education as well.
As you can expect,
recommending content perhaps
if there are 10 students
taking the class not everybody
needs to work on
the same background.
Maybe, if I'm taking a math
course, after some time
this automated
system figured out
that my algebra is
beat because whenever
there's a algebra concept in
the question I make a mistake.
So it might recommend
me some remedial
or some prerequisite
work on algebra.
More exercises in
algebra and someone--
another person might get
recommended a different
content.
So online education
is a big area as well,
where people are using--
or there's a lot of potential
for big data and data science.
Online retailers, these are
some of the big retailers in US.
You can map it to, for those
of you who are not from US,
you can map it to any other
big retailer in your country.
Social networks, the
friends recommendation,
the job recommendation,
the who to follow,
who to be friends with, what
jobs to look for, and so on.
All of this we see it
around us all the time.
Entertainment the
recommendation engine
that you see Netflix,
YouTube, and Pandora.
We see that all the time.
And we are going
to actually cover
this in depth and
during our boot camp
and we go over
recommendation engines.
So we'll talk about building
recommendation engines
in detail.
And health care, I
think health care
is a very interesting area.
A lot of promise, full of
good data science and big data
applications.
You can probably recommend
different medication
based on your understanding
of the patient.
What you know about the patient,
their age, their ethnicity,
their any background.
Maybe this-- this
medication is shown
not to work on
people of this age
and this background
versus this medication
works better for this
background right.
There is a lot of
applications in health care
that we can site.
Let's look at this
year, online shopping.
Amazon I think all of
us we have seen this.
You go and decide to buy a book
and Amazon predictive models,
they will say, yeah I
mean this guy's going
to buy this book, how about
we recommend this second book
and offer this person a 5% off?
And Amazon predictive
models, they already
have figured out
that if they take
a revenue hit on the second
book this person might not even
buy this book otherwise,
so let's take a revenue hit
and offer this book, and
make money out of it.
So that's a high
level idea here.
And similarly, we have
seen this, who to follow
or people you may
know, and LinkedIn,
the groups that I
want-- may want to join,
any jobs that I may
be interested in.
All of this is coming from
some predictive models
behind the scenes.
OK, let's keep going here.
Online entertainment,
we've seen this.
If I have added a
movie in my queue
or based on my past
behavior, Netflix
would recommend
some other movies
and ask me to
watch these movies.
So what I would like to do
is just spend two or three
minutes now if you have any
application that you think
should be there, or maybe
you are interested in,
or maybe you know that
your company is working,
or a friend of yours is
working, anything else
that you would like
to brainstorm, or just
mention something?
We want to be in that--
so before we start
learning something
we have to be in
that state of mind
where we actually
passionately think about what
is happening around us.
So if you guys can share some
applications that either you
would have liked to see on
my side deck or you think
they exist or anything else.
Any other applications
that you can think of?
And again, you don't
have to be correct.
Don't worry about being wrong.
I think being in that proactive
mindset it is going to help.
OK.
Let me try to--
do you think banks,
financial institutions
are there any applications
that finance--
a credit card company
may have of using
big data, or data science,
or predictive modeling?
Any application that you could
think in finance or a credit
card industry?
Banks could offer you
like some extra credits
when they see you
are out of money or--
Sure, how about-- so
I'm not sure Marco,
is there a concept of
yes credit card fraud.
That's what I wanted to hear.
So credit card fraud it's
an application right.
So we get this phone call,
or email, or sometimes
our credit card may get blocked
just because some transaction
happened.
Jose yes, risk and
is there a concept
of a credit score in Europe?
So in the US we
have a credit score
ranging from I don't know
200 to 850 or something.
So is there a concept of how
credit worthy you are and then
getting a score associated
with you, your--
we have-- we call it social
security number in US.
I don't know what
you guys call it.
Marco, yeah, go ahead please.
Each bank has it's
own regulations
in the month they yeah they
check how creditable you are.
Yes and what do you think it is?
Do you think that they are
accurate when they predict how?
Hopefully think so.
All-- all loans are registered
so they know everything
you have your income so.
Yeah so essentially these
are predictive models.
They actually assign
you a score based
on different factors, how old
you are, or what kind of job
you have, what was
your performance
in the previous loans
that you had and so on.
So if you look at this,
and this is this concept
is not new for us.
It has been going on
for a long time now.
So what has happened is that now
the data gathering has become--
has become so easy and we are
gathering tons and tons of data
about every piece of machinery,
every individual, every car,
every internet of things device,
every product on an online--
in an online store, every
customer going into a grocery
store.
You just-- just imagine
I mean the amount of data
that we are gathering now.
So these techniques have been
there for a long time and now,
suddenly, with this explosion
of the sensors being available
readily, and the
possibility of sensors
being connected to
the cloud easily,
and then storage being cheap.
Suddenly we are-- we
are in this new era
where we want-- we have too much
data, but we many companies--
and you would be
surprised actually.
We deal with companies
on a daily basis.
You might be thinking,
wow, this is a big company,
they must have smart
data science people.
But you will
actually be surprised
that many big companies,
they have no clue
what to do with their data.
And that comes as a surprise
to, sometimes even to me.
I mean you talk to,
without naming names,
you talk to this
big company, they're
saying yeah I mean we are doing
this, but we have this data
but we don't know
what to do with this.
And then you're surprised.
I mean you, from
outside, would think
that they are a
billion dollar company
so why wouldn't they
be able to do this, use
their data effectively.
OK so let's keep going.
So, if you think
about this, what
are we see here, all the
friend recommendation, or job
recommendation, the movie
recommendation, or the fraud
detection, or whether a
customer is going to quit,
or not, whether an accident
will happen, or not.
All of this actually, the
underlying magic behind this,
is is the big data.
We have lots and lots
of data about everything
that is happening and
predictive analytics.
Some body of algorithms that
really help you predict.
I think one thing
that I should mention
is that Amazon
actually filed a patent
for anticipate re-shipping just
a few months ago, if not a year
ago.
And what that means is
that they would actually
predict that somebody
is going to order.
maybe a laptop or a
particular product.
It could be a Playstation,
it could be Microsoft, or say
a Dell laptop, or an HP
laptop, of particular model,
and they would actually start
packaging that and getting
it ready for shipping.
Even without the order arriving.
I mean it has gotten
so sophisticated.
Of course, it is not possible
for every single product.
For the products
they have enough data
for they can actually
predict even that somebody
is going to predict this.
I live in Seattle, in
a particular zip code,
so they might know that
somebody from the zip code
is going to order say
HP-- an HP laptop,
and they would start packaging
that particular laptop
even before the order arrives.
It may sound like
something very--
some science fiction but
it is happening right now.
So, all of this, it is
because of the techniques
that are generally
called machine learning
and predictive
analytics techniques.
And it has become
possible only after we
have lots and lots of data
which we may call big data.
We have lots and lots
of data about everything
and that is making
all of this possible.
So what I would like
to do is, before we
go into any
predictive analytics,
I would like to actually have
all of us take a look at what
does a big data
pipeline look like?
OK so if you look at this, let's
take the example of Amazon.
Amazon I'm assuming
that all of us,
we know what Amazon is, right
so it's a shopping engine.
Millions of-- Tens of
billions of products and they
are really present
in many countries.
So what would a typical big
data pipeline look like?
And it doesn't have to
be exactly like this,
this is just to give you
a conceptual understanding
of how everything happens
in a big data pipeline.
So lets look at this.
So the first stage if I
put my laser pointer here.
So for the first
stage is data influx.
I am-- I open my computer,
I open my browser,
I go and type amazon.com.
The moment I go to
Amazon.com an HTTP request
goes to Amazon servers,
and then Amazon
servers they would
return me a page.
And now, based on if I'm
a new user Amazon would
know, if I am a
returning user, I
am a brand new-- brand new user.
If I'm a returning user they
might, at the same time,
within that millisecond
time window,
they might actually
also quickly see
what is it that I
was browsing for?
Can we generate more
recommendations?
And all of this
data will come in.
So if you look at
this, the page that I
see, the Amazon page that I
see, it has it's a dense page.
I mean a lot of
material is on the page,
but the page may be coming from
maybe 100 different servers
possibly.
One server is giving me
all the recommendations
from clothing because I
was browsing clothing,
but I also bought some books, so
there is a book recommendations
server as well.
There are some
electronics recommendation
and then there is just tons
of things that are happening
and all of this comes in
and is displayed to me.
I go and click something.
Now that click goes, and based
on that an appropriate page
comes back.
Then another click,
another click, right.
So I will keep doing this
and at the same time, if you
look at this, that in real
time they are also generating
predictions, because based on
whatever product I'm looking
at, they will bring me
back the product that
should be recommended to me
or what other people bought.
Now this is a lot of data.
It is arriving-- tons
and tons of data.
You can imagine how much
data they are gathering
from each session from a user.
Now what do you
think they should do?
They should actually
go and start
collecting the data correct?
Because they would not
be able to do anything
actionable out of
this unless they
start collecting this data.
So it is entirely possible
that the, and this is actually
how things are done,
that all the pages, what
was the content of the page,
it might get go and go and get
stored in a different server.
Then you have all
the clicks, they
are going to a
different servers.
Then you have the
product that I viewed
and maybe a list of products
that I looked at from clothing.
They might go to a
different server.
The product that I looked--
looked at, from
electronics, they
might go to a different server.
So all of this data is
going to different places
and getting stored on hard
drives or whatever storage,
they have.
OK is everybody with me so far?
I hope I'm not
going too fast here.
Please ask questions.
That's the only way
I think that remote--
the session being remote unless
I cannot see your face so you
have to really give me some
feedback if this is making
sense so far oh and
please ask questions.
Is-- is this clear so far?
OK, sounds good.
Thanks Marco, thanks Joeri.
So now if you look
at this I said--
I mentioned to you that the
clicks are going and stored
somewhere else.
The products that
were shown to me,
they were going somewhere else.
Maybe my identity,
and my information,
who I am, and of course
they anonymise it based on--
so they don't really share--
save my name
somewhere in the logs.
But they would-- all this data
is it is in different places.
More so, it is not always humans
who are coming and accessing
their web sites.
Sometimes it is
some company that
is trying to scrape their
website for different product
prices.
Some automated
computer programs that
are just clicking on
every link and trying
to get what products
exist on their web
page or their website.
So so there are these
automated traffic--
This automated traffic
it should actually be--
it should be marked as
such, because if you
think about this, that this
is not really human traffic
and it should be identified.
I mean we call them bots.
In the online services industry
we call it, all of these,
bots because they
are not really--
they don't represent
typical human behavior.
So there's a lot
of preprocessing
that would happen.
You will clean up
maybe some clicks
were because of some
bug in the code.
They were-- instead of one
click, you logged two clicks.
Or maybe some click was lost
because of network issues.
So some data
cleaning will happen.
Then, after that, you will
have a transformation stage
where you will
transform all the data,
possibly merge the data
from different sources
and make it into a
single data source.
In some cases, you will put
them in some big data warehouse.
In some cases,
you will push them
into some traditional databases.
So different types of
different representations
because not everybody needs
data in the same format.
And, only after that, once
the data is transformed
and this whole
transformation is complete,
you are going to actually
do some data mining.
You will have-- you will
discover these patterns.
OK patterns as in OK what--
who should be
recommended what content?
Who should be recommended
what movie or product?
Who is likely to be a bot
versus an actual human?
What news item should
this person see?
Who is more likely
to buy this product?
What products are like another?
How do you to
forecast the revenue?
And so on.
And so you mine the data.
And then integrate
and evaluate based
on what your business goal is.
It could be in a typical,
big, company, like Amazon,
or Facebook, or Twitter,
there's many teams
doing different things
with the same data.
So this is typically
how your big data
pipeline would look like.
Any questions about this?
OK, let's keep going.
So if you look at this,
and this is only a part
of the overall landscape.
If you-- let me just--
I'm taking a note here.
I notice a small typo here.
So if you look at
this, there are
some tools you may or may or may
not have heard of all of them.
But there are some tools
that are used for logging,
purely for logging.
So when somebody comes to my
website and they're clicking
and they are
issuing queries, how
do I gather all of that data?
So Splunk and Flume are
some of the tools for that,
that are used for logging.
And, if you look at this, Flume
is also used for collecting.
So it is used for collecting.
And if you look at
the y-axis, this
is the different stages
in data processing.
And these are the
different technologies.
On x-axis, these are different
technologies that you have.
So Flume is--
I'm here right now,
where my laser pointer
is, it is used for collecting
and maintaining the logs.
And Splunk is for storage.
But these are, if you can
think of them as flat files,
like everything that
comes in it is just
dumped into the log files.
However, if you have
SQL, if you want
to store your data in a
more SQL format, more of as
in a database format, a
traditional database format,
than you will use
some of the tools
like mySQL, Oracle, TeraData,
SQL Server, and DB2, and so on.
And similarly, when
the data is coming in
and you are gathering the
data, but it is not really
in the form of
database tables, it
is not normalized
in that fashion,
then you would
actually store that
in more in a no SQL format.
And no SQL is--
you can say that you still
use SQL like languages, hive
and pig, to access
these databases,
but these are on big data.
So you can think of this
as big data databases.
And we will talk about
this during the boot camp.
Right now you should not even
think about that all of this,
it will become more clear.
It is just giving you an idea
of the overall landscape.
We'll go into the in depth
about most of these things
during the boot camp.
Then MapReduce, MapReduce
you can think of it
as if you have to run
something on a single machine
you can simply just
run the command.
And your data was say 4
gigs and you have an 8 gig
of RAM on your computer.
You run this processing
and then you are fine.
But what if your
data, the data--
the model data that
you have is has about
say a terabyte of data, or
even 100 gigabytes of data
and you don't have-- your
ram is only 8 gigabytes?
What are you going to do?
So in that-- for
that you are going
to use this technology
called MapReduce but there
are other paradigms
that have emerged now.
You may have heard of Spark.
So Spark exists now.
And there are other
paradigms that are emerging.
But, in general, MapReduce
is a more commonly known one.
So MapReduce, HartonWorks,
and Mapper, and Cloudera,
all of them they have--
they have tools for doing
MapReduce on your data.
And if you have to
build predictive models,
or do some analysis work on it--
if you look at
the y-axis part we
are exploring
predicting, recommending.
So if I go and
overlap this here,
so this is all
the tools that you
might be using for doing
MapReduce, doing analysis,
and prediction, and
recommendation of visualization
using MapReduce.
And we are going
to do R in detail,
we are going to
do Hive in detail,
in during our boot camp.
We have-- we will not
be covering Mahout
but we-- our
curriculum, our textbook
has interesting labs that
once you know R and Hive,
Mahout is going to be just
something that is going
to be fairly straightforward.
The lab will take you
about two hours on your own
after the boot camp.
And Mahout is a distributed
machine learning library,
and we have--
we will be-- we will have
that in our curriculum.
We will not cover
during five days
but you will have a the labs
and to follow up on that.
Now so if you look at
this, the real time,
so I think if you go back--
go a decade-- to a decade
ago, it was a big deal
if somebody knew MapReduce.
It with a big deal if
you can process big data,
but no-- more no longer
is this is a big deal
because there are some very
simple tools that will actually
let you handle big data.
And people are actually, when
they come into the boot camp,
they're actually surprised and
shocked, if they did not know,
that you can actually learn
how to handle big data.
Just handle, I'm not saying
a deeper understanding here.
I just want to make
clear, make it clear.
But you can actually start
running queries on big data
in a matter of, if you have
some basic understanding of SQL,
or if you understand-- you have
some background in programming,
in two to three hours you can
actually create your own Hadoop
cluster in the cloud and then
start running hive queries.
So being able to
handle a terabyte or 10
terabytes of data it's not
even a big deal anymore.
Anybody can do that.
But now, what's emerging
now is that can you handle
this big data in real time?
So previously it might
have this was the case
that you will gather
data for a whole day
and then process at
the end of the day.
But think about, let's take
the example of Twitter,
so Twitter gives you these
trending tweets or trending
hash tags.
What about if Twitter
gave you those
trending hash tags
the following day?
How would that-- how
would your experience be?
But, if you look
at, Twitter actually
gives you trending
queries and they
are updated very frequently.
I don't know how often, maybe
five minutes or 10 minutes,
but if they give
it to you after--
even after four or five hours,
they are no longer trending.
So imagine the number of
tweets they'd get in a minute,
or in 30 minutes,
now all of this
has to be processed
very efficiently.
We're talking about real
time processing of data.
And we're talking
about real time--
we're talking about real
time analytics here.
It is not really exactly
real time, as you can expect,
sometimes it's called
near real time analytics.
But there are some tools that
deal with data in real time
or near real time.
And we will cover some--
one of the tools and
then you will actually
build your own real
time analytics pipeline
at the end of the boot camp.
And over here, analytics.
Analytics-- for
analytics we have
a bunch of different
tools, and power BI,
from Microsoft, that
should be here as well.
There is and there
is other tools
that exist for virtualization.
So when I look at the Soviet
document analytics and we
are talking about
virtualization.
So that's what this means.
I know its a lot of material
condensed into a single slide,
and it will--
the boot camp will
actually demystify this.
But the idea is
actually to to get you--
just to make you look at
it and start appreciating
what is exactly happening.
And the boot camp
is going to actually
give you a solid understanding
of this whole ecosystem.
This is the boot camp
will take care of this.
If you look at this,
the science part,
so we are going to spend
more time on the science part
but we'll also do a lot of
the engineering and management
part.
And you will-- by the end of
this I think you will be--
you will have a good
understanding of what this--
by the end of the fifth day you
will have a good understanding
of, OK what is all of this?
Even so we there's
no way we can teach
all of these
technologies, but you
will have pockets of-- so
one technology from here,
one technology from
there, and so on.
So you have an understanding,
but once you understand
one technology, other
technologies are really
some more features,
some less features,
but they are the same idea.
So the goal is actually to
give you a bigger, high level
understanding end to end so you
are capable of making decisions
for your-- on your own, for
your company, for your business.
And do you have a basic
understanding end to end.
Are there any
questions about this?
And, as I said, I think please
feel free to ask any questions.
This is extremely important.
Asking questions
is very important.
OK let's keep going.
So we looked at this whole--
whole idea behind
big data and what
does a big data
pipeline look like,
and what are some of the
technologies that exist,
what are some of the
tasks that are there
for processing big data.
Now we'll actually start
moving towards the data
mining, and machine learning,
and predictive modeling side.
And so there can be different
types of data mining tasks.
You can think of
these tasks as--
or some methods that we use
they can be descriptive methods,
they can be predictive
methods, and they
can be prescriptive methods.
And what-- instead of going
through the definitions
as I will keep repeating
during the boot camp,
I'm not a big fan
of definitions,
I actually would try to
explain things using examples,
and the definitions with
themselves make sense.
So let's actually
look at this example.
You have a-- you're managing
traffic and your goal is
actually to--
so your goal is that you're
managing traffic, that you want
to know when the traffic
jam has happened,
or you want to
avoid traffic jams.
So the descriptive method
would be that the traffic
jam has happened already.
So they are describing
the situation
that a jam had-- a traffic
jam has already happened.
And the implicit thing is,
now do something about it.
What can we do about this
traffic jam that has happened?
As opposed to this, if I
go to a predictive method,
if you notice that now there's
a change from descriptive
to predictive, it is both an
informing and warning goal.
What does that mean?
The traffic jam is about to
happen in the next 30 minutes.
And now you suddenly
see that that you
know something is coming up
as opposed to this after the--
it is not a hindsight--
hindsight information.
You actually are
getting this information
even before something
has happened.
And, of course, implicit
is do something about it
before this happens.
And another one is more
of a prescriptive method
that might involve
some informing,
and warning, and advisory role.
What does that mean?
That data action, so traffic
jam, does not happen.
So that is a
prescriptive action.
Our traffic jam is about to
happen in the next 30 minutes.
And you could possibly take the
following courses of action,
route traffic to
service road near I-5,
and I-5 is a highway in
Seattle, and, similarly,
block more traffic from
entering the five--
Washington 520 bridge.
Now I would like to
actually pause because I've
been speaking for a long time.
Does this make sense?
Do we do we think
that this is possible?
If we somehow gather all the
data from last five years
about traffic that has had--
that the way a traffic pattern
has been in a given idea,
do you think everything
that we're talking
about it is possible?
Or does it look like
science fiction?
I'm sure it's
possible with the--
how to do it is
science fiction for me.
So we'll look at
that, absolutely.
And that the how part.
Yes, it is absolutely possible.
And you would actually be
surprised how easy it is now.
We were offering a
training at a company that
is owned by Audi,
BMW, and Mercedes,
I guess, it's a
geo-mapping company.
And all of the data from
these and so that's only one
of the source and they have
data coming from other sources.
So all these cars, they
actually send back all the data,
when the-- where is the car?
Where did the car turn?
What was the location?
When was the break hit?
What was the weather?
What was the temperature?
So all of these companies
are-- these companies
are gathering tons
and tons of data.
And the how part we will--
we will spend five days, 50
hours, we will spend that time
on the how part.
But I think one,
for the first thing,
is that we need to see
OK is it even possible?
Is it happening?
So yes, we'll cover how
part in much detail,
not specifically
traffic management.
I will have some
other applications.
But you can map it to any domain
once we go through all of that.
OK let's keep going.
For instance, I'm not sure if
Kayak exists in outside of US.
I have no idea.
But Kayak is a website that if
I'm-- if I'm looking for this
is one of the searches
that I did at Kayak.com.
So I was looking for
a ticket from Seattle
to San Francisco
on a certain date.
And kayak gave me this thing.
It gave me a price
trend like this.
What do you think is this
predictive in nature,
or prescriptive in nature,
or descriptive in nature?
What do you think
this is telling me?
Is this a prediction, or a
description, or prescription?
I think it's a description.
It is a description.
Why do we think so?
Already--
Yeah.
It has history.
Yeah it is-- it's
historical data.
And think about this,
this is the traditional BI
that we have been living in
for a long time now right.
The reports customers
have left us already,
this was our revenue.
We have had so much
sales in clothing,
and so many sales in
sports, and so many sales
in electronics right.
So this is
descriptive in nature.
How about the-- let
me just move to--
how about this rectangle?
Price may rise
within seven days.
Is this predictive,
descriptive, prescriptive?
It is predictive in nature.
Yes, that is correct right.
So and then there is a
prescriptive element.
I'm not sure how they are
doing the prescription.
Maybe the prescription is just
because prediction is there,
so advice is buy.
But that gives you
an idea that how
these companies are actually
leveraging all the data.
And again, do you think what
Kayak is doing is possible?
And I want to keep
emphasizing-- sometimes
my question may seem like
I'm asking the obvious,
but I just want to reassure and
reassure that we understand.
Do you think Kayak can do this?
And how do you think
they might do this?
Do you think they have
access to all the data
of all the historical pricing?
Sure, this is all--
I'm sure that I can get
some API from some--
I think there is a
international organization,
IATA international association
of travel agents or some other.
I think I can get this
data easily from somewhere
after paying some money.
And then after that, I can
look at some seasonalities,
the names of the cities,
what were the dates, what
was the day of the week, what
was the day of the month,
was there any events around?
And I can build a
model that can give me
an idea of whether the price
is going to rise or decrease
and so on.
So let's take a look
at it once again,
this whole idea of descriptive
and predictive and prescriptive
analytics.
And for the moment, ignore
the diagnostic and preemptive
analytics.
Now, as I said,
descriptive analytics
was our traditional BI.
And now, I'm not sure what is
the background of the audience
right now, but how difficult
do you think it is, these days,
to actually get this
descriptive analytics?
Most of the companies, most
of the database package,
database software
they really have
all of these pie charts and
bar graphs and trending charts
and box plots and
all of that built in.
So if I want to
ask a question what
is the attrition
in last six months?
Which customers have we lost?
If I'm a business owner, and I
want to know which customer--
which of the customers
have we lost?
It is almost-- it used to
be a big deal again 20,
25 years ago, but
now most off the--
off the shelf
database software they
would have all of these
visualizations there.
What do you think is the level
of difficulty, right now,
for this?
Not much right.
So any off the shelf software.
Any database developer
would just very quickly
set this up for you.
And the business
value, of course,
knowing it after it has happened
it is extremely important,
but we can do more.
And I will jump to
predictive analytics
right away, before even
going to diagnostic.
I will come back what that is.
But what if-- what if I was
able to tell what will happen?
Which customers are likely
to leave in next six months?
Now think about this, that
over here in descriptive
I am being told that
somebody has already left.
In this case, I'm being
told that somebody is
planning to leave your service.
Somebody is planning to
switch to Vodaphone or Orange.
Or somebody is planning
to, from Amazon,
somebody is planning
to switch to Azure,
or some other company.
So there-- there's clearly
more business value here,
but at the same time, the
difficulty level increases.
And if I go all the way to
prescriptive analytics, it is--
it also, not only
that it is giving me
an idea that it might--
it might happen,
it is telling me
which customers
might stop if we try.
More of a what if scenario.
What if we do this?
Will this help us
mitigate the situation?
So if you look at this,
the descriptive analytics
is mostly about information.
And then, as we move
along these lines,
we're talking about insights.
We are-- we have more insights
into what has happened
and what might happen.
And then, if we go beyond,
the difficulty keeps going up
but the business value
also keeps going up.
And If you look at the
diagnostic analytics, it is--
if you think of
diagnostic analytics
it is something that our
traditional deep data analyst
and business analyst would do.
Something has happened
already, now why
did we lose the customers?
Even though we don't know
which customers-- we did not
know in advance, but at
least, if we understand
why we lost customers,
we can probably
avoid that thing in future.
So, in this case, in predictive
case, we know beforehand.
In this case, we understand, we
diagnose what was the reason.
Does it make sense?
Does it make sense to everyone?
OK, great so you can
see, and we are--
our focus is going
to be strictly on--
I think we'll spend most
of the time in the course.
We'll learn some diagnostic
aspects of things.
And descriptive, I think we'll
see that, but most of the focus
are going to be-- is going to
be around predictive analytics
during the course.
And we'll also understand
actually diagnostic analytics.
We'll understand
that but, mostly, I
think the more common term
is predictive analytics
and that's what we
should be focusing on.
So, in the next few slides,
I'm going to talk about some
of the techniques that
I use for data mining
and predictive analytics.
And the goal is actually
to make you aware
that these techniques exist.
We're going to cover
all of this in depth
during the five day boot camp.
So if you have questions,
please feel free to ask.
But don't assume that this
is all in these techniques.
We will actually spend
a lot of time diving
into the finer details of
each of these techniques.
It's just to get
the point across.
Make you look at
them so you already
start thinking about
them and by the time,
in two or four or six
weeks, whenever you
are attending the boot camp.
When you come in,
you already know
what it is and you
have thought about it
and it is not something
brand new for you.
Now, let's look at this example.
We're talking about
classification now.
And classification
is a simple data
mining, or machine learning,
or predictive analytics task.
And what classification does,
is that it will tell you--
it will give you some
distinct outcomes.
For instance, if
you want to predict
whether a transaction
in a credit card
is a fraud, or not a fraud, it
is a classification problem.
If you want to
detect, for instance,
you may have seen those
cameras that will--
that will create a
bounding box around a face
when you've got a face comes
in front of the camera.
So how does that camera detect
whether it is a face or not?
That is a
classification problem.
So there is a
classification model
running behind the
scenes that is based
on everything in the image.
It is detecting whether there
is a face in the image or not.
And so whenever there is
a discrete set of outcomes
from your predictive model
that is called a classification
problem.
And the way
classification works is
let's take this example here.
And this is a toy example.
So in the US the
government agency
that handles all the
taxes and the tax returns
are called IRS, Internal
Revenue Services.
And so you have
millions of taxpayers
and IRS does not have the
resources to actually audit
all the taxes.
Now, if I ask you, how would
you audit your taxpayers
for potential fraud?
What would be your strategy
for detecting whether somebody
has filed wrong returns
or intentionally
or unintentionally
fraudulent returns?
So what do you think?
What are some of the--
some of the techniques that
you would-- you might use?
And forget about the slide, I
mean I'm just asking you right.
So you have 10 million
taxpayers and you have resources
only to just check
say 10,000 taxpayers.
How would you-- which
taxpayers should you--
should you choose
for testing if they
are committing fraud or not,
or their tax returns have error
or not?
Any thoughts?
Any hacks?
Any basic techniques?
Anything that you
can come up with?
Oh yeah, absolutely,
fabulous random sampling.
So you will randomly pick
10,000 users, 10,000 taxpayers,
out of all of the taxpayers.
So so Marco was suggesting
a random sampling,
which is going to work.
But any potential weaknesses
of random sampling?
And I'm saying
random would work,
but any potential weaknesses?
Anyone?
do we think that random sampling
will catch a lot of frauds,
or is it if we take
a 10% random sample?
Don't we think that it will
only capture 10% of the frauds?
Or if we do a--
OK, Jose no answer to my
comment or if random sampling
has is good or not?
OK yes, so random is-- it
will work but it will not--
it will only capture a small--
small subset of of
potential frauds.
So random will work because
there's not much you can do.
But, yes better to search for
a pattern, absolutely right,
some sort of selective sampling.
So what if I do this?
I am looking at
this year's taxes.
But for the last 20
years I know who cheated
and who did not right.
So I will have some
historical data.
So assume that this bigger
table that I have here, this
is my historical data, and
the historical data tells me
when somebody requested a
refund, meaning that they
overpaid taxes, and their
marital status was single,
and the taxable income was 125K.
They did not cheat.
So this was one of the
taxpayers in the past.
Then I have another taxpayer
who did not request a refund,
he was married, and 100k.
He did not cheat.
And I keep going.
And the fifth
taxpayer, he cheated.
Than the sixth one did not.
The seventh one did not.
The eighth one actually
did cheat on their taxes.
So this is my historical data.
And if you look
at this, refund is
a categorical variable,
marital status
is a categorical variable, and
taxable income, it is again,
a continuous variable.
It's a number, not
really a category.
And whether somebody
cheated or not I
can call it a class because
owing to what classification,
but it is a
categorical variable.
Now if you look at this,
can I find a pattern
that can tell me
if refund is this,
and marital status is this,
and taxable income is this,
the person is going to cheat?
This is what we
call classification.
Based on some historical
data it could be categorical,
it could be continuous.
I am coming up with a model,
a predictive model, that
will give me, for
future data, it
will give me an idea whether
somebody is going to cheat
or not.
So if you look at this,
all of this data is there,
and we will push
this data into--
and we call this historical
data the training
data or the training set.
And once we have
the training set,
we will bring in a
machine learning algorithm
that takes all of
this data and learn
something, which is called
a machine learning model.
And once we have this model, for
any new data that I don't know
anything about, I can take
this, call in the test set,
I will give it to the model,
and once we give it to the model
the model will tell me
whether the person is a person
likely to cheat or not.
So this is an example
of classification.
And we will deal with
classification in depth
when we do it during
the boot camp.
We spend a lot of time
in on classification.
This is just a high level idea.
What is a training set?
What is a test set?
What is a model?
What is and-- what
is a classifier?
We'll get into that
during the boot camp.
Any questions about this?
OK let's OK great
let's keep going.
And I think I did
mention to you,
I'm not a big fan
of definitions.
I will leave this
definitions for you.
They actually-- the
definition formally
explains what I just went over
so I will leave it for you.
This slide-- you will
have the slide deck
on the learning portal.
I made some minor
changes yesterday
so the object slide deck should
be on the learning portal
shortly.
So what are the applications?
Direct marketing,
so I'm not sure
how it is in other countries.
But in the US, we have
a lot of junk email
that lands in our
mailbox on a daily basis.
Lot of paper that comes in.
So companies they actually
use classification
for direct marketing.
Instead of just sending it
to every possible customer
out there, how about we send
it to only the people who
are likely to buy our
product based on their age,
based on their gender,
based on their--
where they live,
based on understanding
of maybe how many people
they have in the household,
and so on?
Can we do that?
Fraud detection, can we predict
fraud-- fraudulent cases
and credit card transactions?
I talked about that.
Customer attrition
or churn, can we
predict whether a
customer is likely to be
lost to a competitor?
All of these are examples
of classification.
There is another
body of techniques
that we will cover in depth
when we come to the boot camp
but I just mentioning
them because knowing
them is important.
So what we do in clustering
is that we think of--
so computers they only
understand numbers.
So if there is any text, it has
to be translated into numbers.
If there is a
characteristic of my gender,
it has to be
translated to a number.
If the color of my car, it has
to be translated to a number.
So computers they
understand numbers.
So assume that these are some
data points that we have.
Each of them could
be a news article.
It could be a human, a
customer, it could be a car.
It could be just anything.
So these are the same
types of objects.
We converted them into
numbers and now we're
presenting them by x, y,
and z-coordinates right.
So x could be their age.
Y could be their income, and
z could be the amount of money
that they spent on our website.
OK so now the idea of
clustering is that once
you come up with the this
numeric representation,
how do you how do you--
how do you combine the ones
that are close to each other
or that are similar
to each other in one
group and the ones that are
not similar to each other
into a different group?
So this is the high
level idea of clustering.
And, in this case, you
can see the whole idea
is you can see that these points
they came into a single cluster
here, and these, they came
into another cluster here,
and then these points they
came into a different cluster.
And the idea is that we
want the points that are--
or the customers,
or the cars, I mean
so each point represents
just some entity
that we are interested in.
It could be a product, a
car, a human, an animal,
just anything.
So everything that is
close to each other,
they come they stay in close
proximity, and everything that
is not like each other, they
are separated as far apart
as possible.
So this is the
idea of clustering.
Why would we do it?
We want to subdivide
our customers
into subset-- distinct
subsets of customers.
So I have this example,
I have nine customers
but not all the
customers are the same.
Maybe these two customers, they
are my high value customers.
I want to spend--
maybe I want to give
them more attention
versus these customers that
are not high value customers.
Maybe these customers
are the ones that
are more interested in
clothing versus these that
are more interested in
electronics and more
versus these customers that
are interested in some sports
and other products, same idea.
We have definitions here.
I will let you take a
look at the definitions.
We'll cover clustering in detail
when we during the boot camp,
but this is just here
to just to give you
an idea what clustering is.
So how do we know which
points are similar?
So Euclidean distance
is a fancy way of,
if you remember the distance
formula, that is actually
a Euclidean distance.
So if you take that distance
between these two points
like a x2 minus x1 squared,
y2 minus y1 squared,
and z2 minus z1 squared.
Whole square-- and the
square root of that.
That is called
Euclidean distance.
So one way to actually see
how similar these two are
is how close they are in
terms of the distance metric.
OK and sorry please go ahead.
[Question from Marco]
No I said that--
Yeah they are, great Marco,
thanks for the question.
No, it is just
because I could not
draw more than three
dimensions on paper.
But, in theory, so
think about this--
let's do this, what do
you think about Amazon?
How many attributes
they might be
gathering about each customer?
Hundreds if not
thousands, maybe hundreds.
Right?
Is this clear?
So each company, for
instance Netflix,
what do you think about Netflix?
If they want to cluster
movies, how many
attributes they might
have about a given movie?
Or Amazon, how many
attributes it might
have about a single product?
Probably hundreds right?
So if I have 100
features this will
become 100 dimensional space.
It is going to be--
but the idea remains the same.
You will still take the
distance in the same fashion.
Marco does it answer
your question?
Yes it does.
And can you also find
strongest clustering?
Now when you say strongest
clustering what would that?
The best prediction.
Yes, and clustering
it tends to be ad hoc.
I think you're referring to
whether two clusters would
have been good enough or three
clusters would be better right?
Yes we will actually
talk about that.
How many clusters are good.
And because we will spend about
two hours on clustering itself
and some exercises there.
So yes, we'll get into that when
we get to clustering actually.
We'll use K-Means in class.
OK so let's look at this.
For instance, if you
look at clustering,
so think of Google News.
Google News actually
when you see--
or Yahoo News, or or
Amazon News right,
so now when they're
clustering all the documents,
nobody actually goes and
manually puts each document
in different categories.
But what they might be doing
is that each document might
be converted into based on
the number of words that
are occurring, and based on the
other things, other language
features, you'll convert them
into a vector, some x, some y,
some z, some x1, some
x2, some z2, and so on.
So you have these numbers,
and now what you will do
is any document
that is close to--
so you will create
this clustering,
and then if a new
document comes in,
you will compute the
distance of the document
from each of the clusters.
And, based on that,
you will decide
whether which kind
of document it is.
And this will make sense.
We'll spend time
on text analytics
and we will also spend
time on clustering.
So it'll make much more sense.
But, at high level,
can we visualize this?
And sometimes I
think, when you're
looking at this for the
first time, I think the--
my understanding
is that the people
find it hard to actually
visualize how can text become--
a text document
convert into a number?
But this is what
I want to convey,
I want to communicate here.
Do you-- do you see?
Do you have some sense
of how this might happen?
So I'm assuming that
the answer is yes.
Let's keep moving.
Yes, thanks Jose.
So there is another aspect,
which is association analysis.
And if you go to
a grocery store,
the grocery stores would see
every time somebody bought.
So they would look
at each checkout
and each checkout you have
milk, and eggs, and then
milk, and bread.
And maybe in some
checkout somebody
bought milk, egg, and bread.
And somebody bought
toilet rolls, and egg.
And now what they
would do is they
would try to do some
association analysis.
And they will say, yeah I
mean, whenever there is milk,
it always includes bread.
And whenever there is
eggs, there's a 43%--
43% of the people who bought
milk, also bought eggs.
So they actually
continuously do this--
these analyzes and then
optimize their inventory,
and then optimize their
placement on the shelves
and so on.
Just to make it easy
for the customers
and make it more
profitable for themselves.
For instance, in this
case, if you look at this--
if you look at
this, how do you--
how do you discover a rule?
For instance, bread, Coke,
and milk, and beer, and bread,
and beer, Coke, and diaper,
and milk, and so on.
So if you look at this
closely, the rules
discovered are that whenever
there's milk, there is Coke.
If you look at this, in the
first transaction we had milk,
and Coke was there.
And there was milk,
and Coke was there.
And milk, Coke was not there.
And milk, Coke was there.
So three out of four times,
whenever there was milk,
Coke was there.
And similarly, when we look
at diaper and milk together,
so diaper, and milk.
When there is diaper
and milk, you have beer.
When you have diapers
and milk, you have beer.
And diaper, and milk,
and Coke is there right.
So these are some of
the rules that they
would be learning--
grocery stores,
they would be
learning all the time.
Just to keep optimizing
their inventory and then
placement on the shelves.
And you may actually
see things like this.
When in a grocery
store sometimes
you would see a few things
that are completely unrelated
but they have a reason to--
the stores, they have a
reason to actually put them
because they know that these
things are placed together.
And if they are
accessible, easily,
people are going to buy them.
So that's the whole idea.
And I have some more
applications here.
I think I will let
you go through this.
We have only 15
minutes remaining
and I think I have
five more slides
that I need to finish before.
So I will let you actually
go through this slide.
Another example is regression.
And regression is when
we are predicting instead
of-- so if you know
classification,
we were using a few
variables, a few features,
and predicting whether
a yes or no answer,
a face or a non-face answer.
But, in this case, we are
predicting a continuous number.
So this is an
example from Zillow.
I will zoom it out.
Can you guys see the
numbers on these houses?
OK great, so Zillow is a
website that actually gives us--
and I don't think
it is in Europe.
It is we have it in US here.
So what they do is they would
take the maybe the crime
rate of the house, and
possibly how good is the school
system in that nearby area.
Were there any
foreclosures or not?
Is public transportation
available or not?
What was the previous
selling price of the house?
And so based on a lot of
this data, historical data
about the neighborhood,
about the house,
about the general economics,
economic condition.
And they have this site,
it's called Z-I-L-L-O-W .com.
And they would actually
predict the price of a house.
So this is an example
of a regression--
of a regression model.
So you're taking a
bunch of variables
and predicting the
price of the house
or predicting a number really.
And it could be that
future stock price.
It could be how much
money you will make.
It could be how many
bikes you will rent.
It could be how many
cars you will rent.
It could be how many customers
will show up on your--
in your store.
So whenever you have a--
you are faced with the
question of how many
or how much then you will use
a regression algorithm and not
a classification algorithm.
Because classification is a
distinct set of possibilities,
and this as a continuous number.
And we will get into the
details of regression
as well, just to make sure.
Just to let you know.
So regression is as
simple as this right.
For instance, you
know in the past,
so think of x as
as some variable,
say a number of the number
of rooms in a house,
and the price of the house is y.
So based on the
historical patterns,
you're trying to predict
what would it look like?
What would be the
price of the house?
But in real world it may not
only be the number of rooms,
it could be another--
another factor which could be
the covered area of the house.
It could be the biggest selling
price of the house, and so on.
We'll-- we'll get into the
detail of this during the boot
camp.
And another example,
if you look at this.
So when you go on a website and
you search on a search engine
and you search--
and do a search query, these
search engines they actually
predict the
probability of click,
whether an ad will
be clicked or not.
And based on those
probability of click,
and some other factors we
don't want to get into,
they will actually show you
the ads in a particular order.
So they will keep the ad with
the highest likelihood of click
on the top, then the next one,
then the next one, and so on.
So if I ask you this
question what about-- what
features would you
consider about an ad?
So if I'm searching
for dog food,
what ad should have the
highest probability of click?
If you were to build
this predictive model,
without knowing anything right.
So I know that you're not--
don't know regression yet,
but what are some of the things
that you will care about?
What are going to be strong
predictors of which ad
is going to be clicked?
Just use your--
it it doesn't have
to be machine learning right.
I mean just what does your--
what is your
intuitive answer here?
What will decide whether an as
is going to be clicked or not?
Possibly, bold text, sure.
What else?
The exact match, yes.
So for instance, if I'm
looking for dog food
and then it shows me,
I don't know, BBC.com,
it wouldn't make sense right.
So there has to be a semantic
match between the query
and the ad itself.
But Jose, if you're
deciding the position based
on the probability of click.
So the position will
actually be determined
based on the number that
we are trying to predict.
So do you think if an ad
was clicked more frequently,
in the past, there's
a chance that it
will be clicked more frequently
in the future as well?
Do you think that could
be a good indicator?
OK, so yes So I think we'll
probably get into this example
again when we look
at the regression--
when we look at
regression examples later.
But right now, just
to give you an idea
that this is where we
are, in terms of this
is another technique and
which has a lot of uses.
And regression
and classification
are by far the most
commonly used techniques
and we'll make sure that you
understand them very well.
Now, there is an
other category, which
is called the anomaly or
deviation detection, like fraud
detection, credit card fraud
detection, network intrusion
detection, and bot
detection, and bot traffic.
So all of this is yet
another body of applications
and we will talk about that
during out boot camp as well.
Now so what are some
of the challenges here?
So I'm done with all
the technique, not
all the techniques, but
some of the techniques
that I declare worth mentioning
in a limited amount of time.
So what are some
of the challenges?
It turns out that so when
back in my grad school days,
the biggest data
set that I worked on
was about 500 megabytes.
And when I joined--
I used to work for Bing, I
will properly introduce myself
during the boot camp again, but
when I started looking for Bing
we were actually gathering
tens of terabytes
of data on a daily basis.
So now think about this, that
many algorithms they may work.
They may be completely fine
working on small data sets.
But suddenly, when you
have big data sets,
suddenly becomes
a big challenge.
So scalability is a big problem.
And that needs to be handled.
And a seam machine
learning algorithm,
it might work on small
data set, but it might not
work on a big dataset.
Dimensionality,
usually what happens
is that if you remember
the IRS example where
we were predicting whether
somebody will commit
a fraud in taxes or
not, you we were talk--
looking at their marital status,
whether they were married
or not, or whether they
file for a refund or not,
and what was their income.
So easy enough,
but in real world,
we don't consider
only three columns
or three features of the data.
And as you have more
and more features,
your again your machine
learning algorithms
they find it difficult
to actually come up
with a good predictive model.
So that is the
number of features
or a number of different
factors that you're considering
for predicting something.
That also is actually--
that impacts you.
That's a big challenge
in data mining.
And the third thing is complex
and heterogeneous data.
Not everybody has saved the data
in the format that you like.
Some people they--
you may be using
some text data, some
blogs that you're reading,
some news that you're
getting, some tweets.
Maybe some data is in
video, or audio format.
Some data is all numbers, some
data is in a SQL database,
some data is sitting
in a non-SQL database.
So how do you handle this
complexity in the variety--
and this heterogeneity of data.
Data quality, your
values are missing,
you have to make
sure that you are not
taking bots in your data.
Quality is a big factor.
And if you don't worry
about your dataset--
quality of your data, then you--
whatever you learn
from your data,
it is going to be misleading.
Data ownership and
distribution, a lot of data
sometimes you think that you
can, it is going to be useful,
but you don't own it.
How do you get that data?
It could be some other
division or some other team
within your company, or
it could be completely
some other third party.
How do you-- how do you
gain access to that data?
Do you have to pay?
How much do you have to pay?
And if it is someone
within your own company,
how do you get
access to the data?
So these are some
real challenges
that you will access.
Privacy, what is some
information that you can
or cannot use in the data?
Because there are many
things that make you give up.
They may be very
useful, but can you
collect that information or not?
Or if you are
collecting, can you
use that information or not?
And the reaction time, OK
you have an awesome machine
learning algorithm
that predicts frauds.
But your company wants
to detect the frauds
within five minutes of them
happening but your machine
learning algorithm,
the way it is set up,
it takes a day to
come up with this,
whether something
is a bot or not.
Reaction time is
important right.
So you want to--
it is not just whether
you can do it or not.
How quickly can you do it?
So that's another
challenge, because now we
are not talking about one,
or two, or five megs of data.
We're talking about hundreds
of terabytes or maybe
possibly petabytes of data.
How do you process
that data and react
in a finite amount of time?
And there-- there are many
other domain specific issues
and we will actually
keep talking about these
throughout all five days.
Any questions about
this before I go to--
I think the next slide
is the last slide?
And we'll be done.
Any questions?
OK great, OK.
So there is this concept
of five Vs of big data.
So you heard me saying velocity
and volume and all of that.
So what are five Vs of big data?
So volume, I think not too
long ago the data that we had--
I mean the biggest data
sets might be gigabytes,
if not hundred
megabytes, and now we're
talking about huge,
huge data sets.
So we have terabytes
to exabytes of data
that we have to process.
And there is a
velocity aspect of it
that the data is in motion.
It is not standing still.
It is continuously
moving and arriving.
Think about that you are
predicting say stock prices
based on many other factors.
You are also incorporating
the current stock
prices and any social
media activity,
any Facebook activity
around that stock,
or any Twitter activity.
The data is coming in at
high velocity, high speed.
How do you-- how do you
ingest and handle that data?
Variety, some of the
data is structured.
Some of the data is text.
Some of it is as audio.
Some of them is
numbers and multimedia.
There is a variety in
different forms of data.
How do you handle that?
And veracity of data.
Data is in doubt.
Any lost clicks, any
automated traffics, any bots,
any incompleteness,
any latency, some data
did not arrive in time.
What do you do?
And the data--
value of the data.
Is all the data equal or not?
Some data may have
more information.
some may have less information.
Some may have more insights.
Some may have less insights.
So there are a lot of different
challenges around big data
and we will actually spend a lot
of time discussing all of these
and having discussions
around all of this.
So the idea for
today was actually
to expose you to
all of this and get
you thinking until
you show up and you
have a good sense of
what we are going to do.
And that's about it.
I think this is a--
that's all I have for today.
And if you have any
questions, I am more than
happy to answer the questions.
You should have a few
more webinars coming.
Please make sure if you
can attend in person
do attend them.
You will have the fundamentals
of data mining coming up,
introduction to R and Azure
machine learning studio
and Amazon Machine Learning.
All of them will happen.
And in case you cannot make
it, for whatever reason,
try to make it in person, but if
you cannot please make sure you
watch the videos.
And for some of those webinars,
there are some quizzes
as well that you have to--
I mean you're not required
to, but doing them--
we'll keep bugging you,
we'll keep reminding
you to do the quizzes.
So that gives us a
good understanding
that you have reviewed the
material before you come in,
because it will be much more
beneficial if you go to all
the content before coming in.
So I'm here to answer any
questions, please let me know.
And if not, I will just
hang up and the video
will be available by later
today on the learning portal.
OK so yes, we will actually
send you the fifth webinar.
I think the question is that
the fifth webinar invite is not
there.
Yes, all of you--
maybe many of you
did not receive that.
So I will have it
sent to you shortly.
OK, thanks a lot, all of you.
It was nice having you and
then I will see you in person.
