Hello everyone and welcome to this
interesting session on data science
full course. So before we begin let's
have a quick look at the agenda of this
session so first of all I'll be starting
off by explaining you guys about the
evolution of data how it led to the
growth of data science, machine learning,
AI and all the different aspects of data.
Then we'll have a quick introduction to
data science, understand what exactly it
is then we'll move forward to the data
science careers and the salary and
understand what are the different job
profiles in the data science career path
how to become a data scientist data
analyst or a machine learning engineer.
Then we'll move on to the first and the
foremost part of data science which is
statistics and after completing statistics
we'll move on to machine learning where
we'll understand what exactly is machine
learning what are the different types of
machine learning and how are they used
and where are they used the different
algorithms and next we'll understand
what is deep learning and how deep
learning is different from machine
learning, what is the relationship
between AI, machine learning and deep
learning in terms of data science and
understand how exactly neural network
works, how to create a neural network and
much more, So let's begin our session now.
Data is increasingly shaping the systems
that we interact with every day, whether
you are searching something on Google
using Siri or browsing your Facebook
feed you are consuming the result of
data analysis. It is increasing at a very
alarming rate where we are generating
2.5 quintillion bytes of it every day.
Now that's a lot of data and considering
there are more than 3 billion
Internet users in the world a quantity
that has tripled in the last 12 years
and 4.3 billion cell phone users that's
a heck lot of data and this rapid growth
has generated opportunity for new
professionals who can make sense out of
this data. Now given its transformation
ability it's no wonder that so many data
arrays with jobs have been created in
the past few years like data analysts,
data scientists, machine learning
engineers, artificial intelligence
engineers and much more. And before we dwell into
the details of all of these different
professionals, let's understand exactly
what data science is. So data science
also known as the relevant science is an
interdisciplinary field about scientific
methods, processes and systems to extract
knowledge or insights from data in
various forms. It's structured or
unstructured. It is the study of where
information comes from what it
represents and how it can be turned into
a valuable resource in the creation of
business and IT strategies. So data
science employs many techniques and
theories from fees like mathematics,
statistics, information science as well
as computer science, and can be applied
to small data sets also yet most people
think data science is when you are
dealing with big data or large amounts
of data. So this brings the question
which job profile is suitable for you, is
it the data analysts, the data scientist
or the machine learning engineer. Now
data scientist has been called the
sexiest java 21st century nonetheless
data science is a hot and growing field
so before we drill into the data science
let's discuss all of these profiles one
by one and see what this roles are and how
they work in the industries so read a
science career usually starts with
mathematics and stats as the base which
brings up the force profile in our data
science career path which is a data
analyst so Idina analyst delivers value
to the companies by taking information
about specific topics and then
interpreting analyzing and presenting
the finding in comprehensive reports now
many different types of businesses use
data analysts to help as experts data
analysts are often called on to use the
skills and tools provide competitive
analysis and identify trends within the
industry's most entry-level professional
interested in going into Data related
jobs start off as data analyst
qualifying for this role is as simple as
it gets all you need is a bachelor's
degree in computer science mathematics
and a good statistical knowledge strong
technical skills would be a plus and can
give you an edge over most other
applicants so next we have data
scientists there are several definitions
available on data scientists but in
simple words
scientist is one who practices the art
of data science the highly popular term
data scientist was coined by DJ Patton
and Jeff hammer backer data scientists
are those who crack complex data
problems with a strong expertise in
certain scientific disciplines they work
with several elements related to
mathematics statistics computer science
and much more now data scientists are
usually business analysts or data
analysts with a difference it is a
position for specialists and you can
specialize in different types of skills
like speech analytics text analytics
which is the natural language processing
image processing video processing
medicine simulation material simulation
now each of these specialists roles are
very limited in number and hence the
value of such a specialist is immense
now if we talk about AI or machine
learning ingenious so machine learning
engineers are sophisticated programmers
who develop machines and systems that
can learn and apply knowledge without
specific direction artificial
intelligence is the goal of a machine
learning engineer they are computer
programmers but their focus goes beyond
specifically programming machines to
perform specific tasks now they create
programs that will enable machines to
take actions without being specifically
directed to perform those tasks so now
if we have a look at the salary trends
of all of these professionals so
starting with a data analyst the average
salary in the u.s. is around 83,000
dollars or it's almost close to eighty
four thousand dollars whereas in India
it's around four lakh and four thousand
rupees per annum. Now coming to data
scientist the average salary is ninety
one thousand dollars nine eleven point
five thousand dollars and in India it is
almost seven lakh rupees and finally
four ml in ten years the average salary
in the u.s. is around one hundred and
eleven thousand dollars whereas in India
is around seven lakh and twenty thousand
dollars so as you can see the radius
scientist an ml ingenious position are a
certain higher position which requires
certain degree of expertise in that
field so that's the reason why there is
a difference in the salary of all the
three professionals so if you have a
look at the road map of becoming any one
of these profession
so what first one needs to do is own a
bachelor's degree
now this bachelor's degree can be in
either computer science mathematics
information technology statistics
finance or even economics now after
completing a bachelor's degree the next
comes is fine-tuning the technical
skills during the technical skills is
one of the most important parts in the
roadmap where you learn all the
statistical methods and packages you
either learn are Python essays languages
which are very important you learn about
data warehousing business intelligence
data cleaning visualization reporting
techniques walking knowledge of Hadoop
and MapReduce is very very important and
if you talk about machine learning
techniques it is one of the most
important parts of the data science
career now apart from these technical
skills there are also some business
skills which are very much required so
this involves analytical problem-solving
effective communication creative
thinking as well as industry knowledge
now after fine-tuning your technical
skills and developing all the business
skills you have the options of either
going for a job or either going for a
master's degree or certification
programs now I might suggest as you go
for a master's degree as just coming out
of the BTech world and having the
technical skills is not enough so you
need to have a certain level of
expertise in the field so it's better to
go for any masters or PhD programs which
are in computer science statistics or
machine learning you can also go for big
data certifications and you can also go
for industry certifications regarding
the data analysis machine learning or
the data science it so happens that
arica also provides a machine learning
data analysis as well as a data science
certification training they have
master's program which are equivalent to
a master's degree which you get from a
certain University so do check it out
guys I'll leave the link to all of these
in the description box below and after
you have completed the master's degree
what comes is working on the projects
which are related to this field so it's
better if you work on machine learning
deep learning or data ethics projects
that will give you an edge over other
competitors while applying for a job
scenario so a certain level of expertise
in the field is also required and this
is how you will succeed in the rate of
science career path there are certain
skills which are required which I was
talking about earlier the technical
skills and the non technical skills now
if you talk about the skills which are
required to become all of these
professions so they are mostly the same
so for any data analyst first of all you
need to have analytical skills which
involves Maths having good knowledge of
matrix multiplications the Fourier
transformations and all next we have
communication skills so come looking for
a data analyst require someone who has
the good communication skills who can
explain all of their technical terms to
non-technical teams such as marketing or
the sales team another important skill
required is critical thinking you need
to think in certain directions and gain
insights from the data so that's one of
the most important part of a data
analysts job obviously you need to pay
attention on the details so as a minor
shift or the deviation in the result or
in the calculation what you say the
analysis might result in some sort of
loss of the company it's not necessarily
to create a loss but it's better to
avoid any kind of deviation from the
results so paying attention to the
detail is very very important and then
again we talk about the mathematical
skills knowing about all the types of
differentiations and integrations is
going to help a lot because you know a
lot of machine learning algorithms as I
would say are mostly mathematical terms
or mathematical functions so having good
knowledge of mathematics is also
required apart from this the usage of
technical tools such as Python are we
have essays you need to know about the
big data ecosystem how it works the HDFS
how to extract data create a pipeline
you know about JavaScript a little and
if you talk about the skills of data
scientist it's almost the same having
analytical and statistics knowledge now
another important part here is to know
the machine learning algorithms as it
plays an important role in the data
science career from solving skills
obviously now another important aspect
if you talk about the skill which
differs from that of a data analyst is
only deep learning so deep learning I'll
talk about deep learning later in the
second half or the later part of the
video so having a good knowledge of deep
learning and the various frameworks such
as tensorflow
PI torch you have piano all of this is
very required for data scientists and
again business communication as I
mentioned earlier is very much required
because as you know these are one of the
technical roles most technical roles in
the industries and the output of these
roles or what I would say the output of
what these professions - is not that
much technical is more business oriented
so they have to explain all of these
findings to either the non-technical
teams the sales the marketing and again
you need the technical tools and the
skills now for machine learning engineer
obviously programming languages having
good knowledge of our Python C++ or Java
it's very much required you need to know
about calculus and statistics as I
mentioned earlier learning about
mattresses integration now another
important skill here is signal
processing so a lot of times machine
learning engineers have to work on
robots and signal processing they work
on human-like robots they work on
robotics which mimic human behavior so a
lot of signal processing techniques are
also required in this field applied
mathematics as I mentioned earlier and
again neural networks it is one of the
base of artificial intelligence which is
being used and again we have natural
language processing so as you know we
have personal assistants like Siri and
Cortana and they work on language
processing and not just language
processing you have audio processing as
well as video processing so that they
can interact with a real environment and
provide a certain answer to a particular
question so these were the skills I
would say for all of these three roles
next if we have a look at the
peripherals of data science so first of
all we have statistics needless to say
there are programming languages we have
short read integrations then we have
machine learning which is a big part of
data science and then again we have big
data so let's start with statistics
which is the first area of data science
or I should say the first milestone
which we should cover
so for statistics let's understand first
what exactly is data
so data in general terms refers to facts
and statistics collected together for
reference or analysis when working with
statistic it's important to recognize
the different types of data so data can
be broadly classified into numerical
categorical and ordinal now data with no
inherent order or ranking such as a
gender or race is called nominal data so
as you can see in the type 1 we have
male female male female that is nominal
data
now data with an ordered series is
called ordinal data so as you can see
here we have an ordered series where we
have the customer IDs and the rating
scale no data with only two options
series is called binary data now in this
type of data there are only two options
like either yes or no or true or false
or 1 or 0 so as you can see here we have
customer ID and in the owner or car
column we have either yes or no now the
types of data we just discussed under
law describe the quality of something in
size appearance value or something such
kind of data is broadly classified into
qualitative data now data which can be
categorized into a classification data
which is based upon counts there is only
a finite number of values possible and
the values cannot be subdivided
meaningfully is called discrete data so
as you can see here in our example we
have organization and the number of
products so this cannot be subdivided
into number of sub products right and if
you talk about data which can be
measured on a continuum or a scale no
data which can have almost any numeric
value and can be subdivided into finer
and finer increments is called
continuous data so as you can see here
in patient ID we have weight of the
patient it is 6.5 kgs now kgs can be
subdivided into grams and milligrams and
final refinement is also possible now
this type of data that can be measured
by the quantity of something rather than
its quality is called quantitative data
now that we have honest with the
different types of data qualitative and
quantitative it's time to understand the
types of variables we have now there are
majorly two types of variables dependent
and independent variables so if you want
to know whether caffeine affects your
appetite the presence or the absence of
the amount of
caffeine would be the independent
variable and how hungry you are would be
the dependent variables so in statistics
dependent variable is the outcome of an
experiment as you change the independent
variable you watched what happens to the
dependent variable whereas if you talk
about independent variable a variable
that is not affected by anything that
you or the researcher does usually
plotted on the x-axis
now the next step after knowing about
the datatypes and the variables is to
know about population and sampling and
that comes into experimental research
now in experimental research the aim is
to manipulate an independent variable
and then examine the effect that this
change has on a dependent variable now
since it is possible to manipulate the
independent variable experimental
research has the advantage of enabling a
researcher to identify a cause and
effect between the variables well
suppose there are 100 volunteers at the
hospital and a doctor needs to check the
working of a particular medicine which
has been cleared by the government so
the doctor divides those hundred
patients into two groups of 50 and then
asked one group to take one type of
medicine and the other group to not take
any medicine at all and then after of me
then compare the results and in non
experimental research the researcher
does not manipulate the independent
variable this is not to say that it is
impossible to do so but it will either
be impractical or it will be unethical
to do so so for example a researcher may
be interested in the effect of illegal
recreational drug views which is the
independent variable on certain types of
behavior which is the dependent variable
however why is possible it would be
unethical to ask an individual to take
illegal drugs in order to study what
effects this hat on certain behaviors it
is always good to go for experimental
research rather than non experimental
research so next in our session we have
population and sampling those are two of
the most important terms in statistics
so let's understand these terms so in
statistic the term population is the
entire pool from which a sample is drawn
statistician also speak of a population
of objects or events or procedures or
observation
including such things as the quantity of
the number of vehicle owned by a penny
person now population is thus an
aggregate of creatures things cases and
so on and a population commonly contains
too many individuals to study
conveniently an investigation is often
restricted to one or most samples drawn
from it now a world chosen sample will
contain most of the information about a
particular population parameter but the
relationship between the sample and the
population must be such as to allow true
inferences to be made about a population
from that sample for that we have
different types of sampling techniques
so in probabilities there are sampling
methods which are classified either as
probability or non probability so in
probability sampling each member of the
population has a known nonzero
probability of being selected probably
the methods include random sampling
systematic sampling and stratified
sampling whereas in nonprobability
sampling members are selected from a
population in some non-random manner but
these includes convenience sampling
judgement sampling quota sampling and
snowball sampling while sampling is
important there is another term which is
known as sampling error so sampling
error is a degree to which a sample
might differ from the population when
inferring to a population results are
reported plus or minus the sampling
error now in probability sampling there
are three terms which are random
sampling systematic sampling and
stratified sampling so talking about
random sampling probability of each
member of the population to be chosen
has equal chance of being selected such
type of sampling is random sampling
never talk about systematic sampling it
is often used instead of random sampling
and it is also called the NEP name
selection technique now pay attention to
the name called Anette name so after the
required sample size has been calculated
every NS record is selected from the
list of the population member now it's
only advantage over Anna's having
technique is its simplicity now the
final type of sampling is a stratified
sampling so a stratum is a subset of the
population that shares at least one
common characteristics the researcher
first hand you fires irrelevant stratums
and there
actual representations in the population
before analysis so now that we know how
our data is and what kind of sampling is
done let's have a look at the measure of
center which helps describe to what
extent this pattern holds for a specific
numerical value so as you can see in
measure of center we have three terms
which are the mean median and mode and
I'm sure everyone must be aware of all
of these terms
I'll not get into the details of these
terms what's more important is to know
about the measure of spreads now a
measure of spread sometime called a
measure of dispersion is used to
describe the variability in the sample
or population it is usually used in
conjunction with a measure of Center
tendency such as the mean or median
provide an overall description of a set
of data now if you talk about deviation
it is the difference between each X I
and the mean for a sample population
which is known as the deviation about
the mean whereas variance is based on
deviation and entails computing squares
of deviation so as you can see here we
have the formula for the variance which
is the difference between the mean and
the particular data point squared and
divided by the total number of data
points and it's summation standard
deviation is basically the under root of
variance so as you can see the formula
is the same just we have the under root
over the variance so that was stand
evasion and variance another topic in
probability and statistics is kunis so
skewness is a measure of symmetry or
more precisely the lack of symmetry so
as you can see here we have left skewed
symmetric non symmetric left skewed we
have right skewed so normally
distributed curves are the most
symmetric curves we'll talk about normal
distribution later
so after skewness what we need to know
about is the confusion matrix now
confusion matrix represent a tabular
representation of actual versus the
predicted values now this help us find
the accuracy of the model when we are
creating any machine learning or the
team learning model to find the accuracy
what we do is plot a confusion matrix so
what you need to do is you can calculate
the accuracy of your model with adding
the true positives and the true negative
and dividing it with the true positives
plus true negatives plus false positive
plus false negatives that will give you
the accuracy of the model so as you can
see in the image we have good bad for
predicted as well as actual and as you
can see here the true positive D and the
true negative a are the two areas where
we have created it it was good and the
actual value was good in true negative a
we have the predicted it was bad and the
actually it's bad so model which gets
the higher true positive and true
negatives are the ones which have the
higher accuracy so that's what confusion
matrix are for now the next term and a
very important term in statistics is
probability so probability is the
measure of how likely something will
occur it is the ratio of desired
outcomes to the total outcomes now if I
roll a dice there are six total
possibilities one two three four five
and six
now each possibility has one outcome so
each has a probability of one out of six
now for instance the probability of
getting a number two is one out of six
since there is only a single two on the
dice now when talking about the
probability distribution techniques or
the terminologies there are three
possible terms which are the probability
density function normal distribution and
the central limit theorem so probability
density function it is the equation
describing a continuous probability
distribution so it is usually referred
as PDF now if we talk about normal
distribution so the normal distribution
is a probability distribution that
associates the normal random variable X
with a cumulative probability the normal
distribution is defined by the following
equation so as you can see here Y is 1
by Sigma into the square root of 2 pi 2
whole multiplied by E raised to power
minus X minus mu whole square divided by
2 Sigma square where X is a random
normal variable mu is the mean and Sigma
is the standard deviation now the
central limit theorem states that the
sampling distribution of the mean of any
independent random variable will be
normal or nearly normal if the sample
size is large enough now accuracy or the
resemblance to normal distribution
depends on however two factors the first
one is a number of sample points taken
and second is the shape of the
underlying population now enough about
statistics if you want to know more
about statistics and if you want to get
in-depth knowledge over statistics you
can refer to our statistics for data
science video I'll leave the link to
that video in the description box so
that video talks about statistics and
probability in a more depth movie then I
explained here so I will talk about the
p-value is the hypotheses what all are
required or any data science project so
let's move on to our next part of data
science learning which is learning paths
which is the machine learning so let's
understand what exactly is machine
learning so machine learning is an
application of artificial intelligence
that provides systems the ability to
automatically learn and improve from
experience without being explicitly
programmed now getting computers to
program themselves and also teaching
them to make decisions using data where
writing software is a bottleneck let the
data do the work instead now machine
learning is a class of algorithms which
is data driven that is unlike normal
algorithms it is the data that does what
the good answer is so if we have a look
at the various features of machine
learning so first of all it uses the
data to detect patterns in a data set
and adjust the program actions
accordingly it focuses on the
development of computer programs that
can teach themselves to grow and change
when exposed to new data so it's not
just the old data on which it has been
trained so whenever a new data is
entered the program changes accordingly
it enables computers to find hidden
insights using iterative algorithms
without being explicitly programmed
either so machine learning is a method
of data analysis that automates
analytical model building now let's
understand how exactly it Wells so if we
have a look at the diagram which is
given here we have traditional
programming on one side we have machine
learning on the other so first of all in
traditional program what we used to do
was provide the data provide the program
and the computer used to generate the
output so things have changed now so in
machine learning what we do is provide
the data and we provide a predicted
output to the machine now what the
machine does is learns from the data
find hidden insights and creates a model
now it takes the output data also again
and it reiterates and trains and grows
accordingly so that the model gets
better every time it's a strain with the
new data or the new output so the first
and the foremost application of machine
learning in the industry I would like to
get your attention towards is the
navigation or the Google Maps so Google
Maps is probably the app we use whenever
we go out and require assistant in
directions and traffic right the other
day I was traveling to another city and
took the expressway and the math
suggested despite the havoc traffic you
are on the fastest route no but how does
it know that well it's a combination of
people currently using the services the
historic data of that fruit collected
over time and a few tricks acquired from
the other companies everyone using maps
is providing their location their
average speed the route in which they
are traveling which in turn helps Google
collect massive data about the traffic
which may extemporary the upcoming
traffic and it adjust your route
according to it which is pretty amazing
right now coming to the second
application which is the social media if
we talk about Facebook so one of the
most common application is automatic
friend tanks suggestion in Facebook and
I'm sure you might have gotten this so
it's present in all the other social
media platform as well so Facebook uses
face detection and image recognition to
automatically find the face of the
person which matches its database and
hence it suggests us to tag that person
based on deep face
now if the face is Facebook's machine
learning project which is responsible
for recognition of faces and define
which person is in the picture and it
also provides alternative tags to the
images already uploading on Facebook so
for example if we have a look at this
image and we introspect the following
image on Facebook we get the alt tag
which has a particular description so in
our case what we get here is the image
may contain sky grass outdoor and nature
now transportation and commuting is
another industry where machine learning
is used heavily so if you have used an
app to book a cab recently then you are
already using machine learning to an
extent and what happens is that it
provides a personalized application
which is unique to you it automatically
detects your location and provides
option to either go home or office or
any other frequent basis based on your
history and patterns it uses machine
learning algorithm layered on top of
historic trip date had to make more
accurate ETA predictions now uber with
the implementation of machine learning
on their app and their website saw a 26
percent accuracy in delivery and pick up
that's a huge a point now coming to the
virtual person assistant as a name
suggests virtual person assistant assist
in finding useful information when asked
why a voice or text if you have the
major applications of machine learning
here a speech recognition speech to text
conversion natural language processing
and text-to-speech conversion all you
need to do is ask a simple question like
what is my schedule for tomorrow or show
my upcoming flights now for answering
your personal assistant searches for
information or recalls your related
queries to collect the information
recently personal assistants are being
used in chat pods which are being
implemented in various food ordering
apps online training web sites and also
in commuting apps as well again product
recommendation now this is one of the
area where machine learning is
absolutely necessary and it was one of
the few areas which emerged the need for
machine learning now suppose you check
an item on Amazon but you do not buy it
then and there but the next day you are
watching videos on YouTube and suddenly
you see an ad for the same item you
switch to Facebook there also you see
the same ad and again you go back to any
other side and you see the ad for the
same sort of items so how does this
happen well this happens because Google
tracks your search history and
recommends asked based on your search
history this is one of the coolest
application of machine learning and in
fact 35% of Amazon's revenue is
generated by the products recommendation
now coming to the cool and highly
technological side of machine learning
we have self-driving cars if we talk
about self-driving car it's here
and people are already using it now
machine learning plays a very important
role in self-driving cars as I'm sure
you guys might have heard about Tesla
the leader in this business and the
excurrent artificial intelligence is
driven by the hardware manufacturer
Nvidia which is based on unsupervised
learning algorithm which is a type of
machine learning algorithm now in media
state that they did not train their
model to detect people or any of the
objects as such the model works on deep
learning and Traut sources it's data
from the other vehicles and drivers it
uses a lot of sensors which are a part
of IOT and according to the data
gathered by McKenzie the automotive data
will hold a tremendous value of 750
billion dollars but that's a lot of
dollars we are talking about it now next
again we have Google Translate now
remember the time when you travel to the
new place and you find it difficult to
communicate with the locals or finding
local spots where everything is written
in a different languages well those days
are gone
Google's G and M T which is the Google
neural machine translation is a neural
machine learning that works on thousands
of languages and dictionary it uses
natural language processing to provide
the most accurate translation of any
sentence of words since the tone of the
word also matters it uses other
techniques like POS tagging named entity
recognition and chunking and it is one
of the most used applications of machine
learning now if we talk about dynamic
pricing setting the rice price for a
good or a service is an old problem in
economic theory there are a vast amount
of pricing strategies that depend on the
objective sort be it a movie ticket a
plane ticket or a cafe everything is
dynamically priced now in recent year
machine learning has enabled pricing
solution to track buying trends and
determine more competitive product
prices now if we talk about uber how
does Oberer determine the price of your
right
who was biggest use of machine learning
comes in the form of surge pricing a
machine learning model named as
geosearch if you are getting late for a
meeting and you need to book an uber in
a crowded area get ready to pay twice
the normal fear
even for flats if you're traveling in
the festive season the chances are that
prices will be twice as much as the
original price now coming to the final
application of machine learning we have
is the online video streaming we have
Netflix Hulu and Amazon Prime video now
here I'm going to explain the
application using the Netflix example so
with over 100 million subscribers there
is no doubt that Netflix is the daddy of
the online streaming world when Netflix
PD dries has all the movie
industrialists taken aback forcing them
to us how on earth could one single
website take on Hollywood now the answer
is machine learning the Netflix
algorithm constantly gathers massive
amounts of data about user activities
like when you pause rewind fast-forward
what do you want the content TV shows on
weekdays movies on weekend the date you
watch the time you watch whenever you
pause and leave a content so that if you
ever come back they would such as the
same video the rating events which are
about four million per day the searches
which are about three million per day
the browsing and the scrolling behavior
and a lot more now they collect this
data for each subscriber they have and
use the recommender system and a lot of
machine learning applications and that
is why they have such a huge customer
retention rate so I hope these
applications are enough for you to
understand how exactly machine learning
is changing the way we are interacting
with the society and how fast it is
affecting the world in which we live in
so if you have a look at the market
trend of the machine learning here so as
you can see initially it wasn't much in
the market but if you have a look at the
2016 side there was an enormous growth
in machine learning and this happened
mostly because you know earlier we had
the idea of machine learning but then
again we did not had the amount of big
data so as you can see the red line we
have here in the histogram and the power
plot is that of the Big Data so Big Data
also increased during the years and
which led to the increase in the amount
of data generated and recently we had
that power or I should say the
underlying technology and the hardware
to support that power that makes us
create machine learning programs that
will work on the spectator so that is
why you see very high inclination during
the 2016 period time as compared to 2012
so because during 2016 we got new
hardware and we were able to find
insights using those hardware and
program and create models which would
work on heavy data now let's have a look
at the life cycle of machine learning
so a typical machine learning life cycle
has six steps so the first step is
collecting data second is video
wrangling then we have the third step
per be analyzed the data fourth step
where we train the algorithm the fifth
step is when we test the algorithm and
the sixth step is when we deploy that
particular algorithm for industrial uses
so when we talk about the fourth step
which is collecting data so here data is
being collected from various sources and
this stage involves the collection of
all the relevant data from various
sources now if we talk about data
wrangling so data wrangling is the
process of cleaning and converting raw
data into a format that allows
convenient consumption now this is a
very important part in the machine
learning lifecycle as it's not every
time that we receive a data which is
clean and is in a proper format
sometimes their value is missing
sometimes there are wrong values
sometimes data format is different so a
major part in a machinery lifecycle goes
in data wrangling and data cleaning so
if we talk about the next step which is
data analysis so data is analyzed to
select and filter the data required to
prepare the model so in this step we
take the data use machine learning
algorithms to create a particular model
now next again when we have a model what
we do is strain the model now here we
use the data sets and the algorithm is
trained on between data set through
which algorithm understand the pattern
and the rules which govern the
particular data once we have trained the
algorithm
next comes testing so the testing data
set determines the accuracy of our
models
so what we do is provide the test
dataset to the model and which tells us
the accuracy of the particular model
whether it's 60% 70% 80% depending upon
the requirement of the company and
finally we have the operation and
optimization so if the speed and
accuracy of the model is acceptable then
that moral should be deployed in the
real system the model that is used in
the production should be made with all
the available data models improve with
the amount of available data used to
create them all the result of the moral
needs to be incorporated in the business
strategy now after the model is deployed
based upon its performance the model is
updated and improved if there is a dip
in the performance the moral is
retrained so all of these happen in the
operation and optimization stage now
before we move forward since machine
learning is mostly done in Python and us
so and if we have a look at the
difference between Python and our I'm
pretty sure most of the people would go
for Python and the major reason why
people go for python is because python
has more number of libraries and python
is being used in just more than data
analysis and machine learning so some of
the important Python libraries here
which I want to discuss here so first of
all I'll talk about matplotlib now what
Matt brought lib does is that it enables
you to make bar charts scatter plots the
line charts histogram basically what it
does is helps in the visualization
aspect as data analyst and machine
learning ingenious what one needs to
represent the data in such a format that
it is used that it can be understood by
non-technical people such as people from
marketing people from sales and other
departments as well so another important
Python library here we have a seaborne
which is focused on the visuals of
statistical models which includes heat
maps and depict the overall
distributions
sometimes people work on data which are
more geographically aligned and I would
say in those cases he traps are very
much required now next we come to
scikit-learn
and scikit-learn is the one of the most
famous libraries of python i would say
it's simple and efficient or data mining
and for data analysis it is built on
numpy and my rock lab and it is
open-source next on our list we have
pandas it is the perfect tool for data
wrangling which is designed for quick
and easy data manipulation aggregation
and visualization and finally we have
numpy now numpy stands for a numerical
Python provides an abundance of useful
features for operation on n arrays which
has an umpire's and matrices in spite
and mostly it is used for mathematical
purposes so which gives a plus point to
any machine learning algorithm so as
these were the important part in larry's
which one must know in order to do any
price and programming for machine
learning or as such if you are doing
Python programming you need to know
about all of these libraries so guys
next what we are going to discuss other
types of machine learning so then again
we have three types of machine learning
which are supervised reinforcement and
unsupervised machine learning so if we
talk about supervised machine learning
so supervised learning is where you have
the input variable X and the output
variable Y and you use an algo I know to
learn the mapping function from the
input to the output so if we take the
case of object detection here so or face
detection I rather say so first of all
what we do is input the raw data in the
form of labelled faces and again it's
not necessary that we just input faces
to train the model what we do is input a
mixture of faces and non-faces images so
as you can see here we have labeled face
and labeled on faces what we do is
provide the data to the algorithm the
algorithm creates a model it uses the
training dataset to understand what
exactly is in a face what exactly is in
a picture which is not a face and after
the model is done with the training and
processing so to test it what we do is
provide particular input of a face or an
on face what we know see the major part
of supervised learning here is that we
exactly know the output so when we are
providing a face we
our selves know that it's a phase so to
test that particular model and get the
accuracy we use the labeled input raw
data so next when we talk about
unsupervised learning unsupervised
learning is the training of a model
using information that is neither
classified nor labeled now this model
can be used to cluster the input data in
classes or the basis of the statistical
properties for example for a basket full
of vegetables we can cluster different
vegetables based upon their color or
sizes so if I have a look at this
particular example here we have what we
are doing is we are inputting the raw
data which can be either apple banana or
mango what we don't have here which was
previously there in supervised learning
are the labels so what the algorithm
does is that it visually gets the
features of a particular set of data
it makes clusters so what will happen is
that it will make a cluster of red
looking fruits which are Apple yellow
local fruits which are banana and based
upon the shape also it determines what
exactly the fruit is and categorizes it
as mango banana or apple so this is
unsupervised learning now the third type
of learning which we have here is
reinforcement learning so reinforcement
learning is the learning by interacting
with a space or an environment it
selects the action on the basis of its
past experience the exploration and also
by new choices a reinforcement learning
agent learns from the consequences of
its action rather than from being taught
explicitly so if we have a look at the
example here the input data we have what
it does is goes to the training goes to
the agent where the agent selects the
algorithm it takes the best action from
the environment gets the reward and the
model is strange so if you provide a
picture of a green apple
although the Apple which it particularly
nose is red what it will do is it will
try to get an answer and with the past
experience what it has and it will
recreate the algorithm and then finally
provide an output which is according to
our requirements so now these were the
major types of machine learning
algorithms next what we never do is dig
deep into all of these types of machine
learning one by one so let's get started
with supervised learning first and
understand what exactly is supervised
learning and what are the different
algorithms inside it how it works the
algorithms the working and we'll have a
look at the various algorithm demos now
which will make you understand it in a
much better way so let's go ahead and
understand what exactly is supervised
learning so supervised learning is where
you have the input variable X and the
output variable Y and using algorithm to
learn the mapping function from the
input to the output as I mentioned
earlier with the example of face
detection so it is cos subbu is learning
because the process of an algorithm
learning from the training data set can
be thought of as a teacher supervising
the learning process so if we have a
look at the supervised learning steps or
what will rather say the workflow so the
model is used as you can see here we
have the historic data then we again we
have the random sampling we split the
data enter training error set and the
testing data set using the training data
set we with the help of machine learning
which is supervised machine learning we
create statistical model now after we
have a model which is being generated
with the help of the training data set
what we do is use the testing data set
for prediction and testing what we do is
get the output and finally if we have
the model validation outcome that was
third training and testing so if we have
a look at the prediction part of any
particular supervised learning algorithm
so the model is used for operating
outcome of a new data set so whenever
performance of the model degraded the
model is retrained or if there are any
performance issues
the model is retrained with the help of
the new data now when we talk about
supervisor in there are not just one but
quite a few algorithms here so we have
linear regression logistic regression
this is entry we have random forest we
have made biased classifiers so linear
regression is used to estimate real
values for
the cost of houses the number of cars
the total sales based on the continuous
variable so that is what Rainier
generation is now when we talk about
logistic regression it is used to
estimate discrete values for example
which are binary values like zero and
one yes or no true and false based on
the given set of independent way so for
example when you are talking about
something like the chance of winning or
if we talk about winning which can be
the true or false if will it rain today
which it can be the yes or no so it
cannot be like when the output of a
particular algorithm or the particular
question is either yes/no or binary then
only we use a logic regression now next
we have decision trees so so these are
used for classification problems it
works for both categorical and
continuous dependent variables and if we
talk over random forest so random forest
is an N symbol of a decision tree it
gives better prediction and accuracy
that decision tree so that is another
type of supervised learning algorithm
and finally we have the Nate Byars
classifier so it is a classification
technique based on the based theorem
with an assumption of independence
between predictors so we'll get more
into the details of all of these
algorithms one by one so let's get
started with linear regression so first
of all let us understand what exactly
linear regression is so linear
regression analysis is a powerful
technique you operating the unknown
value of a variable which is the
dependent variable from the known value
of another variable which is the
independent variable so a dependent
variable is the variable to be predicted
or explained in a regression model
whereas an independent variable is a
variable related to the dependent
variable in a regression equation so if
you have a look here as a simple linear
regression so it's basically equivalent
to a simple line which is with a slope
which is y equals a plus B X where Y is
the dependent variable a is the
y-intercept we have P which is the slope
of the line and X which is the
independent variable
so intercept is the value of the
dependent variable Y when the value of
the independent variable X is 0 it is
the
the line cuts the y-axis whereas slope
is the change in the dependent variable
for a unit increase in the independent
variable it is the tangent of the angle
made by the line with the x-axis now
when we talk about the relation between
the variables we have a particular term
which is known as correlation so
correlation is an important factor to
check the dependencies when there are
multiple variables what it does is it
gives us an insight of the mutual
relationship among variables and it is
used for creating a correlation plot
with the help of the Seabourn library
which I mentioned earlier which is one
of the most important libraries in
Python so correlation is very important
term to know about now if we talk about
regression lines so linear regression
analysis is a powerful technique used
for predicting the unknown value of a
variable which is the dependent variable
from the regression line which is simply
a single line that best fits the data in
terms of having the smallest overall
distance from the line to the points so
as you can see in the plot here we have
the different points or the data points
so these are known as the fitted points
then again we have the regression line
which has the smallest overall distance
from the line to the points so you have
a look at the distance between the point
to the regression line so what this line
shows is the deviation from the
regression line so exactly how far the
point is from the regression line so
let's understand a simple use case of
linear regression with the help of a
demo so first of all there is a real
state company use case which I'm going
to talk about so first of all here we
have John he has some baseline for
pricing the villa's and the independent
houses he has in Boston so here we have
the data set description which we're
going to use so this data set has
different columns such as the crime rate
per capita which is CRI M it has
proportional residential residential
land zone for the Lots proportion of non
retail business the river the United
Rock side concentration average number
of rooms and the proportion of the owner
occupying the built prior to 1940 the
distance
of the five Boston employment centers in
excess of accessibility to Riedl
highways and much more so first of all
let's have a look at the data set we
have here
so one number I don't thing here guys is
that I'm gonna be using Jupiter notebook
to execute all my practicals you are
free to use the spider notebook or the
console either so it basically comes
down to your preference so for my
preference I'm going to use the Jupiter
notebook so for this use case we're
gonna use the Boston housing data set so
as you can see here we have the data set
which has the CRI mzn in desc CAS NO x
the different variables and we have the
data set of form almost I would say like
500 houses so what John needs to do is
plan the pricing of the housing
depending upon all of these different
variables so that it's profitable for
him to sell the house and it's easier
for the customers also to buy the house
so first of all let me open the code
here for you so first of all what we're
gonna do is import the library is
necessary for this project so we're
going to use the numpy we're going to
import numpy as NP import pandas at PD
then we're gonna also import the
matplotlib and then we are going to do
is read the Boston housing data set into
the BOS one variable so now what we are
going to do is create two variables x
and y so what we're gonna do is take 0
to 13 I'll say is from CR I am two LS
dat in 1x because that's the independent
variable and Y here is dependent
variable which is the MA TV which is the
final price so first of all what we need
to do is plot a correlation so what
we're gonna do is import the Seabourn
library as SN s we're going to use the
correlations to plot the correlations
between the different 0 to 13 variables
what we gonna do is also use ma DV here
also so what we're going to do is SN s
dot heatmap correlations to be going to
use the square
to differentiate usually it comes up in
square only or circles so you don't know
so we're gonna use square you want to
see you see map with the Y as GNP you
this is the color so there's no rotation
in the y axis and we're gonna rotate the
excesses to the 90 degree and let's we
gonna plot it now so this is what the
plot looks like so as you can see here
the more thicker or the more darker the
color gets the more is the correlation
between the variables so for example if
you have a look at CRI M and M a DV
right so as you can see here the color
is very less where the correlation is
very low so one thing important what we
can see here is the tax and our ad which
is the full value of the property and
RIT is the index of accessibility to the
radial highways now these things are
highly correlated and that is natural
because the more it is connected to the
highway and more closer it is to the
highway the more easier it is for people
to travel and hence the tax on it is
more as it is closer to the highways now
what we're going to do is from SQL and
dot cross-validation we're going to
import the Train test split and we're
gonna split the data set now so what we
are going to do is create four variables
which are the extreme X test Y train
white tests and we're going to use a
train test split function to split the x
and y and here we're going to use the
test size 0.3 tree which will split the
data set into the test size will be 33%
well as the training size will be 67%
now this is dependent on you usually it
is either 60/40 70/30 this depends on
your use case your data you have the
kind of output you are getting the model
you are creating and much more then
again from SQL learn dot linear model
we're going to import linear regression
now this is the major functions we're
gonna use just linear regression
function which is present in SQL which
is a scikit-learn so we going to create
our linear regression model into LM and
the model which are going to create and
we're going to fit the training videos
which has the X train and the why train
then we're gonna create a prediction
underscore 5 which is the LM dot credit
and I take the X test variables which
will provide the predicted Y variables
so now finally if we plot the scatter
plot of the Y test and the y predicted
what we can see is that and we give the
X label as white test and the Y label
has y predicted we can see the
regression line which we have plotted in
at the scatter plot and if you want to
draw a regression line it's usually it
will go through all of these points
excluding the extremities which are here
present at the endpoints so this is how
a normal linear regression works in
Python what you do is create a
correlation you find out you split the
dataset into training and testing
variables then again you define what is
going to be your test size import the
reintegration moral use the training
data set into the model fitted use the
test data set to create the predictions
and then use the wireless code test and
the predicted Y and plot the scatter
plot and see how close your model is
doing with the original data it had and
check the accuracy of that model now
typically you use these steps which was
collecting data what we did data
wrangling analyze the data we trained
the algorithm we use the test algorithm
and then we deployed so fitting a model
means that you are making your algorithm
learn the relationship between
predictors and the outcomes so that you
can predict the future values of the
outcome so the best fitted model has a
specific set of parameters which best
defines the problem at hand since this
is a linear model with the equation y
equals MX plus C so in this case the
parameters of the model learns from the
data that are M and C so this is what
more fitting now if it have a look at
the types of fitting which are available
so first of all machine learning
algorithm first attempt to solve the
problem of underfitting
that is of taking a line that does not
approximate the data well and making it
approximate to the data better so
machine does not know where to stop in
order to solve the problem and it can go
ahead from appropriate to overfit more
sometimes when we say a model overfits
we mean that it may have a low error
rate for training data but it may not
generalize well to the overall
population of the data we are interested
in so we have under fact appropriate and
over fit these are the types of fitting
now guys this was linear regression
which is a type of supervised learning
algorithm in machine learning so next
what we're going to do is understand the
need for logistic regression so let's
consider a use case as in political
elections are being contested in our
country and suppose that we are
interested to know which candidate will
probably win now the outcome variables
result in binary either win or lose the
predictor variables are the amount of
money spent the age the popularity rank
and etc etcetera now here the best fit
line in the regression war is going
below 0 and above what and since the
value of y will be discrete that is
between 0 & 1 the linear rain has to be
clipped at 0 & 1 now linear regression
gives us only a single line to classify
the output with linear regression our
resulting curve cannot be formulated
into a single formula as you obtain
three different straight lines what we
need is a new way to solve this problem
so hence people came up with logistic
regression so let's understand what
exactly is logic regression so logistic
regression is a statistical method for
analyzing a data set in which there are
1 or more independent variables that
determine an outcome and the outcome is
a binary class type so example a patient
goes a followed a teen checkup in the
hospital and his interest is to know
whether the cancer is benign or
malignant now a patient's data such as
sugar level blood pressure eight skin
width and the previous medical history
are recorded and a daughter checks the
patient data and it reminds the outcome
of his illness and severity of illness
the outcome will result in binary that
is zero if the cancer is malignant and
one if it's been I know no strict
regression is a statistical method used
for analyzing a dataset there were say
one or more dependent variables like we
discuss like the sugar level blood
pressure
skin with the previous medical history
and the output is binary class type so
now let's have a look at the lowest ik
regression curve now the law
disintegration code is also called a
sigmoid curve or the S curve the sigmoid
function converts any value from minus
infinity to infinity to the discrete
value 0 or 1 now how to decide whether
the value is 0 or 1 from this curve so
let's take an example what we do is
provide a threshold value we set it we
decide the output from that function so
let's take an example with the threshold
value of 0.4 so any value above 0.4 will
be rounded off to 1 and anyone below 0.4
we really reduce to 0 so similarly we
have polynomial regression also so when
we have nonlinear data which cannot be
predicted with a linear model we switch
to the polynomial regression now such a
scenario is shown in the below graph so
as you can see here we have the equation
y equals 3x cubed plus 4x squared minus
5x plus 2 now here we cannot perform
this linearly so we need polynomial
regression to solve these kind of
problems now when we talk about logistic
regression there is an important term
which is decision tree and this is one
of the most used algorithms in
supervised learning now let's understand
what exactly is a decision tree so our
decision tree is a tree like structure
in which internal load represent tests
on an attribute now each attribute
represents outcome of test and each leaf
node represents the class label which is
a decision taken after computing all
attributes
apart from root to the leaf represents
classification rules and a decision tree
is made from our data by analyzing the
variables from the decision tree now
from the tree we can easily find out
whether there will be came tomorrow if
the conditions are rainy and less windy
now let's see how we can implement the
same so suppose here we have a data set
in which we have the outlook so what we
can do is from each of the Outlawz we
can divide the data as sunny overcast
and rainy so as you can see in the sunny
side we get two yeses and three noes
because the outlook is sunny the
humidity is now
and oven is weak and strong so it's a
fully sunny day what we have is that
it's not a pure subset so what we're
gonna do is split it further so if you
have a look at the overcast
we have humidity high normal week so yes
during overcast weekend play and if you
have a look at the Raney's area we have
three SS and - no so again what we're
going to do is split it further so when
we talk of a sunny then we have humidity
in humidity we have high and normal so
when the humidity is normal we're going
to play which is the pure subset and if
the humidity is high we are not going to
play which is also a pure subset now so
let's do the same for the rainy day so
during rainy day we have the vent
classifier so if the wind is to be it
becomes a pure subset we're going to
play and if the vent is strong it's a
pure substance we not gonna play so the
final decision tree looks like this so
first of all we check if the outlook is
sunny overcast or rain if it's overcast
we will play if it's sunny we then again
check the humidity if the humidity is
high we will not play if the humidity is
normal real play then again in the case
of rainy if we check the vent if the
wind is weak the play will go on and
similarly if the wind is strong the play
must stop so this is how exactly a
decision tree works so let's go ahead
and see how we can implement logisitics
relation in decision trees now for
logistic regression we're going to use
the Casa data set so this is how the
data set looks like so here we have the
eye diagnosis radius mean - I mean
parameter mean these are the stats of
particular cancer cells or the cyst
which are present in the body so we have
like total 33 columns all the way
starting from IDE - unnamed 32 so our
main goal here is to define whether or
I'll say predict whether the cancer is
pinang on mannequin so first of all what
vinegar - is from scikit-learn dot small
selection we're gonna import
cross-validation score and again we're
going to use numpy
for linear algebra we're gonna use
pandas as PD because for data processing
the CSV file input for data manipulation
in sequel and most of the stuff then
we're going to import the matplotlib it
is used for plotting the graph we're
going to import Seabourn which is used
to plot interactive graph like in the
last example we saw we plotted a heatmap
correlation so from SK learn we're going
to import the logistic regression which
is the major model or the algorithm
behind the whole logic regression we're
gonna import the train dressed split so
as to split the raita into two paths
training and testing data set we're
going to import metrics to check the
error and the accuracy of the model and
we're gonna import decision tree
classifier so first of all what we're
gonna do is create a variable data and
use the pandas PD to read the data from
the data set so here the header 0 means
that the zeroth row is our column name
and if we have a look at the data or the
top six part of the data we're going to
use the friend data dot head and get the
data dot info so as you can see here we
have so many data columns such as highly
diagnosis radius being in text remain
parameter main area means smoothness
mean we have texture worst symmetry
worst we have fractal dimension worse
and lastly we have the unnamed so first
of all we can see we have six rows and
33 columns and if you have a look at all
of these columns here right we get the
total number which is the 569 which is
the total number of observation we have
and we check whether it's non null and
then again we check the type of the
particular column so it's integer it's
object float mostly most of them are
float some are integer so now again
we're going to drop the unnamed column
which is the column 30 second 0 to 33
which is the 30 second column so in this
process we will change it in our data
itself so if you want to save the old
data you can also see if that but then
again that's of no use so theta dot
columns will give us all of these
columns when we remove that so
you can see here in the output we do not
have the final one which was the unnamed
the last one we have is the type which
is float so latex we also don't want the
ID column for our analysis so what we're
gonna do is we're gonna drop the ID
again so as I said above the data can be
divided into three paths so let's divide
the features according to their category
now as you know our diagnosis column is
object type so we can map it to the
integer value so we what we wanna do is
use the data diagnosis and we're gonna
map it to M 1 and B 0 so that the output
is either M or B now if we use a rated
or described so you can see here we have
8 rows and 1 columns because we dropped
two of the columns and in the diagonals
we have the values here let's get the
frequency of the cancer stages so here
we're going to use the Seabourn SNS not
count plot data with diagnosis and Lee
will come and if we use the PLT dot show
so here you can see the diagnosis for 0
is more and for 1 is less if you plot
the correlation among this data so we're
going to use the PLT dot figure SNS
start heat map we're gonna use a heat
map we're going to plot the correlation
c by true we're going to use square true
and we're gonna use the cold warm
technique so as you can see here the
correlation of the radius worst with the
area worst and the parameter worst is
more whereas the radius worst has high
correlation to the parameter mean and
the area mean because if the radius is
more the parameter is more area is more
so based on the core plot let's select
some features from the model now the
decision is made in order to remove the
: era t so we will have a prediction
variable in which we have the texture
mean the parameter mean the smoothness
mean the compactors mean and the
symmetry mean but these are the
variables which we'll use for the
prediction now we'll gonna split the
data into the training and testing data
set now in this our main data is
splitted into training a test data set
with the 0.3 test size that is 30 to 70
ratio next what we're going to do is
check the dimension of that training and
the testing data says so what we're
going to do is use the print command and
pass the parameter train dot shape and
test our shape so what we can see here
is that we have almost like 400 398
observations were 31 columns in the
training dataset
whereas 171 rows and 31 columns in the
testing dataset so then again what we're
going to do is take the training data
input what we're going to do is create a
Train underscore X with the prediction
underscore rad and train is for y is for
the diagnosis now this is the output of
our training data same as we did for the
test so we're going to use test
underscore X for the test prediction
variable and test underscore Y for the
test diagnosis which is the output of
the test data now we're going to create
a logistic regression method and create
a model logistic dot fit in which you're
going to fit the training data set which
is strain X entering Y and then we're
going to use a TEM P which is a
temporary variable in which you can
operate X and then what we're going to
do is we're going to compare to EMP
which is a test X with the test Y to
check the accuracy so the accuracy here
we get is 0.9 1 then again what we need
to do this was like location normal
roads retribution are we going to use
classifier so we're going to create a
decision tree classifier with random
state given as 0 now what next we're
going to do is create the
cross-validation school which is the CLF
we take the moral we take the train X 3
and Y and C V equals 10 the
cross-validation score now if we fit the
training test and the sample weight we
have not defined here check the input of
his true and XID x sorted is none so if
we get the parameters true we predict
using the test X and then predict the
long probability of test X and if we
compare the score of the test X to test
Y with the sample weight none we get the
same result as a decision tree so this
is how you implement a decision tree
classifier and check the accuracy of the
particular model so that was it
so next on our list is random forest so
let's understand what exactly is a
random forest so random forest is an
symbol classifier made using many
decision tree models so so what exactly
are in symbol malls so n symbol malls
combines the results from different
models the result from an N simple mall
is usually better than the result of the
one of the individual model because
every tree votes for one class the final
decision is based upon the majority of
votes and it is better than decision
tree because compared to decision tree
it can be much more accurate it rests if
efficiently on the last data set it can
handle thousands of input variables
without variable deletion and what it
does is it gives an estimate of what
variables are important in the
classification so let's take the example
of weather data so let's understand I
know for us with the help of the
hurricanes and typhoons data set so we
have the data about hurricanes and
typhoons from 1851 to 2014 and the data
comprises off location when the pressure
of tropical cyclones in the Pacific
Ocean the based on the data we have to
classify the storms into hurricanes
typhoons and the sub categories as
further to predefined classes mentioned
so the predefined classes are TD
tropical cyclone of tropical depression
intensity which is less than 34 knots if
it's between thirty four to six to 18 oz
it's D s greater than 64 knots it's a
cheer which is a hurricane intensity e^x
is esta tropical cyclone s T is less
than 34 it's a subtropical cyclone or
subtropical depression s s is greater
than 34 which is a subtropical cyclone
of subtropical storm intensity and then
again we have L o which is a low that is
neither a tropical cyclone a tropical
subtropical cyclone or non and
extraterrestrial cyclone and then again
finally we have DB which is disturbance
of any intensity now these were the
predefined classes description so as you
can see this is the data in which we
have the ID name date event say this
line it's your longitude maximum when
minimum when there are so many variables
so let's start with imp
the pandas then again we import the
matplotlib then we gonna use the
aggregate method in matplotlib we're
going to use the matplotlib in line
which is used for plotting interactive
graph and I like it most for plots so
next what we're going to do is import
Seabourn as SNS now this is used to plot
the graph again and we're going to
import the model selection which is the
Train test split so we're gonna import
it from a scaler and the scikit-learn
we have to import metrics watching the
accuracy then we have to import sq learn
and then again from SQL and we have to
import tree from SQL or dot + symbol
we're gonna import the random forest
classifier from SQL and Road metrics
we're going to import confusion matrix
so as to check the accuracy and from SQL
and on message we're gonna also import
the accuracy score
so let's import random and let's read
the dataset and print the first six rows
of the data sets you can see here we
have the ID we have the name date time
it will stay this latitude longitude so
in total we have 22 columns here so as
you can see here we have a column name
status which is TS TS TS for the four
six so what we're gonna do is data at
our state as visible P dot categorical
data the state so what we can do is make
it a categorical data with quotes so
that it's easier for the machine to
understand it rather than having certain
categories as means we're gonna use the
categories as numbers so it's easier for
the computer to do the analysis so let's
get the frequency of different typhoons
so what we're going to do is random dot
seed then again what are we gonna do is
if we have to drop the status we have to
drop the event because these are
unnecessary we're gonna drop latitude
longitude we're gonna drop ID then name
the date and the time it occurred so if
we print the prediction list so ignore
the error here so that's not necessary
so we have the maximum and minimum and
pressure low went any low when deci low
when s top blue and these are the
parameters on which we're going to do
the predictions so now we'll split that
into training and testing data sets so
then again we have the trained comet
test and we're gonna use a trained test
split especially in the 70s of 30
industrial standard ratio now important
thing here to note is that you can split
it in any form you want can be either
60/40 70/30 80/20 it all depends upon
the model which you have our the
industrial requirement which you have so
then again if after printing let's check
the dimensions so the training dataset
comprises of eighteen thousand two
hundred and ninety five rows were twenty
two columns whereas the testing dataset
comprised of eight thousand rows with
twenty two columns we have the training
data input train x we had a train y so
status is the final output of the
training data which will tell us the
status whether it's a TS d d which it's
an hu which kind of a hurricane or
typhoon or any kind of subcategories
which are defined which were like
subtropical cyclone the subtropical
typhoon and much more so our prediction
or the output variable will be status so
so this is these are the list of the
training columns which we have here now
same we have to do for the test variable
so we have the test x with the
prediction underscore rat with a test y
with the status so now what we're going
to do is build a random foils classifier
so in the model we have the random
forest classifier with estimators as 100
a simple random for small and then we
fit the training data set which is a
training X and train by then we again
make the prediction which is the world
or predict that with the test underscore
X then that and this will predict for
the test data and prediction will
contain the rated value by our model
predicted values of the diagnosis column
for the test inputs so if you print the
metrics of the accuracy score between
the prediction and the test and a score
why to check the accuracy we get 95%
accuracy now the same if we're going to
do with decision tree so again we're
gonna use the model tree dot decision
tree classifier we're going to use the
Train X and tree in Y which other
training data sets
new prediction is smaller for a task or
text we're going to create a data frame
which is the Parador data frame and if
we have a look at the prediction and the
test underscore Y you can see the state
has 10 10 3 3 10 10 11 and 5 5 3 11 and
3 3 so it goes on and on so it has 7840
2 rows and 1 column and if you print the
accuracy we get a ninety-five point five
seven percent of accuracy and if you
have a look at the accuracy of the
random for us we get 95 point six six
percent which is more than 95 point five
seven so as I mentioned earlier usually
random forest gives a better output or
creates a better more than the decision
tree classifier because as I mentioned
earlier it combines the result from
different models you know so the final
decision is based upon the majority of
votes and is usually higher than the
decision tree models so let's move ahead
with our knee by selca rhythm and let's
see what exactly is neat bias so nave
bias is a simple but surprisingly
powerful algorithm for predictive
modeling now it is a classification
technique based on the base theorem with
an assumption of independence among
predictors it comprises of two parts
which are the nave and the bias so in
simple terms an a bias classifier
assumes that the presence of a
particular feature in a class is
unrelated to the presence of any other
feature even of these features depend on
each other or upon the existence of the
other features all of these properties
independently contribute to the
probability that a fruit it's an apple
or an orange and that is why it is known
as a noun a base model is easy to build
and particularly useful for very large
data sets in probability theory and
statistics Bayes theorem which is
alternatively known as the base law or
the Bayes rule also emitted as Bayes
theorem describes the probability of an
event based on the prior knowledge of
conditions that might be related to the
event so Bayes theorem is a way to
figure out the conditional probability
now conditional probability is the
probability of an event happening given
that it has some
to one or more other events for example
your probability of getting a parking
space is connected to the town today you
park where you park and what conventions
are going on at the same time so base
Hyrum is slightly more nuanced and a
nutshell it gives us the actual
probability of an event given
information about tests so let's talk
about the base Hyrum now so now given
any I policies edge and evidence II
Bayes theorem states that the
relationship between the probability of
the hypothesis before getting the
evidence pH and the probability of the
hypothesis after getting the evidence
which is P H bar e is PE bar H into
probability of H there are a probability
of e which means it's the probability of
even after in the hypothesis inter
priority of the hypothesis divided by
the probability of the evidence so let's
understand it with a simple example here
so now for example if a single card is
drawn from standard deck of playing
cards the probability of that card being
a king is 4 out of 52 now since there
are 4 kings in a standard deck of 52
cards the rewarding this if the king is
the event this card is a king the
priority of the king that is the
probability of king equals 4 by 52 which
in turn is 1 by 30 now if the event is
is varieties or instance someone looks
at the card that the single card is a
face card then the posterior probability
which is the P of King given it's a face
can be calculated using the Bayes
theorem given the probability of King
given its face is equal to probability
of the face given its a king there is a
probability of face into the probability
of King since every King is also a face
card so the probability of face given
its a king is equal to 1
and since there are 3 face cards in each
suit that are jacking and Queen the
probability of face card is 3 out of 30
combining these given likelihood ratios
are we get the value using the paste
theorem of probability of King events of
face is equal to 1 out of 3 so foreign
joint probability distribution
with events a and B the probability of a
intersection B which is the conditional
probability of a given B is now defined
as property of intersection B divided by
the probability of B now this is how we
get the base theorem now that we know
the different basic proof of how we got
the base theorem so let's have a look at
the working of the base your answer with
the help of an examples here so let's
take the same example of the radius set
of the these forecasts in which we had
the sunny rainy
overcast so first of all what we're
gonna do is first we will create a
frequency table using each attribute of
the data set so as you can see here we
have the frequency table here for the
outlook humidity and the wind so for
Outlook we have the frequency table here
we have the frequency table for humidity
and the wind so next what we're gonna do
is create the probability of sunny given
say s that is three out of ten find the
probability of sunny which is five out
of 14 and this 14 comes from the total
number of observations there and from
yes and no so similarly we're gonna find
the probability of yes also which is 10
out of 14 which is 0.7 one for each
frequency table will generate these kind
of likelihood tables so the likelihood
of yes given it's a sunny is equal to
0.51
similarly the likelihood of no given
sunny is equal to 0.40 so here you can
look that using Bayes theorem we have
found out the likelihood of yes given
it's a sunny and no given it's a sunny
similarly we're gonna do the save all
likelihood table for humidity and the
same for wind so for humidity we're
gonna check the probability of yes given
its high humidity is high probability of
plane no given the humidity is high is
your going to calculate it using the
same base theorem so suppose we have a
day with the following values in which
we have the outlook as rain humidity as
high and wind as we since we discussed
the same example earlier with the
decision tree we know the answer so
let's not get ahead of ourselves and
let's try to find out the answer using
the Bayes theorem
let's understand how neat bass works
actually so first of all we gonna use
the likelihood of yes on that day so
that equals to probability of Outlook of
rain given it's a yes into probability
of humidity high given SAS interpretive
NVQ NCS into probability of yes okay so
that gives us zero point zero one nine
similarly they're probably likelihood of
no on that day is the outlook is rain in
units and no humidity is high given its
and no and win this week given so know
that equals to zero point zero one six
now what we're going to do is find the
probability of V s and no and for that
what we're going to do is take the
probability the likelihood and divide it
with the sum of the likelihoods obvious
and known so and that really gonna get
the probability of yes overall so you
think that formula we get the
probability of years as zero point five
five and the probability of no as zero
point four five and our model predicts
that there is a fifty five percent
chance that there will be game tomorrow
if it's rainy
the humidity is high and the wind is
weak now if you have a look at the
industrial use cases of any bias we have
new scatterings use categorization as
what happens is that the news are comes
in a lot of tags and it has to be
categorized so that the user gets
information he needs in a particular
format then again we have spam filtering
which is one of the major use cases of
Nate Byars classifier as it classifies
the email as spam or ham then finally we
have with a prediction also as we saw
just with the example that we predict
whether we're going to play or not that
sort of prediction is always there so
guys this was all about supervised
learning we discussed linear regression
logistic regression we discussed named
pies we've discussed random forests
decision tree and we understood how the
random forest is better than decision
tree in some cases it might be equal to
decision tree but nonetheless it's
always gonna provide us a better result
so guys that was all about the
supervised learning so but before that
let's go ahead and see how exactly we're
gonna implement nay bias
so guys here we have another data set
run or walk it's the kinematic data sets
and it has been measured using the
mobile sensor so let the target were
able to be Y assign all the columns
after it to X using scikit-learn a by a
small we're going to observe the
accuracy generate a classification
report using scikit-learn
now we're going to repeat the model once
only the acceleration values as
predictors and then using only the gyro
value aspirators and we're going to
comment on the difference in accuracy
between the two moles so here we have a
data set which is run or walk so let me
open that for you so here I was data
sets run or walk so as you can see we
have the date time user name risk
activity acceleration XY assertions see
Cairo ex Cairo y Cairo Z so based on it
let's see how we can implement the name
by is classifier and so first of all
what we're gonna do is import pandas at
speedy then we gonna import matplotlib
for plotting we're gonna read the run or
walk data file with pandas period or
tree and a CSV let's have a look at the
info so first of all we see that we have
88 thousand five hundred eighty eight
rows with 11 columns so we have the
date/time username rest activity
assertion XYZ Cairo XYZ and the memory
uses is send point 4 MB data so this is
how you look at the columns D F dot
columns now again we're gonna split the
dataset into training and testing data
sets so we're going to use the Train
test flight model so that's what we're
gonna do is split it into X train X test
y train by test and we're gonna split it
into the size of 0.2 here so again I am
saying it depends on you what is the
test size so let's print the shape of
the training and see it's 70,000
observation has six columns now what
we're going to do is from the
scikit-learn dot knee pius we're going
to import the caution NB which is the
question a bias and we're going to put
the classifier as caution NB then
we'll pass on the extreme and white rain
variables to the classifier and again we
have the wireless co-credit which is the
classifier predict X text and we gonna
compare the Y underscore predict with
the y underscore test to see the
accuracies for that so for that we're
going to import sq learn dot matrix
we're going to import the accuracy score
now let's compare both of these so the
accuracy what we get is ninety five
point five four percent now another way
is to get a confusion matrix bill so
from scikit-learn dot matrix we're going
to import the confusion matrix and we're
gonna plot the matrix of five predict
and white test so as you can see here we
have 90 and 699 that's a very good
number so now what we're gonna do is
create a classification report so from
metrics we're gonna import the
classification because reports we're
going to put the target names as walk
comma run and friends the report using
white s and by predict within target
means we have so for walking we get the
precision of 0.92 and the recall of 0.99
f1 score is zero point nine six the
support is eight thousand six hundred
seventy three and for runway
appreciation of ninety ninety percent
with the recoil of 0.92 and f1 score of
zero point 95 so guys this is how you
exactly use the Gaussian in me or the
new pie's classifier on it and all of
these types of algorithms which are
present in the supervisor or unsurprised
or reinforcement learning are all
present in the cyclotron library so one
second assist SQL learn is a very
important library when you are dealing
with machine learning because you do not
have to code any algorithm hard coding
algorithm every algorithm is present
there all you have to do is just passed
it either split the dataset into
training and testing dataset and then
again you have to find the predictions
and then compare the predicted Y with
the test case Y so that is exactly what
we do every time we work on a machine
learning algorithm now guys that was all
about supervised learning let's go ahead
and understand what exactly is
unsupervised learning so sometimes the
given data is unstructured and unlabeled
so it becomes difficult to classify the
data into different categories so answer
learning helps to solve this problem
this learning is used to cluster the
input data in classes on the basis of
their statistical properties so example
we can cluster different bikes based
upon the speed limit their acceleration
or the average that they are giving so
I'm supporting is a type of machine
learning algorithm used to draw
inferences from Veda sets consisting of
input data without labeled responses so
if you have a look at the workflow or
the process flow of unsupervised
learning so the training data is
collection of information without any
label we have the machine learning
algorithm and then began the clustering
models so what it does is that
distributes the data into different
clusters and again if you provide any
unlabeled new data it will make a
prediction and find out to which cluster
that particular data or the data set
belongs to or the particular data point
belongs to so one of the most important
algorithms in unsupervised learning is
clustering so let's understand exactly
what is clustering so a clustering
basically is the process of dividing the
datasets into groups consisting of
similar data points
it means grouping of objects based on
the information found in the data
describing the objects or their
relationships so clustering models
focused on identifying groups of similar
records and labeling records according
to the group to which they belong now
this is done without the benefit of
prior knowledge about the groups and
their characteristics so and in fact we
may not even know exactly how many
groups are there to look for now these
models are often referred to as
unsupervised learning models since there
is no external standard by which to
judge the models classification
performance there are no right or wrong
answers to these model and if we talk
about why clustering is used so the goal
of clustering is to determine the
intrinsic group in a set of unlabeled
data sometime the partitioning is the
goal or the
of clustering algorithm is to make sense
of and exact value from the last set of
structured and unstructured data so that
is why clustering is used in the
industry and if you have a look at the
various use cases of clustering in the
industry so first of all it's being used
in marketing so discovering distinct
groups in customer databases such as
customers who make a lot of
long-distance calls
customers who use internet more than
cause they also using insurance
companies for like identifying groups of
cooperation insurance policyholders with
high average game rate farmers crash
cops which is profitable they are using
cease mix studies and define probable
areas of oil or gas exploration based on
Seesmic data and they're also used in
the recommendation of movies if you
would say they are also used in Flickr
photos they also use by Amazon for
recommending the product which category
it lies in so basically if we talk about
clustering there are three types of
clustering so first of all we have the
exclusive clustering which is the hard
clustering so here an item belongs
exclusively to one cluster not several
clusters and the data point belong
exclusively to one cluster so an example
of this is the k-means clustering so
claiming clustering does this exclusive
kind of clustering so secondly we have
overlapping clustering so it is also
known as soft clusters in this an item
can belong to multiple clusters as its
degree of association with each cluster
is shown and for example we have fuzzy
or the C means clustering which means
being used for overlapping clustering
and finally we have the hierarchical
clustering so when two clusters have a
painting change relationship or a
tree-like structure then it is known as
hierarchical cluster so as you can see
here from the example we have a pain
child kind of relationship in the
cluster given here so let's understand
what exactly is k-means clustering so
today means clustering is an inquiry um
whose main goal is to group similar
elements of data points into a cluster
and it is the process by which objects
are classified into a predefined number
of groups
so that they are as much it is similar
as possible from one group to another
group but as much as similar or possible
within each group now if you have a look
at the algorithm working here you're
right so first of all it starts with an
defying the number of clusters which is
key then again we find the centroid we
find the distance objects to the
distance object to the centroid distance
of objects to the centroid then we find
the grouping based on the minimum
distance has the centroid converge if
true then we make a cluster false we
then I can find the centroid repeat all
of the steps again and again so let me
show you how exactly clustering was with
an example here so first we need to
decide the number of clusters to be made
now another important task here is how
to decide the important number of
clusters or how to decide the number of
clusters we'll get into that later so
force let's assume that the number of
clusters we have decided is three so
after that then we provide the centroids
for all the creditors which is guessing
the algorithm calculates the Euclidean
distance of the point from each centroid
and assigns the data point to the
closest cluster now Euclidean distance
all of you know is the square root of
the distance the square root of the
square of the distance so next when the
centroids are calculated again we have
our new clusters for each data point
then again the distance from the points
to the new clusters are calculated and
then again the points are assigned to
the closest cluster and then again we
have the new centroid scatter it and now
these steps are repeated until we have a
repetition in the centroids or the new
centers are very close to the very
previous ones so until unless our output
gets repeated or the outputs are very
very close enough we do not stop this
process we keep on calculating the
Euclidean distance of all the points to
the centroids then we calculate the new
centroids and that is how claiming is
clustering works basically so an
important part here is to understand how
to decide then value of K or the number
of clusters
it does not make any sense if you do not
know how many class are you going to
make so to decide the number of clusters
we have the elbow method so let's assume
first of all compute the sum squared
error which is the SS e for some value
of K for example let's take two four six
and eight now the SS e which is the sum
squared error is defined as a sum of the
squared distance between each number
member of the cluster and its centroid
mathematically and if you mathematically
it is given by the equation which is
provided here and if you brought the key
against the SS II you will see that the
error decreases as K gets large now this
is because the number of cluster
increases they should be smaller so this
distortion is also smaller now the idea
of the elbow method is to choose the key
at which the SSE decreases abruptly so
for example here if we have a look at
the figure given here we see that the
best number of cluster is at the elbow
so as you can see here the graph here
genius abruptly after number four so for
this particular example we're going to
use for as a number of cluster so first
of all while working with k-means
clustering there are two key points to
know first of all be careful about where
you start so choosing the first Center
at random choosing the second Center
that is far away from the first Center
some of it choosing the NH Center as far
away possible from the closest of the
all the other centers and the second
idea is to do as many runs of k-means
each with different random standing
points so that you get an idea where
exactly and how many clusters you need
to make and where exactly the centroid
lies and how the data is getting
converged now he means he's not exactly
a very good method so let's understand
the pros and cons of k-means clustering
z' we know that k-means is simple and
understandable
everyone don't see that the first go the
items automatically assigned to the
clusters now if we have a look at the
corns so first of all one needs to
define the number of clusters this is a
very heavy task as us if we have 3/4 or
if we have 10 categories and if you do
not know
but number of clusters are gonna be it's
very difficult for anyone to you know to
guess the number of clusters now all the
items are forced into clusters whether
they are actually belong to any other
cluster or any other category they are
forced to to lie in that other category
in which they are closest to and this
against happens because of the number of
clusters with not defining the correct
number of clusters or not being able to
guess the correct number of clusters so
and most of all it's unable to handle
the noisy data and the outliners because
anyways and machine learning engineers
and data scientists have to clean the
data but then again it comes down to the
analysis what they are doing and the
method that they are using so typically
people do not clean the data for k-means
clustering or even if the clean there
are sometimes are now see noisy and
outliners data which affect the whole
model so that was all for k-means
clustering so what we're gonna do is now
a use k-means clustering for the movie
data sets so we have to find out the
number of clusters and divide it
accordingly so the use case is that
first of all we have at the air set of
five thousand movies and what we want to
do is group them look the movies into
clusters based on the facebook lights so
guys let's have a look at the demo here
so first of all what we're gonna do is
import deep copy numpy pandas Seabourn
the various libraries which we're going
to use now and from map rat levels when
you use ply PI plot and we're gonna use
this GD plot and next what we're gonna
do is import the data set and look at
the shape of the data set so if you have
a look at the shape of the data set we
can see that it has five thousand and
forty three rows with 28 columns and if
you have a look at the head of the data
set we can see it has five thousand
forty three data points so
what we're gonna do is place the data
points in the plot we take the director
Facebook Likes and we have a look at the
data columns yeah face number in poster
cast total Facebook Likes director
Facebook Likes so what we have done here
now is taking the director Facebook
Likes and the actor 3 Facebook Likes
right so we have five thousand forty
three rows and two columns now using the
key means from s key alone what we're
going to do is import it first
when import key means from SQL or
cluster remember guys sq done is a very
important library in Python for machine
learning so and the number of cluster
what we're gonna do is provide as five
note this again the number of cluster
depends upon the SSE which is the sum
squared errors or the we're going to use
the elbow method so I'm not going to go
into the details of that again so we're
gonna fit the data into the k-means dot
fit and if you find the cluster centers
then for the k-means and print it so
what we find is is an array of five
clusters and if you print the label of
the k-means cluster now next what we're
gonna do is plot the data which we have
with the clusters with the new data
clusters which we have found and for
this we're going to use the Seabourn and
as you can see here we have plotted the
card we have plotted the data into the
grid and you can see here we have five
clusters so probably what I would say is
that the cluster three and the cluster
zero are very very close so it might
depend see that's exactly what I was
going to say is that initially the main
challenge and k-means clustering is to
define the number of centers which are
the key so as you can see here that the
third center and the zeroth cluster the
third cluster and is your cluster are
very very close to each other so guys it
probably could have been in one another
cluster and the another disadvantage was
that we do not exactly know how the
points are to be arranged so it's very
difficult to force the data into any
other cluster which makes our analysis a
little different
works fine but sometimes it might be
difficult to code in the k-means
clustering
now let's understand what exactly is
siemens clustering so the fuzzy c means
is an extension of a key means
clustering and the popular simple
clustering technique so fuzzy clustering
also referred as soft clustering is a
form of clustering in which each data
point can belong to more than one
cluster so he means tries to find the
hard clusters where each point belongs
to one cluster whereas the fuzzy c means
discovers the soft clusters in a soft
cluster any point can belong to more
than one cluster at a time with a
certain affinity value towards each
fuzzy c means assigns the degree of
membership which ranges from 0 to 1 to
an object to a given cluster so there is
a stipulation that the sum of fuzzy
membership of an object to all the
cluster it belongs to must be equal to 1
so the degree of membership of this
particular point to pool of these
clusters 0.6 and 0.4 and if you add a
peak at 1 so that is one of the logic
behind the fuzzy c means so on and this
affinity is proportional to the distance
from the point to the center of the
cluster now then again we have the pros
and cons of fuzzy c means so first of
all it allows a data point to be in
multiple clusters that's a pro it's a
more neutral representation of the
behavior of genes genes usually are
involved in multiple functions so it is
a very good type of clustering when we
are talking about genes first of and
again if we talk about the cons again we
have to define C which is the number of
clusters same as K next we need to
determine the membership cutoff value
also so that takes a lot of time and
it's time-consuming and the clusters are
sensitive to initial assignment of
centroid so a slight change or deviation
from the center's is going to result in
a very different kind of you know a
funny kind of output we get from the
fuzzy see means and one of the major
disadvantage of a C means clustering is
that it's this are non-deterministic
algorithm so it does not give you a
particular output as in such
that's that now let's have a look at the
third type of clustering which is the
hierarchical clustering so uh
hierarchical clustering is an
alternative approach which builds a
hierarchy from the bottom up or the top
to bottom and does not require to
specify the number of clusters
beforehand
another algorithm works as in first of
all we put each dita point in its own
cluster and if I that closes to cluster
and combine them into one more cluster
repeat the above step till the data
points are in a single cluster now there
are two types of hierarchical clustering
one is elaborated clustering and the
other one is division clustering so a
cumulative clustering builds the
dendogram from bottom level while the
division clustering it starts all the
data points in one cluster from cluster
now again her archaic clustering also
has some sort of pros and cons so in the
pros though no assumption of a
particular number of cluster is required
and it may correspond to meaningful
taxonomies whereas if we talk about the
course once a decision is made to
combine two clusters it cannot be undone
and one of the major disadvantage of
these hierarchical clustering is that it
becomes very slow if we talk about very
very large datasets and nowadays I think
every industry are using last year as
its and collecting large amounts of data
so hierarchical clustering is not the
app or the best method someone might
need to go for so there's that now when
we talk about unsupervised learning so
we have k-means clustering and again and
there's another important term which
people usually miss while talking about
us was learning and there's one very
important concept of market basket
analysis now it is one of the key
techniques used by large retailers to
uncover association between items now it
works by looking for combination of
items that occurred together frequently
in the transactions to put it it another
way it allows retailers to analyze the
relationships between the items that the
people buy for example people who buy
bread also tend to buy butter the
marketing team at the retail store
should target customers who buy bread
and butter and provide them an offer so
that they buy a third eye
like an egg so if a customer buys bread
and butter and sees a discount or an
offer on eggs he will be encouraged to
spend more money and buy the eggs but
this is what market basket analysis is
all about now to find the association
between the two items and make
predictions about what the customers
will buy there are two algorithms which
are the Association rule mining and the
ebrary algorithms so let's discuss each
of these algorithm with an example first
of all if we have a look at the
Association rule mining now it's a
technique that shows how items are
associated to each other for example
customers who purchase bread have a 60%
likelihood of also purchasing Jam and
customers who purchase laptop are more
likely to purchase laptop bags now if
you take an example of an association
rule if you have a look at the example
here a aro B it means that if a person
buys an Adam 8 then he will also buy an
item P now there are three common ways
to measure a particular Association
because we have to find these rules on
the basis of some statistics right so
what we do is use support confidence and
lift now these three common ways and the
measures to have a look at the
Association rule mining and know exactly
how good is that rule so first of all we
have support so support gives the
fraction of the transaction which
contains an item a and B so it's
basically the frequency of the item in
the whole item set whereas confidence
gives how often the item a and B
occurred together given the number of
item given the number of times a occur
so it's frequency a comma B divided by
the frequency of a now lift what
indicates is the strength of the rule
over the random co-occurrence of a and B
if you have a close look at the
denominator of the lift formula here we
have support a into support B now a
major thing which can be noted from this
is that the support of a and B are
independent here so if the value of lift
or the denominator value of the lift is
more it means that the items are
independently selling more not together
so that in turn will decrease the value
of lift so what happens is that suppose
the value of lift is more that implies
that
which we get it implies that the rule is
strong and it can be used for later
purposes because in that case the
support in to support p-value which is
the denominator of lift will be low
which in turn means that there's a
relationship between the items a and B
so let's take an example of Association
rule mining and understand how exactly
it works so let's suppose we have a set
of items a B C D and E and we have the
set of transactions which are t1 t2 t3
t4 and t5 and what we need to do is
create some sort of rules for example
you can see a D which means that if a
person buys a he buys D if a person buys
C he buys a if it wasn't by his a he by
C and for the fourth one is if a person
buy a B and C he is in turn by a now
what we need to do is calculate the
support confidence and left of these
rules now head again we talk about a
priori algorithm so a priori algorithm
and the associated rule mining go
hand-in-hand so what a predators is
algorithm it uses the frequent itemsets
to generate the Association rules and it
is based on the concept that a subset of
a frequent item set must also be a
frequent Isum set so let's understand
what is a frequent item set and how all
of these work together so if we take the
following transactions of items we have
transaction T 1 T 2 T 5 and the items
are 1 3 4 2 3 5 1 2 3 5 to 5 and 1 3 5
now another more important thing about
support which I forgot to mention was
that when talking about Association rule
mining there is a minimum support count
what we need to do now the first step is
to build a list of items set of size 1
using this transaction data set and use
the minimum support count 2 now let's
see how we do that if we create the
tables see when if you have a close look
at the table C 1 we have the item set 1
which has a support 3 because it appears
in the transaction 1 3 & 5 similarly if
you have a look at the item set the
single item 3 so it has a supporter of 4
it appears in t 1
D 2 D 3 and T 5 but if we have a look at
the items at 4 it only appears in the
transaction once so it's support value
is 1 now the item set with the support
rally which is less than the minimum
support value that is to have to be
eliminated so the final David which is a
table F 1 has 1 2 3 and 5 it does not
contain the 4 now what we're going to do
is create the item list of the size 2
and all the combination of the item sets
in f1 are used in this iteration so
we've left four behind we just have 1 2
3 and 5 so the possible item sets of 1 2
1 3 1 5 2 3 2 5 & 3 5 then again we'll
calculate these support so in this case
if we have a closer look at the table c2
we see that the items at 1 comma 2 is
having a support value 1 which has to be
eliminated so the final table F 2 does
not contain 1 comma 2 similarly if we
create the item sets of size 3 and
calculate these support values but
before calculating the support let's
perform the peirong on the data set now
what Spearing so after all the
combinations are made we divide the
table see three items to check if there
are another subset whose support is less
than the minimum support value this is a
priori algorithm so in the item sets 1 2
3 what we can see that we have 1 2 and
in the 1 to 5 again we have 1 2 so we'll
discard poor of these item sets and
we'll be left with 1 3 5 & 2 3 5 so with
135 we have three subsets 1 5 1 3 3 5
which are present in table F 2 then
again we have 2 3 2 5 & 3 5 which are
also present in tea we'll have to so we
have to remove 1 comma 2 from the table
C 3 and create the table F 3
now if we're using the items of C 3 to
create the adults of c4 so what we find
is that we have the item set 1 2 3 5 the
support value is 1 which is less than
the minimum support value of 2 so what
we're going to do is stop
and we're gonna return to the previous
item set that is the table c3 so the
final table f3 was one three five with
the support value of two and two three
five with the support value of two
now what waiting a Jew is generate all
the subsets of each frequent itemsets so
let's assume that our minimum confidence
value is 60%
so for every subset s of AI the output
rule is that s gives I two s is that s
recommends i ns if the support of I
divided by the support of s is greater
than or equal to the minimum confidence
value then only we'll proceed further so
keep in mind that we have not used lift
till now we are only working with
support and confidence so applying rules
with Adam sets of f3 we get rule 1 which
is 1 comma 3 which gives 1 3 5 & 1 3 it
means if you buy 1 & 3
there's a 66% chance that you'll buy
item 5 also similarly the rule 1 comma 5
it means that if you buy 1 & 5
there's 100% chance that you will buy 3
also similarly if we have a look at rule
5 & 6 here the confidence value is less
than 60% which was the assumed
confidence value so what we're going to
do is we'll reject these files now an
important thing to note here is that
have a closer look to the rule 5 and
rule 3
you see it's it has 1 5 3 1 5 3 3 1 5
it's very confusing so one thing to keep
in mind is that the order of the item
sets is also very important that will
help us allow create good rules and
avoid any kind of confusion so that's
done so now let's learn how Association
rule I used in market basket analysis
problem so what we'll do is we will be
using the online transactional data of a
retail store for generating Association
rules so first of all what you need to
do is import pandas MLT ml X T and D
libraries from the imported and read the
data so first of all what we're going to
do is read the data
what we're gonna do is from ml X T and e
dot frequent patterns we're going to
improve the a priori and Association
rules as you can see here we have the
head of the data you can see we have
inverse number of stock code the
description quantity the inverse TTL
unit price customer ID and the country
so in the next step what we will do is
we will do the data cleanup which
includes reviewing spaces from some of
the descriptions given and what we're
going to do is drop the rules that do
not have the inverse numbers and remove
the Freight transaction so hey what what
you're gonna do is remove which do not
have an invoice number if the string
contains type seen was a number then
we're going to remove that those are the
credits remove any kind of spaces from
the descriptions so as you can see here
we have like five iron and 32,000 rows
with eight columns so next what we
wanted to do is after the clean up we
need to consolidate the items into one
transaction per row with each product
for the sake of keeping the data assets
small we gonna only look at the sales
for France so we're gonna use the only
France and group by invoice number
description with the quantity sum up and
C so which leaves us with 392 rows and
1563 columns now there are a lot of
zeros in the data but we also need to
make sure any positive values are
converted to a 1 and anything less than
0 is set to 0 so for that we're going to
use this code defining end code units if
X is less than 0 it owns 0 if X is
greater than 1 returns 1 so what we're
going to do is map and apply it to the
whole data set we have here so now that
we have structured the data properly so
the next step is to generate the
frequent item set that has support of at
least 7%
now this lumber is chosen so that you
can you get close enough now what we're
gonna do is generate the ruse with the
corresponding support confidence and
lift so we had given the minimum support
at 0.7 the metric is lift frequent item
set and threshold is one so these are
the following rules now a few rules with
a high lift value which means that it
occurs more frequently than would be
expected given the number of transaction
the product combinations most of the
places the confidence is high as well so
these are few of the observations what
we get here if we filter the data frame
using the standard pandas code for large
lift six and high confidence 0.8 this is
what the output is going to look like
these are 1 2 3 4 5 6 7 8 so as you can
see here we have the eh rules which are
the final rules which are given by the
Association rule mining and that is how
all the industries or any of these we've
talked about large retailers they tend
to know how their products are used and
how exactly they should rearrange and
provide the offers on the products so
that people spend more and more money
and time in the shop so that was all
about Association rule mining so so guys
that's all for unsupervised learning I
hope you got to know about the different
formulas how unsupervised learning works
because you know we did not provide any
label to the data all we did was create
some rules and not knowing what the data
is and we did clusterings
different types of clusterings k-means
simi's hierarchical clustering so now
coming to the third and last type of
learning is the reinforcement learning
so what reinforcement learning is it's a
type of machine learning where an agent
is put in an environment and it learns
to behave in this environment by
performing certain actions and observing
the rewards which it gets from those
actions so a reinforcement learning is
all about taking an appropriate action
in order to maximize a reward in the
particular situation and in supervised
learning the training theater comprises
of input and expected output
so the model is strained with the
expected output itself but when it comes
to reinforcement learning
there is no expected output the
reinforcement agent decides what actions
to take in order to perform a given task
in the absence of a training dataset it
is bound to learn from its expertise so
let's understand reinforcement learning
with an analogy so consider a scenario
wherein a baby is learning how to walk
now this scenario can go in two ways
first the baby starts walking in and
makes it to the candy
now since the candy is the end goal the
baby is happy it's positive the baby is
happy positive reward now coming to the
second scenario the baby starts walking
but falls due to some hurdle in between
now the baby gets hurt and does not get
to the candy it's negative the baby is
sad negative reward just like we humans
learn from our mistakes by a trial and
an earth reinforcement learning is also
similar and we have an agent which is
baby a reward which is candy and many
hurdles in between the agent is supposed
to find the best possible path to reach
the reward so guys if you have a look at
some of the important reinforcement
learning definitions first of all we
have the agent so the reinforcement
learning algorithm that learns from
trial in err that's the agent now if we
talk about environment the world through
which the agent moves or the obstacles
which the agent has to conquer or the
environment
now actions a are all the possible steps
that the agent can take the state s is
the current conditions returned by the
environment then again we have reward R
and instant return for the environment
to appraise the last action then again
we have policy which is PI it is the
approach that the agent uses to remind
the next action based on the current
state we have value V which is the
expected long-term return with discount
as open to the short-term what are then
again we have the action value Q this is
similar to value except it takes an
extra parameter which is the current
state action which is a now let's talk
about reward maximization for a moment
now reinforcement learning agent works
based on the theory of reward
maximization this is exactly why the RL
must be trained in such a way that he
takes the best action so that the reward
is maximum
now the collective rewards at a
particular time and the respective
action is written as G T equals RT plus
one RT plus two and so on
now the equation is an ideal
representation of rewards generally
things do not work out like this while
summing up the cumulative rewards now
let me explain this with a small gape in
the figure you see a fox right some meat
and a Tyler our reinforcement learning
agent is the Fox and his end goal is to
eat the massive Otto meat before being
eaten by the tiger
since this fox is clever fellow he eats
the meat that is closer to him rather
than the meat which is close to the
tiger because the closer he goes to the
Tiger the tiger the higher are his
chances of getting killed as a result
the reward near the tiger in if they are
bigger meat chunks will be discounted
this is done because of the uncertainty
factor that the tiger might kill the Fox
now the next thing to understand is how
discounting of reward works now to do
this we define a discount called the
gamma the value of gamma is between 0 &
1
the smaller the gamma the larger the
discount and vice versa so our
cumulative discounted reward is GT
summation of K 0 to infinity gamma to
the power P as DK
t plus k plus 1 where gamma belongs to 0
to 1 but if the Fox decides to explore a
bit it can find bigger rewards that is
this big chunk of meats this is called
exploration so the reinforcement
learning basically works on the basis of
exploration and exploitation
so exploitation is about using the
already known expert information to
heighten the rewards whereas exploration
is all about exploring and capturing
more information about the environment
there is another problem which is known
as the K armed bandit problem the K
armed bandit it is a metaphor
representing a casino slot machine with
K pull levers or arms the users or the
customer pulls any one of the levers to
win a projected reward
the objective is to select the leeward
that will provide the user with the
highest reward now here comes the
epsilon greedy algorithm it tries to be
fair to do opposite cause of exploration
exploitation by using a mechanism of
flipping a coin which is like if you
flip a coin and comes up head you should
explore for memory butter comes up days
you should exploit it takes whatever
action seems best at the present moment
so with probability while epsilon the
epsilon greedy algorithm exploits the
best known option with probability
epsilon by 2 epsilon 0 it explores the
best known option and with the
probability epsilon by 2 with
probability epsilon by 2 the algorithm
explores the best known option and with
the probability epsilon by 2 the epsilon
greedy algorithm explores the worst
known option now let's talk about Markov
decision process the mathematical
approach for mapping a solution in
reinforcement learning is called Markov
decision process which is MDP in a way
the purpose of reinforcement learning is
to solve a Markov decision process now
the following parameters are used to
attain a solution set of actions a set
of states s we have the reward our
policy PI and the value V and we have
translational function T probability
that our forum leads to s now to briefly
sum it up the agent must take up an
action to transition from the start
state to end state s while doing so the
agent receives the reward R for each
action he takes the series of actions
taken by the agent define the policy PI
and the rewards collected by collected
to find the value of V the main goal
here is to maximize the rewards by
choosing the optimum policy now let's
take an example of choosing the shortest
path now consider the given example here
so what we have is given the above
representation our goal here is to find
the shortest path between a and D each
edge has a number linked to it and this
denotes the cost to traverse that edge
now the task at hand is to traverse from
point A to D with the minimum possible
cost in this problem the set of states
are denoted by the nodes ABCD a
d the action is to traverse from one
node to another are given by a arrow B
or C our OD reward is the cost
represented by each edge and the policy
is the path taken to reach each
destination a to C to D so you start off
at node a and take baby steps to your
destination initially only the next
possible node is visible to you if you
follow the greedy approach and take the
most optimal step that is choosing a to
see instead of a to B or C now you are
at node C and want to traverse to node T
you must again choose the path wisely
choose the path with the lowest cost we
can see that a CD has the lowest cost
and hence we take that path to conclude
the policy is a to C to D and the value
is 120
so let's understand Q learning algorithm
which is one of the most use
reinforcement learning algorithm with
the help of examples
so we have five rooms in a building
connected by toast and each room is
numbered from 0 through 4 the outside of
the building can be thought of as one
big room which is tea room number five
now dose 1 & 4 lead into the building
from the room 5
outside now let's represent the rooms on
a graph and each node each room has a
node and each door as link so as you can
see here we have represented it as a
graph and our goal is to reach the node
5 which is the outer space so what we're
gonna do is and the next step is to
associate a reward value to each toe so
the dose that directed read to the you
will have a reward of 100 whereas the
doors that do not directly connect to
the target have a reward and because the
dose had to weigh two arrows are
assigned to each room and each row
contains an instant about valley so
after that the terminology in the
q-learning includes the term states and
action so the room 5 represents a state
agents movement from one room to another
room represents in action and in this
figure a state is depicted as a node
while an action is represented by the
arrows so for example let's say can eat
in that Traverse from room to to the
roof I so the initial state is gonna be
the state to it then the next step is
from stage 2 to stage 3 next is to moves
from stage 3 to stage either 2 1 or 4 so
if it goes to the 4 it reaches stage 5
so that's how you represent the hole
traversing of any particular agent in
all of these rooms a represents their
actions via notes so we can put this
state diagram and instant reward values
into a reward table which is the matrix
R so as you can see the minus 1 here in
the table represents the null values
because you cannot go from 1 to 1 right
and since there is no way from to go
from 1 to 0 so that is also minus 1 so
minus 1 represents the null values
whereas the 0
represents zero reward and 100
represents the reward going to the room
five so one more important thing to know
here is that if you're enrolled fireman
you could go to room five the reward is
hundred so what we need to do is add
another matrix Q representing the memory
of what the agent has learned to
experience the rows of matrix Q
represent the current state of the agent
whereas the columns represent the
possible action leading to the next
state now if the formula to calculate
the Q matrix is if a particular Q at a
particular state and the given action is
equal to the R of that state in action
plus gamma which we discussed earlier
the Kurama parameter which we discussed
earlier which ranges from 0 to 1 into
the maximum of the Q or the next state
comma all actions so let's understand
this with an example so here are the
nine steps which any Q learning
algorithm particularly has so first of
all is to set the gamma parameter and
the environment rewards in the matrix R
then we need to do is initialize the
matrix Q to 0 select the random initial
state set the initial state to current
state select one among all the possible
actions for the current state using this
possible action consider going to the
next state when you get the next state
get the maximum Q value for this next
state based upon all the actions compute
the Q value using the formula repeat the
above steps until the current state
equals your code so the first step is to
set the values of the learning
parameters gamma which is 0.8 and
initial state as room number one so the
next initialize the Q matrix a zero
matrix so on the left hand side as you
can see here we have the Q matrix which
has all the values as 0 now from room 1
you can either go to room 3 or room 5 so
let's select room 5 because that's our
end goal so from room 5 calculate the
maximum cube value for this next state
based on all possible actions so Q 1
comma 5 equals R 1 comma 5 which is
hundred plus zero point eight which is
the gamma into the maximum of Q 5 comma
1 5 comma 4 and 5 comma 5 so
maximum or five comma one five comma
four five comma five is hundred so the Q
values from initially as you can see
here the Q values are initialized to
zero so it does not matter as of now so
the maximum is zero so the final Q value
for Q 1 comma 5 is 100 so so that's how
we're gonna update our Q matrix so Q
matrix the position has 1 comma 5 in the
second row gets updated to 100 so the
first step we have turned right now that
for the next episode we start with a
randomly chosen initial state so let's
assume that the stage is 3 so from rule
number 3 you can either go to room
number 1 2 or 4 so let's select the
option of room number 1 because from our
previous experience what we've seen is
that one has directly connected to room
5 so from room / 1 calculate the maximum
Q value for this next state based on all
possible action so 3 comma 1 if we take
we get our 3 4 1 plus 0 point 8 comma
into maximum of T's we get the value as
80 so the matrix Q gets updated now for
the next episode the next state 1 now
becomes the current state we repeat the
inner loop of the Q learning algorithm
because tip 1 is not the goal state from
1 you can either go to 3 of 5 so let's
select 105 as that's our goal so from
room row 5 again we can go from all of
these so the Q matrix remains the same
since Q 1 5 is already fed to the agent
and that is how you select the random
starting points and fill up the Q Q
matrix and see where which path will
lead us there with the maximum provide
points now what we gonna do is do the
same coding using the Python in machine
learning so what we're going to do is
improve an umpire's NP we're gonna take
the R matrix as we defined earlier so
that the minus 1 are the nerve values
zeros are the values which provides a 0
and hundreds is the value so what we're
going to do is initialize the Q matrix
now to 0 we're going to put gamma as 0.8
and set the initial state as 1
now here returns all the available
actions in the state given as an
argument so if we define the of
action with the given state we get the
available action in the current state so
we have the another function here which
is known as a sample next action what
this function does is that chooses at
random which action to be performed
within the range of all the available
actions and finally we have action which
is the sample next action with the
available act now again we have another
function which is update now what it
does is that it updates the Q matrix
according to the path selected and a Q
learning algorithm so so initially our Q
matrix is all 0 so what we're gonna do
is we're gonna train it over 10,000
iterations and let's see what exactly
gives the output of the Q value so if
then the agent learns more through for
the iterations it will finally breach
converges value in Q matrix so the Q
matrix can then be normalized at is
converted to percentage by dividing all
the non-zeros entities by the highest
number which is 500 in this case so once
the matrix Q gets close enough to the
state of convergence agent has learned
the most optimal path to the goal State
so what we're gonna do next is divide it
by 5 which is the maximum here so Q R
and P Q max in 200 so that we get a
normalized
now once the Q matrix gets close enough
to the state of convergence the agent
has learned or the paths so the optimal
path given by the Q learning employer
Thomas if it starts from 2 it will go to
3 then go to 1 and then go to 5 if it
starts at 2 it can go to 3 then 4 then 5
that will give us the same results so as
you can see here is the output given by
the Q learning algorithm is the selected
path is 2 3 1 and Feinstein from the Q
State - so this is how exactly a
reinforcement learning algorithm works
it finds the optimal solution using the
path and given the action and rewards
and the various other definitions or the
various other challenges I would say
actually the main goal is to get the
master reward and get the maximum value
through the environment and that's how
an agent learns through its own path and
going millions and millions of
iterations learning how each part will
give us what reward so that's how the Q
learning algorithm works and that's how
it works in Python as well as I showed
you so now that you have a clear idea of
the different machine learning
algorithms how it works the different
phases of machine learning the different
applications of machine learning how
supervised learning works how
unsupervised learning works our
reinforcement learning works and what to
choose in what scenario what are the
different algorithms under all of these
types of machine learning next move
forward to the next part our session
Rich's understanding about artificial
intelligence deep learning and machine
learning
well data science is something that has
been there for ages nonetheless and data
science is the extraction of knowledge
from data by using scientific techniques
and algorithms people usually have a
certain level of dilemma or I would say
a certain level of confusion when it
comes to differentiating between the
terms artificial intelligence machine
learning and deep learning so don't
worry I'll clear all of these doubts for
you artificial intelligence is a
technique which enables machine to mimic
human behavior now the idea behind
artificial intelligence is fairly simple
yet fascinating which is to make
intelligent machines that can take
decisions on their own now for years it
was thought that computers would never
match the power of the human brain well
back then we did not have enough data
and computational power but now with big
data coming into existence and with the
advent of GPUs artificial intelligence
is possible now machine learning is a
subset of artificial intelligence
technique which uses statistical method
to enable machines to improve with
experience whereas deep learning is a
subset of machine learning which makes
the computation of multi-layer neural
network feasible it uses the neural
networks to stimulate human-like
decision-making so as you can see if we
talk about the data science ecosystem we
have artificial intelligence machine
learning and deep learning deep learning
being the innermost circle is very much
required for machine learning as well as
artificial in
but why was deep learning required so
for that less understand the need for
deep lolly so a step towards artificial
intelligence was machine learning and
machine learning was a subset of ei play
it deals with the extraction of patterns
from the last dataset haslam la dataset
was not a problem what was a problem was
machine learning algorithms could not
handle the hight dimensional data where
we have a large number of inputs and
outputs which rounds thousands of
dimensions handling and processing such
type of data becomes very complex and
resource exhaustion now this is also
termed as the curse of dimensionality
now another challenge faced by machine
learning was to specify the features to
be extracted so as we saw earlier in all
the algorithms which are discussed now
we had to specify the features to be
extracted now this plays an important
role in protecting the outcome as well
as in achieving better actress therefore
without feature extraction the challenge
for the programmer increases as the
effectiveness of the algorithm very much
depends on how insightful the programmer
is now this is where deep learning comes
into picture and comes to the rescue
but deep learning is capable of handling
the high dimensional data and is also
efficient in focusing on the right
features on its own so what exactly is
deeper so deep learning is a subset of
machine learning as I mentioned earlier
where similar machine learning
algorithms are used to Train deep neural
networks so as to achieve better
accuracy in those cases where the former
was not performing up to the MA
basically deep learning mimics the way
our brain functions and learns from
experience so as you know our brain is
made up of billions of neurons that
allows us to do amazing things when the
brain of a small kid is capable of
solving complex problems which are very
difficult to solve even using the
supercomputers so how can we achieve the
same functionality in programs now this
is where we understand artificial neuron
and artificial neural networks so first
of all let's have a look at the
different applications of deep learning
we have automatic machine translation
object classification before
automatic handwriting generation
character text generation we have image
caption generation colorization
of black and white images we have
automatic game playing and much more now
google lens is a set of vision based
computing capabilities that allows your
smartphone to understand what's going on
in a photo video or any live feed
for instance point your phone at a
flower and google lens will tell you on
the screen which type of flower it is
you can in that camera at any restaurant
sign to see the reviews and other
recommendations now if we talk word
mushroom transition this is a task where
you are given words in some language and
you have to translate the words to a
desired language see English but this
kind of translation is classic example
of image recognition and final
application of deep learning which we
have here is image polarization so
automatic colorization of black and
white images as you know earlier we did
not had color photographs back there in
40s and 50s we did not have any color
photographs so through deep learning
analyzing water shadows is present in
the image how the light is bouncing off
the skin tone of the people automatic
colorization is now possible and this is
all possible because of deep learning
now deep learning studies the basic unit
of a brain cell called a neuron now let
us understand the functionality of a
biological neuron and how we mimic this
functionality in the perceptron or what
we call is an artificial neuron so as
you can see here we have the image of a
biological neuron so it has a cell body
it has mitochondrion nucleus we have
dendrites there we have the axon we have
the node of the ran of ear you have the
scavenge cell and the synapse so we need
not know about all of these so what we
need to know mostly about is dendrite
which receives signals from other
neurons we have a cell body which sums
up all the inputs and we have axon which
is used to transmit the signals to the
other cells now an artificial neuron or
perceptron is a linear model which is
based upon the same principle and is
used for binary classification
it models a neuron which has a set of
inputs each of which is given a specific
weight and the neuron computes some
functions on these weighted inputs and
gives the outputs it receives n inputs
corresponding to each feature it then
sums up those inputs applies the
transformation and produces an output it
has generally two functions which are
the summation and the transformation but
the transformation is also known as
activation functions so as you can see
here we have certain inputs we have
certain weights we have the transfer
function and then we have the activation
function now the transfer function is
nothing but the summation function here
and it is the schematic for a neuron in
a neural network so this is how we mimic
a biological neuron in terms of
programming now the way it shows the
effectiveness of a particular input move
the weight of input more it will have an
impact on the neural network on the
other hand bias is an additional
parameter in the perceptron which is
used to address the output along with
the weighted sum of the inputs to the
neuron which helps the model in a way
that it can best fit for the given data
activation functions translate the
inputs into outputs and it uses a
threshold to produce an output there are
many functions that are use has
activation functions such as linear or
identity we have unit or binary step we
have sigmoid logistic tan edge ray Lu
and soft Max now if we talk about the
linear transformation or the activation
function so a linear transform is
basically the identity function where
the dependent variable has a direct
proportional relationship with the
independent variable now in practical
terms it means that a function passes
the signal through unchanged now the
question arises when to use linear
transform function simple answer is when
we want to solve a linear regression
problem we apply a linear transformation
function and next in our list of
activated functions we have your next
step the output of a unit step function
is either 1 or 0 now it depends on the
threshold value we define a
step function with the threshold value
five is shown here so let's consider X
is five so if the value is less than
five the output will be zero whereas if
the value is equal to or greater than
five then the valuable one this equal to
is very much important to consider here
because sometimes people put up the
equal two in the lower end of the side
so that's not it how it is used but
rather it's used on the upper hand side
where if the value is greater than
particular X greater than or equal to X
then only the value will be one now a
sigmoid function is a machine that
converts an independent variable of near
infinite range into simple probabilities
between 0 & 1 now most of its output
will be very close to either 0 or 1 and
if you have a look at the function here
we have 1 divided by n plus y raise to
power minus beta X so I'm not going to
the details or the mathematical function
of a particular sigmoid but it's very
much used to convert the independent
variables of very large infinite range
to the values between 0 & 1
now the question arises when to use a
sigmoid transformation function so when
we want to map the input values to a
value in the range of 0 to 1 where we
know the output should lie only between
these two numbers we apply the sigmoid
transformation function note an H is a
hyperbolic trigonometric function now
unlike the sigmoid function the
normalized range of tan H is minus 1 to
1 it's very much similar to the sigmoid
function but the advantage of tan H is
that it can deal more easily with
negative numbers now next on our list we
have Ray Lu now rail you or the rectify
linear unit transform function only
activates our node if the input is above
a certain quantity while the input is
below 0 the output is 0 but when the
input Rises about a certain threshold or
if we take in this case at 0 but if you
have a certain value X if it crosses
that certain threshold it has a linear
relationship with the dependent variable
now this is very much different from a
normal linear transformation so
has certain threshold now the question
arises here again when to use a railroad
transformation function so when we want
to map the input values to a value in
the range so as input X to maximum 0
comma X that is it Maps the negative
inputs to 0 and the positive inputs are
output without any change we apply a
rectified linear unit or the railroad
transformation function now the final
one which we have is sort max so when we
have four or five classes of outputs the
softmax function will give the
probability distribution of each it is
useful for finding out the class which
has the maximum probability so soft mass
is a function you will often find at the
output layer of a classifier now suppose
we have an input of say the letters of
English words and we want to classify
which letter it is so for that case
we're going to use the sort max function
because in the output we have certain
classes but I would say in English if we
take English we had 26 classes from A to
Z so in that case softmax activation
function is very much important now
artificial neuron can be used to
implement logic gates now I'm sure you
guys must be familiar with the working
of all K that is the output is one if
any of the input is also one therefore a
perceptron can be used as a separator or
a decision line that divides the input
set of or gate into two classes the
first class being the inputs having
output as 0 that lies below the decision
line and the second class would be
inputs having output as 1 that lie above
the decision line or the separator so
mathematically a perceptron can be
thought of like an equation of weights
inputs and bias as you can see here we
have f of X is equal to weight into the
input vector plus the bias so let's go
ahead with our demo understand how we
can implement this perceptron example
which is of an or gate using neural
networks using artificial neuron or the
perceptron and here we're going to use
tensor flow along with Python so let's
understand what exactly is tensor flow
first before going it to the demo so
basically tensor flow is a deep learning
framework by Google
to understand it in a very easy way
let's understand the two terms of
tensorflow which are the tensors and the
flow so starting with tensors tensors
are standard way of representing theater
in deep learning and they are just
multi-dimensional arrays it is an
extension of two-dimensional table
matrices through the data with higher
dimension so as you can see have first
of all we have a tensor of dimension 6
then we have a tensor of dimension 6
comma 4 which is 2d and again we have a
tensor of dimension 6 4 and 2 which is
reading now this dimension is not
restricted to 3 we can have four
dimensions five dimensions it depends
upon the number of inputs or the number
of classes or the parameters which we
provide to a particular neural net or a
particular perceptron so which brings us
tensorflow intensive flow the
computation is approached as a data flow
graph so we have a tensor and then again
we have a flow in which we suppose for
taking the example here we have the data
we do addition then we do matrix
multiplication then we check the result
if it's good then it's fine and if the
result is not good then we again do some
sort of matrix multiplication or
addition it depends upon the function
what we are using and then finally we
have the output so if you want to know
about it as a flow we have an entire
playlist on tensor flow and deep
learning which you should see i'll give
the link to all of these videos in the
description box so let's go ahead with
our demo and understand how we can
implement the or gates using perceptron
so first of all what we're going to do
is import all the required libraries and
Here I am going to import only one
library which is the tensor flow library
so what we're going to do is import
tensorflow a steal
now the next step what we're going to do
is define vector variables for input and
output so for that we need to create
variables for storing the input output
and the bias for the perceptron so as
you can see here we have the training
input and again we have the training
output now what we're going to do next
is define the weight variable and here
we are we will define the tensor
variable of the shape 3 comma 1
and for our weights and we will assign
some random values to it initially so
we're going to use T AF dot variable and
we're going to use TF run random normal
to assign random variables to the 3
cross 1 tensor next what we do is define
placeholders for input and output and so
that they can accept external inputs on
the run so this will be T F dot float32
so for X we are going to use a dimension
for 3 and for y it's dimension of 1 now
as discussed earlier the input received
by a positron is force multiplied by the
respective weights and then all of these
weights input our sum together now this
sum value is then fed to the activation
for obtaining the final result of the or
gate perceptron so this is the output
here what we are defining so it's TF dot
neural networks dot relu using the relu
activation function here and we are
doing the matrix multiplication of the
weights and biases in this case I have
used the rayleigh function but you are
free to use any of the activation
functions according to your needs the
next what we're going to do is calculate
the cost or ere so we need to calculate
the cost which is the mean squared error
which is nothing but the square of the
differences or the perceptron output and
the desired output so the equation will
be loss equals D F dot reduce some and
we'll use the TF dot Square output minus
y now the cool of a perceptron is to
minimize the loss or the cost or the
error so here we are going to use the
gradient descent optimizer which will
reduce the loss and it is a very
important part of any neural network to
use any sort of optimizer so here we are
using the gradient descent optimizer you
can know more about the gradient descent
optimizer in other a Drake of videos or
deep learning and neural networks
now the next step comes is to initialize
the variables so variables are only
defined with TF dot variables the
initially what weighted so we need to
initialize this variable define so for
that we're going to use the T F dot
global variable initializer and we're
going to create the F dot session and we
will not run
with the initialization variables so as
all the variables are initialized not
coming to the last step what we're going
to do is we need to train our perceptron
that is update away our values of the
weights and the biases in the successive
iteration to minimize the error or the
Ross so here I will be training our
perceptron in hundred epochs
so as you can see here for I in range
hundred we are going to run the session
with training data in and why as a
trainee at the output and we're going to
calculate the loss and feed it directly
to the X train and why train and again
and print the epoch so as you can see
here for the first iteration the loss
was two point zero seven and coming down
if as soon as the iterations increase
the loss is decreasing because of the
gradient optimizer it's learning how the
data is and coming down to the
hundredths or the final epoch here we
have the loss of zero point two seven
start with two point zero seven here
initially and we ended up with zero
point two seven loss which is very good
this was how perceptron works on a
particular given data set it learns
about it and as you saw earlier we gave
a set of input the input variables we
provided weights we had a summation
function and then we use the rail u
activation function in the code to get
the final output and then we trained the
particular model for hundred iterations
with the training data so as to minimize
the loss and the loss came down all the
way from two point seven to zero point
two seven well if you think perceptron
solves all the problem of making a human
brain then you were wrong there are two
major problems first problem being that
the single layer perceptron cannot
classify non linearly separable data
points and which other complex problems
that involve a lot of parameters cannot
be solved by a single layer perceptron
now consider the example here and the
complexity with the parameters involved
to take a decision by the marketing team
so as you can see here for every email
direct paid referral program or organic
we have certain number of social media
subcategories Google Facebook LinkedIn
we have twitter and then we have the
type such as the search ad remarketing
as interest as ad look like ads and
again the parameters to be considered
are the customer acquisition cost money
span leads generated customers generated
time taken to become a customer
and all of these problems cannot be
solved by a single layer of perceptron
our one neuron cannot take in so many
inputs and that is why more than one
neuron would be used to solve these kind
of problems
so neural network is really just a
composition of perceptrons connected in
different ways and operating on
activation functions so for that we have
three different terminologies in a
particular neural network we have the
input layers we have the hidden layers
and we have the output layers so in
hidden layer we have hidden nodes which
provide information from the outside
world to the network and heart together
referred to as the input layer now the
hidden nodes perform computations and
transfer information from the input
nodes to the output nodes now a
collection of hidden nodes forms idle
layer in our image we have one two three
four hidden layers and finally the
output nodes are collectively referred
to as output layers and are responsible
for computation and transferring
information from the network to the
outside world
now that you have an idea of how a
perceptron behaves the different
parameters involved and the different
layers of neural networks let's continue
this session and see how we can create
our own neural network from scratch in
this image as you can see here we have
given a list of faces first of all the
patterns of local contrast is being
computed in the input layer then in the
hidden layer 1 we get the face features
and in the hidden layer 2 we get the
different features of the face and
finally we have the output layer now if
we talk about training networks and
weights in a particular neural networks
we can estimate the weight values for
our training data using stochastic
gradient descent optimizer
as I mentioned earlier now it requires
two parameters which is the learning
rate and as I mentioned earlier learning
rate is used to limit the amount of each
weight is corrected each time it is
updated and epoch is a number of times
to run through the training data while
updating the way so in the previous
example we had 100 ebox so we trained
the whole model hundred times and these
along with the training data will be the
arguments to the function as data
scientists or data analysts
or machine learning engineers working on
the hyper parameters is the most
important part because anyone can do the
coding it's your experience and your way
of thinking about the learning rate and
the epochs the model which you are
working the input data you are taking
how much time it will require to train
because time is limited and as you know
these hyper parameters are the only
things which are successful data centers
will be guessing when creating a
particular model and these play a huge
role in the model such as even a slight
difference in learning create of the e
box might result in the model training
time so as it will take longer time to
Train having a large amount of data
using the particular data set that these
all things are what data scientist or
machine learning engineer keeps in mind
while creating them all let's create our
own new network and here we are going to
use the MN is DDS a so the MN IC data
set consists of 60,000 training samples
and 10,000 testing samples of
handwritten digit images not the images
are of the size 28 into 28 pixels and
the output can lie anywhere between 0 to
9 now the task here is to train a model
which can accurately identify the digit
present on the image so let's see how we
can do this using tensor fro and Python
so firstly we are going to use the
import function here to bring all the
print function from Python 3 into python
2.6 or the future statements let's
continue with our cone
so next what we are going to do is from
pencil for examples tutorials we can
take the mi nasty data which is already
provided by tensorflow in their example
tutorials data but this is only for the
learning part and later on you can use
this particular data for more purposes
for your learning now next what we are
going to do is create MN ist and we're
going to use the input data tour tree
data set and one hot is given us through
here so we're going to import tensorflow
and whack plot lib next what we are
going to do is define the hyper
parameters here so as I mentioned
earlier we have few hyper parameters
like learning rate equals batch size
display step is not a very big hyper
parameter to consider here but so the
learning rate we have given here is
0.001 training epochs is 15 that is up
to you because more than number of
epochs the more time it will take for
the model to Train and here you have to
take a decision between the amount of
time it takes for the model to train and
give the output versus the speed again
we have the batch size of 100 now this
is one of the most important have a
parameter to be considered because you
cannot take all of the images at once
and create the radius so you need to do
it in a bath size manner and for that we
define a bad size of 100 so out of
60,000 we're going to take 100 as a bath
size 100 images which will go through 15
iterations and the training set has
60,000 images so you do the math how
many batch we will require and how many
epochs for each batch we'll have 15 a
box the next step is defining the hidden
layers and the input and the classes so
for input layers have taken 256 numbers
these are the number of perceptron I
need or the number of features to be
extracted in the first layer so this
number is arbitrary you can use it
according to your requirements and your
needs so for simplicity I am using two
bits X here and the same I'm going to
use for the hidden layer 2 now for the
number of inputs I'm going to use 784
and that is why because as I discussed
earlier the MST data has an image or the
shape 28 cross 28 which is
784 so in short we have 784 pixels to be
considered in a particular image and
each pixel will provide immense amount
of data so I am taking a 784 input and
number of output classes Here I am
defining ten because the output can
either range from zero one two three
four five six seven eight and nine
so the total number of classes or the
output classes here I'm going to use are
ten and again we are going to create x
and y variables X for the input and Y
for the output classes now as you can
see here we have the multi-layer
perceptron in which we have defined all
the hidden layers and the output layers
so the layer one will do the addition
and first I will do the matrix
multiplication of the weights and the
input with the biases and then it will
provide a summation and then again the
outward for this one will be given to
layer two by using the activation
function of rail you here so as you can
see here we have rail you activation
function for layer 1 layer 2 will take
the input of layer 1 with the weights
provided in h2 hidden to layer with the
biases of b2 layer it will do the
multiplication of layer 1 into weights
it will add the biases and then again
we'll have a rail lu activation function
and the output of this layer 2 will be
given to the output layer so as you can
see here in the final output layer we
have matrix multiplication of layer 2
into weights of the output layer plus
the biases of the output layer and what
we're going to do is return the output
so let's mention the weights and the
biases so here we are taking random
points for that and next what we're
going to do is use the prediction of the
multi-layer perceptron using the input
weights and biases and one thing more
important what we're going to do here is
define the cost so we're going to use
the TF naught reduce mean and we are
using the short max cross entropy with
logits
this is a function and here we are using
the atom optimizer rather than the
gradient descent optimizer with learning
rate provided initially and what we're
going to do is minimize the cost
so again we're going to initialize all
the global variables and we have two
arrays for cos history and accuracy
history so as to store all the values
and train our model so we're going to
create a session and the training cycle
for epoch in the range of 15 we first
initialize the average cost at zero and
the total patch is the MN asset in
number of examples divided by bass has
which is 100 and we loop it over all the
patches run the optimization or the back
propagation and the cost operation to
get the loss value and then we have to
display the logs per each Ipoh for that
will show the epochs and the cost at
each step we're going to calculate the
accuracy add the last to the correct
prediction and will append the accuracy
to the list after every epoch we will
append the cost after every epoch
because that is what and we have created
cos history and the accuracy history for
that purpose and finally we will plot
the cost history using the matplotlib
and we'll plot the accuracy history also
and what we're going to do is we're
going to see how accurate is our model
so so let's train it now and as you can
see at first epoch we have cost 188 and
address is 0.85 so if you see just have
the second epoch the cost has reduced
from 188 to 42 now it's 26 as you can
see the accuracy is increasing from 0.85
to 0.909 one you have reached five
epochs you see the cost is diminishing
at a huge rate which is very good and
you can use different types of
optimizers or gradient descent or be it
atom optimizer and not go to the details
of the optimization because that is
another half an hour or one hour to
explain you guys what exactly it is and
how exactly it works so as you can see
till the tenth epoch or 11th epoch we
have cost 2.4 and the accuracy is 0.94
let's wait a little further till the
50th epoch is turn
so as you can see in the 15th eat walk
we have cost 0.83 and actress is 0.94 we
start with cost 188 and accuracy 0.85
have you ever east the accuracies of
0.94 so as you can see this is the graph
of the cost
it started from 188 ending at 0.8 3 we
have the crop of the accuracy which
started from 0 point 8 4 or 8 5 2 all
the way to zero point nine four so as
you can see the 14th epoch reached an
accuracy of 0.9 4/7 as you can see here
in the graph again and in the 15th epoch
we came to the accuracy of 0.9 for now
one might ask the question the accuracy
was higher in that particular epoch why
has the accuracy decreased another
important aspect or have a parameter to
consider here is the cost the more lower
the cost the more accurate will be your
mod so the goal is to minimize the cost
which will in turn increase the accuracy
and finally accuracy here we have a 0.9
for tonight which is very good now this
was all about deep learning neural
networks and tensorflow how would create
a perceptron or deep neural network what
are the different hyper parameters
involved how does a neuron work so let's
have a look at the companies hiring
these professionals these data
professionals in the data science
environment we have companies all the
way from startups to big giants so the
major companies here we can see as our
Dropbox Adobe IBM we have Walmart who
were chase LinkedIn Red Hat and there
are so many companies and as I mentioned
earlier the required for these
professions are high but the people
applying are too low because you need a
certain level of experience to
understand how things are working you
need to understand machine learning to
understand deep learning you need to
understand all the statistics and
property and that is not an easy task so
you require at least 3 to 6 months of
rigorous training with minimum one to
two years of practical implementation
and project work I would say to go into
data science career if you think that's
the career you want to go so
Yurika as you know provides data science
master program we have a machine
learning master program but as you can
see in the data master program we have
Python statistics we have our statistics
we have data size using our Python for
data science we have Apache spark and
Scylla we have PA and deep learning with
tensorflow we have tableau so guys as
you can see here we have 12 courses in
this master program with 250 hours of
interactive learning via capstone
projects and as you can see here we have
a certain discount going on the hike in
salary you get is much more if you go
for data science rather than any other
program so you can see we have Python
statistics a statistics data science
using Python we have Python for data
science Apache spark and Scala which is
a very important part in data science
you need to know what the Hadoop
ecosystem we have deep learning with
tensorflow
you have tableau and this is a 31 feet
course as I mentioned earlier it's not
an easy task and you do not become a D
assigned all in one month or in two
months you cry a lot of training and a
lot of practice to become a data
scientist or machine learning engineer
or even a data analyst because you see a
lot of topics on a vast list of areas is
what you need to cover
and once you cover all of these topics
what you need to do is select an either
which you wanna work the kind of data
which you're going to be handling
whether it be text data it would be
medical records if it's video audio or
images for processing it is not an easy
task to become a data scientist so you
need a very good and a very correct path
of learning to become a real scientist
so so guys that's it for this session I
hope you enjoyed the session and got to
know about data science the different
aspects of data science how it works all
the ways to either from statistics
probability machine learning deep
learning and finally coming to AI so
this was the path of data science and I
hope you enjoyed this session and if you
have any queries regarding session or
any other session please feel free to
mention it in the comment section below
and we'll happily answer all of your
queries till then thank you
and happy learning. I hope you have
enjoyed listening to this video, please
be kind enough to like it and you can
comment any of your doubts and queries
and we will reply them at the earliest.
Do look out for more videos in our
playlist and subscribe to edureka!
channel to learn more. Happy learning!
