Hey guys, welcome to the session by Intelllipaat. So data is
today's oil and data scientists are the rockstars of this era.
And if you want to be a data scientist, you
obviously have to attend a data science interview
and the interviewer over there could ask a wide range of questions related to statistics, machine
learning and complicated puzzles. So, we've come for
the session on data science
interview questions so that you can ace any
data science interview by all of the major
companies.
So before we start off with all of the Q&A's, let's actually go through this interesting puzzle.
So let's say there are five lanes on a racetrack and 25 horses in total. Now you would have
to find out the minimum number of races to be conducted to determine the three fastest
horses. So how would you do it. So what you understand by linear
regression. Well linear regression is a supervised learning
algorithm which helps us in finding the linear relationship between two variables.
So one is the predictor or the independent variable and the other is the response or the dependent
variable. So we try to understand how does the dependent
variable change with the independent variable.
So linear regression is a supervised learning algorithm which helps us
in finding the linear relationship between two variables.
So one is the predictor of the independent
variable
and the other is the response or the dependent variable.
And we try to understand how does the dependent
variable change with the independent variable.
So let's see there's this telecom company
called as Neo.
And now the data scientist at this company
wants to understand if there's a linear relationship
between the monthly charges incurred by the customer and the tenure of the customer.
So, he collects all of the data and builds
a linear model between the monthly charges
and the tenure. So here, monthly charges would be the dependent
variable and tenure would be the independent variable. And then linear regression,
there could be more than one independent variable. So if there's just one independent variable
it is known as simple linear regression and if there's more than
one independent variable, it is known as multiple linear regression.
So guys this is the underlying concept of
linear regression where we have one dependent
variable and multiple or a single independent variable.
And we try to understand the linear relationship
between the dependent variable and the independent variables.
Now we have our next question over here.
So the question is, What do you understand
by logistic regression?
Well, logistic regression is actually your
classification algorithm which can be used
when the dependent variable is binary. So let's take this example.
So here we are trying to determine whether
it will rain or not on the basis of temperature
and humidity that is temperature and humidity are the independent
variables and rain would be our dependent variable.
That is, we're trying to understand whether it will rain or not on the basis of the temperature
and the humidity. And again logistic regression algorithm, it
actually produces an S curve. So let's say, x axis over here, it represents
the a number of runs scored by Virat Kohli
and the y axis represents the probability
of Team India winning the match.
So let's say this point over here, it denotes
50 runs.
So what we can see from this graph is so if Virat Kohli scores more than 50 runs,
then there is a greater probability for Team
India to win the match.
And similarly if Virat Kohli scores less than
50 runs then the probability of Team India
winning the match is less than 50 percent. So let's take this value here.
So let's see the number of runs scored by
Virat Kohli is around 60.
So if the number of runs scored by Virat Kohli is around 60 then the probability of
Team India winning the match would be let's say around 65 percent or so.
Again let's take this value here. So let's say this is around 97 runs or 95 runs,
and if Virat Kohli scores 95 or 97 runs then
the probability of Team India winning the
match is one which is 100% isn't it. So similarly this value here.
So let's say this is around 5 runs or 10
runs.
So if Virat Kohli scores 5 or 10 runs
then the probability of team India winning
the match is 0. So basically in logistic regression the y
value lies within 0 and 1 range. And this is how logistic regression works.
Now let's head on to the next question. So what is the confusion matrix?
So confusion matrix is actually a table which is used to estimate the performance of a model.
It tabulates actual values and the predicted
values in 2X2 matrix.
So these are the actual values and these are the predicted values.
So this what you see true positives. So this does denotes all of those
records where the actual values were true and the predicted
values were also true. So these denote all of the true positives.
After that we have the false negatives. So
false negatives denote all of those records
where the actual value were true but the predicted value was false.
So where the actual value is true but the predicted value is false
that  is known as a false negative. Then we have false positives.
So in false positive, the actual value is false but the predicted value is true.
And such values are known as false positives.
And finally we have the true negatives where the actual values are false and the predicted
values are also false. So if you want to get the correct values then
correct values would basically represent all of the true positives and the true negatives.
And this is how confusion metrics actually
works. Now let's head on to next question.
So, what do you understand by true positive rate and false positive rate?
So let's start with true positive rate.
So in machine learning, true positive rate
which is also referred to as sensitivity or
recall is used to measure the percentage of
actual positives which are correctly identified.
So the formula for true positive rate is true
positives divided by all the positives.
So I am stating it again, true positive rate is basically the measure of the percentage
of actual positives which have been correctly identified.
Now let's look at false positive rate.
So false positive rate is basically the probability of falsely rejecting the null hypothesis
for a particular test. So the false positive rate is calculated as
the ratio between the number of negative events wrongly categorized as positive.
That is all of the false positives upon the
total number of actual events.
So this is how we can calculate true positive rate and false positive rate.
Now we have the next question and we are supposed to explain what is ROC Curve.
So ROC Curve which actually stands for Receiver Operating Characteristics is
basically a plot between the true positive
rate and the false positive rate and it helps
us to find out the right trade-off between
the true positive rate and the false positive
rate for different probability thresholds
of the predicted values.
So the closer the curve is to the upper left
corner, the better the model is.
Or in other words, whichever curve has greater area in the red that would be the better model.
So let's say we have this curve over here
and let's say there is another curve which
goes like this which is nearer to this upper
left corner than in that case since the second
curve covers greater area under it that would be a better model
than the first model. So this ROC curve helps us to find out the area
under the curve as well as the right trade-off
between the true positive rate and the false
positive rate. Now we'll
head on under the next question. So what do you understand by decision tree?
So, decision tree is a supervised learning algorithm which is used for both classification
and regression, right. So decision tree can be used for both classification
purpose as well as regression purpose. So in this case, the dependent variable can
be both a numerical value as well as a categorical value.
So there is a flowchart like structure where the topmost
node is known as that root node, the internal nodes with children are known
as the branch nodes and the final nodes without children are known
as the leaf nodes. So here, each node actually denotes a test
on an attribute and each edge represents an outcome of the test
and each leaf node holds a class label. So let's say this first node over here.
We're trying to determine the age of the patient. So let's say if the age of the patient is greater than 50.
If the condition is true will come here. If the condition is false then we'll come
here. After that over here we'll check if the person smokes or not.
And if that person smokes, will come here. If that person doesn't smoke
will come here. Similarly over here, if the test condition
could be whether the patient has any children or not. If the condition is true will come
here if the condition is false will come here. So this is how the decision tree works.
And finally we'll have class labels over 
here.
So these would represent individual class
labels.
So let's say this represents that the person
has cancer.
This represents that the person does not have cancer.
Similarly this would represent the person
has cancer.
And again this would represent that the person does not have cancer.
us the final class labels.
So in decision tree, we basically have 
a series of test conditions which would give
