Welcome to the training of Math for
data science professionals.
This is the first lecture which is approximately of eight minutes.
In this first lesson we will focus on defining what exactly is data science
and how a typical data science process looks like.
In Chapter one we will first start with a very very basic definition of data science then we will
then take up a sample data and we will see a micro workflow of how actually data science works.
Then we will redefine our definition with a more in-depth and professional definition.
At the end of the lesson we have talked about two golden rules or two important rules you know which will help you to focus on learning math with data science.
but the first thing is
what exactly is data science?
Data science is all about
converting data
to information.
We have huge amount of data which is lying in the database you have huge amount of data which is lying in the excel sheet or wherever
and we want to present that data to your management or to a decision maker
in like one or two lines summary is what data science is all about.
At the left-hand side I have huge amount of data of 100 days approximately
and this much data if we give it to manager he won't be able to make any kind of decision
We need to take all this single
single transaction
and create a summary we need to go
and create two three lines of information
which can represent the summary of this data and by looking at this information the decision-maker can make better decisions which can improve the
business sales or give some new dimension to the business.
that's what data science is all about.
At the left hand side I have this number of days and per day what sales has been made.
We can give some summary
information to the decision maker
the easiest one is average,
average means on average how much sales we are making everyday.
We can also it some information like
what is
the minimum sales we have made
at any given moment of time on these no. of days i.e
27
and what is the maximum no. of sales we are doing?
We can also use
the Mode
Mode says
which is
the repetitive number from this
Over here 27 is repeated lot of times, 29 is repeated, 22 is repeated.
It tells us
the most repetitive sales which we made
is 80.
It tells the maximum number
which is getting repeated in a Dataset.
Here we have just given out 4 important
summaries to the decision maker
by using this summary he can make decisions.
Let us not use the word Average
will use mean instead.
All these M we can remember - Mean, Min, Max, Mode.
We have applied 4 Math Statistical formulas out here
and have arrived to some kind of summary which we can give to decision maker.
When a decision maker will look at these 4 statistics
he can make some decisions or come to some conclusion.
He can see that the average sales is 582
but at some moment of time he will also make sales of 70000.
If we are making a sales of 70000
why on an average we are making so less per day?
On the last day
we have made a entry of a bulk order
must be there was some kind of a bulk order
and because of this
the max sales are looking higher and the Mean is not looking proper.
The decision maker can more concentrate on bulk orders
so that his business can be multi-forced.
What has really happened
in the past 4 minutes.
In the past 4 minutes
we have applied Data Science and while doing so
we have done three things
the first is we have used
statistical math, have used the math formulas
to arrive to summary values.
First thing is
we used statistics
The second thing what we did out here is we used excel
used excel knowledge and use this formula like Mode, Max and so on. Because of the excel knowledge we were able to
implement these formulas easily, had good IT knowledge of excel.
The most important thing is
domain knowledge
when we saw this 70000 max value
we were able to figure out
that this must be some kind of Bulk order.
Because we had domain knowledge of the retail or selling we were able to figure out why this difference was
We used three things
Math, IT,
let us not just say excel
tomorrow it is possible we are going to use Python or R, in general we used IT, Math and Domain Knowledge
to arrive to this summary information.
Thats what exactly is Data Science.
Data Science is a Multi-disciplinary field.
Initially we had started with a layman definition
It is a multi-disciplinary feel
which
comprises of
statistical math, probability math
IT knowledge where we can use excel formula or use Python, R.
and domain knowledge.
Throughout this lessons
would advice and request to follow two golden rules if really want to
get successful into the Data Science field.
The first rule is
Do not solve math
but try to apply Math.
When we wanted to calculate the average
of this data.
We just used the average formula of excel.
Rather than
solving math by physical pen paper that won't really help us
try to apply math.
The second golden rule is
stick to excel and do not jump to Python and R.
Lot of people start with Python and R
get into programming languages
installations
but then they loose the focus of statistical math.
Throughout this lesson use excel and stick to applying math rather than solving math.
That brings us to the end of this session
at the end of the session giving two excel sheets, one is on which we have shown the demo
The other excel sheet is the practice excel sheet.
This is a practice excel sheet in which we are giving some other dataset
and want you to find out the Mean,
Min, Max and Mode.
Giving a hint there are two odds in this summary
so try to find out those odds.
After every lesson we will give you a practice sheet
the sheet which we used in the tutorial
and at the end of the video we will have a small Q&A
deck which is flashed up you can have a look and summarize what you have learnt in the lesson.
In lesson 2 we will talk about descriptive analysis.
Welcome to Lesson 2
Lesson 2 covers 4 important topics
Descriptive statistics i.e Mean, Median, Mode, Max and Min
it talks about Spread and the importance of Spread, outlier and Quartiles.
This whole lesson is approximately of 15 to 16 minutes and has 4 chapters.
The first thing a Data Science engineer should do when he gets any data for analysis is
to identify if this data is spread or concentrated.
If the data is highly spread
then one value inside that dataset is
far away from the other value.
If it is concentrated
then data revolves around certain values.
We have two dataset of sales out here
Plot these datasets using scatter out here.
We have plotted this total sales 1.
Create one more scatter plot
which will plot
the second one.
First will concentrate on Total Sales 1
In this Total Sales 1 when we plotted this static plot
lot of data is concentrated around certain section.
Pull these all the data up and try to do small analysis.
Lot of
data is
revolving around 80 to 88.
or 80 to 90.
Lot of this datasets are actually revolving
around the certain value.
There is a other group of section as well.
27 to something.
If some
small concentration of data here as well
27 to 35 must be
and we can see some data out here which is almost 112.
This says the total sales 1 data
is
highly concentrated
it is not dispersed.
or we can say the spread is not too much.
This data out here is a straight line
add a data level and set them and check how the data is spread
The data is having a high spread from 16 to 54.
This is almost a linear graph
definitely we won't find such kind of pattern of high concentration.
This kind of dataset
is termed as
Dataset which follows
the measures of central tendency.
This kind of dataset is different dataset which
does not really have any kind of central tendency
and the way of doing analysis for such kind of dataset is different.
The first thing as a Data Science engineer find out
is the dataset following the measures of central tendency or is the dataset not following the measures of central tendency.
Depending on that is the data having high spread or low spread the analysis will depend.
The problem here is we have lot of record millions of records, it is very difficult to plot the graph and do these things
We need to have some mathematical formulas which can quickly run and evaluate
and say that is the data having a high spread or low spread.
Creating such kind of visual graphs for
millions of records is almost impossible and the whole point we showed you this graph out here is to show
how visually a concentrated data looks like
a spread data looks like
a measures of central tendency looks like.
How we can use mathematical formulas
and find out
without plotting a graph is the data following measures of central tendency or does not following it.
First thing what we need to do here is to get description about this data, get
the statistical description of all these data. Before we start
calculating the spread measure
for that we have 5M Formulas, the first one is Mean
then we have Median,
Mode
Max
and Min of the whole dataset.
First will start with the Mean
Mean means Average
it gives us a summary of data
Median is the center of the data
Median in a sorted dataset, Median value is
which is coming exactly in between
We have this dataset of 10, 20, 30, 40, 50
Median is third data
Mode helps to find out the most repetitive value in the dataset.
In this case it is 29.
These 5 formulas give me the statistical description.
It tells the average value is 581, Median value is 80, the most repetitive value in the dataset is 29 and the Max and Min.
When we look at this dataset the first thing we see here is the Median and Min are almost like
5 to 6 times
apart.
This is not a good sign. The Min and Median
should be nearby. It should be like
maximum 5 to 10% of difference
In this dataset purposely at the end we have included bulk values and because of this the whole average is looking very bad.
In this whole dataset there is one value out here which is absurd and outside the range
In that day there is a bulk sales by some corporation thats why the sales jumped up and because of this one single value
this whole average is looking very different.
Only looking at the Meann sometimes can be very dangerous.
We also have to look at the Median.
The Median gives the middle value with that the
absurd value is not coming into the calculation.
The first thing we found out here is an absurd value in the dataset
which is
making the average looking very weird.
This absurd value is termed as an Outlier.
Outlier is a value which is absurd and lies outside the datatset.
In this case we saw the absurd value but what if this was million of records?
To find if there is an absurd value we need to use something called as
Quartiles.
In order to find an outlier, outlier is an absurd value in a dataset
which lie outside
the range of the
maximum dataset what we have.
To find
mathematically
we have to use
the Quartiles.
Calculate
Quartiles and find out the outlier.
A Quartile
divides dataset into three parts
or four parts
For each one of these parts it tries to find out the Median.
To find a Quartile we have a formula
The first Quartile
has taken the dataset and divided into 3 sections. The first Quartile is the Median value of the first
part
of the dataset which is divided.
The Quartile 2 and the value of the Median are same.
Median value and Quartile 2 value is the middle of the dataset.
Quartile is where we divide the dataset into sections
It can be three sections or maximum 4 sections
then finding the Median of those three halfs or 4 halfs.
To calculate
or checkout the Outlier we need to calculate Interquartile
Interquartile is Quartile 3-Quartile 1
A normal range of data
the Max and the Min as follows
it is the highest range of data - Quartile 3+1.5*Interquartile.
The lower end of data can
be Quartile 1-1.5*IQR
Find the low range
low range is
Quartile 1-
1.5*IQR
and the high range is
Quartile 3+
1.5*IQR
The maximum
value of the dataset is 167 and the minimum value should be -53
Depending on
the calculations
the maximum value can go to 167
we can see that this value(70000) is very high.
We will exclude the value for now
By doing so now
the Mean and Median are looking
quite
nearby.
Will put some value here, if we exclude it again that breaks calculated in the Mean
There is 67 and 80 that looks reasonable.
By using the
interquartile
we can find out
the
normal, minimum and maximum range of dataset and then we can hunt down 
the Outlier.
Before we start measuring the thread
first we need to checkout
if the data is having some
values which should not be included
or some wrong values
or some absurd values
or values which do not go with the current dataset.
Try to hunt the Outlier
and remove the Outlier so the calculations are genuine.
Calculate the Mean, Median, Mode, Max and Min for the other dataset as well.
The Mean and Median are almost equal
there is no repeating values.
In this situation where Mean and Median are exactly equal
and there are no repeating values then it is very much possible this data can be
a sequential value.
Here it is 5, 6, 7, 8, 9...... thats why
The Mean, Median and Mode
are looking very exact.
Second, the Max value is 140 and Min value is 5
by using Mean Median Mode we can find out the nature of the dataset.
we can describe the dataset.
like now we can describe this dataset is Sequential with sequential values.
When the Mean and Median very far away
we came to know there was an Outlier.
By using Mean Median and Mode
we can find out
what is the nature of the dataset.
Now calculate
Range
Range is 
the Max
-Min
The range of dataset 1(Total sales 1)
is very less than the range of the total sales 2.
This indicates
the spread of total sales 2 is much higher than total sales 1.
The first
arithmetic
or statistical formula
to find out the measure by just calculating the Range.
Range is 
Max - Min.
From this we can know here the Total Sales 2 has
higher spread
as compare to Total Sales 1.
In this video we were trying to find out
what is a Spread?
How to calculate Spread?
First we did it visually
and using graph we did it
If we have huge dataset we cannot use graph, we need to use calculations.
Thats where we talked about Mean Median Mode Max and Min.
Before we try to find out the spread ensure there is no Outlier.
To calculate the Outlier we used Quartiles
We calculated Quartiles, IQR, the Max and the Min.
from the IQR and eliminated the Outlier.
Finally we calculated the Range
by using the Range
we can now know
the Dataset 2 has higher spread than Dataset 1.
At the end of this video will give you a small practice test whatever we have talked in this class.
At the end of the video we have a small Q&A
try to answer those questions to revise.
