welcome to this course on Python for Data
Science.This is a four week course, we are
going to teach yousome very basic programming
aspects in python.And since this is a course
that is geared towards data sciencetowards
another course based on what has been taught
in the course,we will also show you two different
case studiesone is what we call as a function
approximation case study another one a classification
case study.
And then tell you how to solve those case
studies using the programming platform that
you have learned . So, in this first introductory
lecture I am just going to talk about why
are we looking at python for data science
ah?
So, to look at that first we are going to
look at what data science is.
This is something that you would have seen
in other videos ofcourses in the NPTEL in
other places . Data science is basically the
science of analyzing raw data and deriving
insights from this data . And you could use
multiple techniques to derive insights, you
could use simple statistical techniques to
derive insights you could use more complicated
and more sophisticated machine learning techniques
to derive insights and so on.
Nonetheless the key focus of data science
is in actually deriving these insights using
whatever techniques that you want to use.
Now there is a lot of excitement about data
science and this excitement comes because
its been shown that you can get very valuable
insights, from large data and you can get
insights about how different variables change
together how one variable affects another
variable and so on.
With large data which is not very easy to
simply see by very simple computation.
So, you need to invest some time and energy,
into understanding how you could look at this
data and derive these insights from data.
um And from utilitarian viewpoint, if you
look at data science in industriesif you do
proper data science it allows these industries
to make better decisions.
These decisions could be um in multiple fields
for example, companies could make better purchasing
decisions, better hiring decisions,better
decisions in terms of how to operate their
processes and so on.
So, when we talk about decisions, the decisions
could be across multiple verticals in an industry
. And data science is not only useful from
an industrial perspective it is also useful
in actual science as themselves.
So, where you look at lots of data to model
your system or test your hypotheses or theories
about systems and so on ah.
So, when we talk about data science, we start
by assumingthat we have a large amount of
data for the problem of interest.
And we are going to basically look at this
data we are going to inspect the data we are
going to cleanand curate the data then we
will do some transformation of the data modeling
and so on.
Before we can derive insights that are valuable
to the organization or to test a theory and
so on .
Now, coming to a more practical viewpoint
ofwhat we doonce we have data.
I have these four bullet points; which roughly
tell yousupposing you were solving a data
science problem what are the steps you will
do?
So, you will start with just havingdata someone
gives you data; and you are trying to derive
insights from this data.
So, the very first step is really to bring
this data into your system . So, you have
to read the data.
So, that the data comes into this programming
platform so that you can use this data.
Now data could be in multiple formats so you
could have data in a simple excel sheet or
some other format.
So, we will teach you how to pull data in
to your programming platform from multiple
data formats.
So, that is a first step really if you think
about how you are going to solve a problem
these steps would be first to simply read
the data . And then once you read the data
many times you have to do some processing
with this data you could havedata that that
is not correct.
For example, we all know that if you have
your mobile numbers, there are ten numbers
in a mobile number and if there is a column
of mobile numbers and then say there is a
one row where there are just five numbers
then you know there is something wrong ok.
So, this is a very simple check I am talking
about in real data processing this gets much
more complicated .
So, once you bring the data in when you try
to process this data you are going to get
errors such as this.
So, how do you remove such errors how do you
clean the data?
Is one activity that that usually precedes
doing you more useful stuff with the data.
This is not the onlyissue that we look at
there could be data that is missing.
So, for example, there is a variable for which
you get a value in multiple situations, but
in some situations the value is missing.
So, what do you do with this data do you throw
the record away?
Or you do something to fill the data and so
on.
So, these are all data processing cleaning
steps . So, in this course we will tell you
the tools that are available in python so
that you can do this data processing cleaning
and so on.
Now what you have done at this point is you
have been able to get the data into the system,
you have been able to process and clean the
data and get to a certain data file or data
structure that is reasonably complete so that
you think you can work with this data set
at which point what you will do is you will
try to summarize this data.
And usually summarization of this data a very
simple technique would be very very simple
statistical measures that you will compute,
you could for example, computer median, mode,
mean of a particular column ok .
So, those are simple ideas or summarizing
the data you could compute variance and so
on.
So, we are going to teach you how to use this
notions of statistical um quantities that
you can use to summarize the data.
Once you summarize the data then another activity
which is usually taken up is what is called
visualization right.
So, visualization means you look at this data
and more pictorially to get insights about
the data before you bring in heavy duty algorithms
to bear on this data.
And this is a creative aspect of data science
the same data could be visualized by multiple
people in multiple ways.
And some visualizations are not only I caching
, but are also much more informative than
other types of visualization.
So, this notion of plotting this data so that
some of the attributes are aspects of the
data are made apparent is this notion of visualization.
And there are tools in python that will teach
you in terms of how you visualize this data.
So, at this point you have taken the data,
you have cleaned the data, got a set of data
points or data structure that you can work
with you have done some basic summary of this
data that gives you some insights.
You also looked at it more visually and you
have got some more insights, but when you
have large amount of data big data the last
step is really deriving those insights which
are not readily apparent either through visualization
or through simple summary of data.
So, how do we then go and look at more sophisticated
analytics or analysis of data.
So, that these insights come out.
And that is where machine learning comes and
as a part of this course when you see the
progress of this course you will notice that
you will go through all of this, so that you
are ready to look at data science problems
in a structured format and then use python
as a tool to solve some of these problems
.
Now, why python for doing all of this?
The number one reason is that there are these
python libraries, which already are geared
towards doing many of the things that we talked
about so that it becomes easy for one to program
and very quickly you can get some interesting
outcomes out of what we are trying to do .
So, there are as we talked about in the previous
slide, you need to do data manipulation and
pre processing.
There are lots of functions libraries in python
where you can dodata wrangling manipulation
and so on.
From a data summary viewpoint there are many
of these statistical calculations such you
want to do are already pre programmed and
you have to simply invoke them with your data
to be able to show data summary ah.
The next step we talked about visualization
there are libraries in python, which can be
used to do the visualization.
And finally,for the more sophisticated analysis
that we talked about um all kinds of machine
learning algorithms are already pre coded
available as libraries in python.
So, again once you understand some some bit
about these functions and once you get comfortable
working in python, then applying certain machine
learning algorithms for these problems become
trivial.
So, you simply call these libraries and thenrun
these algorithms.
At a higher level so, in the previous slide
we we talked about flow process for how I
get the data in clean it . And all the way
up to insights and then parallelly we said
why python makes it easy for us to do all
of this um . If you if you go back if you
go forward a little more and then, ask in
terms of the other advantages of python which
are little more than just very simpledata
science activities.
Python provides you several libraries and
and its being continuously improved so, anytime
there is a new algorithm those are coming
into the set of libraries.
So, in that sense its veryvaried and there
is also a good user community.
So, if there are some issues with new libraries
and so on; and those are fixed so that you
get robust library to work with.
And we talk about data and data can be of
different scale.
So, the examples that you will see in this
course are data of reasonably small size,
but in real life problems you are going to
look at data which is much larger which we
call as big data.
So, python has an ability to integrate with
big data frameworks like hadoop spark and
so on.
And python also allows you to do more sophisticated
programming object oriented programming andfunctional
programming um . Python with all of this sophisticated
tools and abilities is still reasonably a
simple language to learn its reasonably fast
to prototype.
And it also gives you the ability to work
with data which is in your local machine or
in a cloud and so on . So, these are all things
that one looks for when one looks at a programming
platform which is capable of solving problems
in in in real life right.
So, these are real problems that you can solve,
these are not only toy examples, but real
applications that you can build data science
applications that you can build with python
.
And um just as another pointer in terms of
why we believe that python is something that,
a lot of our students and professionals in
India should learn.
As you know there are tools which are paid
tools for machine learning with all of these
libraries and so on.
And there are also open source tools and in
India based on a survey , most people of course,prefer
open source toolsfor a variety of reasons
cause being one because its free to use ah.
But also if it is just free to use, but it
does not have a robust user community then
its not really very useful that is where python
really scores in terms of a robust user community
which can help with people working in python.
So, its both open source and there is a robust
user community,both of which are advantageous
for python .
And if you think of other competing languages
for machine learning; if you look at this
chart in India about44 percent of the people
who were surveyed said they use python or
they prefer python.
And of course, a close second is R. In fact,
R was much more preferred a few years back,
but over the last few years in India a python
is starting to become the programming platform
of choice.
So, in that sense its a good language to learn
because the opportunities for jobs and so
on or lot more when when you are comfortable
withpython as a language .
So, with this I will stop this brief introduction
on why python for data science?
I hope I have given you an idea of the fact
that while we are going to teach you python
as a programming language, please keep in
mind that each module that we teachin this
is actually geared towards data science.
So, as we teach python we will make the connections
to how you will use some of the things that
you are seeing in data science; and all of
this we will culminatewith these two case
studies that will bring all of these ideas
together.
In terms of both giving you an idea and an
understanding of how the data science problem
will be solved.
And also how it will be solved in python which
is a program of choice currently in India
.
So, I hope this short four week course, helps
you quickly get on to this programming platform.
And then learn data science and then, you
can enhance your skills with much more detailed
understanding of both the programming language
and data science techniques.
Thank you .
