Dear students.
I am very happy to introduce this scalable
data mining course which will be offering
in the coming semester.
The instructors for the course are myself,
professor Sourangshu Bhattacharya from IIT
Kharagpur and my colleague, Professor Anibran
Dasgupta from IIT Gandhinagar.
In this video we will give you a flavor of
the course of the subject matter of the course,
tell you about the course goals and the pre-requisites,
describe the syllabus for the course, and
tell you about the objectives of this course.
Data science is a field which unifies data
mining, statics and machine learning and to
obtain insights, decisions and predictions
from data.
It has been observed that the more data we
use for obtaining this insights and prediction
the better our predictions and insights become.
Hence data science on massive data is a very
useful thing to do.
However given the current computing infrastructure
it is not always possible to do so.
I will describe 3 sample problems where it
may be difficult.
The first problem, consider a switch through
which IP addresses are passing.
Generally switches have very low amount of
memory.
However now consider the problem where we
wish to find the most frequently occurring
IP address in the stream.
However we cannot store all the IP address
which are passing through the switch which
is a large number.
The problem is to find the most frequently
occurring IP addresses even though we cannot
store the frequencies of all the IP addresses.
The second problem, consider that you have
a collection of millions of images some of
which are duplicate of other images.
A simple calculation will show that comparing
every image to every other image will take
years of time on current computing hardware.
The question is can we quickly find the near
duplicate images approximately without comparing
every image to every other image.
The third problem is where we consider that
you have a large data set of millions of data
points and each of these data point has billions
of features and we wish to classify these
data points into different classes.
How can we train or classify on such huge
scale by utilizing say 100s or 1000s of servers
in a distributed manner.
These are some of the examples of problems
which we will be dealing with in this course.
In this course we will be introducing the
mathematical and algorithmic tools needed
for scaling machine learning and data mining
to very large data sets.
The course will be a mix of both theory and
practice that is we will discuss methods which
have both a theoretical guarantee as well
as which works well in practice.
The theoretical guarantees are very important
because they allow you to choose the hyper
parameters of the systems as well as to debug
the systems when you are developing the systems.
The main key ideas which play big roles in
this course are using randomization techniques,
answering questions approximately or solving
the problems approximately and distributed
computation that is using multiple computers
to solve the problem in a distributed manner.
The prerequisites for the course, the course
is mostly self contained however you need
to be confident about basic linear algebra
such as vectors, matrices, and operations
on those.
You need to be confident about basic probability,
such as random variable, expectation, variance,
things like that you need to be confident
about undergraduate level algorithm's course
and the data structure course and you need
to have good programming skills.
Previous exposure to machine learning will
be helpful.
The syllabus of the course can be divided
into 6 parts.
The first part is the background where we
will cover the necessary probability theory,
linear algebra, optimization, and machine
learning background.
The second part we introduce the memory efficient
data structures, such as hash tables, bloom
filters, and various types of sketches.
The third part we introduce near neighbour
search which is the data structures which
help you answer near neighbor queries in an
approximate manner.
The fourth part is randomize linear algebra
where we describe how to use randomize algorithm
for various matrix factorization related problems.
The fifth part is the map reduce and related
paradigms where we introduce the distributed
computing frame works such as map reduce and
the frame works part which is for in memory
distributed computing.
The sixth part we introduce distributed machine
learning where we describe how to implement
machine learning algorithms on spark.
The learning outcomes from this course are
at the end of the course you should be fairly
familiar with the idea of applying probabilistic
techniques such as sampling and sketching
in data analysis and data mining.
You should know about a few different dimensionally
reduction techniques and their connection
to data analysis.
You should know about the Map Reduce programming
model, the Apache Spark frame work and will
be able to implement machine learning algorithms
in the Apache Spark framework.
Many large corporations use these frame works
and techniques for solving the problems on
massive data such as web scale data processor,
finance, internet of things, and scientific
computation.
Thank you.
