Hello Everyone, Welcome Back  to RU Buzzing
Before learning about Data Science, Machine Learning and Deep learning
one should have basic knowledge of data
Which is frequently asked in an interview
For Example
What is the difference between structured, unstructured and semi-structured data
What is Data Lake?
Why it is important?
And how it is different from Data Warehouse?
What is Veracity and Volume of data?
To answer this type of question
It is very important to know
What? Why? How?
Which we will discuss in this video
With that, we will also discuss the interview questions
Which I have personally experienced in my career
So don't go anywhere
Let's start our first video of today,
Which is about  understanding the details of Data
What is Data?
Data is just a set of records with some information in it
Data is a Set of records
in which there's some information
For example audio, video, text, email
Everything has a record which gives information
Gives something with which we can understand a few things
Then What is Big Data?
Big data means Large Data
So can i say 1 GB data is Big Data?
Or 1 TB
or 1 Petabyte of data
So defining data with the size of it, is a wrong thing
Why?
Because
Suppose 50 years or 100 years ago
if I have told you, to transfer 1 GB data to your friend
For that, you would have had to book a truck for it
Because
We didn't have storage of 1 GB Data, it used to be very huge
Today
due to the advancement of compute
we have a pen drive, which can easily transfer 1 GB of data
Even one does share on Google Drive just by being home
So, 1 GB data which was Big Data for yesterday
but Is considered very small today
same as, if you are feeling Petabyte is Big Data
so it is possible 50 years from now
it may not be a Big Data
and can easily be transferred
How to define Big Data
Big Data is defined by its five characteristics
Volume, Velocity, Variety, Veracity, and Valence
These are called five V's of Big Data
5 Vs of Big Data
It's mostly asked in interviews
what are 5 Vs of big data?
The answer is this
Volume. Velocity, Variety, Veracity, and Valence
Let us see one by one, What are these?
First is Volume
Volume defines what is size of data generation
Size of data generation
for example, if we take YouTube
So daily GBs & GBs data generation is being done
Users like you and me, upload a video every day
so if a single user uploads a video of one hour
millions of users, upload Data on YouTube
Volume is very High
Second is the Velocity of Data
which defines the Speed of Data Generation
Speed of Data Generation
As per Analytics, in YouTube every minute
500 hours to 600 hours of Videos are uploaded
Every Minute
Think of the Speed of Data Generation
If every minute this much Data is being generated
then every day and every year, how much Data would have been generated
Which defines the Velocity of data
Variety of Data defines the type of Data
Data can be of any type
Audio format, Text format, Video format
can be in any format
so data is broken into many formats
Structured, Unstructured or Semi-Structured
which we will discuss later
Next is Veracity of Data
which defines the quality of Data
How we can define the Quality of Data
This helps in deducing how much Data is redundant
And how much is it Consistent
For example, if I draw a table
then I write Mohit 50 marks
And then I write Mohit 50 marks
then I take Mohit 30 marks
Now you can see, here the data is duplicated
Because Big Data is a huge amount
To process that it needs a very capable system
so we can't keep this redundant information
We don't need this redundancy
We also don't need the inconsistency
If Mohit 50 marks is already available
then Mohit 30 is incosistent
So this is wrong
This is low-quality data
as we are wasting so much of compute
using a high power system
if we have this kind of redundant information than it's not useful
and we won't find any information
Usually, this is called low-quality data
Then what is Valence of data
this defines Connectedness of Data
For instance, if we take facebook as an example,
If every node is a user
The edge defines the relation between them
This data is highly connected data
because every information is connected with each other
When there is highly connected data that is known as High Valence data
These 5 Vs of Data  define Big Data
Till now we learnt
What is Data?
What is Big Data
And what are the 5 Vs of Big Data
Now we will look at, Why Big Data?
Once I was asked in my interview
Why Big Data is required and the importance of a lot of Data?
What is the problem that a lot of Data solves over less Data
Let us see it with a real-life example
Assuming I have been given marks
and information of good and bad students
Marks are as follows, 80 is good, 75 is good
50 is bad
30 is bad
25 is bad
so I already have this information
as a human, if I ask you
77 marks is good or bad?
For 77, as you can see,  75 is good, 80 is good
And 77 lies between them
So we can call it good
Your decision is correct
But what if I ask you about 60
Then what will you do?
Because you do not have information for that
Between 75 to 50, what is the distance between good & bad
So one cannot decide this with clarity
Is this Good
or Bad, one cannot say it with 100% assurity
Not even 99% one can  decide
Now suppose if I have more data
For example between these two, 60, 65, 67
If we have this Data
Then we can form a clear boundary
The more data we have, the easier it is to make a decision
And it will be a more accurate decision
Hence we need more data
Now we will take a look at what are the varieties of Data in detail
As I explained earlier,
Structured
Unstructured & Semi-Structured are the Data Types
What is a Structured Data Type?
Structured Data Type is usually
Format mein hota hai table ke
It is pre-processed
It is in the form of Row & Column
This column is related to this
This one is related to his
And usually, this type of Data is stored in Database
And it is very fast, it uses SQL
And we can access it quickly
The Second Type of Data is Unstructured data
Unstructured Data is that type of data which is in raw form
For example,  Raw format
Or unprocessed
This type of data is image
Text,  audio, video
We cannot find a relation
Between one audio to another we cannot find a relation
in different rows & Columns format
And that is why it is called
Unstructured Data
Because it does not have a structure
Semi-Structured Data  combines Structured & Unstructured Data
It is usually in key-value pair
Example for this is XML
JSON
How does it happen? For example
Email is a good example of it, email
If you have noticed in Email
The text in the mail is in the text format
That is unstructured
But some information in it is structured like "Mail to"
"Mail From"
So we can structure the Data according to this
But the message part is unstructured
The one that combines the two is called Semi-Structured
So this is 3 forms or
3 Types of Variety
As we saw that Structured Data  is stored in a Database
Now let us see how Unstructured & Semi-Structured Data is stored
Semi-Structured Data is stored in XML or JSON Format
And Unstructured Data is usually stored in Data Lakes
Hence What is Data Lake?
Data Lake is a place  or a Storage
Where Structured, Unstructured
& Semi-Structured Data is stored together
Structured + Unstructured + Semi Structured
And this is for future use
For example, if one needs to do Analytics for Twitter
then it is not necessary to bring  the data when Analytics need to be done
If your company does Analytics on Twitter
Then you need the Twitter Data be it Structured, Unstructured
or Semi-Structured, all needs to be dumped in a Data Lake
Data Lake Examples are Amazon S3,
Microsoft Azure,
& a Famous & very much used Data Lake Hadoop
So usually we store Unstructured Data in
Hadoop or any of the Data Lakes
This is for future use, we dump data
Information that is not  be used in the present,
the  related information, everything
is dumped & we use it in future
Whenever it is required
What is Data Warehouse?
Data Warehouse  is usually done when we have DB 1 , DB 2, DB 3
These are 3 RDBMS
And all three have some information
For example, this has student information
this has Professor Information
And this has Subject information
So this Database has Information on Students
this on Professors and this on Subjects
Now I need details of students
those who are studying Machine Learning under Andrew MG
Now I need information on Students,
Need information on Professors  also, who  is Andrew NG
And then who all have take Subject Machine Learning
That information is also required
I need information from all 3 Databases
So I will need to store all 3 Databases under 1 place
That place is called Data Warehouse
Where we  combine Relational Databases for the purpose of Analytics
Since we are doing Analytics to know how many students
have taken Machine Learning Course
So this is called Data Warehouse
Then what is the difference between Data Lake & Data Warehouse?
Date Warehouse generally deals with Structured Data
And Data Lake is for Future use
And stores all 3 types of Data
Now an interesting question I was asked in an interview
is
Can Database or RDBMS store Unstructured Data?
My answer was No
I was rejected for the same
But we will see it in detail Why?
We have seen so far
What is the difference between Data Lake & Data Warehouse
Data Warehouse deals with Structured Data
And Data Lake deals with all 3 Types of Data
Can we store Unstructured Data in RDBMS
Can we in RDBMS store Unstructured Data?
For which I had given the answer as NO as I told you earlier
Actually, we can store it,
in the form of Blob,
Clob
It is called Binary Large object
And this is Character Large Object
So Unstructured Data, in these two formats or there are other formats as well,
can be converted & Stored
For example, an image can be converted into a Binary Large Object
In binary form
And then store it in Database
Similarly, text file can be converted, in a Clob or Character Large FOrmat
And then can be stored
Then why we don't use this?
Because of Competition, everyone is trying to support everything
But this is not efficient
If we save anything in Blob
For example if one converts image into binary and then stores it
When one is doing Machine learning on 1000s of images
Then you will have to convert every image into Binary and then store it
And when you need to use it,
you'll need to retrieve it in image format and then use it
That's a very inefficient process
That is why Unstructured Data is not preferred to  store in this
And we usually use Data Like like Hadoop or other systems
to process Unstructured Data
Now, what is the difference between Data Mining & Machine Learning?
so basically Data Mining is done on Data Warehouse
which usually deals with structured data
and Machine Learning can deal with any Data
so both Machine Learning and Data Mining
are Statistical and Algorithmic Methodology
which deal with structured and different type of Data
so Machine Learning can deal with any type of Data
but Data Mining is limited to structured data
Today we learned
What, Why and How aspect of Big Data
in which we saw what are 5 Vs of Big Data
Why we need Big Data?
What is Volume, Velocity, Veracity, All types of Big Data
then we saw on Variety
what is Structured, Unstructured and Semi-Structured Data
What's the difference between Data Lake and Data Warehouse
How to store in RDBMS
and Why we do not store that
Unstructured Data in RDBMS
What is Data Mining  and Machine Learning
We saw the difference between all these in detail
If you like this video
Please Like, Share & Subscribe
and I will meet you again with an interesting video again
Thank you.
