Hi was a Data Warehouse infrastructure which is
used to process structure daytime Hadoop It recites
on top of Hadoop to summarize Big Gate
up and makes querying and analyzing easy understanding
This we have core for this haIf tutorial
Now before we go ahead with section I
like to inform that we have launched for
completely free platform called Create Learning Academy But
you have access to free courses such as
a iCloud industry marketing You can check out
the details in the description below So let's
talk about hive right which you people might
be mostly interested on But I'm not so
usually I'm not a slice person but for
high I need some slight This has around
60 slights I will not run in all
off them Don't very I'll share it with
you You can go through that I learned
a couple of them to get an idea
right And so that always everything out Woodrow
and explain to be difficult for height right
so high It is actually very simple In
2005 when Duke came most of the major
a vendor started using her look like Facebook
yahoo and all but they all faced one
common problem The problem is that Yahoo started
using Hadoop But yeah who had a lot
of developers who did not know Java So
the problem is that they had to write
my produce programs on duh skill set was
the problem You can't say that everybody should
learn Java and start writing court So your
whole found it very troubling Same time Facebook
also started using Hadoop and they also had
the same problem Facebook heard of mostly sequel
developers Onda They were not comfortable with Java
on way back into those and five and
six we had only map produced There was
nothing else on her do Okay Playing my
produce Nothing else Eso They had to be
consistently so in the last class And I
know that was available so everybody was confused
So Yahoo and Facebook started to projects same
time So yeah who created something called a
bigger the creature Big animal Bigger serious Xcor
Pig Big dog Apache Adauto Allergy See that's
a big really big creature So yeah who
created this tool called It's a scripting language
I teach a course called a big on
her Do only one big So very interesting
language Scripting language Gwen Days is very simple
You learn this language on the new whatever
analysis you want you write in that language
It's a script so very small like four
filings max or 10 lines on Once you
write your big script you say Run It
will convert your pig script in Democritus program
so you don't have to learn my produce
You learn big on somewhere in photos and
10 and 11 There was a heavy job
in floor for big developers 80% is off
Her job job was handled by begin Our
10 2080% is It was such a success
for language that everybody was dying for Pig
I have taken on Olympic trainings People aren't
because that was the effect of big actually
like because it's a very easy language scripting
language You just learn it on Everything will
happen behind the scenes You want to do
a joint operation you say join a coma
B That's it Even secret code is a
more complicated Enjoying your to say a heartbeat
are here Nothing Take a be joining a
coma Be than joining them Simple enough Joinem
saying so people started using big extensively But
at the same time your Facebook developed haIf
hi Use idea is different You write sequel
okay You create your data basis your tables
and all the blah blah blah you write
Sequel on High will convert your sequel query
into my produce Same concept So to companies
Devlet one big and one haIf they didn't
know that they were doing the same thing
ultimately But then what happened somewhere in 2010
11 a big started losing the attraction little
sparking So then spark came and became predominantly
useful Spot is also very short and I
like a script you can write in fighting
and all very easy to write So spark
and bigger Almost similar So people don't like
If I'm learning something very short and interesting
on very fast why should I learn big
I can learn by a spark itself right
So biggest no longer used by anybody Very
literally Biggest gone Almost height is not gone
because it is secret Secret will be there
Little the world exist right No questions asked
Right No question sequence is equal so you
can get rid of secret Like how do
you get rid of secret What did body
for all the upl developers And you know
there's a huge community tow backs equal Not
like big I actually like big more But
there is nobody to back big right little
order people who back secret right And then
that is not the only reason Hive is
a beautiful data warehouse It is treated as
a data warehouse off her do meaning You
can create tables and you can query the
data So most of the visualization tools and
our you know can use high even the
back and to fetch their style sandal So
how you became very popular Onda Now what
is happening Hive is there plus spark ISS
using high in spark You have a more
your core spark sequence that you're says sequel
language Now that guy can process itself It
can also talkto So whatever data you have
in high spot can or sorry so high
is predominantly used these days and I will
always be very popular It may be I
think one of the most popular ecosystem Pulis
height so far in the industry because it's
been 10 years 12 years and still uh
community is very much so The command ways
It is very simple Because of this sequence
on I don't think I would repeat you
sequel That's not the idea behind this session
You will already know Sequel but they will
just see what is happening behind the scenes
If you write a sequel quickie on some
of the advanced concepts in hype like partitioning
bucket ing indexing these kind of things on
from the industry experience that IHS One major
thing There are two major vendors Horton works
and Claudia Right now Horton Works uses original
Apache How do that is their propaganda So
what high it is Also Apache predict it's
open source and whatever major changes other in
haIf are predominant important books not included Cloudera
is actually not sober That hype They're using
a cloud A lab that's bad actually I
mean I'm not blaming the lab but clothed
I It's actually a bit bad in high
site I'll tell you the reason because I
teach at Wharton Books That is the reason
I'm able to tell in Horton works They
have a project called Days Some weird people
from Delhi created this project Days is a
Hindi word There is a weird movie by
that name or so right Have you heard
days They stays speed in Hindi huh Ah
so days is up How do you say
it's not a replacement There is a project
created by Indians Then they gave it to
a bad shape became open source They is
this faster my produce So what this guy's
did normally map pretty is what happens to
read the data right on DUP Then you
do mapping and then you store the output
in hard disk again shuffle You re the
data Do the shuffle store that report back
in harness They can reduce the will read
from hardness or some weird guy start Why
don't we do all this in in memory
We don't want to use hard this Read
the data ones do all the map reduce
shuffle and then push terraces Serum apurate has
become spent times faster That is face So
it said Daggett Director That's likely graph concept
on this was given Apache What sort of
books does if you write a haIf equity
they will convert it into what is job
not my apparatus job So haIf queries are
very faster on Harden works in a cloud
A cluster cloudar doesn't support these because they
is is contributed by Horton Will Important books
They're commentators right So if you're working on
a cloud a cluster you will never see
it days in your life they won't allow
So the problem is that if Emma writing
a hive query on Cloudera it's very slow
because reconverted my producer NASA may produce program
takes a lot off line family thing the
same haIf Korea important works It's at least
five times faster because it converse with days
job by the fall of the space on
basis Superb you can't compare right So that
is one drawback I mean so when you
start working in industry so these days mostly
people are more important works for majority of
reasons But the steak is a cloud Only
number off uses are more for Claudia But
over a period of time probably people will
shift the Horton work so they start button
information So then you will be thinking that
from Cloudera there is no answer to this
No they have Impala There's a product called
Impala a parting partner in pilots Contribution is
mostly from Claudia in pilot esta staying thing
work that pays this in memory fast Really
Inquiries So I show you Impala We have
Impala in the cloud cluster I show your
bottom of questions if you want So days
itself as the project is Not really Uh
yeah Not really Uh how do I say
important right now since parquet most of these
things are going including tastes because spot is
also in memory is the same memory So
even spark came taste elevenses north there But
the only relevance is that if you're running
a hive query it converts it into today's
jobs It's faster actually or important books platforms
So I will show your heart in West
Cluster probably tomorrow Okay I have an access
I'll show you that day's job running You'll
see the difference actually So a spot But
there are a lot of companies who are
in tradition High on Lehigh by Their monty
was sparked This 20 was high for them
going to the cloud and maybe a bit
challenging because heart and guts body for give
you three days right So basis first or
so they is This far tolerant impala is
not I'll discuss public tomorrow probably too early
for this case but there are sequel engines
are too many actually in this category him
There is a fight competition between herself them
and it's all marketing gimmick You go to
Cloudera they will say that implies the best
in the world Never listened Horton for Investors
say that on hardness recently High recently announced
something called Elope About a High Elope is
really inquiries using haIf hive is normally batch
processing because it is converting into my operatives
like but recently a passion arms there How
you can support realty inquiries That's called Elope
Um the uh uh uh real abbreviation I
don't know Long leave and process That is
not the real one But we used to
say Long live in process It is still
in beta Not so but the problem lately
is available only on her Converse No loaded
up open source late because it is a
contribution off open source high We just have
a level only on Horton but slowed back
and get it So they're still with the
Impala in violates the only solution they have
So then you talk about higher that a
lot of things but you should be aware
of the industry what's happening by So these
are the reasons why people are shifting towards
Horton work Some of the major things I
see comfort That doesn't mean cloud rise bad
Okay It's not like they are also having
great support Their technical support is excellent and
even with this Impala there who a lot
off to eggs Actually too So if you
have a problem they leave a solution But
a lot of people are preferring this open
source initiating or supporting warts and all are
open source But the problem is I told
you that I was with G They use
a little Ap None of the queries work
Most of the time It is open source
right If it doesn't you can't blame Cloud
and I will fix it because it is
preparatory like I foreign end on drawing right
so open so as you get in order
features But it will work or not but
you cannot guarantee that's one thing They but
they will blame internally on Apache they say
that will give you technical support within the
limits so orginally the product is by Apache
so Horton books They're not fix statically because
it's an open source community Most of the
things work I mean I'm not saying that
everything doesn't work but open source has its
own drawbacks sometimes right But I am seeing
a lot of people migrating toe Horton books
actually these days right So in High Eve
what you need to in understand is these
things It provides sequel like interface on her
do now a very important point HaIf does
not have storage It uses the data in
her group For example you copy a C
a C five in which the efforts on
what I can do I can go toe
high say create a table and give the
schema Then I said Lord this year straight
to the table What is going to happen
to see Yes we will be in her
do It is on a steer First it
is as blocks and replicas and all but
in high view Just project a structure So
you get a table Once you get a
table you can do any sequel most off
the sequel right So the idea is to
present the structure to the data that is
already available in And uh haIf can create
tables from our idea formats structure that even
log file son structure Later uh Jason fires
different types of files can be read by
haIf on put it in our table format
So that's the importance off life So it's
true that nasty her due date of their
house system So the data is initially office
on This is just giving a table like
structure on top off The bigger I will
just maintain this came only because they date
guys already in hot do Right So you
don't have to load the data when they
say Lord the date and the date I
will just become in How do I show
you how to load the data I mean
how to create a day with Yeah So
if you want to run equity you can
directly say right you're sequel says that it
counts star from the people But the problem
is that will hit high I will read
the query on it will convert into a
map reduce program jobs created jar file So
that will take some time Then when it
runs that some operatives job on the funniest
part is hard Duke never in distance What
is haIf 400 with some apparatus program you're
getting the point from the hard oops point
of view There is nothing called high It
understands only my produce So whether you write
a my producer high right um apparatus it's
the same thing right If you want you
can write a map produce program to query
structure later It's very difficult That's why realizing
how you it'll translate and create a jar
file push it to the cluster on the
plus still will run the map reduce regular
program Show you the old port You can
save it in a table or or anything
you can do so you can use all
Regulus secret statements You can say in certain
to our table from this table so the
result will be if it is matching with
the skin it'll insert the same way The
only thing this predicting a structure How do
how do police DFS Only everything is honestly
offense It is slow Ah hi vis Expected
to be slow So that is one thing
Hi vis considered to be a batch processing
engine so it is not a database Another
common problem is that people may not be
able to differentiate between a database and a
data warehouse there Different right Our data basis
Oil pp You called transactional Real time right
Very You wonder Melissa can speed a subsequent
speed That's a database And you always have
insert update All this kind of operation are
date other houses different our data Other house
is a place where all the clean the
data comes and lands right from their your
reporting pores can connect and visualize the data
Right So high is a data warehouse where
you stole your final data So let's say
you have Lord off tables You ran queries
and you got the final out But you
still restore it in high so that your
visualize session towards can ah started next year
first But he says started in high But
it's in actually ever suddenly meaning If I
open high even say selects star I will
see your table If I go to the
location open the file is your text file
Same thing So the next I will be
there If I open insists the next are
saying text file It'll show me in a
nice row corland for my and another thing
is that highest languages actually called excusable hive
Query language It's not a skewered but it
is based on sequel 90 Toes Impacts Most
off Your queries will work marking but it's
technically called its que el hive query language
It's up to us to decide I'm in
so the idea is to familiar with the
family rice with high Then depending on our
business use case we can write in the
sequel queries you can write Complex varies simple
queries You need to have some idea what
sequel queries like Why does the joint operations
what is a true by quickie You understand
what is happening Even a person without knowing
sequel can learn height I mean by looking
at equity you can understand what is invention
off the query right Hi vis Notre DB
Higher query steak minutes even for small data
assets and can be compared to databases like
Oracle haIf does not provide real time queries
So that is what the original high which
skein was converting into my produce So it
was very very slow Then came tastes so
what harden But today they started using high
plus days which is a bit more faster
tonight Glowed I reckon Audio stays So Cloudera
still runs five into my produce on there
and certain that this Impala Impala is that
in memory execution engine but inviolable back on
Lian Claudia So I will show you Impala
and say inquiry will be very very fast
with there is no difference Apart from that
I mean syntax and everything saying actually so
data warehousing in the sense like usually it
ISS used as a storage engine rather than
so what is the data warehouse of data
warehouses where you store all your massive data
structure later on The intention off five is
also seen on all The B eight words
can panic with hype So if you're having
something like Tableau for example Pablo is a
visualization toe so I want to visualize There
are bits off data from How do I
can connect our blow with her dude on
Pablo will be firing sequel queries using high
on the day that can be replaced so
that's a typical application off a lap system
So it's an old lab system online analytical
processing It's not an oil BP system so
you can't combat it like Oracle because that's
like really fine But again highest cannot replace
a proper date of their house Okay so
if you're talking about something like that our
data for example So they are like how
do I say the expensive on fast reliable
that ever hopes is I understand hives will
run up off her due Partly it has
all the drawbacks off her do All right
So monster of the organizations where they do
these days is that they will categorize their
later hold data cold data The whole data
will goto something like that are later No
there They need to immediately faced the data
run faster queries on the cold date average
It's OK You can take some 10 minutes
to quit He will goto high hive supports
insert statements now It was not there before
in such difference are already there But the
problem is that eat insert statement will fire
a map produced So you do a Bilek
lower That's better So in high normally when
you lord the data you do about Lord
you don't inserted on the latest additions off
high support All the credit operations create read
update delete Everything is supported But again if
you want to get the speed you should
use a lady That's what I'm saying Real
time queries lately and l A p a
is off now Florida terrorism support So I
can say if you are on a cloud
that our platform in search and are literally
very slow Not like that article that I
read general in such a very past But
the question is that in our data warehouse
usually you don't have to do an insert
very rare right because you will go by
uploading off the data or your 80 year
course will be dumping the data There they
will collect data from the rd vehemence and
then dump it into the so very rarely
you modified the data So the use cases
like that for Yes all the card operations
are supported Even insert up there delete what
will be slow Oh northernmost platform You have
acid support enable construction management But it is
not as reliable as a database Okay because
you can never replace an RGB Emma system
It's something like Hi but you have ah
transaction management supporter That is always elope elope
is the feature which gives high real time
performance But and little AP also requires resources
It can also it requires more resources to
execute But you get all the acid properties
transaction management everything and this is very important
to understand So there is a command line
interface for height It's called High Shell You
can open the high shell and start typing
your equity say create later this career table
on then by the query So that is
one way to work with hype Ah but
people may or may not use it because
if you are you're in the production and
all you might have some climb right So
some what type off clients who use like
connect it BBC and eight other houses So
secrets of or taught or something know something
like that You might either You can use
that So one thing is that you can
use the CIA Lie on high was very
vast actually so I'm getting confused in a
dog This cli there is Brucie allies the
origin of the Allies called haIf and it's
simply type haIf It'll open You do all
the blah blah blah The new cli is
called billing for billing Okay on beeline is
a proper jerry basic line So through beeline
also you can open the CIA life Or
you can simply say hi little open the
command line for you You can do all
the create a word statements and order There
is a veggie Why But it is very
rarely used high It is district lined It's
not jail BBC claimed So then I opened
the CIA Lie and say Hi daily busy
How drivers light toe connect So when you
want to connect from an application right your
data is you need a driver That's cordage
80 B C or D B C drivers
are that's one medical clinic So the CLI
original High FC allies for haIf client That's
not a jail ABC Klein You just open
it automatically hit haIf and that is it'll
radically hit the haIf and you can type
or the queries and or the B line
is a proper jail BBC Klein which means
you can install it in some other machine
From there you can connect You can give
the server address port number and it'll hit
right on It is not written here there
is something called high Server This four part
is for how you serve this spot For
some reason in the diagram it is not
certain But this whole party scored how you
set of it Because if you're connecting with
their data warehouse there should be a server
company We shall accept your queries You know
whatever you are typing right So that is
this whole part So whether you come through
cli and there is a veggie y or
J BBC you all hit the high server
right on Higher server is the component which
will accept your query So once you write
equity there's a planner Parsa optimizer will optimize
your equity on convert um apparatus program on
Push it into the 100 plus In high
server itself you have high server one on
high server too Okay so the old one
is called high Server one which nobody is
using now You don't have to board it
The latest one is called House Over to
that This everybody is using so high server
who is responsible for managing all your connection
So if you open a connection with high
automatically hit this high server toe on from
there Whatever query you write this guy has
Ah compiler Optimizer execute If you get the
query compile it optimize it push it as
an apparatus program My I think I can
show you that from here If you go
toe Cloutier manager should be here It is
very difficult to talk about high rather than
showing something because talking is fine But you
know if I if I show something probably
people will understand So if I goto high
configuration uh what is high server HAC How
you said over two So this is called
highest ever to the defiant High service house
are too Because there was an old high
server one which nobody's using now because of
performance issues Hanoi So right now whether it
is Horton world so low that our any
platform the server component of hi vis for
help server Okay so either you come through
Ah daily basic lined or the command line
You're all hitting How Server toe Okay From
there it will combine your query optimize your
query and push it on There are turning
soft optimization techniques in life Todo you know
really query on better it and are probably
We will discuss that tomorrow Some of the
optimization techniques Another very important point in this
slide ISS this meta store this thing Meaning
if I open a ah hi Eva Session
and I say create a table If I
exit it should not go right there was
the paper should persist whatever you're creating So
that is where the McMaster comes into picture
That is where all the matter data will
be stored and in production set of What
we do is that we will create on
my sequel server on in the my secret
server Pentastar will be configured so all the
matter date off your table Like the scheme
are access rights Everything will be in the
meta store All right Uh on duh So
you have tow ensure that the Modesto services
running other ways high will not start because
it needs to talk to make this meta
story is very important because even in spark
spark will talk to this man Castro So
if I create a hive table I can
read it in spark because it can talk
to the same a casto and get the
data from their all the meta data will
be their name attached my store eyes The
Meta data store is a service on that
will push them Adar data So you have
to give her destination in the So here
it will be there I'll show you where
the metal stories I don't know where they
have configured If I go here to hive
configuration Right Hi Meta store server trade Ah
database type is my sequel The right Uh
yes This is very to see on this
machine they have configured my sequel and that
is where the meta data will be pushed
on But you don't have any use off
the meta data How you will read Invited
That's fine Okay And it uses it to
persist All the table information and baby you
create and everything We you don't have much
of a use for that but it is
that on in high if you have something
called that cost based optimizer CBO CB ope
Yeah I don't know why it is disabled
Enable course based optimize a day saver It
is disabled for some reason I don't know
So this is course CB or Khost based
optimizer You are aware off our TV in
my sight If you write a very complicated
query it will generate my paper query plans
How to execute this query This optimizer So
what time does if he enable it If
you write a very complicated query it'll generate
multiple plants And based on the cost off
each planet will select the best plan So
that is the course based optimizer It is
disabled I don't know why it is disable
Usually you can enable this on This is
how optimization can be done in life Right
So this is one thing CBO We will
look into other things also Uh okay all
the configurations are here so start Highview Simply
say hi upset You simply type higher to
start the hype Shit may get some warning
That's fine And you will easily see a
warning saying that high You see Ally iss
depreciated on migration Toby line ISS recommend er
that's okay So what it actually means is
that this is the original high see light
on I told you that the B line
is the new client So they're saying that
used the language Never possible So that's fine
I mean to learn how you I think
the best ways this sea life they have
added some more Commanding the B line section
actually because the old highlights a life supporters
uh getting low But every day um so
but for learning corpus center it's all the
same revenue used to see a lie or
the line I show you how to use
the B line Or so um so this
is the high shelf What you open by
on its very easy to get started with
high All you can do is that you
can start by creating and date of issues
they create they Davies So first thing is
that we create something called a database that
is the highest level of abstraction So I
said create database I don't know something may
19 so I just created our databases So
the commanders create database on the later these
names I'm just calling us may minding on
If I'd wash shore data basis you can
see a lot of babies out there People
have creator I see Here these are all
the databases that God created inside High on
I minus may 19 All right you could
have created something else Like whatever I mean
you remember And you have to say Use
the Swiss there baby You are to say
use May 19 These are regular recommends If
you're aware of secret there is a way
to show that it will display the baby
name set Uh hi There is a property
where if you can enable it to see
the baby I'll search for it You can
even put in the column headers So now
I believe in you say select star or
something It won't show the column headers The
column name You have columns and the names
like it won't you But you can make
it show there is a command toe Bring
that So first will make sure everybody's on
the same page Hi but And current baby
said hi You see a life brand current
leading So this is how you can print
your TV You simply say set high You
see a life print current baby true and
Italy displeasure with Can you try this Everybody
is able to logon to high even start
I know anybody having problems in by Tony
Don't have right No you have inundation Innovation
is very complicated Actually you miss one in
the nation Nothing will work and fighter I
know that a lot very difficult People were
coming from Did you have over Leonard your
family with curly braces Here everything is you
start a control structure That is an inundation
but a little Take your appetite thing You're
using some medical Those that have been the
megastore Only the destination like metal started We'll
have it Or is That's what When you
normally start hype I said they used a
nine day light that there Maybe it won't
show me in which Levy I'm in So
now it is displaying it here Right See
this It was showing bracket What is your
current baby Eso It will display with several
database you are in If you change it
will show right on Another important point is
that probably towards a little bit worse Adam
Inside there is a file called a higher
site Examine Hi Hyphen site dot Examine on
That file has around 1000 properties on the
whole hive is controlled by the file So
these properties are actually from that fight Meaning
either you can do like this You can
open a session and say set high blah
blah blah crew But if you exit and
real organ it will not bring the baby
again Your toe Type this or you can
ask your admin change in high site Example
this property so that for everybody who log
in the TV will be visible So that
file has lot off properties Actually try to
find that for you Problem is that I
was with Horton works for the past 10
15 days and finding it a bit difficult
to read A Florida But it will be
here Gonna figure edition or do one thing
Now it's easy what you can do all
if you can open one more command prompt
Web console This copy Um if you go
to see you're missing Hii's gone Uh yes
If you navigate to this location This is
common for Claudia Bottom works everybody If they
see how you've corn folder you will have
a file called high hyphens I don't examine
this file Controls end to end off high
all the properties right And if I open
it you can see all the properties See
how you meta story You are right This
is the metastasize you are remember So where
the connection is they Then some other things
are there or token were join How do
you searching the editor this one day Then
I can pipe Right Uh what did we
sit now How you see a line Don't
know how you dot It's a it's better
not found It should be there right time
And I'm just wondering or what if I
said baby saying pattern or phone baby should
be there Just as for high Probably I
don't know Uh see all this Um what
you say properties are there in this file
Probably some of them are So another thing
is that you are accessing this file locally
I haven't from this machine If I goto
Cloudera manager configuration I'll say baby should be
there somewhere I think some of them are
actually not visible to us But all this
property started from high site examine All these
things will be in high site example actually
So that is how you can print the
current baby Warren Well that's not our intention
What is our intention So let me show
you the data that we have So if
you go to this for the right you
will have to files Now once you start
loading that part of life it is very
complicated Even though the queries are very simple
people try to get it is like watching
a Christopher Nolan movie No you won't understand
what happened up what hours He finished the
movie or they wouldn't understand what happened So
the point is I will write first Then
probably we will do it Okay So you
will create a table that everybody knows this
you will do in have your say create
able blah blah blah It'll create a table
Fine Then it's next up is that you
want to lower the data into the table
The data can be Lord er from local
file system More edifice It's a bit tricky
So if you create a table and you
want to put some data right the data
can be loaded from her Dube all from
your Lennox I system Right Okay And I
will show you what will happen How to
do both this on what is going to
happen if you do both this What will
happen to your data and all these things
like so we'll go step by step otherwise
people get confused a lot in the state
It's actually very easy But end of the
right people get confused a lot Where is
my date of what happened to my data
Sometimes you will get local or so So
what happens Thes India Courson Or sometimes they
dump it into some local machine You can
lower You can say hi Take it from
my Leanness number Tinto So it is actually
coping from limits So how do But there
is a way to do it That's what
I wanted to show So let's look at
the data right We'll also do some small
analysis Can you open this find But I
need a better later Actually What bad will
do right That file is already available in
your data set Let me download this off
real quick Can you see this final transaction
Doc Txt the X and this one It
will be here You will have a court
and data folder inside that high inside that
I think it's pretty Is a dot txt
file Right Because for me the extension is
not visible Uh just give me one moment
I'll just installed this north Purposeless So last
class you're watching now Course Or watch the
answer coming Right What happened in the morning
So what we will go is that first
we will copy the file to the local
file system That is Lennox on Then we'll
create a high paper We say lower the
data to the hive table and we will
see what is happening in that process then
So there are cool fires Actually another file
I'll cooperate with the effects from there I
Lord and other people and show you what
is happening It's OK We'll try without this
I mean you have not bad in a
very right so that's fine I just wanted
to show you this Yes So this is
the transaction data What we have here if
you look at this data right I just
take one line and explain what it is
so you can understand Like so if you
take this line So the first uh Carla
Mr Plants Action I d That's nothing but
18 Then you have the transaction date Then
you have the customer Raidi So 400 toe
Double four forest custom Ready Okay On Then
you have the amount spent $88 Then you
have the category off items he bought being
sports on what Actually he bought Baseball City
or Salt Lake ST off Yuta and credit
So it's a typical of sports store transaction
What Can't agree What item What amount Credit
David You know on you have a number
off lines like that No it's OK That's
okay So this is one data that we
have It's a transactional Later on we want
to create a high table Your this data
So that's one thing Second data that you
have there is a file called a gust
Can you see this on I want toe
open This with uh word by Yeah So
this is that That's camera later So first
golden will be customary First name last name
age and profession Christina Chung aged 55 is
a pilot like that right So you have
ah store customer data You have our transaction
data right And what we want to figure
out is that we want to figure out
the total amount spent age Weiss by the
customers for example I want to know in
my store what is the total amount spend
by people aged between 20 to 30 on
30 to 40 and 40 to 50 So
that is the analysis we want to do
So in one data set we have all
the transaction like how much among their spending
and all second articles that we have the
person's name and age and all So we'll
do a join on Then we will argue
a case statement and figure out that's what
we're going to do So the sequel analysis
is not the intention The intention is to
show you what is happening behind the sea
Onder How high vis dealing this right OK
so now what I want you to do
is first we will go on Lee the
local So we will copy the data to
the local file system From there we will
analyze it So I want you to copy
these tools to our local system So how
do you do it You have to open
the FDP stuff right So I'll just say
f b p Logan on Hopefully it might
work for me up your face So upload
these two files Oneness Golda cussed C u
S t other one Escort the X minus
one over Yeah like what's here And I
need to open one more Okay Luckily for
me it is available You have to figure
out So use FTP and copy the file
still in X nor to her Do don't
upload to her do I'll Just wait if
you need some time Yeah but uh you
don't have changed extension Yeah So if it
is a text file format it can access
any text file on leaving your to mention
that the limited coma or whatever it is
so high it can handle next CS VJ
Sohn examine where I think parquet So there
are multiple file formats it can handle So
by before little read all the text But
if you're having Jason and all that is
so are you guys family with something called
Certainty Do you Have you heard about 70
No sir D stands for serialize early serializing
Okay It's a very common practice So what
will happen is that let's say you have
a Jason file You know what is Jason
right He and value huh The key value
pair So if I have a Jason file
I can ask Have to create a table
from that But the problem is that J
soreness semi structured later It's not proper coma
separated or something on high will say I
don't know what you're talking about I can't
find a schema So to do that we
can use something called a J sincerity Yo
Downloaded It's a jar file I don't know
whether the latest how has it if not
the order download a job file with this
Carla Jason 30 Okay add it into haIf
on Then you say creator table using the
30 So using this 30 it'll parts the
Jason on key will become the column value
will become the data like that So service
are usedto what do they say Read the
semi structured and structured it I will show
you an example off on structure later where
you can use something called rejects 30 regular
expressions you can use to toe parse the
data so high supports of a number Authorities
actually read different types off later Jason I
will see if I have a show So
that is again An open source projects more
most off them So internally nothing happens They
will read the key on the value If
it is a Jason 30 on if you
do that structure like a table to hide
So if high you looks at it high
will nor CS key value pair I will
say a call I'm heard Aranda value It
was just apply the same thing So it
is reading the semi structure later on giving
us structured so that it is a parson
So we have Bar says right It's a
virus that actually So this thing are you
able to do are able top floor everybody
on and find this file You have a
file core commands Can you see whether it
is there just comments in the high folder
you will have a Fire Corps comments No
no no This is not uh cystic commands
We have type or copy paste so we
don't have uploaded So this isn't that for
me This data is in the folder called
Ragu That is weather data other sites Um
let me close this Now we can actually
create our table So let me just copy
This is you can sort onto a create
database because the first two steps we have
order reading we created a database and then
we said use it Okay on the first
thing you need to do is this Command
on Explain it So just copy this Creator
table Just go here and based it on
our the semi colon At the end On
his tender you have to add a semi
colon Otherwise it cannot work Hit enter Yeah
So this is the typical create able command
in sequel It say's creator table And then
the payable name is transaction records or whatever
name you want You can give on within
the brackets You are mentioning the scheme off
the data so high supports a simple and
complex data types So those who are better
off data types So you have a simple
our data types like indigenes string and all
It also supports complex later life like maps
track send All right so they're all supported
But here we are just starting with a
very simple So I'm just say even that
is a date later Type on The important
point here is Nordisk Amos came I mentioning
you even have a data type for date
I'm not using it I'm saying string but
how you support data handling and all So
after that The most important point here is
row format The Limited feels terminated by coma
What That means Hive is expecting the data
in robot rule under the limited risk Oma
Which means you can Lord only coma separated
at files ideally right I'll tell you what
will happen if I do something else that
time north using coma What will happen I'll
show you But right now they're simply creating
a table There is a record later for
high So that will Really No we don't
have fortifying So now this table is created
right So all of your creator hi can
be accessed through Hume You havoc You write
that I show you that There you have
auto complete mostly for the hard way If
you're working with how there is not a
complete your to tight everything manually now now
is the important thing How do you lower
the later Right on There are a lot
of things I want to talk Overloading the
Laker First off identify bird is your data
right So in my case the data is
in this location so just make sure you
understand the location off the data In my
case this is the location I just copy
this on the data is the Exodus Wonder
Txt and I will say lower data local
in part into day brother So this is
the command The part I found here I
went toe this location I did appear the
beauty See this is where I have the
data right It is only next because only
next if and it is very easy If
you make any mistake in this a pop
cannot be found in order Lord if there
is a correct partisan loading the later so
easy to identify whether the command is correct
or not In my Casey it says loading
the later So that means it is working
fine for me Are you also getting the
same thing So make sure it's a local
part Okay So people get confused It is
a local park Yeah because here we have
not done the analysis We just loaded the
data on here You can buy sequel query
you on how I will be writing secret
You don't have to write the Java programming
nor did or you don't have to know
Java jar file It will create a job
file Actually If you mention a folder it
will load all the files in the folder
So just mentioned exact file He mentioned the
four living So there are two files It
lorded everything So here since it is a
data warehouse it just happens to files But
from the trinket that even the problem right
Even I don't know they should be the
final face in bites I think our fleet
is for and be right So before and
be the size off the file Actually Now
this this part this financially I mean you're
just loading the later on you If you
want to check you can simply the work
Cynic Start from the eccentric cards You can
do a limit by Can you try this
Those select star from the table Do a
limit fight Don't simply ran a selects star
That is around 50,000 records citing So do
we limit I want to see that Our
Firoz Right So that's the command This to
make sure that data is loaded Now you
can upload the data from local as the
less How do so I'm just showing how
to do this I'll show you from her
Do person both possible runoff selects starving selectively
thing from the table I want to show
say the four table I limit fireman's top
five rows I want to show now if
you actually want to see the query write
you can write any equity doesn't matter But
I can simply say select going Start them
so the Senate counts Star is a query
and as you can see the moment I
run the queried is firing a map produced
job See So the query will fire a
map Reduce job because it is converting it
Intensive map zero Radio zero on a This
is this cluster is really good The what
I'm stealing is that we have around 10
notes here in the cluster on only I
think I don't 20 people are using it
very less data That's weight is very fast
Actually it is very very slow This is
this is unexpected If hi vis fast there
is some problem your to troubleshoot So this
means that X stars the next time in
showing beautiful table limit five I want to
see the top Firoz you're just very find
the day dies Lordy know the top Firoz
because they test 50,000 rose If I same
Plato selects startle display 50,000 I don't one
I just want to see No no said
next Will North the firearm a producer Because
it's a simple read It doesn't need to
do any calculation So the newest elect star
it warned Firearm I produce job will simply
show you the airport So just let me
know Ari Ableto till this you have to
do a select down start It should FYROM
apparatus job and you should see the output
One more thing I want you to do
in that fine I have given to you
right This commands file It might be slightly
wrong Not wrong I don't copy place That's
what I'm saying that I write it here
because I use it for a different session
also So there if I open this here
it let's say lower data in part see
the same lord that I in part it
is actually loaded a local in part right
So don't for the same thing and you
can correct it or so whatever you're running
here just corrected there so that this ah
reflected light Otherwise tomorrow you try and it
may not work the way you expect A
bit of 60 my produce output I mean
it is firing a map Reduce job like
so mappers logs are just being displayed That's
what you can suppress it There is a
way to suppress it But normally if you
submit a job fighting produced it displays the
same thing Mapper reduces That's where finally will
see the research calls It is where the
fourth nature actually in high Because even though
it is a hive query it is not
a life Could it Some operatives program that's
white shows map and this is good for
troubleshooting Sometimes I will show you errors and
warnings and or when it is running I
show you how to enable But right now
or you're doing a select star right If
you want calling names set Hi I'm not
cli dot print door Heather equals true Okay
now if I go this this is count
starting a little card Definitely because it is
some analysis it has to call for on
leader selects Star will not call because it
is just reading the record So are you
able to lower the date At least run
the basic ready then That's okay He's asking
Can you see the job Accord highways late
No I am not Found it so far
out for that I know that That is
in general toe HaIf I mean how it
iss converting that into a job Actually we
have all this in memories For example how
to write a join in my produces there
A select statement does end up normal select
status Simply showing right doesn't convert matter Wear
condition internally There should be some logic again
Toe Everything is key value pair and Dr
Dre I Everything is so another drawback is
that high is good but you will not
know what is happening behind the scenes like
end off the day how the court is
Reckon you will not know who will see
the result but lacks enough If you know
sequel you can just fire the query right
Hello into haIf in her due to high
van her group because it isn't hurdle on
It is visible in any way so that
now the important point is I don't actually
want to show you running the query attacks
So once you Now you have this table
and you can write all the sequel Query
That is not my intention The real question
is where is the data First question right
So you lorded the data on duh from
local You said that means family next So
first thing you have to understand is that
data must be in her do So when
you get a load data it couldn't goingto
Hello How do that's for sure But we're
in her group Right So to understand this
what you need to do um you have
a describe command So let's say describe and
the extend records So this is a very
common common If you're describe on the table
name it'll show you the columns and later
dives and all on if you want you
can also say describe for matter and the
people need so describe for matter will give
you more details about your table that you
created on If I hit ended there is
a very interesting thing that you can see
The thing is it's a stable type ISS
managed people So any table that you create
in high either it can be a mannish
table or something called external table by default
Everything is managed table I tell you what
is what I mean by managed table OK
but you can clearly say see it say's
it's a managed table Point number one point
number two It displays something called location Right
It saves location on it Says you Serhiy
ive warehouse So I want you to do
one thing Just go to your hue file
Broza can you go to this thing If
you go toe this Hugh just goto this
user folder There's affordable use that Can you
go here All right thing that a lot
of users actually a number of yourselves are
there Okay so do one thing you said
slash Hi So if you simply pipe it
here you sir slash haIf you will land
on this speech So what will happen is
that venue installed high There is something called
warehouse folder How I will ask You were
created on normally in all Hadoop clusters It
will be yusor haIf inside that there'll be
a folder court warehouse Okay on if you
open this folder you can see all the
BBC's people have creator on You can open
your baby My database waas uh may 19
So this is my database May 19 Open
that on There is a folder called the
Eccentric Course This is nothing but your table
You open this you have the data so
try to find out your later by yourself
So my point This what is going to
happen whenever you're creating a DB in her
dupe This four there is their use Serhiy
warehouse inside that will create a folder with
your demeaning When you create a table a
folder with a burning on When you say
lower the data it is simply copying the
data here Are you able to reach this
point I mean this place You can also
do something interesting I can simply go here
on believe the data I related the later
from here Right on If I do a
select star now Sorry it fired the query
because the table is There it is That's
it Because the data isn't are there Right
Let it run Yeah so are if I
had to uh select start There is nothing
So I did a select star from the
table on There is no data because I
deleted it from here Right on You can
manually copy the data here If you don't
want to type lower data local in part
I can just copy this transaction here This
location Ah that is a schema But you
don't have a data So now I manually
uploaded the data here on If I go
back here I should be able to see
the people So this is very your data
ghosts If you're coping if you're saying lower
data local in part the data goes into
this you Serhiy warehouse would be before the
table for the doctors where it comes Okay
So are you able to see this four
day I think Yes And now you are
Now I am going toe confuse you Okay
so if you are invested this much I'm
going to confuse you So now what you
did You loaded the data from local file
system and I have one more data right
So there is one more file called customer
right I want to upload this from her
Do so what I'm going to do You
can all sort of along with me I'll
goto Hugh Okay On DA in my home
folder Ragu Okay I'll just upload the data
here Step number ones So there is the
cast file Customer file Copy that Indo How
do in your home for the very very
one Uh this is a bit confusing I
mean so remember the location where you have
copied that in her do fine Uploaded on
Now I want to copy that into customer
favor But first we have to create a
customer table right You don't have a customer
paper so I can go here Andi there
is a command here Create table customer See
Fine sales based on age group below that
there is a common to Skopje This command
Come back here on just say faced it
and it should create a customer table Fine
So now you can Lord the data from
Lamex You already know how to do that
OK But I don't want to know that
I want to Lord from how do how
do I do it I'll say Lord Data
in part I don't say local I say
in part se Ah Ragu slash into table
customer try This only difference is that you
are not using local use a lower data
in part on give the Hadoop location Now
I want you to do one thing Where
did you upload Or their dynasty A first
discussed file Can you check the data there
It will not be there If you can
see there is a problem it will move
from there Just the ones you see In
my case it wasn't it wasn't drug Ooh
If I refresh here it'll be gone But
because the tables were creating a school managed
people so if indeed later described it short
it is a managed people What is a
managed favor A managed table is our table
Where How You will manage your data You
don't have any control So in her do
it has already selected a location where it
will done board the file So even though
you say Lord later from this hard off
location will cut it from there and dump
it into that so you can go back
to that user high warehouse folder there Your
data will be there for sure You don't
have any control as so they're The data
will be high will control it right Whether
you can see the data there Yeah in
user High warehouse This location on ever Is
this a prediction I disappointed that successor prediction
The meta data will get updated That's what
is happening Because when I say move from
this folder to that for lead rather than
moving that Neymar will have data meta data
it'll point from here to here That's all
it is happening And it is that I
mean other other place we were gone where
we have already uploaded the day travel people
This will not happen If you're lording from
local it will not believe from local Okay
so that's a bit confusing but that is
how it works I mean people think there
is a data What happened Where was my
file Right Because it is managing the later
so it will not I love you to
keep the data It has to be in
the use of high fare prosthetic So that's
one thing and that's called a man is
stable What we're creating is called a man
is stable and now ideally if you have
done everything correct like I am doing if
you do a short tables you should cease
to tables And uh you should always have
their time Both the Davis anyway to fight
Do short tables on If you're just a
next are you should have data in both
the papers because then only you can go
for there Any analysis where coupon rate once
you have the data But here we can
use equal You are not family with sequelae
Vis sequel is the most powerful language in
the world But analysis that is not English
can be secret even though I don't know
much else equal and not a sequel guy
So you might have right complex queries But
end of the day that is not the
idea You should know what's happening Where is
anybody again Like maybe you don't know how
even I don't know how to write some
queries I'll take a or somebody will write
the query But if you find a quickie
what is happening Where it is running that
is I'll tell you very impressing Sorry I
have a friend of mine Hey is working
in Cave Germany So I went toe training
gave Germany so he was in my training
thing He is my friend He was also
sitting in the class I knew that Okay
so I was asking the participants how many
off You're using a group or high fever
and most of them said no there are
PPL guys The fineness they were unimportant Works
on important words has a tool called Hawk
H A W Q Hawk hawkers like high
only but with some mortification they hawk has
become in reality No to be fair Previously
they used to call it this hawk Okay
so Hawk is actually a connection with High
only in day after day So these people
were using hope for past two years but
they didn't know that they were using her
to because for a sequel developer it doesn't
really matter You like the sequel query on
Do we really care where it is hitting
for Who is processing your query You're really
care You're your ideas to run the query
right Hawk asset is elope So Hawk is
an improvement on haIf and now the project
is discontinue It has become that elope really
massive queries on Previously it was known as
hope So I even like three years back
that when they were running hold on day
didn't know that it was so Most of
their queries were on her group but That's
what I'm saying If you're a sequel develop
mostly You don't really care where the queries
running You care where the query is actually
running or not There's a problem Other support
moments issue That is what you're looking that's
where something So I was asking them to
actually use Hurdle They said no we have
never used her toe They were actually running
queries for two years Almost Okay So once
you have the data what we want to
do is uh see I want to do
it in a different way You can do
it in many ways First thing I'm going
to lose that I'm going to create a
table so you can get the comments from
here School's out one So I'm creating a
table called out one If you look at
the scheme off this table you will understand
what I'm doing It's a joint table meaning
I have under tow adjoin on I want
o store that itself in this night So
usually you don't want to ideally push it
into a table but you can do it
also So I'm creating this empty table to
store my joint places Well you're doing a
joint operation in this stable will hold the
percent on How do you do a joint
query It's very very simple Eso if you
know sequel I think this will be very
easy for you You will do uh insert
off right Statement Babel on It's a typical
inner join right you will say a large
customer number First name age and profession that
is from one table Be dot amount and
product That's from another people from customer A
joint transaction records be on the joining Carla
Miss customer number So for those who do
not understand this we are merging two papers
so this will produce it ourselves But they
want the story somewhere right So we created
an empty table called out one so their
little push it So this is a typical
syntax off a joint so that a means
one from a here is customer table Be
here is transaction the course table So from
a table I want columns customer number first
name age and profession B I want amount
and predict And whenever you want to do
a joint there should be a common core
But then only joined makes sense So the
common core Lem here is customer number This
one is the common core customer number and
this is called a simple in or join
You have outer joints left out right daughter
for louder or this everything will work Now
it is in certain over light Now the
table has no data It was simply dump
it If you already have some day title
delete everything and loaded You also have an
appended American simply happened Okay for sequel Guys
I have a question Just I thought I'll
begin with questions What do you mean by
schema on Right So a schema All right
Do you know what this is If yes
what it is you're rgb amiss and typical
systems are called scheme our own right Meaning
If you create a table let's say the
table have six columns If you try to
insert seven column Great our it'll happen It
will say a violator If you cannot insert
seven columns to six column table that is
because it is called schema owned right It
will validate the schema while you are writing
the data Hi Eva Schema or read Not
right Meaning I can create a hive table
on upload an MP three file It'll properly
upload No complaints I can upload a movie
properly API ordered no complaints because it will
not validate what you are uploading by You
were saying Lord the data It'll simply copy
that and dump it in the Fordham But
when you query the data it'll validate their
It'll show on ever saying that okay And
why it is Because why highway scheme Our
own greed is for faster uploads because otherwise
if it is violating everything it will take
time to Lord the later today I think
most of the data warehouses are ski Mound
read I don't know much about other things
but hive is definitely ski mountain read by
So ski mount right This typically on your
traditional systems So around your joint quickie See
what other joint happens It's faster So you
But it'll be more frustrating This is like
this If hiatus faster there is some problem
Actually we have to troubleshoot e I really
we do it So now verify the data
expels you lose selects start from about one
limit five So you should see the joint
output here And as you can see you
are having these details from the customer data
Like Cameron 59 on these details Fundy other
than other people might like the amount and
the product These things are from the transaction
table right You should see the joint Brazil
So now you have joined the paper joint
to savers actually And now what I will
go logically speaking What I will do is
that I will use something called a case
statement in sequence So those who are where
it's easy for others What is the case
statement Very simple I will say that Look
at the age column Okay If ages between
20 and 30 categorize the customer into something
30 40 lose something So you Because you
want to group them right then only you
can Some the transactions So in steak well
you can easily do that You think something
called a case statement and that's what I'm
going to do So But the store the
output probably will create a table I'm just
creating intermediate tables so create the stable called
out to so the stable is called out
to And if you look at the schema
off this table there's a last column called
Level This column does not exist before because
that is the place where the case will
for services Right So that level column will
have lower medium based on the age off
the customer I'll push the data there Okay
on How do you write a case statement
This is one way to do it So
I will say that Select everything and I
will open a case here when age is
less than 30 I'll marred the customer Lou
ages 30 to 58 middle more than 50
old else Others So this is how the
classification will happen on that will act in
the last column off the table No by
the four Simple case 10 Citicorp last Parliament
Lee you're writing a simple case It will
have the last Carla mounted where the conditions
will be added part of a conditioner mentioning
So here I'm just saying that only low
medium high right 33 categories are mentioning So
see this is the case statement right on
high This for Love sequel I can't really
help it I think there is no other
way to write and high right without you
think sequel I cannot like anything in life
So don't worry if you don't invest in
sequel but get an idea like this is
what will happen if I do it Because
high was extensively used I mean most popular
tool I can say s so that means
we are in the last part Now we
have created this table called out to And
if I were select star from our tooled
limit five c Now if you look at
the last column every customer is classified all
the middle And now we will do something
called a group By quitting we just group
them based on lower middle some more the
this thing and you will get the final
This is to show you how high is
running Okay No other Ah on to do
that I will create a table called out
three on I want you to tell me
the query Let's see how many of you
can So this is the third table I
created Okay uh grew by by uh you
know this What is that Column level Yeah
So I mean I'm just asking So what
we're doing here This were simply selecting the
data grouping it and counting it on now
in the final table If you do a
select star you will see the final decision
Select star From how Three So now you
can see low middle old on Tortola More
spin So who is spending more Not low
I think this is bigger right Border Older
people are spending more actually I don't know
Maybe you're right Uh based on body actually
retail store right there Sports items Probably older
people are actually sitting buying more these things
now quite interestingly If you don't want to
open a shell and I pour this stuff
you can go toe shoe There is a
query Erector And there is haIf same thing
Where you gonna go to this height Onda
In here You can select your DB now
So it is that the four database go
back Um my later basis May 7 deed
for May 19 So I'm in Made that
on here You can do the same thing
Ah select So now it is a war
to complete another kind of things a little
bit right Just simply you know if you
just hit play it'll do that So this
tool called a hue is very uh you
know useful actually because um uh that is
one thing Apart from that let's say I
take a quickie So we have a table
right out to right Senate Start from out
to Lex A limit 15 Something like this
I And if I run this the query
runs and it gives you options to visualize
See you can visualize the data I'm in
flight This is Hughes property Oh in my
knowledge no there's a website called get Hugh
dot com but they are available on or
the cloud A by the four has huge
If you buy clothes for you get you
order books Doesn't have you go to separately
Install if you want Horton books Has this
somebody white This vertical somebody's There are mental
that has a file manager same like this
You can click and up your the file
That's very good actually humans third party In
my knowledge I'm bodies from Apache boredom Looks
is just using it Even on Apache Hadoop
Buchanan strong by for So if you're having
a plane Apache how do you want to
administrate You can install embody but popularly on
Horton books because nobody uses much of Apache
How do you are either Horton Works are
loaded Map are has the worst You wait
scored map are controlled Center worst mean I'll
show you that your way of my part
I had a map of that cluster I
lost it My body control system I don't
know Maybe they have improved No they have
not improved my parts You a score map
are controlled system ma'am M C s okay
Uh yeah This is how it looks like
See how the year I mean this is
there like Cloudera manager You see this is
there You wait Very dreadful foreign sent You
can find anything in this I mean the
worst you I ever created Probably I mean
comparing it to the cloud a manager and
all the way better the iPhone seven bill
mapas control system that you wait Uh Anyway
I wanted to discuss one more thing So
so far we have done managed tables I
want to talk about external table and that
is again a bit confusing Okay We'll finish
external tables and wind up Probably You have
been through a tough session actually So just
do one thing I want you to do
one small thing I wanted to create a
table Okay so I'm copying the command But
I'm here Andi I want to change it
So just open one more north Back open
So new Right So this command discreet for
creating the transactional table Regular common I just
want to change it I will say create
external table on either Say transaction records on
the score Txt Okay on Then I will
say location I'll explain what I'm doing Okay
First let me try this Um control or
C Okay my data So I'll just copy
this and see whether the command is working
Then you can also try Yes it works
So uh you want me to increase the
phone Let me just increase phoned amount double
Is there just only one thing So what
is the difference here You are creating something
called external people First thing you can know
down is that you say create external table
So there is a key word called external
You're using point number one point number two
After all the scheme and all I stay
location in location I say use a GL
faculty That's my home for that My data
What is going to happen if you run
this command It'll create a table that's for
sure It will also create that folder for
you My data in her do I don't
have that Ford it really create that folder
for me Let me show you I'll tell
you why it is required If I go
to my home directory My harm it actually
Ah What is it Yeah here it is
My data Can you see So this table
is this folder is created by my table
creation Coming I didn't create it Okay Now
what is the idea Often external people when
you cleared an external table you have the
control as to where is the data So
right now there is a folder called my
data Right And all I need to do
is open this folder okay And copy my
transaction files here Sorry Just copy this file
here I'm just uploading the file on If
I say Cilic start from the X and
that I courts underscore Txt Um it's fine
I have the data so that the friends
here is that what is the difference between
man is stable and external table in man
a stable You don't have any control as
to where is the data You just say
Lord the later on high will always keep
it in you Serhiy Warehouse blah blah blah
in external table You say I want a
location where I will keep my data So
here we mentioned a folder on I can
upload all my later here on that will
appear in my table more than this So
this is one example where we actually use
external table Is that customer the sailor They
will have some data Onda we will ask
Where is your data They will sell my
date Ice industry Amazon history You can simply
say create an external table Boyne Glamour's on
his three location or the data will appear
in the table The date I will be
there I'm saying you can see that in
your cable So any external location you can
keep your data on you can access in
the That's what an externality Dry this a
few one Create an external table and upload
the data See whether you can find it
Because all the times you may not have
control on the data you cannot say that
all the time I will copy the data
to my place Sometimes you might want to
create a table and point the data point
You are able to afford them Actually that
it's possible from my local system I uploaded
to that folder So once you upload the
data to this four there it will appear
in the paper My data folder The data
has to be in this folder that only
you can see in the people Because you
are saying the location is there See the
business use cases different I'll show you I
show you the business use case Okay So
what is going to happen is that let's
say you are having some babies So here
I haven't tobacco Okay yet I have some
my seek with etcetera Right And what you
do is that so you have some data
basis right on you Let's say one scoop
You know what This school right school piss
Your deal you say Hey scoop do one
thing Connect with this baby's every day at
12 pm Well am midnight so I can
show you a job every day Midnight watch
school pill do school pill connected these two
d BCE on It'll run the ideal job
It'll pour their data from here and one
school gets the data school has to send
it somewhere So I have my Hadoop cluster
here Right That is my hello plus step
right What will do school build down this
data into a folded So there is a
folder called I don't Know Who So school
school Bilkey born pushing the date back to
this for the corner Now I can create
an external table OK on point that location
to the group Right So what will happen
when I dress that Exxon automatically The data
that school is dumping here will appear in
my paper Are you able to understand what
I'm saying So typically in an ideal leg
what is happening is that every day you
will be running it'll jobs to bring data
from multiple places on All this data will
get done Lead to some folder Now somebody
cannot go there and say lower data local
in part Rather than doing that you create
an external table going to that folder So
the moment a file lands it is appearing
in your table There isn't that is actually
use case of external table A school can
order scheduling Who's he has to do now
if you are asking school can directly dump
it into hive tables Also it is possible
school can take the data from Iraq or
dump it into a hive table That's possible
but many jobs are there where you get
data from my paper Sources say you are
a flume job If you're running flu imagine
it may be collecting log files All the
law Faisal stumble into some folder I can
create a simple external table where I can
use some certainly to read the law on
I simply select star All the data will
appear in my table otherwise out to say
lower data local in part every time to
number it into the stable Like so the
actual yourself for external table is that you
know you can simply point to any folder
on all the files will appear in your
table So how you doesn't care what is
inside You have to make sure that the
scheme are matches everything and some people ask
this What if I have ah Wendy Column
data in the folder and you have a
10 column high table The 1st 10 problems
Welcome Will try and get from there ideally
because we just fix wherever it is possible
and then it'll remove it So all the
files will be upended in the table And
Sean like fire files wherever you're recorded cannot
match or cannot understand Listen our values I
don't know what it is or later ice
mismatch on If I problems will come if
they tried toe fit But there is no
guarantee night So I really that isnot a
use case because in a data warehouse you
are not supposed to do If it actually
you're not supposed to experiment in a data
warehouse the data should be plead That is
why you have this cleaning pre processing data
factory operations And all right then only you
take the later toe data warehouse Finally um
another important point about external table is this
So if I do a short tables now
I have this table cold Our transaction records
and this is a managed table This is
my orders Enter table I say drop drop
able Beixing records I dropped the manage stable
Can you tell me what will happen to
their data Will it retain the data or
data will be gone A lot of people
shut No no Let everything Data government Okay
I'm just asking So I just dropped the
man a stable You know what is a
managed table I'm asking Table is going for
sure The file I manually uploaded Right We
will be there Yes No 50 50 We'll
check you sir That is it Hi NATO's
May 19 Do you see a uh Fordham
Court transaction records The data is gone So
Manny Stable has this drawback I won't call
it as a drawback If so this is
very common It happened once in a project
that I was working and that is how
I came to know I mean normally when
you working on learned these things you learn
by mistakes something somebody does So we had
a table Very huge stable which was shared
among our project Do the projector sharing the
stable on they had for rights that pointed
by one of the engineer He said drop
the paper on Dennis gone So the problem
So he's had brought the table accidentally He
dreamed intentionally rude but The problem is that
table is gone But they had some recovery
mechanism at that point in time In her
do in high there is no way to
recover Liberec LTD said the live from her
do so they had some recovery trash mechanism
Recycle mechanism Onda Based on that they recovered
somehow Okay That is why when you create
a table which you want to share always
create an external table drop table I can
just use the iPad All right Or the
table name Transaction records on the score E
X t sand dropping The external table table
is gone on I'll goto you sir I
can just go to home for the light
I'll go here on We had a four
local What was the whole day My later
The data remains because high does not have
any control on your external Later you have
control You uploaded it will not believe it
So this is easy You can recreate the
table Data will not be lost Right So
if you want to share the data share
exposed the table always create external tables minus
It was very rare Action for learning purposely
Six Easy to do manage table on the
operations side they're all the same The queries
and everything is same managed with There is
no difference But always remember create external tables
if you want o share this Okay This
p pretty has smaller data I request you
to go through that or so Yeah the
official website off haIf very very important How
you daughter pasture dot org Um and it
has done in self information again like I
don't know how much it has Lord of
things Actually you can start getting started Guide
on high to understand what is high Yeah
I forgot how to connect using beeline Um
so I'll show you tomorrow How could run
beeline Because I just turned to see by
the far higher server two runs in port
number 10,000 Slovinje starting Beeline your to say
Jay DBC connect to 10,000 port number and
then we connect with the house ever to
I will show you how to do that
tomorrow Anyway I forgot So please go through
this high ive official website It has done
center himself information A lot of things have
it There is a higher wiki page It's
we just so right on And just look
at these points we already discussed Ah query
Execution via a patch it stays Apache spark
or map Pretty's probably last bit I can
show you face but I forgot my user
name and password It is not here I
show you tomorrow taste anyway Query Execution via
days My produce or spot Hi Can you
spark as an executive engine But people don't
prefer it because from sparked their fire queries
the other way That's more easy You can
start spots equal and say fired But how
You can actually use park as an execution
engine It is possible And this is sub
second query retrieval via hive L A p
only Horton knocks lately This is what I
was talking about Elope elope enables assay transactions
faster queries everything that you can think about
a normal party Bemis within few but that
again request resources I mean even though it
stays very fast you will take a lot
of resources actually So when you're running taste
they is itself is in memory So anything
that is in memory requires resources in memory
means time You lose only rhyme So they
is itself It's fast because of in memory
So you your plaster need to have a
lot of ram for the Nepean or otherwise
very very slow procedure So back this it's
pl SQL is nothing That is a combination
that's called hive query language It's you're lonely
You can write your own custom are functions
and all inside Hi how you support something
Call you dear user defined function You can
bite So that is they say it doesn't
feel s Q and as a fancy way
of saying that I would feel vice versa
In the sense yeah see if you're bored
and java my produces your choice but my
produces extent extinct right then you want to
spar Spot uses Java but that is Java
eight functional java nor job a seven So
a lot of people have trouble there But
even in spar community the most restaurant prefer
despite on writer and scholar then only java
come so that I supported in spot 100%
days But developers every list People want neither
fighter no scallop because it's easy to write
actually and I I would say some knowledge
in sequences or high vis very important whatever
you're doing I will I will tell you
this and spark in spark If you write
your core Using fight on and sequel of
the sequel is much more optimized because if
I'm writing anything using sequel I have something
course came up right which means I can
understand my data What is schema If I
have scheme I can invest in my data
So if I write up join and filter
I can push the filter first and go
Join are able to understand in a skill
I write a query by the fall It
does right by the four Dinescu SQL If
you write a grew by join then filter
what will happen Feel terrible come first because
why you should do Lord all the data
then finally do a filter first to filter
than do it That small optimization that is
not possible if you write a language in
sparked with this even you like spark chord
It is not optimized because Spark is not
having a very types came up Response Sequel
has That's what I'm saying Sequel always has
a preference because off its optimization that is
not possible in other languages like even in
apparatus who Right You have no idea How
is the data Underline Right You're not mentioning
any scheme up It has to lower the
full later to do analysis a name apparatus
program You right The whole filing to Bill
ordered You cannot say that selectively This parliament
analysts not possible by first told everything has
to be loaded So are always the structure
languages are having a performance on this job
is good I'm not saying jobs back but
I don't consider it in the further Skylar
Probably you can relate with Scala Scala is
that is similar to Java If you know
Java Scala you convey these Leader Scala's spark
is a good combination Very good combination What
else you have questions will wind up in
five minutes after your questions So tomorrow we'll
be covering something called partitioning But getting indexing
and high which are a bit advanced optics
actually but very much needed actually understand what
is high because they are the highlights off
Somebody want to learn how these other points
or if somebody ask you about how high
these are what they ask Nobody But nobody
will ask How do you do Slx Stark
We really hate that Nobody's going to ask
They will ask How do you partition your
later So at least if you can understand
the logic off it that maybe more than
sufficient right And we will look at this
NASA data You have a NASA data See
NASA So tomorrow we will on allies Now
Star Data you're saying high So I can
so show you this Saturday They reject sensor
lease and all I can explain in this
NASA data so make sure this state assets
out there with you Anything else that you
know if you look at this case study
that we have prepared So this is about
the NASA Our data analyzing the NASA Web
server logs with Apache hive And this will
be helpful because we will see the same
daytime spot in the spar class Or so
this data will be the same data So
if you understand the data now we can
unless that you think spot as well So
I guess you are aware off this server
logs contain lots off information from web servers
My on this case really will show you
how to derive insights from Web server logs
So you are going to look at some
log files which are generated by Web server
on if you have ever work with Web
servers So let's say your type google dot
com on then download something All this request
will be hitting the lips over She will
get get request put requests than extra dp
status code So that is how your log
files will look like OK I will show
you the lock Fine on You can actually
download the data for free from this You
are We have already downloaded and uploaded the
data You don't have to worry but you
can It's public later from NASA on defy
scroll down Look at here before moving to
the activity Please go through the STP response
courts at this link right So are you
aware off something called a street TP response
courts Yeah If not you can just go
through this You are So for every request
there'll be accord associating Right So if you're
browsing and hitting a web server the server
will respond with that court So if it
is you 00 it is okay Anything starting
it to his sexist You will not want
its creator to notorious Except er doing or
four is no contend And all 303 direction
That website will redirect you to some other
place right on flying There s our 400
on 500 is Cervera So sometimes you try
to open a website and you will say
501 server not found or like that So
anything starting with five layers Uh in donuts
of Ah era Right So Gateway time out
Service on available for something something like that
You can see So if you want again
just to go through this list Right Andi
understanding the data So the data said contain
the woman's what off all http request to
the NASA Kennedy Space Center's over in Florida
So this data is collected from the nozzle
tips over on two months Worth all the
requests are there and I think that on
one million lines are there in the data
It's pretty huge data actually somewhere around one
million lines the log file self stored in
the Apache common Look format So that's what
a lot of people asked me uh values
that text file It is not better dot
Txt extension Normally you say next fire That
is something with dot txt But there are
a lot of other data which is called
text file It is always not mandatory that
you will have a dot txt extension Right
So this Apache common look format it is
a text file but you cannot open it
and not back my You can open it
inboard powered or any other thing and it'll
show you properly like text data I'll show
you the data Andi Now let's look at
the data So if I go here that
this uh NASA access lobs and this is
the data I say open with it was
a word bad Say OK s So this
is how their data looks like And it's
not really pretty I mean in the sense
like it's not really nice It's not structured
on What are the fields we are interested
in in this Right So this is a
log file on If you look at here
the pdf right We have the host iron
ditty use our identity time request At this
size if I scroll down the first failed
is the host making the request Okay so
where is the host making the request This
is the host this day I am 24
Door in it Something So this is the
fully qualified domain name or the horse Me
close making the request then what we have
next to our the use of identity from
the remorse and local mission These are unavailable
So next to fields are not used their
dashes That's fine on Then we have the
time stamp in they bond our so year
this format So here you can see this
So this is your time stamped by the
time it is making that request time zone
So this is this minus 400 is your
times on what you see here is the
time zone Okay on Then There is a
request in to the server So here we
can see get It's a get request okay
And somebody is trying to get shuttle mission
status new some news something Okay so that's
a get request we have And then there
is an extra vehicular play court Ah assertively
play cord is 200 It's a success Okay
on then you have the size off the
file received so probably in bytes 1839 So
this is the data that so you can
see that a lot of people are sending
request toe the NASA Web server And that's
how the later looks like now what we
want to do They want to create a
hive table to Lord this data on Then
we want to write some queries on the
size off this vilest 167 megabytes It's a
big file actually Not very small on We
will test now how the cluster will be
here because the few fear start writing the
query all of us right The query We'll
see how efficient the cluster is anyway right
And so the next question is that how
do we actually lowered this daytime toe hype
Because this data does not have any structure
to do that We are using this 30
or the serialized the serialized Now a little
bit of information to you I also want
you to give you an assignment So if
you simply go toe high ive and say
sir be yeah you're going to see here
This is called necessarily So what is a
certainty Certainty is a short for serialized be
serializing Like I said if you're having data
which is not really structured like this log
file or or J sarin fires or anything
you want to read and write you can
use this part circle certainly by and high
supports a lot of built in thirties So
these are the built in thirties we have
on We are going to use This guy
rejects You are aware off rejects but I
got that expression right So that means you
can back that I will of expression High
will parses parse it on using this little
read it There's an O R C 30
Averil These things will look at later Onda
then custom thirties You can write your own
thirties if you want That's also possible Right
Um we will look at this later Let's
open this Reject 30 Yeah so it stays
If you want toe create a table and
use the 30 Your to say row for
max 30 Remember previously we used to say
raw format de limiter feels now I say
Roe formats 30 on Then I will say
this is the 30 Okay on then you
save its 30 properties on you will say
in Porto Regular expression Whatever regular expression you
have you have to put their on That
should actually be ableto get the later whatever
rejects your writing And then it should give
the structure for the data This is the
idea Mike and this is very important toward
aspects to fire I mean I think yesterday
didn't discuss it When you create a hive
table by before this option is that your
story As text files Which means whatever data
you are loading into the hive table it
will store it as a text file regular
text file be to see a story file
or any file So yesterday well ordered some
customer data transaction data on day We're getting
loaded as it is Okay Now you have
other options You can say stored as what's
the parquet And that is where the improvement
comes And I will talk about it What
is the difference in when you say text
file or when you say Oh RC file
where is the difference But as off Now
we'll say this story as text files were
not border It is biofire text file huh
So that is why yesterday they were not
mentioning it It is vital for text file
now the date as it is available in
this location Can you see here Forward slash
gl data on Duh Huh One moment Uh
what was the name Access right Yeah So
this is the file It's called access There
is no extension on it is available in
a common folder called GL Data So first
let's do one thing Let's start haIf and
revenue to start high So start height and
let's create a table So we will use
the DB yesterday we created I say use
May 19 That was the TV right So
you can switch to whichever DVU creator on
this is the common To create the table
you can copy paste this from the pdf
Okay Say copy and I'll explain what it
is So you will say create a table
if not exist NASA Lord on the fields
were interested Our host um identity use their
identity dying request status size and the new
Cerro formats 30 We want to use the
rejects 30 With 30 properties you can import
the regular expression So this is a regular
expression we're using on in the output They
will just say that So that's part of
30 You will say that story with equal
space is right on Then you say stored
aspects child So this is very a saying
that story as next five So this is
your little expression huh Uh you can go
to the pdf And if you cannot copy
like this Right Click say select tool There's
a selectorial Once you selected you can left
click and copy like this So you have
to have a some idea about regular expressions
if you want to write it from scratch
But that's OK Yeah sure Um I show
you from console Uh yeah this is coming
I just said haIf Then use the database
on then just copy Play Sturdy table creation
Command Right now we have not Lord or
the data You won't happen Lord the data
that output format string it is simply saying
that there are eight columns light and with
space each parliament space It's a bit difficult
to explain It actually reads Onda separates or
the later with how many spaces you have
And it is looking at digit sand strings
on Then extracting the interesting feels so difficult
explain that Rejects actually So if you're writing
I just from scratch You have to validate
it So this was validated Even I don't
remember How about that I started long back
actually Yeah So it's a create table Yeah
normal They will Only difference here is that
in the table properties You are saying that
what other data that comes to the stable
Apply this 30 the regular expression Otherwise it
cannot recognize the daytime side Right So right
now this is the Apache common look format
right Onda Apache Common Law format has a
fixer The scheme are not schema The way
the law applies a generator so you can
fight it in my paper Please This is
one way off writing it so in highway
fuel Look they were simply say that this
is a reject Everyone here I'll show you
how so Here to say in port Project
C the output index is not stored So
here you can see the import rejects right
Are you looking here I mean this was
written like if you avoid the output spring
also it will work But we are just
ensuring that everything looks string I mean just
We are manually saying that it is spring
that I think it is not updated Highest
documentation is not updated No it is not
updated Let me see Um So where is
it More about rejects Early can be found
here So they have given pollings Actually probably
it is Here Let me just check So
what happens This me Just see if I
can find it first se in open source
communities What happens is that let's say you
are looking at High e or spark or
anything Somebody will create the perfect selects I
created haIf and I will give it to
everybody Everybody will start working on then Well
what will happen Somebody will say that something
is broken so I may not add it
In my documentation highlighted in Nigeria Jr Isa
I shoot racking right It's a derailing in
derailing it clearly say is that there is
import rejects an outward form extreme you should
use So somebody added Ah yes I will
work on adding a 30 how to and
some examples in your directory So somebody said
that this is not properly for matter So
in the jail right is that it actually
It is not an official documentation or is
it only when they created Probably property is
not there I'm saying Then some people complain
but they have mentioned the link here Right
You can see that it is here If
you look at here it stays more about
can be found here I don't know why
they're not updating the documentation but it is
there any way So one problem is this
If you want to search documentation you may
not find everything in the official documentation sometimes
So I always look at Jenna because any
issues they fix will be injera on Also
not everything Not every time things will not
work the way you are expecting certain booger
saying that how you plus Jenna not able
to read this file Some let me show
you are coming across and there'll be a
direct ticket for that Always I So once
we now you have a lower the data
so let's continue with that on Where is
the location The location is here GL data
Right So how do you load you say
Lord Wait wait wait wait wait Made If
I do a lower that later in part
what will happen I said Lord their time
part What is going to happen We can
create an external table right That's better I
didn't think about that because normally the participants
will have the data in other classes and
they will upload themselves So what do you
want You want to create an external table
everybody and point to the later Okay so
then you have to modify this So how
do you modify this If you weren't an
external table he was a create external people
Next drop the original table OK door drop
First door drop drop table What is the
name now Sand on So drop it And
now we say create external table now Salo
these are are fine On the only difference
is location slash ngl data Um wait wait
wait wait External table If you're creating it
should point to afford it This will load
everything from GL data We don't want that
right in jail later You have lot of
foreigners like so I will move it I'll
see if I can create a folder in
jail later Let's see New directory accorded us
NASA Yeah and I'll move it inside here
right That should make sense See you have
to take care of everything Not this right
This one right Access action moved to GL
Data NASA Right There is no NASA for
that Yeah there is Move Fine So I
really know if I check NASA it should
be here Now it's your book Otherwise it
we know what a very file from the
deal later For that we're in mourning that
right So what should be a location This
should be the location Right on Let me
see if it works No you can use
more letter or capital Little So the table
creation Wilbert I know that my question is
toward about the query Select star from works
So you have to try this on Duh
Let me know if books but all of
us are accessing the same date Okay All
of us are in the same data so
make sure you can see see properly structured
No I already created a table first Right
on I dropped it My guys how do
you load data to the papers We have
a shared location So it is better to
create an external table two point So I
dropped it and we created the table So
here it is location And also here is
ah location this or social this external you
can pipe small letter or capital for the
name should be case sensitive on do a
select star and let me know whether you're
able to see the later So there are
two approaches not every later can be made
structure Also there is one problem so high
Eavis a structure the tool That means somehow
you should give structure than a little work
There are some type of data like for
example there is a former corn avenue a
V r o So previously when you get
average data because that's like unstructured you are
not able to do anything But now there
is an avid only that they have creator
So there are still some type of later
where you can or give a structure one
point second Boyne How this works in organizations
is that like I say there'll be a
data ingestion p like So I'm just giving
you the example of for G in G
What happens C g Ys business is so
enormous that nobody has any clue what is
happening there He's one of the biggest firms
in the world actually write on their their
size off That is enormous what they're getting
So they have financial applications Okay these financial
applications will generate data Okay on how do
you get the data They will have a
P ice They will pour their daytime Jason
for my So it comes in Jason for
my on sensor data they get sensor data
So they have this locomotive engines train engines
in Europe GS created So these train engines
whenever they're running will produce sensor data they
collected on that data will come through Casca
It'll come in a Havre format to look
after toe Hadoop ultimately land on her do
But the data is sensor data So it's
like events sensor data becoming events format Like
whenever there is an event it'll generate But
there is text file only But the format
that they get this called Afro So the
point is that data and it lands on
her do Nobody can give any structure to
the data nobody on Earth can give so
g Hassa data ingestion team Their job is
to take the data give some structure and
give to the big later You're getting my
point right Because if you directly give it
a big dirty and they have no clue
what to do with the data So this
will be a staging area where all this
data lines okay on They have returned some
Java code custom Java court What That Java
Corba little read avenue Remove Someone wondered feels
give some structure send it to next side
So So that is how they're injecting the
rate I'm saying so Raw data Most of
the times you may not be able to
directly handle it And second use cases If
you are a big data are developer or
like us We don't want a time the
original format But if you're a machine learning
guy you want the date and original format
Are you getting the difference Like your family
data scientist I don't want anybody toe touch
my later I want the original data with
all the nonsense that only I can build
my machine learning models because they weren't all
the features They don't want to mortify the
data But if I'm a big data a
day like this I am happy with structure
I can lose some data I just need
the structure for my I will work So
these two teams will be there So the
machine learning teams will be original later they'll
be collecting And then they will have their
own algorithms to currents the data Taft issues
to pick us Their data is coming from
Europe Streaming data They don't stream life They
stream once in a day or something I
mean they collect all the sensor data It's
an Amazon Then back set up is an
Amazon So there is some sensor which in
Amazon you have ah solution right Who is
working on Amazon sensor data You have a
solution Amazon record Kindnesses Kindness is right Kindnesses
is your sensor platform in Amazon So kindnesses
will collect all this data and from kindnesses
and push Look after because somebody has to
store right The data there are very less
real time decisions they want to do on
that data because they restrained data on only
time when they're analyzing that data is if
the train is not working like it crashed
Train engine Good crash Otherwise they don't want
to analyze it in real time About 99%
of the time The trains are running so
this data is useless for them So when
the train engine got crashed or something happened
that time they want to look at the
data Last 24 of it What caused the
crash So they're not doing because size off
the data is in a better place So
analyzing this Orson or possibility for them So
they collectively all time on batch it and
send it once energy or something on it
will go through Amazon If he land in
Kafka Why CAFTA Because from Cathcart to three
teams will collect it One team is using
the data for something else distributed on Then
one copy will be stored in her do
like or regional data copy Anybody want they
can get it That is how their architecture
is right now These is an evolving and
architecture They're just moving to Harut No it's
not a stable architecture so probably after a
couple of years they'll find another way to
better this things And or right now they're
not doing real time on the this train
data They also have flight data flight This
aircraft engines are already right for most aircraft
all RG 80% days on all these engines
and great same sensor data they collected But
again that's all Not real time It's very
rare You get a problem on that family
they want to analyze So I hope you
are able to reach here right at this
point So this is a place where your
data is that right If you go to
this location I have copy the data there
If you look at the pdf what it
say's Lord the data selects star Now these
are some off the U S Cases I
wrote I mean you can also write your
own use cases so find the top end
points that received service identity meaning which are
the how many endpoints have received service side
edit right I'm just asking like this 501500
this out ourselves I did it And how
many times it has puckered So how do
you figure it out There are multiple ways
to do it But one of the ways
this you can see select status count off
requests from the war to say NASA log
table grew by status on you can say
having status Reggie Regular expression extract So here
I am saying that anything Starting with 50
in the status column because fires Arroyo is
so one sided Order by status dissenting limit
fight That's the equity So I just want
to know or the errors starting with 50
on Come them How many are there That's
what I'm writing now when you are using
having it has to be a real operator
assignment in having close in high when you're
using having close has to be assignment something
If you say equal to something equal to
something it will become a photo one It'll
become a become ones So here I am
just saying just comparing that comparison operators Beverly
quit Uh no In having only emceeing there
close and all it'll be normal Okay in
having Ulysse so somewhere there was a documentation
on this I'll share it with you Okay
Can you try this query and see Now
it is taking time 24 seconds Only for
the Western Anyway this will take some time
Can us try the query It will be
slow So of a new mentioned the commands
It has to be either for small letters
or for capital letters He said Capital l
location And it's a bug I know this
but because other researcher throwing Arab it won't
throw an error on This is a very
common thing If you create external tables you
mentioned location Even if that location does not
exist it may not throw it up and
you will try quitting Nothing will come up
You will think what is a problem Space
is fine but I really this location keyword
Sometimes it has a problem So that is
what I did It started working but you're
getting Or maybe maybe that itself is not
a problem I dropped it and recreated Probably
started working Not sure So how much time
it is taking for you to run the
query 75 seconds Ah that is also one
more point Um we say for good Yeah
so anyway so somebody was asking about this
The reference on regular expression Here you have
Ah hi Language man It isn't a pdf
on which resource is requested most frequently by
the host This is over question light because
which page is requested the most on I
don't think it's a very difficult query but
still will run it just to see So
now if you look at here it say's
Cedric Request Co McCown as request come from
NASA log grew by request order by descending
limit 30 You are looking at the top
13 requests Actually my you can skip Heather
Randall but I don't think you can skip
the data but I don't think you can
skip columns No I don't think you can
filter Biggest Ultimately it's on her Do So
the data remains on How do Right That's
one problem Because if it is initially office
ah then there is no way you can
filter Anyway Locally I have to see whether
he is asking why loading the data Can
I filter the data like Iran warned Or
the 10 columns I want only you You
have skip Heather That option is that table
properties By the way there is something for
table properties Um you just go to Google
Say hi Create Inaba bvl properties We're so
in creation Off the table there is something
called table properties Where is it Uh huh
So table properties Right So here you can
mention the you know Skip header and or
let me just check no auto compaction mapper
memory or a bunch external True You have
Skip Heather is there that's for sure But
I don't think you can skip other columns
That is not possible I think Anyway I
will just see if I can get back
to you on that OK And there are
some more queries I have little You can
try that Like display the top in Khost
who made maximum requests and all You can
try that Find the total count off different
response courts returned by the server So just
try these queries Let me know if they're
not working I have not actually tried everything
It should work Okay so now are talking
about Impala a little bit right Probably not
in that But you should know a little
bit off Impala because it is heavily Houston
Industry impala If it isn't a cloud outside
we are included aside If it is hard
and work side everything is haIf but in
the cloud outside we have something called in
pilot So a small intro Not much but
I never show you an Impala query also
So what is going to happen If you're
having Dayton or slight so their data There's
another one and this is You can install
something called Impala right on If you're using
a clothing our distribution Impala will be by
the 14 story you don't have to worry
on Impala is its own sequel engine It
does not depend on high for anything My
So when you install Impala what will happen
There is a demon called in Pilot Be
in Pilot the scoring pilot Lemon It will
start running on every day like something like
what I could I mean I'm just saying
so this impala they will be running on
or the notes festival right on Impala has
a shell from where you can fire the
query It does not depend on type on
If you fire a query What is going
to happen is that let's say it hits
this machine so the query will be accepted
by one off the machine So let's say
this guy the right on this guy We'll
look at the query now Impala can talkto
Highsmith our store which means whatever tables you
are creating in Haifa in public and see
so all the tables you create you can
quickly using Impala So on whatever tables you
create an Impala hive can or soc because
they share the same medical store So that
is one are one days so on on
top off that Another thing is that when
you write a hive query it will convert
that into my produce So high this nor
bordered about how the query is running because
it is a responsibility off map produced toe
run The query Impala will not convert on
my produce Impala has this Impala DeMann which
will run the query for you It is
like are the vehemence they can say Not
on our DVR must I'm saying like you
feel firing a Oracle query Who will run
the corridor Akerlund Victory saying like that if
I fired the query this Impala bill Get
the query on it Is the responsibility off
this in Pilot Demonstrate Executive Equity So David
half the highest moderator Okay they will also
have the block meta data meaning if you
create an Impala table like a hive table
and you say Lord the data Right now
the data is actually in Hadoop on the
day Tyson blocks and replicas Now who has
them A target about this name No These
guys will copy that to themselves because they
should know that is the data in Pilot
is not going to depend on anything Toe
fire The core It'll run itself Right So
that is why the queries are very fast
on When the query hits Impala it runs
in memory So there is a drawback So
let's say the data of us here have
a block here You have a block here
What will happen You'll have a lamb this
block Data will come here on this block
Data will come here Ram On this day
I will simply coordinate the Cory because the
data is here and here Right So this
guy will coordinate the query on it will
run the query in memory The drawback ISS
If let's say this mission crashes then back
where he will be aborted It is not
far are living But haIf good is our
fourth column because I'm apparatuses for parliament There
is no they Map produce can fade right
But in pilot queries are not far hollering
So if the one off the machine because
it is a distributor quitting Not one machine
is running the query Probably 10 missions are
running the quickie and if any one of
the machine crashes the query will be aborted
But it will not converting my produce or
do anything so your queries a superfast Actually
Barry Clearly it has refers We relate and
give you the outward quarter where you want
so it's faster But this reliable be issue
is there with Impala So what we do
if you're having a cloud lackluster if you're
having you deal jobs we never using pilot
because in India jobs behind the scenes it
will be running haIf equity on Probably the
query will take three hours to run four
over student probably five overs That's fine even
if during that ideal job of machine crashes
I am I'm safe because it's Ah haIf
Kareem I'm apparatus will take care of it
but if I run that deal we're using
Impala Let's say the query will be faster
Two hours probably will take but after one
never If one of the machine crashes my
Impala has to restart the whole quitting metro
area But again your retail job has to
start from beginning you lost one hour right
So actually needs highest McMaster for the metadata
Impala must use Highsmith Astro so high should
be in start So that is another advantage
High You will remain because lot off tours
requires high human sparks equals sparks equal normally
When you configure spark you will tell sparks
equal Use Highsmith as store so that highest
metastasize used by everybody So are they able
to create All will be in one place
My And if the table is there you
can either quit using high If you're using
Impala that's up to you to decide but
in part also has some drawbacks This now
sir table recreated You can't quit using impala
It is This rejects is not supported So
this table is created using rejects 30 That
is only for high I can see the
stable an impala But I won't be ableto
very I think if my memory serves me
well but most of the regular tables you
can quit without any problem Both of them
can access at the same time because my
test or is a shared place What is
my past Order this your my sequel So
share place anybody can access not only high
of ending by the 100 so far that
people axes it need not run You have
to install Haven't set a Permata store I'm
saying whatever table you create an impala its
stores in Highsmith Astro So each data Nord
will have an empire argument on each demon
will have a copy off Highsmith store block
date up so and what happens if you
fire a query Any demon can pick it
up That's called a coordinated meaning If I
write a Impala query broke that are hundreds
off data nerds So any data nor running
Impala can get the quickie and that North
is called a coordinator for that quickly So
in this example this is the coordinator because
this guy accepted the quickie But this does
not have the data That's fine Okay And
then once it gets the query does a
local look up It's very fast It has
the matter it understand Immediately Here is the
data Here is the data It just splits
the Korean streams to here and here These
guys run the query collect the result she
displayed to you So I really a lot
off machines will be there Where you are
having about the Matar data Look up is
very very fast There is no frequency or
to manually refresh it It will not automatically
copy Meaning if I create a hive table
it will not appear in Impala I could
refresh the metadata There's an option in Impala
where you consider fresh So if you feel
that there is a new table refresh only
them A collector will come Ah block Metta
later So that will come once your table
Majorities Creator say I'm creating a table Once
I created Able I Lord the later then
only the meta data off that data will
come It will not store the whole herd
of meta data Your heart do cluster has
let's say 1000 that are right Fight Okay
Impala in North Shore 1000 Terabyte metadata That
is useless right You're saying this Hadoop cluster
has let's say 1000 files Okay In which
you created an Impala table Okay You say
in pilot table you loaded file one It
will have the matter it off Only this
fight Why should it need all of them
At that age The wearable hit only this
file right Only later to the table I'm
saying when you create a table you will
say lower the later What data So it
never examined well ordered the NASA data So
the block information off that now itself I
remember from hive Or if you're radically lording
into it it will remember itself It won't
remember the block information off other files in
how to be a whole lot of fires
right Not all of the five star in
paper Are you getting my point Where him
saying it'll remember only Dimitar data off the
files associated with the paper Otherwise why it
should remember all the files Manual refreshment is
required I mean block it may remember I'm
saying the table properties It will not automatically
refresh Seattle manually refresh everything So usually we
won't do like that Usually even though Impala
and high can communicate we don't do it
because high who has its functionality Impala has
its functionality meaning whatever tables you have in
high If normally they will be used for
reliability like you deal jobs And on that
I will handle impanel You'll you'll create separate
tables You will say the same thing Lower
data local in part blah blah blah on
that will be handled by impale only like
separate tables They can talk to each other
But what I'm saying if you want up
to date information you should say Refresh that
only you can get it And there are
23 types off refreshed I don't remember completely
There's a complete refresh which will redo and
Lord that nortelnet our data from highest Netas
tower blocks everything There's a past Children fresh
You can do it in Hue so these
things are available in huge I'm not just
saying that you will be wondering So if
you go to query editors you can see
in pilot Right So here is Impala So
probably the table you created right now Shortcoming
Impala That's a logic right So if you
go back where is the May 17 May
19 This is my table My baby Right
See customer now Silo This had an impala
And if I do are select start from
NASA low I don't think you can see
it I doubt five I doubt it will
be able to see it Maybe I don't
remember exactly Uh no Failed The Lord Meta
data failed Lord moderator in various stories Description
in Pilot does not support this table type
Recent 30 library Not supportive It won't support
That's 30 That's why you are not able
to see that No it won't That's financing
in Vilar tables If Empire I square in
your data it's not for so what the
hell will do it quickly and get the
waiter who will quickly That's the question I
find this clearing is far column So here
you can see the NASA low thing is
not supported But if I look at any
other table so let's say select grown start
from Let's say out one right If I
do this I really should work for some
reason in by allies of it slowing this
highways faster In this case it is not
really like is actually very fast Good There
is some operatives with direct link It's on
your services The same could If you run
you will see Map Ridge of starting map
produced nothing so we will on There is
one more thing This refresh Manly has to
do normally Here do it manually You can
write the bats job refresh so I don't
know whether the refreshes actually configured here or
no for the Impala Because this NASA low
cable is available We just created it right
now It is available in Impala so probably
And this is the option to refresh Can
you see this Refresh three options will come
clear Cash perform Incremental metadata update This will
sink missing tables in haIf which means if
you have a new table in haIf that
is not critical to it on then there
is invalidate all meta data and rebuild this
Can both We both resource and Prime Minister
When you say invalidate it will relieve the
whole meta data and from scratch It'll Bill
they'll take a load off So these other
refresh options and impala that you have so
try to create a table see whether you
can find it An impala I don't know
So we'll create some simple table right We're
stable You want to create Maybe I just
change this and run it as it is
right Yeah So I just created And if
I go to Impala May 19th what is
it No I don't have it So if
I say refresh let's say perform in criminal
refresh huh now it game transaction So ideally
if you do this manually because in pilot
cannot automatically identify you created a table in
high If you do that will come I
needed so impanel or so stores A lot
off my tar data on actually later Reese
inquiries are in the cash So if you
want to clear it you can see a
clear cash The reason queries You ran It
was started in the in the cash little
store So sometimes the casual become big Like
you're running a lot off queries so that
will affect the performance We can simply say
clear the cache Yeah I really should keep
it in the cash Hi Does not use
Impala How Uses only um operatives No I
don't think that is possible Probably admit or
somebody can do it from her developer side
I have never I mean what is the
U S case I don't know I mean
masking Faber masking is there I don't know
if it is possible I haven't ever tried
So table masking you can do so That
has to be done by the it means
So you can say that master table It
will not be visible to any other process
like Impala or spark or anything if you
want But usually we won't do like that
Usually they restrict access based on you sir
So the you said so if I am
a gay Okay so all the high tables
will be there An impala Let's say But
if I am a guy uh who is
a developer on my access will be that
I can see only these neighbors in high
for these tables sending pilot so behind the
scenes everywhere you will have papers where you
said level access you can control Who can
see what tables at the race Everybody will
be able to see all the tables I
don't think directly there is an option toe
Stop them communicating Horton Books Her dude doesn't
know having by that uh they have this
thing Hi Is there on by the fall
high plus space There's his execution engine Ah
Then elope Elope Is there uh highlight Actually
Elytis much reliable Much faster than Impala to
be honest right So I'll show you in
the lobby Maybe not Now I'll see a
session So this part is clear Right High
even high placing pilot At least basics are
Hugh has an automatic way off identifying the
data but that will work only if you
have AH property limited coma or space in
here You can create a table there It
will identify if you have space or coma
or something I don't think you're special Letterman's
were to the structure So in Bagram in
this much on the unit or no hands
off now I mean I'm just giving it
as an option Normally in higher classes you
don't teaching pilot or since where I was
hiding by allies Impala not subject like it
has become like so people don't learn it
separately in the projects it because the queries
and other thing the sequel you write everything
is saying there is no difference at all
on letting is a use case is slightly
different No only it has its own It
is contributor bike loaded up in pilots and
open source project but mostly through the contributions
from Claudia So it is like a preparatory
predict So if any shoes comes close that
I will support a lot an Impala going
number one on do so And the problem
is that if you go to other platforms
you won't see in Bala Probably That is
the reason people don't care much about Impala
night So if I goto Horton books I
will not simple at all If you want
security there is something called sentry service century
your to set up century I don't think
we have sentry here So in cloud I'm
saying Claudia you have something called sentry Sen
treek loaded up So this is called sentry
here you can set up or the you
know user level access So this level access
everything are simply again sentry is open source
But so here how users groups on da
access see organization privileges model for higher end
Impala So the problem is sentries again Apache
But only Florida uses you Goto Horton works
You have knock send Ranger Two tours for
this Gold ranger on Knox again that open
source but only important books So one problem
is a platform eyes There are some slight
differences It happens so there are slight differences
Florida you says this century on in sentry
you can configure who should access which table
on all the authorizations The admin has to
sit I don't think simply is installing our
cluster Which means anybody can see anything even
high for high You have one indication you
something possibility we will create here That is
not there I mean as of now they're
saying hi Even lender right Hocker but always
is by default I'm saying right now what
they're doing you are going You're logging into
one off the machine in the last step
It doesn't work like that after getting my
point So I think we discussed this in
the first class So this is your cloud
lab right This is her cloud lad And
you haven't edge note So you know going
hear from here Your connected the lack on
any security Any service you nearly how to
enable here and as off now nothing is
enable So when you just start how you
just hit the cluster and anybody can see
anybody stable or anything that you create centuries
not enable lighting on that Even when you're
working you won't have access to sentry and
it's totally the arm inside They only can
decide who should You can request them You
may be not able to see who has
access Yes Yes So they will be in
the cluster But once you hit their general
that will be applied to you like all
aboard This is whatever you're having Otherwise you
can't restrict The use is like from coming
in Horton Books has Ranger Ranger important words
my injuries similar I also like the portables
documentation a lot It is much more precise
and clear you real and disappoint Their talking
cloud argument is very here Nobody would understand
what they mean Look at hard knocks on
your door Commissions Very easy to read and
understand See comprehensive security for enterprise Heurtaux What
Ranger There's how Ranger works right Everything you
have here So Apache arranger or for centralized
security for all these things high even space
So if you go to hive they will
show you what I see There is even
a you two presentation Very nice documentation They
have the best documentation actually order books Cln
lippy live long and processes What I said
I think that is another explanation It's not
actually live long and process something is Is
there Artem bugs They say there's a levelling
process I don't think it iss live long
and process so high Your datas are here
on some optimization techniques out here Okay uh
ordinates document is the best If you want
to learn or living If you're in a
cloud a production fluster Some of the things
may not be applicable since it is starkly
open source like they have their own base
light So if you are using a cloud
a cluster in production you should always refer
to order a recommendation because not everything will
be seen Some difference will be there right
I hope that clarifies Impala So we're talking
about Impala Um now I have to talk
about something called Partitioning in Life So how
many If you're actually aware off partitioning some
off you might be aware off this country
considered Yeah for those who are already know
this problem started fresher Right See the idea
is very simple So can you tell me
So your manager asked you to create a
sale stable Okay On duh You are creating
a BB called Weeping under table course sales
So we're leaving the location It's a managed
table It will be Use it hype Then
what Their house and the baby is what
Everything on the table is what Saiz This
is the location right For the man is
stable on Let's imagine a very month you
are getting their data Monday So in January
you got the data So it became what
John Dorsey is You said lower data local
blah blah blah This is where the data
will go You know this already on The
data got uploaded Here Everybody is happy No
problem The problem is as the next month
and next month can happens right What is
going to happen in your sales folder February
What will happen February late I will get
uploaded It's in the same folder where the
way on again what will happen Mat's right
on April on May And so So The
problem is if this is the mortal you're
following after let's a couple of years if
you look at the folder you will have
10 20 files inside the same for them
Now The real problem is let's say you're
writing a query Something like this Select start
from sales right they had That's a month
in Quito They say you're writing this quickie
The problem is haIf has no clue Where
is your data It has to scannell the
files in this folder to give you reticent
The placenta is here that you know how
you doesn't when you write a query it
will scan away the folders in the sales
folder So if you are having their say
100 fires Hi it has to scan all
the 100 files before producing the the broad
back off This is that your queries will
be very very slow So in orderto avoid
this problem if this is the problem you're
facing you can do something called partition partitioning
off David Okay on in partitioning you have
two things that is something called a static
partitioning on dynamic partitioning static and dynamic You
have type of partitioning you can do okay
And I will show you that with the
help of an example that will be better
But what is the idea in partitioning First
you have to identify the column or columns
based on that you want to do partitioning
So I am assuming that this is my
stable structure on Most off my queries are
based on one equal to something right So
what I can do in high I can
say that Okay Create a partition table where
the partition corland iss month on what I
will do If you provide the data it
will automatically identify I'm create fold us like
this on will place the They don't like
this So when you partition based on month
okay that my people ways to do this
I will show you technically what is happening
behind the scenes as that it will identify
the month column Andi as many months you
have those many sub folders will be creator
on it will move the data Now we
know that there are only 12 months in
in year So if this is the data
model you will first partition based on year
then month so that it'll create afford a
court hotel some 16 within that 12 supporters
in 2017 within that But if this is
the model if you write this query it
will only hit here Just keep over these
other support so your queries will be naturally
faster This depends on what is the columns
you're having So let's say I'm saying here
the column name is Jan So what is
the next column You have a week one
week toe like that you have I mean
this depends on the data Now in this
data I don't have date I have a
month Korla my have a day call Um
I have a year old now Partitioning is
not a must It's not like a very
British do partition There are many conditions where
you cannot go partitioning So like you said
if you're having date you can't partition directly
Because if I have a date column on
let's say you're looking at last 10 year
data How many days will be there 6
3050 If I say partition by data create
6 3050 folders that's useless There is another
way to tackle it I'll tell you but
you don't do that So partitioning is applicable
If the card in ality is very less
some people carnality right How many unique values
who have in the colon like in this
case or let's say you are getting data
from different countries in the world Fine maximum
You have 150 or no country buys You
can say partition But yeah like you said
or another case is that you are having
transactions and I want to divide the data
They start transaction I d That's not possible
because every transaction has a unique transaction I
d It starts creating those many partitions That's
not possible So partitioning should be used on
leaving your car tonalities within control like this
Right now you're partitioning only on the giancarlo
I'm giving him a different use case Let's
say you're getting the data from different different
countries It also depends on what query you
remain For example let's say the sister partition
data I write a query Select star from
sales Were Country called to India is the
worst Korea can write because my date ice
partition on month and our write a query
based on country it will have no impact
is actually worse So if you know that
most off your queries are based on country
and month you partition my country and month
I mean we celebrate It's possible probably first
partitioned by month then convert your country Denman
You're to figure out the card in ality
so that and also if you are ending
up creating Lord off partitions That's not a
good idea for Hadoop because her dude does
not like a lot off sub folders and
sub files Right So on another drawback I
think I discussed this in the first class
This this partitioning if you do in traditional
data basis it is sometimes not effective Why
Because we had a use case there We
heard Oracle plaster like this Four machines are
there Okay And then you are saying Let's
say you're looking at the iPhone sales later
You're selling iPhone iPhone sales data on you
Want a partition The data based on country
now Number of sales In which country Right
So I will say that partition by country
But that's a logical partitioning You're not physically
dividing The later the problem is it will
create countries like India China us and all
Right So probably us partition will be here
This much will be there And probably what
this will be What I don't know Give
me severe country Congo You're getting the problem
You're logically partitioning You're saying that Divide the
data based on country since in US Lord
of people by iPhone What is going to
happen to us Partition will be very big
This will be your US partition right I'm
talking over the right Okay And this will
be a Congo partition on in Oracle If
you fire a query what will happen the
machine is responsible for They won't work together
So the US partition colleagues will be very
slow because this guy has to turn The
Congo partition will be very fast Right in
How do poor So what happens You will
do the same partitioning Okay you will say
that you have US data in their data
and all will create self orders But the
advantage is that even though this file will
be three TB it'll be on 100 machines
blocks You know actor this will be one
machine How you were doing Listen what I'm
saying So if you're storing the data in
Oracle Oracle has no idea off blocks and
replicas So us partition data will be three
maybe three pp Will be one machine It
won't divide this If this isn't how do
three TB will never be in one machine
blocks will get divided Right So my query
will be faster actually So the partitioning it's
much more effective in her do can see
right So which means partitioning in traditional databases
and all are not so perfectly It is
affecting up to an extent But it also
ensure or the partitions are having equal amount
of data So by party partitioning is logical
in every case even in her dope it
is logical In Oracle it is logical in
Oracle I'm saying that all the partitions off
us would mean one machine There is no
way I can further divide this data You're
getting my point right Because it is not
hard dope or something insisted he be so
all the US later will be in one
machine So that machine will have lot off
data later to us But in Hadoop if
these are four data on yours you will
never have three terabyte on one date And
he always divide and spread our distributor So
my performance is better than partitioning So these
are some of the things we understand Then
we work right So So that is why
some of the companies will say don't do
partitioning much Scharping is there Scharping is dividing
a table existing table starting mostly view doing
no sequel or actual after 11 J supports
charting but it is not extensively useful Actually
in no sequel databases You can do something
for Table Chardin They were shining It's like
physical division Your take a table and say
Take 10,000 10,000 rose Just divide like three
or four and then dump it Okay but
that is not highly possible in our baby
Enormous because in arguments you have normalization It
is physically Any partition will physically divide the
data But can you say that I want
to divide this table in tow Review to
Z two z V That's not possible right
That is logical Physical division means like he's
saying Scharping I take a table The table
has 2000 rows I can't say Divided into
five to wonder to wonder 200 Rose keep
in five missions That's not possible Right In
normal Rdp amiss that score physical division You
are not giving any condition You're saying Take
the full table Split it into 500 100
rules Keep it in fighting machine That's not
porcelain porcelain RGB Emma's in no sequel It
is possible because it is being normalized No
secret Everything is being normalized so I can
divide my date and push it into the
surgery I want Nobody cares It's already distributed
in high evil tradition RTV amiss Physical division
is not possible You logically do what you
say Look at this column on if you
have 100 values they were in 202 100
be 1 to 200 You don't have any
control You can say this much situation I
want or this size I want There is
no physical control logically or dividing the later
right There is no way for you to
understand after march again This affect how you
has no clue right It keeps on searching
Orilla nor data So creating this Either you
can manually create this four days or you
can So normally you don't do like that
We run a command and created actually So
it is better to do this rather than
talking about This is just a basic idea
I want you to find out the data
set on then start working on it So
that s partitioning Can you say this folder
partitioning on There is a word file if
you want I mean this has the commands
but I have been facing one issue with
this word file which is of any copy
paste Sometimes the comments will not work We
will see whether he work here Okay that
that court right Article single coach It was
not identified so we will see whether it
works So Nord So we have no discuss
What is static partitioning or dynamic partitioning And
I will talk about this first thing Um
let's create a table and then we can
start talking about it So do you have
the baby Yes right Okay The baby is
here Ah never say create people Just copy
this command okay A static partitioning is a
bit confusing but we can understand So I'm
creating a table Call you sir one as
you can see If you look at the
schema it has first name last name and
I d only three columns on Then I
will say partition the by country coma region
This means these are my partition columns I
want to divide first by country than by
region Fine but there is a catch here
What is the catch What is the data
you're going to Lord if you look at
the data this is the data You starting
for one And if you open this data
can you tell me how many fails are
there in this data Three Which our first
name last name and I d right If
you look at the table I created it
is matching here Right then I'm saying partition
by country coma region Right But there is
no country or region here I This is
static partitioning So static partitioning means you are
getting the data you know from the where
the data is coming But the data will
not have that columns Meaning I know that
probably this data is coming from a country
quickly US state able to California Right And
while uploading this date I should mention it
The data does not have any columns for
country or region or anything If I upload
the data you will understand So do one
thing Uh copy this to her Do this
You're starting forward Can you do it Uh
goto actually if s There is uh I
just for people who you're setting for one
in fact copy or so that we can
save time right I mean these are also
required Yeah So then you're to use FTP
If you're hoping the local your to use
FTP better poor people here It's faster right
Anything And I have a question for you
So right now in her Dube will it
be already created The country and the region
Folders for partitioning means it is creating supporters
I'm saying create a partition table country and
region Will it actually create a supporter Yeah
so no nothing will be created as off
now and then What do you need to
do this Where is the word pack So
I will copy this because I don't want
to type a lot pretty or to change
it Okay Don't type it as it is
because And what is the location Cool for
me File Name is different I think you
said in for one dot Txt right Is
that a txt extension Yeah So now what
I'm saying I'm saying that Lord the data
and this is my data into the table
And while lording the data I'm mentioning the
details I'm saying there's a country New Zealand
on region Koubek So what will happen It'll
create a folder called Country New Zealand Another
folder called Region to back Within that this
file will go Can you verify that in
the warehouse folder If you have lorded it
should be available The warehouse for them Sideboard
of Hue Ah I'll go basically Fs And
where is your data You sir Five Right
You say hi ive their house minus may
So many babies are there People started May
19 BB Ah here is the user one
can you say can prequel to New Zealand
There is a folder in her do called
country Quarter Newsline But then that this region
equal Tokyu back within that you have your
data right Try yourself and see whether you
can see it This this will be there
in your project and assignments So partitioning is
very important It's in the world Firelight Okay
so it's wrong Right it Sorry Sorry I
didn't just showed here So remember this point
Static partitioning should be used like we had
a use case Okay Where We were getting
user data from multiple countries Okay We're not
our clients application The use of data was
coming from multiple countries but the data we
don't have a country cooler so we used
to manually create That's called static You will
say that for the state I in us
now we have to do something called dynamic
partitioning Never show you that So make sure
you are able to Logan toe haIf on
Always do Use your baby some off you're
using before baby on when you typed that
Commanders there already that table exists because Lord
of your actually using before db So if
you Meanwhile if you look at this data
right high partitioning can you tell me what
is the difference between this data and the
previous one You lorded Exactly It has the
city and ah state or whatever By the
way what is Togo and West Bengal doing
together Whatever Okay so you have some city
or region Yeah So this data has the
country and city column So the best way
to do partitioning here is you create a
partition table You said there is a partition
by country and state or whatever column on
Then you tell high I'll give you the
data You decide yourself I'm not going to
divide the data You look at the column
I don't understand how many countries are there
or how many states are there Accordingly create
them and put them that is called dynamo
But There's only like 20 line soft data
to India today Those many will be created
I really on I think the first park
it starts with Christmas Island There is no
reputation right But there is one thing you
have to understand in dynamic partitioning dynamic partitioning
is by the for disabled in haIf because
somebody can come and say that Okay create
a dynamic partition table Use transactional record Um
then what will happen Millions off partitions will
get created and he'll screw everything So by
default even in production blusters dynamic partition will
be safer So first you're to enable it
first thing Second thing is that you cannot
upload the data directly for dynamic partition table
What you need to do First you upload
the data toe temporary table or some table
OK from the table you insert to partition
table Meaning previously what we used to do
they created a static partition people Then we
said lower data In part it it went
We're able to do that because we know
it's partition where it is going right now
We don't know how you has to decide
So what I'll be doing I will create
a table first normal table I will upload
this ST up No partitioning nothing Then I'll
create a partition table on from the original
table I will say insert of the other
paper So while doing that it will divide
And since that is only liver toe dynamic
partitioning in high OK we'll do that practically
so you will understand You can directly load
It has to be in a table from
their your to insert But in this example
it's not really useful because you're having all
unique values Like normally you should have like
fire records for one country or something like
that But here everything is unique values were
having but that's the idea Um let me
see if I have another file Okay Because
carnations and high just this land property Okay
study one woman Okay No order list I
will do one thing Once we finish this
particular example on dynamic partitioning we will do
something called bracketing or so And once we
finish that I will give you this assignment
This landed property analysis You can do it
by yourself I'm in now It's easy Commands
are already there On that will help you
to reinforce the idea on partitioning This has
partitioning or so Just try that once I
finish OK so for the time being for
dynamic partitioning what we will do we will
create a table called user too You can
see what I'm doing from here Same thing
I copy paste on the sister partitioning table
Right Look at here I'm saying that partition
to buy country and region So this table
is the one which will have my data
partition Okay then what I do I create
a table Call you sir Three That's a
regular paper First name last name Country region
First name last name Our country region Right
So we have only four columns Where is
the data You sir In four three right
We have first name last name I D
So your scheme I should have five columns
Right Then relate will work I mean the
original table So the stable core user three
will hold your original later So what you
will do first you will lower the data
here How are you Lord Nor data In
part because already doesn't How do Where is
in her duke With who slash What's the
filing you will say in four Read on
the extreme right Is that the name into
table use Ah so don't tell this Lord
later toe use of three table Yeah So
this use of three table will hold your
original data No partitioning nothing We just have
your data from there I will insert the
partition table I cannot directly upload the data
partition people in dynamic partitioning So I should
copy the data out to one or is
in a table from their out inside I
am static You are mentioning which country which
state And or manually So you can upload
the data as they create a for the
light like and then from the file at
cities here the file has to be divided
based on Carla mallets Dynamically how you have
to decide So you are lorded eight out
of one people from there you say insert
of the partition table Ah hi You will
dynamically partition Based on these two columns country
and region here I have created right country
and yes first all that increase within countries
regions So right now we have around 2025
times off Baker Only that much will be
there Partition called in shortly Outsiders came up
when you define a partition table So these
are the partition columns Whether they exist in
the data or not exist in the data
they have to be outside The less came
out your separation by there You mentioned the
scheme so it would not be here It
will be here No it's not required only
here Because the daughter schema will be including
the partition problems So it will understand Okay
Now I want you to do one thing
If you look at the word file right
You see these things Can you see this
Right So these are the properties union enable
for dynamic partitioning So the first one is
very easy If you really will understand You
are saying that a hive enable dynamic partitioning
said Hi exit dynamic partition True this means
by default Lidice falls by default Dynamic partitions
are not supporter So I'm saying that yes
I want dynamic partitions But what about this
Let me copy that So even if you
enable dynamic partitioning by the fort and high
what happens if you create a partition People
At least one static partition should be there
So it's a big complicated to understand So
there are a lot of layers of security
but everything you can override also So in
how you first they say I will not
I love you too Clear dynamic partition at
all Only static is allowed So you say
said the hive except dynamic partition True which
means now I can create dynamic partition But
then there thinking if you are allowing you
to create dynamic partition what if again you
try creating 1000 partition 10,000 partition So there
is a strict a mortar Now we are
a non street What will the strict more
tell you is that even if you are
in dynamic partitioning okay you have to first
create one static partition in the people Rest
can be 11 problem you started You do
rest Everything has to be dynamic You can
again turn it off So many people have
a country Huh East partition David So right
now our partition is based on what country
Or that country and region Right So what
they're saying country you manually mention while uploading
country partition Jumana dimension reason I will decide
in strict more northern Italy man is internally
but it is saying that manually you mentioned
at least one static partitioning to be created
It is just a safety measure even in
production centres being disabled it because we want
to do it You are You have three
columns on one column you have to statically
mentioned Miss Nobody is ableto allow you to
do that right But sometimes it is very
useful I'll tell you because you're getting the
data where there is no country column Okay
so you want to addict and pre partition
that will be static You have some other
data Let's say I don't know state or
something that already is there so that can
be dynamic So in that case you're on
how proper knitter Because you will mention countries
Tactically there is no other way Let's stop
them if you are dynamically creating it So
we are burning it off because it's a
developer in Mormon It's fine We'll create as
many partition as we want Normally this are
or true I mean like dynamic partition is
force which means you cannot create any dynamic
partition This will be a strict by the
four So what we do is that Yesterday
somebody asked me even I forgot or discuss
it You can create scripts hive script like
sequence script You create sequel script right dot
sequel file Staying way I can copy all
this commands in a text file Save it
has dot SQL I say run It'll run
as a straight that you weren't type everything
right Holder orders that So in the script
while doing you will first are these things
you know set this does this and just
run it That's how usually we do it
It can be a skilled also SQL or
SQL little Technically I have not tried anything
apart from that Normally we try either SQL
or SQL I don't think if you wired
any other ex village book If you have
a secret strip however and try because all
this course we have is executed by 84
people sailors at school only I don't try
it It might work but I'm saying so
on These are specific to your session So
highest languages call SQL hive query language And
this is not a script We're not writing
scripts This is a shell This is a
quickie script is different what I'm saying All
these commands You can write an hour next
file not back on You can say Run
the north bad little execute one by one
You don't have to copy paste each time
it's skewered It is hive query language So
So that's what I'm saying state yesterday the
language of high school excusable which is derived
from a skew it so 95% it is
similar There are some very rare Uh no
comments which only High has Sequel doesn't have
so I think it is Sequent Photos into
dialect What it is using Sequel 2002 that
I am see for much like for sequel
sequel 92 sequel 2000 Toe sequel to those
and forth Hi this Will Don't sequel doesn't
do If I remember sequel format I'm saying
those impacts like Do you find any syntax
different from your regular secret In thanks partitioning
close a different right That's what I'm saying
So if you're doing partitioning and sequel it's
not this comment Slight differences there So these
two properties I think you're in district What
is the enabling dynamic partitioning on the Morris
known strict now again for controlling this is
hi ive exit Maximum dynamic Partition 100 which
means you are restricting maximum number off dynamic
partition This 100 then this property is not
actually correct The property is correct but you
should find a different value For example Ah
let's write it as 100 What this means
is that so when let's say you're creating
dynamic partitioning let's say you have a very
large file on Let's save in the column
You have 1 50 values countries imagine so
1 50 partitions will be created Each partition
will be created by one reduces it's map
It is job undocked early So this creation
off partitioning table Well if I am a
produce job and after Lee so one producer
will work to create one partition So if
you're 1 50 partitions one for be reduces
will work here you can control it This
dynamic partition for north 100 means maximum number
of producers will be 100 used So if
you want you can control the number off
producers Okay The advantages and disadvantages different meaning
sometimes what is going to happen You may
not really need 150 reduces to do the
job so if you have 150 partitions Your
data is very less you can achieve it
You think of produces Maybe So you say
limited to 50 or 50 reduces will be
launched So this you're saying so These values
are wrong I'm singling in the sense you're
saying maximum number of partition is 100 That's
fine Okay maximum reduces our 100 So I
really hear 100 Reduces will run if you
have 100 partition But mappers will depend on
your A input split Like taking so like
war data you are getting so that you
don't have any control anyway produces only you
have control My point is if you're having
a very small file okay And probably 100
partitions are your only 100 Reduces public can
reduce case so can do the job You
can control this property And I didn't make
up this properties It's available in Hioki so
people will be thinking that Ok from where
did you get tortoise properties It's properly documented
I'll give you the document for this You
can copy any of these properties Goto Google
just spaced it on how you language man
Will will come uh nor the first one
Okay so This is Horton Bursts I mean
the original haIf languish manual Welcome If you
open this uh oh Just search for here
You have all the properties Um where is
it Hi You accept dynamic partition before Value
Force Okay whether or not to allow dynamic
partition High exit dynamically partition more before value
Strict Instruct more The user must specify at
least one static partition like high XX maximum
dynamic partition Default value 1000 high Ive extinct
Maximum dynamic partition for nor maximum number off
to clear in each map producer Nor that
means how many producers will be used total
Right And you also have other properties for
high We will look into that later anyway
So if you come here So we have
enabled all these features right And now what
you should do is that but is a
word Fine Yeah You have a simply say
insert this command You will say Insert the
data from the table So that partition David
this is how you do it And this
will fire a map Reduce job You can
see on You can see on the screen
It'll create partitions Look at my screen Or
you can look at your screen Also which
you can see the partitions Loading loading loading
I'll show you it was dynamic right Soc
country Now regionally Right now See It is
he country Nigeria region Almost key were or
my and see All these partitions are lorded
dynamically on Verify that in group verify that
this is created in her group By you
go here Um what is the papal table
name So if I go to you sir
to see how Korea came I think it
didn't come It says just Korea North Korea
we have right and courts And yes Yeah
I December We're known strict more on any
country Your take there'll be a region any
region Your take will be a part file
because it is our put off producer that
this way that is Rosa welcomes reduces with
I think I don't know how many reduces
ran for this That's another thing So this
is the right on If I go to
something else Yeah So that means those many
reduces ran right It was everything Is it'll
so otherwise uh the role one toe everything
I'll just check We can actually see the
statistics here if you go toe not workflow
Job browser There is a job browser on
This is the job Iran Did I show
you this No I think I showed first
class If I go here there is a
mosque It's not displaying everything Sewing stage one
I'll see anyway So can you are able
to run this I said yes I think
Yes right Can you see the partitions are
no home Introduces the car We'll see that
later But are you able to create this
religious Locate your rich DFS Onda See whether
they can find all the partitions because of
the import data Right So our data need
to be either changed I think because all
other data eyes like level it So here
if you give like this all you have
to clean the Simon So usually what we
do is that I mean if there are
only limited records you can you say that
big or spark to clean the later you
can easily clean it You can say that
fine course and then replace So first call
a name that you given the partition will
be the top level partition But then that
will create those minister partitions The rest of
the partitions are loading right There is no
problem I think are you start So there
is no way to search for Ah rejections
from here the auto manually verify Because here
it says country equal to Korea region equal
to know that's hard It didn't lose If
it actually it created Korea in the region
is north Actually that is incorrect Okay so
before lording you have to validate the data
that is on leaving So if you do
that then our validators which scandal field evaluation
on whether all the fields are like this
on exception handling this your job before you
this day does it has an exception That's
fine But usually the help of vanity now
So this is usually have you do dynamic
partitioning on One more thing is that when
you're doing dynamic partitioning there are performance problems
sometimes like creating too many dynamic partitions is
not advised Even Hadoop will not allow because
name nor has to handle all the Mutharika
for the sub folders on the files like
you are having 150 countries on each country
has states So if you say divide by
country and state Thousands off folders will be
created so that will increase the my target
on name no So in one rate is
good Your queries will be faster but other
way so there is no limit as to
how many You can create partitions but they
say that do it with caution No only
thing is that you are a name nor
performance will be affected I like thousands off
partitions are not a big problem but so
are many Who is number of partitions are
there that will become a problem Actually it's
not a bus I said Eso usually when
you use partitioning is that if your queries
are running Skloot like on the reason is
that inside the same for that you're having
my paper files and it has to scan
or the files on There are some cases
where you cannot do partitioning You may not
be able to find out the logic to
do partitioning Probably there is no common column
or something Then I can't no partitioning There
is another way there It's not like everybody
must do partitioning There is no rule like
that Now somebody asked the question bite so
this will lead us to bracketing But what
if I am having Let's say you're getting
data from the country side So you're decided
to partition So it becomes country Your Partition
column on the 150 countries Look at Amazon
Look at Delhi for example Delhi has presence
in 150 countries Hey Andi So customers are
buying products off them and they have some
data with this country vice Okay so what
They decided they decided to partition based one
country Fine No problem Now after this what
happened Is that so Every time in customer
response chasing a project there is a time
section I D right on They want to
write queries like this Select something Okay Where
Country Quito India Okay Goma Transaction idee equal
to 1234 Now the problem is it's already
partition based on the country So that part
is saved because it will look at only
India But within India you have millions of
transactions So this transaction fiery will be huge
Just to scan all probably there are multiple
transaction files Okay On Trans you cannot partition
based on transaction I d Because every transaction
is unique right So what do you do
That's a question And I don't know whether
this is available in our Libya Miss and
or you do something called Bracketing bucket You
have something called hash partitioning right Can you
tell me what this hash partitioning in rd
bms I'm asking Are you missing How Who
also board hash partitioning board is ash Because
this marketing is very similar to hash partitioning
in our dbm Aside I have not done
hash partitioning So what we do is that
see there's a transaction idea Corland right This
is your back transaction I d column on
here One of the values we can have
Let's say 1234 Right Like this transaction 80
on let's say one million This is the
data you have now You want to divide
this data based on transaction 90 and you
can order partitioning Use a bucket it when
you do bracketing it can be done only
on one corner You cannot say so Complicating
Not possible On I say I want to
create lets it 10 buckets So you are
always mentioned The number of workers see one
that is all you What is going to
happen within each country Partition It will create
10 files or folders It will divide this
data based on an internal hashing logic Okay
on create 10 buckets 10 files within each
country folder So in this query hit it
will calculate the hash off this Okay And
find out which finalists having it'll hit only
there So too easily Understand This is not
actually idea but 56789 10 11 Well this
other transaction it is now I want to
divide them in Took 10 files So that
is a internal logic which hive uses I'm
not going to tell the same logic but
when I say I want 10 buckets 10
files will be created Fine So I say
one who Three for these other files OK
on Let's say we take a very simple
logic You want toe find a mortal or
10 you know more alert Invite what is
the mortal or 10 of one one So
this will goto this bucket This would also
go bad I'm able to understand This will
go here This will go Well It is
not going like this I'm saying are similar
So there's an internal hashing logic on Then
when you write this query it'll look at
four It'll go the fourth bucket That's a
logic Not this logic It doesn't know more
often Okay there's some other complicated logic but
it will internally divided We don't have the
border little internal it Right now many people
have a misconception that you should always use
partitioning and but getting together No I mean
it's a misconception People who are already working
I have seen a lot of people discussing
no but in most of the examples when
you search you want to learn partitioning People
will say that I will first create partition
then inside that marketing that's a practice but
you can independently create But getting without partitioning
I can simply take a table side Want
a bucket It that's all I don't want
any partitioning or anything So what will happen
in the same folder It'll create those many
fires 10 files physical We can say that
marketing we can see Practically we can see
if you can do partitioning Don't know about
getting because that's the most obvious thing Very
cannot go a partitioning You enable marketing for
example India is there within that you don't
have any choice on There is no ideal
way to decide how many buckets are required
So next question is how many buckets 1000
So ideally you should take care that the
file size should be somewhere near to the
block size Don't create like one MB size
bucket So if you are having let's say
that the it I spend GB within a
partition okay you say create that 1000 buckets
it will become very very small Don't know
that because that will create smaller fights So
I really take a block size calculation How
many backers soon video to decide some logic
Your views for that Actually if data becomes
big that's not a problem There should not
become small That's the problem like but gets
ice would normally one NB Then it'll create
so many small files in her to write
her job doesn't like it So if the
bucket sizes 500 define it can become one
JB it'll still divided into blocks That's fine
We are okay with it nor nor the
block size I'm saying bigger than the block
size If it is smaller than the blood
size How do Boris doesn't prefer that Because
smaller fires up north Very well handled in
joint operations These buckets are excellent for joint
operations There is something called bucket join So
how you has lot off advanced concepts for
the time being you may not be aware
off that Okay So what will happen is
that I have a table here Okay on
this is my corland on this is bucket
So this means this is divided into Let's
say fire files One who 34 I have
another table that also has the same column
that is also marketed or civil Right And
this is also having either saying number of
buckets are my different Then you can do
a really fast bucket joint There is something
for back and joint because since the data
is divided into equal equal partitions the joints
will be really fast Actually when you pocket
So when you write a joint query you
will say use But better joint there's a
commander can use I mean there can be
mis matched the menudo joint operations There should
be a common core of joint cola on
Usually they will have similar increase like that
is idea You're doing joint Otherwise what is
the use off a joint So So this
is like you are having to tables and
you want to do a join operation Normally
you can do a joint no doubt But
if the joint column common Carla Miss Bucket
erred Okay then there is something called bucket
join you can use We should be faster
high will internally make it faster Since it
is marketer the join operation will become faster
Ah hashing so And that is my tipple
Types of backer joints actually in height So
it's very interesting It's bitter glance cancer But
these things are possible in No no that's
anyway distant off So it's very rare that
you may get a chance on this but
people are aware of these things like bucket
joints you can do so we can do
But getting practically that will be better right
Rather than talking So why don't you look
at another data set So can you open
this marketing data How do you open with
what excel right See this is a very
interesting case study because here you can appreciate
more partitioning because we have similar things See
you are having street right on the street
is having almost unique values Some some unique
At least 60 Sacramento zip code You have
state you have California bedroom bathroom square feet
So this is a love this think real
estate data and I pissed residential condo a
blah blah blah price Some price so and
how Maney Thanks update that you have 90
This is like some data you have now
our intention is to go back getting but
little do partitioning or so just to see
any way they can do that So what
I want you to do is first upload
this paper to How do you up Lord
The file name Israel Underscore ST dot CS
Week That's the finally and I want you
to open this work bag So we hear
about getting high and so what they're going
to do First we will create a table
okay And this isn't normal table OK I
say create table Ari estate that's called real
estate on this has the schema regular schema
no partitioning no but getting terminated by coma
But over that stuff no changes on we
will load the data How do you lower
the later Lord They uh in bath So
I guess you will be able to lower
the data So this if you'll create what
you can do it slower The data on
you have to enable bracketing So again bracketing
is disabled So just say that high and
force marketing True So the command high you
don't and four start But getting a quick
true with a neighbor marketing night And look
at this comment This is probably the most
important command create table you are calling the
table bucket underscore Table on I have the
schema partitioned by city clustered by street into
four buckets What do you mean by this
partition by city So city you have common
entries right Nothing Sacramento or something So I
will say that partition columnist city Then I
say clustered by street So street colon into
four buckets so it will create those many
city partitions Each partition will have for this
thing buckets but it is not mandatory Another
thing is that if a city has only
one line off data only one bucket will
be there If it has more data than
only the four buckets will make sense Some
of the cities will have only one street
makes a great only one bucket In that
case um on now just Lord it and
see whether you can see the buckets I
want you to try the severly how even
marketing is disabled now I want although should
try us said I'm just doing any certain
Once the insert is complete Go back Yo
you said hi warehouse for that You should
be able to find partitions and buckets So
I will I will even show you my
kids I mean like we just want to
see the buckets You sir Hi Ive So
did hi Their house Uh minus May 19
Maybe May 19 Baby what is the table
We have Book a table If I open
bucket table this have the city If I
open a city there are four buckets But
again buck hurting or so depends on distribution
off data See here The last two backers
are not having any data But if you're
having only one line Oh are just two
lines like then only one or two buckets
will be created I was very in religion
like very less Later it won't create four
buckets here I think you're having more more
like some 14 lives I don't know how
many you have based on that clear But
the last two backups are empty This example
So ideally you should have more data This
is happening because we're having very less later
But some 1000 lines we have actually But
number off mappers will be the right either
napper already I said reducer So I'll show
you here and that property It will be
visible It can be either a mapper or
other Dear sir Now uh when we created
the partition Hey saying I'm not able to
see any reducer jobs Only mapper jobs ran
So see here Maximum number off dynamic partitions
allowed to be created in each mapper slash
producer No So it can be a mapper
job or a producer job If you're having
less amount of data it will be a
map Only will run but sometimes what happens
You will have a huge amount of data
on the mappers will first move the data
in the reduces will aggregate in different pockets
or partitions Now he's saying he's not able
to see producers in the job I'm saying
it can be either mapper or reduces in
the then firing up partition quickly It depends
on resource allocations If the cluster is having
resources it'll allocate I got only one mapper
Mine is actually only one map in order
to sell it Yeah it can be either
Amapola producer Doesn't matter Actually to be honest
what is going to happiness when you created
that partition table right When you don't insert
what is happening map it is job So
And what is that map Ritter's job doing
That is my question You'd have to say
good to give the date up That is
what it is doing right It is not
doing any other logic It is just segregating
the data based on your partition nor bucket
you know So like now you're did So
did you take this partitioning with marketing Is
that Orson or fighting introducer You're saying that
the reason one right with marketing So you
are getting ready sir Little map What is
that How many one map Even I got
one Map it So sometimes what happens if
the cluster has enough resources It will fire
mappers and producers so they can complete really
fast So mappers will first probably segregate based
on partition and reduce the richest calculate the
hash and dump it But if the plaster
is not having enough resources this job will
be very slow Only my personal belongs to
They have to complete the whole thing But
there is no intelligence in attempting disaster way
off the cluster managing your partitioning pluses The
logic is written but you cannot really say
that has to be done by reducing or
a mapper right Sometimes that will be achieved
only using a map for sometimes little shuffle
and fire A producer also Yes In my
case it was done There was only one
map in my memories In his case I
think for reduces their lawns Probably there was
some shuffle operation depends on various your date
Also another thing is that then there is
a producer there is shuffling off Baker data
show first and then goes to introduce it
So I don't know probably in his case
some shuffle operation happened That is whether or
do some picnic Anyway it is not really
important I'm saying I mean while inserting into
a partition table it really doesn't matter whether
the map is doing it Our producer is
doing It does not have any impact on
the developer side because it is just ah
finding by column on then moving the data
that is always but in a query and
all it is very important because if you
write a proper query how many map our
producers get involved that's very important Hi can
also store that The Celtics text five So
they So you write a query on the
output is normally shown on the screen Like
you write acquittal show that you can say
save it on nicely Efforts output whatever queria
the command iss uh in search over right
directory there is a common call insert over
I directory Then you write your query What
will happen is that whatever list output off
the query it will save us a text
file on her Do If you want to
save it My normally want to see it
But sometimes you want to say if you
want you can do that also directly It
is insert override directory If I remember because
we rarely save it has text file Okay
Hi Insert over Right Directory So this is
the even insert over Right Directory Ah but
I have to see it Is there You
can try this common insert overwrite Directory on
Then you give ah ah pot on Then
you say civic Stop So let's try that
I will say batting three Um what be
able to say Uh it is suspending Oh
by Instead of what Our directory I will
say goku slash ABC comb start Uh give
me a table name that we Creator just
now Now silo Right There's a NASA and
the whole right So this means if you
create a folder in her job in rgu
ABC there that result with the story the
result she wanted Store it somewhere You can
do that You can pry this out See
So when you're creating a hive table it
store it as a text file by default
Whatever data you're loading into a high people
it is available in her do and that
will be in a text for my like
CSP or or any text format On over
the period off years there has been lot
off compression techniques to compress the later So
while creating a table you can mention how
to store the data Now I have some
very important works links I will share it
with you Um is there a way Okay
I can Anyway you're not going to see
right I'll just open my made because I
just had some things I wanted to show
you I think in your assignment this is
there We'll see Right Correct That is very
easy I mean you so storing the data
So it is very very easy You just
need to understand what is happening Yeah very
rare Uh customers Are they important Right I
never quit Hotmail So here it is It
so that s important works links which I
found very interesting So RC is a format
so there's a heart in books ling but
eso RC files in HDP Better compression Better
performance So I've been creating a hive table
Last line you can add store as RC
Only one thing you need to do justice
store SRC on what is going to happen
Whatever data you're loading into that table high
will compress it or say stands for optimized
broke corland format It's a compression technique What
is that going days First our brand is
your date ice comp list Well that's not
a greater hundreds Maybe second and to engage
in the most important are doing this is
that Warsi has indexing There is a technical
indexing right now in high If you buy
the Ford has indexing but that is very
poor If you enable index and you will
not get any performance what you are seeing
this it'll compress your data on whatever data
is in the compressed file It will create
an index inside that So when you hit
queries COLUMNA queries and our your queries will
be really fast if the table is having
O R C property Cornum indexing regular column
index and you have right in our TV
MSN or what is indexing in orderto faster
the queries you will create Sort of like
How do I say pointing to the orginally
WASI will create an index on row and
column as well There is row indexing and
column indexing It'll create on everything actually and
it is very very efficient Apart from Marcie
if you want you can create normal indexing
and hype That is something called a big
Map Index and compact index but they're very
inefficient actually because even if you create an
index in high if you are not actually
dealing with name nor Jorisch the office you
don't know very Citatah But if if you
know I'll see what is happening is that
you are asking how to manage that data
compressed data so it will create its own
index for the whole data and store it
Now I'll share this link But if you
scroll down one of the things that they
say here is this So these are so
if you have a 585 JV file in
highest if you start it as a text
file this is the size there is A
for Michael R C file It'll become fine
or five that is another for Merkel parquet
You're going toe to anyone If it is
all of us it is 1 31 GB
That is the level of compression you are
getting so far K and R C r
other compression formats See it's not very woman
now because you are stays optimization or Nazi
I see stroke or learn that you are
seeing the maestro So I see no where
is using parquet some people use So park
is also another compression technique But Parker does
not have indexing or anything It's simply compress
your data that's all Once he has this
our grand age off you know indexing within
that Right So that's one of to engage
Uh avenue is a serialization Formats How do
I say So Let's say you are sending
a lot off data You can send the
data along with the schema Normally where will
you mentioned this Came out in a table
Now averages a format which can apply scheme
Are you into text files without a table
or anything By sending the data you can
have a headbutt on scheme So it is
very common in hard oops systems that you
use I grow for me okay And if
you scroll down So this is how a
lot of Stephen store the data This is
one thing you need to understand So it
will create index or next day 1st $10,000
You can mention how many rules you need
to create Index is a row and column
index So here I am saying that $10,000
I want index so what really low 1st
10,000 bruise It'll create an index and store
here next $10,000 and create an index and
store here So somebody squaring it can understand
which you know group off Rose It has
to push and it can even skip these
$10,000 in one short there So you're writing
a filter condition since it knows what is
inside here that it can simply skip this
entire 10,000 bro and scan here So it's
very fast if you do it in our
city Other formats don't have Don't have this
ability Onda There is something called victory ization
in high centered one's property Victimization will allow
you to read 1000 so frozen one short
Normally when you're reading the date I treat
one draw second or third row If you
enable victimization little RC or if they can
read 1000 so frozen one short victimization and
Orville over on Leona Watson Wars is only
for Mac It supports practice ization or so
this batch indexing this property they're only available
on when you goto Horton works I told
you that you can have a sit properties
in haIf you can enable real time queries
The condition is that the paper should be
or see Then only acid will work So
for all these reasons people prefer to RC
whenever possible But what can be one drawback
These are ordered bandages But what can be
one drawback exactly It has to be compass
Meaning you're Sapio cycles might be required Anything
which is compressed You have to be company
So compression is okay You're storing it and
that's okay But when you're quitting it cannot
It has to uncompressed data right So CPU
cycles are required but still people are preferring
it because they are okay Toe spend some
money on CPUs But still it is very
faster on gives you all these properties So
just go through this link One link Is
this okay Andi try it Is this this
is one strike This 10,000 roses called one
strike that that one block off roses called
once Try it actually OK on this is
the spear off queries off one terabyte data
So this is high $1.0 This high 1.1
That's fine On then High bless vector ized
Very high placed people E ppd Is predicated
Pushed down You know what It's pretty kid
pushed down Like sending your filter first right
So if you write a query like this
So they say you are saying that joined
the data on then filter the data this
bad right You're saying joined the data then
filter the data First you filter the data
then join So by the fault in our
baby Emma Sandahl it will first bring the
filter together It will optimize in high You
have the man Will you do this That
this core ppd predicated push down So if
you enable ppd along with the RC the
spill Terrible Come first then join Welcome So
it will push the predicated Basically that's the
idea Oh I see Yeah So and I
can show you this I mean I'm not
just blabbering if you look at ur uh
clouded up this thing Ah the old man
is right So where is the cloud lab
You look at the cloud a manager you
can see whether these properties are enabled Probably
not all of them are enabled That's one
thing that's check I'll go here I'll goto
haIf and I will goto configuration on And
I don't know if I save ppd Uh
no it is not showing Ah Enable victimization
optimization Okay it is an ever so this
victimization is the property by which if you
are storing the data as a warsi it
can read a hose and so frozen one
short If you enable electorate ization it can
breathe thousands off rose in one go and
then give it to you So it makes
so victimization is enabled ppd I am not
seen I don't know whether it is enabled
or something If I says for ppd it
is not predicated pushed down I don't see
anything predicate No it is not the probably
The property itself might not be here It
is not available No I ultimately when you
are reading it will not get by It'll
get row by row There is a thief
ordinator It is really slow actually in haIf
when you're reading the later it's in the
raw format right So when you're creating the
table you are saying raw form and the
limiter on Let's say your query results requires
let's say our potus $10,000 It will not
push $10,000 in a row on Row two
Rotary like that it will come It's actually
very slow in original haIf If you enable
victimization it can read minimum 1000 is there
in mechanization many $1000 You can mention what
everyone Those many will come in one shot
So it's very fast Actually while getting that
result on victimization requires RC no other for
Michael Birkett secularization there's a reader for victory
ization which is available in all what See
only as off now I don't think any
other format supports secularization like So now the
question is that how do you enable at
this O R C right optimized bro column
format So huh So these things are very
very important Okay let me see if I
can show you from another one So these
are some of the things which you can
go through So when you create a table
all you need to lose that you say
store and so RC last line you add
stored as a warsi and will become a
waxy on it will take the default values
You can even disable compression Some people do
it So what if you don't have enough
CPU power you want to use indexing and
or but I don't want to compress the
data You can say compression And if you
don't compress the later possible on So these
are the properties you can mention in O
R C You can mention the comparison The
reformed algorithm is called zili X Very efficient
You can either say none Zili or snappy
these other three options So the Libyan snappier
compression algorithms if you say none O are
see table will be created no compression on
Then you have O R C compression size
number off bites in each compression chunk So
the more you put the more compression will
happen In this case it is to a
56 k v by the boy Oh our
stripes size number off bites in each strike
both of basics MB So each I told
you write 10,000 rows like that So the
size off that is 2 56 MB You
can say that in my table Probably your
table has one baby Data divided that interval
50 system for physics like that on a
roll index strike 10,000 for every 10,000 role
and index will be creator So this you
can increase or decrease compression Size is the
number of bytes in each compression junk So
that means when it is compressing it creates
something called chunks inside So what should be
the size off each minimum size actually so
here by it afford it is 2 56
Caving depends on the data as with so
we leave these things to default We don't
change the compression size and all because the
algorithm z LIBOR snap your none You can
change on the stripes Size is how it
is dividing the data like here the stripes
I says to 56 MB go 56 we
will be the size off your one stripe
right on in eat stripe It will get
$10,000 and create an index So probably in
one stripe it will create like five or
six indexes depending on how many rules you
are having When you were mentioning the size
on Britain that how many rows one index
need to be created It's okay So sometimes
I mean this is this other properties just
really will understand anything Okay on You don't
have to mention this also So normally when
you create our city will you just a
story that Zorc Because that is better or
are saving taker off the compression itself These
are the values you cannot Just if you
want to tune them If you sing places
store also are so that sort of thing
Yes So if you're say compression none it
will not compress the data It will index
it It'll definitely index it But no compression
will be there Know that before this fixed
five because for so long people are using
text file I don't know O r c
is an evolving standard Okay Andi o r
C has different different versions and all with
a very high release they will release a
new standard for Warsi And all text file
is the standard for my on your own
one The force anybody toe use a particular
standard like this is better So if you
want you can you say that's the only
thing you can't really know I say fire
You can see that you cannot read it
because it is like compressed Onda Only haIf
can read it I can only create and
read it it looked like uh like a
zip file only But you won't be ableto
open and see what is inside Only haIf
or what Serious country detection Like you want
to set up a small set up or
like enough production You want to do it
production So off Obviously union machines Onda you
may not be setting up only her group
and high That's nobody People will be having
the complete package Actually when you say how
dope and high that's only two doors right
So rather than setting only Hadoop and high
so it's very easy You have to have
machines Then you have to go to one
of the vendor in the cloud Our Horton
books get their distribution Install it If you
are using Apache Hadoop you have to first
download the origin our source extracted that are
four files score site XML ish deficit exam
Miljan site Example My appetite Example in each
off these files Example fire sort of manually
type The properties probably take a year to
complete almost on You will not sexy Even
I could have sexy because very difficult nobody
less that we used to do it I
used to set up a partial clusters but
very very complicated It is because everything is
man with there is no or permission everything
And there is no use for that Also
Onley uses that Probably you want open source
Apache that it sort of boots the Horton
Books Edition and Apache Additional Exactly Sing nor
difference at all Yeah even cloudier And Horton
works There is a free edition Okay So
since all these companies are selling open source
products you can goto Horton verse or Clodagh
and download their product install them on as
many machines as you want The problem is
you will not get technical support in Claudia
You have in addition called express addition Okay
what is over Cluster Our cluster should be
expressive vision Right You can check it So
this is clothing right Um some very could
be there I don't think maybe I'll be
able to see the addition because you don't
have access to everything right Huh See Clodagh
Express this is free So we're running on
eight north or 10 notes Right This cluster
subsequently free You don't have to pay a
single penny to Claudia But if something happens
they won't help you like Technical support is
not there The commerce elevation is Corp Florida
in their place So when you're installing the
cluster it will ask you Do you want
express We want in place If you say
in the price they will ask Please browse
and upload your license Keep that only installation
will proceed actually So if you want to
test or try something any off this There
are people who are known hundreds off North's
in the Sedition Express addition No the Horton
but sandbox will not run in my preneurs
It's only single mission set up that manager
you helping start I'm not actually a little
bit because the sandboxes descent for one machine
only It is no dissent for distributor like
it is running only cannot be expanded even
if you connect both off them They won't
work because they're working as independent sandboxes a
virtual machine But you can download install your
PC If you start it it will have
everything How does high pass mark everything just
to learn like a single mission City production
said That is totally different in Starling and
all so that comes in the ATM Inside
that is why I am not talking much
about it You may not be able to
understand even if I see because there are
a lot of other things apart from just
a group or something You have to set
up a lot off repositories than machines Onda
middling and racking And then the set up
the same you will have its 100 missions
One will become master on in every machine
You will use the Florida manager or any
tool Go down Lord Cadogan Distribute So then
you have to first What any true New
Yorker Download the StoryCorps Clodagh manager Download it
on one machine then start Clodagh Manager It
will ask you how many machines you have
You say four machines It last give it
21 Master you said this mission It will
download and store Do everything for so Cloudera
manager your in store on one machine on
Then you started It'll immediately ask you how
many machines who have you give the capital
district the last Which one is your my
name nor richness A date and origin is
so you're select this this this It will
download Hadoop distributed Install it set it up
and give you the cluster That is how
this cluster is created Action No no there
is no limitation You can have those and
soft missions but people don't know what actually
mean If it is a production set of
they go for enterprise licensing But even the
express sedation is pretty stable on the express
addition does not have some features off the
Enterprise edition like you have something called rolling
restarts and rolling upgrades For example let's say
you want to restart a machine in a
cluster now Usually when you re started it'll
have done soft problems because your name in
order to detect this is down So sometimes
in the Hadoop cluster you might want to
restart all the machines you applied some patches
or something And sometimes you want to upgrade
the hurdle person you are running told or
six you want to go to a rock
seven of good will happen but after that
Avery machine should restart So if I do
this in a production but it is very
difficult So in the express addition there is
no way every mission we go down in
the press edition you have something all rolling
restart which means one by one it'll really
start machines without affecting the cluster end off
the day Very mission correctly Starter for your
services will not be affected And you also
have metering backups advanced backup metering solutions that
are all available online in their place This
edition will not help All the features will
be the Lexx Park or whatever tools you
want Everything will be their commissioning You can
commission on the fly You can act to
machines Nothing will happen to missions will get
at it immediately They start to become part
of a cluster It's called commissioning I don't
know if I can show you here I
would not be able to add them but
you go to this host minnow on There
is something called Commission State It is commissioned
11 Now that is a 11 machines running
No I can't add because I don't have
any permission But there will be an option
called Adam Machine here Commissioner So right now
it's a it's commissioned 11 So has often
Everything is commission I don't have any argument
writes Otherwise we can just say click Add
it lad On the plate This bring this
to the end of this tutorial on high
Now before you guys sign off I like
to inform that we have launched a completely
free platform called Greek Learning Academy where you
have access to free courses such as the
iCloud and this is marketing So thank you
very much fat and in this session and
have a great learning
