So Apache spark is another kind of framework for doing big data processing distributed across clusters like MapReduce
The differences is kind of come in how those computations have done. So for example with spark
You've got a lot more flexibility in the computations
So with MapReduce you've got to do map and then reduce like there's no way of getting around it in spark
They provide like a load of different operations that you can do on the data such as joins between different data structures
Why would a you spark this purpose is to process large volumes of data
So it's mainly data that's not going to fit on a single node
There's also computations that over a large volume of data. You don't want to go through and sequentially data
And if you've got parts of your computation that are independent of each other and so you can do it on the data items
Individually, you can split that data across the cluster
and then
Do the computations on that single node exactly like with MapReduce you there's the data locality prints?
but again do the computations on the nodes where the data is stored and
Then you reduce those results down to what you want
The main programming structure that you're going to be dealing with is called a resilient distributed datasets
Which is usually shorter than two RDD
Which is kind of a collection of objects that spread across a cluster
But as a programmer when you're dealing with that, you're kind of just interacting with it as if it's on a single node
So it kind of hidden from you this it's distributed in a spark cluster
You'll have a driver node
and then several worker nodes and the driver node is running the main program where kind of has all of the
Transformations that you want to do to your data and then these kind of get sent out to the worker nodes who then operate that
On their chunks of data that they have. In fact, the transformations can be similar to MapReduce
So it still provides the same map function and reduce functions, but then you have additional stuff on top of that. So they like
Give you a filter operation directly
so you can do dental to implement that you can just
Call the filter function on an RDD and say I only want to return objects to which this is true
so here we've just got a very very simple spark example of just loading in a text file from the local file system and
We're just going to go through and count the number of occurrences of each word
This is exactly the same as the MapReduce example. We looked at last time but we're doing this in spark this time
Okay, so at the start we set off a spark config
So we just set the app name which allows us to seize the which of our jobs is currently running within the web UI
We then set the spark master. So because we're running this locally on a single computer. That's just local
We then set up a spark context which gives us access to like the spark functions for dealing with rdd's
We first of all need to load our data into an RDD
so we do this using need text file function and that puts the contents of that text file into an RDD the RDD you can
Kind of just view it as like an array if you want to it's been like an array
distributed across the cluster
So here we've got our lines RDD each element in the RDD is a single line from the text file
We then go through and split each line
Using the flat map function so that map's a single function over every single item in the dataset. So every line
We go through is split it up into words and then because we're using flat map that then takes that from an RDD of arrays
to an RDD of strings again we
Then go through and exactly the same as in the Map Reduce example
We use the map function
Sum up each word to a key value pair where the key is the word and then the value is the value 1 so indicating
We've got one instance of that word at that point
That then gives us a new RTD and for that one we go reduce by key
instead of it's a Map Reduce that would just be reduced but here in SPARC if we just did reduce it would give us a
single value for the entire RDD back at the driver reduced by key takes
an ID D of key value pairs and
For each key you give it a function to apply to those values for how you want it to be combined
So for us we want to add up the number of those instances of that word that we have
so we use just a simple + to
Aggregate those values so that finally gives us our word count RDD which contains key value pairs of words and a number of instances
Of those words. And so we then called the collect function which will bring that back to the
Driver node, and then for each one of those lines we've in them out. So we right now the counts all those words
So this at the moment that code is written for something that might be on your own computer
How would it differ if it was on a cluster and I'm server farm or a massive data center or something like that?
How would that vary?
And so if you're running this on an actual cluster and not just on your local computer
then
Rather than setting master within your code and setting it to run locally
What you do is you would have SPARC running on the cluster and you would use something called spark submit submit your spark jobs to
Spark to then be run
So it would it's just a different way of running them
Basically rather than hard coding it within your program even though you're their money on the cluster
The rest of the code would be the same
so the work that I've done with spark has been kind of using it to analyze large volumes of telematics data coming off of
Lorries as they're driving around and using the data from that to identify
Locations where incidents are occurring such as if they're cornering harshly or breaking harshly
Outside of research what sorts of things is spark we use. Yes
So Sparky is used quite a lot in the real world
like
You would find a lot of companies will be using it to kind of do large-scale jobs and all of the data that they have
and it can be used for like analysis or simply just processing that data and putting it into storage the good thing about the
Distributed computing in clusters is that if you want to scale the program more you just add
more
Like nodes to the cluster. So the point is if you want to increase your processing power you don't have to
Buy new hardware in terms of replacing your hardware
You keep your old hardware
You buy new nodes and just stick them on the end and you've instantly increased how much processing power you have
So if you suddenly say get a load more data that you need to be posting. You think Oh miss current cluster size
That's not great. We can
Then expand that and then just add a few more nodes
so the SPARC program would then just scale automatically going back to RDDs these are immutable data structures so that
Immutable means they can't be changed. Right? Is that right? Yes. Yeah. So yeah, and they're immutable they cannot be changed once they're created
You can pass them to other functions, but you can't change the contents of that single RDD
so what the spark program ends up being is kind of a chain of
Transformations with each one creating a new RDD and passing it on to the next function
The advantage of the RDDs is that they can be persisted in memory
Which means that then it's more efficient to reuse them later in the computation. So one of the disadvantages of hadoop mapreduce, is that you
It's every time you're writing and stuff to disk basically after your MapReduce computation if you want to reuse it
You've then got to go and get it from disk again
Whereas with SPARC you can just persist the rdd's in memory
If you want to come back to them later, then you can do it really easily
You're saying large amount volumes of data. Can we put some numbers on this? Well, what are we looking at here?
See the volumes of data we're talking about can vary I guess depending on company. It's probably ranging gigabytes to terabytes and then the biggest
We then just keep going up basically
