- Big Data Visualizations.
It's actually not just
a problem of quantity.
It's also a problem of quality,
meaning that even if you had
unlimited computing power,
it is very difficult to visualize millions
of your data points in a meaningful way.
What are the problems?
Well, the first would be overplotting,
overplotting will occur
when some data points
will occlude other structures in your data.
You can try to solve that by
decreasing your point size
and maybe revealing more
structures in your data
but there's a limit to how much
you can decrease the point size.
You can also maybe add
a transparency effect
and that may reveal the more smaller,
denser distributions hiding in your data
but it also makes your broader
more sparse distributions
completely disappear.
Another problem is if you
can't plot everything together
because it causes too much occlusion,
then you will have to sample it.
But if you'll sample it,
you will probably be biased
to the more popular,
more frequent distributions in your data.
But in visualizations you
also want to see the outliers.
So it is very difficult to navigate
this kind of visualization.
So today I'm gonna cover that Datashader,
which is an Open-source Python package,
which aims to solve this kind
of big data visualizations
in a very clever, sophisticated way,
I hope you will enjoy.
Creating interactive plots
from million of data points
is a bit much for my personal computer.
So I'm gonna use a Saturn Cloud today
to host my notebook on
the AWS cloud server.
So these machine specs should be enough
but I don't have that.
But you can go nuts here and
have about half a TeraByte
of RAM if you need it.
So you just host your notebook here
and from there on out it's
just a regular Jupyter notebook
with some environment already included,
very convenient in my opinion.
So today I'm gonna work
with the NYC Taxi Data
and it's about 11 million
sample points of coordinates
of pickups and drop offs of
taxis in the New York city.
And this will be our
visualization challenge.
In addition to Datashader,
I'm also gonna use Bokeh
to make their plots interactive.
If you don't know Bokeh,
I've already made a video about it.
You're more than welcome to take a look.
Okay, so at first we just
define general parameters
for the Bokeh plots and we
define a general function
to make an arbitrary Bokeh plot here.
Disclaimer, most of
this code was already
beautifully written by
the official Datashader
example repository and I've
just made very small changes
to make the notebook more visible.
Okay, so the first
approach would be to sample
a thousand points from
the 11 million points
and take a look at the
general structure of the data.
So we add the tile of the map,
which is a cool feature to
actually make the coordinates
sit upon the real New York map.
So this is how it looks
like and you can see
that most of the points are concentrated
in this region because probably
these are the most popular
pickup and drop off points
for taxis in New York city.
But I doubt this
representation really shows
how the data behaves and what's really
in these 11 million sample points.
So this is a problem of under sampling
when there is a bias towards
the most common points.
The next natural step would be to try
and plot more points to see
more structure in the data.
So let's try that, I'm
gonna plot 10,000 points now
and we really can see new
points in different areas
in New York for pickups and drop offs,
but the Manhattan area
seems already over plotted,
so that's like a giant blob of points.
And now we can use Bokeh for
its interactive abilities
to zoom in a bit and
zooming in helps we can see
the points a little bit better now.
But now here's the question.
Everything already seems
very uniform inside Manhattan
are there no different
distributions inside Manhattan?
and it's very difficult to see now.
And if we make the dots
smaller or more transparent,
then we will completely lose
sight of the other data points
outside of Manhattan.
So we can already see the
problem of overplotting.
Let's take a more extreme example
and I remind you that was a
10th of a percent of the data.
10,000 samples from 11 million.
Now let's plot 1% of the data.
So this takes a little
bit more time to load.
So Bokeh already did a
little bit of scale up
in order to see all the points
and we can already see the problem.
So because there are so
many points in Manhattan,
the points were made much smaller
and in very dense areas in Manhattan,
Maybe we can see some of the data points,
but for other areas it's
practically invisible.
Maybe I can see in my screen, but
probably you in the video,
Don't see the points here at all.
So that's a problem.
Let's see how Datashader
can help us solve it.
Datashader approaches this problem
in a very innovative way.
So the trick here is not just
to project the data points,
but to aggregate them in a way
that will make the most use
of your display capabilities.
So instead of just
projecting the data points,
we're not focusing on
each individual point.
We're trying to focus on
how to visualize the pixels.
So, if we aggregate in
terms of just the count,
so a certain pixel will
get the value of the count
of the data points which
will project, projected
to that specific pixel.
So you can try and look at it as
the best resolution heatmap you can get
for your projection in your display.
At least I hope I understood it right
and you have a lot of freedom in choosing
that aggregation function.
The most natural one would be
to count the number of points
that landed in that specific pixel.
Let's see how it works.
So we import Datashader
and then we create a Canvas
like every plotting library
and then we specify how
to aggregate the points.
So we define points from the DataFrame
and we specify the columns,
which include the points.
And then we choose the
aggregation function,
which as I stated, the most
intuitive in my opinion
would be to count the
number of occurrences
in a certain pixel.
And then we have to shade
like shade the actual result
according to the aggregation function.
Basically just coloring the image
and you can choose in what way to map.
The values to the colors.
So if you specify "linear",
the most highest frequency
pixels would be dark blue
and the least frequency would be white.
And this transformation will be linear.
So let's see the result.
So now we're plotting
all the 12(11) million points
and it looks like this, doesn't look good,
I'll explained why.
But let's see, we do see
the most dense points inside Manhattan,
which we couldn't see before.
And the dots are really small
and we can't see any other points
in other areas of New York.
But let's understand the
key differences here.
We're plotting all the points,
the data points size was
selected automatically.
Same goes for the transparency.
And this is something that
we couldn't do with Bokeh.
We had to choose those
things manually.
So the problem is that
this linear transformation
is a bit too aggressive
and it really diminishes
all the least frequent values.
We can take a look at the histogram
and see that this is exactly the case.
So we have many probably
pixels in Manhattan
which have high occurrences,
but for other areas the
occurrence is much lower.
So most of the linear range
is wasted on these pixels.
So let's see how can we improve that.
A common solution would be to
just equalize this histogram
so those of you with
backgrounds in image processing
you know this operation quite well.
We equalized the histogram
to give more emphasis
on the less frequent values.
And now if we visualize the plot,
it looks much, much better.
Now we finally can see a
representation of every point
in the data, so we don't
lose the less frequent data
of other parts in New York
and we can also see the high
density drop offs in Manhattan
and now for the first time,
we can see something
very interesting here,
we can see that these areas
here are extremely smudged
with a lot of noise and there's
actually a reason for that.
There are many tall buildings here
which disrupt the GPS accuracy
and this, just try to
imagine how would you find
that this pattern occurs here
if you didn't have this quality
visualization to pop it up.
Now for the coolest part of this video,
did you notice how fast
we rendered this image?
It's almost instant.
It takes about a second to make
this kind of Datashader images.
It's because the backend
is very highly optimized
to make this kind of images.
And now maybe we can make
this kind of rendering
in real time using
Bokeh Interactive Plots.
So what we can do is we'll
improve the visuals a bit.
We'll make a black background
with a better color map
and now we'll make a
function which creates
this kind of Datashader images.
And now we'll pass the function
to the InteractiveImage by Bokeh.
And that means that Bokeh,
when we zoom in, can render these for each
new image that we ask.
So this is the same image
just with better colors.
And now when we use the zooming option,
we can specify a region, zoom into that.
And now Bokeh, will make
a new Datashader image
but it will use it regarding
the values that they see here.
So now we can see the
hotspot in this area,
which we couldn't have seen
when Manhattan was in the picture.
So this is great.
In this resolution, we can already see
the individual points and as you can see,
they're represented as 
pixels and not as points.
So let's zoom out.
Now the more points we include,
the slower the computation.
Now we can maybe focus on
this dense area of Manhattan,
get a better look at it.
So now the Datashader the
image will be rendered
just for this area in Manhattan
and now we can see a better
high resolution image
of what's going on here
and see if maybe inside
this dense area we can
see there are even more
denser areas and regions for drop offs in
(this area)
So this is really great.
Let's add another layer
of visualization to it
and make this kind of interactive plot
with the actual map of New York.
So now we can add the area names
to our insights and explore a bit.
So maybe if we zoom into here
and now we can see there's
a kind of a hotspot here,
then we can zoom into that.
And we can see there is a cemetery here.
So maybe this is the place where
the parking lot for unfortunate
events where you have to go to a funeral
but now we actually see it very clearly
and there is a certain
area here in this region
that maybe you guys from New
York know what there is here,
but it seems there's a lot
of pickups and drop offs
in this specific area.
And what more can you ask for?
You can analyze each region
in whatever resolution you want.
And for this 16GB RAM machine,
it renders quite quickly,
gives a very good representation
of the different densities
of data points in this map.
One last eye candy just to show you how far
we can push the quality
of these visualizations.
You can also make plots
on top of each other,
which have different tasks.
So we're gonna do here
that we're gonna color in red.
The places where the taxi
pickups were more frequent
than the drops and we're gonna
color in blue the opposite.
So, and we're gonna overlay
these images on top of each other
and you're gonna use the
same InteractiveImage
and let's see the result.
So this is what we have
and now we can see this
shows a different pattern.
We can see that these vertical lines
seem probably better to pick up,
to get picked up by a taxi.
And these horizontal lines
are more frequent for
dropping off in taxis
and maybe in a different area here
you can see that these main roads here
are more popular for taxi pickups.
Maybe they are the main
streets in this area.
I'm not from New York, so I don't know.
But it seems like it, like
these are the main streets
in this area and these areas
here are more for drop-offs
probably a more residential
area and a more commercial area.
So this is a new way to look at this map
and gather more insights
about what's going on here
with the pickups and drop-offs.
So this is amazing and I really don't see
how you could have gotten
this quality insight
in any other way other
than you actually using the
11 million points that
you had of these samples
and actually being able to see
all of them represented
in this visualization.
Can you believe that that Datashader
is an open-source package?
It's pretty amazing what
developers contribute
to the open source community in Python.
The synergy with other
packages like Bokeh and Pandas,
everything works seamlessly.
Thank you guys.
So I hope you've enjoyed this video.
I make this kind of video every week.
Please tell me what you're
think in the comments,
if you have anything you
want to see, please ask me.
This video is a request
from one of you guys.
I want to make this content as relevant
as possible to you and I
hope I'll see you next week.
