It's a very good tool. I really like
visualizing things, just to explore our
data. There's something that's special
about visualizing things. What I'm
showing today is kind of like an
end-to-end infrastructure, where-- how
do we parse people's packages from like
the raw source code? We're only going to
look at, and and only have available,
people's kernels that are public. So
we're gonna parse that with regular
expressions, save the output in a table in BigQuery,
and then we're gonna use the GCP
integration and Kaggle's kernels to
access that data in the table that we
just created. And then we're gonna do some
cross join in SQL to analyze and
visualize these packages as a network,
and see how-- like what packages are
related to each other. So what we have
here is we basically have everybody's
kernel is out put it in a file format. We
call it a blob in GCS, and that's exported
into BigQuery as a really big table.
However that is really hard to aggregate
at an analyst level. So what I did here,
like internally at Kaggle with some teammates I wrote this regular expression to
basically parse how people are writing
the import code. We try to discover new
packages that come up in Kaggle kernels. And if some packages are used very
frequently, we don't want them to
have to install those packages all the
time. Therefore it's built in and
automatically builds in our Docker image
pipeline. This analysis I'm interested in
not just a frequency of each package
that's used, but also how these packages
are linked to each other. I basically
want to see like a cool network. So first
is make sure that our Kaggle notebook
is connected to GCP. I have my GCP
credentials linked to a notebook.
This is how you can connect to all these
different services that we have GCP
integration on. So what I do here I'm gonna
run each cell. So the first cell is just
a boilerplate. The second is to set up my
project ID and set up a BigQuery client.
The third one-- I already wrote my query
and I'm gonna go through that, but I'm
gonna also show that in the BigQuery
interface. So it looks like a little bit
of a more complicated query. And what
that does is first I'm gonna pull up the
Python packages that I just derived and
that's fresh. And then the interesting
thing is we want to find relationships
between packages. So we basically want-- it's like a k choose 2. So we define
that package A and B have a relationship
if they both coexist in the same kernel.
So how that's done is we
basically do an inner join on the same
table. So this table and itself. And we join
that on the kernel blob ID. So we
basically can create this exhaustive
combination of two packages with the
same kernel blob ID. So basically what
this does is that I have them grouped by
package 1 and package 2. It means that package 1and package 2 is numpy and pandas.
They're almost always-- they're like
the most frequently appearing pair of
packages. And then matplotlib and pandas. But basically this gives us a really
good idea of how much each package pair appears in what we're looking at in all of
the public kernels on Kaggle. And note
that BiGQuery is very fast. It's um-- the
data's not small. And usually it'll be a
lot slower running on a normal
transactional database. Switching to
BigQuery has made these kind of analysis
much faster. Back to our notebook. And
this notebook already has that query
that I ran for the first time. And we're
gonna graph this data into a data frame
because we're data scientists and we like
notebooks. So I just ran that query. And
you can see where that is. And now it's
visualization. In visualization I found
this package that's pretty good.
I quite like it. The most famous one is
called networkX. And it looks pretty
good. So I'm gonna show it first, but
later on in like two cells below I'm
gonna show this other package called pyVis,
which I like quite a bit. For
this analysis I discovered it. So let's
use networkX first and show the plot.
What this one does is we build a graph.
So you can see the graph here. It is very
busy. I can reduce it-- I can reduce it to
like maybe just a hundred of them. What
it does is we go through our data frame
that looks like this. And for each pair
we add package 1 and package 2 each
as a node. And this is a relationship
between package 1 and 2. So it's like a node
called numpy. A node called pandas.
And the weights between them it's like
the bond between them. So the higher the
weight is the more bond they have between
the two nodes. So now we enter pyVis.
And for pyVis it's compatible with
networkX. it's very similar. I
basically define a graph. Since I already
did this for each row, add the packages and the
weight into a node, I can now just
convert that graph into the pyVis
format. And it's a little more
interactive. The size of this node is
related to how often it's used.
So that's cool. We want the larger nodes to show up. And second we want to see the
neighbor. So this is-- this is that title
line that I just added. So you hover over
numpy and it tells me the neighbors of
numpy being scipy, scikit-learn, xgboost,
tqdm, and stuff like
that. It's not just one thing it's
combining a kind of a range of tools in
one task.The more basic part is
how powerful BigQuery is and moving on to later is really utilizing, or like
harnessing, what BigQuery can produce in
a very short time to visualize in a
Python package of the results got
from BigQuery.
