- Is the second part of our
three-part workshop series,
Introduction to Data Analysis
for Aspiring Data Scientists.
This is our part two, so
data analysis with pandas.
And next week we're gonna
have the third workshop
on machine learning and
that's next Wednesday,
the 22nd at 10:00 AM
same time as this week.
And then also just to
call out for part one,
that was recorded last week.
That was Introduction
to Python and Databricks
and the video is available
through our YouTube channel.
So that's a kind of shortened link there.
I'll also drop it in the chat
for everyone to easily access.
So just a quick note
for everyone joining us,
the best way to stay up-to-date
on all of our content,
all of our workshops,
our interviews with different
engineers, developers,
and also Tech Talks is going
to be joining our data plus AI
on my Meetup group.
There's a link to that, again,
in the followup materials,
I will also link that as well.
So don't worry about trying
to copy it down right now.
And then also, I know that
there's a lot of folks joining us
from YouTube, which is really awesome
and really exciting to see.
So just a note there to
subscribe to our channel
and to make sure you're
turning your notification.
So what we try to do is at
least have a couple of days
before I'll post the upcoming live
for whether it's a Tech
Talk or a workshop.
I'll post that so you can
set a reminder for yourself
if you'd rather watch it through YouTube.
And so here's two links to prep everyone.
If you joined through our Meetup group,
you should have received
a message yesterday for me
encouraging you to sign up for
Community Edition in advance.
So while I have you all on the intros,
check out those links and
that's what you're gonna need
to be prepared for this workshop today.
I think we have a little
time just to do some,
if there's any issues
doing some troubleshooting
with the TAs, but sign
up for Community Edition.
I think maybe a lot of you already are
if you joined us for
the workshop last week.
Just to call out, these are
the two links to check out
for today's workshop.
And then just a note,
well, before I let our TAs
and Instructor Conor introduce themselves.
So this is recorded,
so don't worry about if you
would like to visit the session
at a later time.
We're happy you also
got all the resources.
If you join through
Zoom, you'll get an email
within 24 hours that has again,
the links to everything that you need.
So the link to the video
recording, that GitHub link,
the link to the Meetup group,
also the YouTube channel
so you can view everything upcoming
and then also kind of check
out what we have coming up.
Some folks are using the chat
just to kind of call out
what schools you're from,
where you're dialing in, which is awesome.
If we could direct questions
to the Q&A feature,
that'd be perfect.
And that's where we're gonna have
our TAs answering questions.
And then if there's any time at the end,
Conor and Brooke and Chenglyn
will also answer a few questions
that they think might be
helpful for the entire group.
So without further ado,
Brooke and Chenglyn,
if you could just do a quick introduction
and then we'll go ahead
and get started here.
- Sure, my name is Brooke Wenig.
I am the machine learning practice lead
and I'm super excited to get
to work with my team members,
Conor and Chenglyn on this webinar.
- Hi, I'm Chenglyn.
I am on the machine learning practice team
working with Brooke.
I'm a data science consultant.
Some of you might recognize me
before for our previous webinar,
but I did (mumbles) trainings
and also help customers in
premise design solutions.
- Awesome, thank you.
So now we'll pass it over to Conor,
who is our instructor for today.
So, Conor, take it away.
- Perfect, thank you.
Yeah, so my name is Conor Murphy.
I work closely with Chenglyn and Brooke
on our machine learning practice team.
And so that's a hybrid team
where we are consultants
for machine learning and
data science projects.
And we also do some instruction as well.
And so, great.
Let's go ahead and get started.
Let me go ahead and share
my screen and so let's see.
Let me just make sure
this is the right screen.
There we go and hopefully
everybody can see this okay.
And so to go ahead and get started,
let's first make sure that
we have access to Databricks.
And so you can either go to databricks.com
in the upper right hand corner.
You can click try Databricks
and you can sign up here
or if you were with us last
week, instead of doing that,
you can go to
community.cloud.databricks.com
and you can access that here.
And so I'll just give everybody a minute
to go ahead and get started there.
And once you have signed into there,
go ahead and drop a
quick message on the chat
so we know that you've
gained access to it.
And if you have any
challenges along the way,
feel free to post in
the Q$A channel as well.
Cool, it looks like a couple
of people have gotten in.
Awesome, we'll just give
it a couple more minutes.
So it looks like a lot of
people have access to it.
So why don't we go
ahead and keep on moving
and then if you have any
challenges along the way,
just feel free to post that in our Q&A
and we can go ahead and
have our TAs help you out
throughout that process.
But once you're signed in,
you should see a learning page
that looks something similar to this.
So if you were with us last week,
this should look pretty familiar.
If you weren't with us last week,
that's totally fine as well.
So this is the Databricks environment.
Among other things, we
allow hooks in notebooks
that execute code against clusters.
And so those clusters are just resources
that we're going to create for you
so that you have somewhere
to actually run your code.
So instead of running this
code locally on your laptop
or your local computer, you'll
be running this in the cloud.
And so you'll be running this
on Amazon's infrastructure.
So on the left hand side of the screen,
you can go ahead and
click the clusters tab.
And then you can click, create cluster.
If were with us last week,
you might see a cluster
that's already here that
just needs to be restarted.
You can click, create cluster here.
You can give it a name, you
can call it whatever you want.
I'm just gonna call it this pandas.
And then I'm gonna click, create cluster.
The rest of these configuration
should be totally fine.
So I can click, create
a cluster, like that.
And then in a moment it'll
appear here on the screen.
And I'll give you guys just a
second to catch up with that.
And then go ahead and
let us know in the chat
if you've been able to
spin up a cluster okay.
So it looks like somebody asked
if they can use the
previously created cluster.
Yes, you can use that cluster.
You just need to restart it.
- [Brooke] Sorry, actually
that's not enabled
on Databricks CE.
It's enabled in the enterprise version.
This one you will have to
recreate it from scratch.
- Got you, thank you, Brooke.
- [Brooke] Yeah.
- Do I have to register my credit card?
No, you shouldn't need to register
your credit card for this.
So this is a completely free offering
that just allows you to run code
within the Databricks environment.
Cool, so it looks like many of
you have this up and running.
And so the next step is
to import the notebook
that we're gonna be
using in today's lesson.
So if you click on the home button up here
in the upper left hand corner,
and then click this dropdown here,
you can then click in code.
And so this will allow us
to import that notebook.
So once you click import, click from URL
and then, let me paste this link here,
you can import from this endpoint here.
And so I'm just gonna paste
that in there, click import.
And then in a second we'll see
a new notebook appear here.
And so it should look something like this.
And once again, I'll just
give you a few moments
to take care of that.
And just drop us a note once
you have that notebook imported
and your cluster started.
It looks like some people, their
clusters are still pending.
That's totally fine.
So those should get
started relatively soon.
And so normally it takes, I don't know,
maybe one to three minutes
or so to start that cluster.
And so if I click this dropdown menu here,
I'll see this cluster that
I have available to me
and I can see by the greenlight
here that it is running.
And if you have a spinning wheel there,
that just means that
it's still starting up.
And so this is a free resource.
This is what's actually
going to be running the code
that we'll be executing
from this notebook.
So we have about 15 and
a quarter gigabytes.
We have two cores, and the
rest of these configurations
are just the version of the
software that we're running.
So from that dropdown menu, I
can just click on that cluster
and then I will have attached
this notebook to that cluster.
So now I can execute code on that cluster.
So cool, let's go ahead and get started.
So today we're gonna be
introducing data analysis
using pandas.
So we're gonna start
off, we're gonna motivate
why pandas matters.
Talk a little bit about
what it is and its history.
And then we're going to go
through at a high level,
how to use this package.
So we're going to be using
the COVID-19 dataset.
And so this is going to
be available for you.
There's also a lab after the fact.
And so you can go through,
do your own analysis,
and drill down into your
country or your state
or your area and get a sense
for what those trends are.
And at the end,
we're gonna be doing some
visualization as well.
I also wanted to give you a few resources
that you can use after the fact.
And so you have a couple of links here.
First, this is a cheat sheet.
So this is a cheat sheet provided
by the creators of pandas.
And it's a really effective
tool for jogging your memory
when you're trying to
remember how to do something.
And so to give you a sense
for the basic syntax,
some of the basic operations
that you can do on data,
some of the filter operations,
how to handle missing data,
that sort of thing.
And so that cheat sheet
is incredibly helpful.
I also linked the pandas
docs here as well.
And so the pandas documentation
looks something like this.
And this is going to be the
main reference that you use
if you ever get stuck on anything.
So there're a couple of
helpful getting started guides,
so you can always check these out.
You can go through these independently.
There're some video
resources on here as well,
but this should give
you just more resources
that you can be more successful.
And then if you need to look
at the API reference docs
as well, you can see that here.
And so this will give
you a lot more detail
on each and every one of
the functions or methods
that we'll be using in the next hour.
Cool, so let's get started.
I've taught a fair amount of
intro programming in my class
or in my career.
And one way that I always like to talk
is just with the insanely,
insanely big picture.
And so first off, just
motivating why do you use pandas?
Well, one of the reasons why humans
are such a successful animal
is because of our use of the tools.
And computers are arguably
one of the most powerful tools
we've ever created.
And so if you can truly write code,
you can unlock the full
power of those tools.
And I always find that to
just be incredibly exciting.
And so I like to share that with anybody
when I first start teaching
introductory programming.
So that's a little bit of motivation
for why writing code is so effective.
But now why would we wanna
use pandas specifically?
And so pandas is a data munging library
available in the Python environment
or within the Python programming language.
And so more and more industry trends
are showing that data is
leading the decision making.
And so you can use similar tools.
You can use something like Excel instead.
However, pandas allows certain benefits
beyond something like
spreadsheeting tools.
So for instance, if you
wanna automate your analysis
so that reruns on any new data each day,
you can find ways to do that within Excel,
but it's generally a
little bit challenging.
And so code-based tools such
as pandas allow you to do that.
If you want to, say, build a code base
and share it with your colleagues,
you can do that using Python
and using pandas as well.
If you want more robust analyses
that feed your decision making
processes of your business,
pandas allows you to do that,
and if you wanna do
machine learning as well.
And so the next course in the
series, part three, next week,
we'll be focusing more on machine learning
and it'll give you an
introduction to how to do that.
And so if you're a data scientist
and if you're a data
scientist using Python,
you're likely going to be
using pandas quite a bit.
And so pandas is going to
allow us to manipulate the data
that we need so that we can feed it
into machine learning models
so that we can do visualizations,
so that we can calculate reports,
so that we can do all of those things.
And so pandas is absolutely
an essential library
for anything that you do as a data analyst
or data scientist within Python.
So that's pandas.
So at a high level, what is pandas?
So it's an open source library.
So open source means that
anybody has access to it.
You can use it free of charge
and you can see what the
underlying source code is.
So basically it's a bunch of code
that somebody else wrote for you
and you can use that in
order to manipulate data.
So it's used for data
analysis, data manipulation,
and it's built on top of Python.
So it started back in 2008
and it was fully open sourced
back in 2009.
I'm not gonna go through
each one of these different
benefits of using pandas,
but you can see
that there're quite a number
of different functions
or number of different
features within pandas
that can be incredibly helpful.
And so we're mainly gonna be interacting
with this DataFrame object.
So this would give you
the data in a format
that you might expect, right?
And so it's a tabular data format,
somewhat similar to a spreadsheet,
but it allows you to have a
lot more powerful functions
than you would find within a spreadsheet.
So it has optimized tools
for reading and writing data.
It has tools for dealing
with missing data.
You can reshape it, you
can pivot, you can index,
you can do all sorts
of really interesting,
really helpful things.
There's also some time
series functionality
and there's also some visualization
functionality as well.
And so we'll be looking at
many of those features today.
And then if you're interested,
there's a book here
that you can check out as well.
This is, just open it here.
So this is "Python for Data Analysis."
This was written by West McKinney.
West McKinney is the
original creator of pandas.
And there were a ton of helpful exercises
and helpful walkthroughs here.
So if you ever get stuck or
if you want a deeper dive,
I'd definitely recommend
checking out this book.
Cool, so let's get started
with the COVID-19 dataset.
And so this dataset is
gonna be available for you
through Databricks datasets.
So this is just a number
of different datasets
that we host for you so
that you can more easily
test out different code,
look at different features.
So I can use this % sh ls
and then have this pack here.
And so if you're familiar
with Jupyter Notebooks,
you might be more familiar
with an exclamation point
when you're doing a command like this.
But just allows me to take a look at
what's actually within this file system.
And so if I wanna
navigate this file system,
I can see that within Databricks datasets,
there's this COVID folder
and then I have a number of
different sub-directories
within here.
So this is the sub-directory
we're gonna be using
that I wanna navigate that
and see what exactly is in it.
I can just paste that here
and take a closer look
at what's actually within this.
So under the CSSE COVID-19 data,
you can see that we're gonna
be using these daily reports.
And so if I call % sh
ls on that directory,
you can see that we have a ton
of different CSV files here.
So it looks like this starts
at around January 22nd
and it goes up as recently as 4/11.
And so we're gonna be using
primarily this CSV file here.
Though towards the end we'll
combine all of these CSV files
together so that we can see daily trends.
So if we wanna see the first
few lines of this dataset,
we can use % sh again and then head on it.
And so % sh is just allowing
us to run bash commands.
And so bash commands are something,
if you've ever worked
in a terminal window,
these are the same commands
that you would run there.
But this just allows us to
have really granular access
to what's going on within this cluster
that we have available for us.
So if we take a look at the
first few lines by using head,
we can see that we have a number
of different columns here.
We have this FIPS column,
we have admin 2, province,
state, country_region, et cetera.
And so we can go through
what exactly those
meaning in just a moment.
And so these were all what
are called bash commands.
This is actually running Python code.
So here we're going to import pandas.
So like I said, pandas
is an open source library
that is made available to you.
So it's a bunch of code
that somebody else wrote
that you will have available to you.
This is what's called an alias.
And so we're gonna be
importing the pandas library.
And then when we use this as pd,
this is what you will
see time and time again
in data science code.
So this is the convention that we use
in order to access pandas.
So now we can call
something like pd.reedcsv
and we can use the reed CSV
functionality from pandas.
So let's go ahead and
execute this command.
And just for reference, by
pressing Control + Enter
or Command + Enter, depending
on what type of system
you're using, that's how
you can execute these cells.
You can also click over here
and click run cell here too
if you prefer to do that.
So I call pd.reedcsv and
I pass in this long path
that gives me access to the
daily report from 4/11-2020.
And so here we can take a look
at what's actually within this file.
So first we have this index,
so this is just the index
for a given row across this file.
Then we have this FIPS number.
So this is a specific convention
for identifying different areas.
So this is a, I believe it's a
US federal convention for it.
It's not really that important for us.
The admin 2 column, if
the area is within the US,
this is going to be the
county within the US.
And so we're gonna see some
no values for other countries
that don't have a admin 2 value.
And province_state, if it's within the US,
this is gonna be the state within the US.
If it's not within the US,
it's going to be the province or region.
And we'll take a closer
look at that in a moment.
We have the country or region,
which is what you would expect.
The last update, we're not
really gonna be using that,
but that's when this data was updated.
We have latitude, we have longitude,
and then we have the actual core data
that we're going to be working with.
So we have confirmed,
which is the number of confirmed cases.
We have deaths, which is the number of
the number of people recovered,
be somehow careful with this column
because not all countries and areas
are consistently reporting
recoveries from COVID.
And then we have active cases as well.
And finally we have this combined key.
And what the combined key does
is just combine a number
of these different values.
So it combines admin
2 with province_state,
with country_region.
Great, so I'm gonna scroll through this
and you can see that we have 12 columns.
And within this file, we
have just shy of 3000 rows.
So before, you saw that
we were just executing
one line of code in a given cell,
we can combine multiple
lines of code easily enough.
And so I'm gonna rerun this.
And so it's gonna run
the import for pandas
and then it's going to
save this CSV file as df.
And so this gives us an object
so we can continue to
play around with this.
We don't always have to be
reimporting this dataset.
And so now we have this df variable
and we have this dataset saved in memory,
and so we can easily access it
and execute code against it.
Great, so first let's talk a little bit
about summarizing the data.
So these are some of the
basic pandas operations
that you're gonna be
using time and time again.
And so whether you're a data
analyst, a data scientist,
whatever the case may be,
one thing that I've noticed
a lot of individuals do
with quite a bit of success,
is they have a basic walkthrough
for summarizing new datasets
as they interact with them.
And so oftentimes they're going through
and you're calculating the
same summary statistics
anytime you see a new dataset.
And so it's really helpful
to have a good template
for doing what we would call
exploratory data analysis.
So if you're a data scientist
and you're trying to build
some sort of data science solution,
first you're doing
exploratory data analysis
just to figure out what's
going on within your dataset.
And then that process becomes more robust.
And then eventually you
apply machine learning
to the process.
And then some magic ensures
and your businesses are
automatically more successful.
It's something like that, but okay.
So let's talk a little
bit about tab completion.
So if I uncomment this code,
so this is a comment in Python.
Basically, it just indicates the Python
that you're going to ignore
whatever comes after it.
If I uncomment that code, and
I call df., and then hit Tab,
I should see the different
methods that are available to me.
So recall that this is a pandas DataFrame.
We use pd.csv in order
to import that DataFrame.
And when I call .tab,
as long as there are resources
available on my cluster,
I'll be able to see the different methods
that I have associated with it.
So for instance, I can call df.abs.
And so that's gonna give me
the absolute value in a code.
Df.active, not sure what that is,
but might be helpful in some case.
I can call df.admin2,
and that's probably gonna
give me the admin 2 column.
But you can see that there're
a number of different methods
available here.
So this is all code that's
been written for me,
and then I can access by using pandas.
So we can take a look at how that works.
But first, let's take a
look at this df.admin.
If I ever get stuck
and I need to look at
the actual documentation
to get a sense for what's going on,
I can just use this help command.
And so if I call help on df.head,
it'll tell me a little
bit about what this does.
And so here, I can see that
I have this class DataFrame
and we have a number
of different parameters
associated with it.
And so it gives me some examples
of how I can use it as well.
And so it'll tell me the different methods
associated with it and
give me some details on it.
And so here,
we can see that there's quite
a bit of information here
and that's because pandas
is a very robust library.
There's a lot of functionality
available within it.
Cool, so if I just call
this this a df.head
without help around it,
here I can get a sense
for the first few lines of the dataset.
And so this will give me the
first five rows of the dataset.
That way I don't have
to keep on printing out
the whole thing over
and over and over again.
So here I can get a sense
for the first five lines,
looks like these are all areas within US.
And then if I want the inverse of that,
I can just call tail.
So df.tail is going to give
me the last few lines of it.
And so by default, df.tail
will give me five lines,
just like df.head will,
but if I wanna shorten that
or lengthen that, I can
add a parameter here.
And so these arguments rather
are how I start to customize my code.
And so here I can pass a value,
and then in the case of
head or in the case of tail,
this indicates how many rows
I actually wanna be able to see.
And so if I wanna know how
many records are total,
I haven't total my dataset,
I can call it this df.shape.
And so this will give me the
total shape of the dataset.
And so the first value
here is the number of rows
that I have.
So I have 2,966 rows.
And the second value here
is how wide my dataset is.
This is how many columns I have.
And so in data science and data analysis,
normally you refer to this as
features rather than columns
because features highlight the fact
that you are going to be putting this
into some sort of model in the future.
And so now there're a
ton of different ways
that I can summarize my data.
And so one thing that's worth mentioning
is if I do command/,
I can comment an uncommon
code pretty easily.
And then on a Windows machine,
that might be Control/
but I play around with it
a little bit to confirm that.
So if I just call df.sum,
this is going to give
me the column-wise sums
across my dataset.
So for FIPS, that's that number,
that convention for regions.
You can see that it's
summed up that column
and I get some astronomical number.
So this is 8.3 times tenth seventh.
So that's a huge number
for country_region.
It's adding together all the USUSUSs,
that's not super helpful.
Last update time, it's adding
together all those strings.
So that's not super helpful.
Lat and long, that's
not very helpful either.
But these numbers are helpful.
So now it gives me the sum
of the confirmed cases,
the numbers of deaths,
the number of recovered,
and the number of active.
And so this is quite helpful.
So this gives me a sense
that, at this point in time,
in their mind that this was on the 11th,
so this was maybe four or five days ago,
the number of confirmed
cases it looks like it was,
what is that, 1.77 million cases.
And I know today it's a
little bit higher than that,
actually quite a bit higher than that.
But this gives me a sense
for generally speaking,
what are some of the summary statistics
that allow me to get a
sense for what's going on
within this dataset?
So if I wanna look at the minimum values,
I can do that as well.
So I can see that the
lowest FIPS number is 66,
I can see the lowest
latitude and longitude,
if that's helpful for me.
And the lowest number of
confirmed cases, zero deaths,
zero recovered, zero active, zero.
And so across this dataset,
you can see that there're
countries or areas here
that do have zero confirmed cases.
And it's worth mentioning
that this is column-wise minimum values.
And so it could be that one
country has zero confirmed cases
but a different country has zero deaths.
And this country is not
necessarily Afghanistan.
However, Afghanistan is from
a string perspective, right,
'cause this is a string value.
This is going to be the minimum value.
So I can take a look at
the maximum values as well.
And so here I can see that
the maximum confirmed cases
is 163,000, and we'll go
into a little bit more detail
with that in a second.
If I just do a count,
this will give me a count
of all of the non-empty
values here within my dataset.
So now I can see that these
counts are going to look
about as long as,
or these are gonna be
reflective of the number of rows
that I have in my dataset.
And I can look at average values as well.
And so average values
will tell me, let's see,
on average how many
deaths, how many recovered,
how many active for different
areas within my dataset.
Now it's worth bearing in
mind that if, within the US,
these are aggregated at a county level.
And so some of this information
might look a little bit off
because it's not all
done on a country level.
So we'll handle that issue in a second.
But finally I can look at
the standard deviation.
And so the standard deviation
is going to give me an idea
of the spread of my data.
And so how much is this data
spread out across it's mean?
Great, so that's one way
of calculating these values
on a column-based state
or on a column basis.
If I call this df.describe,
this will do many of
these operations for us.
And so here I can look at,
across these different numerical values.
So describe is going to ignore
the non numerical values.
Across these different numerical values,
I can see the counter values, the mean,
the standard deviation,
the minimum, the maximum.
And then these are my different quartiles.
So the 50th percentile,
this is gonna be my median
my 25th percentile and my 75th percentile.
So this gives me a
with my data.
So how much has it spread out (mumbles)
Are most of the values
generally right around the mean
or is a lot of my data are
farther away from the mean?
So that's what standard
deviation in these percentiles
is gonna tell me.
So I can get a sense for what
the average number of case is.
So the average number of
cases is gonna be around 598.
And so bear in mind, this is
gonna be county level data
within the US, so that's
gonna off these numbers
a little bit and we'll
handle that in a second.
Cool, so now let's go on to
slicing and munging data.
And so this will give us a sense
for how to do some more robust
manipulations around it.
So if I just wanna pull
out the confirmed cases,
I can use this syntax here.
So this is pandas specific syntax.
If you've used Python in the past,
you might recognize these
square brackets for lists
within Python, but generally speaking,
you don't use this syntax a lot elsewhere.
This is pretty specific to pandas,
but it will become it a lot more natural
the more you use it.
So if I just call df of confirmed here,
I can get a sense for that
one individual column.
And so here I can just
see the individual values
from df confirmed.
And then if I wanted to, if
I wanted to shorten this,
I could call .head on this as well.
And so one thing that's
helpful to know about pandas
is it's very common to use
what's called method chaining.
And so here I index, I grab this column
and then I add a method to it.
And so it's common to go through
and chain a number of
different methods together.
And so here I can start to use
these individual components
and make a slightly more
complex analysis based upon it.
And so you will see
the syntax quite a bit.
And you'll see as we
go on in this notebook,
this will become more and more complex.
Well, cool.
If I want to take a look at
the number or the columns
that I have, I can take
a look at it that way.
And then if I wanna
index multiple columns,
I can do it this way instead.
And so here I'm using
those same square brackets,
but I'm passing in a list.
And so here, I have double square brackets
and that just indicates the pandas
that I want multiple columns here.
And if I wanna add a new
column, I can do it this way.
So I can call df.date,
date doesn't exist here,
but I can pass in this column.
I'm gonna use this date time object.
I won't go into too much detail about it,
but just know that this allows me
to have a specific formatting
specific to a date.
And then if I take a look at
the first few values of it,
it looks something like this.
So here I can create new
columns from either old columns.
I can say take two columns
and add them together,
or I can create a new column
and have it all be the same value.
In this case, it's a daytime.
And so that covers how to
index on a column level.
If I wanna index on a row level,
I can do something like this.
And so I can call df.loc, L-O-C,
so that just gives me the location.
And here, the first value I
pass in is the specific rows
that I want.
In the second value,
this should look familiar
as the columns that I want.
And so this is gonna evaluate
to be just about the same thing as this.
And so here, it says I wanna
go from the zeroth value.
And so Python is zero indexed.
And so that means, if you're
indexing into something,
you start with zero rather than one.
So I tell it that I wanna go from the zero
to the 10th value.
And so that's how I pull
out those values here.
And so this is gonna be the
same thing as this first line.
And so this gives me a sense
of a how to slice on a
horizontal and a vertical
or from horizontal and
vertical perspective.
And if I just wanna pull
out the first column
and the first row, I can
do something like this.
And so that'll give me the
first FIPS number that appears
within the dataset.
Okay, so I talked a little bit
about how the regions are
a little bit different
across this dataset.
So if I wanna get a sense for
what that actually looks like,
I can call this .value_counts.
So on the country_region,
so we're on that one column,
I'm gonna call it .value_counts
and it's just gonna tell
me exactly how many regions
I have for that country.
So in the US, you can see
that I have about 2,700
and that's because we have
county level information
within the US.
And so within the US, we
have about 2,700 counties
that are represented here.
On China, I believe these are
provinces, Canada as well,
UK, France, et cetera.
And so you can see that for
a handful of these countries,
I have more granular data
than for other countries.
And then for countries
like Tunisia, for Chile,
for the UAE, for Fiji, here
I only have data available
on the country level.
And so what's FIPS?
We talked a little bit about what it is,
but we don't really care.
FIPS is not something that
we necessarily wanna use.
And so if we wanna drop that column,
we can use the df.drop
and then we can call FIPS
and we can give it an access.
And so this access just indicates
that we wanna drop the column
rather than dropping the row.
And so it'll look for a row and that's,
we specify access is equal to one.
And so we can run it like this.
And then if we call df.columns,
we should see that these
columns are all the same
as they were before,
but we no longer have
the FIPS number here.
And so bear in mind that
here, when I call it df.drop,
I re save this to df.
So I overwrote the original
DataFrame with a new DataFrame,
which is df but without the FIPS column.
Cool, and so now we can take
a look at confirmed cases.
So if I want to sort my DataFrame,
I can call df.sort_values
and then I can pass into
column that I wanna sort by.
And so in this case, I'm
gonna sort by confirmed cases
and I'm gonna set
ascending to equal false.
If I set this to be equal to true,
and bear in mind that you
always capitalize true or false,
if I set this to be equal to true,
you could see that I have
all of the zero cases first.
But instead, I want this
to be equal to false.
So now you can see
that I have the highest number
of confirmed cases in Spain,
second highest in Italy,
then France, then Germany,
and then US.
And so this looks a little bit odd.
And the reason why this looks
odd is because once again,
we have that country level
granularity within the US.
And so this is specifically
within the New York City area.
And so here you can see as 4/11,
we had about 98,000 confirmed
cases within New York City.
Okay, so if we just wanna
look at what's going on
within the US, we can
call something like this.
So we'll call it df[Country_region] == US.
So if I run this, this will
only give me the DataFrame
where the country_region is equal to US.
And so here I can always
just call .head on this
to make it a little bit shorter
and have to do a little bit less scrolling
at the end of the day.
Okay, so this is how I filter
a column-based upon one value.
What happens if I want
to filter a column based
on multiple different values?
So if I wanna do something like that,
I can call it df[Country_region] ==US,
I'm gonna throw all these in parentheses
and then I'm gonna add this 10%
and then I'm gonna add province_state
is equal to California
and admin 2 is equal to San Francisco.
And so this is where I'm
based out of right now.
So I live in San Francisco County,
which is the name of the county
is the same as the name of the city.
And so here as a 4/11,
you can see that there were
857 confirmed cases, 13 deaths,
and the recovered number,
I don't really trust
because I'm not convinced
that all of these different regions
are reporting recovery
numbers in the same way.
But generally speaking, I can
trust the confirmed number
and the death numbers.
Cool, so that gives me a sense
for how I can filter based
upon the multiple values.
And you can play around with this.
You can change this in
order to match the area
that you're in.
Okay, so now what if I wanna
answer different questions?
For instance, what country
has the greatest number
of confirmed cases?
So in that case, I wanna
be able to group my data
by the country
and then perform some
sort of operation on it.
So I can call df.agroupby
and I can group by country or region.
And what this returns is
this DataFrame group object.
And so all we really
need to know about this
is this is a specific type of
object and is returning to us,
but it really needs some sort
of other additional method
on top of that.
And so here I'm gonna do df.group on it,
I'm gonna have that same country_region,
then I'm just gonna pull
out the confirmed cases.
I'm gonna call .sum on it,
and then I'm gonna sort the values.
And so this is just that chaining
that I talked about before.
And so sometimes this can
be a little bit challenging
to read, but when you're
writing this code,
normally you start by something like this.
So you might start by saying df.groupby,
and then you wanna do confirmed,
so it just gives you the confirmed values.
And then maybe you do
this sum on top of this,
and then you see that I get sum
across all of these different values
and then you would change
sort values on top of that.
And so when you're coding,
you're normally doing this
in an iterative process.
It's not like you're writing
all of this code at once.
And so when you see these
longer chains of methods,
never be intimidated by it.
The person who wrote this originally
was writing one method at a
time, usually executing it,
making sure it looks like
they were expecting it,
and then going on to the next one.
Cool, so now that I've
grouped by country or region
and looked at the confirmed
cases and sum them,
it looks like
the US has the highest
number of confirmed cases,
number two is Spain,
number three is Italy,
and France, Germany, and China.
And so it's like it's the US first,
then a number of European countries,
and then China after that.
And so if I wanna ask the question,
which US states have the most cases?
I can start to chain up
what we've done before.
So here I'm gonna filter
by country or region
is equal to US.
I'm gonna group by province_state.
I'm gonna pull up the
confirmed values, do a sum,
and then sort values.
And so here, this gives
me state level information
from the United States.
So I see New York has the highest
number of confirmed cases,
followed by New Jersey
and Michigan, Massachusetts, et cetera.
Great, so now I wanna move on
to talk a little bit about
missing data, duplicate values,
that sort of thing.
So this is the reality of
doing any sort of data analysis
work or data science work,
is that you're always
dealing with outliers,
no values, that sort of thing.
And so just being able to do
this in a pretty efficient way
allows you to avoid spending
too much time doing that.
So I can always call this df.isnull
and this will return trues
where a value is null
and false otherwise.
And so the way that I would read this
is I look at, say, the admin 2 column here
and see that I have at least
four different no values
within the tail here.
And then country_region,
it looks like all of these
are null no values.
So if I just call sum on that,
it'll give me column-wise
number of no values that I have.
So a lot of these are
missing admin 2 values.
That's because we don't have
county level information
for a lot of this data.
And you'll see some of the
other missing values as well,
including lat long for 59
different records I have.
So now let's see
how many different
unique countries we have.
And so here I can call df[country_region.]
I can call .unique and then
I can call .shape afterwards.
And then I can see that I have
about 185 different unique countries here.
If I wanted to do the same thing,
I can always just call
drop duplicates as well,
and this will give me the
unique values as well.
And then finally, I can
always use this df.fillna
and then I can pass it in some
sort of value to fill with.
And so here I call it df.fillna
and I pass in no data available.
So now you can see where
I did have missing values.
Suddenly I had the string instead.
And so this might be helpful
if you wanna do what's called imputing,
where you fill in no values
with some sort of standard value.
Cool, so the last thing I wanted
to cover was visualization.
And so within languages like Python or R,
you have really, really robust
visualization libraries.
And so what we're gonna be
using here is what's available
within core pandas, but
there're many other libraries
that are available to you as well.
And so a lot of pythonists
start with matplotlib.
This is the main plotting
library within Python.
There're also at our library
such as Seaborn and Pyplot
and a number of other ones.
And so these are largely
open source libraries
and so you can use these
in quite the same way.
But I'm just gonna run
this, import for mapplotlib
and then call this % mapplotlib in line.
All of this does is indicate to Databricks
that we wanna render these images for me.
And so we wanna make sure
we show whatever's there.
So I'm gonna be looking
at just the US-based data.
So I'm gonna go ahead and create a subset
of my original DataFrame.
I'm gonna call it us_subset_df
and then I'm just gonna pass
in this code that you saw before,
df[country_region] == US.
Cool, so now I have this subset.
So now we can start to
ask different questions
about, say, the number
of distribution of deaths
by US states and territories.
And so oftentimes when we
talk about distributions,
we're talking about for deaths.
What generally speaking is
the number of different values
that we have for different states.
And so here we can see that we're,
sorry, let me take a step back.
So the way that this is operating
is we're going to bend together
the number of different
debts that we have.
So say all of the deaths
between zero and 50,
we're gonna put together in the same bin
and we're gonna visualize it this way.
And so that way we can get a sense
for are there any outliers
within my dataset?
And so if we take a look at this one,
here we can see that there's
one outlier out here,
but normally we have much
lower number of deaths.
But if we want a little
bit more granularity,
we can change this bins
parameter within this histogram.
And so here I'm just gonna
go through, do a group by,
sum of the deaths.
So this will give me deaths by state.
And then I'm going to to call .hist
in order to get a histogram.
And then PATS in this bins parameter.
And so if I execute this,
you can see that I have a histogram
that has a little bit more granularity.
So you can see that I have
one outlier way out here.
This is likely gonna be New York.
And then you can see the
majority of the states
have much lower death accounts.
So if I wanna see how confirmed
cases relate to deaths,
I can use that DataFrame called .plot
and then call .scatter.
Then I just need to define
what's on the X-axis
and what's on the Y-axis.
So this allows me to visualize
confirmed versus deaths.
And you can see
that there seems to be
some sort of correlation
between these two and then I
have this outlier way out here.
If I wanna remove that outlier,
I can call this us_subset_tf.
I can call any deaths less than a 1,000.
And then I can call that same code.
And so this allows me to
clot those same values,
but without that outlier.
And so here you can see
once again, that correlation
between confirmed cases and deaths.
And so I'm gonna gloss over this code
'cause I know we're a
little bit short on time,
but all this is allowing us to do
is take all of those
different individual CSV files
and combine them all together.
So I'm just going through
each one of these.
If you were with us last week,
this (mumbles) should
look somewhat familiar.
But you're reading in that file name,
you're changing the columns a little bit.
And the reason for this
is not all the CSVs line up perfectly.
And so this is just some code
to clean that up a little bit.
I'm adding in the date
and then I'm combining
all of these together.
And so basically I combine
all of these different data
frames within this list
and then I can catenate
them together at the end.
And so this just gives me
data across the entirety
or all of those days that
I have available to me.
And so it's gonna look
something like this.
I dropped out some columns as well
because of the issue with
columns not matching up.
But this will give us the core
data that we're looking for.
So now we can start to ask questions
about how the disease spread over time.
So if we group this information by date
and look at confirmed cases,
so we want the sum of confirmed cases
and then we wanna call .plot on this.
And so this gives us a sense
for the increase of cases over time.
So the data goes back to January,
and you can see this increase.
And so I added this title here.
The title is gonna appear
at the top of the plot
and then I added this rotation as well.
So this rotation is for the X-axis.
If I didn't add this
rotation, all of these numbers
for dates would all
overlap with one another.
So I added a 45 degree rotation
just so I could see this
a little bit better.
So now if I wanna break this
down by the types of cases,
I can group by date,
I can pull up the confirmed
death and recovery rates.
I can call it .sum on it,
so I have a sum of each one
of those different values.
And then I can call it .plot.
And so I give it a title, I
give it that same rotation
and we can get a sense for
the number of confirmed cases,
the number of deaths, and the
number of recoveries as well.
And so then if I wanna
look at what is the growth
within my country,
I can start to look at
it this way as well.
And so this might look a
little bit complicated,
but basically I wrapped all
of this within parenthesis
so I can divide this code up
over a couple of different lines.
Then I subset it by country_region,
province_state, admin 2.
So this is the same code we saw before
when we looked at San Francisco.
I grouped by date, pull up
confirmed deaths, recovered,
call it .sum and call it .plot at the end.
And so this gives me the specific
results for San Francisco.
So you can see the
number of confirmed cases
is increasing somewhat linearly.
And we see we have very
low number of deaths
and recoveries, relatively speaking.
Cool, and so this final
function, this just wraps,
this line of code that we saw up here,
it wraps it all up into one function.
And so if you were with us last week,
you're familiar that you
can (mumbles) this code up
with (mumbles) to function
and so that you can add your
own arguments to this function.
And so now I can just call plot my country
and then I can see US
New York, New York City,
but I could add in different parameters
for my specific state or region as well
if I wanted to as well.
And so you can play around with this
and you can use this
to customize the result
for your own area.
So I know that was a bit
of a fire hose of pandas.
So you'd have access to this notebook.
You also have access to a lab as well.
And so the lab is going to
give you some starter code
and it'll just have you go
through some plotting exercises
so you can get a sense for
how to actually visualize
what's going on in your area.
So I want to make sure
we had a few minutes
at least for some questions.
And so why don't we go
ahead and switch to that?
So if you have any questions,
feel free to include in the Q&A.
And then I also ask the TAs
to include some questions
that came up with some frequency
so that we could all
address them together.
- [Karen] Yeah, Conor, great job.
There're quite a few folks
who are asking about pandas versus Koalas
and also pandas versus Spark,
if you wanna talk about the differences
between single node and
distributed computing.
- Perfect, that sounds good.
So the library we used today is pandas.
Pandas is what we call a
single node technology.
And so when we run this DataFrame,
when we run any pandas code,
it's always only working
on one single machine.
And so for those of you who
are familiar with Spark,
so Spark is a distributed
computing system.
And so if you have more data
that can fit on any one machine,
then you're going to wanna
use something like Spark
rather than pandas because
pandas works really well.
Data scientists use it all the time,
but it cannot scale
beyond one single machine.
And so when you use something like Spark,
it allows you to distribute that.
And so Spark also uses a DataFrame,
but that DataFrame is actually,
you have a little bit of your data
sitting on different
nodes across your cluster.
And so Brooke mentioned Koalas as well.
Koalas is a way of
writing Python style code
but against a Spark DataFrame
rather than a pandas DataFrame.
So it's going to allow you
to use this same syntax,
but your code is gonna be able
to scale across a cluster of machines.
And so this will be,
if you're working on any
sort of big data problem.
And so if you have more data
that can fit on any one machine
and you don't know how to work with it,
then you're gonna wanna
look into technologies
like Spark and Koalas.
- [Karen] Great, and then
a few other questions
about how do you add
records to a DataFrame?
- Well, that's a good question.
So we looked a little bit
about how to add columns
to a DataFrame.
Then there're a number of different tools
that are available for you.
If we take a look at
the pandas cheat sheet,
we should have some
indication of how to do this.
So there're a number of different ways
that you can combine data.
So you can look at
these different reshaping functionalities,
but one that you might wanna
look into is this pd.concat.
So this will concat two
different DataFrames together.
And so you can do this, you
can stack them horizontally,
you can stack them vertically.
But these types of values
are something you're
going to wanna look into.
- [Karen] And then as a followup
to the earlier PySpark
versus Panda's question,
somebody's asking about, is
there any specific threshold
when you'd use one over the
other like 10,000 records,
100, 000 records, what are
some of the considerations
that you take into account?
- Yeah, so the main thing
you have to consider
is how much memory you
have available to you.
So if you have eight
gigabytes worth of memory
on one machine, and the data that you have
is eight gigabytes, you're
probably going to need
to use some sort of
distributed technology Spark
or whatever the case may be.
And so that's one thing to consider.
It's more than size in memory
rather than the overall number
of records that you have.
And bear in mind
that if you have eight gigabytes
of memory available to you
and eight gigabytes worth of data,
that's likely not going to work
because you still need a
little bit of extra memory
available to you to do
any sort of operation.
And so I would start to
think about using something
like Spark, once you're on the order
of maybe tens of millions
of rows, but like I said,
that's really a question of the
size of that data in memory.
