Few days back I uploaded a story on my Instagram
where I did an AMA on a data analysis project
that I was working on.
As a result, so many of you asked me to do
a video on that project so here I am.
Though the dataset will be different, but
the project will be pretty much the same.
What’s up Internet.
I am Manish from RebellionRider.com.
Today we are going to do data analysis of
covid-19 data using Python and tableau.
Here is how I am planning to proceed with
this project.
1.
Data Collection
First, I will try to find a dataset.
Anything which is continuously getting updated
with the latest information will work.
A huge amount of workflow of this project
depends on the nature of this dataset.
2.
Data Cleaning (if needed)
Second, I will look for the anomalies in the
dataset and try of fix them.
I will aim to find a dataset which has less
errors and redundancies.
But if still there is a need, we will do the
cleaning.
3.
Data Storage
In the end we will store this dataset in such
a way that we can use it with our data analysis
software which is tableau.
Let’s start with gathering the data.
While I was searching for the data for this
project.
I came across various good APIs but unfortunately,
all were paid.
And, all the free APIs were either incomplete,
poorly organized, or not getting updated regularly.
Now, I am left with only one option which
is web scrapping.
Initially my plan was to get the data from
worldometer.info website.
I even drafted half of this tutorial around
this but just when I started recording this
video, I got to know that – the data of
this website is also not free.
Anyways, out of respect, I decided to drop
the already drafted script and code and look
for something which is truly free and available
for any of us to use.
So far, one thing which is very clear to me
is that finding a good free dataset is next
to impossible.
Therefore, right now, my aim is to find a
dataset that is getting updated regularly
regardless of it being clean or consistent.
The only website which fits the bill is Wikipedia.
Free and updated but not the cleanest data
for any of us to use.
I admit it’s not my first choice as the
data is not very clean, that means we have
to do a lot of work on it.
And that also means you are going to learn
various new things.
So, let’s do that –
But before that, you can download all the
files of this project from my GitHub page.
All the necessary links are in the description.
Go ahead and check that out.
That being said now let’s get back to the
tutorial.
First let’s quickly create a directory for
storing
all our files.
Let’s name it covid-19 Analysis.
Now let’s launch the IDE.
For this project my choice of IDE is going
to be SPYDER.
As you can see here.
You can use whatever Editor or IDE you want.
Now let’s create a file and save it into
the directory that we just created.
I will name this file main.py
Done.
Next, click here on this menu button and select
– “Set console working directory”.
Done!
Now we are all set to write our code.
For web scrapping or fetching the data from
the website we will be using pandas and request
library of python programming.
Using request library, we will fetch the data
from the URL and using Pandas we will save
that data into a data frame and process it.
Data frames are nothing but a heterogeneous
two-dimensional tabular data structure.
Which means data frames can hold data of different
datatypes in row and column format.
This makes them the best fit structure for
holding this type of data.
Let’s do that –
import pandas as pd
import requests
Here I just imported both the libraries that
we need –
url = 'https://en.wikipedia.org/wiki/COVID-19_pandemic'
req = requests.get(url)
As you can see, I have declared two variables,
URL and req.
In the first variable I am saving the URL
of the website from which we want to fetch
the data.
In the next line we call the get method of
requests library.
As the parameter to this function we pass
the URL.
This function call will fetch all the data
from the specified URL and return the response
object which will be saved into the variable
– req.
Let’s execute this code -
So, we have fetched the data.
But this is not the data that we want, rather
it’s the entire text of the website along
with all the HTML, CSS formatting and other
meta data.
If you will double click on the variable name
here.
You can see what it’s holding right now.
If you will double click on this parameter
text.
You can see all the web page data that it
is holding.
Now the question is how are we gonna filter
out the data that we need from all this?
To do that we will need the corresponding
path value of this text parameter, which is
“req.text”.
Just note it down and close this window.
Done!
To filter out the desired data we will be
using read_html function of the Pandas library.
This function will look for all the data which
is saved into a table format on the specified
web page.
After that it will extract that data into
a list of data frames.
Let’s do that –
Data_list = pd.read_html(req.text)
Here we called the read_html function.
As the parameter to this function we have
specified the path value of the text parameter
that I mentioned earlier.
On execution it will fetch all the desired
data and save it into the variable data_list.
Let’s execute this code.
In the variable explorer you can see that
the data type of this variable is list.
This is not just any list but the list of
data frames.
If you will double click on it then you will
see that it contains 22 different data frames.
From these 22 we need to find out the one
which has our desired data.
By desired data I mean country specific covid-19
data.
If you will look at them carefully, then you
will notice that the data frame number 4 is
the largest in size.
Let’s double click on it and see what we
have inside it.
So, this data frame holds the data that we
need.
But not in its cleanest form.
We will handle that.
But for now, let’s extract this data frame
out from this list.
target_df = data_list[4]
So here I created this variable with the name
target_df.
On the right hand side of this assignment
statement I have written down the name of
the variable which is holding all our data
frames and followed by that inside the square
brackets I have specified the index number
of our target data frame which is 4.
Let’s execute this code.
Done!
Let’s come to the variable explorer and
see what do we have in the variable target_df
Great!
We successfully extracted our desired data
set.
With this we are done with the Data Collection.
Now let’s move on to the next step which
is data cleaning.
This is going to be the most challenging process
but full of knowledge.
And it’s going to be fun.
Let’s do it –
But before that I just want to ask you to
subscribe to the channel if you haven’t
already and press the bell icon to get notified
every time I upload a new video.
And, most importantly don’t forget to press
the thumbs up button if you like this video.
Thanks.
Now let’s move on to the next step.
Data Cleaning
Let’s see what we need to clean here.
So, here we have two completely irrelevant
columns.
First one is full of NAN values and the last
one is full of reference numbers which are
of no use to us.
Good
Also, this dataset has the worst column names.
We’ll have to fix those too.
Column number 4 has this string ‘No data’
stored in some rows.
We will replace this with 0.
No Issue.
I can see some of these country names are
ending with alphabets enclosed in square brackets.
We will need to fix this because it can be
a huge Issue while data analysis.
Let’s scroll down and check if we have any
Issues there or not.
Ok the last two rows have text stored in them.
Which is again of no use to us.
It’s a small Issue.
We will delete these rows.
These are a few of the Issues which are visible
to us.
The one which is not, is the data type of
the data which is stored in this data set.
The data in these columns seem to be of integer
data type but it’s actually of string data
type.
Which we need to convert.
So, in total we have 6 Issues.
Let’s fix them one by one.
Let’s start with changing the column names
into something more meaningful.
Issue #1 Column Names
We will be giving meaningful names only to
column number 1,2,3 and 4.
Not to column 0 and 5 as these are of no use
and we will be getting rid of them.
So, let’s do that –
Here are the column names that I am planning
to assign.
Anywas let’s write the code.
This is going to be simple-
target_df.columns = ['Col0','Country Name','Total
Cases', 'Total Deaths', 'Total Recoveries','Col5']
This is a pretty simple assignment statement.
On the left-hand side, we have the name of
the variable which has the data frame which
we want to update followed by dot (.) and
then the keyword ‘columns’
On the right-hand side, we have the new names
that we want to give to our columns.
Separated by comma, enclosed in single quotes
and the entire list of the name is enclosed
in square brackets.
Let’s execute the code.
Done.
Let’s check.
The column names have been updated.
Great.
Issue number 1 solved.
Now let’s move on to the next one.
Issue #2: Extra Columns
So, in our data set we have two columns column
0 and column 5 which are of no use.
We will be dealing with them by extracting
all the columns that we need and saving those
into the variable.
That way we will leave these extra columns
out from our targeted data frame.
Like this –
target_df = target_df[['Country Name','Total
Cases', 'Total Deaths', 'Total Recoveries']]
On execution this statement will extract all
the columns that we have specified into the
list and save it into the variable.
Which in our case is target_df.
Make sure to enclose this list inside a pair
of double square brackets.
Otherwise you will get an error.
Let’s execute and see!
Done.
Let’s quickly check our target_df variable
and see if we have updated columns or not.
Great.
Now this variable has data only from desired
columns.
With this, Issue number 2 is also resolved.
Now let’s move to next one which is Extra
rows of data
But before moving ahead let me add some comments
to this script so that it will be more readable.
Now this script is looking much better and
more user friendly.
Issue #3: Extra Rows
As we saw earlier that there are two rows
at the bottom of the data set which are again
irrelevant for our data analysis project.
Thus, we will remove them.
To remove these rows, we will need their corresponding
index numbers.
Let’s get those.
Let me first open up the target data frame.
Ok, these index numbers are 229 for the last
row and 228 for the second last row.
Let’s close this window and write some code.
target_df = target_df.drop([229, 228])
On execution, this drop function will remove
rows corresponding to index number 229 and
228.
But we cannot use these index numbers.
Why?
Let’s say tomorrow Wikipedia updates this
list by adding two more countries into this
data set.
In that case the index numbers corresponding
to the last and second last rows will change.
And, if that happens then this script is of
no use.
The solution to this Issue is not hard coding
the numbers into the statement.
Rather fetching the index number dynamically.
Like this
last_idx = target_df.index[-1]
This statement on execution will return the
index number of the last row of our data frame
and save it into the variable last_idx.
Now we will use this variable for deleting
these rows.
Like this
last_idx = target_df.index[-1]
target_df = target_df.drop([last_idx, last_idx-1])
The first parameter of the drop function holds
the index number of the last row and second
parameter which is last_idx -1 holds the index
number of the second last row.
Now no matter how many extra rows of data
get added into the data set in the future,
this script will always work.
Assuming the last two rows will always hold
the irrelevant data.
Let’s execute and see if it works or not!
Done!
Let’s quickly check.
Great Last two rows gone!
Issue number 3 Solved.
Issue #4: Inconsistent Country Name
The solution to this issue is going to be
an interesting one.
Because to solve it we will be using RegEx
or Regular Expressions.
We will need to write a regular expression
which will search all the country names.
And look for the one which has string written
inside a square bracket.
For example, United Kingdom [UK].
So the Regular expression which best fits
such strings is this \[.*\]. This regular
expression will look for all the strings which
have anything written inside the square brackets,
regardless of its position.
Which means it doesn’t matter if the string
enclosed in a square bracket comes in the
end or in the middle or even in the starting
of the country name.
This RegEx will work for all of those.
The reason behind doing this is to future
proof this Python Data Processing script.
Now how are we going to use it?
To modify the country names, we will use the
string function name replace of the Pandas
library.
Let’s do that –
target_df['Country Name'] = target_df['Country
Name'].str.replace('\[.*\]','')
To use the string function replace we first
write the series that we want to update.
In our case that series is the names of countries
which are stored in ‘Country Name’ column
of the target_df data frame.
Then we write the keyword str and after that
we have our function call.
Make sure each of these are separated using
dot (.) operator.
Here this function will take two parameters.
First will be the regular expression and second
will be the empty string.
Just a pair of single quotes, nothing between
them, not even space.
Because we want to replace the square bracket
part of the country name with nothing.
Let’s execute the program and see the result.
Done!
Here we have the country names in the desired
format.
Just to show you the difference here are the
before and after of our data frame.
I must admit this data set has started looking
much cleaner.
Anyways, issue number 4 is also solved.
Now let’s move on to the next issue.
Issue #5 Extra Value in Column 4
Here is the data frame and as we can see on
some rows, we have string “No data” stored
in ‘Total Recoveries’ column.
To fix this issue we will again use the replace
function.
We will replace the string “No data” with
0.
But the problem here is that both the parameters
of this string function replace, needs to
be of string datatype.
Don’t worry we will fix those.
Meanwhile let’s write the function call
–
target_df['Total Recoveries'] = target_df['Total
Recoveries'].str.replace('No data','0')
On execution this function call will replace
the string “No data” with another string
which will be ‘0’.
Let’s execute and check the result.
Execution done!
Let’s check the data frame.
So, we have successfully replaced No data
with 0.
Now let’s fix the data type of all these
values.
Issue #6 Wrong Data types
Ok, so we know that, the data stored in columns
Total Cases, total deaths and total recoveries
is of string data type.
However, it should be of integer data type.
So, let’s fix it.
To change the data type of our data we will
use a function named ‘to_numeric’ of Pandas
library.
Like this
target_df['Total Cases']=pd.to_numeric(target_df['Total
Cases'])
Here pd is the alias that we have assigned
to our Pandas library while importing.
Followed by that we have our function call.
As a parameter to this function we have assigned
the name of the column whose data type we
want to change.
Similarly let’s write the statements for
remaining two columns –
target_df['Total Deaths']=pd.to_numeric(target_df['Total
Deaths'])
target_df['Total Recoveries']=pd.to_numeric(target_df['Total
Recoveries'])
Done!
Let’s execute.
Great!
Let’s check the data set.
Nice and colorful.
We have done it.
Here is the raw dataset that we have fetched
from Wikipedia and here is the one that we
just cleaned.
The result data set is already looking much
better.
With this we are done with the first two steps
of data processing which are data collection
and data cleaning.
Now, let’s perform the last step which is
data organizing.
Export the Data
Now we are in the last phase of our data processing.
Only thing which is now left is to save the
data in a way that our data analysis software
can use it.
For our data analysis we are using tableau.
The two most widely accepted file formats
by tableau are CSV and XLSX.
So, you can export your dataset in either
of these.
For further use I will export my data in XLSX
format.
However, for the sake of learning I will show
you how you can save your dataset in both
these formats.
So, let’s do that –
Exporting our dataset in either of these formats
is very easy.
For example
To export the dataset in CSV we simply have
to call the to_csv() function.
Like this –
target_df.to_csv(r'covid19_dataset.csv')
And to export the dataset in XLSX format of
Microsoft excel we just have to call to_excel()
function.
Like this –
target_df.to_excel(r' covid19_dataset.xlsx
')
In both these function calls covid19_dataset
is the name of the file.
Make sure to specify the name of the file
with the extension.
Just like I did here, in the first function
call the extension is .csv and in the second
it’s .xlsx.
Also, make sure to put r before specifying
the name.
This flag lets the target software interpret
the data easily.
Anyways, I will comment the first function
call as I want to export my data in XLSX format.
Done!
Let’s give this program the final execution.
Done!
Let’s check our project director and see
if we have our dataset saved into a Microsoft
excel file or not.
Here is our data.
Nicely saved into our desired format.
Let’s open up this file and see our data.
Here it is.
Great!
Everything is done just like we want.
So, we have our dataset saved.
Now we will put it into tableau and draw some
interesting charts.
There are multiple ways of visualizing or
analyzing this data.
Say, we can plot our charts using libraries
like Matplotlib or Geopanda.
Or we can use software like QGIS or ARCGIS.
But for this tutorial I am using Tableau.
First of all, it has a version which you can
download for free.
Second, it’s much easier than other options
which I just mentioned.
We don’t need to further code or generate
sets of files for analysis.
We simply have to take the XLSX file and process
it with the software.
So let’s do that.
But if you want me to do a tutorial using
matplotlib, geopandas or QGIS then just go
ahead and comment and let me know.
That being said let’s open up the tableau
and draw some interesting charts.
First let’s minimize this python script.
Now let’s open up the tableau.
Great, Tableau is open next we have to load
the data source.
This will be the excel file that we generated.
Now come to your left-hand side menu.
If you have exported your file in XLX format,
then select Microsoft Excel.
But if you have exported your file in CSV
format then select Text File.
In my case I have exported my data in XLSX
format therefore I will select the Microsoft
Excel option.
Next you know what you have to do.
Search your file and open it.
So, here is my file.
Double click and open it.
Done!
Here we will be plotting two graphs.
First will be a Map and second will be a Total
Deaths Vs Total Recoveries Chart.
Let’s start with the Map.
First click on this sheet 1.
Great!
Now come to the dimensions and double click
on Country Names.
You will have your MAP plotted on your screen.
Before moving ahead – No matter how much
you clean your data something always stays
behind.
As you can see, we have 17 unknow values in
our dataset.
These 17 values contain the names of Cruise
Ships and island which are outside the scope
of Tableau’s map.
If you have latest version of tableau then
chances are there – that you may not have
these errors.
Anyways, excepts from these values we have
the data from almost all the countries.
Which is more than enough for us.
Now I want my Map to show only total number
of cases.
So, select Total Cases and drag it onto this
button called Label.
If you want to format the text then also you
can use this button.
Simply click on it and make your changes.
Like this.
Now it’s looking better.
Now our countries’ names are displayed twice.
Let’s fix it.
Go to Map Menu and Select Map layers.
First let’s change the style.
Let’s set it to outdoor.
Looking Good.
Let’s uncheck this Cities and Country/Region
Name checkbox.
Now this map is pretty clean.
Now let’s quickly draw Total Recoveries
Vs Total Deaths Chart.
For that create another Sheet.
Now again take Country from Dimensions Panel
and drag it to columns.
From measures take total Recoveries and drag
it to rows.
Now sort the data.
Now the Chart is looking pretty good.
Since the chart is showing the number of Recoveries
thus let’s change its color from Blue to
Green.
Good
Next I want to add a filter so that we can
only see those countries which fit into a
range.
Let’s say I want to see only those countries
where number of recoveries are over 1000.
For that, again to measures and drag total
Recoveries to filters.
Then Select Sum and Press Next.
Here Select Range of Values and in this box
specify 1000.
Don’t worry you can change it later.
Now press apply and ok.
Next right-click on Sum and Select Show filter.
Now using this panel, you can see the countries
with specific values.
Also, come to this show mark labels button
and press it.
Now we can see the total number of recoveries
on our chart.
Next, we will be repeating all these steps
but for total deaths.
Therefore, again go to Measures and select
Total Deaths and drag it to Rows.
Since this chart is showing total number of
deaths thus, I will change its color to Red.
We can also add a filter to it.
Take Total deaths from the measures and drag
it to Filters.
Repeat all the steps.
Great.
Now again right click on the filter and select
show filter.
Done.
Now come here and select entire view.
Ok.
Now let’s create a dashboard.
Now we will use these sheets to make a dashboard.
For that come here and select the New Dashboard
option.
Before doing anything.
Come to this Size panel and set it to Automatic.
Now take Vertical object and drag it to the
sheet and repeat the step one more time.
Done
Now take sheet number 1 and drag it to the
top vertical and take sheet number 2 and drag
it to the bottom vertical.
Now both these sheets are individual sheets
not connected to each other.
But I want that if I select a country here
it should also be displayed on the map and
vice versa then it will get displayed in the
map also and vice versa.
Let’s do that –
Go to this menu “Worksheet” and select
“Action”.
Click Add action and Select Filter.
Come to this Source Sheet Section.
Here First Select The Name of your dashboard
and then select the sheet which has your bar
graph.
Also, in run action choose “Select”.
Now come to Target Sheet and again select
your dashboard and then select the sheet which
has your Map.
Also, choose show all values.
Make sure All Field is selected in Target
Filter.
After that Click OK and exit.
Now whatever country you select here in your
bar graph is also selected in your Map.
Like this.
Now quickly create another action which will
work in the reverse manner.
Which means whatever country you select in
the map its corresponding entry in the bar
graph is also selected automatically.
This time the Source Sheet that you will be
selecting is the one which has your MAP and
the target sheet is the one which has your
Bar Graph.
Except from this everything will be 
the same.
Done!
Now let’s say you want to see the death
and recovery rate in Russia.
For that simply select Russia and its corresponding
entry will be on your screen like this.
Similarly you can create charts and figures
and customize your dashboard and do further
data analysis but for this project I will
stop here.
So that’s all you need to do in this project.
Here we used Python Programming for Data Processing
and Tableau for Data Analysis.
If you liked this video please hit the thumbs
up button.
Also subscribe to the channel for more such
interesting tutorials.
And don’t forget to press the bell icon
to get notified next time I upload a video.
Thanks for watching.
This is Manish from RebellionRider.com
