Hi, welcome to this data science dojo
beginner tutorial on getting started
with Python and R for data science. In
this beginner tutorial we'll take you
through some common Python and R
packages and libraries used for machine
learning and data analysis as well as go
through a simple linear regression model.
We'll also help you setup Python and R on
your Windows, Mac, or Linux machine
run your code locally and push your code to
a github repository.
So let's get started with installing Python and R.
To install python on a Windows machine we first
need to check if our machine is 64-bit
or 32-bit as this will determine the
appropriate Python program to install. To
do this search for "about your PC"
and you'll see if your machine is 64-bit or
32-bit, in my case, its 64-bit.
next, in your web browser, type "python.org /
downloads / windows" and scroll down to
the version of python you wish to
download, in my case, I'll choose the
latest version for 64-bit executable installer.
You can go with the default
installation or you can do a custom
installation to include optional
features such as "pip" or you can specify
your path directly under C so it's
easier to locate your Python program later on.
and just click install
once python has installed on your computer
you'll need to add python to your path
to be able to run Python scripts in a directory or folder.
download Git for Windows to set your path and run the
Python command. The command using this
program are basically the same when
using terminal in Mac or Linux
Alternatively, for Windows, you can use
the default command prompt by searching "CMD"
You can also set your local path by
searching "environment variables"
and setting your path there
Here's an example of a Python script
saved in my documents project one folder.
Using a text editor of my choice, such as
notepad plus plus to write my Python
code, I saved my file as a .Py file
Then, I open my terminal which is in "C:
program files/git/git-cmd".
I navigate to documents project one
and I set my local Python path.
So we'll set this up permanently using a bash RC file
with the path to my Python program directly under "C"
now, I simply type "Py"
followed by the name of the file and extension
If using Python 2.7
just type "Python" followed by the 
name of the file and extension
if we were to hit enter to run this,
it would produce the output of
my code which has predicted Heights
using a linear regression model.
The final part of this python windows setup
is installing pip to be able to easily
install Python packages and libraries
pip might not have come with your
installation if you didn't customize
your installation or it might not be
installed in an older version of Python
so to get pip, type in your web browser
"bootstrap.piper.io/git-pip.py
and right click, to save in 
your Python program folder and
then run the command "python get-pip.py"
so my Python programs under (C:)
Moving on to installing R for windows,
simply type in your browser
"cran.r-project.org/bin/windows/base" 
and select the 32 or 64-bit
Once it is downloaded, press ok
and click "next" to all
Once R has installed on your computer,
you can simply open the program
on your desktop and start typing R commands or code.
I recommend you to download R studio as it just makes the process of editing and debugging your code easier.
Otherwise, you're welcome to
use the R command line.
To save an R file, click on "file", "file history", and this
will save your code so you can run it later if you wish
to set your path or working directory, 
just simply type "setwd"
followed by the path to where you
would like to store your R files locally
You might need to use double
backslash for Windows as Windows
understands this to mean separators in the path.
Now, let's install Python on a Mac
Go to Mac terminal in "finder", "applications", "utilities"
and now we're going to store our 
command line utilities Xcode
as this will help with the installation
So type "xcode - select - -install"
click "install"
and "agree'
Now, we're going
to use homebrew to install Python
So type "/usr/bin/ruby"
and we're going to use curl
and we're going to type the URL to
homebrew on github
press return
enter your password if need be
Next add the path, so we will create a
bash RC file to permanently add the path
If you get an error message stating
"cannot write to path" try the "sudo"
channel command accompanying this video.
All commands can be copied and pasted as
they accompany this video.
next we'll install Python so just brew install Python
or Python 3 if your using Python 3
we'll also add this to our path
So we'll create another "- RC" file
Now to check if pip is installed as part of your
Python program, simply type "which pip" and
It'll show you the location where your
pip is installed and if you want to
check out the version just type "pip - V"
and it'll show you which version of
people you've installed. As mentioned pip
is useful for easily installing Python
packages and libraries.
Moving on to R, to install this on a Mac after
installing homebrew, simply type
"brew tap homebrew/science"
and then type "brew install r"
To open the our command line simply type
"r" and enter.
Now let's install Python and R on Linux
I'm using Ubuntu, later
versions of Ubuntu might already have
Python installed but I'll take you
through the process anyway.
So open your terminal
Okay now we're going to type
"sudo apt-get install python 3.6 or 2.7"
Now we're going to type "sudo apt - get
install Python - set up tools"
lastly, install pip to easily install
python libraries in packages by typing
"sudo easy_install pip"
To install R on Linux, simply type
"sudo apt-get -y install r-base"
Now type uppercase "R" and enter to open
the R command line
now that we've got the setup and installation part of this
tutorial out of the way we can now move
on to more fun stuff. Let's have a quick
play with some data to get you familiar
with some key data analysis and linear
regression concepts as well as basic
scripting for this. I'm going to go
through an example of a simple linear
regression in Python and R using
simulated data on people's height in
centimeters and their weight in
kilograms. The model is based on a
formula which can be produced using
Python and R functions that gives a
predictor out come or estimated y-value
given a certain x-value at a certain
constant and slope. Here is what's called
the "regression line" I like to think of
it as a line of predicted values along
the x-axis for a given x-value the line
predicts the y-value to fall about here
in height the actual values are slightly
above and below the line, but the model
is generalized enough to take into
account where most cases would probably
fall. The formula gives a constant value
here which we add this to a given x-
value multiplied by a given coefficient
or slope. The constant means when X is at
0, y is at this value and the slope means
for every one unit increase in X, Y
increases by this number of units. So we
can use this formula to plug in any new
x-value of a person's weight to predict
their height or y-value. Of course there
are many other factors not only weight
that could influence a person's height,
hence we're just looking at a very
simple model to get started with
To implement linear regression in Python
we first need to install a few commonly
used packages. We'll open our terminal
and install "sklearn" for modeling
If using Python 2.7, just type "python -m pip install"
Now, we're going to pip install 
pandas for data importing
We'll also install matplotlib for plotting
The last package we need to install is just "scipy"
Next, go to your text editor and save a new Python file in
"Documents/project 1" or a folder of your choice
So I'll just call my file "LM
model", save it as a Python file
Also, don't forget to CD into this folder in
terminal so you can run your script later.
Now we're going to import these
packages at the beginning of the script
when it runs, so at the top of the file
we'll type "from sklearn import linear model"
So our linear regression tool.
We're also going to important data 
frame from pandas
we also want to use pandas as PD
and we'll just use it as pandas
and we want to import matplotlib and use it as PLT
Now we need to read in our data which
you can download as part of this
tutorial and save in your current folder.
Will use the pandas read table function for this
So we'll put our data and
variable and we'll just call it input data
and we'll use the read table function
and we'll give the data file 
name an extension in our folder
its comma separated as it's a CSV file
and we have headers and they start at line 0
and we'll give our X&Y headers specific names
This automatically infers the data
types for each column too.
before applying a linear regression model, 
let's plot the data using matplotlib's
plot function to see if the data
naturally follows a linear pattern and
the normal distribution as linear
regression is not appropriate or useful
for datasets that don't follow this assumption
So we'll use a scatter plot
and we're just plotting weight versus
height. So weight is on our x-axis
and height is on our y-axis
We'll need to show this 
graph, so it can render on our screen
now save and run the script
As we can see, the data is linear and
follows a normal distribution making
linear regression appropriate to use on these data
Now we'll define our X predictor
variable weight and our Y outcome variable height
So we'll use PD as pandas 
and we use the data frame function
and we'll use weight, as our predictor
and we'll make height our outcome variable
Now we'll fit a model to the
data using the fit function and use this
to predict height to given weight
So we're using a linear regression model
and we'll fit the model to the data
We can now compare the first, say, six
predicted values using the predict
function with the actual height 
values to see if they're on par
So first we're going to get all the predicted values
and we're going to use our predictor
variable to predict the outcome
and we'll just print some sub heads to
differentiate the list of predicted
values from the actual
and we'll have a look at the first 0
to 6 predictions and we'll compare
with the first 0 to 6 actual values
All right, we'll save and run the script
A quick eyeball of the first few predictions with the
actual shows the model was not far off
the mark. Which is good, however, to
properly assess a model, we can use
measures such as R squared which is the
percentage of explained variants
So we'll go back to our script and we're
going to use the score function to get
the R squared
and we want to print this obviously
Now we're just going to comment out the
above lines as we no longer want to view these
we'll save and run our script again
as we can see, a high r-squared
shows the model explained most or nearly
all of the variance which is good
however relying solely on r-squared is
probably not good enough when assessing
and measuring our models predictions
sometimes it can be misleading to look
at the r-squared, but the course will go
through other measures you can use
To perform the same analysis in R, we'll
first install commonly used R package,
ggplot2, which is used for effectively
visualizing and analyzing data
I'll select a cran mirror that's close to me
We need to load ggplot2 whenever we want to use it
We'll read in our data
using the read table function
we'll put our data in a variable
we use read table
we'll give it our file in our current working directory
its comma separated
and we do have headers and we'll just
use the default header names x and y
This automatically infers data types too
will also attach our data frame so we
can refer to column headers or variable
names without having to refer to the
name of our data each time 
making this more convenient
Now we'll plot the data to see its
normal distribution, but we can also use
ggplot2 to plot the regression line or
the line of best fit
So we'll plot our x and y, which is weight and height
and in the smooth function, we'll specify a linear model
as we could see before
the actual heights are close to the
predictions of the line
implementing a simple linear regression in R
is quite easy using the LM function
Now, to see the first few predictions of height we'll
use the predict function
we first need to get all of the predictions
and we're just going to print the first
few to have a quick look
so the first 0 to 6
and we'll compare
with our actual values
As seen before, for the first few cases,
the predictions are pretty close
To print the r-squared or percentage of
explained variants for assessing the
model we'll use summary
As seen before, it explains nearly all
the variants, but it's a good idea to
also look at errors or other measures
for this. Finally now that we're finished
we'll detach our data
In the last part of this tutorial we'll
push our code to a github repository so
you can share your code publicly or
store it privately if you wish. You can
create a github account for free you can
also follow a data science dojo to clone
or access a copy of the code provided as
part of the course material.
Once you have created an account add a new
repository without initializing via
the github website. The instructions to
push your code to github are on the website
but I'll take you through the process
anyway. First open your terminal and CD
into your current project directory and
you'll need to configure your user name
and user email
now configure your username
We'll initialize our project directory
as our git repository
Then we'll add all
files in our project folder, we're not
pushing it live yet, it's just selecting the files
commit your files to track the
first mission with the message should
you wish to publish updates later on
So I'm just gonna say first go at implementing
simple linear regression
as you can see all the files in project 1 folder are there
Now we're going to give
the URL of our main repository
so go to the main page of your github repo
and copy the URL and we're going to paste it
into the terminal when adding a remote repo
Finally we're going to push our code to
the repo and github master branch
Now, if you have a look at your github
repo, you can see all your files are there
All the work we have done in this tutorial is here.
alternatively, after
initializing your github repo via the
site, you can simply drag and drop your
project folder onto the main page of your repo
Now that you've gone through the basics
you should feel ready to dive into the
course and gain a deeper and wider
understanding of data science.
You know how to set up Python and R in your
machine, how to do basic scripting for
reading and visualizing data, how to
apply a model and assess it, and now you
can share your hacks and projects on
github. The data used in this tutorial
the coded examples, the commands, the
URLs to programs, and so on are all
accompanying this video. My name is
Rebecca Merrett, feel free to reach out
to me by commenting on this video I'm
more than happy to help you get ready
before you start your course thanks for
watching and happy analyzing
