Hi everyone, my name is Iliya and I’m the
Co-founder of 365 Data Science.
Hope you’re excited because we’ll be talking
about some really important data science stuff
in this video.
And if you’re here, chances are you already
know a thing or two about the topic, which
makes this even more exciting for me!
But first thing’s first, for those of you
who don’t know 365 Data Science yet, we
are an online training platform for aspiring
data scientists.
You might have seen our name tossed around
websites like KDNuggets, BigDataMadeSimple,
or TowardsDataScience.com – but even if
you haven’t, that’s okay!
What we’re going to talk about here is not
tailored to professional data scientists anyways
– it’s for anyone who’s interested in
what data science is, why it’s the number
one job in the US right now (actually four
years in a row), and how you can make a career
as a data scientist who’s paid that oh-so-lovely
6-figure salary.
So, if you’re a professional, you really
won’t benefit form watching this video,
but if you’re a beginner, or a motivated
quick learner who wants to learn more about
how to make it in the field – stick around!
Also, we’ll talk a little about the best
online training for aspiring data scientists
at the end, and we’ll share a special YouTube
coupon for those of you who want to sign up
at a substantial discount.
So, yeah, stick around – there’s plenty
awesome stuff coming!
Alright, moving on.
Now, we didn’t become data scientists overnight,
and we’re not offering you a magical solution
that will get you hired tomorrow.
We read the textbooks, did tons of practice,
read some more books – the familiar cycle
– but all of us really started thinking
like data scientists once we understood what
data science is.
Sounds like the easiest thing, right?
Yeah, but you’d be surprised how tricky
it is to put all the buzzwords and tech terms
together and have them make sense.
Anyway, what we extracted from our experience
are three knowledge pillars.
Three data science pillars around which all
your learning and career drive need to be
organized, and that’s what we’ll talk
about right now: these three pieces of information
that help you make sense of the field and
the many opportunities it offers:
- The terms in data science
- How they fit together
and
- The timeline of data science processes
Alright – let’s get started!
As I said, my name is Iliya and I am the Co-founder
of 365 Data Science.
I am also responsible for all things Mathematics,
Statistics, Machine and Deep learning related
here at 365 Data Science.
I am co-hosting this with my colleague Simona
– she teaches R Programming and Stats, and
I’m going to pass the mic to her for a minute.
Hey everyone…
It’s pretty awesome to be here and talk
to you about data science, because there are
definitely some confusing things in the data
science world and misconceptions flying around.
See, we feel like data science terms and related
buzzwords are being tossed around by people
from all walks of life, who have taken an
interest to the topic.
This is not to say that they don’t know
what they are talking about – on the contrary
– but they often use the terms of data science
assuming everyone knows and understands where
these terms come from and how they fit into
the bigger picture.
What I mean is, often, the people discussing
data science are not teachers.
And this creates a massive amount of uncertainty
or confusion for the beginner data scientists.
There’s also a lot of step-by-step guides
out there about how to become a data scientist,
but nobody really delves into what data science
is, how and why it works the way it does.
Okay, everyone.
So, Data science.
Can you give a single comprehensive definition
of what data science is?
Try it.
I believe it will prove extremely difficult
to come up with something that doesn’t invite
a hundred follow up questions.
That’s the thing about data science – it’s
a universally recognizable term that is in
desperate need of dissemination.
From my experience, I believe this is a term
that escapes any single complete definition,
which makes it difficult to use, especially
if the goal is to use it correctly.
Most articles and publications use the term
freely, with the assumption that it is universally
understood.
However, data science – its methods, goals,
and applications – evolve with time, and
technology.
25 years ago, data science referred to gathering
and cleaning datasets then applying statistical
methods to that data.
In 2018, data science has grown to a field
that encompasses data analysis, predictive
analytics, data mining, business intelligence,
machine learning, and so much more…
Right.
In fact, because no one definition fits the
bill seamlessly, it is up to those who do
data science to define it.
Here’s the plan.
We will define the key processes in data science
and get to a complete description of the field.
Why?
Because if you want to be a competitive job
applicant you need to understand how various
data science activities fit into the big picture.
You also need to learn about the timing of
the different data processing analyses, as
well as who carries them out.
And let’s not forget how.
Does that make sense?
If you want to work in medicine, you will
learn how the human body functions and then
decide whether you want to be a paediatrician,
a nurse, an oncologist, etc.
That’s what we’re about to do here, but
for data science 😊
Let’s start with a picture.
This is data science in a nutshell.
But we’re not interested in just the nutshell.
This infographic contains the most concise
representation of data science terms that
we know of, and ultimately, all the things
you need to have a solid grasp of before entering
the field of data science.
I mean, that was our grand idea when we designed
it, right?
You are absolutely right, Simona.
By the way, if you guys want to download the
infographic and this entire presentation,
there’s a link in the description that will
take you to our download page.
Or you can click the widget on the screen
to the same effect.
Okay, let’s get started by talking about
data!
Before anything else, there is always data.
Data is the foundation of data science.
In the context of data science, there are
two types of data: traditional, and big data.
You’ve heard of both, I don’t doubt that.
Traditional data is data that is structured
and stored in databases which can be managed
from one computer; it is in table format,
containing numeric or text values.
Big data, on the other hand, is… bigger
than traditional data, and not in the trivial
sense.
It isn’t simply represented by numbers and
text, but also by images, audio, mobile data,
and so on…
In addition, big data has high velocity – what
does that mean – it means that it’s retrieved
and computed in real time.
Finally – think about its volume: big data
is measured in tera-, peta-, and exa-bytes,
and it is often distributed across a network
of computers… why?
Because it is very, very big!
😊
So, to get some perspective… where does
it all come from?
Well, traditional data may come from sources
like basic customer records of a retail store,
or historical stock price information in the
world of finance.
Big data, however, is all-around us.
A consistently growing number of companies
and industries use and generate big data.
Consider online communities, for example.
This might seem like an anecdotal example
but really think about it: Facebook, Google,
and LinkedIn.
They generate massive amounts of user data.
Facebook alone, for example, collects information
about the pictures, places, posts, demographics,
products used, audio and video shared of all
of its 2.2 billion users.
In fact, right now, digital data in the world
amounts to 3.2 zettabytes.
That’s 3.2 times 10 to the power of 21.
And collectively, 90% of all the data we have
(since the beginning of time), has been collected
in the last two years.
That’s a scary amount of big data gathering,
and it’s only going to grow.
Right.
Now, imagine the following scenario.
You’re already a data science professional
and you are working for a telecommunications
company.
A superior member of staff tells you one of
two things.
A: “We need to consider client satisfaction
in the next quarter, so we can predict churn
rate.
Oversee the process and come up with some
numbers.”
Or B: “We have an enormous amount of customer
data from the previous quarter.
Can you oversee the analysis and deliver an
approximation of churn rates for the next
quarter?”
Can you pinpoint the difference between A
and B?
I’ll give you 10 seconds.
The difference is that in the first case,
you do not have data.
You would need to gather it.
This data can come from surveys, like asking
people how much they like or dislike a product
on a scale of 1 to 10…
Ok.
So, you have surveyed all these people.
And their responses have been sent to you.
Is this data ready to be analysed?
Not exactly.
This is called raw data, because you still
haven’t done any processing on it.
It is untouched data that cannot be analysed
straight away.
Awesome!
Now let me introduce a new term: Pre-processing.
This is what we can think of as “preliminary
data science”.
Pre-processing is an absolutely crucial group
of operations that convert raw data into a
format that is more understandable and hence,
useful for further processing.
Plus, it fixes the mistakes that occurred
during the gathering phase – if you’ve
ever worked with data before, then you know,
these happen constantly.
Like, when we’re thinking about customer
data, it’s unrealistically easy to have
a person register as 932 years old, called
‘United Kingdom’, from ‘Kevin Smith’
as their country.
Obviously, those data entries are incorrect
and therefore must be handled before proceeding
to any type of analysis, right?
Yeah, absolutely – that’s why there are
tons of pre-processing practices in place.
I’ll tell you about some of the more common
ones.
First, is class-labelling your observations.
This consists of arranging data by category,
or labelling data points to the correct data
type.
For example, numerical, or categorical.
Number of goods sold daily would be numerical
– you can manipulate this information mathematically…
and a person’s profession or place of birth
is categorical because no mathematical operations
can be done on this information.
Alright!
Just keep in mind that with big data the classes
are extremely varied, therefore instead of
‘numerical’ vs ‘categorical’, the
labels will be ‘text’, ‘digital image
data’, ‘digital video data’, digital
audio data’, a-a-and so on.
Okay.
Then, there is data cleansing or scrubbing.
These are techniques for dealing with inconsistent
data, like misspelt categories and missing
values.
You know the lot – people sharing their
name and occupation but omitting their age,
or gender.
Data shuffling is another interesting one!
Imagine shuffling a deck of cards.
It ensures that your dataset is free from
unwanted patterns caused by problematic data
collection.
Like what??
Like… if the first 100 observations in your
data are from the first 100 people who have
visited your website; this data isn’t randomized,
and it’s likely to reflect just the behavior
of those 100 people, when your website was
still under construction.
In a word, data shuffling prevents patterns
due to sampling to emerge.
And finally, consider data masking.
This is primarily a big-data specific technique.
And no wonder.
When collecting data on a mass scale, you
can accidentally put your hands onto a lot
of sensitive information which you need to
urgently hide from yourself.
Masking aims to ensure that any confidential
information in the data remains private, without
hindering the analysis and extraction of insight.
Essentially, the process involves concealing
the original data with random and false data,
allowing the scientist to conduct their analyses
without compromising private details.
And let’s not forget – all of this is
just at the very beginning of doing data science.
Pre-processing your data to make it useable
is laying the groundwork.
Alright.
Let’s assume your databases are clean an
organized at this point, so let’s get into
the real deal now.
Before we begin, I want to make sure we are
on the same page here.
There are two ways of looking at data, right?
One – with the intent to explain behaviour
that has already happened, and you have gathered
data for it.
And two – to use the data you already have
in order to predict future behaviour that
has not yet happened.
You need to be very clear on this distinction
because it can be what tilts the scales one
way or another when you’re deliberating
which data science path is best for you.
There is also a temporal relationship between
the two ways of looking at data.
Basically, before data science jumps into
predictive analytics, it must look at the
patterns of behaviour the past provides, right?
It must analyse them to draw insight which
will then inform the direction in which forecasting
should go.
Yes, and there’s a name for the data science
that focuses on explaining the past – that’s
Business intelligence.
Think about it this way.
Business Intelligence provides data-driven
answers to questions like: How many units
were sold?
In which region were the most goods sold?
Which type of goods sold where?
How did the email marketing perform last quarter
in terms of click-through rates and revenue
generated?
How does that compare to the performance in
the same quarter of last year?
Although Business Intelligence does not have
“data science” in the title, it is part
of data science, and not in any trivial sense.
I’m sure you’re beginning to get that
feeling already, right?
So, what do you do as a Business Intelligence
Analyst?
Of course, Data Science can be applied to
measure business performance.
But in order for the Business Intelligence
Analyst to achieve that, they must use specific
data handling techniques.
Let’s review some of them.
Right.
So, the starting point of all data science
is data.
For BI Analysts that consists of things like
monthly revenue, customer volume, sales volume,
and so on.
And as soon as they get their hands on that
data, they must go through three fundamental
operations.
First – extract meaningful metrics from
the data set – like average quarterly revenue
per new customer…
Second – identify the Key Performance Indicators
– you know, only those metrics that will
clearly show how the business is doing, and…
Third – analyse the data to extract insights
from it
Think about it for a second: why is business
intelligence an important data science stepping
stone?
Well, consider this…
The company you’re working for is running
a marketing campaign and you have received
the data.
You examine it and identify one of the metrics:
it indicates all the traffic to a page on
your website.
Then you think about what a KPI could be in
this case, and you realize that a KPI would
show the volume of the traffic to the same
page, but only if generated from users who
have clicked on a link in your ad campaign
to get there.
This way, you can check if the ads you’re
positioning are in fact working and driving
customers to click on a link.
In turn, this will determine whether you should
continue to spend on ads or not.
Of course, this is not where the Business
Intelligence Analyst responsibilities conclude.
I want to keep in mind this next thing I will
tell you.
Really, hold on to it.
Data Science is about telling a story.
I will say that again: data science is about
telling a story.
And crunching the numbers is just the introduction
to the story.
So, apart from handling strictly numerical
information, data science, and specifically
business intelligence, is about visualizing
the findings, and creating easily digestible
images supported only by the most relevant
numbers.
After all, all levels of management should
be able to understand the insights from the
data and inform their decision-making.
And this is in the hands of the business intelligence
analyst.
Business intelligence analysts create dashboards
and reports, accompanied by graphs, diagrams,
maps, and other comparable visualisations
to present the findings most relevant to the
current business objectives.
That’s all super interesting but what are
the directly actionable results of these analyses?
What line of work would need a BI analyst…
In other words, where would you come in as
a BI analyst?
Let’s try use our analytical brains and
try to answer that together.
If you are a hotel manager, would you keep
the prices of rooms constant all year round?
Probably not, if you want to attract visitors
when the tourist season is not in bloom and
if you want to capitalise on it when it is.
And how would you inform your strategic decisions
to lower or increase room prices?
With BI insight.
Data science is constantly applied to inform
price optimisation techniques.
What BI Analysts can do is extract information
in real time (think about booking sites) and
then compare this to historicals, update the
dashboards, and reflect the necessary price
change instantaneously.
The exciting bottom line is…
BI allows you to adjust your strategy to past
data as soon as it is available.
Alright.
Here’s some more food for thought.
Think about inventory… of any sort, really.
Which is better?
Over-supply.
Or under-supply?
If you chose secret option number three – neither
– you are correct!
Over and undersupply can cause massive problems
in a business.
Too much inventory and you lose money you
have already invested, too little inventory,
and you lose money you could have potentially
gained.
And implementing effective inventory management
means supplying enough stock to meet demand
with the minimal amount of waste and cost.
And the million-dollar question: how do you
ensure your inventory management is effective?
Data science, and business intelligence, are
invaluable for handling over and undersupply.
A BI Analyst can carry out in-depth analyses
of past sales transactions to identify seasonality
patterns and the times of the year with the
highest sales.
This can then inform inventory managers and
result in the implementation of effective
inventory management techniques that meet
demands at minimum cost.
Easy.
Once the BI reports and dashboards have been
prepared and insights – extracted from them
– this information can become the basis
for predicting future values.
And that’s where it becomes truly awesome.
But, the accuracy of your forecasts will differ
based on the methods and techniques you decide
to apply.
And this is where the more popular data science
concepts come out to play.
Examples of such techniques are: neural networks,
deep learning, time series, and random forests.
But let’s backtrack a little.
Just as there is a distinction between traditional
and big data, there is also a distinction
between traditional methods in predictive
analytics and machine learning.
Judging by the way I’ve positioned these,
can you guess what the basis of this distinction
is?
That’s right – it’s the type of data
the methods operate on.
Traditional data invites traditional analytics,
like regression, cluster, and factor analysis,
whereas machine learning is far better equipped
to handle big data.
As you can imagine, machine learning steps
on the shoulders of classical statistical
forecasting…
In fact, people in the data science industry
refer to some of these methods as machine
learning too, but when I talk about machine
learning, I am referring to newer, smarter,
better methods, like deep learning.
Just something to keep in mind 😊
Right.
Now.
What’s the statistical knowledge you need
for traditional analytics in data science.
Most often, data science employs one of these
five analyses.
Linear regression
Logistic regression
Cluster analysis
Factor analysis
Time series analysis
Okay, let’s do a little exercise.
I will describe what each analyses does but
I won’t tell you its name.
Your task is to try and match the methodology
with the correct name.
Think of this as a self-test; how well do
you know your way around the fundamentals
of statistics.
It’ll give you a relative idea of whether
you’re prepared for practical interview
questions for a data science position.
Alright.
This method is used for quantifying causal
relationships among the different variables
included in the analysis.
You will use this if you need to assess the
relationship between house prices, the size
of the house, and the year they’re built.
The model calculates coefficients with which
you can predict the price of a new house,
if you have the rest of the relevant information
available.
Is this cluster or factor analysis… or is
it time series analysis?
If you said it’s a regression, you are correct
– but which type?
This, of course, is linear regression.
There is a line which governs the relationship
between the size and the price.
Okay.
This exploratory data science technique is
applied when the observations in the data
form groups according to some criteria.
It takes into account that some observations
show similarities, and facilitates the discovery
of new significant predictors, ones that were
not part of the original conceptualisation
of the data.
If your house data looks like this, this analysis
would identify these groups: small expensive
houses in the city centre, big cheap houses
in the suburbs, and big expensive houses in
good neighbourhoods.
This, of course is cluster analysis.
Next one…
If clustering is about grouping observations
together, this analysis is about grouping
features together.
Data science resorts to using it to reduce
the dimensionality of a problem.
Let me explain.
If you have a questionnaire with 100 questions,
and each 10 questions are trying to determine
a single general attitude, this analysis will
identify the 10 factors.
Did you guess this correctly – this is factor
analysis.
Once factor analysis identifies some factors,
they can be used for a regression that will
deliver a more interpretable prediction.
I hope you’re not surprised -- a lot of
the techniques in data science are integrated
like this, which is why a solid fundamental
understanding of statistics is genuinely a
must.
We’re left with two statistical methods
– time series analysis, and logistic regression.
That’s almost too easy, but let’s do it
anyways!
This is a popular method for following the
development of specific values over time.
It is widely used in economics and finance
because their subject matter is stock prices
and sales volume – variables that are typically
plotted against time.
I am sure you noticed that I said the word
“time” about sixteen times, this description
is, of course, time series analysis.
And, finally, logistic regression.
Since not all relationships between variables
can be expressed as linear, data science makes
use of methods like logistic regression to
create non-linear models.
Logistic regression operates with 0s and 1s.
For instance… think about the process of
hiring new staff.
Companies apply logistic regression algorithms
to filter job candidates during their screening
process.
If the algorithm estimates that the probability
that a prospective candidate will perform
well in the company within a year is above
50%, it would return 1, or a successful application.
Otherwise, it will return 0, and that candidate
will not be called in for an interview.
Does that make sense?
I’m pretty sure things are starting to click
already!
Just keep this in mind: Linear and logistic
regression, cluster and factor analysis, and
time series are at the core of the traditional
methods for predictive analytics in data science.
A lot of the smart advanced machine learning
methods for data handling are heavily grounded
in this statistical theory, so if you’re
interested in Data Science, please make sure
to cover these foundations well.
It’ll be a complete game changer.
(This is also one of the reasons why our program
starts from scratch explaining Math and Statistical
fundamentals first, before studying more advanced
topics.)
Okay, now that you know what traditional data
science does, you are definitely trying to
think of the practical applications and where
it fits in the world.
Or at least, you are now.
The application of these analyses is extremely
broad; data science is finding a way into
an increasingly large number of industries
and it’s getting increasingly creative.
But user experience and forecasting sales
are still the big names here.
Think about it.
When companies launch a new product, they
often design surveys that measure the attitudes
of customers towards that product.
That’s a crucial marketing step.
After the BI team has generated their dashboards,
the data scientists can spread their wings.
They can analyse the results by grouping the
observations by segments (for example, sales
regions), and then analysing each segment
separately to extract meaningful predictive
coefficients.
After all calculations are complete, the team
reaches the conclusion that the product needs
slight but significantly different adjustments
in each segment to maximise customer satisfaction.
Only then can everyone be happy and isn’t
that the goal?
What about forecasting sales volume?
This type of analysis invites time series
onto the scene.
You have sales data that’s been gathered
until a certain date, and you want to know
what is likely to happen in the next sales
period, or a year ahead.
You apply mathematical and statistical models
and run multiple simulations; these simulations
are what provides you with future scenarios.
This is at the core of data science, because
based on these scenarios, the company can
make better predictions and implement adequate
strategies.
Don’t worry if this sounds a little vague
at this point – data science is massive
and sometimes, you need the bird’s eye view
to stay on track.
Alright.
We are now at the end of the line.
We’ve looked at data, explanatory dashboards
from the BI team, and predictive analytics
using classical statistical approaches.
Data science is coming into shape, but its
missing one powerful methodology, isn’t
it?
In fact, that’s what most people consider
true data science – Machine learning.
But just as it doesn’t make sense to learn
how to run before you can walk, approaching
machine learning needs to be done with caution.
Do you understand the basics?
If you’ve been paying attention, you have
a good idea at this point.
So…
Machine learning is the state-of-the-art approach
to doing data science.
And rightly so.
The main advantage machine learning has over
any of the traditional data science techniques
is the fact that at its core resides the algorithm.
You have heard of it before, I am sure.
These are the directions a computer uses to
find a model that fits the data as well as
possible.
The difference between machine learning and
traditional data science methods is that in
machine learning we do not give the computer
instructions on how to find the model; it
takes the algorithm performs a trial-and-error
like process to find it out on its own.
Unlike in traditional data science, human
involvement is minimized.
In fact, machine learning, especially deep
learning algorithms are so complicated, that
humans cannot genuinely understand what is
happening “inside”.
Even though we are the ones that design them.
Isn’t that fascinating!
But let’s get back to the algorithm because
that’s the core.
A machine learning algorithm is like a trial-and-error
process, but the special thing about it is
that each consecutive trial is at least as
good as the previous one.
That opens the door to much more accurate
predictions and models.
But bear in mind that in order to learn well,
the machine has to go through hundreds of
thousands of trial-and-errors, with the errors
decreasing throughout.
Then, once the training is complete, the machine
will be able to apply the complex computational
model it has learned to novel data and still
produce highly reliable predictions.
This is where the power of machine learning
lies, and the reason why it’s considered
the epitome of data science.
Okay, to give you a better idea of how machine
learning works.
There are three major types of machine learning:
supervised, unsupervised, and reinforcement
learning, right?
If you’ve been following data science trends,
you should have at least heard about them.
Imagine having some data.
A lot of it, actually – big data, consisting
of video files and images.
And you have labelled your data – these
are videos of cats, these of dogs, and third
are animals that are neither cats nor dogs.
In supervised learning, you want to have labeled
data just like the one you have right now.
The machine gets the data and that data is
associated with a correct answer; if the machine’s
performance does not get that correct answer,
an optimization algorithm adjusts the computational
process, and the computer does another trial.
Of course, keep in mind that, typically, the
machine does this on a batch of, say, 1000
data points at once.
Machine learning is powerful – that’s
the takeaway here, and I will probably say
it at least once more.
Support vector machines, deep neural networks,
random forest models, and Bayesian networks
are all instances of supervised learning.
Alright, let’s get back to the data again,
but this time, imagine it’s not just big,
it’s too big; and you cannot label it.
Or you are too pressured for resources to
take the time to do that, or you don’t know
what the labels are at all.
In this case, data science resorts to using
unsupervised learning.
This consists of giving the machine unlabeled
data and asking it to extract insights from
it.
This often results in the data being divided
in a certain way according to its properties.
To use a term we just discussed – it is
clustered.
See how traditional methods spill into machine
learning?
The really cool thing about unsupervised learning
is that it is extremely effective for discovering
patterns in data, especially things that humans
using traditional analysis techniques would
miss.
In our case, this could be a whole new animal
group we’ve missed, like pictures and videos
of alligators which we would have otherwise
labelled “neither cats nor dogs”.
Actually, data science often makes use of
supervised and unsupervised learning together,
with unsupervised learning labelling the data,
and supervised learning finding the best model
to fit that data.
But we can discuss this another time.
There’s a more pertinent question here.
Where is Machine Learning applied in the world
of data science & business?
Think about fraud detection!
With machine learning, specifically supervised
learning, banks can take past data, label
the transactions as either legitimate, or
fraudulent, and train their models to detect
fraudulent activity.
When these models detect even the slightest
probability of theft, they flag the transactions,
and prevent the fraud in real time.
Pretty impressive, isn’t it?
And in terms of corporate, ML is indispensable
for client retention insights.
With machine learning algorithms, corporate
organizations can know which customers are
likely to purchase goods from them.
It’s again “follow the pattern” kind
of activity.
And this means the store can offer discounts
and a ‘personal touch’ in a very efficient
way, minimizing marketing costs and maximizing
profits.
When talking about this technique, one name
always comes to mind – the e-commerce giant
Amazon.
Alright, let’s quickly do a recap.
Data science is a slippery term that encompasses
everything from handling data – traditional
or big – to explain patterns and predict
behavior.
Data science is done through traditional methods
like regression and cluster analysis or through
unorthodox machine learning techniques.
It is a vast field, but right now you are
one step closer to understanding how all-encompassing
and intertwined with human life it is.
And we will give you one better.
Did you notice we haven’t said anything
about programming languages and the software
data science uses?
That’s because we wanted to run you through
the ABCs of data science first, and then discuss
the most useful tools for each data science
subfield.
Okay, so we can split data science tools into
two categories: programming languages, and
software.
Knowing a programming language enables the
data scientist to devise programs that can
execute specific operations.
So, the biggest advantage programming languages
have over software is that they offer massive
flexibility: as you can imagine, capable programmers
can write code that lets them do to data pretty
much anything that strikes their fancy.
R, Python, and MATLAB, combined with SQL,
cover most of the tools used when working
with traditional data, BI, and conventional
data science.
In fact, R and Python are the two most popular
languages across all data science sub-disciplines.
We actually carried out our own extensive
research looking into 1,001 data scientists
and their backgrounds – and 53% of them
knew R and or Python – yes, the overlap
is huge.
Why are they so popular?
Well, their biggest advantage is that they
can manipulate data and are integrated within
multiple data and data science software platforms.
They are not just suitable for mathematical
and statistical computations; they are adaptable.
SQL is king, however, when it comes to working
with relational database management systems,
because it was specifically created for that
purpose.
SQL shines brightest when working with traditional,
historical data; for example, when preparing
a BI analysis.
What about big data?
Big data in data science is handled with the
help of R and Python, of course, but people
working in this area are often proficient
in other languages like Java or Scala.
These two definitely come in handy when combining
data from multiple sources which is not a
rare occurrence, trust me.
Right.
Let’s talk about software for a minute – a
lot of software is used in data science, especially
in corporate because it is just that – a
tool adjusted for the specific business needs
of a company.
Yeah, for example, Excel is definitely a household
name and a tool applicable to more than one
category—traditional data, BI, and Data
Science.
Similarly, SPSS is a very famous tool for
working with traditional data and applying
statistical analysis.
TensorFlow, on the other hand, is software
library designed for working with big data
and designing Machine Learning algorithms.
It was developed by Google for internal use,
became public in 2015, and is generally the
leader for working with and deploying neural
networks.
You’ve also heard of Apache Hadoop, Apache
Hbase, and Mongo DB, I don’t doubt that
– these are too big-data centered software.
On the visualization side of the court, especially
having Business Intelligence reports and dashboards
in mind, Tableau is unparalleled in terms
of ease of use, versatility, and looks.
Alright!
That about covers it!
This is data science in 30 minutes.
So, let’s do a quick recap everybody.
• We talked about data – where you get
it from, and what type of pre-processing operations
you must be comfortable with before the real
data science begins.
If there is one takeaway here, it is that
data is the foundation of anything data science
does.
If you cannot understand how to work with
raw data, you will not be able to get into
the more sophisticated analyses.
So, action point #1: if you’re thinking
of starting on the data scientist path, learn
how to handle data.
• We also looked at business intelligence
and saw the clear-cut line between predictive
data analytics and explanatory analysis.
Business intelligence focuses on explaining
past business behavior.
These analyses are the stepping stone for
predicting how businesses will perform in
the future.
Learn the past to know the future.
Makes sense, right?
And this leads to action point #2: be comfortable
with extracting insight from past data first,
and only then proceed to making forecasts
about your company’s business performance
in the future.
• Alright.
We mentioned dashboards and presentations
too.
Data science is about telling a story, and
that story includes both numbers and visuals,
okay?
To be an effective story teller, you must
visualize the insights you draw from your
data.
Otherwise, if your audience is not number-savvy,
the point you’re trying to get across won’t
make any impact, and your work will go unnoticed.
In most cases, that’s not ideal.
For me, at least!
So, action point #3: learn data visualization
– Tableau, ggplot2, matplotlib, Seaborn…
any software would do the trick.
• Then, we talked about the traditional
methods of predictive analytics, right?
Regressions, factor and cluster analysis,
time series – the bread and butter of forecasting.
A lot of the more advanced approaches to data
science, like machine and deep learning, rest
on the theory at the core of these methodologies.
Hate to sound like a conspiracy theorist,
really, but the beautiful, marvelous thing
about data science is that everything is connected.
And honestly, I believe this is one of the
main reasons why it’s difficult to find
your way around as a beginner.
Does that make sense?
Okay, cool.
Action point #4: as far as statistics is concerned
– cover the foundations first before playing
with the big guns (like neural networks, ML
algorithms, and so on).
• And only once the foundations are covered,
can we talk about – and we did – the state-of-the-art
approach to data science – machine learning.
These are the newest, smartest, best methods
for predictive analytics, including deep learning,
neural networks, k-means clustering and reinforcement
learning.
Mastering machine learning is definitely the
#1 skill you need to start a career in data
science, but it is also the most challenging
data science subfield.
Action point #5: Build a portfolio of ML projects
(even if it takes you some time) – practice
will be your best teacher and you will have
something to show for during your interviews.
• Absolutely!
Okay, we also looked at programming languages.
There is a ton of research that indicates
that data scientists, on average, can comfortably
work with 2 languages – in most cases that’s
some combination between R, Python, and SQL.
I guess, the action point here is simple:
when it comes to programming languages, learn
one or two but learn them well.
• Finally, have fun!
Data science is awesome and exciting and opens
up a world of possibilities – really.
So, don’t stress out too much, and enjoy
your journey.
Okay, I am super glad we cleared all of that
up.
And if you’re really serious about developing
some data science skills, we can definitely
give you some guidance.
For sure!
I, for one, hope you feel like the time you
just spent learning about data science was
worth it – not only because you now have
a clear data science path you can follow if
you choose to develop your data scientist
skills, but also because we are a motivated
team and we are going to show off something
we’re super proud of, and… you get to
be part of it.
You’ll see how 😉
Alright, it’s called the 365 Data Science
Online Program and it says it all in the name.
This is the comprehensive data science curriculum
my co-founders and I set out to develop, so
others wouldn’t have to invest as much time,
energy, and resources as we did when we were
growing our own data science skillsets.
It’s a pretty awesome program.
That’s actually something our students say,
too, which is great because it totally validates
everything we are trying to achieve, which
is creating an accessible and transparent
fast track into data science.
About our R and statistics training, Matthew
says that while other courses have been somewhat
helpful, with us he managed to really get
into doing statistics with R, and actually
had fun doing it.
That’s awesome, thanks Matthew!
And Rajiv was absolutely thrilled with our
Machine Learning portion of the program – he
says it’s both in depth and logical, unlike
a lot of the resources you can find scattered
online.
That’s exactly what we are aiming to do,
so that’s spectacular.
He also says that building models from scratch
helped him learn a lot, and to certainly see
the real-world business applications of ML!
Annabelle, on the other hand, is already a
Data Science grad student, and she finds our
statistics training flawless and super intuitive.
She says it’s essential, especially if you’re
going into Machine Leaning.
In fact, she would recommend the course to
anyone in the data science field, which is
absolutely fantastic, and makes us really,
really happy to hear that!
Now, the program includes a lot more than
these courses, and we’ve created it with
the entire professional journey in mind.
So, we’ve designed trainings for all of
the skills that step-by-step build up the
data scientist, including:
• The fundamentals of Mathematics
• Probability
• Statistics
• Tableau
• SQL
• R
• Python… all the way to
• Machine Learning
Yes, and this way, when you start growing
your skillset you will be able to do it systematically,
in a structured practice-rich environment,
and if your motivation is at the level we
think it is – you’ll complete the training
in a couple of months, instead of years.
In fact, I am pretty sure that most of you
guys are already highly motivated and you’re
here because you have been trying to teach
yourself on your own and find the right way
through the labyrinth of resources available
online, right?
So, how do you get to be part of this?
Well, we’ve created a secret coupon for
only those of you who are here with us in
this presentation, because that’s our way
of saying thanks and reaching out.
There’s a link to the side of the screen
that says Subscribe Now.
If you click that link, it’ll take you to
a Subscription page where you will be able
to enroll in the 365 Data Science Online Program
with 75% taken off the standard, non-webinar
price.
This offer is only available to the first
100 people who sign up, so make sure to claim
yours before the coupons are all gone.
We know that’s a massive discount, but we
decided to do it, so we can make this gateway
into data science an accessible reality.
We kind of feel like your being here, talking
to us online, is a pretty good sign that you
are serious about your future in data science.
So, this way, all of you that enroll, will
have access to a massive amount of resources
and you will be able to learn a lot… as
well as see if it makes sense for you to continue
on the data science path.
What we really hope will happen, though, is
that you will realize how awesome and full
of opportunities data science is, and you
will let us help you become a more effective
professional.
So, to recap.
As our amazing webinar audience, you get a
secret coupon to begin the 365 Data Science
Online Program at 75% off.
The discount will only be applied to the first
100 of you who enroll in the training, which
means that once the coupons are gone, they’re
gone.
Clicking on the Subscribe Now widget next
to the screen will claim the coupon, so claim
yours soon.
And after that…, we’ll see you in class!
Alright, everyone, thanks for watching!
We are super glad data science terminology
is no longer confusing in your minds!
And, just a reminder, if you want to get all
the materials we used in the video – the
infographic and the presentation – there’s
a link in the description that will take you
to our website where you can download them
from.
Or you can click on the widget on the screen…
now.
Thanks!
And good luck!
Yeah – we had a lot of fun!
Good luck, everybody!
