Christopher Penn: In today's episode,
Jessica asks, What is the most common
scientific method to analyze data so
that when I, a business person is
working with data scientists, I can have
an intelligent conversation. To my
knowledge, there really is only one
scientific method, which is you develop
a question, you define the problem of
the data. you formulate a hypothesis,
you create a test, you run the test,
collect the data, analyze the results,
you refine your hypothesis or throw it
out, and then observe and repeat the
process. Now, all that said, the
application of the scientific method is
where things differ from traditional
science a little bit,
not a lot, but a little bit.
Let's say you're testing a new vaccine
right? For SEO Coronavirus, you would
have a question, does this action work?
Right? does it create antibodies? You
would define the parameters, you would
do the formulation, and you would run
the test where data science is slightly
different is you still have the question
you want answered. But in the problem
definition itself, that's where you're
going to do a lot of what's called
exploratory data analysis. And that is
to understand the problem better to
define it better to experiment a little
bit, not a lot, but a little bit to
analyze the data set itself, if you have
it, and to do a lot of refinements to
it, cleaning of the data, etc. so that
you can formulate a hypothesis and
understand what it is you want to ask
and and define the parameters of the
test. Let's say you want to know Let's
say you want to know the impact of
Twitter on your lead generation? That's
a good question, right? What is
Twitter's impact on my lead generation?
What data you're going to need? You'll
need Google Analytics data, probably
you'll need Twitter data. And you'll
have a hypothesis that you'll
hypothesize that you tweet. If you tweet
more your conversions will go up maybe
by a certain amount. How much is that
amount? Do you know? This is where you
take that step back into the Define
stage of the process and go Okay, let's
look at my Twitter data. Let's look at
my Google Analytics data. Is there a
mathematical relationship there? Is
there even a there there before we set
up a test before we create a hypothesis
that is testable? Are we even barking up
the right tree? So you might run like a
regression analysis and see if there is
a a mathematical assumption Between the
two because without an association,
there probably isn't going to be a
causation. And who would explore your
Twitter's data? Does Twitter give you
enough data to build? A good hypothesis?
Or conversely, does Twitter give you so
much data that you're not even sure what
to test? Right? Is it length of tweet?
Is it the day and time of a tweet? Is it
a certain emotion or sentiment or tone?
That has been working for you images in
the post video in the post? What What
kinds of stuff do you have access to?
And then what kinds of things actually
matter? And this is where you would run
things like feature selection or feature
importance or predictive predictive
strength on your Twitter data combined
with Google Analytics data. So there is
a fair amount in that defined stage that
to our discredit, we tend to gloss over
that I would just define success There's
a lot that actually goes into that. And
that it really is exploratory data
analysis, which is almost a discipline
unto itself, of being able to explore
data and understand this is what's in
this thing. This is what's in the box,
or, and this is where your subject
matter expertise really is important.
And your analytical expertise is really
important. What happens when you see an
association? Like, yes, it looks like
Twitter data has a correlation to Google
Analytics conversions. But then when you
run regression tests and things to try
and isolate what are the most important
features, you come up with nothing, like
I just tested a whole bunch of things.
What, what happened here, there's a
there's an association. So there should
be correlative variables that contribute
to it. But none of these show any kind
of statistically valid, predictive
strength, what happened and so that that
In the experienced marketing data
scientists would say okay, what am I
missing? I've got data, but I've got no
statistical relevance. What's not in the
box? What? What else do I need to go and
get? And that's where you'll find your
biggest challenges because it's tempting
to run the test and say, Okay, here's
the top thing, but at the top thing is,
you know, a point 08. And you're looking
for point two, five or better for some
of these multiple regression tests,
you'd be like, Hmm, what do I do? The
very junior or naive data scientist
says, I'll just take the top three,
that's good enough, right? That's the
the algorithms but that's good enough,
but it's not the case. That is very
rarely ever the case. Like I'm trying to
think of a situation where that is the
case and I'm not coming up with
anything. And so for
what we want to do, as as As people who
want to become experienced marketing
data scientists, we have to say, Okay,
well, there's clearly something else
that's missing, there's a variable that
is we don't have that would glue these
two datasets together, or combination of
variables. On the flip side of that, you
get things called like a completed
variables where there's something that
is mixing the two up and creating
stronger signal strength than there
actually should be. That typically
happens with highly correlated variables
mixed together. So if you simple
example, if it turns out that the length
of a tweet is important, and you have
the number of characters in the tweet
and the number of words in a tweet, and
that goes into your, your algorithm to
determine strength, that's going to
screw things up because those two are
perfectly correlated. And it's going to
create an an unnecessary signal for the
algorithms. The regression algorithms
used to say like, Oh yes, this is
invalid. And contributor to the outcome.
Again, as as a more experienced data
scientist, you would look at that and
go, Okay, well, we've got to get rid of
one of the karlitz here, because they
both can't be in there, but they're
gonna throw a wrench into the
computation. So, in Twitter's case,
again, you'd use your social media
expertise, Twitter accounts at the
character level, if you have 280
characters to work with. Okay, so if
that's the case, let's get rid of number
of words in the tweet, and just stick
with the characters in the tweet. And
that will be a better measure of whether
the length of a tweet is relevant to the
outcomes that we care about. So to sum
up, there is only one scientific method
that I know of. But it is the
implementation that matters the most and
how you do it within data science.
Because there's a lot that goes into
defining the problem you will spend a
lot of time you should spend a lot of
defining the problem. If you don't, if
you immediately jump into a hypothesis,
you immediately jump into running a
test. Chances are, something has gone
missing, something has been omitted,
that will come back to bite you in the
end. I can virtually guarantee you that
something will come back to haunt you
and you will not you will not enjoy the
process of having to redo the experiment
from scratch. Really good question. We
could spend a whole lot of time on this,
but that's a good starting point. If you
have follow up questions, leave in the
comments box below. Subscribe to the
YouTube channel in the newsletter, I'll
talk to you soon take care. want help
solving your company's data analytics
and digital marketing problems. Visit
Trust insights.ai today and let us know
how we can help you
