[APPLAUSE]
SPEAKER 1: So it's my pleasure
to introduce Professor Andy
Andres, who is a Senior
Lecturer of Natural Sciences
and Mathematics at
Boston University.
He's taught a very
successful MOOC,
online course, Sabermetrics
101: an Introduction
to Baseball Analytics, to
about 40,000 students on EdX.
He's formerly the lead
instructor and head
coach of the MIT Science
of Baseball Program,
and is chairman of the
Educational Resources
Committee for the Society of
American Baseball Research.
He's a data caster for
MLB.com at Fenway Park.
He scored game 6 of
the 2013 World Series.
And has a PhD in nutritional
biochemistry and physiology
from Tufts University.
So apart from the sports
angle, I like this topic.
I think it fits very
well into the Google
spirit of avoiding making
decisions from the gut,
from intuition with all the
conscious and unconscious
biases that that represents,
and taking a much more
data-driven, rational approach.
And as we see, that's a big win.
So thank you.
ANDY ANDRES: Thank you.
Thank you.
[APPLAUSE]
So this is my cover slide
for this talk, some contact
information.
But this image is from
the MOOC that we designed.
And during this few
minutes we have together,
I want to talk about basically
the quick introduction of what
I do, why it's important,
then a quick summary of why
sabermetrics has changed
the game of baseball,
why analytical thinking is
changing the game of baseball,
and then try to come up
to current technologies
and current data that's being
used to try to make decisions
in the game.
And the talk is geared
towards the novice.
Certainly at the beginning, you
might be able to follow things.
But if you're a baseball fan
or a sabermetrician yourself,
maybe you'll appreciate
the current technology.
So it's been my privilege
through the introduction
you heard-- I've sort of made
this transition from a bench
biochemist to a data scientist.
And I do lots of work with data.
And I'm an educator, so I spend
a lot of time teaching folks
how to think about
data, how to use
data to drive decision making,
how to think about modeling.
So all these things
are part of what I do.
But I've really made
this conscious effort
over my career to go
from the bench biologist
to the baseball scientist.
And it's been a lot
of different phases.
A lot of it actually has
to do with the physics
and biology of baseball.
That actually was my entree
into the whole field,
was doing anabolic
steroids in baseball,
that kind of discussion
and that kind of research,
and talks since 2001.
But then it's transitioned into
sabermetrics and analytics.
Quickly, this is a view
of my seat at Fenway Park.
I get to score the games.
I'm in the press box.
This is another view when
I'm looking down on the field
from my seat.
And this is my score sheet.
I have a different score
sheet than a lot of folks.
But it has to do
with the work I do,
which is to record every
event for all the technology.
I record every pitch,
every play in a certain set
of code and language.
And it's a very intense job.
I make the claim all
the time that I'm
the only person in the pressbox
paying attention to the game,
because everyone else is
paying attention to MLB
At Bat, which is what
my data generates.
So I do all the paying
attention these days at Fenway.
We generate the box
score at Fenway Park.
That's just an
emergence of technology.
The old school version of
this was baseball box scores
were written up by
the official scorer.
They don't do that anymore.
The official scorer just
calls the game itself.
So that's one nice
role I have, but it's
transitioned to the
academic piece here
of teaching sabermetrics.
And I designed the first
course in sabermetrics,
just doing baseball
analytics at Tufts in 2004.
And so we had this seminar
of many years teaching
Tufts students.
But now we've gone
into the MOOC format.
And one of the things
we did right away,
in both formats,
either seminar or MOOC,
was talk about the definition
of the term sabermetrics.
Sabermetrics has a funny origin.
Here's the origin.
The origin comes
from SABR, which
is the Society for
American Baseball Research,
and metrics is just a word to
say how to measure baseball.
And so sabermetrics-- it's
only been a word since 1981.
But Voltaire was
the first person
to do this-- those of you who
are philosophers or readers.
Voltaire said first
define your terms.
And it's really important.
When you look at
this definition,
I'd say this is incorrect.
This is not what
sabermetrics is.
Sabermetrics is not the
statistical analysis
of baseball data.
That's way too
narrow a definition
for what sabermetrics is.
And like a good
scientist, I just
want to show something and let
you see and think about this.
This is slow motion of
a pitcher for the Red
Sox named Daniel Bard.
This is from 2010.
And I say Daniel Bard
the Good, because this
was a year where Daniel Bard
was pitching extremely well.
For those of you who
are baseball fans,
you may remember Daniel
Bard a few years ago
and how great a pitcher he was
that one season especially.
Now, if you pay attention to
these two slow motion videos,
this is the same inning
in the same game.
These are like
two pitches apart.
A pitch count-- well,
three pitches apart.
You can see the pitch
count on the top screen.
The first on the left
is the Derek Jeter.
It's at Yankee Stadium.
And it's a fastball.
And even though the physics--
we all know the physics
say that the fastball
can't rise above gravity.
If you appreciate
what's going on here,
that fastball is defying gravity
by having a back spin put
on it by the pitcher going up.
Now, it's hard to
appreciate, but just look
how straight the
left-hand image is.
And then watch how much
curve the other pitch gets.
Just appreciate the
curve on the one pitch
and the straightness
on the other.
So like any trained
engineer scientist,
you should be
observing this thing,
and your natural curiosity
would drive questions.
What's going on here?
These are two different pitches
with two different grips.
Both are thrown over
98 miles an hour.
Now, baseball fans might
know what that means,
but these are
extremely fast pitches.
One is rising against gravity.
It's actually defying
gravity and going up.
And the other is taking this
left to right turn so much so
that the catcher
can't even catch it.
The point is this
is just the time
to observe and ask questions,
like any good investigation.
Now, this is when Daniel
Bard was quite good.
And he was getting people
out left and right.
This pitch here to
the right was one
of those-- we call them Bugs
Bunny pitches in baseball.
It references the old
"Bugs Bunny" cartoon
when you had crazy pitches.
Now, this is real data.
So here is a field of pitches
that are yellow triangles.
You can sort of see them in
here behind the blue squares,
which are the changeup.
But this pitch right here--
these two pitches way over here
to the left side of
that screen where
the pitch was that pitch that
had that big break-- what
you're seeing here is data.
You're acting like
the catcher here.
And this is where all
the pitches move relative
to the bullet spin.
The bullet spin is right in
the middle of the axis, right
in the middle of the
two big axes there.
That would be the bullet spin.
The ball is just going along
its trajectory with gravity.
But then you add the
spin that the pitcher
can put on the ball, and you
get different types of pitches.
So now, this is
Daniel Bard the good.
This red cloud are his
fastballs that rise.
The one you saw on the
left to Derek Jeter,
that's the red cloud.
Then buried in there
is the yellow cloud
of triangles where he's
got that pitch that bends.
Now, I'm going to show you--
this is when he's good.
This Daniel Bard the good,
Daniel Bard the next year,
and Daniel Bard the next year.
And pretty much after
this, he's out of baseball.
I'll at least scroll back.
Daniel Bard the
good, Daniel Bard
the next year, his last year.
He's tried since
then, but he's never
made really any
appearances since then
in Major League Baseball.
Now, if you follow the red
cloud, you'll see what I mean.
This red cloud-- if
you can picture this,
this is a catcher's orientation.
This red cloud is balls that are
thrown by Daniel Bard that move
a lot from that center point.
If you pick where
that red cloud is,
the center of that
red cloud, you
can see it move as Daniel
Bard is going from good
to mediocre-ish to not so good.
And then he's out of the game.
The point is this
is datasets that we
can use to analyze
pitching performance,
the movement of the baseball.
So instead of just watching
video and making assessments
of how well balls move,
we can measure now
how the balls actually
move based on the spin put
on the ball by the pitcher.
So this is driving the
analytics in baseball.
The decision making
in baseball can be
based on this kind of data now.
Now, this is older data--
older data as in 2007 and '8
is when this dataset emerges.
And it's done by video.
Two video cameras are
trained between the area
of the pitcher and the
catcher, and they basically
do a triangulation of the
ball by video analytics,
and then they reconfigure
the whole trajectory
of the pitch, from the release
all the way to the catcher,
just using video analytics.
It's called PITCHf/x.
Now, PITCHf/x is
really, really important
to baseball analytics.
We've actually moved on.
But let's go back.
We've just gone through a
simple little couple minutes
exercise looking at Daniel Bard
the good, his actual video,
some data showing how
his pitches might move,
and how that might drive
some decision making.
An objective look at
Daniel Bard's performance
is what we can see with
this kind of dataset.
So this goes to the
definition again.
Voltaire was the one who
said first define your terms.
Well, sabermetrics to me--
and I think most people
who really are practitioners
would agree with this--
it's the study of
the game of baseball
through observation,
by looking carefully.
Now, sometimes you
can do experiments,
but usually you don't.
It's more like
astronomy that way.
You are observing
things and trying
to figure out what's happening
through your observations.
That's much more
what sabermetrics is.
And here's the definition I
start off with all my courses.
This is the simplified
version of sabermetrics--
the scientific and
objective analysis
of baseball-- much
more broadly defined
than just looking at
the statistical analysis
of baseball data.
So this is an important
part to start.
Baseball can be treated as
an observational science.
You can start thinking
about the game
and understanding it better.
Now, what we've done
at BU and at edx.org
is develop our course
called Sabermetrics 101:
An Introduction to
Baseball Analytics.
And this is one of
the big innovations
we had in our course.
The course is very successful.
I could go on about
all the data points
of why EdX loves this course,
and why BU loves this course,
and why the students
have loved this course.
But one of the innovations
I kept pushing for
was something called
the SQL Sandbox.
One thing we do really well
is teach SQL to novices.
And we do it through
the SQL Sandbox.
We make it very simple to
play with SQL right away.
And we really do it right
away in this course.
We do it right away in my
courses in seminar format
as well.
We take all comers, English
majors, non-technical people,
and we start doing SQL
queries using the SQL Sandbox.
So it was an engineering
thing to build this little SQL
Sandbox.
It wasn't tricky,
but it really had
to look right for the users.
That was one of the
big innovations we had
in this MOOC in the EdX course.
This is a-- if you can see
this, this is a simple query,
and you can see
some of the results.
You basically run a
query, and you get
a bunch of results in a table.
And you can even push a button
to download the full results.
In this case, this
query, I think,
generated something
like 25,000 rows.
These are different
sets of data for games
between 2000 and 2009.
So this was one innovation
that we found worked very well
for our students.
In addition, the thing
we innovated here
was something called tracks.
We realized in Introduction
to Data Analytics,
we'd get some people who
were data scientists who
were curious about baseball.
We'd get some baseball
people who were
curious about data science.
And we'd get different folks
with different interests coming
into this class.
So we created tracks--
tracks of different topics.
We created a sabermetrics track.
It was a street baseball track.
We created a
technical track, where
we worked through SQL and
R, but a lot of our students
already had that done, like
maybe many in this room.
If they took this
course, they actually
wouldn't have to take
the technical track.
We also had a history track.
We also had a statistics track.
This didn't pretend
to be Statistics 101,
but it did cover the
statistics needed
to do basic analytics that
we required for the course.
So the different tracks
was another innovation
that actually many other
introductory courses on EdX
have used since then.
So that was a nice
innovation we also had.
The last thing we had was
something called the Grader.
We would take-- people would run
a SQL query for their problem
set and submit the
dataset as their answer.
And we had it in
a whole algorithm.
We developed an algorithm
to try to grade each answer.
It involved the right
number of columns, rows,
the order of the
columns and the rows,
and different criteria
to give different grades.
But the Grader was another
innovation for the class.
Now, in terms of baseball,
really, baseball--
when we talk about sabermetrics,
we're modeling the game.
Just like modeling any
system you want to model,
you observe it carefully, and
you try to think about it.
So when you think
about baseball,
you try to understand
the game's about winning,
but how do you win the game.
You win the game with runs.
So essentially, when you talk
about baseball analytics,
you're talking about
analyzing runs.
How are runs created?
How are runs prevented?
And that's the model.
And that's the key--
that's the standard
you develop as you build all
these models for baseball.
Now, fundamentally, when
you think about runs--
and baseball fans know this--
you think about great players.
And they're going
to create runs.
David Ortiz is a
great player, and he's
got a great batting average
and lots of home runs.
Well, very early on-- and I'm
talking a century ago-- very
early on, people realized
the standard metrics
for batting performance, batting
average, weren't working.
There was a problem
with batting average,
and people recognized
this a long time ago.
This is a correlation
between team runs
scored for a whole season.
So you basically
take a team, you
measure their runs
scored for that season,
and you correlate it with
all their batting averages.
Now, batting average
was the standard metric
for hitting performance
for the longest time.
But even 100 years ago, people
said this isn't good enough.
And it actually has a
fairly low correlation.
Not as bad as this graph.
This is the graph I always
show when I do these talks,
because it's got no correlation.
Same x-axis-- team runs scored.
Take any given team in a
season, count their runs scored.
But then you try to plot it
against the number of outs
they record on defense.
Now, these should
not be correlated,
because one is offense-- scoring
runs-- and one is defense,
preventing runs.
Now, if you do it
different ways,
you can find correlations
with these data,
but I chose it this way--
very low correlation.
But I'll show you
the final graph.
This is a higher correlation
for offensive performance.
Now, this is a fundamental
concept in sabermetrics.
This is what you do.
You look at the game.
You observe the game.
You think about how
to model run scoring.
Instead of using standard
metrics like batting average,
you can take this very simple
metric-- this is actually
quite a simple formula.
This is just hits added
to walks, H plus BB.
So just measure the team
hits and the team walks.
You multiply it by
another component,
which is total bases, which
adds up singles, doubles,
triples, home runs,
a little formula,
divided by the number
of at bats by walks.
So this is a really
simple formula.
But look how well it
correlates with run scoring.
So this simple formula was
developed in the early '80s,
'70s.
And it really changed the
nature of baseball analytics.
Because the analysts who
figured this out just
did what I've just showed you.
They said people for years have
been saying batting average is
not a good metric for
offensive performance,
for modeling the game,
for understanding
the offensive
performance in baseball.
But this is.
Now, they kept saying
this over and over again
since the early '80s,
and nothing changed.
These were outsiders who had no
access to the actual clubhouse,
to the decision makers.
The decision makers weren't
listening to the folks
telling everybody
there's a better way
to model the game of baseball.
And you can imagine the
market inefficiency.
Instead of modeling your team
on the best batting average,
the players with the
best batting average,
you can start modeling
your team on the player
with-- oh, I'm sorry.
Sorry for the flashing slides.
You can model your
team on the player who
gets harder hit balls,
which is measured
by the TB here, home runs
and triples and doubles,
and get walks.
Walks are added here.
So this formula-- this really,
really simple formula--
gets at another
level of modeling
the game more accurately.
And the market
inefficiency now is
taking care of people who
hit home runs and triples
and doubles and take walks.
Now, this market inefficiency
is completely closed today.
But before the publication
of "Moneyball,"
this market inefficiency
was wide open.
It was first described, like
I said, in the early '80s.
But it took two
decades-- two decades--
before you got that
little crack in the door
into the clubhouse
to convince people
that this is the way to
think about the game, the way
to model the game.
Now, today, fast forward another
decade since "Moneyball,"
there are 150 analysts
on 30 teams who--
this is what they grew up on.
This is fundamental.
This is like 2 plus 2 equals
4 to these 150 analysts.
We're talking about a dramatic
paradigm shift in an industry.
Now, this is just entertainment.
We understand that.
But it's a dramatic paradigm
shift in this industry.
This is the importance
of baseball analytics
as an emerging-- discipline
may be too strong,
but it's certainly an emerging
area of study for folks
who want to do
this kind of work.
It's essentially
interesting data science
around entertainment.
So let's go back to
another set of data.
Another set of data
emerged in 2008.
This is Jon Lester
from 2007 to 2014.
These are all his pitches.
Now, remember what
this is again.
I'm going to try to
describe this carefully.
This is the catcher view.
You're the catcher.
So Jon Lester is the pitcher.
And he's throwing-- he's
a left-handed pitcher,
and he's throwing pitches
to you, the catcher.
And he's got five
different pitches.
The four seam
fastball is the one
that's going to rise the most.
So that cloud of black
circles is the one
that's the highest velocity.
That's the fastest pitch.
Now, you can see he's got
balls that-- the sinker,
this gray cloud
here, that actually
moves a little different
than the fastball.
These are the two pitches
you saw by Daniel Bard.
The fastball was the one
on the left to Derek Jeter
by Daniel Bard, and the
sinker was the one that
curved so much left to right.
Now, for Jon Lester, he's
a left-handed pitcher.
It's going to curve
the other way.
It's going to curve from right
to left, because he's lefty.
So when he throws his
two-seamed sinker,
it's going to move this way.
Now, he's got three
other pitches.
He's got his curveball,
which is going to spin down.
And that's the
yellow cloud here.
He's got the changeup, which is
a whole other pitch altogether.
It's meant to slow down.
You can see how it's
lower on the graph.
It's a lower velocity pitch.
So he grips it differently, and
it's actually a slower pitch.
And then he's got another
pitch that he's spinning
the other way from the sinker.
He calls it a cutter, a slider.
These are different names
for very similar pitches
that go the different way
than the classic sinker pitch.
The point is-- if you don't
know baseball, that's fine.
The point is there's a pitcher
grabbing a baseball that's
going to spin in air, and
he's doing this on purpose.
So we can collect all these
data and look at Jon Lester.
Now, one data point of one
pitcher doesn't matter much.
But if we compare Jon
Lester now to another?
This is another
left-handed pitcher
with the same
five-pitch repertoire.
This is Jose Quintana.
Now, what I've done is
I've manipulated the images
so the scales are the same.
And I'm going to
go back and forth.
Now, you should understand
that one pitcher is throwing it
faster, and one pitcher is
throwing it with more movement.
And this is the way now
to measure pitching.
We're not going to
measure ERA anymore,
which is a fundamental measure
of pitching performance.
Now we can take these data and
say who moves the ball better.
Who moves the ball faster?
Now, there's a key piece missing
here in this dataset right
here, which is how
accurately they can throw it,
their placement of the
pitches over the strike zone.
But just on the skill set
of these two pitchers here,
Lester has more velocity and
more movement of these pitches.
These scales are
exactly the same.
And so now this is
just to show you
we have another way to
measure pitching performance
using technology and new data.
So this is using to
PITCHf/x dataset,
and understanding the
different pitch types,
and understanding how these
pitches move as a measurement
of value of pitchers.
Now, I throw this one in
here just because it's fun.
Has anyone ever
seen this before?
This is Todd Frazier.
Todd Frazier now actually plays
for the Chicago White Sox.
This is when he was on the Reds.
If anyone is from
New Jersey, he's
from Point Pleasant, New Jersey.
New Jersey Shore?
No one?
Jersey Shore?
OK, fine.
But watch.
If you look carefully, it's
pretty obvious his right hand--
watch his right hand-- is off
the bat at bat-ball contact.
But if you look carefully,
his left hand is, too.
He has essentially thrown
the bat at the ball.
He's throwing the
bat at the ball.
Now, if you've ever
played baseball,
the coach would never
tell you to do this.
This is not classic baseball.
But what's really interesting
here is the physics.
We can use this as
a physics example
of understanding bat-ball
contact, the mechanics of this.
But beyond that,
this was a home run.
He hit this ball
about-- over 400 feet.
So he's throwing the bat at
the ball, and it goes 400 feet.
So this is the introduction to
understanding a little bit more
about home runs.
So here is a list of
all the different teams
and their ballparks.
It would probably
surprise you if you're
a baseball fan-- if
you understand this
and you're a baseball
fan, it would probably
surprise you to
see that last year,
the most home runs were hit by
the games played in Baltimore
at Orioles Park.
Most people who
are baseball fans
might say oh,
that's Coors Field.
That's where there's
a high altitude
and balls just travel farther.
But no.
Most home runs were
hit in Baltimore.
The least home runs were
hit in San Francisco.
Now, this is just a rank
order listing of the home runs
that occurred in every park.
But as we start understanding
home runs, as we transition
into understanding
this, this is one data
set that we can
start thinking about.
Now, here's another dataset.
This is actually
really interesting.
This is using-- this
is by Greg Rybarcyzk.
He's actually a Red
Sox analyst now.
What he did is took video of
every home run ever played.
He had a package, and he could
watch videos of all games.
And he would find every video
of a home run since 2006.
So this is yeoman
work, because there's
about 5,000 home runs a year.
And he would find where
it landed in the ballpark.
And he had 3D reconstructions
digitally, reconstructions
of all ballparks.
And he would time on his
video bat-ball contact
and then where it landed.
So he has a time in the air,
a position where it landed.
And he actually got more
sophisticated near the end
and put in weather in there,
and he put in wind in there.
He talked to the physicists
about batted ball flight,
understood better about
the potential batted ball
flight moves here that are
typical for righty lefty, lefty
righty stuff.
And he modeled
various components.
So this dataset is just
looking at home runs in 2009.
So this is the cloud of all home
runs graphing their batted ball
velocity, how hard
the ball was hit.
Now, Greg calculated this.
He calculated the
batted ball velocity
based on hang time in
the air and the distance
and a few other parameters.
He also determined
the launch angle.
So he is estimating both launch
angle and batted ball velocity
just by looking at video.
Lot of work, but
probably a big error
bar around the precision
of what this actually
is, because he's doing
a lot of formulation
to get these two
sets of numbers.
But clearly, it should
show you something.
What's the average?
He's not too far off.
This is pretty good modeling.
But it tells you the
average batted ball velocity
for a home run is about
103.6 miles per hour.
Off the bat, home
runs average this.
Now, there's a big range.
You can see some home runs
were pretty weakly hit.
Those are probably
things like pop ups
at Fenway Park that scrape
the foul pole or something.
So it's a weak hit that
might just be a home run.
But some were hit quite hard.
You can see the
outliers on both sides.
And then you can see
the launch angle.
Most people, when they
think about trajectories
that you learn in
Mechanics 101, you
think 45 degrees is the best.
Well, in baseball, it's not.
Data clearly shows you here
the launch angle is about 28.
Now, the range is anywhere
from 43 down to 18 or 19.
When you're hitting an
18 or 19 launch angle,
that's a line drive.
That's a low trajectory.
And then, of course, you've
got the almost 45-degree
trajectories.
But the typical home run is
much more like 30 degrees.
Now, this makes sense
when you understand
the physics in the
backspin of a baseball
and how balls move
in the fluid of air.
So you can explain this
by looking at the science
quite easily.
But this was an important
little dataset to understand.
Most people wouldn't
get this right away.
Now, also in 2009, we
have another data set
that looks at all
batted ball events.
This is just April 2009.
But it's another technology
just using video analytics.
Remember that video
analytics of PITCHf/x
where you basically
reconstruct the trajectory
of the pitch with
high speed cameras
in this area between the
pitcher's mound and home plate?
You can do the same thing
with the batted ball.
When the ball is hit, you
can watch the trajectory
through those camera areas.
Now, it wasn't a
great technology.
It actually had lots of errors.
But this dataset tells
us a very similar thing.
What we're showing here
in different colors
is a fly out, which is
a ball hit in the air.
That's different
than a pop up, which
is a ball hit in
the air which is
closer to where the batter is.
Then there's line outs and
ground outs and singles,
doubles, and triples.
For those of you
who know the game,
you understand these clearly.
But you can see where
these clouds are.
I've reduced this cloud
into different points,
the average points for
each of those eight events.
These are the outs-- the pop
out, the fly out, the line out.
And if you understand
the graph here looking
at the velocity versus
the launch angle, which
is what we showed before
with the other dataset,
you can see why
that might curve.
You understand the game, what
a liner is, what a pop up is.
You see why they might
have different batted ball
velocities and trajectories.
But this is the single.
The double and the
triple are very close.
They're actually very similar
batted ball projectiles
off the bat.
But then here's the home run.
The home run is
singularly different.
You've got to hit it
harder, and you've
got to hit it at
the right launch
angle, the right trajectory,
the right angle off the bat.
So this is another
data set showing
a very similar thing,
another completely different
technology.
Instead of looking
at video, which
is what Greg did, seeing
where the ball landed
and the time in the air, this is
looking at video reconstruction
of a high-speed camera
of the initial flight
path of the ball.
Now we can go to
another dataset.
We have a third dataset.
This is actually not
the third dataset yet.
Greg's dataset looking at
the trajectory, flight time,
and landing spot gets
this data set for 2015.
So you can see
it's very similar.
It's about 30 degrees launch
angle, and it's about 104,
103 miles per hour off the bat.
These are home runs
again, very specific hits,
and they fill the same cloud.
So we're getting a sense now
of what a home run looks like.
So if you could read the
initial batted ball velocity
and you see over 100,
and you see a launch
angle in the area 20 to 35,
40, you could say home run.
You're probably going
to be pretty close.
This is definitely what a
home run requires-- the ball
to be hit over the fence.
So there's new data now,
and it's using radar.
Instead of using video,
there's radar in the stadiums.
This is radar data.
Now, this is a little different.
It's not the same axes anymore.
But what we have on the bottom
is batted ball velocities,
but these are average.
Each of these points
represents one player
in 2015, just one player.
And these are the top
150 or so players.
And I chose 150 mainly
because I wanted
to get the full-time players.
Players who played
the most would
be determined by
their management
to be the best players,
the best players performing
to great runs.
This is all the
different players.
So we've got their average
batted ball velocity.
It turns out when you
take the top 150 players,
you get 300 events.
So this is an aggregation.
Each of these points is an
aggregation of 300 batted ball
events during the year 2015--
so the top 150 batters,
300 batted ball events-- in
other words, they hit the ball.
Now, when you take
the average velocity,
you get this huge range.
The average ranges from about
82 all the way up to 94, 95.
That is a big range of
batted ball velocities.
This is a distinguishing
characteristic of a batter.
It really is a distinguishing
characteristic of a batter.
You can really determine
what's an average player who's
playing full time.
These are full-time players.
You can pick that group in
the middle, the average.
And you can pick the
group at the bottom,
and you can call them
the weaker hitters.
And the ones at the top
are the stronger hitters.
Now, who's a baseball
fan in the audience who
wants to take a guess at
who might be a weaker hitter
and who might be
a stronger hitter?
Anybody a baseball fan
enough to want to guess that?
AUDIENCE: Sure.
Weaker hitter is Billy
Hamilton, and stronger
is Giancarlo Stanton.
ANDY ANDRES: Stanton
and Hamilton, good.
Good guesses.
Yikes.
That was just the home runs.
Any other guesses?
Any other baseball fans
that want to hazard a guess?
AUDIENCE: Jose
Iglesias, [INAUDIBLE].
ANDY ANDRES: Good, good.
Actually, you're not hitting
any of the leaders, by the way,
or the followers,
the bottom group.
Let me just show you.
Now, the other part of this
here is batted ball distance.
You can see how there's a slight
correlation between batted ball
velocity and distances.
There's something there.
You can sort of see that the
weaker hitters don't hit it
as far.
Now, so who's the cluster
at the bottom there?
The cluster at the bottom
are four players-- Burns
from Oakland A's, Ben Revere,
Ichiro Suzuki, and Dee Gordon.
And Dee Gordon-- I don't know
if you follow this stuff like I
do, but Dee Gordon
just got suspended
for anabolic
steroid usage, which
is-- he's way at the bottom.
I don't know what's
going on there.
It's doesn't seem
to be-- hitting
the ball hard is not part of
the equation for Dee Godron.
But on the other
side, these top five
are Trumbo, Bautista, Cruz,
Ortiz, and Miguel Cabrera.
Now, very close-- I'm sure
Mike Trout is very close.
I'm sure Stanton is very close
to the top of this group here.
And I'm sure the others you
mentioned are at the bottom.
The point is this is now--
I'm going to go out on a limb
and say this is the right way
to think about batting power.
And one of the key things-- this
is 2015-- you're a smart GM.
You might say look at
that list at the top.
One of those things is not
like the other-- Mark Trumbo.
Mark Trumbo was a free agent,
and one very smart team
signed Mark Trumbo.
And Mark Trumbo is having
a great year this year.
Now, small sample size.
There's not necessarily
a cause and effect here.
But the point is this might be
a new way to really understand
the value of batters.
Remember what these
points represent.
For Mark Trumbo, that's
over 300 batted ball events.
And he hit it harder than
everybody in baseball
except four, four guys who
anybody would list as a power
hitter who hits it hard-- Miguel
Cabrera, David Ortiz, Nelson
Cruz.
These guys hit the ball hard.
We all see it.
You can observe it.
Now, you see Mark
Trumbo hit it hard, too,
but it might be a surprise to
see him that high on the list.
Question?
AUDIENCE: [INAUDIBLE] on the
right side of [INAUDIBLE].
ANDY ANDRES: Probably,
yes, I agree.
The observation was
Wily Mo Pena probably
would fit on the right
side of these graphs.
This dataset doesn't
encompass Wily Mo Pena.
This is 2015, 2016.
This is when we put radar
now in all baseball parks.
AUDIENCE: Let's say
Trumbo [INAUDIBLE].
The smart GM would sign Trumbo.
The looser GM would sign Pena.
But there wasn't any evidence
to indicate [INAUDIBLE].
ANDY ANDRES: Fair point.
But I think there's more to
it than just this, for sure.
So this isn't the
end of the story
or the whole story for sure.
But this is a real new look
at the performance of players,
a new look at
performance of players
that we didn't have
before last year.
These data are now available.
We can look at that.
Now--
AUDIENCE: Is this
going to be all
available-- this
is all open source,
and I can go and download
all of this [INAUDIBLE]?
ANDY ANDRES: If you ask nicely.
The question was is the
data publicly available.
The answer is yes.
If you ask nicely, I can
tell you where to find it.
It's not hard to find.
Just Google stat cast,
and you'll find it.
But this is the radar.
And if you go to
Fenway Park any time,
just look behind home plate.
Mounted on the press box, look
for this little box right here.
This is the radar that
covers the whole field that's
measuring the pitch, measuring
the trajectory of the pitch
at 40,000 times a second.
And it gets the
actual release point.
It actually can determine the
three-dimensional components
of the actual release
point of the pitcher--
not just guess it with
the previous technology
of extrapolating data points
through video analytics.
Radar gets the release
point, the exact spot
where it's released.
It can measure seams
on the baseball.
It's that precise.
And it can obviously see
the pitch trajectory,
recreate that.
Now it can also recreate
the hit trajectory,
and that's where we're
getting these datasets from--
the velocity of the hit,
and the distance of the hit,
and the trajectory of the
hit, and the spin of the hit,
and the spin of the pitch
all measurable now because
of radar.
Now, we also get
every throw made.
So you bat the ball into play.
The fielder gets it, and they
throw it to another player.
All those throws
are now measurable.
You can measure the trajectory
of every single throw
with this technology.
So we are at the leading edge
of a whole new set of analytics
like this.
Now, there was a
great question about
is this the be all and end all.
Is Miguel Cabrera the
best hitter in baseball?
He's one of them.
He's not the best.
But when you model the game,
when you model run scoring,
when you're the
objective analyst now,
you have to take this
data into account.
You've got to start adding this
new technology to your model
to better understand
offensive performance,
to better understand
where runs come from,
to leap beyond the
idea of understanding
just what batting
average is contributing
to offensive performance
and run scoring.
So radar is key.
There's another
component here though,
too, which is really important.
These are six cameras.
And I took a photo
at Fenway Park.
Just wandered around
and took this picture.
Each of the sides
has three cameras.
The right-hand image is
three black rectangles
next to a white-- that
is actually a radar gun
to measure pitch velocity.
But that's three cameras there.
And on the other left side there
are three cameras mounted three
on top of one another.
If you go to any ball
Major League ball park,
it might be fun to-- since
you're technology oriented,
you might try to find the radar
system and the six cameras.
Because they're in every
Major League ballpark.
They're actually in a lot
of minor league ballparks.
Because what teams have
found out now is it's
not just measuring
the Major Leagues.
You want to do with
the minor leagues, too.
You want to install
these systems
so you can see which
batter is hitting
it the hardest in the minors
to either trade for them
or to promote them.
It really is getting
more and more pervasive,
this set of technology.
The cameras here
measure player movement.
That's not measuring
ball movement.
If you've ever followed
football, soccer,
or other moving sports--
if you watch the World Cup,
you see how they track players.
And they track
players through video.
They don't need little trackers
in their shoes or anything
like that.
You can just watch.
You can determine where the
player is by video analytics.
ChyronHego is a European company
that does this very well.
Sports Vision is an American
company trying to do it.
But ChyronHego was hired
to do it for baseball.
And the technology
tracks the movement
of all the umpires, all the
coaches, all the players,
all the runners, the batter.
Everybody moving
on the field is now
tracked through video analytics.
So now we have another set of
data measuring player movement
all over the field.
And this is important
now, too, obviously,
if you want to measure
defense and if you
want to measure offensive
performance baserunning
or batter times to first.
These are all things that
are important to measure
if I want to better understand
offensive or defensive
performance in the game-- so
merging now our big datasets
to help us better
understand the value of each
and every player in
the game of baseball.
Now, when they
announced this in 2014
that this was going to be
coming to Major League Baseball,
they also made the announcement
that this is seven terabytes
of data every game.
Now, that's a lot of data.
Now, Google laughts at that.
They're like, seven terabytes,
I use that in my hip pocket
in the morning.
But you know what?
This is actually a big
dataset for baseball.
Relative to what other data
sets we've had in the past,
even though baseball was
considered data rich,
this is actually really
emerging into bigger analytics
of big datasets.
And so a lot of time has been
spent by the very smartest,
cutting edge teams to
parse through these data
to get the right data they need.
A lot of time has been spent
also by Major League Baseball
to help teams do this.
There's a lot of folks working
in Major League Baseball
to get through these
data to boil it down
to really important information
that the teams can use,
and that also might be useful
for the fans of the game.
It might be fun for fans to
see these different things.
So let's move on
to other things.
Let's move on to another
way to use these data sets.
I only have a
couple more slides.
But this is interesting.
This is David Ortiz in 2015.
What's shown here is his
batted ball location.
Most people probably
wouldn't have
guessed that David Ortiz--
most of his batted balls
go right center field.
If you're a fan, everyone
is shifting on David Ortiz.
But David Ortiz is hitting it
straight up the middle mostly.
So you look at the heat
map, and that's mostly
where his hits are landing.
Now, if you look
at Jose Bautista,
if you know anything about
Jose Bautista, you can see him.
He's a pull hitter.
You look at his heat map.
He's a right-handed batter.
This is a demonstration
of the field,
and you can see that most
of his hits are pulled.
Here's another graphic of
Jose Bautista, 2008 to 2015,
all his home runs.
He's a pull hitter.
If you understand
the game of baseball,
he's pulling everything
to left field.
This is Ryan Howard.
Now, he's a left-handed hitter.
He pulls the ball.
Now, those of you
who are Red Sox fans
might know who
Zander Bogarts is.
Zander Bogarts-- I see
some heads nodding.
Zander Bogarts is the
shortstop for the Red Sox,
and he doesn't hit a
whole lot of home runs.
But he's got an
interesting profile here.
And this actually goes
along with your observation.
If you watch a lot
of baseball, you'll
see Zander Bogarts is much
different than Jose Bautista.
They're both
right-handed hitters.
These are many
batted ball events.
This is not just one data point.
These are a lot of data points
of where their ball lands
when they hit it.
Compare this heat map.
And look-- one big thing to
compare is how close it is.
He doesn't hit it far.
It's closer to home plate,
which is the bottom of those two
black lines.
Here's Jose Bautista.
Jose Bautista-- his heat
map extends a lot further,
and it's certainly pulled
towards third base,
left field compared
to Zander Bogarts.
He's hitting it opposite field.
So a very interesting
set of heat maps.
The last thing-- I'm going to
run through these slides quick.
If you care, there is something
called catcher framing.
But what people don't understand
is batters have a strike zone.
In other words, you might
understand that batters have--
umpires have a strike zone.
Some umpires, if
you know the game,
might have a wider strike zone.
Some umpires might be tighter.
What people don't
understand as fans
is that batters
have a strike zone.
They really do.
When you understand the
strike zone-- baseball
analysts who study
the strike zone
know there's four parts
to the strike zone.
Each pitcher has a strike zone.
Each batter has a strike zone.
Each umpire has a strike
zone, and each catcher
has a strike zone.
They all impact
that called pitch
being either a ball or a strike.
Let's do this.
This is Jose Bautista.
He's a right-handed hitter.
Let's compare him to Mike Trout,
another right-handed hitter.
These are all data
from 2014 through 2016.
So this is Bautista.
Here's Trout.
Bautista, Trout.
So these are two different
guys with two different strike
zones.
You can appreciate the
difference in the red field.
Now let's go to a left hander.
This is David Ortiz.
This is his strike zone.
Bryce Harper, another
left-handed hitter,
another great left-haded hitter.
Look at the difference.
There is differences here.
And this is subtle.
This is a subtle
difference between batters
that impact the strike zone--
so cutting edge analytics that
aren't really being
published, even in the public.
But what the teams are
doing-- what they're doing
is they're better understanding
each batter strike
zone, each umpire strike
zone, each pitcher strike
zone, each catcher strike zone.
And they're emerging now, like
who's pitching, catching today.
Who's the umpire today?
And they can actually make this
part of the scouting report
to help inform either
pitchers, catchers,
or batters what
advantages there might be
to winning the game that day.
So this is the kind
of thing that's new.
Old was this idea of
let's value walks.
Let's value the home run.
New is using technology and
data to drive decision making
in real time to understand
better small advantages that
might help you win games.
So look-- I understand
this is entertainment.
I understand this
is fun and games.
But this is an
analogy, in a way,
to a lot of things going
on in real industry,
real decision making in real
business and real things that
might matter a lot
more than baseball.
So we can talk about
this baseball story
and how it relates to this
emerging study of anything
we observe, anything.
And I use anything
literally, because we've
done it with baseball.
So anything can suffer
a close observation,
thoughtful questioning, trying
to understand it better,
gathering data to be
objective about what
you're trying to answer.
So we've done that
with baseball.
I think it can be
done with anything.
And I thank you for your time.
If you have any questions,
I'll be glad to answer them.
[APPLAUSE]
AUDIENCE: Hi.
Thank you very much.
What are some of
the teams that you
think are the most cutting
edge in this stuff?
ANDY ANDRES: Cubs, Rays,
Red Sox, Nationals.
Yankees are pretty good.
I would say Astros.
I forgot the Astros.
Yes.
AUDIENCE: Is that a
culture of how much they're
willing to spend or
the actual people they
have doing the analysis that
produces the better results?
ANDY ANDRES: Some it's
spending, and some it's people.
Astros, it's actually people.
Rays, it's people.
Red Sox, it's a combination
of people and money.
Cubs, it's mostly money,
but a lot of good people.
AUDIENCE: So it seems
like observationally,
watching a game, whenever
anybody really hits a triple,
it always seems like they hit
a double and the fielder really
screwed up, but it wasn't
close enough to really
be called an error.
And given what your
chart showed earlier,
they're basically
the same kind of hit.
Is that your observation
as a scorer, too?
ANDY ANDRES: Yes.
I don't make the
hit-error decisions.
I just record everything
when I'm working at MLB.
But I agree with you.
From what I see, these
are hard-hit balls that
rattle around outfield fences.
And you just can't call
errors on certain plays.
If they're making ordinary
effort and they screw up,
that's when you give an error.
But if it's some crazy
thing that happened,
that's-- you can't
give errors to that.
AUDIENCE: Even when
they fall over.
But they fell over not within
five feet of the ball, so--
ANDY ANDRES: This
is being recorded,
so I better watch what I say.
But I have a lot of stories
about how the sausage is
made in official scoring.
AUDIENCE: The data you were
talking about collecting,
is that available to
all the teams equally,
or do some of them
have private data
that they're adding into this?
ANDY ANDRES: This dataset
is for every team.
Every team has access to it.
But that doesn't
mean other teams
aren't adding their own data.
There's other data
providers out there
that are trying to
sell data to teams.
And some teams buy it.
Some teams are thinking about
data a little differently.
They're gathering data.
They're spending
money to get data
in low minors and
other places, too.
So each team is a
little different.
But this dataset using
radar and video analytics
is open to all 30 teams.
AUDIENCE: So when
academics discover things,
they publish them.
In this case, finding a new
metric that works really
well is probably something
you want to keep proprietary.
Do you guys lean more
towards that or more
towards sharing with each other?
ANDY ANDRES: I think
the industry is always
looking for the
market inefficiency
and for that advantage.
Now, this isn't a
clear cut advantage
as some other industries.
If you make a better widget
and you cut the price in half,
you will make more money.
That's a measure of that system.
In this system, you want
to measure it by, say,
World Series victories.
Getting more value from your
players because of some metric
about how hard
they hit the ball,
that doesn't guarantee
a World Series victory.
But it certainly would
be a slight improvement.
And it is a market inefficiency.
In terms of academics
and publishing,
I think there is
the outside world--
and I'm actually part of
the outside world, where
we actually look
at data and we try
to analyze it and publish it on
the web so everyone can see it.
Then there's other data
involved and better models
involved because of the
richer datasets in the clubs.
And that's not something
we're privy to.
The outside doesn't see that.
It's very proprietary.
There's some cross-talk
between teams, but not much.
They usually hold information
pretty close to the vest.
But I'm on the outside.
So I just try to look at data
that I can find and analyze it.
AUDIENCE: Two
questions, actually.
So at the player level,
how much are teams
sharing any data points?
So one thing that
stood out was you
said they put the
shift for Ortiz,
but he hits to the middle.
Do they share that with
their players at all
and say, hey, you shouldn't
necessarily shift,
and build in a
player-by-player strategy?
Are teams doing that?
ANDY ANDRES: Yes, absolutely.
If you watch baseball
a lot, you'll
see how every at bat
shifts the defense.
Every new batter
coming up, they'll
always have a different move.
Even within pitch
counts, they'll shift.
They'll move players
around in pitch counts.
And so they understand
the different pitch counts
and the batters impact where
you should be on defense.
It's that.
Granular.
AUDIENCE: Cool.
That's pretty interesting.
ANDY ANDRES: I agree.
AUDIENCE: And then
the other question I
had was it seems
like this would be
pretty good information for
video game developers to have.
I'm curious if you
share a lot of this
and if they incorporate
it well or not well.
ANDY ANDRES: The data I
have is freely available.
Now, I'm not the best
web scraper in the world.
If one of you want to
have a cup of coffee,
I have a couple web
scraping questions.
I'd love to-- I have a couple of
things I need to get better at.
But I don't know of any
game developers using this,
but it's freely available.
In other words, they
can find these datasets
to better do whatever
gaming they're designing.
It's out there.
And I would agree with you.
If I were a game
designer, I'd really
want to model the
game a lot better.
And these data are out there.
AUDIENCE: When I saw the graphs
about Daniel Bard's pitching,
two things jumped out
at me, and I wasn't sure
if you had a comment on them.
One, there's this old
thought that command
is everything for a pitcher.
If you don't have
command, it doesn't
matter how fast you throw,
and if you have command,
you can do well.
And then there's
the other one, which
is that it's all about
variance-- so if you lose
fastball velocity
but your changeup
can still be overpowering if
your changeup also slows down,
like Keith Foulke
or Trevor Hoffman.
But when I look at that graph,
one thing seems to hold up,
and the other one doesn't.
And I'm not sure if you
had a comment on that.
Because he--
ANDY ANDRES: On Daniel Bard?
AUDIENCE: Yeah.
Because he seemed to
retain a lot of variance
from 2009 to 2011, but he
clearly lost almost all
of his command.
ANDY ANDRES: The observation
is he lost his command.
If you watch what he did, he
just started walking batters
everywhere.
But that's not what this
dataset actually shows.
So what this is
trying to represent
is the movement of the pitches.
And that's his
ability to spin it.
AUDIENCE: But command is--
when you're a pitcher,
you're expecting the ball
to behave a certain way.
When I look at a graph like
this and I see a tight cluster,
it means that the
pitch is predictable.
And so the pitcher
can command the pitch
by assuming that it's
going to go a certain way.
And then if you go
two slides ahead,
as the fastball graphs
start to spread out,
it means that he's
walking players.
He's lost command.
But he still has a
lot of variation.
His slider is actually breaking
even more than it was before.
ANDY ANDRES: I think
this is not representing
the strike zone and command.
I could do another graph
which shows you how well he's
hitting his spots.
This is a measure of
movement of the pitch.
So we're not really getting
is he hitting his spots.
So you're right.
If I were getting at
command another way,
I'd look at how
well he's clustering
around a certain band where
I want the pitches to go.
But you mentioned
something really important.
You mentioned this
idea of sequencing,
the idea of not
being consistent, not
showing the same thing.
Advanced analytics today--
and there's real teams who
understand this that
others don't-- this idea
of sequencing, this
idea of understanding
a pitcher-catcher-- being
in sync about a certain set
of pitches, different kinds
of pitches, different pitches,
different locations to different
batters and how that's working.
So sequencing is really
important-- really,
really important.
This graph is showing more
his ability to spin the pitch.
Something happened to
Daniel Bard after 2010.
And it's sort of a mystery,
because he had such talent.
It wasn't age.
It wasn't injury.
Something happened where he
just couldn't spin it as well.
He couldn't spin
the pitches as well.
AUDIENCE: [INAUDIBLE]
ANDY ANDRES: Yes, 2012
he started to start.
But it was actually his choice.
If you go back and look
it up, he wanted to start.
Very interesting scenario here.
Daniel Bard is--
someone is going
to write a great book
about Daniel Bard someday.
Maybe one of you.
AUDIENCE: So you showed a huge
difference between the stadiums
and the number of runs.
Do you have any idea
what's going on there?
Are these smaller stadiums?
ANDY ANDRES: Smaller stadiums,
but their biggest effect
is the players they have.
AUDIENCE: I see.
AUDIENCE: Toronto was
number four on that list.
ANDY ANDRES: Yeah.
Toronto was up there.
Baltimore was up there.
These guys-- they have players
who hit a lot of home runs.
AUDIENCE: OK.
ANDY ANDRES: But it
is also the park.
Trust me.
Fences in, altitude
up, more home runs.
But the players you have
also determine that.
One of the lower
teams was Atlanta.
Atlanta is-- I don't know
if anyone is Braves fan,
but Atlanta is in--
this is a season where
they're rebuilding.
They have weaker hitters than
they would like, I'm sure.
And so they have low home
rates at their home park.
AUDIENCE: Do you
know if anyone is
trying to design a ballpark
so that it would be more home
runs?
Is that allowed in the sport?
ANDY ANDRES: I think
people do that.
I think they do.
Because there was a famous
advertisement by two pitchers,
Greg Maddux and Tom
Glavine from Billerica.
And the advertisement was that
their pitchers and all the fans
were appreciating all
the home run hitters.
So the phrase was everyone
loves the fly ball.
So home runs and
runs, I think people
thought that was a way
to bring fans back,
more interesting fans.
Thank you for your attention.
[APPLAUSE]
