I'm Hugh Brown.
I'm here at the Data Science Immersive at
Galvanize in Denver.
I've been here for the last thirteen weeks,
getting up to speed on data science.
I've come to this with a long career in software
development.
I've worked with large data projects for a
long period of time, large bodies of code,
working in many, many languages.
For my capstone project here at Galvanize,
I worked on Bitly data that captured the Germana
Wings crash of March 24, 2015.
And if you're not familiar with it, basically
there was a copilot who shockingly locked
the pilot out of the cockpit, and intentionally
crashed the plane.
And this information wasn't initially known.
It was found by investigators over a period
of days and the story broke gradually.
And so the focus of my project was to back
out the topics and the information and show
you could rediscover the story from the data,
and essentially the data was coherent with
what came out in the media.
So Bitly, a company which does URL shortening,
provided me with a lot of data over the period
of March 25 to March 27.
I had the top ten minutes of each of 72 hours,
so and each of those was a zipped up file
which, when uncompressed, turned into a gig.
So I had like 72 gigs of data.
I was working with Python on AWS with Mongo
DB because it was the best risk managed approach
to working with the data.
I spent a lot of time cleaning the data, throwing
out a lot of stuff which had to do with advertising,
which was essentially unrelated to my story,
and paired it down to around 25,000 documents,
on which I ended up doing topic modeling.
The topics were, in fact, largely silo-ed
on each language that I was looking at, and
the languages that my data came in were Danish,
English, French, German, Spanish, Dutch, Russian.
There was a lot of data in different languages.
But, apart from the single topic in each language,
there was also kind of a secondary topic on
which you could get a couple of distinctions.
A couple of the interesting things which came
out of it were that you could clearly see
that the focus in the German stories was much
more on family, on the investigation, the
relative, the relatives and had a different
angle.
And on the American side, the leading topic
that came out of the data was an interest
in the copilot and the mental health issues
associated with him, and the fact that he
hid his health records.
As I was saying earlier the information came
out over a period of time.
And you could clearly see spikes in the data
traffic as topics came out.
And if you looked at those topics, you could
see that they were actually related to specific
stories which came out on those days, typically
in those languages, although predominantly
in English.
The other obvious things you could back out
of it were that there was information which
developed over timezones.
While America is sleeping there isn't as much
interest, but as soon the United States wakes
up on the East Coast there is interest in
this story and the new developments get new
traffic.
So largely the data shows a coherence with
what I know from the media, and you can see
both of those in parallel.
A lot of the value in the project was in the
fact that there was a large amount of data.
And ordinarily, if I were doing this, like
for a professional job, I would have used
a slightly different toolset.
I probably would have gone for a spark with
AWS.
But my lack of deep familiarity with the tools
meant that it was necessary to risk managing
the project properly and to use the tools
I was most familiar with in the near term.
But that still meant that I ended up using
Amazon web services, which is a valuable toolset
for industry.
I would have preferred to work with supervised
data, but I didn't have supervised data.
I was unsupervised learning.
The main thing I learned out of it was how
important it is to have a good data pipeline,
and to manage the information you extract
from it and the way that you manipulate it
so you can recreate it reliably.
At times, I had to go back and redo things
so that I had a repeatable kind of pipeline
for the data.
And that was really the most important thing
to come out of it.
As a matter of fact, I tried a lot of different
ways to get industry data, and going with
Bitly was the best approach.
