So a question that comes up regularly is,
how do I get started in data science?
What are the first steps to
learning the basics of the field?
What are some of the best
resources to go to? And for those who have
been in the field for a little
while, what are some of the resources
that can be used to stay
up-to-date? So that's what we
want to talk about in in this particular
session. And so, you know, if you're
just getting started, two excellent online
courses that you may want to explore are
Learning From Data over with Caltech, and
Machine Learning from Stanford.
So, Yaser Abu-Mostafa at Caltech puts together
a great course that really touches
both on the basics of machine learning
and data science, and also throws in a
little bit of learning theory so, you
know, has a pretty strong
computer science approach there.
Andrew Ng's course touches on,
goes into a number of additional
classification methods and techniques,
but also presents quite an array of, you know, 
starting techniques that you can use
to get into the field. So both,
I recommend both of these quite highly.
So you know, also there are a number of
books available that are,
some of which are even free.
I've noted that with the red dots here.
And if you look through some of these,
certainly An Introduction to Statistical Learning
and Learning From Data
are two great texts.
If you're going to start in one place, I
would recommend An Introduction to
Statistical Learning. That'll build up
the foundations for both regression and
classification, talk through some of the
the basics of evaluating performance of
of machine learning texts, of machine
learning methods, and so a great starting point.
As you start to go a
little bit further, you know, there are a
number of more advanced texts: The
Elements of Statistical Learning,
Pattern Recognition and Machine Learning,
and Deep Learning, that will build on
those basics, and either go into
more sophisticated techniques,
everything from hidden Markov
models on to you know, deep neural networks.
Those will be touched on in
each of those texts along with the math
that goes along with it too. And 
there are a number of excellent applied texts:
Hands-On Machine Learning with
Scikit-Learn & TensorFlow
and Deep Learning with Python, these both
 are a little bit more of a Python twist
to them, which happens to be
the tool that I use the most.
But, you know, there are corresponding
books for users of R out there,
that could be looked in as well. But depending
on what you're looking for, you know,
the introductory texts that they're listed
here, good starting points and either
going to the advanced text or the
applied text depending on what
you're interested in. And I also put up a
link on this page to a website that
I constructed a couple years back that
lists a number of other freely available
data science resources and links to free
online texts, videos, online courses,
all of these types of resources
that may be valuable, so feel free
to use that if it's helpful for your
learning journey in this area.
Additionally, staying up-to-date,
following a few, either blogs or newsletters
can be helpful here. A number that
either I subscribe to or that I know others
who subscribe to include, Data Elixir,
Data Machina, the Data Science Roundup,
and the Kaggle newsletter.
For those of you not familiar
with Kaggle, Kaggle is an 
online machine learning platform.
This is a good website to go to,
you can certainly compete in online
machine learning competitions, which is
actually a good way to start
becoming more active in this
field, working on some real-world
problems, and I mean, a lot of them have
monetary prizes if you're able to win them.
Of course, it's a very, very
high bar these days for winning them,
but they also have some learning
competitions there that are useful,
and so, would certainly
recommend signing up for some of these
because also, we all want a
little bit more email in our inbox, don't we?
But no, these are really, really
useful resources all. And then you know,
to kind of stay the most
up-to-date, we need to follow what the
latest research is. So you know, one way
of doing that is to look at industry blogs.
These are kind of the
bell labs of the 21st century here,
Google research, Microsoft Research,
DeepMind, which is now owned by Google
Google but has, you know, a blog there,
there on as well. And Kaggle, as I
mentioned before. And a number
of these research groups
are really producing cutting-edge
stuff and you know, supervised, unsupervised,
and reinforcement learning,
and are well worth following.
And, there are a number of academic
conferences that may be of interest too,
to either attend or to check out 
the Proceedings of, to see what
some of the latest work is coming from,
both the industry research labs, but
also academic institutions in different
universities and national labs around the country.
So, the International Conference on
Machine Learning (ICML), you know,
a lot of cutting-edge stuff comes
out at their annual conference, including
as well as NIPS, the Neural
Information Processing Systems Conference.
And so, you're gonna see that,
these are very hard conferences
to get papers accepted into and so a lot
of the the work that goes into them is
truly cutting edge. The International
Conference on Learning Representations,
another good option there as well.
Always good to browse the proceedings there.
And the ACM SIGKDD Conference
on Knowledge Discovery and Data Mining.
There's also an applied track there,
which is interesting to check out,
not that the others don't have applied papers,
but SIGKDD, this conference on
Knowledge Discovery and Data Mining
has a specific applied track that may be
of interest to those with
less of an academic interest in these
types of technologies. The other thing to
follow if you really want to dive deeply
into the waters is Arxiv. So this is a
preprint, pre-publication online
repository for journal articles,
conference papers which haven't
necessarily been accepted yet or
are working papers. These have not gone
through the peer-review process but
often are hot off the presses from
various research institutions,
so can be useful to check out. Now,
Arxic itself is a bit of a rabbit hole
with, you know, many, many papers being
uploaded each and every day so
Andrej Karpathy created a tool for filtering
these and helping to make it a little bit
easier to go through some of the
most viewed and cited papers, which is
using the Arxiv sanity-preserver
tool, which is what's being shown here,
which just helps you kind of, you can
look at the most recent or some of the
most popular, recent articles that have
appeared and so it can be useful
for kind of filtering through the
noise of that many papers. But yeah,
if you really want to deep dive into the
latest and greatest, the Arxiv and
Arxiv sanity-preserver are the way to
go there. Additionally, there's often no better
source than going straight to the
researchers mouth, right.
So following, this is a number
of folks who are leaders in
data science, machine learning,
deep learning, all of them really pushing
the bounds here, and so with
this, I would highly recommend
following these folks on your
favorite social media platform,
checking out their respective
organizations and research groups.
You have an assortment here
of some of the all-stars in the field.
You know, Jeffrey Hinton, Yann LeCunn, Fei-Fei Li,
pushing forward on deep learning and computer vision,
along with Andrej Karpathy, Andrew Ng, and Yoshua Bengio,
all really doing cutting-edge work
in that space.
And then certainly, for looking at
deep reinforcement learning,
David Silver, over at DeepMind.
And if you are so interested in exploring
generative adversarial networks, Gans,
which is an up-and-coming topic,
certainly in Goodfellow, you know, 
has done a lot of work in that space as well.
All of these folks really have been working
on building the foundations in the space
of some of the modern machine 
learning research and
can't recommend highly enough
following their latest work.
How many R users do we have here?
Okay, python?
Okay, MATLAB?
Ah! An even distribution. That is indeed
rare in a group.
So, you know, these each
have their pros and cons,
probably depending on your field's
specific needs and where
different packages are, for the work
that you're doing. Either of these could
be a good option. The nice thing about
R and Python is they're free.
Nice thing about MATLAB, is
all the packages just work together.
There's a lot of pros and cons
to each of these languages.
So, Python certainly have seen significant
increases in uptake in the machine
learning community and it's probably the
most prominent of the three, these days
in that space but certainly throughout
different disciplines, all of them
have a significant representation.
You can kind of see, this is regularly
updated programming language index and
the level of the rating shows the
relative popularity, as according to
their methodology with Python, R, and MATLAB.
All being, it's a very long list, so all
being pretty high up there.
So, another thing that you probably, 
if you're going to be playing around with
data science to any extent you probably want
some degree of version control.
This is something you'll 
probably end up getting at some point.
Managing changes to color over time,
enabling collaboration, making that easier,
and providing a complete 
history of changes made.
You don't want to have my file.v1, .v2,
this takes care of all that for you. The
two resources out there on that,
that are very popular are git, 
which is the system that does it,
and then GitHub is certainly a very
popular web-based hosting platform
for git repositories. There's also
GitLab, which I believe
Duke has a site license for as well. 
So those are some other options.
Another tool that may be 
of interest here is the Jupyter Notebook.
So, this nicely combines code, 
which you can actually run
within this notebook,
with math and text into one easily
shareable document. So, great for making
websites, for sharing a fully
described set of processes with a
collaborator. That's certainly an option
to help with that. And the next evolution
of the Jupyter Notebook is Jupyterlab,
which allows you to have Jupyter
notebooks and as well as a full
what's known as integrated development
environment. What does that mean?
Well, we have the option to have a text
editor, as well as some other files open,
seeing what some of the outputs may
be. It has it all in one, rather than
having 12 programs open at
once. So, this could be another useful
tool for you as well. And then, data
visualization, again, in terms of
communicating it, if we have a scale on
time to learn an ability to customize,
those usually go together.
We have Tableau. It's some beautiful
charts you can quickly load into data
and make a chart. If you want a
little more flexibility, ggplot2 for R,
matplotlib for Python.
And then, if you are a hardcore visualizer,
you can go the full D3, data driven documents
JavaScript implementation make 
incredible New York Times worthy
interactive visualizations. Okay,
so, we'll wrap up there. Any questions?
