LUIS BITENCOURT: All right.
This work?
Cool.
So Luis.
I work at Reddit.
And we do know our page views,
I just can't share them all.
But sorry in advance if I cough.
I'm working through a
cold from the weekend.
But started in data.
My first gig out of college
was building Excel Services,
so Google Sheets before
Google Sheets was out.
And these Google guys love that.
And then built out the Office
[INAUDIBLE] team, so they
data pipelines of Office.
So every time you saw a
crash or all the quality
data out of Office, the
team built out the pipelines
to get all that
back to us, which
had about a billion users.
It's a pretty high
scale, especially when
you're deciding what cabs
to get from people, which is
a pretty big data [INAUDIBLE].
Did a little startup
down in LA for a while,
and then was going to do
something in the data space
as a new startup, and then ended
up kind of connected with one
of the partners at Andreessen
introduced me to Reddit
and was like, well, if you want
to build an AI data startup,
you could go and
grind for three years
and try to get enough data
to make it a data problem,
or you could just take Reddit's,
which at the time I think
had three data engineers,
no search team,
a handful of people
in the relevance team.
So I took the challenge.
I like the hats.
Not enough to wear them,
because my team is actually
called the Reddit
Intelligence Group, which
is an acronym for RIG.
My title on Slack is the
RIG Tool Pusher, which is--
it's actually the
director of an oil rig.
So very much ascribe
to that analogy there.
And that's me.
Oh, and I guess Google--
so we're trying out Google
in a number of things,
largely around Analytics.
We use mostly BigQuery, which at
Reddit scale is actually a lot.
So we're taking out small chunks
of the stack but huge amounts
of data pushing into it.
ROBIN LI: All right, Robin Li.
I work for TapJoy, which is
a advertising monetization
platform.
In terms of the
data scale, we're
dealing with around
200 terabytes
to 300 terabytes active data.
And the data-data addition is
about 20 terabytes compressed.
So my data journey started from
a investment bank back in 2016.
I worked there for
about six years.
I did everything you can
imagine with an investment
bank, starting from
building up trading tools,
building risk modules
and then end up
with sitting on trading desk
for two years in their New York
office.
Then I get a little bit bored
because of the economics
crash in 2008.
We literally don't have any
commissions or bonus for two
years, so I decided to move on.
And I'm a computer major, so I
feel like my passion is still
within the technology world.
So I moved from East
Coast to the West Coast.
And I started to lead a
team of data engineers
as well as data
science engineers.
So what we do right now is
we are not just moving data
around, we are also responsible
for the entire data stack, data
platform, starting from
the data infrastructure,
building up the servers,
moving data around
to the data warehousing,
using the data,
making sense of the data,
as well as training modules
and serve our product
decision based on top of it.
Right?
So that's pretty much about me.
Oh, Google Stack.
So we starting to
use Google from 2015.
Part of the reason is we were
running a very beefy Vertica
cluster.
I'm not sure if you
guys heard of Vertica.
That was a commercial-licensed
enterprise-scale NPP system,
I would call it.
And our cost was around
a million dollars
to a few million dollars a
year by the perpetual license
hardware cost as well
as the support cost.
So were thinking to move on
to some cheaper solutions.
We were looking at, let's say
LBS, Redshift, Snowflakes,
as well as Google BigQuery.
I'm not sure if you
remember, but two of us
actually interact a lot on
Stack Overflow regarding
the early questions we
have on Google BigQuery.
So that's actually how to get
us started with BigQuery back
in 2015.
And then in 2017
just this year, we
move our entire
Hadoop and Spark stack
from Oracle on-premises data
center into Google Cloud.
We just [INAUDIBLE] in October,
so it's pretty impressive.
All right, that's me.
Thank you.
ANDREAS SEKINE:
Hi, Andreas again.
So what first got me
interested in data, I think,
was at my previous company
we had a lot of rich access
to user data.
And it felt like we were really
heavily underutilizing that,
and it just kind of
was an itch that when
you have all this
data that you should
be able to get a lot of
value [INAUDIBLE] out of it.
And then so coming
to Thumbtack, where
we did have a data
platform team,
started to really
appreciate when
you have the proper tooling
infrastructure to really make
use of your data, you
can use that to drive
really informed decisions and
gain some insights about how
your product is being used in
ways that you couldn't have
otherwise.
As far as using GCP, we
use it pretty heavily now.
Our data platform is almost
entirely on Google Cloud.
We heavily use
Dataproc and BigQuery.
We also use Pub/Sub,
that's driving
that map right there, along with
Spark Streaming on Dataproc.
Yeah.
ANDREI SAVU: Cool.
So talking about
your data platform,
can you do a parallel between,
like, before and after
and what do you think that
changed for the better
as you evolved your
platform over time?
ANDREAS SEKINE: Before
and after-- before GCP?
ANDREI SAVU: Before and
after GCP, or, like,
the earlier stages of
GCP as you ramped up
in using different services.
ANDREAS SEKINE: So before
GCP we had a Hadoop cluster
that we were managing
ourselves in AWS,
and that worked very well for
a year and a half, two years.
We started hitting some
real growing pains when
it came to scale and a lot
of the static versioning
of every time we
needed more capacity
we'd have to manually
purge the new cluster,
bring it up, do an
HDFS re-balance.
We had a couple of days then
it could start being utilized.
We had very spiky
utilization of our resources.
So all of the ad hoc
analysis or ad hoc queries
that were being run were
being run on Hive and Impala.
Those would be
conflicting because it
was sharing resources with
Yarn and the Spark jobs
were running.
So just this really
inflexible infrastructure
made it so that we
wanted something
that could grow more dynamically
as our data would grow.
And GCP had some
very nice offerings
in terms of having
Dataproc clusters that
can be ephemeral and be spun
up in 90 seconds, run your job
and spin back down.
BigQuery, where we can just kind
of pass the buck on to Google
and let them deal with
all the operations
overhead, because being on call
at this time was a huge pain.
It would be, basically
most of your week on call
would be just putting out fires.
So it was a huge win for us.
FELIPE HOFFA: Cool.
How about Tapjoy?
What's your, like, the--
ROBIN LI: Yeah.
I think a lot of people bring
up a very good point, which
is, you know, data
[INAUDIBLE] is not only
about moving data around.
It's also about what tools to
choose and what to what tools
to utilize, right?
And I think up to that
point we previously
had a very large on-premises
data center we built ourselves.
We basically bought up
a couple million dollars
worth of hardware,
strung them together,
outsourced a company to do all
our offline smart [INAUDIBLE]
works.
And threw it into a
[INAUDIBLE] data center
right next to
[INAUDIBLE] internally
through with direct
connect fiber link.
All of that work is
pretty cool, but it
has a lot of operational
overheads, right?
And we have our 300- to
400-machine Hadoop production
cluster.
Typically sit on
bare-metal hardware.
If any of those goes offline we
have to deal with them, right?
So and then after we
started looking at BigQuery,
after we move our entire
PB solution in the BigQuery
we realized that
as a company size
of Tapjoy, which is around
200 to 300 people it's
pretty impressive to do
on-premises data center
is one thing.
But on the other
thing is we need
to think about operational
overhead as well
as the true running cost
for your infrastructure.
So that's why we're
starting to looking
at moving our entire
infrastructure to cloud based
and we did a lot of
benchmarking and comparison.
My Google account reps
helped me a lot with that.
And at the end the day
it works really well.
So we merged our entire
Hadoop infrastructure
from on-premises data
center into Google Cloud.
But a little bit different
from this gentleman's usage.
We didn't use Dataproc.
We basically built our own
elastic Hadoop Spark cluster
into one on Google Cloud,
utilizing the cloud power
for a very spiky workload.
That's before and after.
ANDREI SAVU: Cool.
LUIS BITENCOURT: So we
were on Hive before.
I've always been a fan of not
building my own infra when
I don't need to.
Like, data infrastructure
is not going
to be the differentiator
for Reddit,
it's not going to be what
makes it succeed or not.
So we were using Qubole
which is kind of cool,
but the difference between
Qubole and BigQuery
is essentially like,
picture driving a manual car
in the '60s when you had to do
a lot of work and maintenance
and checking the engine oil
and doing all that stuff,
versus an automatic car
in this day and age where
you don't have take it in for
a checkup until, like, 50,000
miles.
So its even less overhead on
the engineering side, and it
helped with some of the
speed concerns that we had.
That said, you have to actually
structure your data cognizantly
of some of-- especially when
you have the amount of data that
Reddit has--
of some of the limitations.
Because even
BigQuery has limits.
Everyone has limits.
Uh, yeah.
Sorry.
So there is some thinking that
you have to do around that.
And there are some
differences with the abilities
that you have in Hive
versus what you have
in BigQuery to acknowledge.
But it's certainly a lot
less maintenance overhead
than we had even in Qubole,
which theoretically is much
easier than running your own--
which is much easier than
running your own Hive clusters.
But was still-- basically
the speed of iteration
was the main reason
to try it out.
Being able to write a query,
run it and get it back really
quickly versus waiting two
minutes for Qubole hive
clusters to spin up, or
having Presto clusters
that were big enough and
that your SQL query was
optimized enough to actually
be able to run against Presto.
BitQuery just got some of
those concerns away from us.
And there were more
tools that out of the box
were built to support it.
So things like mode analytics,
Looker, Periscope data,
what you can just
point to BigQuery
in the cloud versus
having to figure out
how you get your
clusters that Qubole spin
up somewhere to feed up to some
of these data analytics tools,
so.
ANDREI SAVU: Awesome.
LUIS BITENCOURT: Why
do you use BigQuery?
[LAUGHTER]
FELIPE HOFFA: I tried it.
[LAUGHTER]
ANDREI SAVU: I mean, for
Felipe, if you can give one
example where you've
seen, like, a very
interesting transformation for
a customer going from something
more cumbersome, maybe, to a
much more streamlined solution?
FELIPE HOFFA: So let me bring
the before and after to the--
every person is a data
scientist, especially people
here, and how these
tools enable laziness.
Like if I have a question, if
I want to find out something
about Reddit or Stack
Overflow or Hacker
News or any other site
that looks interesting,
if I have to start
by scraping, I just
need to have a lot of energy
to finish that break first.
That's before.
Now I just need to know
that that data is available
somewhere.
It's ready on BigQuery.
I can just connect and start
asking questions right away.
And I think that's
how Reddit awareness
that BigQuery existed.
Basically there was this guy
Jason Baumgartner in Washington
that started scraping
every Reddit comment.
Finally, he started
sharing that data.
That data ended up in BigQuery.
And then one day Reddit itself
sees that that data is there
and it starts with
writing some queries
without any other effort
other than the data is already
there for you.
And so it's useful,
it's insightful, it's--
I didn't have to
prepare anything
to be able to analyze it.
ANDREI SAVU: Cool.
So let's talk a little
bit about workloads.
The type of workloads that
this technology powers.
Just to get some examples on
where does the end value come
from, and kind of like
in terms of the data that
flows through the
system, what's that data?
What does it represent for you?
And for Felipe it's more on,
like, interesting examples.
Like you showed an
example with taxis.
What else have you
seen that's more,
like, transformative in nature?
FELIPE HOFFA: Let me
give the microphone
to the actual users first.
LUIS BITENCOURT:
So right now we're
using it primarily
for analytics,
so how many page views we
get, who's looking at what.
Like, what kind of aggregate
information we have,
how feature users are running.
I'm actually working on
getting availability numbers
into there, so getting HE proxy
logs parsed into BigQuery,
so we can see by service
what the availability is.
A lot of business metrics.
OKRs that we track.
Time on site.
You know, DAUs, all
that kind of stuff.
Some of the more-- and then
there's some interesting, like,
investigations.
If you look at things
like anti-evil [INAUDIBLE]
around spam [INAUDIBLE] botnet
detection, vote manipulation.
There's a lot of
information there
that the difference in
that kind of workflow
is that it's very ad hoc.
So if you had a way that
you could automatically
detect bots or vote manipulation
that was 100% accurate,
you would just productize
it and then block them.
Which we do, but
it's an arms race,
so then they change things.
So you need a much
more ad-hoc flow
where we start seeing some
weird behavior on the site
and you just have
to dig into the data
and pivot it a number of ways
and try to figure out, OK,
what are the common themes?
What are the common threads?
Is it actually a botnet?
Is it a sleeper cell?
Are there different things
that people are doing here?
And that really helps kind of
that ad hoc [INAUDIBLE] type
of queries.
So it's been super
helpful there,
where every 200
iterations on Hive
would take a couple of
minutes, where, you know,
if you're doing it just
on a small time range
it takes a couple of seconds
to get an answer for looking
at enough data where
you-- like, things where
you need aggregates, right?
Like to know if I have a
vote manipulation or botnet,
I can't look at, like 100 votes.
I have to look at
millions of votes
and start seeing what
the patterns are.
So that's been
super helpful there.
So essentially right
now, business metrics,
KPIs, understanding how
people are using the site, how
the site is behaving.
And then as well as
ad hoc investigations
on different things around
the anti-evil space.
There's a couple of
interesting things with Bill.
I have one query that
actually gives me
a listing, like same
thing as you see when you
go to the front page of Reddit.
So it's sourced by
your hot algorithm now,
or some models that we're
starting to experiment with.
But with any analytics
event that I have,
I have a query that I can
just point to that event
and generate a front
page based on that query.
So I could see, like, what
would the front page of Reddit
it look like if what we used
were shares to rank content?
Or views from
different countries?
Or number of comments?
Those kind of things.
ANDREI SAVU: Cool.
ROBIN LI: For us, pretty
much the same rate.
We mentioned we use it
for ad hoc analysis,
a lot of ad hoc
queries, as well as
reporting that drive insights.
We also use BigQuery
for lightweight ETLs,
such as we do a lot of moduling
in our Spark and Hadoop
cluster, but we use BigQuery
to extract a lot of features,
do a lot of the data cleanups,
and eventually fading
into the Hadoop and Spark
cluster for module trainings.
I think there is one
more thing, which is
pretty interesting in BigQuery.
It lives very well with
the other Google products,
such as the stack drivers,
which produced a log [INAUDIBLE]
as well as all of the
APIs that Google supports.
One other good, very
interesting example,
which is good BigQuery is
prising at per-query level.
How much data you
moved per query.
So its eliminated a lot of
DVME work and infrastructure
operational work.
But a new type of work which
is introduced by BigQuery,
slightly--
I'm sorry-- but it is how do
you curb your costs, right?
So how do you keep your
costs at a minimal level,
making sure your users are
running meaningful queries
to get their results?
So we are calling
all the Google APIs
to get all the queries, usage,
per-user, per-query level,
normalizes them.
You know, stripe
out the dates part,
the common [INAUDIBLE] parts
and putting them into a BigQuery
table for analyzing BigQuery
costs, which actually--
yeah.
It's kind of like a feedback
loop, stuff like that.
LUIS BITENCOURT: We spent
10,000 bucks on one dashboard
last year.
ROBIN LI: Yeah.
Or if you are not be
careful, your cost
for one particular bad query
could go out pretty fast.
FELIPE HOFFA: Please turn
on your cost controls.
They are there.
ANDREI SAVU: But its
also, like, it's-- well,
the one thing I want to add here
is being in a place where you
have these knob is
fairly empowering, right?
Being able to decide
that I want to spend
more when I need to spend more.
ROBIN LI: Yeah.
BigQuery definitely produce
all the knobs and controls
for cost controls
and access controls.
I'm just giving an example here.
ANDREAS SEKINE:
And we definitely
saw some unexpectedly high
BigQuery costs at first
and realized it
was because we're
putting these powerful
tools in people's hands,
and they want to use them
and do new things with them.
So also educating
users, like, maybe you
don't need to do select
stars, or maybe you
should use this
partition time column.
And that helps a lot.
So yeah, I think we use
BigQuery in similar ways.
It's really enabled
some new avenues
to do really fast analytics on
larg datasets that we didn't
have the capability of before.
I'd say one of the things that
I enjoy the most about GCP
is using Dataproc, because
we use it so heavily.
Having any developer
able to spin up
an arbitrarily-sized cluster,
to test to their job,
to not have to wait for this
big Spark batch job to run,
has been really powerful
and really enabled
faster iteration.
So as a developer that's
become a huge boon for us.
ANDREI SAVU: Awesome.
FELIPE HOFFA: To give my
[INAUDIBLE] users, what
I'm pretty proud of, being
part of Google Cloud,
is as you listen all
of these examples
and how they are solving
problems at their huge scale,
that everyone here has
access to the same tools.
You can basically run
the same processes,
you can prepare yourself,
work at the same scale.
Limiting cost is an issue, but
it's not a hard issue to solve,
you just put the
cost controls there.
But just having access to the
same size of data, same tools,
can really, one,
serve you for fun.
But then if you
want to work at any
of these places or many
more, you're ready.
You have everything.
You've tried everything and you
are speaking the same language
from day zero.
LUIS BITENCOURT: And
I'll add to that.
I mean, cost is only
an issue because we,
like, the $10,000 dashboard was
a [INAUDIBLE] dashboard which
was set to refresh
every five minutes that
was scanning every single
screen view of Reddit.
Like, that's just dumb.
[LAUGHS] So it's not really--
and the amount of data
is ridiculous.
Like, most people, if you're
playing with it, like,
you might spend a
few bucks if you're
using on your own projects
or on most things.
It's when you're talking
about, like, a top-four website
in the US, then you're going
to have those kind of things.
But at that level, 10,000 bucks
is actually not a lot of money.
ANDREI SAVU: OK, cool.
So we talked a lot
about past, present,
let's talk about the future.
So assuming you have
this magic wand,
what are the areas in which you
are investing moving forward?
And then what do you think
is going to make a material
difference for you?
LUIS BITENCOURT: Yeah.
So I'm excited about the
ML stuff on Google Cloud.
So when you talk about
people having the same tools
and being able to work across
companies, that's actually
one of the pieces that
was kind of surprising
and as a positive, was a
lot of the ML community
is actually using Google
Cloud on their own projects
because it's a little bit
more developer-friendly.
It's a little bit
easier than AWS.
AWS is probably the least
developer-friendly thing
out there.
So that was really helpful
because we can go and recruit.
And the folks who we're
recruiting actually
get excited that we're open to
exploring that in the future.
So having your data,
that's one step.
Cleaning it up is another one.
And Google certainly has a
leg up in a lot of the ML APIs
and functionality
that we might want
to start playing with
and using in the future.
So that's an area
I'm excited about.
I think another one is just
continuous improvements
on BigQuery.
There's definitely
still low hanging fruit
even though it's a big area
where we could significantly
increase the speed of
queries, cost reduction,
even the JavaScript UDF
stuff is pretty powerful what
it can do there.
So we know we're just
scratching the surface.
We're dumping data there and
then running queries on it.
There's a lot on the whole
Google Cloud ecosystem
that I think we'd be
really interested to start
playing with.
ROBIN LI: A very similar
observation on my site.
We have the MPB
system in BigQuery.
We have our data lake on GCS
and we have our entire Hadoop
on GCE.
So now what?
Great.
So we have all the data, we have
all the pipelines move over.
The next step for us is to
translate our simple machine
learning modules
such as [INAUDIBLE]
over to deep learning modules.
So we've been starting to look
at tensor solutions offered
by Cloud ML our POC deep
learning modules produce
very impressive results.
Are I think the
biggest challenge
for us is not how to train
a module on Google Cloud,
it's how to serve it in real
time back to our customers.
That's a problem
challenge for ourselves.
So but having that in
mind, the next step for us
is definitely how
to gradually moving
from a Spark driven
machine learning path over
to deep learning driven
path on our site.
That's pretty much on my site.
ANDREI SAVU: Awesome.
ANDREAS SEKINE: I think
I'm on the same boat
as the two of you.
One additional thing
that I'm excited about
is the unification of batch
and streaming paradigms.
I think kind of making the
leap to a streaming world
is pretty daunting.
And it takes a lot
of upfront thought
about how to make that switch.
And with the rise of Apache Beam
and Spark structured streaming,
that's becoming a lot easier,
and I'm excited to see that.
ANDREI SAVU: Awesome.
So--
FELIPE HOFFA: Just to
give my little bit.
Yes.
Material learning is
completely the future
and what everyone is
really focused right now.
Everyone wants to learn tensor
flow, et cetera, et cetera.
So that's pretty exciting
to see how things
are going to get easier to use.
But then at the same
time, I personally
have not moved into the
material learning realm.
I just think there is so
much more to do with data.
Anything with material learning
will require data engineers.
That's the name of the
Meetup, how do we move data?
And I'm especially excited
about real time data.
So we have [INAUDIBLE]
now available.
I would really love to
see more public feeds
of fresh livestreamed data and
people doing things with it,
building pipelines over it.
Totally.
Material learning,
that's the hottest one.
ANDREI SAVU: OK, so I will
ask, like, [INAUDIBLE]
questions to go around.
But then I want you to
think about your questions
as we close down this session.
So my final question for you
is, if you were able to go back
based on what you know today
and tell a story to yourself,
tell a bunch of advices,
what would that be?
LUIS BITENCOURT: Think
a little bit better how
to spread apart your datasets.
We put everything, all of
our analytics of an event
in one dataset.
With BigQuery you
don't get the ability
to do custom partitions
like you do in Hive,
or hierarchical
partitions, so everything
is partitioned by day.
And it turns out that we just
generate too much data per day,
even at BigQuery scales,
to be able to scan
through one day's worth of
data in a reasonable time.
So leveraging different
datasets as basically one level
of partitioning
your data would have
been a way to be much faster.
We were way too much into
the denormalized camp
and avoiding joins due to the
slowness of our Hive solutions
there.
And I have not had any issues
with joins in BigQuery,
so I would be more open to
doing joins and separating out
my data.
ANDREI SAVU: That's a refreshing
thing to hear about big data
and joins.
Big data and joins
used to incompatible.
ROBIN LI: I think for me,
if I want to redo it over,
I probably want to have
data visualization,
a lot more a richer
data visualizations
to tell my story of how we're
dealing with data, how we're
using data to drive
insights driving
the revenue of the company.
I think that's very important,
because lot of your data
engineers do a lot of dirty
works and grunt works,
but in order for your company
to be a data-driven company,
you definitely need
a lot of insights
and a lot of visualization
of your data.
Not just a data sitting
on a spreadsheet.
ANDREAS SEKINE: I think along
the same vein as visualization,
we had a bug a little while
back where some of our front end
servers stopped sending
event data to the back end
and no one noticed for a while.
For an embarrassingly long time.
So realizing that not all
failures are catastrophic
and that they could
be partial or subtle
and having more rich
moderating and alerting
to make sure that your data
is trustworthy and safe
is something that is kind
of obvious in hindsight.
ANDREI SAVU: Sweet.
So any questions
from the audience?
I think we all have--
LUIS BITENCOURT: We
answered everything.
AUDIENCE: Hi.
So my question was sort of
on the data governance side.
And you mentioned in
terms of understanding
which visualizations or business
questions you need to answer.
What are some process-level
mechanisms or actual tools--
like, do you have a
glossary of certain reports
that you guys certify?
Or what are some of the ways
you kind of manage, you know,
we are getting
accurate data and it
should be used for X, Y and Z?
FELIPE HOFFA: You want to
repeat that question again?
LUIS BITENCOURT: Oh.
Yeah.
Does that work?
So the question was
on data governance,
and I'm assuming, leading
into the QA side of it,
things that break.
But I'll add a
little bit more, too.
There's all-around a
PII protection of data.
While analytics is
great, you probably
don't want everyone
in the company
to have access to-- like,
on Reddit, for example,
not everyone in the company
should know every single post
that a particular
account can see, right?
That's kind of creepy.
There's IP information,
PII information,
or a privacy policy.
I think it's like 90 days where
we can keep PII on people.
So things like an IP
address we keep for a while,
but then we have
to go and mask it.
So what happens when the scripts
to mask it or the items to mask
it actually break and
you're no longer masking it?
So a lot data governance
on Reddit, at least,
is, first you've got to have a
breakdown of what data you're
collecting, where you're
putting it and where it lives.
One conversation we
were just having is,
some people's [INAUDIBLE]
input data in different places,
so there's also some ideas
of, OK, at some point,
we need to start caring
about where the data lives
outside of that to
make sure that we're
protecting our users,
protecting their information.
But the first step is an audit.
Take an audit of where you are.
Actually, the same way that
you get third party security
audits, there's third
party data audits.
So if you're at that
level of data governance,
you can get a third party to
come in, look at your data,
see what you're storing,
where you have it,
where data leakage is.
And it's actually not that
hard to accidentally leak data,
at least within your company.
There's so many processes.
There's things that go to an
event collector, then a Kafka
pipeline, and then we dump
it into S3 or GCS and then
BigQuery.
And any of those, if all
of a sudden other people
have access to it
for other purposes,
now they can pipe in and
look at data that they
shouldn't be looking at.
Then there's the
whole quality side.
We've had the same bugs,
where Android Reddit
app stopped sending events.
I don't think it
was screen views,
but one of our core events,
it just stopped sending
and we didn't notice until a
couple of weeks after the build
was out because the adoption
curve was a little bit smaller
but we didn't have
a simple thing
like a dashboard of
build-to-build or
deploy-to-deploy, what
are your key events
and which ones have
significant changes over those?
So things like
that is another way
to have governance,
but first step, audit.
And the second one is, if
you don't have it automated,
if you don't have
monitoring over it.
So you have to have a set of,
what are my golden events?
What are the ones
that can't break?
And there should be a test
for every single one of those.
If there's ever a deploy
and your delta of those
goes above a certain amount,
you've got to fire an alert.
If it's not automated, if you're
relying on people to notice it,
you're not going to
catch it and you're
going to miss those bugs.
ANDREI SAVU: Any
other questions?
I have another one around
the organization more
than anything else.
So it's kind of
related to governance.
There are various ways in
which you can organize data.
Google Cloud between
projects and folders
and I'm wondering how do
you organize your data?
Do you split storage
versus query, or not?
Or what are some of
the things you do?
LUIS BITENCOURT: Badly.
To go back to what I would tell
myself, so we did two things.
We versioned our events.
So first thing when
we started at Reddit,
there was schema-less
events, which is horrible.
You always want a schema.
You always want to enforce it.
You want to be able to know
what fields are expected
and what's not expected.
So we had the
[INAUDIBLE] events.
And for those we actually
broke them out by dataset,
and we had one project for all
the events, different projects.
So datasets are grouped by,
is it projects in BigQuery?
FELIPE HOFFA: Project,
dataset, table.
LUIS BITENCOURT: OK.
So projects, I guess, is
the thing I was thinking of.
Maybe I'm thinking of tables.
Now I'm confused.
Project is a Google
Cloud project, right?
FELIPE HOFFA: Yeah.
LUIS BITENCOURT: OK, sorry.
So ignore what I was saying.
Multiple tables is what I wanted
to have, not multiple datasets.
I have one data set and one
table for all my V2 events,
and then I can't
actually segment them.
I need more granular than
a day's worth of data.
So we have one project for
all of our production events.
We have one project for
employee datasets, so things
that are like ad hoc analytics.
And those are literally
organized by the employee name,
so we know exactly who to go
to and see if they're just
trying out different things or
trying out different datasets
and they're not polluting
the official project.
And then we have now the idea
of these reporting tables
and aggregate tables
which we're eventually
forming into our golden dataset,
so what are the blessed tables
and reports that we
have cleaned up data on,
that we've got them
pre-aggregated so
that we can do queries really
fast on the business side.
ANDREAS SEKINE: So we
definitely started down
this path on Google
Cloud Storage where
we were trying to not
have too many buckets,
and we realized that was the
wrong decision, that it's
better to have more
buckets because they
are a pretty singular
unit of control
over, whether it's
lifecycle management
or permissions or
the storage class,
buckets tend to be the way
that GCS has you control them.
So we've started moving
more towards more buckets
with fewer folders within them.
As far as BigQuery goes, we
do have particular blessed
datasets where they have what we
consider analytics data, which
is kind of the nicely
sanitized, cleaned,
pre-joined data which anyone
who has kind of more cursory
or casual questions about
key metrics of the company
can query those tables and trust
that the analytics teams have
done their job to make sure it's
sane data and clean in there.
ROBIN LI: I guess
another benefits
when you have multiple
projects or buckets, which
is cost control array, so you
can structure your projects
into different tiers
and different tiers
can have different
capability to run queries.
So you can set up caps on
different buckets as well.
So that's another way of doing
good cost control in BigQuery.
FELIPE HOFFA: I have two
questions, action-item-related.
One is what should I
tell my product managers?
What change would you
like from Google Cloud?
Excellent.
And the other question,
if someone here
wants a weekend project, what
would you recommend them to do?
What would be a fun
project for the audience?
LUIS BITENCOURT:
Custom partitions.
hierarchical
partitions in BigQuery.
That would save my life.
And weekend
projects, I would say
if you're looking for extra
projects, Reddit is hiring.
So you should just
talk to me and I'll
pay you to do your
weekend project.
If you don't want to
get paid for the work,
I literally think--
I mean, I know it's more
the data engineering.
But I think ML is huge.
It's going to be even bigger.
And if you want to be a
data engineer in the future,
likelihood is going to be a
data engineer for ML projects.
So I would pick whatever
thing you are manually
doing today, some repetitive
process, some classification
thing, and go play
and figure out
if you can actually create
some stupid, dumb models
to be able to predict
something for you there.
ROBIN LI: There is one really,
really nice feature about
BigQuery I really like,
which is your HyperLog++.
I'm not sure if you guys all
know what HyperLogLog does.
It's basically an algorithm
that developed in 1989.
It's sort of seven years
younger than my age.
But basically what
it does is trying
to do an estimation
of our distinct values
over your datasets.
And Google takes it
up to another level,
which is that you publish
a white paper about
HyperLogLog++, which, typically
if you see our original
algorithm, it gave you
the accuracy about 0.5%,
which means if you actually
[INAUDIBLE] distinct values,
and also using the HyperLogLog
algorithm to count the distinct
values, the difference
should be around 0.5%.
But in Google's case,
it's about 0.05%.
It's really impressive
how accurate it is.
And also Google has
a really nice feature
which you can starting to
partition your distinct values
into synopsis or sketches,
which means, let's say, if I--
previously, if you want
to count last seven
days of distinct
users, what you need
to do is you always
need to get all
the users for the
last seven days
and starting to count these
things of them, right?
But now in Google BigQuery what
you can do is, for each day
you can count the
distinct values
and put them into a hash string
and a story for that day.
Do it for the next
six days and when
you want to look back for the
last seven days of distinct,
you just need to merge
all of the last seven days
together, which saves you
a lot of conditional power.
So that's one of the
nicest features I really
like in Google BigQuery.
Other MPP has the same
feature, but Google
in terms of speed and accuracy
is the best I've ever seen.
ANDREAS SEKINE: Not so much
features, but since you're
offering some PMs ears,
being a little bit more
public about when there are new
changes or releases to BigQuery
would be nice.
We've had breaking changes
where there was incompatibilies
between Dataproc and BigQuery.
And we would file
support tickets
and not hear back about them.
So that was a
little discouraging.
And also have noticed particular
service outages of up to 20
or 30 minutes that never show up
on the status dashboard, which
is a little frustrating
when you're on pager duty
and there's an outage
that you're observing
and there's no word.
LUIS BITENCOURT: Wait,
they didn't have requests,
so I want to throw in two more.
Native sampling
would be awesome.
We literally have
an ATL that creates
datasets that are samples,
just so we can query them.
Data sampling would be amazing.
And now this is
coming from Reddit,
so you have to feel this one.
Like, make your
UI not look ugly.
And knowing that Reddit looks
really ugly, like, that's bad.
ROBIN LI: Can I
add a few things?
[LAUGHTER] No.
I mean, when we
moved to BigQuery,
I think the
reliability is actually
one of the reason we
moved over to BigQuery.
We run cluster in [INAUDIBLE]
in our own datacenter
on AWS, on Vertica.
In Redshift, the
reliability we've
seen on all those clusters and
the work compares to BitQuery.
So I think the reliability
in Google Cloud
is actually very, very good.
The only thing
that confused me is
the name of different
products Google have,
like Dataproc, Datapipe,
Datalab, Datastudio.
They actually confuse me
sometimes which tool to use
and what tool's for what.
LUIS BITENCOURT: Don't use AWS.
[LAUGHTER]
ANDREI SAVU: Cool.
So we'll stop here.
Thanks.
Thanks, everyone, for
sharing your experience,
sharing your thoughts about
how things can improve
and what things are working for.
Thank you for
attending the event.
So we have half an
hour left in which
we can just socialize,
have a drink,
continue the conversations.
And that's it.
LUIS BITENCOURT:
Who gets the hat?
ANDREI SAVU: Who gets the hat?
So we have one question,
one great question.
He gets the hat.
Thank you very much and hope
to see you at the next one.
FELIPE HOFFA: Thank you.
[APPLAUSE]
