>>Tony Voellm: We have one last lightning
talk before we take a break here.
We will make it go like lightning, lightning
being exactly 15 minutes.
So it's like lightning from the sun to here
or something like that.
And so with that, I want to introduce Celal
Ziftci and Vivek Ramavajjala.
I practiced this.
Ramavajjala.
>>Vivek Ramavajjala: That's good.
>>Tony Voellm: Okay.
Who are from the University of California,
San Diego, and are going to be talking about
finding culprits automatically in failing
builds, i.e., who broke the build.
And I had an opportunity to have dinner last
night with Celal, and it was a very fascinating
discussion.
With that, I shall hand it over.
>>Celal Ziftci: Hi, everybody.
[ Applause ]
>>Celal Ziftci: As Tony mentioned, I am coming
from the University of California, San Diego.
I'm a Ph.D. student there, and I'm hopefully
finishing in a month to join Google.
And my colleague here, Vivek, is already working
at Google.
And, incidentally, he is also a graduate of
U.C.
San Diego.
So our talk today is about finding culprits
automatically in finding builds, in other
words, who broke the build.
So let me first give you some background on
how this project came to life.
So I was an intern in the Google New York
office last summer.
And this started as basically a research project
on how to do this kind of job automatically.
And we, basically, wanted to see if this is
feasible, if there were any strategies that
we could use to make this a reality.
And lucky for me, we actually found a good
way to put this into production.
And right before I left, we started using
it in two projects.
And we got some good, promising results which
I will give you some examples at the very
end.
All right.
So let me first describe what a culprit is.
So a culprit change list is -- we defined
it as CL that breaks the build.
So as you saw during the talks yesterday and
today, most of the companies today, including
Google, use continuous integration.
And in a typical setup, you have a green build
in which the build compiles, it -- all the
tests pass, and everything is good, life is
good.
And then developers commit change lists.
And those gray boxes represent change lists.
And at some point in time, the continuous
integration machines download the code, they
build it, and some tests fail.
So we have a red build.
So now what we need to do quickly is to figure
out which of those gray boxes is the culprit.
And it caused the build to break.
The reason we want to do this is because we
want to have more green builds, which means
we will have a better quality software.
Because if we have a green build, we can have
a faster development and release cycle, meaning
we will be confident with our build, as long
as it's green, so we can release new versions
of it.
And also, importantly, we are going to have
fewer engineer hours wasted.
Let me expand on that a little bit.
So when you have a setup like this, typically,
developers take shifts on watching the build.
If the build fails, whoever is watching it
on that week or that month stops what he is
doing and then started investigating what
might have caused the breakage.
There are different ways of doing this.
And I'm going to describe different types
of automation on this problem and where we
-- where our solution lies.
But pretty much if you do this manually, then
it's not a very good thing to do, because
you're going to use a lot of manual time of
your developers, and, basically, you should
have an automated tool to solve this problem.
So let me talk about how this can be automated
or already automated in different types of
tests.
So the first type of test is unit tests.
As you can imagine, these are pretty short
tests.
I am talking about seconds here.
And if you have a problem like this, if the
build fails in your unit test, you can pretty
much do the following.
You can just build every single CL separately,
and possibly in parallel, and you can easily
find the culprit in seconds, or, at the most,
minutes; right?
If you have medium-sized tests, which take
a bit longer than the unit tests, then -- these
are also more computation intensive.
Then, instead of building everything in parallel,
because you don't want to use lots of computation
resources, you do a tradeoff between time
and computation, obviously, which is how many
compute (indiscernible), as you know.
So we can do a binary search instead.
So what we can do is we can target the middle
change list, we can build it.
If it fails, then the culprit is probably
to the left.
So we recurse to the left.
If it passes, we go to the right -- sorry,
vice versa.
If it passes, then we go to the left.
And we basically do a binary search recursively
until we find the culprit CL.
Now, these were already handled problems in
Google when I came in.
And the biggest problem lay on integration
tests.
So these are tests that take a very long time
to run.
And I put the eight-minute cutoff there.
But the ones that I looked at could take hours,
let's say two hours; right?
And, basically, the strategies that I described
for unit tests and the medium-sized tests
don't work for those, because every single
build takes two hours.
So then we have to have a different strategy
to automate this.
And as I said, when I came in, there was no
easy way to find culprits like this automatically.
So the solution was to manually investigate
the change lists.
All right.
So, basically, our proposal was to have a
completely automated tool that gives us suggestions
about the culprit and do this for every single
time that the build fails by itself.
And then provide the user some information
about what it found and what could be the
suspects.
And the important things are here.
It should be fast.
Potentially, it should give you a result within
minutes.
It should be cheap.
It shouldn't use too many computing resources.
As I mentioned, these can take two hours to
run.
And it should be good.
Obviously, good is a subjective word.
And I'm going to show you the results at the
end, and you can judge for yourself.
All right.
So this is a very high-level overview of how
this works.
So the tool monitors the build.
If the build fails, it immediately starts
looking at all the change lists that went
into the build and it ranks them.
So it calculates this metric that I call suspiciousness.
And it then puts the most suspicious ones
on the top, and then it just basically sorts
them in reverse and tells the users that here
are the suspects that I think are the culprits.
Please take a look at them.
All right.
So, obviously, in this flow, as you can imagine,
the interesting part is how do we do the ranking;
right?
So, obviously, for every single change list,
we look at all the files in all of them.
And on a very high level, I cannot give too
many details, but I am going to explain how
we do this on a very high level.
So we basically use two heuristics right now.
The first one is -- which is rather trivial
-- if a change list has more files than another
change list, obviously, it's more suspicious,
because it touched more stuff in the build
tree.
The sending one is a bit more interesting.
It's distance.
So if you think about -- if you think about
a build tree, so it's basically a directed
acyclic graph.
And, basically, your library, your project
has lots of dependencies to other libraries.
So if you think about this as a tree, there's
a root of the tree.
Let's say in Python it's the core Python libraries;
right?
And the heuristic says the following: So if
I depend on a library that directly is my
first adjacent library, then it's more likely
to break my build then a library that's up
in the chain that is closer to the root.
This might seem a bit weird, but let me explain
why we thought about this.
So think about a library that is closer to
the root.
There are two important things that we observed
about such a situation.
First of all, the libraries that are closer
to the root would be -- if somebody makes
a change in that library, they would be more
cautious.
For example, if you are changing the core
Python library or something that immediately
depends on it, obviously, you know that lots
of projects in Google depend on that.
So you -- you are more cautious on making
changes, and there's a more rigorous review
process and so on.
And, secondly, let's say you still made a
mistake and you introduced a bug and you broke
the build on that project.
Then immediately, since there are lots of
projects that depend on that, because it's
closer to the root, what will happen is, some
continuous build, some continuous integration
project other than yours will hit it immediately.
Because lots of projects are running continuous
integration, and it's very likely that the
first one that's already running with the
new changed project will hit that bug.
And they will immediately realize this and
figure it out and fix it.
So this is basically the heuristic why we
use distance as a metric to calculate suspiciousness.
The important thing here is, the heuristics
are pluggable.
In this case, we used two.
But we had hopes on using other ones.
For example, if something fails, you can look
at the logs.
Hopefully, there are some key words in the
logs.
And in the diffs of the files that were changed,
and maybe you can correlate them and so on.
So we didn't want to implement this, but I
want to say these are pluggable.
So you can combine these and make your own
heuristic and so on.
It's a matter of experimenting and finding
what works for your project.
And with that, I give it to Vivek to show
you how this looks.
>>Vivek Ramavajjala: So, basically, after
Celal left from his internship, I joined about
the same week, so I only started working for
the project that he was interning on.
And I thought it was a pretty cool thing,
because I had seen my teammates tear their
hair out when they were finding out who broke
the build.
So I spent a couple of months hacking things
together and getting a prototype released.
So you just have to go to the UI and say,
look, this is the build tool I have.
And what's this thing for anything that breaks,
anything that is indicating, like, a fail
test or failure or anything like that.
And when something breaks, go and figure out
which is the latest green build, which is
the latest red build, figure out which (indiscernible)
between that are, you know, things like automated
CS data updates.
If they're not really human errors, (indiscernible)
processing, and then actually get the culprits,
and then send them to the team in charge of
the project.
And I also ended up implementing some of these
(indiscernible) heuristics, so things like,
if you have logs, the logs will contain messages
that say, you know, this is the place where
an exception was raised.
Or this (indiscernible) didn't compile.
So you can do a lot smarter things, like just
figure out where this file was modified, and
that gives you a lot more confidence in your
scores.
So since January or middle of January, we
have about six or seven projects using this
prototype.
It already has investigated something like
250 breakages.
That's about 30 breakages per project in less
than two or three months.
So that's pretty significant.
And, essentially, it lets you find the culprits
in something like two to three minutes instead
of half an hour to an hour.
So you can just, you know, get back to your
work instead of having to look around other
people's code.
And, basically, I said so this was a prototype
that I developed.
And it's now been integrated into a proper
production continuous build system.
And that's a little slower, but it's getting
there.
So, hopefully, once this is actually in place,
it should be possible for people not to worry
about, you know, going through people's code,
asking them why they did this and if they
tested properly and so on and so forth.
And there are some results.
Basically, we often end up analyzing -- because
they are large integration tests, the number
of changes we are talking about are in hundreds
of thousands.
So one example there is, like, the last one
is the most significant,17,000 CLs between
the green and the red.
And they will tell you that in your ranking
number, (indiscernible) is the answer.
So that's pretty significant.
You don't have to worry about filtering out
hundreds of CLs.
And, hopefully, we're going to save a lot
of time for people in the future, too.
>>Celal Ziftci: And I want to finish by reminding
the three things that we wanted to have on
this project.
One, good results.
And I think we can have some promising results
here.
And second one is, it should be fast.
And the first prototype that I implemented
last summer took, like, six hours, which was
suboptimal, we can say.
And then we did some runs on it, and we did
some optimizations, and then we took it down
to, as he mentioned, like, two minutes.
Obviously, we used lots of caching and so
on.
And that's the third thing that I want to
mention, resource efficiency.
We basically don't go to the file system and,
you know, query things.
We use caching as much as we can.
And that's why this is so fast.
>>Vivek Ramavajjala: There's a lot of precomputation.
So we don't need to compute the build dynamically.
Once it's submitted, we pretty much know what
the build tree was at that CL.
A lot of these things are to save time.
>>Celal Ziftci: The last thing I want to mention
is I would like to make a point on what Ari
mentioned during his keynote speech yesterday.
So he mentioned that Google is having -- starting
to have some scalability problems in running
all of the unit tests in all of Google.
And after we implemented this so that there
are some promising results, we had some discussions
about integrating this tool to the core technology
infrastructure in Google.
And the idea is, basically, if this tool can
actually suggest the culprits or suspect CLs
before -- after they are submitted, before
they are built, then we can actually build
all the CLs, even for unit test.
Instead of building all of them in parallel,
we can actually build them in a certain order.
For example, we can take the first hundred
and then build them in parallel, basically,
the most suspicious ones.
And then we can take the second batch and
so on.
So, basically, the idea is, if you see a CL
that fails, then there's pretty much no point
running the CLs after that; right?
Because, like, until that is fixed, they are
going to fail as well.
So, basically, the idea is to integrate to
the core testing infrastructure so that we
can have more efficiency in the testing at
Google.
And that concludes our talk.
[ Applause ]
>>Tony Voellm: Thanks.
Great.
Great.
Thank you, Celal.
Thanks, Vivek.
We can take a question on either aisle.
I see some people lining up.
So how about the left first?
Although, Shadi (phonetic), I don't think
you get this, so --
>>> It's okay.
Hi, I'm Shadi.
I work with Account Team.
First, very nice tool.
Really nice work.
Just wondering, how do you deal with false
positives and what's the right way to update
the heuristics that you have based on the
results?
>>Celal Ziftci: I should say, I mean, we experiments
with six projects so far.
I mean, when I was here, we experimented with
two projects.
And now it's six.
Unfortunately, I didn't have enough time to
do a formal evaluation, which I would like
to do once I join Google in summer.
By "formal evaluation," I mean literally going
to a (indiscernible), like, baseline from
all the unit tests that Google runs already
and run this tool on them and see if I can
actually identify the ones that actually fail.
So after we do that evaluation, I'm pretty
sure we will have a good understanding of
which experts, like which heuristics work
better.
But we, basically, like, experimented with
a couple of them, and these two look like
the best.
>>Vivek Ramavajjala: Speaking of false positive,
we identified a few other tools that identified
flaky failure.
If it's not really a failure, the user can
notify that it's not a true failure, it's
a flake.
So that basically can tell our tool that don't
trigger on this failure.
Wait for two or three to happen before it
gets actual failing.
>>Tony Voellm: And with that, we've actually
run out of time.
So thank you, guys.
And we are -- Please.
[ Applause ]
