MALE SPEAKER: Thank you
for coming, everybody.
Some of you have probably
already
heard of Linus Torvalds.
Those of you who haven't,
you're the people with
Macintoshes on your laps.
He's a guy who delights in
being cruel to people.
His latest cruel act is to
create a revision control
system which is expressly
designed to make you feel less
intelligent than you
thought you were.
Thank you for coming
down today, Linus.
I've been getting emails for the
past few days from people
saying, where's Linus?
Why hasn't he measured
my tree?
Doesn't he love me anymore?
And he walked into my office
this afternoon.
What are you doing here?
But thank you for taking
the time off.
So Linus is here today to
explain to us why on Earth he
would write a software tool
which only he is smart enough
to know how to use.
Thanks, Linus.
LINUS TORVALDS: So I have a few
words of warning, which is
I don't actually do speaking
very much, partly because I
don't like speaking, partly
because over the last few
years everybody actually wants
me to talk about nebulous
visions for the next century
about Linux.
And I'm a tech geek, so I
actually prefer talking about
technology.
So that's why I am not talking
about the kernel, because it's
just too big to cram into
a one-hour talk.
Although apparently, Andrew
did that two days ago.
And I'm instead talking about
Git, which is the source
control management system that
we use for the kernel.
I'm really, really, really bad
at doing slides, which means
that if we actually end up
following these slides, you
will be bored out of your mind
and the talk will probably not
be very good anyway.
So I am the kind of
speaker who really
enjoys getting questions.
And if that means that we kind
of veer off in a tangent,
you'll be happier, I'll be
happier, the talk will
probably be more interesting
anyway.
I don't know how you do things
here at the Google talks, but
I'm just saying don't feel shy
as far as I'm concerned.
If your manager will shoot
you, that's your problem.
So next slide.
I want to give a few credits
before I start.
Credit CVS in a very, very
negative way because in many
ways when I designed Git, it's
the what would Jesus do?
Except it's what would CVS
never, ever do kind of
approach to source control
management.
I've never actually used
CVS for the kernel.
For the first 10 years of
kernel maintenance, we
literally used tarballs and
patches, which is a much
superior source control
management system than CVS is.
But I did end up using CVS for
seven years at a commercial
company and I hated
it with a passion.
When I say I hate CVS with a
passion, I have to also say
that if there are any SVN users
in Subversion, users in
the audience, you might want to
leave because my hatred of
CVS has meant that I see
Subversion as being the most
pointless project ever started,
because the slogan
for Subversion for a while
was, CVS done right or
something like that.
And if you start with that
kind of slogan, there's
nowhere you can go.
There is no way to
do CVS right.
So that's the negative
kind of credit.
The positive credit
is BitKeeper.
And I realize that a lot of
people thought there was a lot
of strife over BitKeeper and
that the parting was very
painful in many ways.
As far as I'm concerned, the
parting was amicable, even
though it looked very non-amical
to outsiders.
And BitKeeper was not only the
first source control system
that I ever felt was worth using
at all, it was also the
source control system that
taught me why there's a point
to them and how you actually
can do things.
So Git in many ways, even though
from a technical angle
it is very, very different
from BitKeeper, which was
another design goal because I
wanted to make it clear that
it wasn't a BitKeeper clone, a
lot of the flows we use with
Git come directly from
the flows we
learned from BitKeeper.
And I don't think you use
BitKeeper here inside Google.
As far as I know, BitKeeper is
the only commercial source
control management system that
actually does distribution.
And if you need a commercial
run, that's the one you should
use, for that reason.
I'd also like to point out that
I've been doing Git now
for slightly over two years, but
while I started it and I
made all the initial coding
design, it's actually being
maintained by a much more
pleasant person, Junior
Hermano, for the last
year and a half.
And he's really the person
who actually made it more
approachable for mere mortals.
Early versions of Git did
require a certain amount of
brainpower to really wrap
your mind around.
It's gotten much, much
easier since.
Obviously the way I always do
everything is I try to get
everybody else to do as much as
possible so that I can sit
back and sip my pina colada, so
there's been a lot of other
people involved, too.
That's the credits.
With those out of the way.
So this slide is now one day
old, and I didn't actually do
the slides last night because
last night I was out carousing
and eating sushi.
But the slides will talk about
implementation of a high
performance distributed content
management thing.
And the keyword here is actually
the distributed part.
I will start off trying
to explain why
distribution is so important.
If we never get past
that point, I
will actually be happy.
If we never get to actually
what Git implementation
internally is, it's fine.
I am not also trying to teach
you how to use Git.
There is this thing
called google.com.
You may have seen it.
It has this thing you can
type things into.
You type Git and then you press
the I'm Feeling Lucky
button, and you will actually
get the home page.
The home page has tutorials,
it has the user manual,
they're all in HTML.
If you actually want to learn
to use Git, that's where you
should start, not
at this talk.
But as mentioned, if we actually
start veering off
topic into other tangents
because of
questions, it's all good.
I already gave you kind of a
heads up warning on this.
I use the SCM, which I consider
to mean Source Code
Management, that is,
revision control.
Some other people think SCM
means Software Configuration
Management and see it as a much
bigger feature, including
release management and
stuff like that.
That's not what I'm talking
about, although Git is clearly
relevant in that setting, too.
CVS, we already went there.
You can disagree with me as much
as you want, but during
this talk, by definition anybody
who disagrees is
stupid and ugly.
So keep that in mind.
When I'm done speaking, you can
go on with their lives.
Right now, yes.
I have strong opinions and CVS
users, if you actually like
using CVS, you shouldn't
be here.
You should be in some mental
institution somewhere else.
So before actually go and talk
about the whole distribution
thing, which I think is the most
important part, I'll talk
a bit about the background
because it invariably comes up
because people, if they have
heard about Git, a lot of the
things they've heard about is
the background for doing it in
the first place.
One piece of background
information is I really am not
an SCM person.
I have never been very
interested in revision control.
I thought it was evil until
I met BitKeeper.
I actually credit that to some
degree for why Git is so much
better than everything else.
It's because my brain did not
rot from years and years of
thinking CVS did
something sane.
I needed a replacement
for BitKeeper.
The reason for that was
BitKeeper is a commercial
product, but BitMover and Larry
McVoy allowed it to be
used freely for open
source projects, as
some of you may know.
The only restriction was you
were not supposed to reverse
engineer it and you weren't
supposed to try to create a
competing product.
And I was happy with that
because, quite frankly, as far
as I'm concerned I do open
source because I think it's
the only right way
to do software.
But at the same time, I'll use
the best tool for the job and,
quite frankly, BitKeeper
was it.
However, not everybody
agreed with me.
They are ugly and stupid.
But they cause problems and it
resulted in the fact that
Larry and I had several
telephone conversations which
ended up saying we'll all be
much happier if we just part
ways and don't make
this any worse.
So we did.
And I made the Linux 2.6.12-rc2
release about two
years ago and said, I'm not
going to touch Linux until I
have a replacement for BitKeeper
for doing source
code maintenance.
And one of the replacement
options was going back to
tarballs and patches,
but nobody
really liked that anymore.
So I actually looked at
a lot of alternatives.
Most of them I could discard
without even trying them out.
If you're not distributed,
you're not worth using.
It's that simple.
If you perform badly, you're
not worth using.
It's that simple.
And if you cannot guarantee that
the stuff I put into an
SCM comes out exactly the same,
you're not worth using.
Quite frankly, that pretty
much took care of
everything out there.
There's a lot of SCM systems
that do not guarantee that
what you get out of it again is
the same thing you put in.
If you have memory corruption,
if you have disk corruption,
you may never know.
The only way you'll know is
you notice that there's
corruption in the files when
you check them out.
The source control management
system does not protect you at
all, and this is not
even uncommon.
It is very, very common.
The performance issue, one of
the things I kind of liked was
a system called monotone, which
actually, I think there
was a talk at Google about
them some time
ago, I'm not sure.
It had a lot of interesting
ideas, but performance was so
horrendously bad that I tried it
for a day and realized that
I cannot use it.
The end result was I decided I
can write something better
than anything out there in two
weeks, and I was right.
So now we get to distribution.
And this is the worst slide of
them all, and I'm not very
proud of it.
And the problem is distribution
is really, really
important, but when I tried
to make slides about it I
could not do it.
And part of it is my obvious
artistic talents, which are on
display for all of you, but part
of it is that it's really
hard to explain.
So before you can start, I'd
like to know how many people
are used to the notion of a
truly distributed source
control management system?
Are most of you kernel
developers?
No, OK.
So there were maybe 10
hands coming up.
Being distributed very much
means that you do not have one
central location that keeps
track of your data.
No single place is more
important than any other
single place.
So for example, this is why I
would never touch Subversion
with a 10 foot pole.
There is a massive Subversion
repository, and it's where
everybody has to write.
The centralized model just
doesn't work when you want to
be-- let's look at a
few of the cases.
I say it's so much more than
just offline work, but the
offline work part is actually
maybe the most obvious thing,
which is that you can take a
truly distributed source
control management system, you
can take it on a plane and
even if they don't offer Wi-Fi
and satellite hookups, you
just continue working, you can
look at all your logs, you can
commit, you can do everything
you would do even if you were
connected to a nice
gigabit ethernet
directly to the backbone.
And that is really important.
It is doubly important when you
have hundreds or thousands
of people working on the same
project and they may not be
literally disconnected, but in
practice they aren't really
well-connected either.
So part of distribution is
this offline work theme.
Even if it's not completely
offline, it is important to be
able to do everything you want
to do from any location
without having to be able
to access the server.
What that basic fact actually
results in is that you
effectively have a lot more
branching because everybody
who has a complete repository
and can do commits on his own
will effectively have his own
branch, even if you don't
realize it.
Even if you think of your
project as just having a
single branch, every single
time you disconnect your
laptop and start working
with it, you
are on your own branch.
And this is really, really
important and is very
different from anybody who's
used CVS, where branching is
considered something that
only true gurus do.
How many of you have
ever used CVS?
OK, everybody.
How many of you have really
done a branch and ever
merged it in CVS?
Good job.
I mean, it wasn't everybody but
it was actually more than
I expected.
How many of you enjoyed
the experience?
OK, so there were a couple.
But it is considered hard.
In CVS, when you merge
a branch--
I've done it as little
as possible, but
I've had to do it--
what you do is you plan ahead
for a week and then you
basically set aside one
day for doing it.
Am I wrong?
I'm not seeing a lot of people
say no, it was easy.
I liked it.
It's horrible.
If you're distributed, you have
to realize that every
single person has
his own branch.
It's horrible.
It's not something you
even have to set up.
It just is.
In fact, in Git, we like
branches so much that a lot of
people just have five or
ten or fifteen of them.
Just because once you realize
that you have to have a
special branch anyway, you
might as well have many.
And one of the branches you do
some experimental work on and
one of the branches you
do maintenance on.
So branching is much more
inherent when you do
distribution.
One of the other things that,
to me, is very important is
that by being distributed, you
also automatically get to be
slightly more trustworthy.
I have a theory of
backup switches.
I don't do them.
I put stuff up on one side and
everybody else mirrors it.
And if I crash my own machine I
don't really care, because I
can just download my own
work right back.
And it works beautifully well,
and I don't have to have an
MIS department.
I heartily suggest everybody
else do the same.
But this only really works in
a distributed environment.
If you use CVS, you
can't do this.
What do you use here?
Perforce?
Perforce.
I'm sorry.
I'm sure it's better than
CVS. [WHISPERS].
So that's part of it.
One of the really nice things
which is also--
maybe you don't have this issue
inside a company, but we
certainly have it in every
single open source community
I've ever seen that uses CVS
or Subversion or something
like that-- is you have this
notion of commit access.
Because you have a central
repository, it means that
everybody who is working on that
project needs to write to
the central repository, which
means that since you don't
want everybody to write to the
central repository because
most people are morons, you
create this class of people
who are ostensibly not morons.
And most of the time, what
happens is you make that class
too small because it's really
hard to know if a person is
smart or not, and even when
you make it too small, you
will have problems. So this
whole commit access issue,
which some companies are able
to ignore by just giving
everybody commit access, is a
huge psychological barrier and
causes endless hours
of politics in
most open source projects.
If you have a distributed
model, it goes away.
Everybody has commit access.
You can do whatever you
want to your project.
You just get your own branch,
you do great work or you do
stupid work.
Nobody cares.
It's your copy, it's
your branch.
And later on, if it turns out
you did a good job, you can
tell people hey, here's
my branch.
And by the way, it performs 10
times faster than anybody
else's branch, so nyah
nyah nyah, how about
pulling from me?
And people do.
And that's actually how
it works, and we
never have any politics.
That's not quite true, but
we have other politics.
We don't have to worry about
the commit access thing.
And I think this is a huge issue
and that alone should
mean that every single open
source system should never use
anything but a distributed
model.
You get rid of a
lot of issues.
One of the things that
commercial companies,
distributed models actually
help also
with the release process.
You can have a verification team
that has its own tree,
and they pull from people
and they verify it.
And when they've verified it,
they can push it to the
release team and say, hey, we
have now verified our version.
And the development people, they
can go on playing with
their head.
Instead of having to create
tagged branches, whatever you
do to try to keep off each
other's toes, again, you keep
off each other's toes by just
every single group can have
its own tree and track its work
and what they want done.
So distributed is really, really
central to any SCM you
should ever use.
So get rid of Perforce now.
[APPLAUSE]
LINUS TORVALDS: It's sad,
but it is so, so true.
That was my only real slide
about distribution.
I'd love to get questions,
because we're now moving into
other areas that--
AUDIENCE: So how would
you do it?
If you had this monstrously
awesomely big code base, and
you wanted to use this without
stopping business for six
months, how would you do it?
LINUS TORVALDS: Stay by the mic
because I couldn't quite
make out your question.
OK, he went away.
How would you do this?
AUDIENCE: [INAUDIBLE].
LINUS TORVALDS: So an example of
actual distribution is you
have a group of five people
working on one small,
particular feature.
And that means that for a while,
that feature will be
very, very broken, right?
Because nobody actually creates
perfect code the first
time around except me, but
there's only one of me.
So what happens is they need
to have their own tree that
they can work in without
affecting other people.
You can do this many
different ways.
In CVS, one of the most common
ways, because branches are so
painful, is that you don't
actually commit.
You never commit until it passes
every single test. For
example, at your company you
have a very strict committing
rule saying you will never, ever
commit until it's past
the whole test suite.
And by the way, the fact that
the test suite takes two hours
to run, tough.
You cannot afford to commit.
And this is something
that happens at
every single company.
I bet it happens even
here at Google.
You probably have a strict test
suite, and you are not
supposed to commit
unless it passes.
And then in practice, people
make one-liner changes and
ignore the test suite because
they know the one-liner
changes can't possibly break.
This happens.
This is a horrible,
horrible model.
It just means that you make
huge commits because you
commit something after you've
worked on it for two weeks,
and you have three people
working in the same sandbox
because before they commit,
they can't see the changes
that the other people made.
This is common.
It happens everywhere,
it's scary.
The other alternative is to
use branches even in a
centralized environment.
But branches always end up being
pretty expensive to do,
so you can't do them for
experimental features.
You don't know beforehand if
it's something that's going to
take one day or two weeks,
but most of the time most
programmers say, hey, I can
do this in 48 hours.
And it turns out, yeah,
no you couldn't.
But because you feel you can do
it in 48 hours, creating a
branch, even in systems that
are better at creating
branches than CVS,
is a big pain.
So you don't do it because you
think you can get it resolved
and you're back to
case number one.
But if you decide to create
a branch, you will affect
everybody else's repository
because in a centralized
environment, branches
are global.
So you're kind of screwing with
everybody else, but at
least you're not screwing with
their main, head branch.
You are adding stuff to their
repositories, but hopefully in
a way that they won't notice.
But it does make everybody's
repositories bigger.
So either way, you can't win.
In contrast, in a distributed
environment, what you do is
you have five people, they pull
the current head, which
is hopefully good and tested,
and they start working on it
and they start committing
on it.
And you don't need to wait for
two weeks until your commits
are stable because your commits
are always local.
And what happens is within that
group of five people, you
can pull from each other.
That's what distributed means.
There's no central location, it
means everybody's the same.
So you can merge between
yourself.
So not only can you commit every
single line if you want
to without having to run the
two-hour test suite, but you
can then communicate by pulling
and merging each
other's work and one person
finds the bug again commits it
and tells the other four people,
hey, my repository has
a fix for this.
And then when that group is done
two weeks later, they can
tell their manager, hey,
we've done this.
Can you ask the main group to
pull, and they'll get this new
feature and by the way, we've
tested it over two weeks and
it works and it performs this
much better because we have
actually been able to time it
before we even ask anybody
else to look at it.
And that's a hugely better model
for doing development.
And this is the model that
the kernel uses.
It turns out in many places, we
don't need all that power,
even in the kernel.
So people usually don't pull
within one group, but does it
does happen.
For example, the networking
people sometimes affect the
NFS people, and the
fact that they can
synchronize actually helps.
So this is a real, practical
advantage.
Somebody else has a question.
AUDIENCE: So it feels like the
politics has just been moved
to an indirect political
question.
If everyone's got access and
they're all playing with their
branches and they have their
sandbox and they're having
fun, at the end of the day there
has to be merging and
resolving unless you have
80 billion flavors
of every Linux kernel.
LINUS TORVALDS: Absolutely.
There will be 1,000 or maybe
20,000 different branches, but
in practice you won't ever see
them because they won't care.
You will see like a few
main branches, maybe
you'll see only one.
In the case of the kernel, a lot
of people they only really
look at my branch.
So even though there are
lots of branches,
you can ignore them.
What happens is the way merging
is done is the way
real security is done, by a
network of trust. If you have
ever done any security work
and it did not involve the
concept of network of trust,
it wasn't security work.
It was masturbation.
I don't know what you were
doing, but trust me, it's the
only way you can do security,
it's the only way you can do
development.
The way I work, I don't
trust everybody.
In fact, I am a very cynical
and untrusting person.
I think most of you are
completely incompetent.
The whole point of being
distributed is I don't
have to trust you.
I don't have to give
you commit access.
But I know that among the
multitude of average people,
there are some people that just
stand out, that I trust
because I've been working
with them.
I only need to trust
5, 10, 15 people.
If I have a network of trust
that covers those 5, 10, 15
people that are outstanding
and I know they're
outstanding, I can
pull from them.
I don't have to spend a lot of
brain power on the question.
When Andrew sends me patches--
he doesn't actually use Git,
it's some kind of defect--
other than that, he's
a very solid person.
When he asks me to pull, he
does it by sending me a
million patches.
Instead, I just do it.
Sometimes I disagree with some
of these patches, but at some
point, trust means never having
to say you're sorry.
I don't know.
It basically means you
have to accept
other people's decisions.
The nice thing about trust is it
does network, that's where
the network of trust comes in.
I only need to trust a
few people that much.
They have other people, they
have determined, hey, that guy
is actually smarter than I am.
That's actually a really
good measure of who
you should pull from.
If you have determined that
somebody else is smarter than
you, go for it.
You can't lose, right?
Even if it turns out you pulled
crap and somebody else
starts complaining, you know
who you pulled from and you
can just point to the
other person and
say, hey, I just pulled.
Go to him, he knows
what he's doing.
So that's how I work.
That's probably most of
my lieutenants work.
I pull the networking changes
from one person, he gets them
from many other people that he's
worked with over time.
So this is how it all
comes together.
It doesn't have to come
together to one point.
In the kernel, it comes together
to one point largely
I think for historical
reasons.
And actually, I've always tried
to encourage people to
have more trees.
So we do have vendor trees, we
do have -mm trees, we have
multiple one points, and it
happens to be that my one
point is getting maybe
more attention
than it always should.
But even if it doesn't come down
to one point, it means
that you can take these
thousands of branches and
ignore 99.9% of them.
And you know that, hey, there
are five branches that are
really interesting to follow
because I'm interested in
those sub-areas.
And it all works
very naturally.
One of the nice things about
this whole network of trust is
it's not just easy to do
technically, it's actually how
every single person in this
room is very fundamentally
wired to work.
It is how we think.
We don't know 100 people.
We have five, seven, ten close,
personal friends.
Well, we're geeks,
so we have two.
But I mean, that's basically
how humans work, is that we
have these people that we really
trust. It's family,
it's close friends.
And it really fits.
You don't even have to
have a mental model.
It fits how we are wired up.
So there's huge advantages to
this whole model network of
trust.
AUDIENCE: Do you know any
companies that are using
distributed systems
internally?
It seems like there might be a
risk of vulcanizing the code
base as people not being in
the same sandbox don't
contribute back.
LINUS TORVALDS: So quite
frankly, there aren't that
many distributed systems.
There is BitKeeper.
It is clearly being used at
commercial companies.
We might have somebody in the
audience who actually knows.
What?
AUDIENCE: [INAUDIBLE].
LINUS TORVALDS: So HP is using
things like BitKeeper for the
printer project.
I'm sure they have a
lot more companies.
In the open source world, there
are two distributed
systems that are worth
looking at right now.
One of them is obviously Git
and you really should pick
that one, but the other one is
Mercurial, which actually has
pretty much the same
time design.
There are huge differences in
implementation and there are
some differences in details,
but it boils down to a very
similar model.
Git just does it better.
Everything else, it's either
centralized or it is too
unstable or too slow to
use for anything big.
AUDIENCE: Right, but is there an
advantage for a company to
have everybody playing
in the same sandbox?
LINUS TORVALDS: I think a lot of
companies think there is an
advantage to that.
I know that inside companies,
I don't think a lot of
companies use Git knowingly
in the sense that it
is a company decision.
I know several companies who
use Git internally, not
knowing that they do so because
they actually have
their main repository in
Subversion and a lot of
developers then import it into
Git because Git can actually
merge things for you.
So you can take a Subversion
tree, just import it into Git,
let Git do the merge, which
would be a major headache to
do in Subversion, create a merge
commit, and actually
export it back to Subversion,
and nobody else even
knew you used Git.
It's kind of sad, but we have
cases of people talking about
doing exactly that
inside companies.
Git has not been around in a
form where a lot of people
will be comfortable using
it for more than
half a year or so.
We have had so huge improvements
to the user
interfaces that realistically,
a year ago at a commercial
company a lot of people
would just have said
it's too hard to use.
I think we're way
past that hump.
Git is much easier to use
than CVS, really.
It's easier to use than
anything else.
Just get over it.
You don't have to use all
the powerful tools.
Some of them might be things
you want to explain and
introduce to people only after
they got over the initial hump
of understanding what
distribution really means.
But the basic stuff is
really easy to do.
AUDIENCE: One characteristic
of a centralized system is
that it's the original developer
who has to resolve
any merges, who has
to fix merges.
How do you do that in
Git and how do you
minimize merge conflicts?
LINUS TORVALDS: Thank you for
asking me that question.
Did I tell you to ask
that question first?
One of the really nice parts
of Git is A, it does make
things much easier to merge than
a lot of other systems.
Merging a branch in CVS tends
to be really painful.
One of my main statistics is the
kernel is actually one of
the biggest open source
projects.
We have 22,000 files.
We've used Git for two years.
During those two years, we have
averaged 4.5 merges a
day, every single day.
That's not something you'd
do in something
where merging was hard.
So Git makes merging easy, but
you will in inevitably have
cases where two maintainers send
me the question to please
pull my stuff.
And I pick one of them at random
usually, because their
mail happened to be first in
my mailbox, and I have pull
their stuff.
And another person had
made changes that--
it doesn't happen that often,
but it does happen--
just clashed so much that I
said, I could fix this up but
I really don't want to.
I didn't write the code, it's
not my area of expertise, its
networking or something like
that, I can't really judge it,
I can't test it, so asking
me to resolve the
merge is just crazy.
It's not how you should
do things.
OK, the Windows machine
flaked out again.
Remember, distribution means
nobody is special.
So instead of me merging, I just
push out my first tree
that didn't have any merge
issues and I tell the second
person, hey, I tried to pull
from you but I had merge
conflicts and they weren't
completely trivial, so I
decided you get to do
the honors instead.
And they do.
And they know what
they're doing
because it's their changes.
So they can do the merge and
they probably think I'm a
moron because the merge was so
easy and it was obvious I
should have taken their code.
But they do the merge and then
they update their tree and
say, hey, can you pull
from me now?
And I pull from them and they
did all the work for me.
That's what it's all about.
They did all the work for me.
And I take the credit.
Now I just need to figure
out step three, profit.
But that's another thing that
comes very naturally from
being distributed.
It's not something that
is special to Git.
Git makes merging easier than
anything else, but Git does it
exactly because Git
is distributed.
Yes.
AUDIENCE: So I guess I don't
entirely understand why you
think that its necessary to have
a distributed system--
it seems like you get a lot of
the good effects, at least for
corporate development.
For open source development,
it seems very useful that
everybody can work
on their own.
But when you really have a
centralized, corporate tree,
then a centralized system with
really cheap branches,
wouldn't that give you pretty
much the same effect?
Or is that just impossible
to do?
LINUS TORVALDS: No.
I will argue that centralized
systems can't work.
But it is clearly true that if
you're in a tightly controlled
corporate environment,
centralized
systems work better.
And it's unquestionably true
that people have been able to
use centralized system for
the last 35 years.
Nobody's really arguing
that centralized
systems cannot work.
They cannot work as well as
distributed systems. One of
the issues you tend to have is
centralized systems inevitably
have problems when you have
groups in different locations.
It tends to work really well
if you have a really beefy
background fiber.
And I guess for Google, you
probably do have some kind of
network going.
I don't know.
And maybe it's not as big of
an issue as it is for other
projects, but trust me.
Not having to go over the
network for everything is a
huge performance saver.
I can't show you demonstrations
and it's not a
very interesting demonstration
anyway, but this is a laptop
that is what, four or
five years old.
It's like a Pentium M
1.6 gigahertz thing.
I could show you me doing a
full diff of the kernel on
that laptop in whatever,
just over a second.
On my main machine, it takes
less than 1/10 of a second.
That's the kind of performance
you simply cannot get if you
have to go over a network.
We're talking a couple of
packets going over the network
and you just blew
the performance.
So if you have a decentralized
system and you're used to
having something like commit
or diffing the whole source
tree taking 30 seconds.
Maybe 30 seconds doesn't
sound that bad to you.
Trust me, when you're used to
taking 1/10 of a second, 30
seconds sounds pretty bad.
So there are huge performance
issues even if you have a good
network, nevermind the fact that
most people don't have a
good network.
The other thing is branches,
even if you make them
technically very cheap to
create, just the fact that you
create them and everybody sees
them because everybody will
see them since they're
centralized, basically means
that you don't want to make
branches willy nilly.
You will have namespace
issues.
What do you call your branch?
Would you call it Test?
Oh by the way, there's 5,000
other branches called Test 1
through 5,000.
So now you have to make up all
these naming rules for your
branches because you have a
centralized system that has a
centralized branch namespace,
which is kind of inevitable
when you have a centralized
system.
How does that work in
distributed environments?
You call your branch test,
and it's that easy.
Actually, you shouldn't
call it tests.
You should basically name
your branches the way
you name your functions.
You should call them
something short and
sweet and to the point.
What is that branch doing?
Git, by default, gives you one
branch that is called master.
It's short and sweet
and to the point.
It's the master branch.
But you can make a branch that
is called Experimental Feature
X and it will be obvious.
But this is something you
simply cannot do in a
centralized environment.
You cannot call branches
Experimental Feature X. You
have to make up stupid,
idiotic names.
I worked for a company
that had nice--
as nice as you probably can make
them-- scripts around CVS
that helped you make branches.
You could actually
make branches
with a simple command.
It didn't take that long.
It picked a name for you,
exactly because it would pick
the number.
So you'd give it a base name and
you would say, this is my
branch for doing so and so
and it would call your
branch So and So-56.
And it would tag where you
started that branch because in
CVS you need to do that, too.
It took a while,
but it worked.
You can do these things in
centralized systems, but you
don't need to.
If your system is decentralized,
it just works.
That is how it should work.
So I'm not going to force
you to switch over to
decentralized, I'm just
going to call you
you ugly and stupid.
That's the deal.
Anyway, we are on the
performance slide.
AUDIENCE: Can I ask
a question?
LINUS TORVALDS: Yes.
AUDIENCE: Two questions,
actually.
So one is how many files
will Git take.
And then the second one, let's
say if you have a humongous
tree under Git, would it
be possible to check
out part of the tree?
LINUS TORVALDS: Great
questions.
Those questions actually kind
of dovetail into a different
issue, even though they are
performance related.
One of the things that Git is
really special about, and this
special even with regards to
things like Mercurial which is
otherwise fairly similar,
Git tracks your content.
It never, ever tracks
a single file.
You cannot track
a file in Git.
What you can do is you can track
a project that has a
single file, but if your project
has a single file,
sure do that, and
you can do it.
But if you track 10,000 files,
Git never, ever sees those as
individual files.
Git thinks of everything
as the full content.
All history in Git is based on
the content of all of the
history of the whole project.
This has implications
for performance.
When you use CVS it's
perfectly fine.
It's stupid, but it's perfectly
fine to have one
huge repository that has a
million files in it because at
the end of the day, CVS actually
thinks of all those
million files as
a single file.
And you can actually ask CVS to
only update that one file
because CVS really thinks in
those terms. And that's
actually true of pretty much
everything else too.
It is actually even
true of BitKeeper.
That was one of the mistakes
in BitKeeper.
The problem with thinking in
terms of single files is that
quite often, especially if
you're a high level maintainer
like me, I have 22,000 files to
track, I don't care about
one of them.
I might care about a
sub-collection of them that
contains maybe 1,000
thousand files.
I might care about the USB
subsystem, but I never care
about the single file.
So Git tracks everything as a
collection of files, and if
you ask for the history of
a single file, Git will
literally start from
the global history
and simplify it.
It's a fairly efficient
system.
It's a very efficient system.
You would normally not even
realize that it does that.
But it does mean that if you try
to track a million files
in one repository, when you
then ask for a single file
history it's going
to be slower.
So it has different scaling
properties than a lot of other
systems for this very
fundamental design reason.
We have used big repositories.
We've imported things like
something like 3/4 of the
Subversion history of the
whole KDE project.
And the KDE people are--
I like KDE but trust me, they
put every single component in
one repository.
Not very smart.
What you ended up with, you had
a repository that took I
think eight gigabytes under the
CVS tree and Subversion
blew it up to like three
times that size.
Maybe it wasn't quite
eight gigabytes in
CVS, but it was big.
It was more than
four gigabytes.
Git would actually compress
it down to
something like 1.3 gigabytes.
So Git is actually very
efficient at taking this
project and just smushing it
together and most things
perform very well.
But certain things did not.
The things that do not perform
very well, if you put a
million files in one repository,
initial clones,
when you get it,
you get it all.
You put it in one repository,
Git thinks of it as one thing.
Don't do that.
If you have multiple components,
do them as
separate repositories.
You can actually have what we
call a super project that
contains pointers to other
projects and the user
interfaces there are
some lacking.
But you keep separate projects
separate, and then you avoid
the problem of, OK, you
have to get it all.
Because with Git, you do
have to get it all.
AUDIENCE: Why don't they
all share code?
[INAUDIBLE]?
LINUS TORVALDS: If they
all shared code.
What you can do with Git, if
you actually have a lot of
shared stuff, since Git actually
internally uses a
content-addressable file system,
if there are files
with identical content, Git will
actually use the exact
same object for them and
save you tons of space.
You can have these shared
objects and still have them as
separate entities.
You can still see them at
separate repositories that
just have a shared file
system backing the
data, you can do that.
If you actually have shared code
in the sense that you,
for example, have a library that
is used by five different
things, that's when you use
the super project support,
where you have one Git
repository that just tracks
all the other Git
repositories.
It may contain stuff
like a shared build
infrastructure, too.
But then the individual
pieces are individual.
These are like CVS modules.
In CVS, modules aren't really
individual but that's because
in CVS, the directory is a
thing of its own anyway.
So CVS modules are kind of a
combination of this and just
tracking them all.
But you can basically think
of it as CVS modules.
And we do support it but I do
have to admit, that code is
fairly recent and that's one
area where our user interfaces
right now are definitely
lacking some.
There was probably some other
part to that question that I
completely forgot.
AUDIENCE: [INAUDIBLE].
LINUS TORVALDS: I
can't hear that.
AUDIENCE: The question was, can
you have just part of the
files pulled out of the
repository, not the entire
repository?
LINUS TORVALDS: You can export
things as tarballs, you can
export things as individual
files.
You can rewrite the whole
history to say, I want a new
version of that repository that
only contains that part.
You can do that.
It's a fairly expensive
operation.
It's something you would do, for
example, if you import an
old repository into one huge
Git repository and then you
can split it later on to be
multiple, smaller ones.
You can do it.
What I'm trying to say,
you should generally
try to avoid it.
It's not that Git can't handle
huge projects, it's that Git
won't perform as well as it
would otherwise and you will
have issues that you wish
you didn't have.
I'm skipping this and going back
to the performance issue.
One of the things I want to say
about performance is a lot
of people seem to think that
performance is about doing the
same thing, just doing
it faster.
And that's not true.
That's not what performance
is all about.
If you can do something really
fast really well, people start
using it differently.
One of the things I wanted to
make sure is that merges go
really, really quickly because
I want people to merge often
and merge early because
it turns out it
becomes easier to merge.
If you merge every day, suddenly
you never get to the
point where you have
huge conflicts
that are hard to resolve.
If you actually make branching
and merging easy, you actually
avoid a whole class of problems
that you otherwise
have a really, really
hard time avoiding.
So for example, let's go back
to one of the things where I
think the designers
of Subversion
were complete morons.
Strong opinions.
That's me, right?
There's a few of them in the
room today, I suspect.
You're stupid.
Subversion, for example, talks
very loudly about how they do
CVS right by making branching
really cheap.
It's probably on their main web
page where they probably
say that branching in Subversion
is an 01 operation.
You can do as many cheap
branches as you want.
Nevermind that the
01 is actually a
pretty large 0 I think.
But even if it takes a millionth
of a second to do
branching, who cares?
It's the wrong thing
you're measuring.
Nobody is interest
in branching.
Branches are completely useless
unless you merge them,
and CVS cannot merge
anything at all.
You can merge things once, but
because CVS then forgets what
you did, you can never, ever
merge anything again without
getting horrible, horrible
conflicts.
Merging in Subversion is
a complete disaster.
The Subversion people kind of
acknowledge this and they have
a plan and their plan
sucks, too.
It is incredible how stupid
these people are.
They've been looking at the
wrong problem all the time.
Branching is not the issue,
merging is, and merging they
didn't do squat for five
years after the fact.
That is sad.
So performance is important,
but you need to
look at what matters.
Performance for making a branch
under Git, literally
you create a new file that
is 41 bytes in size.
How fast do you think that is?
I don't think you
can measure it.
If you use Windows
you can probably
measure it because file--
but whatever.
It is so fast you can't
really measure it.
That's creating a branch.
Nobody cares.
It's not an issue.
That's not it.
The only thing that matters
is how fast can you merge?
In Git I merge 22,000 several
times a day and I get unhappy
if a merge takes more
than five seconds.
And all of those five seconds
are just the downloading of
the deltas between
the two trees.
The merge itself takes less
than half a second, and I
don't have to think about it.
What takes longer than the merge
is after every merge by
default, Git will do a diff
stat of everything that
changed as a result of
that merge because I
do care about that.
When I merged from somebody,
I trust them.
But on the other hand, hey, they
might have stopped using
their medication.
I mean, I trust them, but let's
just be honest here.
They might have been
OK yesterday,
today not a good day.
So I do a diff stat and Git
does that by default.
You can turn it off if you
really want to, but you
probably shouldn't.
It's fast enough anyway.
If it's a big merge, the diff
stat usually takes a second or
two because creating a diff
and actually doing all the
stats on how many lines changed,
that actually is much
more expensive than doing
the merge itself.
That is the kind of performance
that actually
changes how you work.
It's no longer doing the same
thing faster, it's allowing
you to work in a completely
different manner, and that is
why performance matters and why
you really shouldn't look
at anything but Git.
Hg's Mercurial is pretty good,
but Git is better.
I think I'm running
out of time.
OK, this one is still
interesting.
We never got to the
implementation part, you
really don't care.
I will say so much about
implementation is the
implementation is
really simple.
The code data structures are
really, really, really simple.
If you then look at the source
code and realize it's 80,000
lines and mostly in C. And the
kind of C I write most people
don't understand,
but I commented.
The source code may sometimes
look complicated because we
are very performance-centric.
I am, I really care.
And sometimes to make things
go really fast, you have to
use more complicated algorithms
than just checking
one file at the time.
When you're doing 22,000-file
merges, you don't want to
check one file at a time.
You want to check the whole
three in one go and say,
they're the same, I didn't
need to do anything.
So Git does things like that
and that kind of blows the
source code up a bit because
doing it well is complicated.
But the basics are really,
really simple, and one of the
basics is this trust and
reliability thing.
Every single piece of data, when
Git tracks your content,
we compress it, we delta it
against everything else.
But we also do a SHA-1 hash of
the content, and we actually
check it when we use it.
If you have disk corruption, if
you have DRAM corruption,
if you have any kind of problems
at all, Git will
notice them.
It's not a question of
if, it's a guarantee.
You can have people who
try to be malicious.
They won't succeed.
You need to know exactly 20
bytes, you need to know the
160-bit SHA-1 name of your top
of tree, and if you know that,
you can trust your
tree all the way
down, the whole history.
You can have 10 years of
history, you can have 100,000
files, you can have millions
of revisions, and you can
trust every single piece of it
because Git is so reliable and
all the basic data structures
are really, really simple.
And we check checksums.
And we don't just check some
piddly UDP packet checksum
that is a 16-bit sum
of all the bytes.
We check a checksum
that is considered
cryptographically secure.
Nobody has been able to break
SHA-1, but the point is the
SHA-1, as far as Git is
concerned, isn't even a
security feature.
It's purely a consistency
check.
The security parts are
elsewhere, so a lot of people
assume that since Git uses SHA-1
and SHA-1 is used for
cryptographically secure stuff,
they think that, OK,
it's a huge security feature.
It has nothing at all to do with
security, it's just the
best hash you can get.
Having a good hash is good for
being able to trust your data.
It happens to have some other
good features, too.
It means that when we hash
objects, we know that the
hashes are actually
well-distributed and we don't
have to worry about certain
distribution issues.
So internally, it means from an
implementation standpoint
we can trust that the hashes
are so good that we can use
hashing algorithms and know that
there are no bad cases.
So there are some reasons
to like the
cryptographic site, too.
But it's really about the
ability to trust your data.
I guarantee you, if you put
your data in Git, you can
trust the fact that five years
later, after it was converted
from your hard disk to DVD to
whatever new technology and
you copied it along, five years
later you can verify
that the data you get back
out is the exact same
data you put in.
And that's something you really
should look for in a
source control management
system.
One of the reasons I care is for
the kernel, we had a break
in on one of the BitKeeper sites
where people tried to
corrupt the kernel source
code repositories.
And BitKeeper actually
caught it.
BitKeeper did not have a really
fancy hash at all.
I think it's a 16-bit CRC,
something like that.
But it was good enough that you
could actually see clumsy.
It was not cryptographically
secure, but it was hard enough
in practice to overcome that
it was caught immediately.
When that happens once to you,
you got burnt once, you don't
ever want to get burnt again.
Maybe your projects aren't
that important.
My projects, they're
important.
There's a reason I care.
This is also one of the reasons
to go back to the
distribution angle a bit.
When you do Google, for example,
Google code, you have
your source repositories that
you help people maintain, and
I think you do so under
Subversion.
I would never, ever trust Google
to maintain my source
code for me.
I'm sorry.
You're just not that
trustworthy.
The reason I really prefer a
distributed system is I can
keep my source code behind three
firewalls on a system
that does not allow
SSH in at all.
When I'm here, I cannot read my
email because my email goes
onto my machine, and the only
way I can get into that
machine is when I'm physically
on that network.
So maybe I'm cuckoo, maybe I'm
a bit crazy and I care about
security more than
most people do.
But this whole notion that I
would give the master copy of
source code that I trust and
I care about so much, and I
would give it to a third
party is ludicrous.
Not even Google, not a way
in hell would I do that.
I allow Google to have a copy
of it, but I want to have
something that I know
nobody touched.
And by the way, I'm not a great
MIS person, so the disk
corruption issue is definitely
a case that I might worry
about because I don't
do backups.
So it's OK if I can then
download it again from
multiple trusted parties.
I can verify them against
each other, that
part is really easy.
I can verify them against
hopefully that 20 bytes that I
really, really cared about.
Hopefully I have that
in a few places.
20 bytes is easier to track
than 180 megabytes and
corruption is less likely
to hit those 20 bytes.
If I have those 20 bytes, I can
download a Git repository
form a completely untrusted
source and I can guarantee
that they didn't do anything
bad to it.
That's a huge thing, and that's
something that when you
do hosted repositories for
other people, if you use
Subversion you're just
not doing it right.
You're not allowing them
to sleep well at night.
Of course, if you do it for
75,000 projects, most of them
are probably pretty
small and not very
important, so it's OK.
That should make people
feel better.
I have a few more slides.
I think we're over time.
I'm not even going to bother
showing them, they're not that
interesting I think.
I talked a bit about this,
about content versus
individual files.
Git tracks content.
There is the only sample command
line in the whole
presentation.
Gitk is the graphical viewer of
history of a Git project.
It's a [UNINTELLIGIBLE]
script that is really only doing
viewing of stuff that
Git is really good
at showing you.
And this is the kind of command
line I use as a
top-level maintainer.
I want to be able to say what
changed since a particular
version, maybe since a
particular date, I can do that
easily, in those two directories
or in those two
directories and that file.
And what this will show me is
the global history as it
pertains to those parts
of the repository.
It is more expensive to compute
than the global,
global history, but if my laptop
was actually connected
to the A/V system,
I could show you.
Even on that laptop, it
comes up in seconds.
It is that expensive, but
we are that good.
This is something that is
really, really unique to get.
Nobody else can do it.
And it's a hugely important
feature.
Maybe it's not so important to
individual developers because
individual developers often do
think in terms of single
files, but it is important for
the people who merge stuff, it
is important for people like me
and the people I work with
directly because they
never basically care
about a single file.
And they do care about these
kinds of features.
Somebody sends a bug report,
which bug reports are usually
not very good.
But maybe the bug report is
good enough that you can
pinpoint, OK, scuzzy
subsystem.
That's the command line.
You can't say which file, but
you can do this and say, OK,
that will cut it down from the
15,000 commits we've had since
last week, it will cut
it down to 50.
That's a huge deal.
That is something that nobody
else can do, I guarantee you.
So that's the reason you
want to use Git.
That's what it all
boils down to.
It's safe, it is so fast that
you can do things that nobody
else can do, it does things that
nobody else can do, even
slowly, and it's distributed.
So go and spread the word.
We have one more question,
I guess.
What is the timing like?
I don't know.
AUDIENCE: Quickly.
So one of the reasons why we
would switch from Perforce is
release capability
and performance.
Otherwise, people would just
say, keep using it.
Would we be exchanging one set
of scalability performance
problems for other scalability
performance problems?
LINUS TORVALDS: I already
mentioned the fact that I
don't know how you maintain
stuff in Perforce, but when
and if you do a switchover to
Git, what you want to make
sure is because of this content
model, you need to do
it at same content boundaries.
The content boundaries usually
are actually pretty
self-obvious.
I mean, they really are.
You have the compiler, you have
the main source, you have
the documentation.
Well, you probably have the
documentation spread out, but
you may have some user visible
documentation.
Or maybe Google doesn't.
But a lot of companies have a
separate set of documentation
that they give to customers,
and then they have the
documentation that goes into
each individual package, is
package-based.
So one of the things you do have
to think about with Git
is you want to make sure it is
in a somewhat sane hierarchy.
Git can easily handle
larger projects.
You can have 10,000 files and
that's not a problem.
The kernel is 22, we've done
tests with 100, it's fine.
It's faster than
anything else.
With a million files, I suspect
other systems will be
faster at some things.
And that's the kind of
situation I don't
want you to get into.
But if you do that basic setup
correctly, it will be
basically faster at pretty much
everything, than anything
anybody else will.
I am very confident about
Git performance.
One of the things we don't
necessarily do really well is
the CVS Annotate.
People use CVS Annotate a lot
if they use CVS. I'm told it
sucks under Perforce, too, so
you probably don't use the
Perforce version of Annotate,
I'm not sure.
But CVS users are used
to CVS Annotate.
It's the one operation that CVS
can do faster than Git,
because CVS does track things
one file at a time.
Git doesn't.
Git has an annotate, but if you
moved a function from one
file to another, Git will
literally tell you the history
of that function even
across that move.
Not to file move, a function
within a file.
It will go and dig back and
say, hey, those two lines
actually came from that other
file five years ago.
That is, again, something nobody
else can do and it
boils down to the same thing.
It's the content that matter,
it's not actually the files.
But it does make it a much more
expensive operation, so
if you go back five years maybe
it takes 30 seconds.
On the kernel, it takes a second
for any file I have. We
started from no history two
years ago because we just made
the decision that let's not make
it more complicated than
it needs to be.
So right now, we only
have two years of
history in the kernel.
We have more history in other
projects that we've done
timings on.
So we've done timings on
importing the KDE and things
like that with more history.
There are performance issues,
but most of them are, Git is
one or two orders of
magnitude faster.
So most of them are
the good kind.
And if you find something, we
actually have a really, really
good community.
The Git mailing list is fairly
high signal to noise.
It does get a fair amount of
emails, but it's actually a
very pleasant mailing list. If
anybody is interested, read
the sources first, but
start looking at the
mailing list archives.
We have our flames, we have are
pointless discussions, but
most of it is actually
very good.
OK.
Thanks.
