- [Dmitry] Hello everybody.
My name is Dmitry.
- [Timur] My name is
Timur and I'm timur_audio
on like Twitter and
other online platforms.
- And we are going to
talk about parsing C++.
And we both work at JetBrains at Clion
and all this C++ content related stuff.
So we both have kind of bias,
and part of this talk would be
from the tooling point of view
and not from the purely
theoretic or compiler stuff.
And a bit of disclaimer.
The main focus of this
talk is to have fun because
it's not like it's too much practical.
If you are writing C++ content
you probably already know of this stuff,
and if you are not, well,
you're better not using all the snips
we are showing in production anyway.
But it's still kind of cool.
It kind of helps to
understand the language better
and to have a peak inside the language
design process and how
different language features
interact with each other
and stuff like that.
Very, very brief introduction
how the textbook compiler usually works.
So there is a lexer, a tool that converts
source file to the stream of tokens.
Tokens usually, usually there are parsers
that makes syntax three
from the stream of tokens,
and then goes all the fancy stuff
that we are not going to talk about today.
So we are mostly focusing on parsing
and a bit of lexer and how it's,
how C++ parser and C++
lexers are kind of special.
And the first thing that is special
is the preprocessor because
a C++ source file doesn't consist of C++
because it's the preprocessor
and it's a completely different language.
It's a language in a language.
So the one minor annoyance here
you can't even properly highlight
the resource code using a
lexer based on the highlighter
because these things on the slide
looks like keywords
but except they're not.
And these are not C++ keywords because
the proper C++ code is like on the right.
But it's kind of minor annoyance
and you usually don't notice that.
What is a major annoyance?
The code we use in C++ is also
based on the preprocessor,
and it doesn't care about the C++ syntax.
So combined with all the complications
in parses that we are going to talk about,
it's kind of hard to do tooling properly
because the Holy Grail of C++ tooling
is not to reparse
or all include the headers again and again
because on my machine they have a world.
After preprocessing
it's like 14,000 lines,
and we don't want to
reparse it every time.
And it could be doable
just with a preprocessor.
I could just go from the
point of your parser,
but combined with the
two it's complicated.
And we can't wait for the
module to come and save us.
But preprocessor is
better and we all know.
So let's don't talk about it anymore.
Let's talk about the grammar.
And the common talking point is that C++
doesn't really have a grammar,
so people like to say that
but it's kind of true
for a lot of languages
because it's complicated to describe
what code is well formed,
what code is ill formed
using just the grammar.
What this thing is is part
of the C++ standard code.
It says it's kind of true
for a lot of languages.
But it's specific for C++,
that it's highly ambiguous grammar.
So there's two kinds of ambiguities,
one is that some things can just be
distinguished at all,
so you have to tell that something
is always in declaration.
But it's not a problem.
What's a problem?
That something is undistinguishable.
But to lose it you require
semantic information.
For example,
in the grammar there are
things like type name
as there are things like expression.
And most of them can be just indentifier.
So it doesn't,
can be distinguished using just the syntax
and then the keywords.
And to distinguish them and to produce
different syntax for them.
You have to know this semantic information
or maybe do something else.
And it's not a problem for
a lot of modern languages.
Usually they can be parsed without it.
Usually you can write the grammar
in your favorite parser generator,
description generator parser can be done
but not with C++.
And in some of the languages
there are kind of such problems,
like there is a minor problem
like that in Java or, like, (mumbles)
all the syntax read list is just list.
Nice at least.
And you can just say, okay,
we are not going to
distinguish it while parsing.
We are going to postpone it for later.
It works for some but it doesn't work
for C++ because it's very different.
And also there are smart ways
to deal with it just to parse
both branches simultaneously
but we are not going to talk about it
because most of the mainstream
tools doesn't work like that.
And most of the mainstream compilers,
the mainstream tools,
are just pretty simple looking parsers.
But they are required to know
about semantics language.
And you're required to do
semantic analysis as you go.
That's why they're usually complicated.
They're kind of hard to write.
And I'll pass it to Timur to
have some specific examples.
- Right, so let's dive
into some actual code.
As you know C++ derives a lot from C,
so it also inherits a lot
of the grammar from C.
So let's see what other
problems in the grammar in C,
what are the ambiguities
that we can have there?
So it turns out there's
quite a lot of things
in C that can be ambiguous.
For example here inside
the main in the first line,
this could be either a declaration
of an array of pointers,
or it could be a function call
and then a subscript and
the result of that function.
Likewise the return statement could be
either the sum of two variables
or it could be one variable,
and then it's cast to type B.
And so it's ambiguous, we don't
know how to interpret this.
And this depends on the
meaning of the identifiers.
So for example in the first one
if you know that T is a type
and we know that the whole
thing is a going to be
a declaration, whereas
if T is a function name
we know that this is
gonna be a function call.
And the second one, if P is a type,
then the whole thing will be cast.
If B is a variable then
that's going to be a sum.
So what we need to know to
resolve these ambiguities is
whether they identify as types or not.
That's basically the main
problem to unambiguously parse C.
How do we do that?
So it turns out in C it's
actually relatively easy
because in C a type is
either a fundamental type,
but then you're talking
about ends or floats.
And those things are keywords anyway,
so you don't have a problem there.
Or the only other types that
we can have in C are structs.
And in C if we have a
struct type like struct Foo
and you wanna declare an
instance of this type,
in C it's mandatory to use
this so called struct tab.
So we can't just say Foo f,
we have to say struct Foo f,
and the same if we declare
a parameter or whatever.
So this way it's also
always ambiguous, right?
So if you have an identifier
then it's a struct type
if it's preceded by the word struct.
The only case in C where it gets
a little bit tricky are typedefs.
So there's a common pattern
in C where you write
typedef struct Foo and
then the definition of Foo.
And then you can use it
without the struct tag.
And then you run into these ambiguities
where you need to know what Foo is,
but it turns out that it's actually enough
to just track all the
identifiers that are typedefs,
like, keep them in a table.
And once you have that and you know
whether an identifier is a typedef or not,
that's enough to resolve
all the ambiguities.
It's not quite as trivial
because there are some
weird cases where an identifier
actually can change its meaning
in the middle of a declaration.
There's some weird
examples on the internet.
That's rare and it's not really a problem.
People have done this in decades.
But of course we are
not in a C conference,
we're at a C++ conference,
so how are things in C++?
Of course in C++ it is much, much worse.
Why is that?
Because we have all these nice features
that we all like to use, right?
So we're gonna try and do this.
As we point out, the few features
that C++ introduced and
then say how it made it
harder to parse it and made
the grammar more complicated.
So let's start with pretty much
the first really cool feature in C++
that C doesn't have are constructors.
So in C you can construct an object
with a constructor that
take several arguments,
you give them in parens,
and by the extension it also means
that C++ introduced this
initialization syntax
where you initialize the
variable with parens.
So you have the same syntax for classes
and also all the other types like ints.
I can actually point at this thing, wow.
So you have this new syntax here where
you initialize a variable,
you declare a variable,
initialize it with parens.
Which means we have a new ambiguity.
So if you have int j and then in parens i,
then if i is a type and the whole thing
would be a function declaration,
otherwise i will be the
initializer of j, right?
So far so good, very easy.
But the problem is that
now you have to track
whether i is a type in order
to be able to parse it.
And it's not enough to
just track the typedefs.
The origin of this is in C++
you don't have to write the struct tags.
In C++ we call them
elaborate type specifiers.
In C++ they're optional,
and most of the time we don't use them.
And the reason why Bjarne,
when he created C++
introduced that is support
from his book Design and Evolution,
basically he wanted user defined types
and built in types to be
kind of on the same footing,
on the same dial.
He didn't want user defined types
to be more annoying to
use than built in types.
So that's great but it
makes parsing harder.
So elaborate type
specifiers, they're optional.
If you don't use them you
have these ambiguities.
Actually if you do use them,
there's one use case where you actually,
where they're actually useful is when
you have an identifier
that could be both a type
and a variable in some scope,
and then you have to kind
of distinguish a band.
This is the way you actually use them.
But you can also use them on other context
and then it gets weird like, for example,
if you declare a function like this
and you use an array
type specifier in there,
like the struct S star
then that also actually simultaneously
acts as a forward declaration.
So that's gonna compile
even if S is not defined
anywhere else above.
And actually these implicit
forward declarations
created by these
elaborate type specifiers,
they have really weird scoping rules.
So actually, if we write it like that,
then it's gonna actually
leak into the outer scope
and that bottom line there is also okay.
So this is really weird.
So, yeah.
So they make our lives harder basically.
The next thing about constructors
is there is this constructor call notation
where we can just write
type and then parens
and then some arguments or no arguments
and that's gonna be kind
of an constructor call
as an expression and that
is super useful, right?
So we can just call these constructors
create temporary objects
anywhere in expression.
Super cool feature.
But as Bjarne recognized decades ago,
this is the fertile source
of more parsing problems.
So let's see what constructor
call notation does.
First of all, it gives rise
to the most vexing parse,
which is probably the most famous
kind of grammar ambiguity in C++, I'm sure
than if you have thought about that.
So, if you have a type foo
and then we write this kind of foo f
and then foo paren, paren
inside the parameter,
what is this thing?
So it could be either an initializer.
We call the constructor and
this returns a temporary foo
and that's gonna be the
initializer for F, right?
Or it could be a function
declaration of a function type
that takes a function
that takes no argument
and returns a foo
and then that function
returns a foo, right?
And then the C++ grammar says,
well, we disambiguate it by saying
everything that could be
declaration is declaration
so this line would be a
function declaration actually.
And if you want it to mean
that it's an initializer,
we have to use these extra parens around,
which is really weird and can
bite you in a real-world code
and Nico had a nice example
in his talk this week.
But this is kind of well-known.
The other thing that kind
of goes under the hood
that you might not maybe be
aware of if you're writing code
is that if you try to parse the stuff,
actually, you have to resolve
this ambiguity, right?
So sometimes there's an open
paren, then you stop parsing it
and then you try to figure
out, if I put this via type,
then the whole thing is
a function declaration.
Could this be something else?
And then you can have something
arbitrarily long there,
right, where you don't know,
okay, so, if it would
end here, it would be
like most vexing parse
function declaration,
but there's a comma there
so it could also be a constructor call.
We don't know yet.
And the next one, well,
it also could be either
constructor call or a function type.
And the next one also, right?
So you have to keep, keep
going and then at some point,
there is something that
will disambiguate it,
which is in the case, this bar.
And then if that's a type,
the whole thing will be
a function declaration.
If that's a variable,
the whole thing is going
to be an initializer.
And then if you guessed wrong,
you have to rewind the whole thing
and start over with your parsing.
And if you're parsing C++ declarations,
you have to do that all the time, right?
There's more cases like
this where we have like
three or three different possibilities
and we keep doing this all
the time when you parse code.
Just guess start parsing, oh, wrong!
Rewind.
Disambiguate, rewind, start over.
So that's what parsers do all the time.
These initializers with parens,
they create more problems.
So it's not just that you have
to rewind sometimes and start over.
Sometimes this process of rewinding
is actually more complicated.
Let's take a normal function pointer.
So, one of the things we
talked about also this week is
we should all be using
trailing return types
because they're really cool and modern.
So we can also use them with
function pointers, right?
So we can write this
declaration like that.
That's a pointer to a
function that returns an int.
Now, of course, a function pointer
is also simultaneously a variable
so we can initialize it, right?
And the grammar allows
initializes in parens
so we can initialize it like that, right?
So in this case, bar is a null pointer
so we initialize the
pointer with a null pointer,
which is totally a valid
code, but it looks like,
actually, this int bar is
a function type, right?
Which is, of course, not allowed
because you can't return a function type
from something like that.
But it's still a valid grammar.
So in this case, you would have
to kind of start parsing it
and then realize, oh, something
is wrong and then rewind
but this makes it really tricky because
what's going on here?
So, we have this one part of the grammar,
which is the grammar of the declarator,
declarator has this kind
of name, parameters,
and then trailing return type
and trailing return type
as arrow and then a type.
And then several levels
above this whole thing
is actually part of the declaration,
which can have in the end
an optional initializer
which can optionally be in parens.
So if you're parsing the arrow int
and then you hit the
paren, then you think,
oh, this looks like a
type, like function type,
and then you hit the bar and then you say,
okay, that's not a function type.
You can't just say, okay,
well then, this is a wrong function type,
and then roll back the
function type, right?
So we have to roll back to after the int
because you have to realize
that several levels above,
there was something where
it could be followed
by something else in parens,
in this case, the initializer
and then kind of roll
back to the right point
and then and kind of start parsing
this other part of the grammar.
And this is something where,
these are cases where you can find bugs
and many popular compilers.
Like at least one of
those three big compilers
or I think even two.
This code, they fail at this code.
Another thing you can do with parens is
not just put them around the initializers,
but also put them around
declarators, right?
So if you have a decorator, this the,
for example here, the
name of a variable X,
we can just put parens around
it as many as you want.
Great.
So obviously, we need them
because, and we have that in C
because this way, we
can distinguish between
function pointers and
function returning pointers.
And that's actually not a problem.
That grammar of these kind of pointers
and we can also nest them.
It's a bit nasty, but
it's not really ambiguous.
But the possibility of creating,
having declarators with parens around it,
create all these
ambiguities on constructs.
Like for example if we have
foo bar in the bars and parens,
in C++, this can be five
different things, right?
It can be a variable
declaration where you put
like a redundant parens
around the variable name.
It could be a constructor declaration
if it's inside a class.
It could be a function-style cast
if foo is a type of parse expression.
It could be a function call
if foo is a function name.
It could be function type, right,
if both of these are types.
So there's like at least
five different things
that this thing could
mean and you have to,
and this also depends
obviously on the context,
on the outer things or
the things before it.
Or whether it's inside a class
or whether it's somewhere else
and you have to
distinguish all these cases
and disambiguate all these cases,
which makes it non-trivial.
So another
example with parens,
so this whole thing actually
nest as well, right?
So you can also have parens
if you declare parameter
inside a function and
then it gets really funny.
So if you have it like that,
then it could be either
an initializer of an integer foo
where you have X and
you just cast it to int
or it could be that it's a function
taking it into parameter,
which is called X
and then you just put the name in parens.
So both are perfectly valid
so there standard has all these
extra rules where it says,
well, if it's either this or
this, then we just say it that.
And in this case, the overall rule is
whatever can be a
declaration is a declaration.
So in this case, int X
would be the declaration
of a parameter.
So it's not a cast, and oops.
And obviously, if X is a type,
then the whole thing is a function type.
There are more funny cases
where you have this kind of
declaration versus expression
kind of thing going on.
One example is the initializer in four,
and since C++17 we can also
put them in if and switch.
So this is a weird case because typically,
we would always put like,
what they're made for
is typically, we would put a
declaration in there, right?
So if we have an if with initializer,
you would use the initializer
to initialize a variable
that we then use in the condition
and maybe in the body out here.
But the grammar allows
to put also arbitrary
expressions in there.
Well, not quite arbitrary,
but it allows to put expressions in there.
So if you have this
kind of if int A thing,
then you say, okay, if int A,
probably this is the classical case.
You're declaring a variable A of type int
in your if with initializer,
you put extra parens
around it, that's fine.
So that's a declaration,
but then it has this,
then you have this plus, oops, sorry.
Wrong button.
How did that happen?
Sorry.
Then you have this plus
stuff here going on.
And then actually, most puzzles
including popular puzzles, they say,
well, okay, it'll form declaration, right?
And then they spew out
this that we had errors
although, in fact, actually it's valid
because it's an expression, right?
It's a sum.
It's A, you cast it to int
and you add another variable.
So it's perfectly a valid code,
but most compilers actually
fail to compile it.
So, last one of these weird code examples.
So we talked about most vexing parse,
but actually it's not
the most vexing parse
because there is a more vexing parse.
The more vexing parse goes like this.
So, if you have a function, a declaration,
which is taking a parameter
of type foo, right?
So far so good.
All right, now,
you're gonna give this
parameter a name, right?
You can do that obviously.
So you also name it foo
just because you can.
- Yeah, like it's (mumbles).
It will be so easy to
bend it like 50 years ago,
but I guess something new.
- Yeah.
But no one did,
so you can have parameter name equals.
Anyway.
So, this is still a function
that takes a parameter foo, right?
Right.
Now, we add parens around the declarator.
Now, it changes, right?
That is the most vexing parse.
So now, all of a sudden,
this is interpreted as
the whole thing inside also
interpreted as a declaration.
So bar2 would be a function
that takes a function
that takes a foo and a trans a foo
that are transvoid, right?
So those are different.
Now, just because you can,
what happens if you add
another pair of parens around the name?
Anyone knows the answer?
We're gonna say that
so (mumbles) is not gonna help you
because different compilers
again give different answer.
So, it's actually not really
easy to work this out.
There was a funny episode
at the ACCU conference
in April this year where
I was showing this code sample around.
At some point, there were like
four people staring at this.
Like few of them were committee members
and one of them was Richard Smith
and we were kind of trying to figure out
whether bar3 is the same as
bar2 or the same as bar1.
And it took quite a while for us to reach
like an opinion on what it was.
Like it was really non-obvious
and then the kind of explanation
like the grammar production
that actually tells you like, okay,
which is the correct one.
And then if you really
squint hard at the standard,
then actually, the rule,
like the usual rule
for disambiguating this
can kind of be applied
to figure out what is,
but it's really not obvious.
Yeah, so I'm gonna leave
it to you as an exercise
to figure out what.
I think I tweeted the answer
somewhere at some point.
Anyway.
But, of course, you can say
these are all funny examples
that I'm showing to you
because I'm trying to be funny,
but this is like, she's like,
no one writes code like this, right?
So why should we worry about this?
Like if you add extra
parens around weird things
and you add like all these,
like this is construed right?
Like no one writes code like this.
So, as long as we just write normal code,
we're all gonna be fine.
Well, yes, but not those of us who write
parsers, compilers, IDEs,
and other kind of tools
because we actually have
to worry about the stuff.
Because almost no one
writes code like this,
but once in a while, there
is going to be a bug report
where someone have these extra parens
and then they're like,
this is not working,
and then, well, we wanna
be standard-compliance
so we have to fix all of these, right?
So we have to write tests
for all of these cases,
we have to figure out what they mean.
We have to add all these branches
extra checks in the parser that says,
it could be this weird thing
and we have to check for that
even though no one's gonna write that.
So this is also, no,
development time goes into this
and sometimes you add
these fixers into a code
that actually parses very
common construct normally
so it's (mumbles) makes
the whole thing slower.
That's not really relevant
on the big picture
because things like template
association or the link
will always consume more
resources than the parser,
but still, we might introduce more bugs.
And then overall, it has an impact on you
because it has an impact
on the tools that we write.
So, all these like weird
grammar properties,
they do have a negative impact also
so we have to handle this correctly.
We have to parse all code that is correct,
we have to parse it correctly.
So, how do we do that?
Well, okay.
So we implement the grammar
that's in the standard
and we implement the disambiguation rules
that are in the standard, right?
Why is this so hard?
Why do we still have
all these bars anyway?
Why still so hard to do this?
So first of all, as we already said,
in order to parse things correctly,
you have to track all identifiers, right?
So for any identifier that gets anywhere,
you have to know is it a variable?
Is it the type?
Is it the parameter pack?
That's another thing where
if you have template type
named dot dot dot args
and then this identifier
that appears somewhere else
than depending on whether
that's via the parameter,
this could mean either pack
expansion or an ellipses
so this also matters,
or it could be combinations of the above.
Like bar could be like an integer,
but at the same time, it could be a type,
and it could be an undeclared identifier.
So, all of these things matter
and we have to figure that out.
Now, this still doesn't sound really hard,
but why is it hard?
So, okay, you have an identifier in C++
and you need to figure
out what that thing is,
how do you do that, right?
So you're in some kind of scope.
So what you have to do to find
the meaning of identifier,
you have to do name lookup, right?
A name lookup in C++ is
really, really hard, right?
So, what we have to do
like to even figure out
what identifier means
whether that's variable or a type, right?
That's kind of the typical case.
So first of all, we have namespace, right?
So we have also using namespace
which brings all the identifiers
from another namespace
into the current scope,
we have inline namespace,
we have namespaces aliases.
We have inheritance which
also takes all the identifiers
from the base class and brings
them in the current scope
and obviously we also
have multiple inheritance
and then you have more ambiguities
that you have to deal with.
You have argument-dependent lookup,
which has really complicated
and not obvious rules.
So these are kind of the three big ones,
but then there's also all
these extra little cases
and features and rules.
Like for example, out-of-line
definitions, right?
So if you have a class
and you can define like
a class inside there
and then you have a constructor
and you can declare
the constructor inside,
you can define it outside.
And then there's this weird
rule that if you have the,
if you have this out-of-line
definition of a constructor
and you have an identifier in here,
then you're not gonna look
it up in the scope you're in.
Because it's an out-of-line definition,
you have to look it up in
the scope of that class,
which is somewhere else, right?
So, you have all these extra rules
that you have to take care of.
Okay, so that's still not hard, right?
Because the rules are there.
So, why is it so hard?
All right, so,
okay, actually, sorry, it is hard because
all these features basically can mean that
almost any identifier from any scope
can be in any other scope basically
through one of these rules.
But, okay, there are rules, right?
So we can take all these
really complicated rules
and implement them, and
then it's not really hard
that was kind of my original point.
I mean, it's really, really
hard, but it's doable.
Okay, it's hard, but it's doable.
But why is it really hard?
And the reason is, well,
how do you even find the scope, right?
So, for example, you have again
this thing where this might be odd,
but this might be either an initializer
if baz is a variable or if baz a type,
then the whole thing is
a function declaration.
Okay, so you need to find baz
which is in some kind of scope.
Okay, fine, there's all
these complicated rules.
If you look it up, what's baz?
But sometimes, in order
to even find scope,
which you have to look into
to determine whether baz
is a type or a variable or something else
depends on other code, like actual code.
Like in this example, for
example, the scope is some type
which is like decltype of
some expression, right?
So it could be some class
somewhere, whatever,
but in order to even find
that scope to look up baz,
we need to evaluate foo and bar,
which are some constexpr functions
with code in them, right?
And then the return
type, for example, of foo
could have an auto return type,
which depending on which
branch you're taking
at compile time, it could
return one type or another type.
And which actually returning
depends on some kind of
if constexpr condition which
depends on yet other code
like these other constexpr functions,
which we then also have to evaluate.
So, in order to even parse
C++, you need to evaluate
almost arbitrary code
at compile time, right?
So does the subset of C++
which is allowed in constexpr
is already quite big.
And in C++ 20 as we heard
from Louis couple hours ago
is going to be much bigger.
So every C++ parser basically has to be
a full-blown C++ interpreter
at the same time, right?
So this is a pretty unusual property
for a programming language.
That in order to just parse it
and like build an abstract syntax tree
which is like the first
thing that you have to do
before you do anything, you
have to execute the code, right?
Yeah, so this is what
makes C++ tooling so hard.
This is the world we
live as tool builders.
Of course there's not all.
There's another wonderful C++ feature
that I haven't talked about
yet, which is templates.
So I'm gonna pass over to Dmi
and he's gonna talk about that.
- Yeah, it doesn't add much to the, like,
conceptual overall picture but
it has some funny properties.
Like if you open Compiler Explorer window
and you move your cursor around
you might notice that the
curly braces are highlighted.
Not much of curly braces highlighted,
and the much angle brackets
are not highlighted.
Well, that is because there
is another conflict obviously
because it's really hard
to distinguish operator
lesser known from the angle brackets
inside templates unless you, again,
do all of that Timur was just
talking about for so long.
And in C++ 11 it's even proper lexing.
The source code requires all
this semantic information,
in the worst case like
how excited to do that.
Actually template code has
some surprising property.
You still need to parse the code
inside function templates for
example, and class templates.
So you still need to build a step.
For them it's not a talking soup anymore.
And how to lose it?
Because you don't know anything about
the dependent names when you're doing it.
so you have to put this
kind of ugly disambiguators.
Before typenames, when you
think you need a typename
and before the template when you know
you have a template.
So it looks ugly but it has nice property
that C++ code consists
only of dependent names.
If the subset of C++ that's easy to parse,
and it's kind of easy for tolling.
And imagine if C++ index
wasn't so complicated.
Imagine if it would be nice to work with.
In this case you also won't need
to have this disambiguators.
Okay.
So we're done with crazy examples.
and we need to talk a bit more
in depth why does it matter?
C++ compiler implementors
will take care about that,
and we don't need to think about that.
Well, it still does matter for
everyday C++ developer because
for compilers it's kind
of a minor problem.
There are more difficult problem.
I will next show why it still
matters for the compilers
than for the users, but
it's kind of doable.
But this means that every tool has,
either to the parent on a compiler
or to be as complicated as
a compiler which is crazy.
Or it could also don't work.
So it's not exactly
healthy for deck assisting
because you can hack
something around and be done.
To do a proper tooling,
for example you need to
first compile your code with a compiler
you want to use with this tool.
And it's not always possible, for example,
if you use some compiler extensions,
some lesser known and so on.
But for the actual compilers,
it also matters because
it's a good compiler.
It's also showing good error messages.
And it's more args inside
because it's not like your
standard or anything else
about how to show error messages,
how to recover from it.
The compiler must guess
the programmer's intent.
And either it must point out
why this code is so formed
and where is the error, and
it should probably recover
and continue parsing
because it's important for
the developer experiencing,
important for tooling.
And because it has an ambiguous syntax,
guessing what was meant by
the programmer is also hard.
For example take this here.
Here's an example where you accidentally
skip a semicolon after class definition,
and then you use its
name in the declaration.
So these lines might be in
subsequent header files,
and you might even not
realize they are subsequent.
So let's take two major compilers.
Of it says like semicolon is expected
after the class declarations.
Kind of what we expect from the compiler
to say to say to us.
But the other one says,
it tries to expect before X.
And if you think about it,
it kind of makes sense
what's it doing because
it sees class C, then it sees C,
so it's thinking it's probably a variable
of the type class C something.
It still makes sense.
Then this is x and x is not allowed,
so then it must have something,
then it says the obvious thing.
- [Timur] I know how to fix that.
You can put an extra paren
around the initializer.
- Yeah, and actually I was thinking,
it's because the declaration syntax,
since C++ is wired, then you can tease
the type of declaration of the variable
from something else.
And I was thinking, okay,
most of the languages
don't have this problem
because it's obvious
when a variable is declared.
So let's do something like that.
Let's use Always Auto and it should,
sure, it would be there
for the compiler, right?
It's not a variable name anymore.
So I was sure that it would help things.
I put it in the slide then I put it in
the compiler explorer and I got this.
It says taht, i don't know.
The error messages got worse in a way,
in both cases.
Now in both cases it point
to the incorrect line,
and it doesn't make any sense.
Because this photo is
creating some repository.
It's not only like auto variable,
it's also a type specifier,
and nobody uses it like that anymore.
I'm sorry, the storage class specifier.
But it still exists.
And you see that this compiler
doesn't have this special case
called for this specific
data in the previous example.
Here is a similar example.
There are another two major compilers,
and there is a typo in TypeName.
So one of it says, okay,
so you have a typo in your TypeName,
maybe you wnat to fix it.
But another, it does
say something different.
Well, it says undeclared identifier.
What could it be, thinks the compiler.
It could be it's not a
type, it's not anything.
Maybe it's a declaration,
and we need a type specifier
which is default int in C.
Okay, let's come up with
this, let's recover like this,
but, hmm, in C++ it's not allowed,
so it diagnoses this thing.
But this is a variable of type int,
and this is again, the x it's not allowed,
so probably a semicolon
will go in between there.
That would happen when you
have an ambiguous langauge.
Is it fixable?
Yes, it's fixable.
You can take each one of
these examples and fix it.
That's what compiler authors are doing
with their new version.
But it's complicated and it's, yeah.
And my final example is,
I actually hit it several
times when I program in C++
and ideas to use (mumbles)
for error messages.
So you have this function,
you have this lambda,
and you forget to capture it,
capture one of the variables.
And compiler says, okay, so the variable
can specifically capture
something like that.
And you go to the capture
list and start to capture it.
And then, like hell broke loose
and let's say something about
exit wasn't great and so on,
and what's more annoying?
You don't get your initial
error message anymore.
And if you forgot what was that variable,
you kind of need to go
back and check it again.
And that's actually because
square brackets are hard.
If you take all the dialects
which the client supports,
it's kind of very, very ambiguous things
that it can be attribute,
it can be Microsoft visual style attribute
that has only one square bracket,
or it can be this Objective
C/C++ send messages.
It also starts with two angle
brackets, two square brackets.
And so it kind of tries to play safe
and it doesn't play that.
So, okay, that looks ill formed,
and there is quite a bit here.
I don't care.
I probably skip something
in case it's lambda
and I recover in this point.
Though it doesn't work too well.
And especially it doesn't work too well,
like, if you have this
not as a compiler output
but if you have as an error
highlighting in your editor.
And so what we're gonna do about it?
First, don't panic.
It's still kind of okay.
Like compiler and two others
try hard to shield you
from all these problems.
But if you look at these
examples and if you think,
like, let's say it was me,
okay, I feel your pain.
You probably want to write simpler code
because all of it can
be avoided quite easily
in real world code.
But what's important is that when
invented new features for the language
and when some people say, okay,
I have this real world problem and I want
to invent this syntax for this problem,
you might want to
consider not only if it's
theoretically implementable in any way,
but you want to also
consider how it affects
the error messages, how
it affects the tooling,
how it affects like some number tools
that's not yet dependent compiler.
- Yeah, actually this is something I think
every time I set an EWG on something,
someone proposes a crazy new syntax
for that new cool feature then I'm like,
oh, this could be really
cool for our users.
But then I also think, oh, but then what
is it going to break?
It's kind of always these two
things that kind of collide.
- And one final thing
that we never had a chance
to clean up some old rusty syntax from C++
because, again, the code we
used was preprocessor based
and we just can't remove anything,
we can't change meaning of anything
without breaking the old code.
But maybe when we get the modules
and when they are adopted,
we can compile separately old legacy code
and then the code you're writing,
then maybe you can start
cleaning up and rebooting stuff.
So I think there is hope.
Thank you.
- All right.
So this is all we have.
Please ask us questions.
(crowd applauds)
- [Man] Hi, thank you
for the presentation.
You guys said that it
is time like a cleanup
of the syntax of the language, right?
I have thought for a long time that
maybe what is needed is a second version
of the syntax for the whole language,
something that makes sense.
It is easy for the tools.
And it can be cleaned,
would you agree with that?
- Well, the important thing is that you
still have to maintain the
full backward compatibility,
because it's the thing that keeps
the preprocessor float problem.
And so if you just break headers and say
don't compile anymore,
you can clean that and nobody
will use the new version.
So you have to just use
old code with the new one.
And I think modules will be fair,
really good chance to do
that, but not before probably.
All right.
- [Man] So you mentioend
in your talk that this
really nasty challenging
problem is something that
each tool needs to solve on their own.
I was wondering if there is any attempt
to make like a generic
C++ parser that spits out
some sort of defined AST
that everyone could use,
kinda like Clang AST
or something like that.
- Well, Clang AST is kind of becoming
the industry standard in some areas.
But, yeah, I don't know.
I mean there are things
that go a little bit
into this direction.
For example if you follow
the work about reflection,
it has some of the elements where you have
certain hooks into the
AST which are kind of
not depending on the compiler.
But something like the whole parser,
making the whole parser kind of,
in the implementation
independent or standardized,
I don't think it could
work because we have
several major implementation which
have been around for a while,
and they work the way they work.
I don't know, like, what
problem would be solved by that?
Like, I'm not sure I get it.
- [Man] I imagine there's something like,
if you're writing an IDE or if
you're writing Clang format,
you want the same thing at your parser
if you know what I mean.
- Well, I mean, a lot of tools
they build on the Clang AST
for example.
- So, it kind of shows the problem because
you might say that you
just use the Clang AST
and be done with it.
But even for formatting,
like, Clang format
doesn't use Clang AST.
It has its own parser with a bunch of,
it doesn't look inside headers
and has a bunch of heuristics.
So even that shows that you want to have
different parsers for different tasks.
- Right.
- There is no one fits all
solution for C++ probably.
- So basically either you have to kind of
build on top of some of the
implementaitons that exist
or you have to build something
sompletely from scratch
which is probably not gonna work properly
and gonna be very limited
in what it can do.
So, yeah, I think it's a big problem
for the kind of C++ ecosystem.
But I don't have a solution.
If anyone has a solution,
that would be great.
- [Man] So despite all the
trouble that a preprocessor
likes to give us,
the included preprocess directive is still
just ultimately a copy past mechanism.
So whenever there's a part of the code
where parsing and we need to,
that depend whose result
depends on other code,
at least we're guaranteed according
to valid program rules that the other code
is somewhere up there that was included
due to declaration or definition rules.
But once modules comes in,
that won't be the case any more.
I'm curious how modules could actually
help improve the situation
as opposed to complicating it
in that respect.
- You wanna comment on that?
- Yeah, I can try.
It's a technical problem because, well,
you need to build a graph of
dependency on the modules.
It should be doable because
otherwise not all tools work.
As you build a graph of dependencies,
it's a cycle graph and
you can just work it
and work it and parse it one by one.
And for the included directive,
it's not usually the case
because in some cases you can have
include directive inside the method body.
It depends on the technical
builder on the fire,
but in the class declaration
and stuff like that.
- So I think modules are
solving some problems,
like you're not gonna have
to deal with the complicated
included graph anymore.
But also creating other
problems for tooling,
so this is some stuff that's
on our minds in these days.
As we saw in order to pass
something we need to know
what an identifier means.
And then we potentially
get that from the module,
but then a module might be just kind of,
use this module and
then the module might be
in some kind of binary format, whatever,
and then you have to either parse that
or figure out where's the
code that this came from
and how was that built
because you also need compiler
flags and things like that
to be able to properly evaluate macros.
And you don't have this
information in the kind of
precompiled module.
How is this even going to support to work
for like the tooling ecosystem?
The standard doesn't say,
like the module TS doesn't say,
like people in the committee
say, well, we don't care.
This is outside of the standard.
So, yeah, I mean there's
a lot to think about here.
This is definitely something
that's on our mind.
But, yeah.
- Yeah, I also want to add
something a bit about that
because conceptually what,
at least several major, like, ideas
which aims to be feature
rich but doesn't use
compilers doing things like,
so it is that at least all of them.
They take a header and then try to detect
if it's like a well-formed
header that includes
like the top level and
it just doesn't affect
other ones so much and sees
what preprocessor state is left,
left of it, and then it kind of treats it
like a module anyway
to be able to have quick
indexing and parsing the project.
So when proper modules arrive at least
these would not be
required and it would be
ambiguent for the two.
- [Man] Okay, thank you.
- [Man] Just a quick comment.
I have used tools that generate C++
and macros made by all
the people and even myself
that are very hard to make that generate
very, very weird code.
And it's great that a lot
of people do a great effort
on supporting the language as it is
and disambiguating all the stuff.
But just to let you know, that ambiguity
of the extra parenthesis,
I have experienced it
in expanding macros because
macros try to protect
against side effects and they
put things within parenthesis,
and that checks.
So I hope that a tool,
like the ones that
JetBrains work on like CLion
can help people to deal with
those weirdness of this index.
Thank you.
- Okay, thank you.
- Any more questions?
We still have a couple of minutes left.
All right, I see none.
So thanks again for listening and, yeah.
- Thank you.
(crowd applauds)
