Understanding the ECMAScript spec, part 3

720 阅读7分钟

In this episode, we’ll go deeper in the definition of the ECMAScript language and its syntax. If you’re not familiar withcontext-free grammars, now is a good time to check out the basics, since the spec uses context-free grammars to define the language.

ECMAScript grammars

The ECMAScript spec defines four grammars:

Thelexical grammardescribes howUnicode code pointsare translated into a sequence of input elements (tokens, line terminators, comments, white space).

Thesyntactic grammardefines how syntactically correct programs are composed of tokens.

TheRegExp grammardescribes how Unicode code points are translated into regular expressions.

Thenumeric string grammardescribes how Strings are translated into numeric values.

Each grammar is defined as a context-free grammar, consisting of a set of productions.

The grammars use slightly different notation: the syntactic grammar usesLeftHandSideSymbol :whereas the lexical grammar and the RegExp grammar useLeftHandSideSymbol ::and the numeric string grammar usesLeftHandSideSymbol :::.

Next we’ll look into the lexical grammar and the syntactic grammar in more detail.

Lexical grammar

The spec defines ECMAScript source text as a sequence of Unicode code points. For example, variable names are not limited to ASCII characters but can also include other Unicode characters. The spec doesn’t talk about the actual encoding (for example, UTF-8 or UTF-16). It assumes that the source code has already been converted into a sequence of Unicode code points according to the encoding it was in.

It’s not possible to tokenize ECMAScript source code in advance, which makes defining the lexical grammar slightly more complicated.

For example, we cannot determine whether/is the division operator or the start of a RegExp without looking at the larger context it occurs in:

const x = 10 / 5;

Here/is aDivPunctuator.

const r = /foo/;

Here the first/is the start of aRegularExpressionLiteral.

Templates introduce a similar ambiguity — the interpretation of}`depends on the context it occurs in:

const what1 = 'temp';
const what2 = 'late';
const t = `I am a ${ what1 + what2 }`;

Here`I am a ${isTemplateHeadand}`is aTemplateTail.

if (0 == 1) {

}`not very useful`;

Here}is aRightBracePunctuatorand`is the start of aNoSubstitutionTemplate.

Even though the interpretation of/and}`depends on their “context” — their position in the syntactic structure of the code — the grammars we’ll describe next are still context-free.

The lexical grammar uses several goal symbols to distinguish between the contexts where some input elements are permitted and some are not. For example, the goal symbolInputElementDivis used in contexts where/is a division and/=is a division-assignment. TheInputElementDivproductions list the possible tokens which can be produced in this context:

InputElementDiv :: 
WhiteSpace 
LineTerminator
Comment 
CommonToken
DivPunctuator
RightBracePunctuator

In this context, encountering/produces theDivPunctuatorinput element. Producing aRegularExpressionLiteralis not an option here.

On the other hand,InputElementRegExpis the goal symbol for the contexts where/is the beginning of a RegExp:

InputElementRegExp :: 
WhiteSpace
LineTerminator
Comment  
CommonToken
RightBracePunctuator
RegularExpressionLiteral

As we see from the productions, it’s possible that this produces theRegularExpressionLiteralinput element, but producingDivPunctuatoris not possible.

Similarly, there is another goal symbol,InputElementRegExpOrTemplateTail, for contexts whereTemplateMiddleandTemplateTailare permitted, in addition toRegularExpressionLiteral. And finally,InputElementTemplateTailis the goal symbol for contexts where onlyTemplateMiddleandTemplateTailare permitted butRegularExpressionLiteralis not permitted.

In implementations, the syntactic grammar analyzer (“parser”) may call the lexical grammar analyzer (“tokenizer” or “lexer”), passing the goal symbol as a parameter and asking for the next input element suitable for that goal symbol.

Syntactic grammar

We looked into the lexical grammar, which defines how we construct tokens from Unicode code points. The syntactic grammar builds on it: it defines how syntactically correct programs are composed of tokens.

Example: Allowing legacy identifiers

Introducing a new keyword to the grammar is a possibly breaking change — what if existing code already uses the keyword as an identifier?

For example, beforeawaitwas a keyword, someone might have written the following code:

function old() {
var await;
}

The ECMAScript grammar carefully added theawaitkeyword in such a way that this code continues to work. Inside async functions,awaitis a keyword, so this doesn’t work:

async function modern() {
var await; // Syntax error
}

Allowingyieldas an identifier in non-generators and disallowing it in generators works similarly.

Understanding howawaitis allowed as an identifier requires understanding ECMAScript-specific syntactic grammar notation. Let’s dive right in!

Productions and shorthands

Let’s look at how the productions forVariableStatementare defined. At the first glance, the grammar can look a bit scary:

VariableStatement[Yield, Await] :  
var VariableDeclarationList[+In, ?Yield, ?Await] ;

What do the subscripts ([Yield, Await]) and prefixes (+in+Inand?in?Async) mean?

The notation is explained in the section Grammar Notation

The subscripts are a shorthand for expressing a set of productions, for a set of left-hand side symbols, all at once. The left-hand side symbol has two parameters, which expands into four "real" left-hand side symbols:VariableStatement,VariableStatement_Yield,VariableStatement_Await, andVariableStatement_Yield_Await.

Note that here the plainVariableStatementmeans “VariableStatementwithout_Awaitand_Yield”. It should not be confused withVariableStatement[Yield, Await].

On the right-hand side of the production, we see the shorthand+In, meaning "use the version with_In", and?Await, meaning “use the version with_Awaitif and only if the left-hand side symbol has_Await” (similarly with?Yield).

The third shorthand,~Foo, meaning “use the version without_Foo”, is not used in this production.

With this information, we can expand the productions like this:

VariableStatement :  
var VariableDeclarationList_In ;

VariableStatement_Yield : 
var VariableDeclarationList_In_Yield ;

VariableStatement_Await : 
var VariableDeclarationList_In_Await ;

VariableStatement_Yield_Await : 
var VariableDeclarationList_In_Yield_Await ;

Ultimately, we need to find out two things:

  1. Where is it decided whether we’re in the case with_Awaitor without_Await?
  2. Where does it make a difference — where do the productions forSomething_AwaitandSomething(without_Await) diverge?

_Awaitor no_Await

Let’s tackle question 1 first. It’s somewhat easy to guess that non-async functions and async functions differ in whether we pick the parameter_Awaitfor the function body or not. Reading the productions for async function declarations, we findthis:

AsyncFunctionBody : 
FunctionBody[~Yield, +Await]

Note thatAsyncFunctionBodyhas no parameters — they get added to theFunctionBodyon the right-hand side.

If we expand this production, we get:

AsyncFunctionBody : 
FunctionBody_Await

In other words, async functions haveFunctionBody_Await, meaning a function body whereawaitis treated as a keyword.

On the other hand, if we’re inside a non-async function,the relevant productionis:

FunctionDeclaration[Yield, Await, Default] : 
function BindingIdentifier[?Yield, ?Await] ( FormalParameters[~Yield, ~Await] ) { FunctionBody[~Yield, ~Await] }

(FunctionDeclarationhas another production, but it’s not relevant for our code example.)

To avoid combinatorial expansion, let’s ignore theDefaultparameter which is not used in this particular production.

The expanded form of the production is:

FunctionDeclaration : 
function BindingIdentifier ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Yield : 
function BindingIdentifier_Yield ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Await :  function BindingIdentifier_Await ( FormalParameters ) { FunctionBody }

FunctionDeclaration_Yield_Await : 
function BindingIdentifier_Yield_Await ( FormalParameters ) { FunctionBody }

In this production we always getFunctionBodyandFormalParameters(without_Yieldand without_Await), since they are parameterized with[~Yield, ~Await]in the non-expanded production.

Function name is treated differently: it gets the parameters_Awaitand_Yieldif the left-hand side symbol has them.

To summarize: Async functions have aFunctionBody_Awaitand non-async functions have aFunctionBody(without_Await). Since we’re talking about non-generator functions, both our async example function and our non-async example function are parameterized without_Yield.

Maybe it’s hard to remember which one isFunctionBodyand whichFunctionBody_Await. IsFunctionBody_Awaitfor a function whereawaitis an identifier, or for a function whereawaitis a keyword?

You can think of the_Awaitparameter meaning "awaitis a keyword". This approach is also future proof. Imagine a new keyword,blobbeing added, but only inside "blobby" functions. Non-blobby non-async non-generators would still haveFunctionBody(without_Await,_Yieldor_Blob), exactly like they have now. Blobby functions would have aFunctionBody_Blob, async blobby functions would haveFunctionBody_Await_Bloband so on. We’d still need to add theBlobsubscript to the productions, but the expanded forms ofFunctionBodyfor already existing functions stay the same.

Disallowing await as an identifier

Next, we need to find out howawaitis disallowed as an identifier if we're inside aFunctionBody_Await.

We can follow the productions further to see that the_Awaitparameter gets carried unchanged fromFunctionBodyall the way to theVariableStatementproduction we were previously looking at.

Thus, inside an async function, we’ll have aVariableStatement_Awaitand inside a non-async function, we’ll have a VariableStatement.

We can follow the productions further and keep track of the parameters. We already saw the productions forVariableStatement:

VariableStatement[Yield, Await] :
var VariableDeclarationList[+In, ?Yield, ?Await] ;

All productions forVariableDeclarationListjust carry the parameters on as is:

VariableDeclarationList[In, Yield, Await] :
VariableDeclaration[?In, ?Yield, ?Await]

(Here we show only theproductionrelevant to our example.)

VariableDeclaration[In, Yield, Await] :  
BindingIdentifier[?Yield, ?Await] Initializer[?In, ?Yield, ?Await] opt

Theoptshorthand means that the right-hand side symbol is optional; there are in fact two productions, one with the optional symbol, and one without.

In the simple case relevant to our example,VariableStatementconsists of the keywordvar, followed by a singleBindingIdentifierwithout an initializer, and ending with a semicolon.

To disallow or allowawaitas aBindingIdentifier, we hope to end up with something like this:

BindingIdentifier_Await :  
Identifier 

yieldBindingIdentifier :  
Identifier
yield 
await

This would disallowawaitas an identifier inside async functions and allow it as an identifier inside non-async functions.

But the spec doesn’t define it like this, instead we find thisproduction:

BindingIdentifier[Yield, Await] : 
Identifier
yield
await

Expanded, this means the following productions:

BindingIdentifier_Await : 
Identifier 
yield 
await
BindingIdentifier : 
Identifier 
yield  
await

(We’re omitting the productions forBindingIdentifier_YieldandBindingIdentifier_Yield_Awaitwhich are not needed in our example.)

This looks likeawaitandyieldwould be always allowed as identifiers. What’s up with that? Is the whole blog post for nothing?

Statics semantics to the rescue

It turns out that static semantics are needed for forbiddingawaitas an identifier inside async functions.

Static semantics describe static rules — that is, rules that are checked before the program runs.

In this case, thestatic semantics for BindingIdentifierdefine the following syntax-directed rule:

BindingIdentifier[Yield, Await] : await

It is a Syntax Error if this production has an[Await]parameter.

Effectively, this forbids theBindingIdentifier_Await : awaitproduction.

The spec explains that the reason for having this production but defining it as a Syntax Error by the static semantics is because of interference with automatic semicolon insertion (ASI).

Remember that ASI kicks in when we’re unable to parse a line of code according to the grammar productions. ASI tries to add semicolons to satisfy the requirement that statements and declarations must end with a semicolon. (We’ll describe ASI in more detail in a later episode.)

Consider the following code (example from the spec):

async function too_few_semicolons() {  
let 
await 0;
}

If the grammar disallowedawaitas an identifier, ASI would kick in and transform the code into the following grammatically correct code, which also usesletas an identifier:

async function too_few_semicolons() {
let; 
await 0;
}

This kind of interference with ASI was deemed too confusing, so static semantics were used for disallowingawaitas an identifier.

Disallowed StringValues of identifiers

There’s also another related rule:

BindingIdentifier : Identifier

It is a Syntax Error if this production has an[Await]parameter andStringValueofIdentifieris"await".

This might be confusing at first.Identifieris defined like this:

Identifier :  
IdentifierName but not ReservedWord

awaitis aReservedWord, so how can anIdentifierever beawait?

As it turns out,Identifiercannot beawait, but it can be something else whoseStringValueis"await"— a different representation of the character sequenceawait.

Static semantics for identifier namesdefine how theStringValueof an identifier name is computed. For example, the Unicode escape sequence forais\u0061, so\u0061waithas theStringValue"await".\u0061waitwon’t be recognized as a keyword by the lexical grammar, instead it will be anIdentifier. The static semantics for forbid using it as a variable name inside async functions.

So this works:

function old() { 
var \u0061wait;
}

And this doesn’t:

async function modern() {
var \u0061wait; // Syntax error
}

Summary

In this episode, we familiarized ourselves with the lexical grammar, the syntactic grammar, and the shorthands used for defining the syntactic grammar. As an example, we looked into forbidding usingawaitas an identifier inside async functions but allowing it inside non-async functions.

Other interesting parts of the syntactic grammar, such as automatic semicolon insertion and cover grammars will be covered in a later episode. Stay tuned!