WTSC 6: Parser

167 阅读4分钟

Parsing: The parser is a simple, recursive-descent parser (implemented in lib/Parse) with an integrated, hand-coded lexer. The parser is responsible for generating an Abstract Syntax Tree (AST) without any semantic or type information, and emit warnings or errors for grammatical problems with the input source.  -- Swift Compiler Architecture

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes. --Wikipedia

From these two quotations above, we know that there is a grammar for swift programming language, and a recursive-descent parser implementing that grammar for parsing swift programming language. Moreover the swift grammar belongs to CFG ( Context Free Grammar), which has form of Nonterminals -> Sequence of Combination of Nonterminals and Terminals. The nonterminals, here, are swift language constructs (a set of strings of tokens that is able to make of language constructs) and terminals are tokens recognized by lexer. Mostly, for each nonterminal, there is a procedure for recognizing it. Therefore, the simplest skeleton of the organization is to group and facilitate those procedures together to parse source code into AST.

Now, let us dive into details of swift parser.

In the previous post WTSC 5: Lexer, we have mentioned swift::Parser::parseTopLevel method, which is the start point of parsing. That is, in a swift source code file, the top level constructs are statements in allow-top-level-code mode, like print("hello"), or declarations of different things in all modes, such as imports, classes, structs, global functions, global variables and so on. By the way, the entry point of swift program is the first statement of the main file named main.swift.

/* /wtsc/swift/include/swift/AST/SourceFile.h */
...
/// True if this is a "script mode" source file that admits top-level code.
  bool isScriptMode() const {
    switch (Kind) {
    case SourceFileKind::Main:
      return true;
    case SourceFileKind::Library:
    case SourceFileKind::Interface:
    case SourceFileKind::SIL:
      return false;
    }
    llvm_unreachable("bad SourceFileKind");
  }
...

Now, let us see what are inside parseTopLevel method, which is the main entrypoint for the parser.

/* /wtsc/swift/lib/Parse/PaseDecl.cpp */
...
/// Main entrypoint for the parser.
...
///   top-level:
///     stmt-brace-item*
...
void Parser::parseTopLevel(SmallVectorImpl<Decl *> &decls) {
  // Prime the lexer.
  if (Tok.is(tok::NUM_TOKENS))
    consumeTokenWithoutFeedingReceiver();
  // Parse the body of the file.
  SmallVector<ASTNode, 128> items;
  while (!Tok.is(tok::eof)) {
...
    parseBraceItems(items, allowTopLevelCode()
                               ? BraceItemListKind::TopLevelCode
                               : BraceItemListKind::TopLevelLibrary);
...
  }
...
}
...

You can see from above code snippet, the comment says top-level ::= stmt-brace-item*, means top-level is made of zero or more stmt-brace-item.

The consumeTokenWithoutFeedingReceiver method is to kickstart the lexer to get its first token. Then in the while-loop with token is not eof (End of File), it calls parseBraceItems method to parse constructs by consuming tokens from lexer.

Let us check parseBraceItem method as following.

/* /wtsc/swift/lib/Parse/ParseStmt.cpp */
///   brace-item:
///     decl
///     expr
///     stmt
///   stmt:
///     ';'
///     stmt-assign
///     stmt-if
///     stmt-guard
///     stmt-for-c-style
///     stmt-for-each
///     stmt-switch
///     stmt-control-transfer
///  stmt-control-transfer:
///     stmt-return
///     stmt-break
///     stmt-continue
///     stmt-fallthrough
///   stmt-assign:
///     expr '=' expr
ParserStatus Parser::parseBraceItems(SmallVectorImpl<ASTNode> &Entries,
                                     BraceItemListKind Kind,
                                     BraceItemListKind ConditionalBlockKind,
                                     bool &IsFollowingGuard) {
...
while ((IsTopLevel || Tok.isNot(tok::r_brace)) &&
         Tok.isNot(tok::pound_endif) &&
         Tok.isNot(tok::pound_elseif) &&
         Tok.isNot(tok::pound_else) &&
         Tok.isNot(tok::eof) &&
         !isStartOfSILDecl() &&
         (isConditionalBlock ||
          !isTerminatorForBraceItemListKind(Kind, Entries))) {
    ...
    // Parse the decl, stmt, or expression.
    PreviousHadSemi = false;
    if (Tok.is(tok::pound_if)) {
      auto IfConfigResult = parseIfConfig(
        [&](SmallVectorImpl<ASTNode> &Elements, bool IsActive) {
          parseBraceItems(Elements, Kind, IsActive
                            ? BraceItemListKind::ActiveConditionalBlock
                            : BraceItemListKind::InactiveConditionalBlock,
                          IsFollowingGuard);
        });
      ...
    } else if (Tok.is(tok::pound_line)) {
      ParserStatus Status = parseLineDirective(true);
      ...
    } else if (Tok.is(tok::pound_sourceLocation)) {
      ParserStatus Status = parseLineDirective(false);
      ...
    } else if (isStartOfSwiftDecl()) {
      SmallVector<Decl*, 8> TmpDecls;
      ParserResult<Decl> DeclResult = 
          parseDecl(IsTopLevel ? PD_AllowTopLevel : PD_Default,
                    IsAtStartOfLineOrPreviousHadSemi,
                    [&](Decl *D) {
                      TmpDecls.push_back(D);

                      // Any function after a 'guard' statement is marked as
                      // possibly having local captures. This allows SILGen
                      // to correctly determine its capture list, since
                      // otherwise it would be skipped because it is not
                      // defined inside a local context.
                      if (IsFollowingGuard)
                        if (auto *FD = dyn_cast<FuncDecl>(D))
                          FD->setHasTopLevelLocalContextCaptures();
                    });
      ...
    } else if (IsTopLevel) {
     ...
      ParserStatus Status = parseExprOrStmt(Result);
     ...
    } else if (Tok.is(tok::kw_init) && isa<ConstructorDecl>(CurDeclContext)) {
      ...
    } else {
      ParserStatus ExprOrStmtStatus = parseExprOrStmt(Result);
      ...
    }
  } // End of while loop
...
}
...

The core task of Parser::parseBraceItems is to loop over tokens and parse the declarations, expressions and statements by calling corresponding parsing methods, parseDecl, parseStmt, parseExprImpl. Then in turn calls sub-component parseXXX method to parse parts of the whole construct.

/// Parse a single syntactic declaration and return a list of decl
/// ASTs.  This can return multiple results for var decls that bind to multiple
/// values, structs that define a struct decl and a constructor, etc.
///
/// \verbatim
///   decl:
///     decl-typealias
///     decl-extension
///     decl-let
///     decl-var
///     decl-class
///     decl-func
///     decl-enum
///     decl-struct
///     decl-import
///     decl-operator
/// \endverbatim
ParserResult<Decl>
Parser::parseDecl(ParseDeclOptions Flags,
                  bool IsAtStartOfLineOrPreviousHadSemi,
                  llvm::function_ref<void(Decl*)> Handler);

ParserStatus Parser::parseExprOrStmt(ASTNode &Result);

/// parseExpr
///
///   expr:
///     expr-sequence(basic | trailing-closure)
///
/// \param isExprBasic Whether we're only parsing an expr-basic.
ParserResult<Expr> Parser::parseExprImpl(Diag<> Message,
                                         bool isExprBasic);

ParserResult<Stmt> Parser::parseStmt();

So we can use our lldb to go over how the compiler parses our Helloworld.swift, print("Helloworld"). 

swift::Parser::parseExprPostfix(this=0x00007fffffff5398, ID=(ID = expected_expr), isExprBasic=false) at ParseExpr.cpp:1377:1
swift::Parser::parseExprUnary(this=0x00007fffffff5398, Message=(ID = expected_expr), isExprBasic=false) at ParseExpr.cpp:502:12
swift::Parser::parseExprSequenceElement(this=0x00007fffffff5398, message=(ID = expected_expr), isExprBasic=false) at ParseExpr.cpp:436:9
swift::Parser::parseExprSequence(this=0x00007fffffff5398, Message=(ID = expected_expr), isExprBasic=false, isForConditionalDirective=false) at ParseExpr.cpp:183:7
swift::Parser::parseExprImpl(this=0x00007fffffff5398, Message=(ID = expected_expr), isExprBasic=false) at ParseExpr.cpp:66:10
swift::Parser::parseExpr(this=0x00007fffffff5398, ID=(ID = expected_expr)) at Parser.h:1454:12
swift::Parser::parseExprOrStmt(this=0x00007fffffff5398, Result=0x00007fffffff46b0) at ParseStmt.cpp:164:35
swift::Parser::parseBraceItems(this=0x00007fffffff5398, Entries=0x00007fffffff4a08, Kind=TopLevelCode, ConditionalBlockKind=Brace, IsFollowingGuard=0x00007fffffff48ef) at ParseStmt.cpp:437:29
swift::Parser::parseBraceItems(this=0x00007fffffff5398, Decls=0x00007fffffff4a08, Kind=TopLevelCode, ConditionalBlockKind=Brace) at Parser.h:904:12
swift::Parser::parseTopLevel(this=0x00007fffffff5398, decls=0x00007fffffff4f60) at ParseDecl.cpp:186:5


/* /wtsc/swift/lib/Parse/ParseExpr.cpp */
...
ParserResult<Expr> Parser::parseExprPostfix(Diag<> ID, bool isExprBasic) {
...
  auto Result = parseExprPrimary(ID, isExprBasic);
...
  Result = parseExprPostfixSuffix(Result, isExprBasic,
                                  /*periodHasKeyPathBehavior=*/InSwiftKeyPath,
                                  hasBindOptional);
...
}
...

We stop at Parse::parseExprPostfix method, and we can see the function call stack to know how to get to Parse:parseExprPostfix.

In Parse:parseExprPostfix, the parseExprPrimary will parse the identifer, "print", and parseExprPostfixSuffix will parse the argument list enquote with parenthesis, '("Hello world!")'.

This time when you print out the result of parsing, you will get the AST of our 'print("Hello world!")', but without type information, type=''.

(lldb) expr Result.get()->dump()
(call_expr type='<null>' arg_labels=_:
  (unresolved_decl_ref_expr type='<null>' name=print function_ref=unapplied)
  (paren_expr type='<null>'
    (string_literal_expr type='<null>' encoding=utf8 value="Hello world!" builtin_initializer=**NULL** initializer=**NULL**)))

That is it. Now we get the AST of our hello world example. Next we will see how to fill types in this AST as type-checking.