实现第一个语言前端LLVM教程（六）扩展语言：用户定义操作符6.1.简介欢迎来到“用LLVM实现一门语言”教程的第6章

6.1.简介

欢迎来到“用LLVM实现一门语言”教程的第6章。在我们的教程中，我们现在有了一种功能完备的语言，它相当简单，但也很有用。然而，它仍然存在一个大问题。我们的语言没有很多有用的运算符（比如除法、逻辑否定，甚至除了小于之外的任何比较）。

本教程的这一章离题太远，将用户定义的操作符添加到Kaleidoscope语言中。这个题外话现在给了我们一种简单而丑陋的语言，但同时也是一种强大的语言。创造自己的语言的好处之一是你可以决定什么是好，什么是坏。在本教程中，我们假设可以使用这种方式来展示一些有趣的解析技术。

在本教程的最后，我们将运行一个示例Kaleidoscope应用程序。这给出了一个使用Kaleidoscope和它的特性集可以构建什么的例子。

6.2. 用户自定义操作符

我们将在Kaleidoscope中添加的"操作符重载"比c++等语言更通用。在c++中，您只允许重新定义现有的操作符：您不能以编程方式更改语法，引入新的操作符，更改优先级等。在本章中，我们将把这个功能添加到Kaleidoscope中，它将允许用户扩展支持的操作符集合。

在这样的教程中讨论用户定义操作符的目的是展示使用手工编写的解析器的强大功能和灵活性。到目前为止，我们一直在实现的解析器对大部分语法使用递归下降，对表达式使用运算符优先级解析。详见第二章。通过使用操作符优先解析，程序员可以很容易地在语法中引入新的操作符：随着JIT的运行，语法可以动态扩展。

我们将添加的两个特定特性是可编程一元操作符（目前，Kaleidoscope根本没有一元操作符）和二元操作符。一个例子是：

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary "logical or", (note that it does not "short circuit")
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Define = with slightly lower precedence than relationals.
def binary= 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

许多语言都希望能够在语言本身中实现其标准运行时库。在Kaleidoscope中，我们可以在库中实现语言的重要部分！

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary "logical or", (note that it does not "short circuit")
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Define = with slightly lower precedence than relationals.
def binary= 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

许多语言都希望能够在语言本身中实现其标准运行时库。在Kaleidoscope中，我们可以在库中实现语言的重要部分！

我们将把这些特性的实现分为两部分：实现对用户定义的二元操作符的支持和添加一元操作符。

6.3. 用户自定义的二元操作符

在我们当前的框架中，添加对用户定义的二元运算符的支持非常简单。我们将首先添加对一元/二元关键字的支持

enum Token {
  ...
  // operators
  tok_binary = -11,
  tok_unary = -12
};
...
static int gettok() {
...
    if (IdentifierStr == "for")
      return tok_for;
    if (IdentifierStr == "in")
      return tok_in;
    if (IdentifierStr == "binary")
      return tok_binary;
    if (IdentifierStr == "unary")
      return tok_unary;
    return tok_identifier;

这只是增加了词法分析器对一元和二元关键字的支持，像我们在前几章中所做的那样。当前AST的一个优点是，我们通过使用它们的ASCII码作为操作码来表示具有完全通用的二元操作符。对于扩展操作符，我们将使用相同的表示，因此不需要任何新的AST或解析器支持。就另一方面，我们必须能够在函数定义的“defbinary | 5”部分表示这些新操作符的定义。到目前为止，在我们的语法中，函数定义的"名称"被解析为"原型"产品，并被解析为PrototypeAST AST节点。为了将新的用户定义操作符表示为原型，我们必须像这样扩展PrototypeAST AST节点：

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its argument names as well as if it is an operator.
class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;
  bool IsOperator;
  unsigned Precedence;  // Precedence if a binary op.

public:
  PrototypeAST(const std::string &Name, std::vector<std::string> Args,
               bool IsOperator = false, unsigned Prec = 0)
  : Name(Name), Args(std::move(Args)), IsOperator(IsOperator),
    Precedence(Prec) {}

  Function *codegen();
  const std::string &getName() const { return Name; }

  bool isUnaryOp() const { return IsOperator && Args.size() == 1; }
  bool isBinaryOp() const { return IsOperator && Args.size() == 2; }

  char getOperatorName() const {
    assert(isUnaryOp() || isBinaryOp());
    return Name[Name.size() - 1];
  }

  unsigned getBinaryPrecedence() const { return Precedence; }
};

基本上，除了知道原型的名称之外，我们现在还跟踪它是否是操作符，如果是，操作符的优先级是什么。优先级仅用于二元操作符（正如您将在下面看到的，它不适用于一元操作符）。现在我们有了一种表示用户定义操作符原型的方法，我们需要解析它

/// prototype
///   ::= id '(' id* ')'
///   ::= binary LETTER number? (id, id)
static std::unique_ptr<PrototypeAST> ParsePrototype() {
  std::string FnName;

  unsigned Kind = 0;  // 0 = identifier, 1 = unary, 2 = binary.
  unsigned BinaryPrecedence = 30;

  switch (CurTok) {
  default:
    return LogErrorP("Expected function name in prototype");
  case tok_identifier:
    FnName = IdentifierStr;
    Kind = 0;
    getNextToken();
    break;
  case tok_binary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected binary operator");
    FnName = "binary";
    FnName += (char)CurTok;
    Kind = 2;
    getNextToken();

    // Read the precedence if present.
    if (CurTok == tok_number) {
      if (NumVal < 1 || NumVal > 100)
        return LogErrorP("Invalid precedence: must be 1..100");
      BinaryPrecedence = (unsigned)NumVal;
      getNextToken();
    }
    break;
  }

  if (CurTok != '(')
    return LogErrorP("Expected '(' in prototype");

  std::vector<std::string> ArgNames;
  while (getNextToken() == tok_identifier)
    ArgNames.push_back(IdentifierStr);
  if (CurTok != ')')
    return LogErrorP("Expected ')' in prototype");

  // success.
  getNextToken();  // eat ')'.

  // Verify right number of names for operator.
  if (Kind && ArgNames.size() != Kind)
    return LogErrorP("Invalid number of operands for operator");

  return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames), Kind != 0,
                                         BinaryPrecedence);
}

这些都是相当简单的解析代码，我们在过去已经见过很多类似的代码。上面代码中一个有趣的部分是为二元操作符设置FnName的几行。这将为新定义的“@”操作符构建类似“binary@”的名称。然后，它利用了LLVM符号表中的符号名允许包含任何字符的事实，包括嵌入的空字符

下一个要添加的有趣的东西是对这些二元运算符的编码支持。给定我们当前的结构，这是对现有二元操作符节点的默认情况的简单添加：

Value *BinaryExprAST::codegen() {
  Value *L = LHS->codegen();
  Value *R = RHS->codegen();
  if (!L || !R)
    return nullptr;

  switch (Op) {
  case '+':
    return Builder->CreateFAdd(L, R, "addtmp");
  case '-':
    return Builder->CreateFSub(L, R, "subtmp");
  case '*':
    return Builder->CreateFMul(L, R, "multmp");
  case '<':
    L = Builder->CreateFCmpULT(L, R, "cmptmp");
    // Convert bool 0/1 to double 0.0 or 1.0
    return Builder->CreateUIToFP(L, Type::getDoubleTy(*TheContext),
                                "booltmp");
  default:
    break;
  }

  // If it wasn't a builtin binary operator, it must be a user defined one. Emit  a call to it.
  Function *F = getFunction(std::string("binary") + Op);
  assert(F && "binary operator not found!");

  Value *Ops[2] = { L, R };
  return Builder->CreateCall(F, Ops, "binop");
}

正如您在上面看到的，新代码实际上非常简单。它只是在符号表中查找相应的操作符，并生成对它的函数调用。由于用户定义的操作符只是作为普通函数构建的（因为“原型”归结为具有正确名称的函数），所以一切都井然有序。

我们缺少的最后一段代码，是一些生成代码：

Function *FunctionAST::codegen() {
  // Transfer ownership of the prototype to the FunctionProtos map, but keep a
  // reference to it for use below.
  auto &P = *Proto;
  FunctionProtos[Proto->getName()] = std::move(Proto);
  Function *TheFunction = getFunction(P.getName());
  if (!TheFunction)
    return nullptr;

  // If this is an operator, install it.
  if (P.isBinaryOp())
    BinopPrecedence[P.getOperatorName()] = P.getBinaryPrecedence();

  // Create a new basic block to start insertion into.
  BasicBlock *BB = BasicBlock::Create(*TheContext, "entry", TheFunction);
  ...

基本上，在对函数进行编码之前，如果它是用户定义的操作符，我们将其注册到优先级表中。这允许我们已有的二元运算符解析逻辑来处理它。由于我们正在研究一个完全通用的操作符优先解析器，这就是我们需要做的“扩展语法”

现在我们有了有用的用户定义二元运算符。这在我们之前为其他操作符构建的框架上建立了很多。添加一元操作符比较有挑战性，因为我们还没有任何框架——让我们看看需要什么。

6.4. 用户定义一元操作符

由于我们目前在Kaleidoscope语言中不支持一元操作符，因此我们需要添加所有内容来支持它们。上面，我们为词法分析器添加了对'unary'关键字的简单支持。除此之外，我们还需要一个AST节点：

/// UnaryExprAST - Expression class for a unary operator.
class UnaryExprAST : public ExprAST {
  char Opcode;
  std::unique_ptr<ExprAST> Operand;

public:
  UnaryExprAST(char Opcode, std::unique_ptr<ExprAST> Operand)
    : Opcode(Opcode), Operand(std::move(Operand)) {}

  Value *codegen() override;
};

到目前为止，这个AST节点非常简单和明显。它直接镜像二元运算符AST节点，只不过它只有一个子节点。因此，我们需要添加解析逻辑。解析一元操作符非常简单：我们将添加一个新函数来完成它：

/// unary
///   ::= primary
///   ::= '!' unary
static std::unique_ptr<ExprAST> ParseUnary() {
  // If the current token is not an operator, it must be a primary expr.
  if (!isascii(CurTok) || CurTok == '(' || CurTok == ',')
    return ParsePrimary();

  // If this is a unary operator, read it.
  int Opc = CurTok;
  getNextToken();
  if (auto Operand = ParseUnary())
    return std::make_unique<UnaryExprAST>(Opc, std::move(Operand));
  return nullptr;
}

这里我们添加的语法非常简单。如果在解析主操作符时看到一元操作符，则将该操作符作为前缀吃掉，并将其余部分解析为另一个一元操作符。这允许我们处理多个一元操作符（例如“！！x”）。请注意，一元操作符不能像二元操作符那样具有二义性解析，因此不需要优先级信息。

这个函数的问题是，我们需要从某处调用ParseUnary。为此，我们将ParsePrimary的先前调用者改为调用ParseUnary

/// binoprhs
///   ::= ('+' unary)*
static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
                                              std::unique_ptr<ExprAST> LHS) {
  ...
    // Parse the unary expression after the binary operator.
    auto RHS = ParseUnary();
    if (!RHS)
      return nullptr;
  ...
}
/// expression
///   ::= unary binoprhs
///
static std::unique_ptr<ExprAST> ParseExpression() {
  auto LHS = ParseUnary();
  if (!LHS)
    return nullptr;

  return ParseBinOpRHS(0, std::move(LHS));
}

通过这两个简单的更改，我们现在能够解析一元操作符并为它们构建AST。接下来，我们需要添加对原型的解析器支持，以解析一元操作符原型。我们将上面的二进制操作符代码扩展为：

/// prototype
///   ::= id '(' id* ')'
///   ::= binary LETTER number? (id, id)
///   ::= unary LETTER (id)
static std::unique_ptr<PrototypeAST> ParsePrototype() {
  std::string FnName;

  unsigned Kind = 0;  // 0 = identifier, 1 = unary, 2 = binary.
  unsigned BinaryPrecedence = 30;

  switch (CurTok) {
  default:
    return LogErrorP("Expected function name in prototype");
  case tok_identifier:
    FnName = IdentifierStr;
    Kind = 0;
    getNextToken();
    break;
  case tok_unary:
    getNextToken();
    if (!isascii(CurTok))
      return LogErrorP("Expected unary operator");
    FnName = "unary";
    FnName += (char)CurTok;
    Kind = 1;
    getNextToken();
    break;
  case tok_binary:
    ...

与二元操作符一样，我们使用包含操作符字符的名称来命名一元操作符。这有助于我们在代码生成时。说到这里，我们需要添加的最后一部分是对一元操作符的代码支持。它是这样的：

Value *UnaryExprAST::codegen() {
  Value *OperandV = Operand->codegen();
  if (!OperandV)
    return nullptr;

  Function *F = getFunction(std::string("unary") + Opcode);
  if (!F)
    return LogErrorV("Unknown unary operator");

  return Builder->CreateCall(F, OperandV, "unop");
}

这段代码类似于二元运算符的代码，但比它更简单。它更简单，主要是因为它不需要处理任何预定义的操作符。

6.5. 动手测试

这有点难以置信，但通过我们在前几章中介绍的一些简单扩展，我们已经开发出了一种真正的语言。有了它，我们可以做很多有趣的事情，包括I/O、数学和其他一些事情。例如，我们现在可以添加一个不错的排序操作符（printd被定义为输出指定的值和换行符）

ready> extern printd(x);
Read extern:
declare double @printd(double)

ready> def binary : 1 (x y) 0;  # Low-precedence operator that ignores operands.
...
ready> printd(123) : printd(456) : printd(789);
123.000000
456.000000
789.000000
Evaluated to 0.000000

我们还可以定义一些其他的"primitive"操作，比如

# Logical unary not.
def unary!(v)
  if v then
    0
  else
    1;

# Unary negate.
def unary-(v)
  0-v;

# Define > with the same precedence as <.
def binary> 10 (LHS RHS)
  RHS < LHS;

# Binary logical or, which does not short circuit.
def binary| 5 (LHS RHS)
  if LHS then
    1
  else if RHS then
    1
  else
    0;

# Binary logical and, which does not short circuit.
def binary& 6 (LHS RHS)
  if !LHS then
    0
  else
    !!RHS;

# Define = with slightly lower precedence than relationals.
def binary = 9 (LHS RHS)
  !(LHS < RHS | LHS > RHS);

# Define ':' for sequencing: as a low-precedence operator that ignores operands
# and just returns the RHS.
def binary : 1 (x y) y;

考虑到前面的if/then/else支持，我们还可以为I/O定义有趣的函数。例如，下面的命令打印出一个字符，其"密度"传入的值：值越低，字符密度越高：

# Determine whether the specific location diverges.
# Solve for z = z^2 + c in the complex plane.
def mandelconverger(real imag iters creal cimag)
  if iters > 255 | (real*real + imag*imag > 4) then
    iters
  else
    mandelconverger(real*real - imag*imag + creal,
                    2*real*imag + cimag,
                    iters+1, creal, cimag);

# Return the number of iterations required for the iteration to escape
def mandelconverge(real imag)
  mandelconverger(real, imag, 0, real, imag);

这个"z = z2 + c"函数是一个美丽的小东西，它是曼德尔布罗特集计算的基础。我们的mandelconverge函数返回一个复杂轨道逃逸所需要的迭代次数，饱和到255。这本身并不是一个非常有用的函数，但是如果您在二维平面上绘制它的值，您可以看到曼德尔布罗特集。考虑到我们在这里被限制使用putchard，我们惊人的图形输出是有限的，但我们可以使用上面的密度绘图仪拼凑一些东西：

# Compute and plot the mandelbrot set with the specified 2 dimensional range
# info.
def mandelhelp(xmin xmax xstep   ymin ymax ystep)
  for y = ymin, y < ymax, ystep in (
    (for x = xmin, x < xmax, xstep in
       printdensity(mandelconverge(x,y)))
    : putchard(10)
  )

# mandel - This is a convenient helper function for plotting the mandelbrot set
# from the specified position with the specified Magnification.
def mandel(realstart imagstart realmag imagmag)
  mandelhelp(realstart, realstart+realmag*78, realmag,
             imagstart, imagstart+imagmag*40, imagmag);

ready> mandel(-0.9, -1.4, 0.02, 0.03);

在这一点上，你可能开始意识到Kaleidoscope是一个真正的和强大的语言。它可能不是自相似的：)，但它可以用来绘制相似的东西！：

至此，我们结束了本教程的"添加用户定义操作符"一章。我们已经成功地增强了我们的语言，增加了在库中扩展语言的能力，并且我们已经展示了如何使用它在Kaleidoscope中构建简单但有趣的最终用户应用程序。在这一点上，Kaleidoscope可以构建各种功能性的应用程序，并且可以调用带有副作用的函数，但它实际上不能定义和改变变量本身。

引人注目的是，可变变量是某些语言的一个重要特性，如何在不向前端添加"SSA构造"阶段的情况下添加对可变变量的支持并不明显。在下一章中，我们将描述如何在不构建前端SSA的情况下添加可变变量。

完成代码