本章节对应仓库

1.词法分析 Github

前言

其实本章所讲的词法分析不是严格意义上的词法分析，因为Sprache库可以通过组合的方式将词法分析与语法分析拼装在一起

词法分析器

新建一个类，命名为Lexer，内容如下：

namespace SwordScript;

/// <summary>
/// 词法分析器
/// </summary>
public static class Lexer
{
    
}

注：若提示不支持的语法，请参照第二章，将目标语言设置为C#10以上版本

标识符解析

构建解析器

通过第一章可知，标识符的定义为：以Unicode字符或下划线开头、包含Unicode字符、下划线、数字的非保留字符

因此，我们可以根据此定义，写出匹配的正则表达式：[_\p{L}][0-9_\p{L}]*

正则拆解：
- _ 下划线
- \p{L} Unicode字符，包括英文字母、中日俄等语言文字
- [_\p{L}] Unicode字符或下划线
- [0-9_\p{L}] Unicode字符、下划线或数字
- [0-9_\p{L}]* 任意个Unicode字符、下划线或数字

有了正则表达式，我们就可以使用正则表达式来匹配对应的文字作为标识符。

在本篇文章中，我们将使用Sprache库解析词法和语法。入门一个库最好的方法自然是阅读库中的相关文章，当然，在本篇文章中也会对使用到的语法进行一定的讲解。

在Sprache库里，要创建一个解析器，可以从Parse静态类开始，该类方法返回Parser<T>类型的委托。T为解析器解析后返回的类型。

使用Sprache构建的标识符解析器如下：

    public static readonly Parser<string> Identifier = Parse.Regex(@"[_\p{L}][0-9_\p{L}]*").Token();

Parse.Regex返回匹配符合正则语法的字符串的解析器，.Token()会在解析时进行剔除前后空格的操作。

此时，我们已经得到了一个解析器。通过对解析器调用.Parse(string input)扩展方法便可以调用解析器生成解析结果。

单元测试

打开Tests项目，使用Nuget导入Sprache包。然后新建LexerText类，内容如下：

using NUnit.Framework;
using Sprache;
using SwordScript;

namespace Tests;

public class LexerText
{
    [Test]
    public void Identifier()
    {
        Assert.AreEqual("abc", Lexer.Identifier.Parse(" abc "));
        Assert.AreEqual("_abc", Lexer.Identifier.Parse(" _abc "));
        Assert.AreEqual("_abc123", Lexer.Identifier.Parse(" _abc123 "));
        Assert.AreEqual("变量", Lexer.Identifier.Parse(" 变量 "));
        Assert.AreEqual("变量123", Lexer.Identifier.Parse(" 变量123 "));
        Assert.Catch<ParseException>(() => Lexer.Identifier.Parse(" "));
        Assert.Catch<ParseException>(() => Lexer.Identifier.Parse(" 123 "));
        Assert.Catch<ParseException>(() => Lexer.Identifier.Parse(" 123abc "));
    }
}

运行单元测试，全部通过。

整数解析

构建解析器

相比标识符，整数的解析就简单的多。我们可以认为，一连串的数字在一起，中间没有其他字符，便是整数。

使用Sprache构建的整数解析器如下：

public static readonly Parser<long> LongInteger = Parse.Chars("0123456789").AtLeastOnce().Text().Select(long.Parse).Token();

Parse.Chars匹配符合包含字符的全部字符，.AtLeastOnce()代表匹配1个或多个，.Text()会将字符组合成字符串，.Select(long.Parse)会调用委托函数，试图将字符串转换为long类型，.Token()含义如上，匹配前剔除前后空格。

单元测试

在LexerText类中，新增关于整数解析的测试函数

[Test]
public void LongInteger()
{
    Assert.AreEqual(0L, Lexer.LongInteger.Parse(" 0 "));
    Assert.AreEqual(1L, Lexer.LongInteger.Parse(" 1 "));
    Assert.AreEqual(123L, Lexer.LongInteger.Parse(" 123 "));
    Assert.AreEqual(123456789L, Lexer.LongInteger.Parse(" 123456789 "));
    Assert.AreEqual(1234567890123456789L, Lexer.LongInteger.Parse(" 1234567890123456789 "));
    Assert.Catch<ParseException>(() => Lexer.LongInteger.Parse(" "));
    Assert.Catch<ParseException>(() => Lexer.LongInteger.Parse(" abc "));
}

运行单元测试，全部通过。

为什么不解析负数？

有细心的读者会注意到，在此处我们没有对负数进行解析。这是因为负数的解析会留到表达式解析中一并进行。

如果此处提前进行了负号解析，那么在表达式解析中就有可能出现连续解析多个负号的情况。

浮点解析

在我们的定义中，浮点数的定义也清晰明了：数字(可省略).数字便是浮点数。

于是浮点解析器如下：

public static readonly Parser<double> DoubleFloat =
    (from integer in Parse.Chars("0123456789").Many().Text()
        from dot in Parse.Char('.')
        from fraction in Parse.Chars("0123456789").AtLeastOnce().Text()
        select double.Parse($"{integer}.{fraction}")).Token();

看到剧增的代码量，或许有读者会感到有种小学数学到高数的窒息感，别慌，这其实是Sprache库里使用Linq构建连续匹配的解析器的方式。

开头的from xxx in Parser是从解析器中获取解析结果

.Many()匹配0次或多次结果

而后续的from xxx in Parser，则是SelectMany的语法糖，接连的解析内容。

最后的select double.Parse($"{integer}.{fraction}")，则是将解析结果拼成字符串并使用double.parse转换为数字。

不过实际上，以上解析器其实也可以用对应的正则表达式替换：

public static readonly Parser<double> DoubleFloat = Parse.Regex(@"[0-9]*.[0-9]+").Text().Select(double.Parse).Token();

效果是完全一致的。

单元测试

在LexerText类中，新增关于浮点解析的测试函数

[Test]
public void DoubleFloat()
{
    Assert.AreEqual(0.0, Lexer.DoubleFloat.Parse(" 0.0 "), 0.00001);
    Assert.AreEqual(0.5, Lexer.DoubleFloat.Parse(" 0.5 "), 0.00001);
    Assert.AreEqual(.5, Lexer.DoubleFloat.Parse(" .5 "), 0.00001);
    Assert.AreEqual(1.0, Lexer.DoubleFloat.Parse(" 1.0 "), 0.00001);
    Assert.AreEqual(10.01, Lexer.DoubleFloat.Parse(" 10.01 "), 0.00001);
    Assert.AreEqual(9999.9999, Lexer.DoubleFloat.Parse(" 9999.9999 "), 0.00001);
}

运行单元测试，全部通过。

字符串解析

字符串的定义是基础类型之中较为复杂的

根据第一章的定义，可以得出我们需要匹配引号之间的字符。且第二个引号之前不能有\号作为转义。

因此，可以写出如下正则表达式："(\\"|[^"])*"

正则拆解：
- [^"] 除引号以外的字符
- \\" 以\开头的"符号
- (\\"|[^"])*任意个除引号以外，以\开头的"符号

解析器如下：

public static readonly Parser<string> String =
    Parse.Regex(@"""(\""|[^""])*""")
        .Select(s => s.Substring(1, s.Length - 2)
            .Replace(@"""", @"""")
            .Replace(@"\", @"")
            .Replace(@"\n", "\n")).Token();

此部分的知识点与标识符部分相似

需要注意的是，在C#中，由@开头的字符串代表无视转义字符，但是引号则需要用两个连续引号表达。

单元测试

在LexerText类中，新增关于字符串解析的测试函数

[Test]
public void String()
{
    Assert.AreEqual("", Lexer.String.Parse(@" """" "));
    Assert.AreEqual("abc", Lexer.String.Parse(@" ""abc"" "));
    Assert.AreEqual("abc\ndef", Lexer.String.Parse(@" ""abc\ndef"" "));
    Assert.AreEqual("abc\ndef\n", Lexer.String.Parse(@" ""abc\ndef\n"" "));
    Assert.AreEqual("你好，世界", Lexer.String.Parse(@" ""你好，世界"" "));
}

运行单元测试，全部通过。

布尔值

布尔值的定义较为简单，但是其涉及了一个新的概念：保留词

在词法分析中，如果遇到了true与false两个词，那么在进行词法分析时，很有可能被标识符解析器当做标识符进行误解析。因此，在定义布尔值解析器前，需要先对标识符解析进行改造。

新建一个类，命名Words，在其中我们定义true与false。

namespace SwordScript;

public static class Words
{
    public static readonly string[] ALL_RESERVED_WORDS = new[]
    {
        BOOLEAN_TRUE,
        BOOLEAN_FALSE,
    };
    
    public const string BOOLEAN_TRUE = "true";
    public const string BOOLEAN_FALSE = "false";
}

为标识符解析器添加保留词判断：

public static readonly Parser<string> Identifier = Parse.Regex(@"[_\p{L}][0-9_\p{L}]*")
    .Where(t => !Words.ALL_RESERVED_WORDS.Contains(t)).Token();

注：.Contains(t)方法来自命名空间System.Linq

接下来，我们再在Lexer类中添加布尔类型解析器：

public static readonly Parser<bool> Boolean = Parse.String(Words.BOOLEAN_TRUE)
    .Or(Parse.String(Words.BOOLEAN_FALSE))
    .Text()
    .Select(s => s == Words.BOOLEAN_TRUE)
    .Token();

单元测试

[Test]
public void Boolean()
{
    Assert.AreEqual(true, Lexer.Boolean.Parse(" true "));
    Assert.AreEqual(false, Lexer.Boolean.Parse(" false "));
    Assert.Catch<ParseException>(() => Lexer.Boolean.Parse(" "));
}

同时为标识符添加新的测试样例：

Assert.Catch<ParseException>(() => Lexer.Identifier.Parse(" true "));
Assert.Catch<ParseException>(() => Lexer.Identifier.Parse(" false "));

运行单元测试，全部通过。

空值解析

空值的解析与布尔值解析类似，为保留词数组添加null，随后添加对应的解析器即可。

Words.cs

public const string NULL = "null";

Lexer.cs

public static readonly Parser<object> Null = Parse.String(Words.NULL).Return<IEnumerable<char>,object>(null).Token();

单元测试

[Test]
public void Null()
{
    Assert.AreEqual(null, Lexer.Null.Parse(" null "));
    Assert.Catch<ParseException>(() => Lexer.Null.Parse(" "));
}

运行单元测试，全部通过。

注释解析

在解析完基本类型后，我们便可以在词法分析阶段进行注释的处理了。

通过观察前面的解析器，可以发现所有解析器都有一个Token功能，Token的作用是去除标识符前后的空格，同时，可以发现注释的作用，在代码中也等同于空格

因此，我们可以定义如下注释解析器：

public static readonly CommentParser Comment = new CommentParser();

public static Parser<T> SuperToken<T>(this Parser<T> parser)
{
    return from leftComment in Comment.AnyComment.Token().Many()
        from token in parser.Token()
        from rightComment in Comment.AnyComment.Token().Many()
        select token;
}

CommentParser是Sprache里提供的一个工具类，用于生成匹配注释的解析器。在不填入任何参数的情况下，默认生成C风格的注释解析器。

SuperToken是一个扩展方法，用处是将目标解析器左右侧的任意数量个注释去除掉。

随后，把其他解析器的.Token()替换为.SuperToken()即可。

单元测试

[Test]
public void Comment()
{
    Assert.AreEqual("abc", Lexer.Identifier.Parse(" abc //abc "));
    Assert.AreEqual(123L, Lexer.LongInteger.Parse(" 123 //123 "));
    Assert.AreEqual(.5, Lexer.DoubleFloat.Parse(" .5 /* .6 */"), 0.00001);
    Assert.AreEqual("abc", Lexer.String.Parse(@" /*""abc""*/ ""abc"" "));
    Assert.AreEqual(true, Lexer.Boolean.Parse(" true //false "));
    Assert.Catch<ParseException>(() => Lexer.Boolean.Parse(" //true "));
    Assert.AreEqual("/*abc*/", Lexer.String.Parse(@" /*""abc""*/ ""/*abc*/"" "));
    Assert.AreEqual("abc", Lexer.Identifier.Parse(" /* */ /* */ abc //abc "));
}

运行单元测试，全部通过。

注：若出现以下错误提示

System.TypeInitializationException : The type initializer for 'SwordScript.Lexer' threw an exception.
  ----> System.NullReferenceException : Object reference not set to an instance of an object.

这是因为初始化顺序错误，Comment没有先初始化所导致的。

请把public static readonly CommentParser Comment = new CommentParser();语句放到Lexer类开头。

结语

此时，我们已经完成了基础类型和标识符的词法分析。可以发现，只要能清晰构建出需要获取的字词类型，就能使用Sprache库写出我们需要的解析器。

这为下一章将开始的语法分析打好了基础。

SwordScript - 使用C#开发脚本语言（三）词法分析

本章节对应仓库

前言

词法分析器

标识符解析

构建解析器

单元测试

整数解析

构建解析器

单元测试

为什么不解析负数？

浮点解析

单元测试

字符串解析

单元测试

布尔值

单元测试

空值解析

单元测试

注释解析

单元测试

结语