This chapter specifies the lexical structure of the Java programming language. 本章规定了 Java 编程语言的词法结构。
Programs are written in Unicode (§3.1), but lexical translations are provided (§3.2) so that Unicode escapes (§3.3) can be used to include any Unicode character using only ASCII characters. Line terminators are defined (§3.4) to support the different conventions of existing host systems while maintaining consistent line numbers. 程序是用Unicode(§3.1)编写的,并提供了词法分析(§3.2),以便Unicode转义(§3.3)可用于包含任何仅使用ASCII字符的Unicode字符。行终结符的定义(§3.4)是为了支持现有主机系统的不同约定,同时保持一致的行编号。
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements (§3.5), which are white space (§3.6), comments (§3.7), and tokens. The tokens are the identifiers (§3.8), keywords (§3.9), literals (§3.10), separators (§3.11), and operators (§3.12) of the syntactic grammar. 由词法翻译产生的 Unicode 字符被简化为由空格 ( §3.6 )、注释 ( §3.7 ) 和标记组成的输入元素序列( §3.5 )。其中标记是句法语法的标识符(§3.8)、关键字(§3.9)、文字(§3.10)、分隔符(§3.11)和运算符(§3.12)。
总结
1、Java语言的程序采用Unicode编码编写 2、词法分析的输入为ASCII字符表示的Unicode字符(利用转义符号将Unicode编码空间映射到ASCII空间) 3、定义行终止符号是为了屏蔽操作系统差异保持程序代码在不同系统中的行号一致 4、词法分析后代码程序将会被转换为由空格、注释以及标记所组成的一系列输入元素的序列 5、标记则是句法分析中的:标识符、关键字、文字、分隔符以及运算符
3.1. Unicode
Programs are written using the Unicode character set (§1.7). Information about this character set and its associated character encodings may be found at https://www.unicode.org/.
程序是使用 Unicode 字符集 ( §1.7 ) 编写的。有关此字符集及其相关字符编码的信息,请参见 www.unicode.org/ 。
The Java SE Platform tracks the Unicode Standard as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character.
Java SE 平台紧跟 Unicode 标准的发展。特定版本Java SE平台使用的 Unicode 的精确版本将在类 Character 的文档中指定。
The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same. Unicode 标准最初设计为固定宽度的 16 位字符编码。此后,它已更改为允许表示需要超过 16 位的字符。合法代码点的范围现在是 U+0000 到 U+10FFFF,使用十六进制 U+n 表示法。码位大于 U+FFFF 的字符称为补充字符。为了仅使用 16 位单位表示完整的字符范围,Unicode 标准定义了一种称为 UTF-16 的编码。在此编码中,补充字符表示为一对 16 位代码单元,第一个来自高代理项范围(U+D800 到 U+DBFF),第二个来自低代理项范围(U+DC00 到 U+DFFF)。对于 U+0000 到 U+FFFF 范围内的字符,码位和 UTF-16 码单元的值相同。
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding. Java 编程语言使用 UTF-16 编码以 16 位代码单元表示在序列中的文本。
Some APIs of the Java SE Platform, primarily in the Character class, use 32-bit integers to represent code points as individual entities. The Java SE Platform provides methods to convert between 16-bit and 32-bit representations.
Java SE 平台的某些 API(主要是 Character 类)使用 32 位整数将代码点表示为一个整体。Java SE 平台提供了在 16 位和 32 位表示形式之间进行转换的方法。
This specification uses the terms code point and UTF-16 code unit where the representation is relevant, and the generic term character where the representation is irrelevant to the discussion. 本规范使用术语code point和 UTF-16 code unit在这里的表达是有关的,通用术语character在这里的表示形式与讨论无关。
Except for comments (§3.7), identifiers (§3.8), and the contents of character literals, string literals, and text blocks (§3.10.4, §3.10.5, §3.10.6), all input elements (§3.5) in a program are formed only from ASCII characters (or Unicode escapes (§3.3) which result in ASCII characters). 除了注释 ( §3.7 )、标识符 ( §3.8 ) 以及字符文字、字符串文字和文本块的内容 ( §3.10.4 , §3.10.5 , §3.10.6 ) 之外,程序中的所有输入元素 ( §3.5 ) 仅由 ASCII 字符(或导致 ASCII 字符的 Unicode 转义 ( §3.3 ) )组成。
ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode UTF-16 encoding are the ASCII characters. ASCII (ANSI X3.4) 是美国信息交换标准代码。Unicode UTF-16 编码的前 128 个字符是 ASCII 字符。
总结
1、Java SE 平台紧跟 Unicode 标准的发展,每个版本其使用的Unicode标准在类Character指定 2、Java 编程语言使用 UTF-16 编码以 16 位代码单元表示其输入序列中的文本 3、除了注释 、标识符 以及字符文字、字符串文字和文本块的内容 之外,程序中的所有输入元素仅由 ASCII 字符(或通过Unicode 转义为ASCII 字符)组成 4、ASCII (ANSI X3.4) 是美国信息交换标准代码 5、Unicode UTF-16 编码的前 128 个字符就是 ASCII 字符
3.2. Lexical Translations
3.2.词法分析
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn: 使用以下三个词法分析步骤将原始 Unicode 字符流转换为标记序列,这些步骤依次应用:
- A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form
\u*xxxx*, where*xxxx*is a hexadecimal value, represents the UTF-16 code unit whose encoding is*xxxx*. This translation step allows any program to be expressed using only ASCII characters. Unicode 转义将 Unicode 字符的原始流翻译 (§3.3) 到相应的 Unicode 字符。格式为\u*xxxx*的 Unicode 转义,其中*xxxx*是十六进制值,表示编码为*xxxx*的 UTF-16 代码单元。此转换步骤允许仅使用 ASCII 字符表示任何程序。 - A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4). 将步骤 1 生成的 Unicode 流转换为输入字符和行终止符流 ( §3.4 )。
- A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens that are the terminal symbols of the syntactic grammar (§2.3). 将步骤 2 产生的输入字符和行终止符流转换为输入元素序列 ( §3.5 ) ,输入元素序列是在空格 ( §3.6 ) 和注释 ( §3.7 ) 被丢弃后,包含作为句法语法的终结符号的标记组合 ( §2.3 )。
The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. There are two exceptions to account for situations that need more granular translation: in step 1, for the processing of contiguous \ characters (§3.3), and in step 3, for the processing of contextual keywords and adjacent > characters (§3.5).
在每个步骤中都使用尽可能长的翻译,即使结果最终不会使程序正确,而对于词法分析却是正确的。对于需要更精细翻译的情况,有两个例外:在步骤1中,处理连续的\字符(§3.3),在步骤3中,处理上下文关键字和相邻的>字符(§3.5)。
The input characters a--b are tokenized as a, --, and b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program. The tokenization a, -, -, b can be realized with the input characters a- -b (with an ASCII SP character between the two - characters).
输入字符 a--b 被标记为 a , -- 和 b ,它不是任何语法正确的程序的一部分,即使标记化 a , - , - , b 可能是语法正确的程序的一部分。标记化 a , - , - , b 可以通过输入字符 a- -b 实现(在两个 - 字符之间有一个 ASCII SP 字符)。
It might be supposed that the raw input \\u1234 is translated to a \ character and (following the "longest possible" rule) a Unicode escape of the form \u1234. In fact, the leading \ character causes this raw input to be translated to seven distinct characters: \ \ u 1 2 3 4.
可以假设原始输入 \\u1234 被转换为 \ 字符,并且(遵循“最长可能”规则)格式为 \u1234 的 Unicode 转义。事实上,前导 \ 字符使此原始输入被转换为七个不同的字符: \ \ u 1 2 3 4 。
总结
词法分析规则(将原始 Unicode 字符流转换为标记序列) 1、将以ASCII字符表示的原始输入流转换成Unicode字符流 2、Unicode 字符流转换为由输入字符和行终止符流组成的字符流 3、将去除了空格和注释的仅包含句法分析需要的终结符号标记的输入字符和行终止符流转换为输入元素序列
分析说明: 1、词法分析的每一步都为尽最大努力分析,可能导致程序不正确 2、处理词法分析时两个例外的处理说明:第一步的连续\字符处理,第三步的上下文关键字后接>字符的处理
例如:
对于像a--b这样的Unicode字符序列,将会转换为a、--、b这一组输入序列,这组序列将不是程序正确的序列,但是词法处理会产生这一结果,如果想要a、-、-、b这组可能程序正确序列需要a- -b这样的字符序列。
"\\u1234"这样的字符序列,我们可能会认为会转换为一个\后接以\u1234所代表的Unicode字符的形式,但是实际情况是第一个\转义了第二个\使得原始输入被转换为七个不同的字符: \ \ u 1 2 3 4
3.3. Unicode Escapes
3.3.Unicode 转义
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its raw input, translating the ASCII characters \u followed by four hexadecimal digits to a raw input character which denotes the UTF-16 code unit (§3.1) for the indicated hexadecimal value. One Unicode escape can represent characters in the range U+0000 to U+FFFF; representing supplementary characters in the range U+010000 to U+10FFFF requires two consecutive Unicode escapes. All other characters in the compiler's raw input are recognized as raw input characters and passed unchanged.
Java 编程语言的编译器(“Java 编译器”)首先识别其原始输入中的 Unicode 转义,将 ASCII 字符 \u 后跟四个十六进制数字转换为原始输入字符,该字符表示指示的十六进制值的 UTF-16 代码单元 (§3.1)。一个 Unicode 转义可以表示 U+0000 到 U+FFFF 范围内的字符;表示 U+010000 到 U+10FFFF 范围内的补充字符需要两个连续的 Unicode 转义。编译器原始输入中的所有其他字符都被识别为原始输入字符(ASCII字符),并保持不变地传递。
This translation step results in a sequence of Unicode input characters, all of which are raw input characters (any Unicode escapes having been reduced to raw input characters). 此转换步骤将生成一系列 Unicode 输入字符,所有这些字符都是原始输入字符(任何 Unicode 转义都已简化为原始输入字符)。
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit
UnicodeMarker:
u {u}
HexDigit:
(one of)
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
RawInputCharacter:
any Unicode character
The \, u, and hexadecimal digits here are all ASCII characters.
这里的 \ 、 u 和十六进制数字都是 ASCII 字符。
The UnicodeInputCharacter production is ambiguous because an ASCII \ character in the compiler's raw input could be reduced to either a RawInputCharacter or the \ of a UnicodeEscape (to be followed by an ASCII u). To avoid ambiguity, for each ASCII \ character in the compiler's raw input, input processing must consider the most recent raw input characters that resulted from this translation step:
UnicodeInputCharacter 的生成是模棱两可的,因为编译器原始输入中的 ASCII \ 字符可以简化为 RawInputCharacter 或 UnicodeEscape 的 \ (后跟 ASCII u )。为了避免歧义,对于编译器原始输入中的每个 ASCII \ 字符,输入处理必须考虑此转换步骤产生的最新原始输入字符:
-
If the most recent raw input character in the result was itself translated from a Unicode escape in the compiler's raw input, then the ASCII
\character is eligible to begin a Unicode escape. 如果结果中最新的原始输入字符本身是从编译器原始输入中的 Unicode 转义转换而来的,则 ASCII\字符有资格开始 Unicode 转义。For example, if the most recent raw input character in the result was a backslash that arose from a Unicode escape
\u005cin the raw input, then an ASCII\character appearing next in the raw input is eligible to begin another Unicode escape. 例如,如果结果中最近的原始输入字符是由原始输入中的 Unicode 转义\u005c产生的反斜杠,则原始输入中接下来出现的 ASCII\字符有资格开始另一个 Unicode 转义。 -
Otherwise, consider how many backslashes appeared contiguously as raw input characters in the result, back to a non-backslash character or the start of the result. (It is immaterial whether any such backslash arose from an ASCII
\character in the compiler's raw input or from a Unicode escape\u005cin the compiler's raw input.) If this number is even, then the ASCII\character is eligible to begin a Unicode escape; if the number is odd, then the ASCII\character is not eligible to begin a Unicode escape. 否则,请考虑在结果中连续显示为原始输入字符的反斜杠数量,返回到非反斜杠字符或结果的开头。(无论任何此类反斜杠是由编译器原始输入中的 ASCII\字符引起的,还是由编译器原始输入中的 Unicode 转义\u005c引起的,都无关紧要。如果此数字为偶数,则 ASCII\字符有资格开始 Unicode 转义;如果数字为奇数,则 ASCII\字符不符合开始 Unicode 转义的条件。For example, the raw input
"\\u2122=\u2122"results in the eleven characters" \ \ u 2 1 2 2 = ™ "because while the second ASCII\character in the raw input is not eligible to begin a Unicode escape, the third ASCII\character is eligible, and\u2122is the Unicode encoding of the character™. 例如,原始输入"\\u2122=\u2122"生成 11 个字符" \ \ u 2 1 2 2 = ™ ",因为原始输入中的第二个 ASCII\字符不符合开始 Unicode 转义的条件,而第三个 ASCII\字符符合条件,并且\u2122是字符™的 Unicode 编码。
If an eligible \ is not followed by u, then it is treated as a RawInputCharacter and remains part of the escaped Unicode stream.
如果符合条件的 \ 后面没有 u ,则它被视为 RawInputCharacter 并保留为转义 Unicode 流的一部分。
If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.
如果符合条件的 \ 后跟 u 或多个 u ,并且最后一个 u 后面没有四个十六进制数字,则会发生编译时错误。
The character produced by a Unicode escape does not participate in further Unicode escapes. Unicode 转义生成的字符不参与进一步的 Unicode 转义。
For example, the raw input \u005cu005a results in the six characters \ u 0 0 5 a, because 005c is the Unicode value for a backslash. It does not result in the character Z, which is Unicode value 005a, because the backslash that resulted from processing the Unicode escape \u005c is not interpreted as the start of a further Unicode escape.
例如,原始输入 \u005cu005a 生成六个字符 \ u 0 0 5 a ,因为 005c 是反斜杠的 Unicode 值。它不会产生字符 Z ,即 Unicode 值 005a ,因为处理 Unicode 转义 \u005c 产生的反斜杠不会被解释为进一步 Unicode 转义的开始。
Note that \u005cu005a cannot be written in a string literal to denote the six characters \ u 0 0 5 a. This is because the first two characters resulting from translation, \ and u, are interpreted in a string literal as an illegal escape sequence (§3.10.7).
请注意, \u005cu005a 不能写入字符串文本来表示六个字符 \ u 0 0 5 a 。这是因为翻译产生的前两个字符 \ 和 u 在字符串文本中被解释为非法转义序列 ( §3.10.7 )。
Fortunately, the rule about contiguous backslash characters helps programmers to craft raw inputs that denote Unicode escapes in a string literal. Denoting the six characters \ u 0 0 5 a in a string literal simply requires another \ to be placed adjacent to the existing \, such as "\\u005a is Z". This works because the second \ in the raw input \\u005a is not eligible to begin a Unicode escape, so the first \ and the second \ are preserved as raw input characters, as are the next five characters u 0 0 5 a. The two \ characters are subsequently interpreted in a string literal as the escape sequence for a backslash, resulting in a string with the desired six characters \ u 0 0 5 a. Without the rule, the raw input \\u005a would be processed as a raw input character \ followed by a Unicode escape \u005a which becomes a raw input character Z; this would be unhelpful because \Z is an illegal escape sequence in a string literal. (Note that the rule translates \u005c\u005c to \\ because the translation of the first Unicode escape to a raw input character \ does not prevent the translation of the second Unicode escape to another raw input character \.)
幸运的是,关于连续反斜杠字符的规则有助于程序员制作原始输入,以字符串文本表示 Unicode 转义。在字符串文本中表示六个字符 \ u 0 0 5 a 只需要将另一个 \ 放在现有的 \ 旁边,例如 "\\u005a is Z" 。这是因为原始输入 \\u005a 中的第二个 \ 不符合开始 Unicode 转义的条件,因此第一个 \ 和第二个 \ 保留为原始输入字符,接下来的五个字符 u 0 0 5 a 也是如此。随后,两个 \ 字符在字符串文本中被解释为反斜杠的转义序列,从而生成一个包含所需六个字符 \ u 0 0 5 a 的字符串。如果没有该规则,原始输入 \\u005a 将被处理为原始输入字符 \ ,后跟 Unicode 转义 \u005a ,后者将成为原始输入字符 Z ;这将无济于事,因为 \Z 是字符串文本中的非法转义序列。(请注意,该规则将 \u005c\u005c 转换为 \\ ,因为将第一个 Unicode 转义转换为原始输入字符 \ 不会阻止将第二个 Unicode 转义转换为另一个原始输入字符 \ 。
The rule also allows programmers to craft raw inputs that denote escape sequences in a string literal. For example, the raw input \\\u006e results in the three characters \ \ n because the first \ and the second \ are preserved as raw input characters, while the third \ is eligible to begin a Unicode escape and thus \u006e is translated to a raw input character n. The three characters \ \ n are subsequently interpreted in a string literal as \ n which denotes the escape sequence for a linefeed. (Note that \\\u006e may be written as \u005c\u005c\u006e because each Unicode escape \u005c is translated to a raw input character \ and so the remaining raw input \u006e is preceded by an even number of backslashes and processed as the Unicode escape for n.)
该规则还允许程序员制作原始输入,以表示字符串文本中的转义序列。例如,原始输入 \\\u006e 生成三个字符 \ \ n ,因为第一个 \ 和第二个 \ 保留为原始输入字符,而第三个 \ 有资格开始 Unicode 转义,因此 \u006e 被转换为原始输入字符 n 。随后,三个字符 \ \ n 在字符串文本中解释为 \ n ,表示换行符的转义序列。(请注意, \\\u006e 可以写成 \u005c\u005c\u006e ,因为每个 Unicode 转义 \u005c 都转换为原始输入字符 \ ,因此剩余的原始输入 \u006e 前面有偶数个反斜杠,并作为 n 的 Unicode 转义处理。
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u - for example, \u*xxxx* becomes \uu*xxxx* - while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u each.
Java 编程语言指定了一种将用 Unicode 编写的程序转换为 ASCII 的标准方法,该方法将程序更改为可由基于 ASCII 的工具处理的形式。转换涉及通过添加额外的 u (例如, \u*xxxx* 变为 \uu*xxxx* )将程序源文本中的任何 Unicode 转义转换为 ASCII,同时将源文本中的非 ASCII 字符转换为每个字符包含单个 u 的 Unicode 转义。
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character.
这个转换后的版本同样可以被 Java 编译器接受,并且表示完全相同的程序。通过将存在多个 u 的每个转义序列转换为少一个 u 的 Unicode 字符序列,同时将每个带有单个 u 的转义序列转换为相应的单个 Unicode 字符,可以从此 ASCII 形式恢复确切的 Unicode 源。
A Java compiler should use the \u*xxxx* notation as an output format to display Unicode characters when a suitable font is not available.
Java 编译器应使用 \u*xxxx* 表示法作为输出格式,以便在合适的字体不可用时显示 Unicode 字符。
总结
- 1、Java程序的原始代码字符流是完全由ASCII编码表示
- 2、Unicode escapes 步骤会将符合规范范围的Unicode转义字符转换为Unicode字符表示,不符合将不会做转义
- 3、关于\u005c和\字符歧义问题处理
通过\u005c转换而来的\字符,就表示\字符本身,不会影响接下\字符的转义功能
否则考虑这个\字符前面连续出现的\字符的次数,为偶数则有转义能力,奇数则失去转义能力
-
4、如果\字符后接的字符不是u则不作为转义符号,没有转义能力
-
5、如果符合条件的
\后跟u或多个u,并且最后一个u后面没有四个十六进制数字,则会发生编译时错误 -
6、转义生成的字符不参与进一步的 Unicode 转义
-
7、关于字符串文本中的转义规则
- 单个\u如果满足Unicode表示规范范围将会转换为对应字符,不满足则保持不变,
- 想保持转义序列不被转换为对应字符可以根据连续\规则(此时写\字符与写\u005c等价)编制字符串文本即可
-
8、对于程序源代码的处理,会将Unicode转义字符转换成多个u的ASCII码形式,非ASCII字符转换成但u的ASCCII形式,同时此种形式表达的程序源代码也可以被Java编译器所接受,Java编译器会将多u去除u,单u转换为Unicode字符
注意:单u形式:\uXXXX)、多u形式:\uuXXXX
-
9、为了字符兼容性考虑,建议Java 编译器应使用
\u*xxxx*表示法作为输出格式,以便在合适的字体不可用时显示 Unicode 字符
3.4. Line Terminators
3.4.行终止符
A Java compiler next divides the sequence of Unicode input characters into lines by recognizing line terminators. 接下来,Java 编译器通过识别行终止符将 Unicode 输入字符的序列划分为行。
LineTerminator: the ASCII LF character, also known as "newline" the ASCII CR character, also known as "return" the ASCII CR character followed by the ASCII LF character InputCharacter: UnicodeInputCharacter but not CR or LF
Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two characters CR immediately followed by LF are counted as one line terminator, not two. 行以 ASCII 字符 CR、LF 或 CR LF 结尾。紧跟 LF 的两个字符 CR 算作一行终止符,而不是两个。
A line terminator specifies the termination of the // form of a comment (§3.7).
行终止符指定注释的 // 形式的终止 ( §3.7 )。
The lines defined by line terminators may determine the line numbers produced by a Java compiler. 行终止符定义的行可以确定 Java 编译器生成的行号。
The result is a sequence of line terminators and input characters, which are the terminal symbols for the third step in the tokenization process. 行终止符和输入字符序列的结果是标记化过程中第三步的终止符号。
总结
1、行终止符由ASCII字符或者组合:LF、CR、CR LF表示(为了屏蔽操作系统差异)
2、行终止符功能:
- a、确定Java代码的行号
- b、确定单行注释的结束标志
- c、作为词法分析标记化过程中第三步的终止符
3.5. Input Elements and Tokens
3.5.输入元素和标记
The input characters and line terminators that result from Unicode escape processing (§3.3) and then input line recognition (§3.4) are reduced to a sequence of input elements. 由Unicode转义处理(§3.3)和输入行识别(§3.4)产生的输入字符和行终止符被简化为输入元素序列。
Input: {InputElement} [Sub]
InputElement:
Token:
Identifier Keyword Literal Separator Operator
Sub:
the ASCII SUB character, also known as "control-Z"
Those input elements that are not white space or comments are tokens. The tokens are the terminal symbols of the syntactic grammar (§2.3). 那些不是空格或注释的输入元素即为标记。标记是句法语法的终结符号 ( §2.3 )。
White space (§3.6) and comments (§3.7) can serve to separate tokens that, if adjacent, might be tokenized in another manner. 空格 ( §3.6 ) 和注释 ( §3.7 ) 可以用于分隔标记,如果相邻,则可以以另一种方式标记。
For example, the input characters - and = can form the operator token -= (§3.12) only if there is no intervening white space or comment. As another example, the ten input characters staticvoid form a single identifier token while the eleven input characters static void (with an ASCII SP character between c and v) form a pair of keyword tokens, static and void, separated by white space.
例如,输入字符 - 和 = 只有在没有中间空格或注释的情况下才能形成运算符标记 -= (§3.12)。再举一个例子,10 个输入字符 staticvoid 形成一个标识符标记,而 11 个输入字符 static void (ASCII SP 字符介于 c 和 v 之间)形成一对关键字标记 static 和 void ,用空格分隔。
As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.
作为与某些操作系统兼容性的特殊让步,如果 ASCII SUB 字符( \u001a 或 control-Z)是转义输入流中的最后一个字符,则忽略该字符。
The Input production is ambiguous, meaning that for some sequences of input characters, there is more than one way to reduce the input characters to input elements (that is, to tokenize the input characters). Ambiguities are resolved as follows: 输入生成是模棱两可的,这意味着对于某些输入字符序列,有多种方法可以将输入字符简化为输入元素(即标记输入字符)。歧义的解决方式如下:
-
A sequence of input characters that could be reduced to either an identifier token or a literal token is always reduced to a literal token. 可以简化为标识符标记或文本标记的输入字符序列始终简化为文本标记。
-
A sequence of input characters that could be reduced to either an identifier token or a reserved keyword token (§3.9) is always reduced to a reserved keyword token. 可以简化为标识符标记或保留关键字标记 (§3.9) 的输入字符序列始终简化为保留关键字标记。
-
A sequence of input characters that could be reduced to either a contextual keyword token or to other (non-keyword) tokens is reduced according to context, as specified in §3.9. 可以简化为上下文关键字标记或其他(非关键字)标记的输入字符序列根据上下文进行简化,如 §3.9 中所述。
-
If the input character
>appears in a type context (§4.11), that is, as part of a Type or an UnannType in the syntactic grammar (§4.1, §8.3), it is always reduced to the numerical comparison operator>, even when it could be combined with an adjacent>character to form a different operator. 如果输入字符>出现在类型上下文 ( §4.11 ) 中,即作为句法语法 ( §4.1 , §8.3 ) 中 Type 或 UnannType 的一部分 ,它总是被简化为数字比较运算符>,即使它可以与相邻的>字符组合形成不同的运算符。Without this rule for
>characters, two consecutive>brackets in a type such asList<List<String>>would be tokenized as the signed right shift operator>>, while three consecutive>brackets in a type such asList<List<List<String>>>would be tokenized as the unsigned right shift operator>>>. Worse, the tokenization of four or more consecutive>brackets in a type such asList<List<List<List<String>>>>would be ambiguous, as various combinations of>,>>, and>>>tokens could represent the>``>``>``>characters. 如果没有>字符的此规则,则List<List<String>>类型中的两个连续>括号将被标记为有符号的右移运算符>>,而List<List<List<String>>>等类型中的三个连续>括号将被标记为无符号右移运算符>>>。更糟糕的是,在List<List<List<List<String>>>>等类型中,四个或更多连续的>括号的标记化将是模棱两可的,因为>、>>和>>>标记的各种组合可以表示>>>>字符。
Consider two tokens *x* and *y* in the resulting input stream. If *x* precedes *y*, then we say that *x* is to the left of *y* and that *y* is to the right of *x*.
考虑生成的输入流中的两个标记 *x* 和 *y* 。如果 *x* 在 *y* 之前,那么我们说 *x* 在 *y* 的左边, *y* 在 *x* 的右边。
For example, in this simple piece of code:
class Empty { }
we say that the } token is to the right of the { token, even though it appears, in this two-dimensional representation, downward and to the left of the { token. This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment.
我们说 } 标记位于 { 标记的右侧,即使它在此二维表示中显示为 { 标记的向下和左侧。这个关于左右词使用的惯例是允许我们去说的,例如,二进制运算符的右手操作数或赋值的左侧操作数。
总结
- 1、输入元素序列由输入元素和行终止符组成
- 2、输入元素由空格、注释以及标记组成
- 3、空格和注释可以在字符序列中用于分割字符,如果字符之间不分割则会标记为一个整体
- 4、出于跨平台的考虑,如果ASCII SUB 字符(
\u001a或 control-Z)是转义输入流中的最后一个字符,则忽略该字符 - 5、关于字符标识过程中的二义性问题处理:
- 如果输入字符序列能够组成标识符标记或者字面量标记,则首先简化为这些标记
- 如果输入字符序列能够组成标识符标记或者保留关键字标记,则简化为这些标记
- 如果输入字符序列能够简化为上下文关键字标记或其他(非关键字)标记的输入字符序列根据上下文进行简化
- 如果>字符出现在类型上下文即使连续出现也会被标识为单个的数值运算符处理,不会标识为右移运算或者无符号右移运算符
- 6、关于标记的前后关系(左右关系)的约定,从上到下从左到右逐行扫描先出现在前(左),后出现在后(右)