This presentation is an HTML5 web site
Press → key to advance.
Welcome! (This field is for presenter notes and commentary.)Controls
- ← and → to move around.
- Ctrl/Command and + or - to zoom.
- S to view page source.
- N to toggle speaker notes.
- 3 to toggle 3D effect.
- 0 to toggle help.
Rob Pike
r@golang.org
GTUG Sydney
Aug 30, 2011
Many programming problems realign one data structure to fit another structure.
- breaking text into lines
- "blocking" and "deblocking"
- packet assembly and disassembly
- parsing
- lexing
The pieces on either side have independent state, lookahead, buffers,
...
Can be messy to do well.
Coroutines were invented to solve this problem!
They enable us to write the two pieces independently.
Let's look at this topic in the context of a lexer.
Wanted to replace the old Go template package.
It had many problems:
- inflexible
- inexpressive
- code too fragile
Key change was re-engineering with a true lexer, parser,
and evaluator.
Has arbitrary text plus actions in {{ }}.
Evaluation: {{.Title}}
Constants and functions: {{printf "%g: %#3X" 1.2+2i 123}}
Control structures {{range $s.Text}} {{.}} {{end}}
Must tokenize:
- the stuff outside actions
- action delimiters
- identifiers
- numeric constants
- string constants
- and others
Two things identify each lexed item:
- its type
- its value; a string is all we need
// item represents a token returned from the scanner.
type item struct {
typ itemType // Type, such as itemNumber.
val string // Value, such as "23.2".
}
The type is just an integer constant.
We use iota to define the values.
// itemType identifies the type of lex items. type itemType int
const (
itemError itemType = iota // error occurred;
// value is text of error
itemDot // the cursor, spelled '.'
itemEOF
itemElse // else keyword
itemEnd // end keyword
itemField // identifier, starting with '.'
itemIdentifier // identifier
itemIf // if keyword
itemLeftMeta // left meta-string
itemNumber // number
itemPipe // pipe symbol
itemRange // range keyword
itemRawString // raw quoted string (includes quotes)
itemRightMeta // right meta-string
itemString // quoted string (includes quotes)
itemText // plain text
)
Printf has a convention making it easy to print any type: just define
a String() method:
func (i item) String() string {
switch i.typ {
case itemEOF:
return "EOF"
case itemError:
return i.val
}
if len(i.val) > 10 {
return fmt.Sprintf("%.10q...", i.val)
}
return fmt.Sprintf("%q", i.val)
}
Many approaches available:
- use a tool such as lex or ragel
- use regular expressions
- use states, actions, and a switch statement
Nothing wrong with using a tool but:
- hard to get good errors (can be very important)
- tend to require learning another language
- result can be large, even slow
- often a poor fit
- but lexing is easy to do yourself!
Blogged about this last week.
- overkill
- slow
- can explore the state space too much
- misuse of a dynamic engine to ask static questions
It's easy!
Plus, most programming languages lex pretty much the same tokens, so once
we learn how it's trivial to adapt the lexer for the next purpose.
- an argument both for and against tools
Many people will tell you to write a switch statement,
something like this:
// One iteration:
switch state {
case state1:
state = action1()
case state2:
state = action2()
case state3:
state = action3()
}
Boring and repetitive and error-prone, but anyway:
Why switch?
After each action, you know where you want to be;
the new state is the result of the action.
But we throw the info away and recompute it from the state.
(A consequence of returning to the caller.)
A tool can compile that out, but so can we.
State represents where we are and what we expect.
Action represents what we do.
Actions result in a new state.
Let's put them together: a state function.
Executes an action, returns the next state—as a state function.
Recursive definition but simple and clear.
// stateFn represents the state of the scanner // as a function that returns the next state. type stateFn func(*lexer) stateFn
Our state machine is trivial:
just run until the state goes to nil, representing "done".
// run lexes the input by executing state functions
// until the state is nil.
func run() {
for state := startState; state != nil; {
state = state(lexer)
}
}
How do we make tokens available to the client?
Tokens can emerge at times that are inconvenient to stop to return to
the caller.
Use concurrency:
Run the state machine as a goroutine,
emit values on a channel.
Here is the lexer type. Notice the channel of items; ignore the rest for
now.
// lexer holds the state of the scanner.
type lexer struct {
name string // used only for error reports.
input string // the string being scanned.
start int // start position of this item.
pos int // current position in the input.
width int // width of last rune read from input.
items chan item // channel of scanned items.
}
A lexer initializes itself to lex a string and launches the state machine
as a goroutine, returning the lexer itself and a channel of items.
The API will change, don't worry about it now.
func lex(name, input string) (*lexer, chan item) {
l := &lexer{
name: name,
input: input,
items: make(chan item),
}
go l.run() // Concurrently run state machine.
return l, l.items
}
Here's the real state machine run function, which runs as a goroutine.
// run lexes the input by executing state functions until
// the state is nil.
func (l *lexer) run() {
for state := lexText; state != nil; {
state = state(l)
}
close(l.items) // No more tokens will be delivered.
}
A token is a type and a value, but (yay Go) the value can just be sliced
from the input string.
The lexer remembers where it is in the input and the emit routine just
lobs that substring to the caller as the token's value.
input string // the string being scanned.
start int // start position of this item.
pos int // current position in the input.
// emit passes an item back to the client.
func (l *lexer) emit(t itemType) {
l.items <- item{t, l.input[l.start:l.pos]}
l.start = l.pos
}
As the lexer begins it's looking for plain text, so the initial state
is the function lexText.
It absorbs plain text until a "left meta" is encountered.
// run lexes the input by executing state functions until
// the state is nil.
func (l *lexer) run() {
for state := lexText; state != nil; {
state = state(l)
}
close(l.items) // No more tokens will be delivered.
}
const leftMeta = "{{"
func lexText(l *lexer) stateFn {
for {
if strings.HasPrefix(l.input[l.pos:], leftMeta) {
if l.pos > l.start {
l.emit(itemText)
}
return lexLeftMeta // Next state.
}
if l.next() == eof { break }
}
// Correctly reached EOF.
if l.pos > l.start {
l.emit(itemText)
}
l.emit(itemEOF) // Useful to make EOF a token.
return nil // Stop the run loop.
}
A trivial state function.
When we get here, we know there's a leftMeta in the input.
func lexLeftMeta(l *lexer) stateFn {
l.pos += len(leftMeta)
l.emit(itemLeftMeta)
return lexInsideAction // Now inside {{ }}.
}
func lexInsideAction(l *lexer) stateFn {
// Either number, quoted string, or identifier.
// Spaces separate and are ignored.
// Pipe symbols separate and are emitted.
for {
if strings.HasPrefix(l.input[l.pos:], rightMeta) {
return lexRightMeta
}
switch r := l.next(); {
case r == eof || r == '\n':
return l.errorf("unclosed action")
case isSpace(r):
l.ignore()
case r == '|':
l.emit(itemPipe)
This will give you the flavor.
case r == '"':
return lexQuote
case r == '`':
return lexRawQuote
case r == '+' || r == '-' || '0' <= r && r <= '9':
l.backup()
return lexNumber
case isAlphaNumeric(r):
l.backup()
return lexIdentifier
// next returns the next rune in the input.
func (l *lexer) next() (rune int) {
if l.pos >= len(l.input) {
l.width = 0
return eof
}
rune, l.width =
utf8.DecodeRuneInString(l.input[l.pos:])
l.pos += l.width
return rune
}
// ignore skips over the pending input before this point.
func (l *lexer) ignore() {
l.start = l.pos
}
// backup steps back one rune.
// Can be called only once per call of next.
func (l *lexer) backup() {
l.pos -= l.width
}
// peek returns but does not consume
// the next rune in the input.
func (l *lexer) peek() int {
rune := l.next()
l.backup()
return rune
}
// accept consumes the next rune
// if it's from the valid set.
func (l *lexer) accept(valid string) bool {
if strings.IndexRune(valid, l.next()) >= 0 {
return true
}
l.backup()
return false
}
// acceptRun consumes a run of runes from the valid set.
func (l *lexer) acceptRun(valid string) {
for strings.IndexRune(valid, l.next()) >= 0 {
}
l.backup()
}
func lexNumber(l *lexer) stateFn {
// Optional leading sign.
l.accept("+-")
// Is it hex?
digits := "0123456789"
if l.accept("0") && l.accept("xX") {
digits = "0123456789abcdefABCDEF"
}
l.acceptRun(digits)
if l.accept(".") {
l.acceptRun(digits)
}
if l.accept("eE") {
l.accept("+-")
l.acceptRun("0123456789")
}
// Is it imaginary?
This is more accepting than it should be, but not by much. Caller must
call Atof to validate.
// Is it imaginary?
l.accept("i")
// Next thing mustn't be alphanumeric.
if isAlphaNumeric(l.peek()) {
l.next()
return l.errorf("bad number syntax: %q",
l.input[l.start:l.pos])
}
l.emit(itemNumber)
return lexInsideAction
}
Easy to handle: emit the bad token and shut down the machine.
// error returns an error token and terminates the scan
// by passing back a nil pointer that will be the next
// state, terminating l.run.
func (l *lexer) errorf(format string, args ...interface{})
stateFn {
l.items <- item{
itemError,
fmt.Sprintf(format, args...),
}
return nil
}
Concurrency makes the lexer easy to design.
Goroutines allow lexer and caller (parser) each to run at its own rate,
as clean sequential code.
Channels give us a clean way to emit tokens.
Can't run a goroutine to completion during initialization.
Forbidden by the language specification.
(Raises awful issues about order of init, best avoided.)
That means we can't lex & parse a template during init.
The goroutine is a problem....
...but it's not necessary anyway.
The work is done by the design; now we just adjust the API.
We can change the API to hide the channel, provide a function to get the
next token, and rewrite the run function.
It's easy.
Hide the channel and buffer it slightly, turning it into a ring buffer.
// lex creates a new scanner for the input string.
func lex(name, input string) *lexer {
l := &lexer{
name: name,
input: input,
state: lexText,
items: make(chan item, 2), // Two items sufficient.
}
return l
}
Traditional lexer API: return next item.
Includes the modified state machine runner.
// nextItem returns the next item from the input.
func (l *lexer) nextItem() item {
for {
select {
case item := <-l.items:
return item
default:
l.state = l.state(l)
}
}
panic("not reached")
}
We now have a traditional API for a lexer with a simple, concurrent implementation
under the covers.
Even though the implementation is no longer truly concurrent, it still
has all the advantages of concurrent design.
We wouldn't have such a clean, efficient design if we hadn't thought about
the problem in a concurrent way, without worrying about "restart".
Model completely removes concerns about "structural mismatch".
Concurrency is not about parallelism.
(Although it can enable parallelism).
Concurrency is a way to design a program by decomposing it into independently
executing pieces.
The result can be clean, efficient, and very adaptable.
Lexers are fun.
Concurrency is fun.
Go is fun.
Go: http://golang.org
New templates: http://golang.org/pkg/exp/template/
(Next release will move them out of experimental.)