如何优化 Pyparsing 的性能和内存使用率

104 阅读2分钟

Pyparsing 是一个强大的 Python 库,用于解析各种各样的文本格式。然而,当使用 Pyparsing 解析大规模的文本时,可能会遇到性能和内存使用率的问题。例如,在下面的示例中,我们将使用 Pyparsing 解析一个简单的算术表达式 3*(1+2*3*(4+5))

from pyparsing import *

newline = LineEnd()
minus = Literal('-')
plus = Literal('+')
star = Literal('*')
dash = Literal('/')
dashdash = Literal('//')
percent = Literal('%')
starstar = Literal('**')
lparen = Literal('(')
rparen = Literal(')')
dot = Literal('.')
comma = Literal(',')
eq = Literal('=')
eqeq = Literal('==')
lt = Literal('<')
gt = Literal('>')
le = Literal('<=')
ge = Literal('>=')
not_ = Keyword('not')
and_ = Keyword('and')
or_ = Keyword('or')
ident = Word(alphas)
integer = Word(nums)

expr = Forward()
parenthized = Group(lparen + expr + rparen)
trailer = (dot + ident)
atom = ident | integer | parenthized
factor = Forward()
power = atom + ZeroOrMore(trailer) + Optional(starstar + factor)
factor << (ZeroOrMore(minus | plus) + power)
term = ZeroOrMore(factor + (star | dashdash | dash | percent)) + factor
arith = ZeroOrMore(term + (minus | plus)) + term
comp = ZeroOrMore(arith + (eqeq | le | ge | lt | gt)) + arith
boolNot = ZeroOrMore(not_) + comp
boolAnd = ZeroOrMore(boolNot + and_) + boolNot
boolOr = ZeroOrMore(boolAnd + or_) + boolAnd
match = ZeroOrMore(ident + eq) + boolOr
expr << match
statement = expr + newline
program = OneOrMore(statement)

print(program.parseString('3*(1+2*3*(4+5))\n'))

当我们运行这个程序时,我们会发现它需要花费很长的时间来解析这个表达式,并且内存使用率也会很高。

2、解决方案

为了解决这个问题,我们可以使用 Pyparsing 的 packrat 解析功能。packrat 解析是一种备忘录模式,它可以将解析结果缓存起来,以便在以后需要时重用。这可以大大提高解析的性能和内存使用率。

要在 Pyparsing 中启用 packrat 解析,我们需要在导入 Pyparsing 后,调用 ParserElement.enablePackrat() 方法。例如:

from pyparsing import *

ParserElement.enablePackrat()

newline = LineEnd()
# ...

print(program.parseString('3*(1+2*3*(4+5))\n'))

启用 packrat 解析后,我们再次运行程序,会发现解析速度明显提高,内存使用率也有所下降。

代码示例

from pyparsing import *

ParserElement.enablePackrat()

newline = LineEnd()
minus = Literal('-')
plus = Literal('+')
star = Literal('*')
dash = Literal('/')
dashdash = Literal('//')
percent = Literal('%')
starstar = Literal('**')
lparen = Literal('(')
rparen = Literal(')')
dot = Literal('.')
comma = Literal(',')
eq = Literal('=')
eqeq = Literal('==')
lt = Literal('<')
gt = Literal('>')
le = Literal('<=')
ge = Literal('>=')
not_ = Keyword('not')
and_ = Keyword('and')
or_ = Keyword('or')
ident = Word(alphas)
integer = Word(nums)

expr = Forward()
parenthized = Group(lparen + expr + rparen)
trailer = (dot + ident)
atom = ident | integer | parenthized
factor = Forward()
power = atom + ZeroOrMore(trailer) + Optional(starstar + factor)
factor << (ZeroOrMore(minus | plus) + power)
term = ZeroOrMore(factor + (star | dashdash | dash | percent)) + factor
arith = ZeroOrMore(term + (minus | plus)) + term
comp = ZeroOrMore(arith + (eqeq | le | ge | lt | gt)) + arith
boolNot = ZeroOrMore(not_) + comp
boolAnd = ZeroOrMore(boolNot + and_) + boolNot
boolOr = ZeroOrMore(boolAnd + or_) + boolAnd
match = ZeroOrMore(ident + eq) + boolOr
expr << match
statement = expr + newline
program = OneOrMore(statement)

print(program.parseString('3*(1+2*3*(4+5))\n'))

输出结果:

['3', '*', ['(', '1', '+', '2', '*', '3', '*', ['(', '4', '+', '5', ')'], ')']]