python脚本解析java、kt源代码文件

621 阅读4分钟

背景

假定我们需要通过脚本统计代码中的图片资源使用情况,此时钦定python。 通过搜索, 解析java文件我们使用javalang,解析kt文件使用kopyt。

2个库学习总结

javalang会按照java8语法将java代码解析成抽象语法树,开源时间挺久,star也接近700. kopyt则是参考javalang开发,很多数据结构非常类;支持kotlin 1.5.

下面是关于2个库的学习、使用总结,主要是代码指令部分。

javalang

基本的指令结构Statement,子类根据名称很好理解,基础表达式为StatementExpression,包含成员Expression

Statement(Node) (javalang.tree)
    IfStatement(Statement) (javalang.tree)
    WhileStatement(Statement) (javalang.tree)
    DoStatement(Statement) (javalang.tree)
    ForStatement(Statement) (javalang.tree)
    AssertStatement(Statement) (javalang.tree)
    BreakStatement(Statement) (javalang.tree)
    ContinueStatement(Statement) (javalang.tree)
    ReturnStatement(Statement) (javalang.tree)
    ThrowStatement(Statement) (javalang.tree)
    SynchronizedStatement(Statement) (javalang.tree)
    TryStatement(Statement) (javalang.tree)
    SwitchStatement(Statement) (javalang.tree)
    BlockStatement(Statement) (javalang.tree)
    StatementExpression(Statement) (javalang.tree)
    CatchClause(Statement) (javalang.tree)

看看进一步拆分出的基础结构,即Expression

Expression(Node) (javalang.tree)
    Assignment(Expression) (javalang.tree)
    TernaryExpression(Expression) (javalang.tree)
    BinaryOperation(Expression) (javalang.tree)
    Cast(Expression) (javalang.tree)
    MethodReference(Expression) (javalang.tree)
    LambdaExpression(Expression) (javalang.tree)
    Primary(Expression) (javalang.tree)
        Literal(Primary) (javalang.tree)
        This(Primary) (javalang.tree)
        MemberReference(Primary) (javalang.tree)
        Invocation(Primary) (javalang.tree)
            ExplicitConstructorInvocation(Invocation) (javalang.tree)
            SuperConstructorInvocation(Invocation) (javalang.tree)
            MethodInvocation(Invocation) (javalang.tree)
            SuperMethodInvocation(Invocation) (javalang.tree)
        SuperMemberReference(Primary) (javalang.tree)
        ClassReference(Primary) (javalang.tree)
            VoidClassReference(ClassReference) (javalang.tree)
        Creator(Primary) (javalang.tree)
            ArrayCreator(Creator) (javalang.tree)
            ClassCreator(Creator) (javalang.tree)
            InnerClassCreator(Creator) (javalang.tree)
    ArraySelector(Expression) (javalang.tree)

其中我们重点关注下方法调用表达式MethodInvocation(假定我们需要匹配setImageResource方法);
另外也要注意下,LambdaExpression是一个嵌套结构,它包含一个Satement数组,需要继续遍历。

kopyt

kopyt作者开篇明义的说了参考javalang,贴下主要数据类可以发现确实如此

class Statement(ControlStructureBody):
    annotations: Sequence[Annotation]
    labels: Sequence[Label]
    statement: Union[Declaration, Assignment, LoopStatement, Expression]

由于kt语法更加的灵活,可以看到结构上有些区别,Statement是没有直接子类的,只有一个成员statement,且有多种组合方式,这也使得语法树更加的复杂。

Expression(Node) (kopyt.node)
    BinaryExpression(Expression) (kopyt.node)
        Disjunction(BinaryExpression) (kopyt.node)
        Conjunction(BinaryExpression) (kopyt.node)
        Equality(BinaryExpression) (kopyt.node)
        Comparison(BinaryExpression) (kopyt.node)
        InfixOperation(BinaryExpression) (kopyt.node)
        ElvisExpression(BinaryExpression) (kopyt.node)
        InfixFunctionCall(BinaryExpression) (kopyt.node)
        RangeExpression(BinaryExpression) (kopyt.node)
        AdditiveExpression(BinaryExpression) (kopyt.node)
        MultiplicativeExpression(BinaryExpression) (kopyt.node)
        AsExpression(BinaryExpression) (kopyt.node)
    UnaryExpression(Expression) (kopyt.node)
        PrefixUnaryExpression(UnaryExpression) (kopyt.node)
        PostfixUnaryExpression(UnaryExpression) (kopyt.node)
    DirectlyAssignableExpression(Expression) (kopyt.node)
        ParenthesizedDirectlyAssignableExpression(DirectlyAssignableExpression) (kopyt.node)
    ParenthesizedAssignableExpression(Expression) (kopyt.node)
    PrimaryExpression(Expression) (kopyt.node)
        LiteralConstant(PrimaryExpression) (kopyt.node)
            RealLiteral(LiteralConstant) (kopyt.node)
            IntegerLiteral(LiteralConstant) (kopyt.node)
            HexLiteral(LiteralConstant) (kopyt.node)
            BinLiteral(LiteralConstant) (kopyt.node)
            UnsignedLiteral(LiteralConstant) (kopyt.node)
            LongLiteral(LiteralConstant) (kopyt.node)
            BooleanLiteral(LiteralConstant) (kopyt.node)
            NullLiteral(LiteralConstant) (kopyt.node)
            CharacterLiteral(LiteralConstant) (kopyt.node)
        ParenthesizedExpression(PrimaryExpression) (kopyt.node)
        CollectionLiteral(PrimaryExpression, Nodes[Expression]) (kopyt.node)
        StringLiteral(PrimaryExpression) (kopyt.node)
            LineStringLiteral(StringLiteral) (kopyt.node)
            MultiLineStringLiteral(StringLiteral) (kopyt.node)
        FunctionLiteral(PrimaryExpression) (kopyt.node)
            LambdaLiteral(FunctionLiteral) (kopyt.node)
            AnonymousFunction(FunctionLiteral) (kopyt.node)
        ObjectLiteral(PrimaryExpression) (kopyt.node)
        ThisExpression(PrimaryExpression) (kopyt.node)
        SuperExpression(PrimaryExpression) (kopyt.node)
        IfExpression(PrimaryExpression) (kopyt.node)
        WhenExpression(PrimaryExpression) (kopyt.node)
        TryExpression(PrimaryExpression) (kopyt.node)
        JumpExpression(PrimaryExpression) (kopyt.node)
            ThrowExpression(JumpExpression) (kopyt.node)
            ReturnExpression(JumpExpression) (kopyt.node)
            ContinueExpression(JumpExpression) (kopyt.node)
            BreakExpression(JumpExpression) (kopyt.node)
        CallableReference(PrimaryExpression) (kopyt.node)
        SimpleIdentifier(PrimaryExpression, Identifier) (kopyt.node)

Expression则包含了全部的表达式,方法调用为PostfixUnaryExpression;另外还定义了复数Expression,如下

   class Block(ControlStructureBody, Nodes[Statement]):
       
   class Nodes(Node, Generic[NodeType]):
       sequence: Sequence[NodeType]
       

遍历查找

javalang
# return set(res_name)
def find_set_src_in_java_class(java_class):
    res = set()
    for method in java_class.methods:
        if method.body is None:
            continue
        for statement in method.body:
            _list = find_set_src_in_java_statement(statement)
            if len(_list) > 0:
                res.update(_list)

    if java_class.constructors is not None:
        for method in java_class.constructors:
            for statement in method.body:
                _list = find_set_src_in_java_statement(statement)
                if len(_list) > 0:
                    res.update(_list)
    return res


# return set(res_name)
def find_set_src_in_java_statement(statement):
    res = set()
    if isinstance(statement, javalang.tree.StatementExpression):
        expression = statement.expression
        if isinstance(expression, javalang.tree.MethodInvocation):
            res1 = find_in_expression(expression)
            if res1 is not None:
                res.add(res1)
        elif isinstance(statement, javalang.tree.LambdaExpression):
            if isinstance(statement.body, javalang.tree.Expression):
                res1 = find_in_expression(expression)
                if res1 is not None:
                    res.add(res1)
            elif isinstance(statement.body, list):
                res.update(find_in_statements(statement.body))
    elif isinstance(statement, javalang.tree.BlockStatement):
        res.update(find_in_statements(statement.statements))
    elif isinstance(statement, javalang.tree.SynchronizedStatement):
        res.update(find_in_statements(statement.block))
    elif isinstance(statement, javalang.tree.IfStatement):
        res.update(find_set_src_in_java_statement(statement.then_statement))
        res.update(find_set_src_in_java_statement(statement.else_statement))
    elif isinstance(statement, javalang.tree.WhileStatement):
        res.update(find_set_src_in_java_statement(statement.body))
    elif isinstance(statement, javalang.tree.DoStatement):
        res.update(find_set_src_in_java_statement(statement.body))
    elif isinstance(statement, javalang.tree.ForStatement):
        res.update(find_set_src_in_java_statement(statement.body))
    elif isinstance(statement, javalang.tree.SynchronizedStatement):
        res.update(find_in_statements(statement.block))
    elif isinstance(statement, javalang.tree.TryStatement):
        res.update(find_in_statements(statement.block))
        # res.extend(find_in_statements(statement.finally_block))
    elif isinstance(statement, javalang.tree.SwitchStatement):
        for case in statement.cases:
            res.update(find_in_statements(case.statements))
    return res


# return set(res_name)
def find_in_statements(statements):
    res = set()
    if statements is None:
        return res
    type_ = type(statements)
    if type_ is not list:
        print(f"find_in_statements not list {type_}")
        return res
    for sub_statement in statements:
        result = find_set_src_in_java_statement(sub_statement)
        if len(result) > 0:
            res.update(result)
    return res


# return res
def find_in_expression(expression):
    if isinstance(expression, javalang.tree.MethodInvocation):
        if expression.member == 'setImageResource':
            arguments_0 = expression.arguments[0]
            if isinstance(arguments_0, javalang.tree.MemberReference):
                param_qualifier = arguments_0.qualifier
                if param_qualifier.find('drawable') > 0:
                    return f'{param_qualifier}.{arguments_0.member}'

    return None

kopyt
# return set(res_name)
def find_in_class_kt(kt_class):
    res = set()
    if isinstance(kt_class, node.ClassDeclaration):
        for member in kt_class.body.members:
            if isinstance(member, node.FunctionDeclaration):
                if member.modifiers is not None and 'Composable' in member.modifiers:
                    continue
                res.update(find_in_func_body(member.body))
    elif isinstance(kt_class, node.ObjectDeclaration):
        if kt_class.body is None:
            print(f'empty body in {kt_class.name}')
    return res


# return set(res_name)
def find_in_func_body(body):
    res = set()
    if body is None:
        return res
    if isinstance(body, node.Block):
        res.update(find_in_statements_kt(body.sequence))
    elif isinstance(body, node.Expression):
        res.update(find_in_expression_kt(body))
    else:
        print(f'type of cs func_body {type(body)}')
    return res


# return set(res_name)
def find_in_cs_body(body):
    res = set()
    if body is None:
        return res
    if isinstance(body, node.Block):
        res.update(find_in_statements_kt(body.sequence))
    elif isinstance(body, node.Statement):
        res.update(find_in_statement_kt(body))
    elif isinstance(body, node.LambdaLiteral):
        res.update(find_in_statements_kt(body.statements))
    else:
        print(f'type of cs_body {type(body)}')
        raise Exception()

    return res


# return set(res_name)
def find_in_statements_kt(statements):
    res = set()
    for statement in statements:
        res.update(find_in_statement_kt(statement))

    return res


# return set(res_name)
def find_in_statement_kt(statement):
    res = set()

    _statement = statement.statement
    _type = type(_statement)
    if _type is node.ForStatement or \
            _type is node.WhileStatement \
            or _type is node.DoWhileStatement:
        res.update(find_in_cs_body(_statement.body))
    elif isinstance(_statement, node.Expression):
        res.update(find_in_expression_kt(_statement))

    return res


# return set(res_name)
def find_in_expression_kt(expression):
    res = set()

    _type = type(expression)
    if _type is node.PostfixUnaryExpression:
        _res = find_in_target_expression_kt(expression)
        if _res is not None:
            res.add(_res)
    elif _type is node.LambdaLiteral:
        res.update(find_in_statements_kt(expression.statements))
    elif _type is node.AnonymousFunction:
        res.update(find_in_func_body(expression.body))
    elif _type is node.IfExpression:
        res.update(find_in_cs_body(expression.if_body))
        res.update(find_in_cs_body(expression.else_body))
    elif _type is node.WhenExpression:
        for when_entry in expression.entries:
            res.update(find_in_cs_body(when_entry.body))
    elif _type is node.TryExpression:
        res.update(find_in_statements_kt(expression.try_block))

    return res


def find_in_target_expression_kt(expression):
    res = None
    if isinstance(expression, node.PostfixUnaryExpression):
        if len(expression.suffixes) == 2:
            suffix = expression.suffixes[0]
            if isinstance(suffix, node.NavigationSuffix) \
                    and suffix.suffix == 'setImageResource' \
                    and isinstance(expression.suffixes[1], node.CallSuffix):
                args = expression.suffixes[1].arguments
                if isinstance(args, node.ValueArguments) \
                        and len(args.sequence) > 0 \
                        and isinstance(args.sequence[0].value, node.PostfixUnaryExpression):
                    arg_postfix = args.sequence[0].value
                    if len(arg_postfix.suffixes) == 2 \
                            and isinstance(arg_postfix.suffixes[0], node.NavigationSuffix) \
                            and arg_postfix.suffixes[0].suffix == 'drawable':
                        res = str(arg_postfix)

    return res

最后

代码仅停留在demo阶段,作为学习交流之用,仍然存在很多不足之处:
部分嵌套的分支没有加入遍历,如TryStatement的catch、finally代码块等;
再例如目前仅判断了方法名称及参数,并未校验方法owner是否为ImageView;
欢迎大家交流更多心得。