前两天的Python思考题,给大家一个参考答案

420 阅读11分钟
原文链接: zhuanlan.zhihu.com

题目在这里: 给大家留一道Python的思考题 - 知乎专栏

参考代码传到了GitHub

hubo1016/pychecktype

代码当中有许多doctest,如果已经自己写了,可以尝试运行一下看看:

def check_type(value, type):
    """
    Generic type checking.
    
    :param type: could be:
                                  
                 - a Python type. Notice that `object` matches all types, including None. There are a few special rules: int or long type always match
                   both int and long value; str or unicode type always match both str and unicode value; int type CANNOT match bool value.
                 
                 - a tuple of type, means that data can match any subtype. When multiple subtypes can be matched, the first matched subtype is used.
                 
                 - a empty tuple () means any data type which is not None
                 
                 - None, means None. Could be used to match nullable value e.g. `(str, None)`. Equal to types.NoneType
                 
                 - a list, means that data should be a list, or a single item which is converted to a list of length 1. Tuples are also
                   converted to lists.
                 
                 - a list with exact one valid `type`, means a list which all items are in `type`, or an item in `type` which is
                   converted to a list. Tuples are also converted to lists.
                   
                 - a dict, means that data should be a dict
                 
                 - a dict with keys and values. Values should be valid `type`. If a key starts with '?', it is optional and '?' is removed.
                   If a key starts with '!', it is required, and '!' is removed. If a key starts with '~', the content after '~' should be
                   a regular expression, and any keys in `value` which matches the regular expression (with re.search) and not matched by other keys
                   must match the corresponding type. The behavior is undefined when a key is matched by multiple regular expressions.
                   
                   If a key does not start with '?', '!' or '~', it is required, as if '!' is prepended.
    
    :param value: the value to be checked. It is guaranteed that this value is not modified.
    
    :return: the checked and converted value. An exception is raised (usually TypeMismatchException) when `value` is not in `type`. The returned
             result may contain objects from `value`.
             
    Some examples::
    
       >>> check_type("abc", str)
       'abc'
       >>> check_type([1,2,3], [int])
       [1, 2, 3]
       >>> check_type((1,2,3), [int])
       [1, 2, 3]
       >>> check_type(1, ())
       1
       >>> check_type([[]], ())
       [[]]
       >>> check_type(None, ())
       Traceback (most recent call last):
           ...
       TypeMismatchException: None cannot match type ()
       >>> check_type([1,2,"abc"], [int]) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: 'abc' cannot match type <... 'int'>
       >>> check_type("abc", [str])
       ['abc']
       >>> check_type(None, str) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: None cannot match type <... 'str'>
       >>> check_type(None, (str, None)) is None
       True
       >>> check_type([1,2,"abc",["def","ghi"]], [(int, [str])])
       [1, 2, ['abc'], ['def', 'ghi']]
       >>> check_type({"abc":123, "def":"ghi"}, {"abc": int, "def": str}) == {"abc":123, "def":"ghi"}
       True
       >>> check_type({"abc": {"def": "test", "ghi": 5}, "def": 1}, {"abc": {"def": str, "ghi": int}, "def": [int]}) == {"abc": {"def": "test", "ghi": 5}, "def": [1]}
       True
       >>> a = []
       >>> a.append(a)
       >>> check_type(a, a)
       [[...]]
       >>> r = _
       >>> r[0] is r
       True
       >>> check_type(1, None)
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type None
       >>> check_type(a, ())
       [[...]]
       >>> check_type(True, int) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: True cannot match type <... 'int'>
       >>> check_type(1, bool) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type <... 'bool'>
       >>> check_type([1], [list]) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type <... 'list'>
       >>> check_type(1, 1)
       Traceback (most recent call last):
           ...
       InvalidTypeException: 1 is not a valid type: Unrecognized type
       >>> my_type = []
       >>> my_type.append(([str], my_type))
       >>>
       >>> my_data = ["abc"]
       >>> my_data.append(my_data)
       >>>
       >>> check_type(my_data, my_type)
       [['abc'], [...]]
       >>> r = _
       >>> r[1] is r
       True
       >>> my_type = {}
       >>> my_type["abc"] = my_type
       >>> my_type["def"] = [my_type]
       >>> my_data = {}
       >>> my_data["abc"] = my_data
       >>> my_data["def"] = my_data
       >>> r = check_type(my_data, my_type)
       >>> r['abc'] is r
       True
       >>> r['def'][0] is r
       True
       >>> my_obj = []
       >>> my_obj2 = [my_obj]
       >>> my_obj.append(my_obj2)
       >>> my_obj.append(1)
       >>> my_type = []
       >>> my_type.append(my_type)
       >>> check_type(my_obj, (my_type, [(my_type, int)])) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: [[[...]], 1] cannot match type ([[...]], [([[...]], <... 'int'>)])
       >>> my_type = []
       >>> my_type.append(my_type)
       >>> check_type(1, my_type)
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type [[...]]
       >>> check_type(True, bool)
       True
       >>> check_type(1, [[[[[[[[[[int]]]]]]]]]])
       [[[[[[[[[[1]]]]]]]]]]
       >>> check_type([], [int, str]) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       InvalidTypeException: [<... 'int'>, <... 'str'>] is not a valid type: list must contain 0 or 1 valid inner type
       >>> check_type([], [])
       []
       >>> check_type([1,2,3], [])
       [1, 2, 3]
       >>> check_type([1,"abc"], [])
       [1, 'abc']
       >>> check_type((1, "abc"), [])
       [1, 'abc']
       >>> check_type({"a": 1}, [])
       [{'a': 1}]
       >>> check_type(1, {})
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type {}
       >>> check_type([], {})
       Traceback (most recent call last):
           ...
       TypeMismatchException: [] cannot match type {}
       >>> check_type({"a":1}, {})
       {'a': 1}
       >>> check_type({"a":1}, {"b": int}) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: {'a': 1} cannot match type {'b': <... 'int'>}: key 'b' is required
       >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int}) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: 'abc' cannot match type <... 'int'>
       >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "abe": str}) == {'abc': 1, 'abd': 2, 'abe': 'abc'}
       True
       >>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "?abe": str}) == {'abc': 1, 'abd': 2, 'abe': 'abc'}
       True
       >>> check_type({"abc": 1, "def": "abc"}, {"abc": int}) == {'abc': 1, 'def': 'abc'}
       True
       >>> check_type({"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"}, {"~a.*": int, "~b.*": str}) == {"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"}
       True
       >>> my_type = (str, [])
       >>> my_type[1].append(my_type)
       >>> check_type(1, my_type) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: 1 cannot match type (<... 'str'>, [(...)])
       >>> my_obj = []
       >>> my_obj.append(my_obj)
       >>> my_obj.append(1)
       >>> check_type(my_obj, my_type) # doctest: +ELLIPSIS
       Traceback (most recent call last):
           ...
       TypeMismatchException: [[...], 1] cannot match type (<... 'str'>, [(...)])
       >>> my_obj = []
       >>> my_obj.append(my_obj)
       >>> my_obj.append("abc")
       >>> check_type(my_obj, my_type)
       [[...], 'abc']
       >>> my_type = []
       >>> my_type2 = {"a": my_type, "b": my_type}
       >>> my_type.append(my_type2)
       >>> my_obj = {}
       >>> my_obj['a'] = my_obj
       >>> my_obj['b'] = my_obj
       >>> r = check_type(my_obj, my_type)
       >>> r[0]['a'][0] is r[0]['b'][0]
       True
       >>> r[0]['a'][0] is r[0]
       True
       >>> r = check_type(my_obj, my_type2)
       >>> r['a'][0] is r['b'][0]
       True
       >>> r['a'][0] is r
       True
       >>> my_obj2 = []
       >>> my_obj2.append(my_obj2)
       >>> my_obj2.append(1)
       >>> my_obj = [my_obj2, my_obj2]
       >>> my_type = []
       >>> my_type.append((int, my_type))
       >>> check_type(my_obj, my_type)
       [[[...], 1], [[...], 1]]
       >>> r = _
       >>> r[0] is r[1]
       True
       >>> my_type = []
       >>> my_type.append(([int], my_type))
       >>> check_type(my_obj, my_type)
       [[[...], [1]], [[...], [1]]]
       >>> r = _
       >>> r[0] is r[1]
       True
    """
    return _check_type_inner(value, type)

我们来解读一下比较关键的实现细节:

try:
    _long = long
except Exception:
    _long = int

try:
    _unicode = unicode
except Exception:
    _unicode = str


def _check_type_inner(value, type_, _recursive_check = None):
    # print('Check type:', value, id(value), type_, id(type_))
    if _recursive_check is None:
        # current, succeeded, failed, listloop
        _recursive_check = ({}, {}, {}, set())
    current_check, succeded_check, failed_check, list_loop = _recursive_check
    # Use (id(value), id(type)) to store matches that are done before
    check_id = (id(value), id(type_))
    if check_id in succeded_check:
        # This match is already done, return the result
        # print('Hit succedded cache:', succeded_check[check_id], id(succeeded_check[check_id]))
        return succeded_check[check_id]
    elif check_id in failed_check:
        # This match is already failed, raise the exception
        raise failed_check[check_id]
    elif check_id in current_check:
        # print('Hit succedded cache:', current_check[check_id], id(current_check[check_id]))
        # This match is in-operation. The final result is depended by itself. Return the object itself to form a recursive structure.
        return current_check[check_id]
    return_value = None
    try:
        if type_ == None:
            # Match None only
            if value is not None:
                raise TypeMismatchException(value, type_)
            else:
                return_value = value
        elif type_ == ():
            if value is None:
                raise TypeMismatchException(value, type_)
            else:
                return_value = value
        elif type_ is int or type_ is _long:
            # Enhanced behavior when matching int type: long is also matched; bool is NOT matched
            if not isinstance(value, bool) and (isinstance(value, int) or isinstance(value, _long)):
                return_value = value
            else:
                raise TypeMismatchException(value, type_)
        elif type_ is str or type_ is _unicode:
            # Enhanced behavior when matching str: unicode is always matched (even in Python2)
            if isinstance(value, str) or isinstance(value, _unicode):
                return_value = value
            else:
                raise TypeMismatchException(value, type_)
        elif isinstance(type_, type):
            if isinstance(value, type_):
                return_value = value
            else:
                raise TypeMismatchException(value, type_)
        elif isinstance(type_, tuple):
            for subtype in type_:
                try:
                    return_value = _check_type_inner(value, subtype, _recursive_check)
                except TypeMismatchException:
                    continue
                else:
                    break
            else:
                raise TypeMismatchException(value, type_)
        elif isinstance(type_, list):
            if len(type_) > 1:
                raise InvalidTypeException(type_, "list must contain 0 or 1 valid inner type")
            if not type_:
                # matches any list or tuple
                if isinstance(value, list) or isinstance(value, tuple):
                    return_value = list(value)
                else:
                    return_value = [value]
            else:
                subtype = type_[0]
                if isinstance(value, list) or isinstance(value, tuple):
                    # matches a list or tuple with all inner objects matching subtype
                    current_result = []
                    # save the reference to the list
                    current_check[check_id] = current_result
                    # backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
                    _new_recursive_check = (current_check, dict(succeded_check), failed_check, set())
                    current_result.extend(_check_type_inner(o, subtype, _new_recursive_check) for o in value)
                    # copy succeeded checks
                    succeded_check.clear()
                    succeded_check.update(_new_recursive_check[1])
                else:
                    # a non-list value like "abc" cannot match an infinite looped [[...]]
                    # when a non-list value is replaced to a list, we must prevent it from forming an infinite loop
                    if check_id in list_loop:
                        raise TypeMismatchException(value, type_)
                    list_loop.add(check_id)
                    try:
                        current_result = [_check_type_inner(value, subtype, _recursive_check)]
                    finally:
                        list_loop.discard(check_id)
                return_value = current_result
        elif isinstance(type_, dict):
            if not isinstance(value, dict):
                raise TypeMismatchException(value, type_)
            if not type_:
                return_value = dict(value)
            else:
                required_keys = dict((k[1:] if isinstance(k, str) and k.startswith('!') else k, v)
                                 for k,v in type_.items()
                                 if not isinstance(k, str) or (not k.startswith('?') and not k.startswith('~')))
                optional_keys = dict((k[1:], v) for k, v in type_.items()
                                 if k.startswith('?'))
                regexp_keys = [(k[1:], v) for k, v in type_.items()
                               if k.startswith('~')]
                # check required keys
                for k in required_keys:
                    if k not in value:
                        raise TypeMismatchException(value, type_, 'key ' + repr(k) + ' is required')
                optional_keys.update(required_keys)
                current_result = {}
                # save the reference to the dict
                current_check[check_id] = current_result
                # backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
                _new_recursive_check = (current_check, dict(succeded_check), failed_check, set())
                for k, v in value.items():
                    if k in optional_keys:
                        current_result[k] = _check_type_inner(v, optional_keys[k], _new_recursive_check)
                    else:
                        for rk, rv in regexp_keys:
                            if re.search(rk, k):
                                current_result[k] = _check_type_inner(v, rv, _new_recursive_check)
                                break
                        else:
                            current_result[k] = v
                # copy succeeded checks
                succeded_check.clear()
                succeded_check.update(_new_recursive_check[1])
                return_value = current_result
        else:
            raise InvalidTypeException(type_, "Unrecognized type")
    except Exception as exc:
        # This match fails, store the exception
        failed_check[check_id] = exc
        if check_id in current_check:
            del current_check[check_id]
        raise
    else:
        # This match succeeded
        if check_id in current_check:
            del current_check[check_id]
            # Only store the succeded_check if necessary. 
            succeded_check[check_id] = return_value
        return return_value

这个实现中最困难的部分就是对递归结构的处理,我们可以看到,其中关键的细节在于,在递归调用过程中,传递了一个这样的参数:

if _recursive_check is None:
        # current, succeeded, failed, listloop
        _recursive_check = ({}, {}, {}, set())

其中其实包括了三个dict和一个set。它们的key,是(id(value), id(type_)),id在Python中会返回对象的一个唯一标识,之所以使用id号而不是直接使用对象,是因为像dict, list这样的对象是unhashable的,不能放进set或者map当中,使用dict则没有这种问题。虽然一般来说,使用id号标识一个对象是不推荐的,因为对象可能会被GC释放,然后其他对象可能会重新占据这个id号,但是在我们这个过程中,对象不会被修改,id也只做临时使用,所以没有什么大问题。在Python系统库的json、pickle等实现中,也使用了这种方法。

在这里使用值、类型的元组作为标识也是一个要点,并不能直接用值,因为不同的值可能会匹配到不同的类型,产生的结果也是不同的。但是在值和类型不改变的时候,匹配的结果则是唯一的。

这四个对象分别代表这样的意思:

  1. current - 记录了递归到这个位置时,哪些匹配尚未完成,它们实际上就是调用当前这个过程的祖先。这个map的值指向一个list或者dict的对象,这个对象在匹配过程中正在被构建出来,是个不完整的对象。如果出现了循环,则直接使用current中的值来替代进一步的递归过程,这样就防止出现无限循环
  2. succeeded - 记录了所有已经成功匹配的对象。之所以需要这一项,是为了在原始数据中相同引用的对象可以得到相同的结果,比如[my_obj, my_obj]这样的结构,我们会希望返回列表中的两个对象也是同一个引用。map的值指向已经成功生成的对象。
  3. failed - 记录已经失败的匹配。一般来说没有这一项也可以实现算法,但有这一项在某些情况下可以减少尝试的次数。由于存在用元组表示多种类型之一的语法,子结构匹配失败并不总是会导致整体结构匹配失败。
  4. listloop - 这个记录用来防止无限嵌套结构与非无限嵌套结构的匹配

----注意这里的大坑 ----

第四点是非常容易被忽略的,我们来考虑这个测试例:

my_type = []
my_type.append(my_type)
check_type(1, my_type)

在第一个版本的实现里,这个表达式是可以成功匹配的!匹配的结果是[[...]]

原因在于我们规定了,如果是非列表的量,它可以自动转换为只有一项的列表。也就是说,1可以匹配int, 也可以匹配[int],进一步可以匹配[[int]],进一步可以匹配[[[int]]]……

所以……它也许也能匹配[[...]]这个无限嵌套的列表呢?毕竟有无限层,也许最里面有个int呢……

只特别判断一下如果内层还有列表的时候不能无限嵌套行不行呢?也是不行的

my_type = []
my_type.append((str, my_type))
check_type(1, my_type)

解决这个问题的要点在于,在匹配过程中,如果我们曾经将一个非列表值转化为列表,在它成功匹配到非列表类型之前,不能再重新匹配到同一个列表类型——这句话有点绕,表现为代码就是这最后一个set。当我们将一个值转化为列表的时候,会设置一个标识在这个set里面,在继续递归的时候,如果发现这个标识已经设置了,说明我们遇到了一个无限循环的嵌套结构,必须按照匹配失败来返回。如果递归调用没有失败而是成功返回,则清除这个标识。

但是,如果说已经成功匹配到了一个非列表项,就需要在继续递归的时候清除这个标识,考虑这个测试例:

>>> my_type = []
>>> my_type2 = {"a": my_type, "b": my_type}
>>> my_type.append(my_type2)
>>> my_obj = {}
>>> my_obj['a'] = my_obj
>>> my_obj['b'] = my_obj
>>> check_type(my_obj, my_type)
[{'a': [{...}], 'b': [{...}]}]

我们将一个不包含列表的无限嵌套结构,转化成了每一层都增加了一层列表的无限嵌套结构,这个匹配是合法的。我们在dict的递归调用的时候,使用了这条语句:

# backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
                _new_recursive_check = (current_check, dict(succeded_check), failed_check, set())

在这里提供了一个新的set替代以前的set。为什么不使用clear()?如果这个调用失败了,抛出异常并返回到上层的时候,需要重新恢复以前的标识。


另一个注意点在succeeded的使用上。在数据包含递归结构的时候,succeeded当中记录的结果,可能包括了current中的尚未完成的匹配——这些匹配并不保证最终能够成功。所以,在递归调用的时候,如果我们创建了一个current项,而这个项最终匹配失败了,为了安全起见,我们必须清空在递归调用过程中增加的所有的succeeded里的缓存,因为其中的某些项可能包含了已经失败了这个current匹配。这就是_new_recursive_check中创建了一个新的succeeded_check的map,只有成功的时候才将它写进以前的map的原因。

failed没有这个问题,如果某个匹配失败了,它在任何情况下都会失败。

代码中有一些细节上的实现改动:

  1. 无论Python2还是Python3,int和long视作同一种类型;str和unicode视作同一种类型。
  2. 不允许bool值(True,False)匹配int类型。在Python中,bool是int的一个子类,True实际上就是1,因此isinstance(True, int)会返回True。但在我们的程序中做了特殊处理。

从今天起请叫我广义表匹配狂魔