题目在这里: 给大家留一道Python的思考题 - 知乎专栏
参考代码传到了GitHub
代码当中有许多doctest,如果已经自己写了,可以尝试运行一下看看:
def check_type(value, type):
"""
Generic type checking.
:param type: could be:
- a Python type. Notice that `object` matches all types, including None. There are a few special rules: int or long type always match
both int and long value; str or unicode type always match both str and unicode value; int type CANNOT match bool value.
- a tuple of type, means that data can match any subtype. When multiple subtypes can be matched, the first matched subtype is used.
- a empty tuple () means any data type which is not None
- None, means None. Could be used to match nullable value e.g. `(str, None)`. Equal to types.NoneType
- a list, means that data should be a list, or a single item which is converted to a list of length 1. Tuples are also
converted to lists.
- a list with exact one valid `type`, means a list which all items are in `type`, or an item in `type` which is
converted to a list. Tuples are also converted to lists.
- a dict, means that data should be a dict
- a dict with keys and values. Values should be valid `type`. If a key starts with '?', it is optional and '?' is removed.
If a key starts with '!', it is required, and '!' is removed. If a key starts with '~', the content after '~' should be
a regular expression, and any keys in `value` which matches the regular expression (with re.search) and not matched by other keys
must match the corresponding type. The behavior is undefined when a key is matched by multiple regular expressions.
If a key does not start with '?', '!' or '~', it is required, as if '!' is prepended.
:param value: the value to be checked. It is guaranteed that this value is not modified.
:return: the checked and converted value. An exception is raised (usually TypeMismatchException) when `value` is not in `type`. The returned
result may contain objects from `value`.
Some examples::
>>> check_type("abc", str)
'abc'
>>> check_type([1,2,3], [int])
[1, 2, 3]
>>> check_type((1,2,3), [int])
[1, 2, 3]
>>> check_type(1, ())
1
>>> check_type([[]], ())
[[]]
>>> check_type(None, ())
Traceback (most recent call last):
...
TypeMismatchException: None cannot match type ()
>>> check_type([1,2,"abc"], [int]) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: 'abc' cannot match type <... 'int'>
>>> check_type("abc", [str])
['abc']
>>> check_type(None, str) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: None cannot match type <... 'str'>
>>> check_type(None, (str, None)) is None
True
>>> check_type([1,2,"abc",["def","ghi"]], [(int, [str])])
[1, 2, ['abc'], ['def', 'ghi']]
>>> check_type({"abc":123, "def":"ghi"}, {"abc": int, "def": str}) == {"abc":123, "def":"ghi"}
True
>>> check_type({"abc": {"def": "test", "ghi": 5}, "def": 1}, {"abc": {"def": str, "ghi": int}, "def": [int]}) == {"abc": {"def": "test", "ghi": 5}, "def": [1]}
True
>>> a = []
>>> a.append(a)
>>> check_type(a, a)
[[...]]
>>> r = _
>>> r[0] is r
True
>>> check_type(1, None)
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type None
>>> check_type(a, ())
[[...]]
>>> check_type(True, int) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: True cannot match type <... 'int'>
>>> check_type(1, bool) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type <... 'bool'>
>>> check_type([1], [list]) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type <... 'list'>
>>> check_type(1, 1)
Traceback (most recent call last):
...
InvalidTypeException: 1 is not a valid type: Unrecognized type
>>> my_type = []
>>> my_type.append(([str], my_type))
>>>
>>> my_data = ["abc"]
>>> my_data.append(my_data)
>>>
>>> check_type(my_data, my_type)
[['abc'], [...]]
>>> r = _
>>> r[1] is r
True
>>> my_type = {}
>>> my_type["abc"] = my_type
>>> my_type["def"] = [my_type]
>>> my_data = {}
>>> my_data["abc"] = my_data
>>> my_data["def"] = my_data
>>> r = check_type(my_data, my_type)
>>> r['abc'] is r
True
>>> r['def'][0] is r
True
>>> my_obj = []
>>> my_obj2 = [my_obj]
>>> my_obj.append(my_obj2)
>>> my_obj.append(1)
>>> my_type = []
>>> my_type.append(my_type)
>>> check_type(my_obj, (my_type, [(my_type, int)])) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: [[[...]], 1] cannot match type ([[...]], [([[...]], <... 'int'>)])
>>> my_type = []
>>> my_type.append(my_type)
>>> check_type(1, my_type)
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type [[...]]
>>> check_type(True, bool)
True
>>> check_type(1, [[[[[[[[[[int]]]]]]]]]])
[[[[[[[[[[1]]]]]]]]]]
>>> check_type([], [int, str]) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
InvalidTypeException: [<... 'int'>, <... 'str'>] is not a valid type: list must contain 0 or 1 valid inner type
>>> check_type([], [])
[]
>>> check_type([1,2,3], [])
[1, 2, 3]
>>> check_type([1,"abc"], [])
[1, 'abc']
>>> check_type((1, "abc"), [])
[1, 'abc']
>>> check_type({"a": 1}, [])
[{'a': 1}]
>>> check_type(1, {})
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type {}
>>> check_type([], {})
Traceback (most recent call last):
...
TypeMismatchException: [] cannot match type {}
>>> check_type({"a":1}, {})
{'a': 1}
>>> check_type({"a":1}, {"b": int}) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: {'a': 1} cannot match type {'b': <... 'int'>}: key 'b' is required
>>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int}) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: 'abc' cannot match type <... 'int'>
>>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "abe": str}) == {'abc': 1, 'abd': 2, 'abe': 'abc'}
True
>>> check_type({"abc": 1, "abd": 2, "abe": "abc"}, {"~a.*": int, "?abe": str}) == {'abc': 1, 'abd': 2, 'abe': 'abc'}
True
>>> check_type({"abc": 1, "def": "abc"}, {"abc": int}) == {'abc': 1, 'def': 'abc'}
True
>>> check_type({"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"}, {"~a.*": int, "~b.*": str}) == {"abc": 1, "abc": 2, "bcd": "abc", "bce": "abd"}
True
>>> my_type = (str, [])
>>> my_type[1].append(my_type)
>>> check_type(1, my_type) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: 1 cannot match type (<... 'str'>, [(...)])
>>> my_obj = []
>>> my_obj.append(my_obj)
>>> my_obj.append(1)
>>> check_type(my_obj, my_type) # doctest: +ELLIPSIS
Traceback (most recent call last):
...
TypeMismatchException: [[...], 1] cannot match type (<... 'str'>, [(...)])
>>> my_obj = []
>>> my_obj.append(my_obj)
>>> my_obj.append("abc")
>>> check_type(my_obj, my_type)
[[...], 'abc']
>>> my_type = []
>>> my_type2 = {"a": my_type, "b": my_type}
>>> my_type.append(my_type2)
>>> my_obj = {}
>>> my_obj['a'] = my_obj
>>> my_obj['b'] = my_obj
>>> r = check_type(my_obj, my_type)
>>> r[0]['a'][0] is r[0]['b'][0]
True
>>> r[0]['a'][0] is r[0]
True
>>> r = check_type(my_obj, my_type2)
>>> r['a'][0] is r['b'][0]
True
>>> r['a'][0] is r
True
>>> my_obj2 = []
>>> my_obj2.append(my_obj2)
>>> my_obj2.append(1)
>>> my_obj = [my_obj2, my_obj2]
>>> my_type = []
>>> my_type.append((int, my_type))
>>> check_type(my_obj, my_type)
[[[...], 1], [[...], 1]]
>>> r = _
>>> r[0] is r[1]
True
>>> my_type = []
>>> my_type.append(([int], my_type))
>>> check_type(my_obj, my_type)
[[[...], [1]], [[...], [1]]]
>>> r = _
>>> r[0] is r[1]
True
"""
return _check_type_inner(value, type)我们来解读一下比较关键的实现细节:
try:
_long = long
except Exception:
_long = int
try:
_unicode = unicode
except Exception:
_unicode = str
def _check_type_inner(value, type_, _recursive_check = None):
# print('Check type:', value, id(value), type_, id(type_))
if _recursive_check is None:
# current, succeeded, failed, listloop
_recursive_check = ({}, {}, {}, set())
current_check, succeded_check, failed_check, list_loop = _recursive_check
# Use (id(value), id(type)) to store matches that are done before
check_id = (id(value), id(type_))
if check_id in succeded_check:
# This match is already done, return the result
# print('Hit succedded cache:', succeded_check[check_id], id(succeeded_check[check_id]))
return succeded_check[check_id]
elif check_id in failed_check:
# This match is already failed, raise the exception
raise failed_check[check_id]
elif check_id in current_check:
# print('Hit succedded cache:', current_check[check_id], id(current_check[check_id]))
# This match is in-operation. The final result is depended by itself. Return the object itself to form a recursive structure.
return current_check[check_id]
return_value = None
try:
if type_ == None:
# Match None only
if value is not None:
raise TypeMismatchException(value, type_)
else:
return_value = value
elif type_ == ():
if value is None:
raise TypeMismatchException(value, type_)
else:
return_value = value
elif type_ is int or type_ is _long:
# Enhanced behavior when matching int type: long is also matched; bool is NOT matched
if not isinstance(value, bool) and (isinstance(value, int) or isinstance(value, _long)):
return_value = value
else:
raise TypeMismatchException(value, type_)
elif type_ is str or type_ is _unicode:
# Enhanced behavior when matching str: unicode is always matched (even in Python2)
if isinstance(value, str) or isinstance(value, _unicode):
return_value = value
else:
raise TypeMismatchException(value, type_)
elif isinstance(type_, type):
if isinstance(value, type_):
return_value = value
else:
raise TypeMismatchException(value, type_)
elif isinstance(type_, tuple):
for subtype in type_:
try:
return_value = _check_type_inner(value, subtype, _recursive_check)
except TypeMismatchException:
continue
else:
break
else:
raise TypeMismatchException(value, type_)
elif isinstance(type_, list):
if len(type_) > 1:
raise InvalidTypeException(type_, "list must contain 0 or 1 valid inner type")
if not type_:
# matches any list or tuple
if isinstance(value, list) or isinstance(value, tuple):
return_value = list(value)
else:
return_value = [value]
else:
subtype = type_[0]
if isinstance(value, list) or isinstance(value, tuple):
# matches a list or tuple with all inner objects matching subtype
current_result = []
# save the reference to the list
current_check[check_id] = current_result
# backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
_new_recursive_check = (current_check, dict(succeded_check), failed_check, set())
current_result.extend(_check_type_inner(o, subtype, _new_recursive_check) for o in value)
# copy succeeded checks
succeded_check.clear()
succeded_check.update(_new_recursive_check[1])
else:
# a non-list value like "abc" cannot match an infinite looped [[...]]
# when a non-list value is replaced to a list, we must prevent it from forming an infinite loop
if check_id in list_loop:
raise TypeMismatchException(value, type_)
list_loop.add(check_id)
try:
current_result = [_check_type_inner(value, subtype, _recursive_check)]
finally:
list_loop.discard(check_id)
return_value = current_result
elif isinstance(type_, dict):
if not isinstance(value, dict):
raise TypeMismatchException(value, type_)
if not type_:
return_value = dict(value)
else:
required_keys = dict((k[1:] if isinstance(k, str) and k.startswith('!') else k, v)
for k,v in type_.items()
if not isinstance(k, str) or (not k.startswith('?') and not k.startswith('~')))
optional_keys = dict((k[1:], v) for k, v in type_.items()
if k.startswith('?'))
regexp_keys = [(k[1:], v) for k, v in type_.items()
if k.startswith('~')]
# check required keys
for k in required_keys:
if k not in value:
raise TypeMismatchException(value, type_, 'key ' + repr(k) + ' is required')
optional_keys.update(required_keys)
current_result = {}
# save the reference to the dict
current_check[check_id] = current_result
# backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
_new_recursive_check = (current_check, dict(succeded_check), failed_check, set())
for k, v in value.items():
if k in optional_keys:
current_result[k] = _check_type_inner(v, optional_keys[k], _new_recursive_check)
else:
for rk, rv in regexp_keys:
if re.search(rk, k):
current_result[k] = _check_type_inner(v, rv, _new_recursive_check)
break
else:
current_result[k] = v
# copy succeeded checks
succeded_check.clear()
succeded_check.update(_new_recursive_check[1])
return_value = current_result
else:
raise InvalidTypeException(type_, "Unrecognized type")
except Exception as exc:
# This match fails, store the exception
failed_check[check_id] = exc
if check_id in current_check:
del current_check[check_id]
raise
else:
# This match succeeded
if check_id in current_check:
del current_check[check_id]
# Only store the succeded_check if necessary.
succeded_check[check_id] = return_value
return return_value这个实现中最困难的部分就是对递归结构的处理,我们可以看到,其中关键的细节在于,在递归调用过程中,传递了一个这样的参数:
if _recursive_check is None:
# current, succeeded, failed, listloop
_recursive_check = ({}, {}, {}, set())其中其实包括了三个dict和一个set。它们的key,是(id(value), id(type_)),id在Python中会返回对象的一个唯一标识,之所以使用id号而不是直接使用对象,是因为像dict, list这样的对象是unhashable的,不能放进set或者map当中,使用dict则没有这种问题。虽然一般来说,使用id号标识一个对象是不推荐的,因为对象可能会被GC释放,然后其他对象可能会重新占据这个id号,但是在我们这个过程中,对象不会被修改,id也只做临时使用,所以没有什么大问题。在Python系统库的json、pickle等实现中,也使用了这种方法。
在这里使用值、类型的元组作为标识也是一个要点,并不能直接用值,因为不同的值可能会匹配到不同的类型,产生的结果也是不同的。但是在值和类型不改变的时候,匹配的结果则是唯一的。
这四个对象分别代表这样的意思:
- current - 记录了递归到这个位置时,哪些匹配尚未完成,它们实际上就是调用当前这个过程的祖先。这个map的值指向一个list或者dict的对象,这个对象在匹配过程中正在被构建出来,是个不完整的对象。如果出现了循环,则直接使用current中的值来替代进一步的递归过程,这样就防止出现无限循环
- succeeded - 记录了所有已经成功匹配的对象。之所以需要这一项,是为了在原始数据中相同引用的对象可以得到相同的结果,比如[my_obj, my_obj]这样的结构,我们会希望返回列表中的两个对象也是同一个引用。map的值指向已经成功生成的对象。
- failed - 记录已经失败的匹配。一般来说没有这一项也可以实现算法,但有这一项在某些情况下可以减少尝试的次数。由于存在用元组表示多种类型之一的语法,子结构匹配失败并不总是会导致整体结构匹配失败。
- listloop - 这个记录用来防止无限嵌套结构与非无限嵌套结构的匹配
----注意这里的大坑 ----
第四点是非常容易被忽略的,我们来考虑这个测试例:
my_type = []
my_type.append(my_type)
check_type(1, my_type)在第一个版本的实现里,这个表达式是可以成功匹配的!匹配的结果是[[...]]
原因在于我们规定了,如果是非列表的量,它可以自动转换为只有一项的列表。也就是说,1可以匹配int, 也可以匹配[int],进一步可以匹配[[int]],进一步可以匹配[[[int]]]……
所以……它也许也能匹配[[...]]这个无限嵌套的列表呢?毕竟有无限层,也许最里面有个int呢……
my_type = []
my_type.append((str, my_type))
check_type(1, my_type)解决这个问题的要点在于,在匹配过程中,如果我们曾经将一个非列表值转化为列表,在它成功匹配到非列表类型之前,不能再重新匹配到同一个列表类型——这句话有点绕,表现为代码就是这最后一个set。当我们将一个值转化为列表的时候,会设置一个标识在这个set里面,在继续递归的时候,如果发现这个标识已经设置了,说明我们遇到了一个无限循环的嵌套结构,必须按照匹配失败来返回。如果递归调用没有失败而是成功返回,则清除这个标识。
但是,如果说已经成功匹配到了一个非列表项,就需要在继续递归的时候清除这个标识,考虑这个测试例:
>>> my_type = []
>>> my_type2 = {"a": my_type, "b": my_type}
>>> my_type.append(my_type2)
>>> my_obj = {}
>>> my_obj['a'] = my_obj
>>> my_obj['b'] = my_obj
>>> check_type(my_obj, my_type)
[{'a': [{...}], 'b': [{...}]}]我们将一个不包含列表的无限嵌套结构,转化成了每一层都增加了一层列表的无限嵌套结构,这个匹配是合法的。我们在dict的递归调用的时候,使用了这条语句:
# backup succedded check: it may depends on current result. If the type match fails, revert all succeeded check
_new_recursive_check = (current_check, dict(succeded_check), failed_check, set())在这里提供了一个新的set替代以前的set。为什么不使用clear()?如果这个调用失败了,抛出异常并返回到上层的时候,需要重新恢复以前的标识。
另一个注意点在succeeded的使用上。在数据包含递归结构的时候,succeeded当中记录的结果,可能包括了current中的尚未完成的匹配——这些匹配并不保证最终能够成功。所以,在递归调用的时候,如果我们创建了一个current项,而这个项最终匹配失败了,为了安全起见,我们必须清空在递归调用过程中增加的所有的succeeded里的缓存,因为其中的某些项可能包含了已经失败了这个current匹配。这就是_new_recursive_check中创建了一个新的succeeded_check的map,只有成功的时候才将它写进以前的map的原因。
failed没有这个问题,如果某个匹配失败了,它在任何情况下都会失败。
代码中有一些细节上的实现改动:
- 无论Python2还是Python3,int和long视作同一种类型;str和unicode视作同一种类型。
- 不允许bool值(True,False)匹配int类型。在Python中,bool是int的一个子类,True实际上就是1,因此isinstance(True, int)会返回True。但在我们的程序中做了特殊处理。
从今天起请叫我广义表匹配狂魔