在比较两个列表的相似度时,一个常见的任务是找到两个列表之间的最小差异。最小差异是指将一个列表中的元素重新排列或删除,使其与另一个列表相同所需的最小操作次数。
- 解决方案
一种解决此问题的常用算法是“递归算法”。该算法通过以下步骤工作:
- 找到两个列表的公共最长子序列(LCS)。
- 将两个列表分成两个部分:LCS之前的部分和LCS之后的部分。
- 对每个部分递归地应用相同的算法。
- 将结果合并起来。
该算法的时间复杂度为O(n log n),其中n是两个列表的长度。
import itertools
def _longest_common_subseq(src, dst):
"""Returns pair of ranges of longest common subsequence for the `src`
and `dst` lists.
>>> src = [1, 2, 3, 4]
>>> dst = [0, 1, 2, 3, 5]
>>> # The longest common subsequence for these lists is [1, 2, 3]
... # which is located at (0, 3) index range for src list and (1, 4) for
... # dst one. Tuple of these ranges we should get back.
... assert ((0, 3), (1, 4)) == _longest_common_subseq(src, dst)
"""
lsrc, ldst = len(src), len(dst)
drange = list(range(ldst))
matrix = [[0] * ldst for _ in range(lsrc)]
z = 0 # length of the longest subsequence
range_src, range_dst = None, None
for i, j in itertools.product(range(lsrc), drange):
if src[i] == dst[j]:
if i == 0 or j == 0:
matrix[i][j] = 1
else:
matrix[i][j] = matrix[i-1][j-1] + 1
if matrix[i][j] > z:
z = matrix[i][j]
if matrix[i][j] == z:
range_src = (i-z+1, i+1)
range_dst = (j-z+1, j+1)
else:
matrix[i][j] = 0
return range_src, range_dst
def split_by_common_seq(src, dst, bx=(0, -1), by=(0, -1)):
"""Recursively splits the `dst` list onto two parts: left and right.
The left part contains differences on left from common subsequence,
same as the right part by for other side.
To easily understand the process let's take two lists: [0, 1, 2, 3] as
`src` and [1, 2, 4, 5] for `dst`. If we've tried to generate the binary tree
where nodes are common subsequence for both lists, leaves on the left
side are subsequence for `src` list and leaves on the right one for `dst`,
our tree would looks like::
[1, 2]
/ \
[0] []
/ \
[3] [4, 5]
This function generate the similar structure as flat tree, but without
nodes with common subsequences - since we're don't need them - only with
left and right leaves::
[]
/ \
[0] []
/ \
[3] [4, 5]
The `bx` is the absolute range for currently processed subsequence of
`src` list. The `by` means the same, but for the `dst` list.
"""
# Prevent useless comparisons in future
bx = bx if bx[0] != bx[1] else None
by = by if by[0] != by[1] else None
if not src:
return [None, by]
elif not dst:
return [bx, None]
# note that these ranges are relative for processed sublists
x, y = _longest_common_subseq(src, dst)
if x is None or y is None: # no more any common subsequence
return [bx, by]
return [split_by_common_seq(src[:x[0]], dst[:y[0]],
(bx[0], bx[0] + x[0]),
(by[0], by[0] + y[0])),
split_by_common_seq(src[x[1]:], dst[y[1]:],
(bx[0] + x[1], bx[0] + len(src)),
(bx[0] + y[1], bx[0] + len(dst)))]
# Example 1
src = [0, 1, 2, 3]
dst = [1, 2, 4, 5]
result = split_by_common_seq(src, dst)
print(result)
# Output: [[(0, 1), None], [(3, 4), (2, 4)]]
# Example 2
src = [1, 2, 3, 4, 5]
dst = [1, 2, 3, 4, 5, 6, 7, 8]
result = split_by_common_seq(src, dst)
print(result)
# Output: [[None, None], [None, (5, 8)]]
# Example 3
src = [4, 5]
dst = [1, 2, 3, 4, 5, 6, 7, 8]
result = split_by_common_seq(src, dst)
print(result)
# Output: [[None, (0, 3)], [None, (5, 8)]]