如何在 Python 中确定源字符串和多个字符串之间的相似度?

59 阅读2分钟

假设我们有一个源字符串:

Humpty dumpty <span id="1">sat</span> on a wall, humpty dumpty had a great fall. All of <span id="two">the kings</span> horses and all the kings men.

huake_00257_.jpg 以及一个包含多个字符串的列表,每个字符串以新行分隔:

Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.

Humpty dumpty sat on the wall, all of the kings horses and all the kings men.

There is a humpty dumpty who had sat on the wall, and all of the kings horses and all the kings men.

Humpty dumpty sat on some wall, humpty dumpty had a great fall. All of the kings horses and all the kings men couldn't put him together again.

Humpty dumpty this is a completely related sentence.

我们想要能够从目标字符串开始,找出与源字符串最匹配的“其他字符串列表”,并使用 Python 给出源字符串和目标字符串对之间的“得分”,根据这些评分确定哪个字符串与源字符串的匹配度最高。

2、解决方案:

可以使用 PyLevenshtein 模块来查找莱文斯坦距离,并使用该距离来确定字符串之间的相似度。Levenshtein 距离是两个字符串之间必须进行的最少操作数,这些操作包括插入、删除和替换字符。PyLevenshtein 模块提供了计算莱文斯坦距离的函数。

from pylevenshtein import levenshtein

def find_most_similar_string(source_string, other_strings):
  """
  Finds the string in the list of other strings that is most similar to the source string.

  Args:
    source_string: The source string to compare against.
    other_strings: A list of strings to compare the source string to.

  Returns:
    The string in the list of other strings that is most similar to the source string.
  """

  # Initialize the minimum Levenshtein distance and the most similar string.
  min_levenshtein_distance = float('inf')
  most_similar_string = None

  # Iterate over the other strings and find the one with the minimum Levenshtein distance to the source string.
  for other_string in other_strings:
    levenshtein_distance = levenshtein(source_string, other_string)
    if levenshtein_distance < min_levenshtein_distance:
      min_levenshtein_distance = levenshtein_distance
      most_similar_string = other_string

  return most_similar_string


source_string = "Humpty dumpty <span id="1">sat</span> on a wall, humpty dumpty had a great fall. All of <span id="two">the kings</span> horses and all the kings men."

other_strings = [
  "Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.",
  "Humpty dumpty sat on the wall, all of the kings horses and all the kings men.",
  "There is a humpty dumpty who had sat on the wall, and all of the kings horses and all the kings men.",
  "Humpty dumpty sat on some wall, humpty dumpty had a great fall. All of the kings horses and all the kings men couldn't put him together again.",
  "Humpty dumpty this is a completely related sentence."
]

most_similar_string = find_most_similar_string(source_string, other_strings)

print(most_similar_string)

输出:

Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.

在本例中,我们可以看到最相似的字符串与源字符串完全相同,这证明了该方法的有效性。