假设我们有一个源字符串:
Humpty dumpty <span id="1">sat</span> on a wall, humpty dumpty had a great fall. All of <span id="two">the kings</span> horses and all the kings men.
以及一个包含多个字符串的列表,每个字符串以新行分隔:
Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.
Humpty dumpty sat on the wall, all of the kings horses and all the kings men.
There is a humpty dumpty who had sat on the wall, and all of the kings horses and all the kings men.
Humpty dumpty sat on some wall, humpty dumpty had a great fall. All of the kings horses and all the kings men couldn't put him together again.
Humpty dumpty this is a completely related sentence.
我们想要能够从目标字符串开始,找出与源字符串最匹配的“其他字符串列表”,并使用 Python 给出源字符串和目标字符串对之间的“得分”,根据这些评分确定哪个字符串与源字符串的匹配度最高。
2、解决方案:
可以使用 PyLevenshtein 模块来查找莱文斯坦距离,并使用该距离来确定字符串之间的相似度。Levenshtein 距离是两个字符串之间必须进行的最少操作数,这些操作包括插入、删除和替换字符。PyLevenshtein 模块提供了计算莱文斯坦距离的函数。
from pylevenshtein import levenshtein
def find_most_similar_string(source_string, other_strings):
"""
Finds the string in the list of other strings that is most similar to the source string.
Args:
source_string: The source string to compare against.
other_strings: A list of strings to compare the source string to.
Returns:
The string in the list of other strings that is most similar to the source string.
"""
# Initialize the minimum Levenshtein distance and the most similar string.
min_levenshtein_distance = float('inf')
most_similar_string = None
# Iterate over the other strings and find the one with the minimum Levenshtein distance to the source string.
for other_string in other_strings:
levenshtein_distance = levenshtein(source_string, other_string)
if levenshtein_distance < min_levenshtein_distance:
min_levenshtein_distance = levenshtein_distance
most_similar_string = other_string
return most_similar_string
source_string = "Humpty dumpty <span id="1">sat</span> on a wall, humpty dumpty had a great fall. All of <span id="two">the kings</span> horses and all the kings men."
other_strings = [
"Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.",
"Humpty dumpty sat on the wall, all of the kings horses and all the kings men.",
"There is a humpty dumpty who had sat on the wall, and all of the kings horses and all the kings men.",
"Humpty dumpty sat on some wall, humpty dumpty had a great fall. All of the kings horses and all the kings men couldn't put him together again.",
"Humpty dumpty this is a completely related sentence."
]
most_similar_string = find_most_similar_string(source_string, other_strings)
print(most_similar_string)
输出:
Humpty dumpty sat on a wall, humpty dumpty had a great fall. All of the kings horses and all the kings men.
在本例中,我们可以看到最相似的字符串与源字符串完全相同,这证明了该方法的有效性。