字符串模糊匹配库 --- FuzzyWuzzy

·  阅读 2205

MedusaSorcerer的博客


介绍

FuzzyWuzzy 是用于字符串匹配的第三方工具库, 它是依据 Levenshtein Distance 计算指定的两个字符串之间的差异值。

Github 项目地址

环境安装

安装时可以使用 PIP 安装:

python -m pip install fuzzywuzzy
复制代码

或者你需要安装 python-Levenshtein 库:

python -m pip install fuzzywuzzy[speedup]
复制代码

导入运行时出现告警信息:

UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
复制代码

安装解决告警数据的库:

python -m pip3 install python-levenshtein
复制代码

支持的测试工具

  • pycodestyle
  • hypothesis
  • pytest

简单使用

#!/usr/bin/env python
# _*_ Coding: UTF-8 _*_
from fuzzywuzzy import fuzz, process


# 简单使用
print(fuzz.ratio("this is a medusa blog", "this is a blog!"))  # 78

# 不完全匹配
print(fuzz.partial_ratio("this is a test", "this is a test!"))  # 100
print(fuzz.partial_ratio("It is never too old to learn", "It is never too old to learn!"))  # 100
print(fuzz.partial_ratio("no cross, no crown.", "No cross, no crown."))  # 95

# 忽略匹配顺序
print(fuzz.ratio("Medusa Sorcerer Blog", "Blog Medusa Sorcerer"))  # 75
print(fuzz.token_sort_ratio("Medusa Sorcerer Blog", "Blog Medusa Sorcerer"))  # 100

# 去重子集匹配
print(fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))  # 84
print(fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear"))  # 100

# Process 相似度枚举匹配,返回匹配的相似度倒叙序列:[("字符串", 匹配度), ...]
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
print(process.extract("new york jets", choices, limit=2))  # [('New York Jets', 100), ('New York Giants', 79)]
print(process.extract("new york jets", choices, limit=4))  # [('New York Jets', 100), ('New York Giants', 79), ('Atlanta Falcons', 29), ('Dallas Cowboys', 22)]
print(process.extract("new york jets", choices))  # [('New York Jets', 100), ('New York Giants', 79), ('Atlanta Falcons', 29), ('Dallas Cowboys', 22)]

# Process 相似度枚举匹配,返回匹配的相似度最高枚举字符元组:("字符串", 匹配度)
print(process.extractOne("cowboys", choices))  # ('Dallas Cowboys', 90)

# Process 使用附加参数设置匹配模式,例如匹配文件路径
songs = ["System of a down", "fly", "I am"]
print(process.extractOne("System of a down - Hypnotize - Heroin", songs))  # ('System of a down', 90)
print(process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio))  # ('System of a down', 65)
复制代码

已知的移植项目

分类:
代码人生
标签:
收藏成功!
已添加到「」, 点击更改