介绍
FuzzyWuzzy 是用于字符串匹配的第三方工具库,
它是依据 Levenshtein Distance 计算指定的两个字符串之间的差异值。
环境安装
Pythonv 2.7+Difflib- python-Levenshtein, 可选项, 在字符串匹配差异时可以提供
4-10x的加速, 但在某些特定的条件下, 可能会导致匹配结果不同
安装时可以使用 PIP 安装:
python -m pip install fuzzywuzzy
或者你需要安装 python-Levenshtein 库:
python -m pip install fuzzywuzzy[speedup]
导入运行时出现告警信息:
UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
安装解决告警数据的库:
python -m pip3 install python-levenshtein
支持的测试工具
pycodestylehypothesispytest
简单使用
#!/usr/bin/env python
# _*_ Coding: UTF-8 _*_
from fuzzywuzzy import fuzz, process
# 简单使用
print(fuzz.ratio("this is a medusa blog", "this is a blog!")) # 78
# 不完全匹配
print(fuzz.partial_ratio("this is a test", "this is a test!")) # 100
print(fuzz.partial_ratio("It is never too old to learn", "It is never too old to learn!")) # 100
print(fuzz.partial_ratio("no cross, no crown.", "No cross, no crown.")) # 95
# 忽略匹配顺序
print(fuzz.ratio("Medusa Sorcerer Blog", "Blog Medusa Sorcerer")) # 75
print(fuzz.token_sort_ratio("Medusa Sorcerer Blog", "Blog Medusa Sorcerer")) # 100
# 去重子集匹配
print(fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")) # 84
print(fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")) # 100
# Process 相似度枚举匹配,返回匹配的相似度倒叙序列:[("字符串", 匹配度), ...]
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
print(process.extract("new york jets", choices, limit=2)) # [('New York Jets', 100), ('New York Giants', 79)]
print(process.extract("new york jets", choices, limit=4)) # [('New York Jets', 100), ('New York Giants', 79), ('Atlanta Falcons', 29), ('Dallas Cowboys', 22)]
print(process.extract("new york jets", choices)) # [('New York Jets', 100), ('New York Giants', 79), ('Atlanta Falcons', 29), ('Dallas Cowboys', 22)]
# Process 相似度枚举匹配,返回匹配的相似度最高枚举字符元组:("字符串", 匹配度)
print(process.extractOne("cowboys", choices)) # ('Dallas Cowboys', 90)
# Process 使用附加参数设置匹配模式,例如匹配文件路径
songs = ["System of a down", "fly", "I am"]
print(process.extractOne("System of a down - Hypnotize - Heroin", songs)) # ('System of a down', 90)
print(process.extractOne("System of a down - Hypnotize - Heroin", songs, scorer=fuzz.token_sort_ratio)) # ('System of a down', 65)
已知的移植项目
- Java:xpresso's fuzzywuzzy implementation
- Java:fuzzywuzzy (java port)
- Rust:fuzzyrusty (Rust port)
- JavaScript:fuzzball.js (JavaScript port)
- C++:Tmplt/fuzzywuzzy
- C#:fuzzysharp (.Net port)
- Go:go-fuzzywuzz (Go port)
- Free Pascal:FuzzyWuzzy.pas (Free Pascal port)
- Kotlin multiplatform:FuzzyWuzzy-Kotlin
- R:fuzzywuzzyR (R port)