Pythonic Name Matching

51 阅读3分钟

我有包含足球队名称的数据库,例如,在下面的第一项中,Marshall 和 Southern Methodist。然后,与我的数据库名称匹配的是一些不同,但可识别的名称(在第一个条目中,SMU、Marshall):

[u'Houston', u'Alabama']
[u'Houst', u'Alab']


[u'Florida State', u'North Carolina State']
[u'NCSt', u'FlaSt']


[u'Penn State', u'Iowa']
[u'PnSt', u'Iowa']


[u'Oklahoma', u'Texas']
[u'Texas', u'Okla']


[u'Florida Atlantic', u'South Florida']
[u'SFla', u'FlAtl']


[u'Georgia', u'Tennessee']
[u'Geo', u'Tenn']


[u'San Jose State', u'Idaho']
[u'UI', u'SJSU']


[u'Washington State', u'Arizona State']
[u'ArzSt', u'WshSt']


[u'Fresno State', u'Nevada']
[u'Nevad', u'FrsSt']


[u'Oregon State', u'Arizona']
[u'ARIZ', u'OSU']


[u'Clemson', u'Virginia Tech']
[u'VTech', u'Clem']


[u'Chattanooga', u'Arkansas']
[u'UTC', u'AR']


[u'USC', u'Stanford']
[u'USC', u'Stanf']


[u'Baylor', u'Colorado']
[u'BU', u'CU']


[u'North Texas', u'Louisiana-Lafayette']
[u'NoTex', u'LaLaf']


[u'Tulane', u'Army']
[u'TLN', u'ARMY']


[u'Troy', u'Florida International']
[u'TROY', u'FIU']


[u'Louisiana-Monroe', u'Arkansas State']
[u'ASU', u'ULM']


[u'Texas Tech', u'Iowa State']
[u'TT', u'ISU']


[u'Akron', u'Western Michigan']
[u'AKRON', u'WMU']


[u'Liberty', u'Toledo']
[u'LIBERTY', u'TOLEDO']


[u'Virginia', u'Middle Tennessee']
[u'Virg', u'MTnSt']


[u'Oklahoma State', u'Texas A&M']
[u'TexAM', u'OKSt']


[u'Notre Dame', u'UCLA']
[u'NDame', u'UCLA']


[u'Rutgers', u'Cincinnati']
[u'Cincy', u'Rutgr']


[u'Ohio State', u'Purdue']
[u'Prdue', u'OhSt']


[u'LSU', u'Florida']
[u'Fla', u'LSU']


[u'Air Force', u'UNLV']
[u'AFA', u'UNLV']


[u'Nebraska', u'Missouri']
[u'Misso', u'Neb']


[u'New Mexico State', u'Boise State']
[u'NMxSt', u'BoiSt']


[u'Pittsburgh', u'Navy']
[u'Navy', u'Pitt']


[u'Wake Forest', u'Florida State']
[u'WFrst', u'FlaSt']


[u'San Jose State', u'Hawaii']
[u'Hawa', u'SJSt']


[u'UCF', u'South Florida']
[u'UCF', u'SFla']

对于每四组名称,我需要将我的数据库名称与正确的名称匹配。我可以用很多 if 语句来完成,但这需要很多代码,而且不太优雅。有没有更好的方法来匹配?

解决方案

from difflib import SequenceMatcher

li = [
    ([u'Houston', u'Alabama'],
     [u'Houst', u'Alab']),

    ([u'Florida State', u'North Carolina State'],
     [u'NCSt', u'FlaSt']),

    ([u'Penn State', u'Iowa'],
     [u'PnSt', u'Iowa']),

    ([u'Oklahoma', u'Texas'],
     [u'Texas', u'Okla']),

    ([u'Florida Atlantic', u'South Florida'],
     [u'SFla', u'FlAtl']),

    ([u'Georgia', u'Tennessee'],
     [u'Geo', u'Tenn']),

    ([u'San Jose State', u'Idaho'],
     [u'UI', u'SJSU']),

    ([u'Washington State', u'Arizona State'],
     [u'ArzSt', u'WshSt']),

    ([u'Fresno State', u'Nevada'],
     [u'Nevad', u'FrsSt']),

    ([u'Oregon State', u'Arizona'],
     [u'ARIZ', u'OSU']),

    ([u'Clemson', u'Virginia Tech'],
     [u'VTech', u'Clem']),

    ([u'Chattanooga', u'Arkansas'],
     [u'UTC', u'AR']),

    ([u'USC', u'Stanford'],
     [u'USC', u'Stanf']),

    ([u'Baylor', u'Colorado'],
     [u'BU', u'CU']),

    ([u'North Texas', u'Louisiana-Lafayette'],
     [u'NoTex', u'LaLaf']),

    ([u'Tulane', u'Army'],
     [u'TLN', u'ARMY']),

    ([u'Troy', u'Florida International'],
     [u'TROY', u'FIU']),

    ([u'Louisiana-Monroe', u'Arkansas State'],
     [u'ASU', u'ULM']),

    ([u'Texas Tech', u'Iowa State'],
     [u'TT', u'ISU']),

    ([u'Akron', u'Western Michigan'],
     [u'AKRON', u'WMU']),

    ([u'Liberty', u'Toledo'],
     [u'LIBERTY', u'TOLEDO']),

    ([u'Virginia', u'Middle Tennessee'],
     [u'Virg', u'MTnSt']),

    ([u'Oklahoma State', u'Texas A&M'],
     [u'TexAM', u'OKSt']),

    ([u'Notre Dame', u'UCLA'],
     [u'NDame', u'UCLA']),

    ([u'Rutgers', u'Cincinnati'],
     [u'Cincy', u'Rutgr']),

    ([u'Ohio State', u'Purdue'],
     [u'Prdue', u'OhSt']),

    ([u'LSU', u'Florida'],
     [u'Fla', u'LSU']),

    ([u'Air Force', u'UNLV'],
     [u'AFA', u'UNLV']),

    ([u'Nebraska', u'Missouri'],
     [u'Misso', u'Neb']),

    ([u'New Mexico State', u'Boise State'],
     [u'NMxSt', u'BoiSt']),

    ([u'Pittsburgh', u'Navy'],
     [u'Navy', u'Pitt']),

    ([u'Wake Forest', u'Florida State'],
     [u'WFrst', u'FlaSt']),

    ([u'San Jose State', u'Hawaii'],
     [u'Hawa', u'SJSt']),

    ([u'UCF', u'South Florida'],
     [u'UCF', u'SFla'])
]


def comp(N, D, sq=SequenceMatcher(None)):
    sq.set_seqs(N[0], D[0])
    a = sq.ratio()
    sq.set_seqs(N[1], D[1])
    b = sq.ratio()

    sq.set_seqs(N[0], D[1])
    x = sq.ratio()
    sq.set_seqs(N[1], D[0])
    y = sq.ratio()

    sq.set_seqs(N[0].lower(), D[0].lower())
    al = sq.ratio()
    sq.set_seqs(N[1].lower(), D[1].lower())
    bl = sq.ratio()

    sq.set_seqs(N[0].lower(), D[1].lower())
    xl = sq.ratio()
    sq.set_seqs(N[1].lower(), D[0].lower())
    yl = sq.ratio()

    if ((a > 0.5 and b > 0.5 and a + b > 1.4)
            or (al > 0.5 and bl > 0.5 and al + bl > 1.4)):
        return (N[