pyahocorasick

3 阅读1分钟

pyahocorasick

pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find multiple key strings occurrences at once in some input text. pyahocorasick 是一个快速且内存高效的库,用于精确或近似的多模式字符串搜索,这意味着您可以在某个输入文本中一次找到多个关键字符串的出现。 The strings “index” can be built ahead of time and saved (as a pickle) to disk to reload and reuse later. The library provides an ahocorasick Python module that you can use as a plain dict-like Trie or convert a Trie to an automaton for efficient Aho-Corasick search. 字符串“index”可以提前构建 时间并保存(作为 pickle)到磁盘以便以后重新加载和重复使用。库 提供了一个 ahocorasick Python 模块,您可以将其用作类似字典的普通 Trie,或将 Trie 转换为自动机,以实现高效的 Aho-Corasick 搜索。

pyahocorasick is implemented in C and tested on Python 3.8 and up. It works on 64 bits Linux, macOS and Windows. pyahocorasick 使用 C 语言实现,并在 Python 3.8 及更高版本上测试。它适用于 64 位 Linux、macOS 和 Windows。

The license is BSD-3-Clause. Some utilities, such as tests and the pure Python automaton are dedicated to the Public Domain. 许可证为 BSD-3-Clause。部分实用程序(例如测试和纯 Python 自动机)专用于公共领域。

Testimonials 评价

Many thanks for this package. Wasn’t sure where to leave a thank you note but this package is absolutely fantastic in our application where we have a library of 100k+ CRISPR guides that we have to count in a stream of millions of DNA sequencing reads. 非常感谢这个软件包。我之前不知道该在哪里留言感谢,但这个软件包在我们的应用中真的太棒了。我们有一个包含 10 万多个 CRISPR 指南的库,需要统计数百万条 DNA 测序读段的数据。

This package does it faster than the previous C program we used for the purpose and helps us stick to just Python code in our pipeline. 这个包比我们之前用于此目的的 C 程序执行得更快,并帮助我们在管道中坚持使用 Python 代码。

Miika (AstraZeneca Functional Genomics Centre) github.com/WojciechMul… Miika(阿斯利康功能基因组学中心) github.com/WojciechMul…

Download and source code 下载和源代码

You can fetch pyahocorasick from: 您可以从以下位置获取 pyahocorasick

The documentation is published at pyahocorasick.readthedocs.io/文档发布于 pyahocorasick.readthedocs.io/

Quick start 快速启动

This module is written in C. You need a C compiler installed to compile native CPython extensions. To install: 此模块使用 C 语言编写。您需要安装 C 编译器才能编译原生 CPython 扩展。安装方法如下:

 pip install pyahocorasick

Then create an Automaton: 然后创建一个自动机:

 >>> import ahocorasick
 >>> automaton = ahocorasick.Automaton()

You can use the Automaton class as a trie. Add some string keys and their associated value to this trie. Here we associate a tuple of (insertion index, original string) as a value to each key string we add to the trie: 你可以将 Automaton 类用作一个字典树。向该字典树添加一些字符串键及其关联的值。这里,我们将一个 (插入索引, 原始字符串) 元组作为值关联到我们添加到字典树的每个键字符串:

 >>> for idx, key in enumerate('he her hers she'.split()):
 ...   automaton.add_word(key, (idx, key))

Then check if some string exists in the trie: 然后检查 trie 中是否存在某个字符串:

 >>> 'he' in automaton
 True
 >>> 'HER' in automaton
 False

And play with the get() dict-like method: 并使用 get() 类似字典的方法:

 >>> automaton.get('he')
 (0, 'he')
 >>> automaton.get('she')
 (3, 'she')
 >>> automaton.get('cat', 'not exists')
 'not exists'
 >>> automaton.get('dog')
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 KeyError

Now convert the trie to an Aho-Corasick automaton to enable Aho-Corasick search: 现在将 trie 转换为 Aho-Corasick 自动机以启用 Aho-Corasick 搜索:

 >>> automaton.make_automaton()

Then search all occurrences of the keys (the needles) in an input string (our haystack). 然后在输入字符串(我们的大海捞针)中搜索所有出现的键(针)。

Here we print the results and just check that they are correct. The Automaton.iter() method return the results as two-tuples of the end index where a trie key was found in the input string and the associated value for this key. Here we had stored as values a tuple with the original string and its trie insertion order: 这里我们打印结果并检查它们是否正确。 Automaton.iter() 方法将结果返回为二元组,即在输入字符串中找到的 trie 键的结束索引,以及该键的关联值 。这里,我们将原始字符串及其 trie 插入顺序存储为一个元组的值:

 >>> haystack = 'he her hers she'
 >>> for end_index, (insert_order, original_value) in automaton.iter(haystack):
 ...     start_index = end_index - len(original_value) + 1
 ...     print((start_index, end_index, (insert_order, original_value)))
 ...     assert haystack[start_index:start_index + len(original_value)] == original_value
 ...
 (1, 2, (0, 'he'))
 (1, 3, (1, 'her'))
 (1, 4, (2, 'hers'))
 (4, 6, (3, 'she'))
 (5, 6, (0, 'he'))

You can also create an eventually large automaton ahead of time and pickle it to re-load later. Here we just pickle to a string. You would typically pickle to a file instead: 你也可以提前创建一个最终规模很大的自动机,然后将其 pickle 起来以便稍后重新加载。这里我们只是将其 pickle 到一个字符串中。通常情况下,你应该将其 pickle 到一个文件中:

 >>> import pickle
 >>> pickled = pickle.dumps(automaton)
 >>> B = pickle.loads(pickled)
 >>> B.get('he')
 (0, 'he')

附文件版本

 # 保存到本地文件
 with open("automaton.pkl", "wb") as file:
     pickle.dump(A, file)
 ​
 # 从文件加载
 with open("automaton.pkl", "rb") as file:
     C = pickle.load(file)
 ​
 print("🚀 ~ file: try_ahocorasick.py:43 ~ C.get('苹果'):", C.get("苹果"))
 ​

See also: 参见:

Documentation 文档

The full documentation including the API overview and reference is published on readthedocs. 完整文档(包括 API 概述和参考)已发布于 阅读文档

Overview 概述

With an Aho-Corasick automaton you can efficiently search all occurrences of multiple strings (the needles) in an input string (the haystack) making a single pass over the input string. 使用 Aho-Corasick 自动机 你可以高效地搜索所有出现的多个字符串(needle) 输入字符串(大海捞针)对输入字符串进行一次传递。 With pyahocorasick you can eventually build large automatons and pickle them to reuse them over and over as an indexed structure for fast multi pattern string matching. 使用 pyahocorasick,您最终可以构建大型自动机并对其进行腌制,以便将其作为索引结构反复重复使用,以实现快速的多模式字符串匹配。

One of the advantages of an Aho-Corasick automaton is that the typical worst-case and best-case runtimes are about the same and depends primarily on the size of the input string and secondarily on the number of matches returned. Aho-Corasick 自动机的优点之一是典型的最坏情况 和最佳情况的运行时间大致相同,主要取决于输入字符串的大小,其次取决于返回的匹配数。 While this may not be the fastest string search algorithm in all cases, it can search for multiple strings at once and its runtime guarantees make it rather unique. Because pyahocorasick is based on a Trie, it stores redundant keys prefixes only once using memory efficiently. 虽然这并非在所有情况下都是最快的字符串搜索算法,但它可以一次搜索多个字符串,并且其运行时保证使其相当独特。由于 pyahocorasick 基于 Trie 树,因此它只需存储一次冗余的键前缀,从而高效地利用内存。

A drawback is that it needs to be constructed and “finalized” ahead of time before you can search strings. In several applications where you search for several pre-defined “needles” in a variable “haystacks” this is actually an advantage. 它的缺点是需要提前构建并“完成”才能搜索字符串。在一些需要在变量“haystacks”中搜索多个预定义“needle”的应用程序中,这实际上是一个优势。

Aho-Corasick automatons are commonly used for fast multi-pattern matching in intrusion detection systems (such as snort), anti-viruses and many other applications that need fast matching against a pre-defined set of string keys. Aho-Corasick 自动机通常用于入侵检测系统(如 snort)、防病毒系统和许多其他需要与预定义字符串键集进行快速匹配的应用程序中的多模式快速匹配。

Internally an Aho-Corasick automaton is typically based on a Trie with extra data for failure links and an implementation of the Aho-Corasick search procedure. 内部而言,Aho-Corasick 自动机通常基于 Trie 树,其中包含用于故障链接的额外数据以及 Aho-Corasick 搜索程序的实现。

Behind the scenes the pyahocorasick Python library implements these two data structures: a Trie and an Aho-Corasick string matching automaton. Both are exposed through the Automaton class. pyahocorasick Python 库在后台实现了这两个数据结构: Trie 树和 Aho-Corasick 字符串匹配自动机。两者都通过 Automaton 类公开。

In addition to Trie-like and Aho-Corasick methods and data structures, pyahocorasick also implements dict-like methods: The pyahocorasick Automaton is a Trie a dict-like structure indexed by string keys each associated with a value object. You can use this to retrieve an associated value in a time proportional to a string key length. 除了类似 Trie 和 Aho-Corasick 的方法和数据结构之外, pyahocorasick 还实现了类似字典的方法:pyahocorasick 自动机是一种类似于字典的结构,它由字符串键索引,每个键都与一个值对象关联。您可以使用它来检索关联值,所需时间与字符串键的长度成正比。

pyahocorasick is available in two flavors: pyahocorasick 有两种版本:

  • a CPython C-based extension, compatible with Python 3 only. Use older version 1.4.x for Python 2.7.x and 32 bits support. CPython 是一个基于 C 语言的扩展 ,仅兼容 Python 3。为了支持 Python 2.7.x 和 32 位系统,请使用旧版本 1.4.x。
  • a simpler pure Python module, compatible with Python 2 and 3. This is only available in the source repository (not on Pypi) under the etc/py/ directory and has a slightly different API. 一个更简单的纯 Python 模块,兼容 Python 2 和 3。这仅在源存储库(而不是 Pypi)的 etc/py/ 目录下可用,并且 API 略有不同。

Unicode and bytes Unicode 和字节

The type of strings accepted and returned by Automaton methods are either unicode or bytes, depending on a compile time settings (preprocessor definition of AHOCORASICK_UNICODE as set in setup.py). 自动机方法接受和返回的字符串类型是 unicodebytes ,取决于编译时设置(在 setup.py 中设置的 AHOCORASICK_UNICODE 的预处理器定义)。

The Automaton.unicode attributes can tell you how the library was built. On Python 3, unicode is the default. Automaton.unicode 属性可以告诉你该库是如何构建的。在 Python 3 中,unicode 是默认值。

Warning 警告

When the library is built with unicode support, an Automaton will store 2 or 4 bytes per letter, depending on your Python installation. When built for bytes, only one byte per letter is needed. 当库支持 Unicode 时,自动机将为每个字母存储 2 个或 4 个字节,具体取决于你的 Python 安装。当库支持字节时,每个字母只需要存储一个字节。

Build and install from PyPi 从 PyPi 构建并安装

To install for common operating systems, use pip. Pre-built wheels should be available on Pypi at some point in the future: 要为常见操作系统安装,请使用 pip。Pypi 上应该会在未来某个时间提供预构建的 wheel:

 pip install pyahocorasick

To build from sources you need to have a C compiler installed and configured which should be standard on Linux and easy to get on MacOSX. 要从源代码构建,您需要安装和配置一个 C 编译器,该编译器应该是 Linux 上的标准,并且在 MacOSX 上很容易获得。

To build from sources, clone the git repository or download and extract the source archive. 要从源代码构建,请克隆 git 存储库或下载并提取源档案。

Install pip (and its setuptools companion) and then run (in a virtualenv of course!): 安装 pip (及其 setuptools 伴侣)然后运行(当然是在虚拟环境中!):

 pip install .

If compilation succeeds, the module is ready to use. 如果编译成功,该模块就可以使用了。

Support 支持

Support is available through the GitHub issue tracker to report bugs or ask questions. 可以通过 GitHub 问题跟踪器获得支持来报告错误或提出问题。

Contributing 贡献

You can submit contributions through GitHub pull requests. 您可以通过 GitHub 拉取请求提交贡献。

  • There is a Makefile with a default target that builds and runs tests. 有一个带有默认目标的 Makefile,用于构建和运行测试。
  • The tests can run with a pip installe -e .[testing] && pytest -vvs 测试可以使用 pip installe -e .[testing] && pytest -vvs 运行
  • See also the .github directory for CI tests and workflow 另请参阅 .github 目录以了解 CI 测试和工作流程

Authors 作者

The initial author and maintainer is Wojciech Muła. Philippe Ombredanne is Wojciech’s sidekick and helps maintaining, and rewrote documentation, setup CI servers and did a some work to make this module more accessible to end users. 最初的作者和维护者是 Wojciech Muła。Philippe Ombredanne 是 Wojciech 的助手,负责维护、重写文档、设置 CI 服务器,并做了一些工作,使该模块更易于最终用户使用。

Alphabetic list of authors and contributors: 作者和贡献者的字母列表:

  • Andrew Grigorev 安德鲁·格里戈列夫
  • Ayan Mahapatra 阿扬·玛哈帕特拉
  • Bogdan 博格丹
  • David Woakes 大卫·沃克斯
  • Edward Betts 爱德华·贝茨
  • Frankie Robertson 弗兰基·罗伯逊
  • Frederik Petersen 弗雷德里克·彼得森
  • gladtosee 很高兴看到
  • INADA Naoki 稻田直树
  • Jan Fan 范建军
  • Pastafarianist 飞天面条神教徒
  • Philippe Ombredanne 菲利普·翁布雷丹
  • Renat Nasyrov 雷纳特·纳西罗夫
  • Sylvain Zimmer 西尔万·齐默
  • Xiaopeng Xu 徐晓鹏

and many others! 以及其他许多人!

This library would not be possible without help of many people, who contributed in various ways. They created pull requests, reported bugs as GitHub issues or via direct messages, proposed fixes, or spent their valuable time on testing. 这个库的诞生离不开许多人的帮助,他们以各种方式做出了贡献。他们创建了 Pull 请求 ,并将 bug 报告为 GitHub 问题。 或通过直接消息、提出修复建议或花费宝贵的时间进行测试。

Thank you. 谢谢。

License 执照

This library is licensed under very liberal BSD-3-Clause license. Some portions of the code are dedicated to the public domain such as the pure Python automaton and test code. 该图书馆的许可非常自由 BSD-3-Clause 许可证。部分代码专用于公共领域,例如纯 Python 自动机和测试代码。

Full text of license is available in LICENSE file. 许可证全文可在 LICENSE 文件中找到。