Paper:PEOs:使用概率路由优化基于图的ANN搜索

243 阅读2分钟

Abstract

Paper:Probabilistic Routing for Graph-Based Approximate Nearest Neighbor Search
Github: https: //github.com/ICML2024-code/PEOs .
论文提出了PEOs,和常用图索引(HNSW,NSSG)结合以提高效率

Introduction

基于图的ANN优化方向:routing(HCNGG、TOGG-KMC,TODO-2,TODO-5)、edge selection(TODO-1、TODO-3)、quantization(NGT,TODO-6)。同时,作者也指出了现有优化方案的问题。

However, most of these optimizations are heuristic, based on empirical observations (e.g., over 80% of data vectors are less relevant than the furthest element in the results list and thus should be pruned before exact distance calculations (Chen et al., 2023)), making them challenging to quantitatively analyze.

在FINGER这篇文章中,作者提到了对于一个点,它最有探查价值的点低于20%。同时启发式方法会带来较高的估计误差和较低的recall。

为解决概率路由问题,作者将SimHash和CEOs与SOTA图算法结合,并提出了PEOs。

PEOs的特点

  • 结合空间划分和随机投影生成一个随机变量,代表查询向量和每个邻居的角度
  • ep<=5e_p<=5内解决了概率路由问题。L:划分子空间数量
  • 使用SIMD优化

Problem Definition

带路由的ANN

image.png

其中efs是搜索队列的大小

概率路由(Probabilistic Routing)定义

Definition 3.2 (Probabilistic Routing). Given a query vector q, a node v in the graph index, an error bound ε, and a distance threshold δ, for an arbitrary neighbor u of v such that dist(u, q) < δ, if a routing algorithm returns true for u with a probability of at least 1 − ε, then the algorithm is deemed to be (δ, 1 − ε)-routing.

就是说给定一个误差范围ε,距离阈值δ,查询向量q,对于点v的任意一个邻居u,dist(u,q) < δ,有至少 1- ε的概率返回(访问)该结点。

Baseline Algorithms

SimHash Test

关于SimHash:5分钟搞懂LSH之SimHash算法原理 - 知乎 (zhihu.com)

image.png image.png

就是给出m个随机向量,将图中的点u和查询向量q分别与aia_i做点积,值大于0该位置1,小于0该位置0(相当于划分空间吧),然后分别得到u和q的空间分布的一个哈希值,统计二者碰撞位的位数。

根据以上的公式可知,哈希碰撞的计算与v(点u的邻居)无关,导致算法可能错过一些路径导致recall下降。可以使用u和v的残差向量。为此,作者提出了RCEOs。

RCEOs Test

Paper:Simple yet efficient algorithms for maximum inner product search via extreme order statistics.

image.png

image.png

PEOs(Partitioned Extreme Order Statistics)

待更

TODO

  • 1. High Dimensional Similarity Search With Satellite System Graph: Efficiency, Scalability, and Unindexed Query Compatibility.
  • 2. FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search
  • 3. DiskANN : Fast Accurate Billion-point Nearest Neighbor Search on a Single Node
  • 4. High Dimensional Similarity Search With Satellite System Graph: Efficiency, Scalability, and Unindexed Query Compatibility.
  • 5. Learning to Route in Similarity Graphs.
  • 6. HVS - hierarchical graph structure based on voronoi diagrams for solving approximate nearest neighbor search
  • 7. SIMD
  • 8. Matryoshka representation learning.
  • 9. Simple yet efficient algorithms for maximum inner product search via extreme order statistics.
  • 10. Falconn++: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search
  • 11. Simple Yet Efficient Algorithms for Maximum Inner Product Search via Extreme Order Statistics