postgres的全文搜索

87 阅读2分钟

全文搜索

这有一篇相关的文章 本文简单介绍postgres里full text search。 postgres full text search在不安装插件的情况下,只支持英文字母和数字。

全文搜索的用途主要包括以下几个方面:信息检索,数据分析,内容推荐,文本处理。简单地说,全文搜索支持通过模糊的文本,去匹配信息。 例如,有一个数据库,保存电脑的名称信息。用户直接搜索“台式机和笔记本”,我们不应该直接SELECT title FROM computes WHERE title LIKE '%台式机和笔记本%'。子串匹配的方式,不适合从大量文本中搜索。应该使用全文搜索。它有以下能力

  1. 忽略stop words (例如 the an)
  2. 词干提取 (也就是匹配单词的各种形式 例如 run可以匹配 runs running ran
  3. 权重排序 (最好的匹配放在结果的最前面)

下面,我简单演示一下postgres的全文搜索。现在,有以下这样一个表格

it_store=# \d products;
                                      Table "public.products"
   Column    |          Type          | Collation | Nullable |               Default
-------------+------------------------+-----------+----------+--------------------------------------
 id          | integer                |           | not null | nextval('products_id_seq'::regclass)
 title       | character varying(200) |           | not null |
 description | text                   |           | not null |
Indexes:
    "products_pkey" PRIMARY KEY, btree (id)

it_store=# select * from products;
 id |                   title                   |                                                       description
----+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------
  1 | HP ProDesk 460 G3 4GB RAM 1TB HDD         | This is great high end cheap desktop computer.
  2 | ASUS UX363UB 8GB RAM 256GB SSD            | This is a great think and light laptop computer. It has good sound quality
  3 | Microsoft Surface Book 2 1TB SSD 16GB RAM | This is a great convertible laptop computer. You can use it as a tablet, as a laptop and a desktop as well if you want.
(3 rows)

it_store=#

我们要对description字段进行全文搜索。先看一下postgres会如何解析这些文本。

it_store=# select to_tsvector(description) from products;
                                             to_tsvector
------------------------------------------------------------------------------------------------------
 'cheap':6 'comput':8 'desktop':7 'end':5 'great':3 'high':4
 'comput':9 'good':12 'great':4 'laptop':8 'light':7 'qualiti':14 'sound':13 'think':5
 'comput':7 'convert':5 'desktop':20 'great':4 'laptop':6,17 'tablet':14 'use':10 'want':25 'well':22
(3 rows)

以上操作,是把文本转化为tsvector。这是一个特殊的数据类型,用来进行全文搜索的。可以看到可以正确地分词,而且过滤掉了the,a这样的无意义的词。

it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('cheap');
                  description
------------------------------------------------
 This is great high end cheap desktop computer.
(1 row)

it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('cheap | good');
                                description
----------------------------------------------------------------------------
 This is great high end cheap desktop computer.
 This is a great think and light laptop computer. It has good sound quality
(2 rows)

it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('great & !tablet');
                                description
----------------------------------------------------------------------------
 This is great high end cheap desktop computer.
 This is a great think and light laptop computer. It has good sound quality
(2 rows)

第一次搜索中,我搜索所有含有cheap的行。第二次,包含cheap或者good。第三次,包含great但不包含tablet 这样类似于一个玩具型的搜索引擎了。 虽然es的某些能力,postgres没有,但是对于pg能满足的需求,就不用再装一个es了。