全文搜索
这有一篇相关的文章 本文简单介绍postgres里full text search。 postgres full text search在不安装插件的情况下,只支持英文字母和数字。
全文搜索的用途主要包括以下几个方面:信息检索,数据分析,内容推荐,文本处理。简单地说,全文搜索支持通过模糊的文本,去匹配信息。
例如,有一个数据库,保存电脑的名称信息。用户直接搜索“台式机和笔记本”,我们不应该直接SELECT title FROM computes WHERE title LIKE '%台式机和笔记本%'
。子串匹配的方式,不适合从大量文本中搜索。应该使用全文搜索。它有以下能力
- 忽略stop words (例如
the
an
) - 词干提取 (也就是匹配单词的各种形式 例如 run可以匹配
runs
running
ran
) - 权重排序 (最好的匹配放在结果的最前面)
下面,我简单演示一下postgres的全文搜索。现在,有以下这样一个表格
it_store=# \d products;
Table "public.products"
Column | Type | Collation | Nullable | Default
-------------+------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('products_id_seq'::regclass)
title | character varying(200) | | not null |
description | text | | not null |
Indexes:
"products_pkey" PRIMARY KEY, btree (id)
it_store=# select * from products;
id | title | description
----+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------
1 | HP ProDesk 460 G3 4GB RAM 1TB HDD | This is great high end cheap desktop computer.
2 | ASUS UX363UB 8GB RAM 256GB SSD | This is a great think and light laptop computer. It has good sound quality
3 | Microsoft Surface Book 2 1TB SSD 16GB RAM | This is a great convertible laptop computer. You can use it as a tablet, as a laptop and a desktop as well if you want.
(3 rows)
it_store=#
我们要对description字段进行全文搜索。先看一下postgres会如何解析这些文本。
it_store=# select to_tsvector(description) from products;
to_tsvector
------------------------------------------------------------------------------------------------------
'cheap':6 'comput':8 'desktop':7 'end':5 'great':3 'high':4
'comput':9 'good':12 'great':4 'laptop':8 'light':7 'qualiti':14 'sound':13 'think':5
'comput':7 'convert':5 'desktop':20 'great':4 'laptop':6,17 'tablet':14 'use':10 'want':25 'well':22
(3 rows)
以上操作,是把文本转化为tsvector。这是一个特殊的数据类型,用来进行全文搜索的。可以看到可以正确地分词,而且过滤掉了the,a这样的无意义的词。
it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('cheap');
description
------------------------------------------------
This is great high end cheap desktop computer.
(1 row)
it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('cheap | good');
description
----------------------------------------------------------------------------
This is great high end cheap desktop computer.
This is a great think and light laptop computer. It has good sound quality
(2 rows)
it_store=# select description from products WHERE to_tsvector(description) @@ to_tsquery('great & !tablet');
description
----------------------------------------------------------------------------
This is great high end cheap desktop computer.
This is a great think and light laptop computer. It has good sound quality
(2 rows)
第一次搜索中,我搜索所有含有cheap的行。第二次,包含cheap或者good。第三次,包含great但不包含tablet 这样类似于一个玩具型的搜索引擎了。 虽然es的某些能力,postgres没有,但是对于pg能满足的需求,就不用再装一个es了。