Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

21 阅读4分钟

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 “如何自动化同义词并使用我们的 Synonyms API 进行上传” 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 “Elasticsearch:使用推理端点及语义搜索演示”。我们可以创建一个如下的 chat completion 端点:

`

1.  PUT _inference/completion/azure_openai_completion
2.  {
3.      "service": "azureopenai",
4.      "service_settings": {
5.          "api_key": "${AZURE_API_KEY}",
6.          "resource_name": "${AZURE_RESOURCE_NAME}",
7.          "deployment_id": "${AZURE_DEPLOYMENT_ID}",
8.          "api_version": "${AZURE_API_VERSION}"
9.      }
10.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

`Extract audio product information from this description.  Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories),  features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound),  use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio).  Description:`AI写代码

如果你还不了解如何定义这个变量,请参考我之前的文章 “Kibana:如何设置变量并应用它们”。

`

1.  POST _ingest/pipeline/_simulate
2.  {
3.    "description": "Use LLM to interpret messages to come out categories",
4.    "pipeline": {
5.      "processors": [
6.        {
7.          "script": {
8.            "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
9.            "params": {
10.              "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
11.            }
12.          }
13.        },
14.        {
15.          "inference": {
16.            "model_id": "azure_openai_completion",
17.            "input_output": {
18.              "input_field": "prompt",
19.              "output_field": "ai_response"
20.            }
21.          }
22.        },
23.        {
24.          "json": {
25.            "field": "ai_response",
26.            "add_to_root": true
27.          }
28.        },
29.        {
30.          "json": {
31.            "field": "ai_response",
32.            "add_to_root": true
33.          }
34.        },
35.        {
36.          "remove": {
37.            "field": [
38.              "prompt",
39.              "ai_response"
40.            ]
41.          }
42.        }
43.      ]
44.    },
45.    "docs": [
46.      {
47.        "_source": {
48.          "name": "Wireless Noise-Canceling Headphones",
49.          "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
50.          "price": 299.99
51.        }
52.      }
53.    ]
54.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)收起代码块![](https://csdnimg.cn/release/blogv2/dist/pc/img/arrowup-line-top-White.png)

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

`

1.  {
2.    "docs": [
3.      {
4.        "doc": {
5.          "_index": "_index",
6.          "_version": "-3",
7.          "_id": "_id",
8.          "_source": {
9.            "use_case": "Travel",
10.            "features": [
11.              "wireless",
12.              "noise_cancellation",
13.              "long_battery"
14.            ],
15.            "price": 299.99,
16.            "name": "Wireless Noise-Canceling Headphones",
17.            "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
18.            "model_id": "azure_openai_completion",
19.            "category": "Headphones"
20.          },
21.          "_ingest": {
22.            "timestamp": "2026-01-22T13:56:11.926494Z"
23.          }
24.        }
25.      }
26.    ]
27.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

上面的测试非常成功。我们可以进一步创建 pipeline:

`

1.  PUT _ingest/pipeline/product-enrichment-pipeline
2.  {
3.    "processors": [
4.      {
5.        "script": {
6.          "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
7.          "params": {
8.            "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
9.          }
10.        }
11.      },
12.      {
13.        "inference": {
14.          "model_id": "azure_openai_completion",
15.          "input_output": {
16.            "input_field": "prompt",
17.            "output_field": "ai_response"
18.          }
19.        }
20.      },
21.      {
22.        "json": {
23.          "field": "ai_response",
24.          "add_to_root": true
25.        }
26.      },
27.      {
28.        "json": {
29.          "field": "ai_response",
30.          "add_to_root": true
31.        }
32.      },
33.      {
34.        "remove": {
35.          "field": [
36.            "prompt",
37.            "ai_response"
38.          ]
39.        }
40.      }
41.    ]
42.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

`

1.  PUT products
2.  {
3.    "settings": {
4.      "default_pipeline": "product-enrichment-pipeline"
5.    }
6.  }

`AI写代码

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

`

1.  POST _bulk
2.  { "index": { "_index": "products", "_id": "1" } }
3.  { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 }
4.  { "index": { "_index": "products", "_id": "2" } }
5.  { "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 }
6.  { "index": { "_index": "products", "_id": "3" } }
7.  { "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

`AI写代码

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

`GET products/_search?filter_path=**.hits`AI写代码
`

1.  {
2.    "hits": {
3.      "hits": [
4.        {
5.          "_index": "products",
6.          "_id": "1",
7.          "_score": 1,
8.          "_source": {
9.            "use_case": "Travel",
10.            "features": [
11.              "wireless",
12.              "noise_cancellation",
13.              "long_battery"
14.            ],
15.            "price": 299.99,
16.            "name": "Wireless Noise-Canceling Headphones",
17.            "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
18.            "model_id": "azure_openai_completion",
19.            "category": "Headphones"
20.          }
21.        },
22.        {
23.          "_index": "products",
24.          "_id": "2",
25.          "_score": 1,
26.          "_source": {
27.            "use_case": "Travel",
28.            "features": [
29.              "waterproof",
30.              "surround_sound"
31.            ],
32.            "price": 149.99,
33.            "name": "Portable Bluetooth Speaker",
34.            "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
35.            "model_id": "azure_openai_completion",
36.            "category": "Speakers"
37.          }
38.        },
39.        {
40.          "_index": "products",
41.          "_id": "3",
42.          "_score": 1,
43.          "_source": {
44.            "use_case": "Studio",
45.            "features": [
46.              "noise_cancellation",
47.              "voice_assistant"
48.            ],
49.            "price": 199.99,
50.            "name": "Studio Condenser Microphone",
51.            "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
52.            "model_id": "azure_openai_completion",
53.            "category": "Microphones"
54.          }
55.        }
56.      ]
57.    }
58.  }

`AI写代码![](https://csdnimg.cn/release/blogv2/dist/pc/img/runCode/icon-arrowwhite.png)收起代码块![](https://csdnimg.cn/release/blogv2/dist/pc/img/arrowup-line-top-White.png)

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!