在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 “如何自动化同义词并使用我们的 Synonyms API 进行上传” 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!
创建 LLM Chat completion 端点
我们可以参考之前的文章 “Elasticsearch:使用推理端点及语义搜索演示”。我们可以创建一个如下的 chat completion 端点:
`
1. PUT _inference/completion/azure_openai_completion
2. {
3. "service": "azureopenai",
4. "service_settings": {
5. "api_key": "${AZURE_API_KEY}",
6. "resource_name": "${AZURE_RESOURCE_NAME}",
7. "deployment_id": "${AZURE_DEPLOYMENT_ID}",
8. "api_version": "${AZURE_API_VERSION}"
9. }
10. }
`AI写代码
创建一个 ingest pipeline
我们可以使用如下的一个方法来测试 pipeline:
在上面,我们定义了一个 EXTRACTION_PROMPT 变量:
`Extract audio product information from this description. Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). Description:`AI写代码
如果你还不了解如何定义这个变量,请参考我之前的文章 “Kibana:如何设置变量并应用它们”。
`
1. POST _ingest/pipeline/_simulate
2. {
3. "description": "Use LLM to interpret messages to come out categories",
4. "pipeline": {
5. "processors": [
6. {
7. "script": {
8. "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
9. "params": {
10. "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
11. }
12. }
13. },
14. {
15. "inference": {
16. "model_id": "azure_openai_completion",
17. "input_output": {
18. "input_field": "prompt",
19. "output_field": "ai_response"
20. }
21. }
22. },
23. {
24. "json": {
25. "field": "ai_response",
26. "add_to_root": true
27. }
28. },
29. {
30. "json": {
31. "field": "ai_response",
32. "add_to_root": true
33. }
34. },
35. {
36. "remove": {
37. "field": [
38. "prompt",
39. "ai_response"
40. ]
41. }
42. }
43. ]
44. },
45. "docs": [
46. {
47. "_source": {
48. "name": "Wireless Noise-Canceling Headphones",
49. "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
50. "price": 299.99
51. }
52. }
53. ]
54. }
`AI写代码收起代码块
提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。
上面命令运行的结果就是:
`
1. {
2. "docs": [
3. {
4. "doc": {
5. "_index": "_index",
6. "_version": "-3",
7. "_id": "_id",
8. "_source": {
9. "use_case": "Travel",
10. "features": [
11. "wireless",
12. "noise_cancellation",
13. "long_battery"
14. ],
15. "price": 299.99,
16. "name": "Wireless Noise-Canceling Headphones",
17. "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
18. "model_id": "azure_openai_completion",
19. "category": "Headphones"
20. },
21. "_ingest": {
22. "timestamp": "2026-01-22T13:56:11.926494Z"
23. }
24. }
25. }
26. ]
27. }
`AI写代码
上面的测试非常成功。我们可以进一步创建 pipeline:
`
1. PUT _ingest/pipeline/product-enrichment-pipeline
2. {
3. "processors": [
4. {
5. "script": {
6. "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description",
7. "params": {
8. "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}"
9. }
10. }
11. },
12. {
13. "inference": {
14. "model_id": "azure_openai_completion",
15. "input_output": {
16. "input_field": "prompt",
17. "output_field": "ai_response"
18. }
19. }
20. },
21. {
22. "json": {
23. "field": "ai_response",
24. "add_to_root": true
25. }
26. },
27. {
28. "json": {
29. "field": "ai_response",
30. "add_to_root": true
31. }
32. },
33. {
34. "remove": {
35. "field": [
36. "prompt",
37. "ai_response"
38. ]
39. }
40. }
41. ]
42. }
`AI写代码
创建索引并写入数据
我们接下来创建一个叫做 products 的索引:
`
1. PUT products
2. {
3. "settings": {
4. "default_pipeline": "product-enrichment-pipeline"
5. }
6. }
`AI写代码
如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:
`
1. POST _bulk
2. { "index": { "_index": "products", "_id": "1" } }
3. { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 }
4. { "index": { "_index": "products", "_id": "2" } }
5. { "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 }
6. { "index": { "_index": "products", "_id": "3" } }
7. { "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }
`AI写代码
注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!
如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:
`GET products/_search?filter_path=**.hits`AI写代码
`
1. {
2. "hits": {
3. "hits": [
4. {
5. "_index": "products",
6. "_id": "1",
7. "_score": 1,
8. "_source": {
9. "use_case": "Travel",
10. "features": [
11. "wireless",
12. "noise_cancellation",
13. "long_battery"
14. ],
15. "price": 299.99,
16. "name": "Wireless Noise-Canceling Headphones",
17. "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.",
18. "model_id": "azure_openai_completion",
19. "category": "Headphones"
20. }
21. },
22. {
23. "_index": "products",
24. "_id": "2",
25. "_score": 1,
26. "_source": {
27. "use_case": "Travel",
28. "features": [
29. "waterproof",
30. "surround_sound"
31. ],
32. "price": 149.99,
33. "name": "Portable Bluetooth Speaker",
34. "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.",
35. "model_id": "azure_openai_completion",
36. "category": "Speakers"
37. }
38. },
39. {
40. "_index": "products",
41. "_id": "3",
42. "_score": 1,
43. "_source": {
44. "use_case": "Studio",
45. "features": [
46. "noise_cancellation",
47. "voice_assistant"
48. ],
49. "price": 199.99,
50. "name": "Studio Condenser Microphone",
51. "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.",
52. "model_id": "azure_openai_completion",
53. "category": "Microphones"
54. }
55. }
56. ]
57. }
58. }
`AI写代码收起代码块
有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。
祝大家学习愉快!