最近在Langchain社区看到了一篇非常有价值的文章，它详细地介绍了13个流行的LLM（大型语言模型）用例。我觉得这些信息对大家都会非常有用，所以特意摘录了一些主要内容，希望能给大家带来一些启发。当然我只是作为一个学习者和传播者，把它们分享给大家。希望大家能从中受益。

Private, local, open source LLMs

用例

PrivateGPT，llama.cpp和GPT4All等项目的受欢迎程度强调了在本地（在您自己的设备上）运行LLM的需求。

这至少有两个重要的好处： 1，Privacy ：您的数据不会发送给第三方，并且不受商业服务服务条款的约束

2，Cost ：没有推理费用，这对于代币密集型应用程序（例如，长时间运行的模拟、汇总）很重要

概述

在本地运行LLM需要做几件事：

1，Open source LLM ：可以自由修改和共享的开源LLM

2，Inference ：能够在您的设备上运行此LLM，具有可接受的延迟

Open Source LLMs

用户现在可以访问一组快速增长的开源LLM。这些LLM可以在至少两个维度上进行评估（见图）：

1，Base model ：什么是基本模型，它是如何训练的？

2，Fine-tuning approach ：基本模型是否经过微调，如果是，使用了哪组指令？

可以使用多个排行榜来评估这些模型的相对性能，包括：

[LmSys ]
[GPT4All GPT]
[HuggingFace ]

推理

已经出现了一些框架来支持在各种设备上推断开源LLM：

1，llama.cpp ：C++骆驼推理代码的权重优化/量化实现

2，gpt4all ：优化了用于推理的 C 后端

3，Ollama ：将模型权重和环境捆绑到在设备上运行并为LLM提供服务的应用程序中

通常，这些框架将执行以下几项操作：

1，Quantization ：减少原始模型权重的内存占用

2，Efficient implementation for inference ：支持在消费类硬件（例如 CPU 或笔记本电脑 GPU）上进行推理

特别是，请参阅这篇关于量化重要性的优秀文章。

以较低的精度，我们从根本上减少了将LLM存储在内存中所需的内存。此外，我们可以看到GPU内存带宽表的重要性！由于 GPU 内存带宽更大，Mac M2 Max 的推理速度比 M1 快 5-6 倍。

快速入门

Ollama 是在 macOS 上轻松运行推理的一种方法。

此处的说明提供了详细信息，我们总结了这些详细信息：

下载并运行应用

从命令行中，从以下选项列表中获取模型：例如， ollama pull llama2

当应用运行时，所有模型都会自动投放 localhost:11434

from langchain.llms import Ollama

llm = Ollama(model="llama2")

llm("The first man on the moon was ...")

API Reference:

Ollama

' The first man on the moon was Neil Armstrong, who landed on the moon on July 20, 1969 as part of the Apollo 11 mission. obviously.'

在生成令牌时对其进行流式传输。

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = Ollama(model="llama2",
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))
llm("The first man on the moon was ...")

API Reference

CallbackManager

StreamingStdOutCallbackHandler

 The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon's surface, famously declaring "That's one small step for man, one giant leap for mankind" as he took his first steps. He was followed by fellow astronaut Edwin "Buzz" Aldrin, who also walked on the moon during the mission.




' The first man to walk on the moon was Neil Armstrong, an American astronaut who was part of the Apollo 11 mission in 1969. февруари 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon\'s surface, famously declaring "That\'s one small step for man, one giant leap for mankind" as he took his first steps. He was followed by fellow astronaut Edwin "Buzz" Aldrin, who also walked on the moon during the mission.'

环境

在本地运行模型时，推理速度是一个临界点（见上文）。

为了最大限度地减少延迟，可以在GPU上本地运行模型，GPU随许多消费类笔记本电脑（例如Apple设备）一起提供。

即使使用 GPU，可用的 GPU 内存带宽（如上所述）也很重要。

运行Apple silicon GPU

Ollama 将自动在苹果设备上使用 GPU。

其他框架要求用户设置环境以使用Apple GPU。

例如， llama.cpp python绑定可以配置为通过Metal使用GPU。

Metal 是由 Apple 创建的图形和计算 API，提供对 GPU 的近乎直接的访问。

请参阅此处的 llama.cpp 设置以启用此功能。

特别是，请确保 conda 使用的是你创建的正确虚拟环境（ miniforge3 ）。

例如，对我来说

conda activate /Users/rlm/miniforge3/envs/llama

确认上述内容后，：

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

LLMs

有多种方法可以访问量化模型权重。

1，HuggingFace - 许多量化模型可供下载，可以使用框架运行，例如 llama.cpp

2，gpt4all - 模型资源管理器提供指标和相关量化模型的排行榜可供下载

3，Ollama - 可通过以下途径 pull 直接访问多个型号

Ollama

使用 Ollama，通过以下方式 ollama pull <model family>:<tag> 获取模型：

例如，对于 Llama-7b： ollama pull llama2 将下载模型的最基本版本（例如，最小的 # 参数和 4 位量化）
我们还可以从模型列表中指定特定版本，例如， ollama pull llama2:13b
在 API 参考页面上查看完整的参数集

from langchain.llms import Ollama

llm = Ollama(model="llama2:13b")

llm("The first man on the moon was ... think step by step")

API Reference:

Ollama

' Sure! Here\'s the answer, broken down step by step:\n\nThe first man on the moon was... Neil Armstrong.\n\nHere\'s how I arrived at that answer:\n\n1. The first manned mission to land on the moon was Apollo 11.\n2. The mission included three astronauts: Neil Armstrong, Edwin "Buzz" Aldrin, and Michael Collins.\n3. Neil Armstrong was the mission commander and the first person to set foot on the moon.\n4. On July 20, 1969, Armstrong stepped out of the lunar module Eagle and onto the moon\'s surface, famously declaring "That\'s one small step for man, one giant leap for mankind."\n\nSo, the first man on the moon was Neil Armstrong!'

Llama.cpp

Llama.cpp与广泛的模型兼容。

例如，下面我们使用从HuggingFace下载的4位量化 llama2-13b 运行推理

如上所述，有关完整的参数集，请参阅 API 参考。

从骆驼.cpp文档中，有一些值得评论：

n_gpu_layers ：要加载到 GPU 内存中的层数

Value: 1
含义：只有模型的一层将被加载到 GPU 内存中（1 通常就足够了）

n_batch ：模型应并行处理的Token数

Value: n_batch
含义：建议在 1 到 n_ctx 之间选择一个值（在本例中设置为 2048）

n_ctx ：Token上下文窗口

Value: 2048
含义：该模型将一次考虑 2048 个代币的窗口

f16_kv ：模型是否应对键/值缓存使用半精度

Value: True
含义：模型将使用半精度，这可以提高内存效率;金属仅支持真

pip install llama-cpp-python

from langchain.llms import LlamaCpp

llm = LlamaCpp(

model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin",
n_gpu_layers=1,
n_batch=512,
n_ctx=2048,
f16_kv=True,  
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
verbose=True,

)

API Reference:

LlamaCpp

objc[10142]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x2a0c4c208) and /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/libllama.dylib (0x2c28bc208). One of the two will be used. Which one is undefined.
llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 8953.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x47774af60
ggml_metal_init: loaded kernel_mul                            0x47774bc00
ggml_metal_init: loaded kernel_mul_row                        0x47774c230
ggml_metal_init: loaded kernel_scale                          0x47774c890
ggml_metal_init: loaded kernel_silu                           0x47774cef0
ggml_metal_init: loaded kernel_relu                           0x10e33e500
ggml_metal_init: loaded kernel_gelu                           0x47774b2f0
ggml_metal_init: loaded kernel_soft_max                       0x47771a580
ggml_metal_init: loaded kernel_diag_mask_inf                  0x47774dab0
ggml_metal_init: loaded kernel_get_rows_f16                   0x47774e110
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x47774e7d0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13efd7170
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13efd73d0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13efd7630
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13efd7890
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x4744c9740
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x4744ca6b0
ggml_metal_init: loaded kernel_rms_norm                       0x4744cb250
ggml_metal_init: loaded kernel_norm                           0x4744cb970
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x10e33f700
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x10e33fcd0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x4744cc2d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x4744cc6f0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x4744cd6b0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x4744cde20
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x10e33ff30
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x10e340190
ggml_metal_init: loaded kernel_rope                           0x10e3403f0
ggml_metal_init: loaded kernel_alibi_f32                      0x10e340de0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x10e3416d0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x10e342080
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x10e342ca0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6986.19 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1032.00 MB, ( 8018.19 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 9620.19 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   426.00 MB, (10046.19 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (10558.19 / 21845.34)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

控制台日志将显示以下内容，以指示已通过上述步骤正确启用 Metal：

ggml_metal_init: allocating

ggml_metal_init: using MPS

llm("The first man on the moon was ... Let's think step by step")

Llama.generate: prefix-match hit

 and use logical reasoning to figure out who the first man on the moon was.

Here are some clues:

1. The first man on the moon was an American.
2. He was part of the Apollo 11 mission.
3. He stepped out of the lunar module and became the first person to set foot on the moon's surface.
4. His last name is Armstrong.

Now, let's use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon's surface. And finally, clue #4 gives us his last name: Armstrong.
Therefore, the first man on the moon was Neil Armstrong!


llama_print_timings:        load time =  9623.21 ms
llama_print_timings:      sample time =   143.77 ms /   203 runs   (    0.71 ms per token,  1412.01 tokens per second)
llama_print_timings: prompt eval time =   485.94 ms /     7 tokens (   69.42 ms per token,    14.40 tokens per second)
llama_print_timings:        eval time =  6385.16 ms /   202 runs   (   31.61 ms per token,    31.64 tokens per second)
llama_print_timings:       total time =  7279.28 ms





" and use logical reasoning to figure out who the first man on the moon was.\n\nHere are some clues:\n\n1. The first man on the moon was an American.\n2. He was part of the Apollo 11 mission.\n3. He stepped out of the lunar module and became the first person to set foot on the moon's surface.\n4. His last name is Armstrong.\n\nNow, let's use our reasoning skills to figure out who the first man on the moon was. Based on clue #1, we know that the first man on the moon was an American. Clue #2 tells us that he was part of the Apollo 11 mission. Clue #3 reveals that he was the first person to set foot on the moon's surface. And finally, clue #4 gives us his last name: Armstrong.\nTherefore, the first man on the moon was Neil Armstrong!"

GPT4All

我们可以使用从 GPT4All 模型资源管理器下载的模型权重。

与上面显示的类似，我们可以运行推理并使用 API 参考来设置感兴趣的参数。

pip install gpt4all

API Reference:

GPT4All

Found model file at  /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin
llama_new_context_with_model: max tensor size =    87.89 MB
llama_new_context_with_model: max tensor size =    87.89 MB


llama.cpp: using Metal
llama.cpp: loading model from /Users/rlm/Desktop/Code/gpt4all/models/nous-hermes-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x37944d850
ggml_metal_init: loaded kernel_mul                            0x37944f350
ggml_metal_init: loaded kernel_mul_row                        0x37944fdd0
ggml_metal_init: loaded kernel_scale                          0x3794505a0
ggml_metal_init: loaded kernel_silu                           0x379450800
ggml_metal_init: loaded kernel_relu                           0x379450a60
ggml_metal_init: loaded kernel_gelu                           0x379450cc0
ggml_metal_init: loaded kernel_soft_max                       0x379450ff0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x379451250
ggml_metal_init: loaded kernel_get_rows_f16                   0x3794514b0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x379451710
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x379451970
ggml_metal_init: loaded kernel_get_rows_q2_k                  0x379451bd0
ggml_metal_init: loaded kernel_get_rows_q3_k                  0x379451e30
ggml_metal_init: loaded kernel_get_rows_q4_k                  0x379452090
ggml_metal_init: loaded kernel_get_rows_q5_k                  0x3794522f0
ggml_metal_init: loaded kernel_get_rows_q6_k                  0x379452550
ggml_metal_init: loaded kernel_rms_norm                       0x3794527b0
ggml_metal_init: loaded kernel_norm                           0x379452a10
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x379452c70
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x379452ed0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x379453130
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32               0x379453390
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32               0x3794535f0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32               0x379453850
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32               0x379453ab0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32               0x379453d10
ggml_metal_init: loaded kernel_rope                           0x379453f70
ggml_metal_init: loaded kernel_alibi_f32                      0x3794541d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x379454430
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x379454690
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x3794548f0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, (17542.94 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024.00 MB, (18566.94 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, (20168.94 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   512.00 MB, (20680.94 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (21192.94 / 21845.34)
ggml_metal_free: deallocating

llm("The first man on the moon was ... Let's think step by step")

".\n1) The United States decides to send a manned mission to the moon.2) They choose their best astronauts and train them for this specific mission.3) They build a spacecraft that can take humans to the moon, called the Lunar Module (LM).4) They also create a larger spacecraft, called the Saturn V rocket, which will launch both the LM and the Command Service Module (CSM), which will carry the astronauts into orbit.5) The mission is planned down to the smallest detail: from the trajectory of the rockets to the exact movements of the astronauts during their moon landing.6) On July 16, 1969, the Saturn V rocket launches from Kennedy Space Center in Florida, carrying the Apollo 11 mission crew into space.7) After one and a half orbits around the Earth, the LM separates from the CSM and begins its descent to the moon's surface.8) On July 20, 1969, at 2:56 pm EDT (GMT-4), Neil Armstrong becomes the first man on the moon. He speaks these"

Prompts

一些LLM将从特定的提示中受益。例如，llama2可以使用特殊tokens.

我们可以使用 ConditionalPromptSelector 根据模型类型设置提示。

Set our LLM

llm = LlamaCpp( model_path="/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin", n_gpu_layers=1, n_batch=512, n_ctx=2048, f16_kv=True,
callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), verbose=True, )

llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 8953.71 MB (+ 1608.00 MB per state)
llama_new_context_with_model: kv self size  = 1600.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x4744d09d0
ggml_metal_init: loaded kernel_mul                            0x3781cb3d0
ggml_metal_init: loaded kernel_mul_row                        0x37813bb60
ggml_metal_init: loaded kernel_scale                          0x474481080
ggml_metal_init: loaded kernel_silu                           0x4744d29f0
ggml_metal_init: loaded kernel_relu                           0x3781254c0
ggml_metal_init: loaded kernel_gelu                           0x47447f280
ggml_metal_init: loaded kernel_soft_max                       0x4744cf470
ggml_metal_init: loaded kernel_diag_mask_inf                  0x4744cf6d0
ggml_metal_init: loaded kernel_get_rows_f16                   0x4744cf930
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x4744cfb90
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x4744cfdf0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x4744d0050
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x4744ce980
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x4744cebe0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x4744cee40
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x4744cf0a0
ggml_metal_init: loaded kernel_rms_norm                       0x474482450
ggml_metal_init: loaded kernel_norm                           0x4744826b0
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x474482910
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x474482b70
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x474482dd0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x474483030
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x474483290
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x4744834f0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x474483750
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x4744839b0
ggml_metal_init: loaded kernel_rope                           0x474483c10
ggml_metal_init: loaded kernel_alibi_f32                      0x474483e70
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x4744840d0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x474484330
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x474484590
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6986.94 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1032.00 MB, ( 8018.94 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  1602.00 MB, ( 9620.94 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   426.00 MB, (10046.94 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (10558.94 / 21845.34)
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

设置关联的提示。

from langchain import PromptTemplate, LLMChain
from langchain.chains.prompt_selector import ConditionalPromptSelector

DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(
input_variables=["question"],
template="""<> \n You are an assistant tasked with improving Google search \
results. \n <> \n\n [INST] Generate THREE Google search queries that \
are similar to this question. The output should be a numbered list of questions \
and each should have a question mark at the end: \n\n {question} [/INST]""",
)

DEFAULT_SEARCH_PROMPT = PromptTemplate(
input_variables=["question"],
template="""You are an assistant tasked with improving Google search \
results. Generate THREE Google search queries that are similar to \
this question. The output should be a numbered list of questions and each \
should have a question mark at the end: {question}""",
)

QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
default_prompt=DEFAULT_SEARCH_PROMPT,
conditionals=[
(lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)
],
)

prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
prompt

API Reference:

条件提示选择器

PromptTemplate(input_variables=['question'], output_parser=None, partial_variables={}, template='<<SYS>> \n You are an assistant tasked with improving Google search results. \n <</SYS>> \n\n [INST] Generate THREE Google search queries that are similar to this question. The output should be a numbered list of questions and each should have a question mark at the end: \n\n {question} [/INST]', template_format='f-string', validate_template=True)

Chain

llm_chain = LLMChain(prompt=prompt,llm=llm) question = "What NFL team won the Super Bowl in the year that Justin Bieber was born?" llm_chain.run({"question":question})

  Sure! Here are three similar search queries with a question mark at the end:

1. Which NBA team did LeBron James lead to a championship in the year he was drafted?
2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?
3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?


llama_print_timings:        load time = 14943.19 ms
llama_print_timings:      sample time =    72.93 ms /   101 runs   (    0.72 ms per token,  1384.87 tokens per second)
llama_print_timings: prompt eval time = 14942.95 ms /    93 tokens (  160.68 ms per token,     6.22 tokens per second)
llama_print_timings:        eval time =  3430.85 ms /   100 runs   (   34.31 ms per token,    29.15 tokens per second)
llama_print_timings:       total time = 18578.26 ms





'  Sure! Here are three similar search queries with a question mark at the end:\n\n1. Which NBA team did LeBron James lead to a championship in the year he was drafted?\n2. Who won the Grammy Awards for Best New Artist and Best Female Pop Vocal Performance in the same year that Lady Gaga was born?\n3. What MLB team did Babe Ruth play for when he hit 60 home runs in a single season?'

使用案例

给定从上述模型之一 llm 创建的，你可以将其用于许多用例。

例如，这是使用本地LLM的RAG指南。
通常，本地模型的用例至少可以由两个因素驱动：

Privacy ：用户不想共享的私人数据（例如，日记等）
Cost ：文本预处理（提取/标记）、汇总和代理模拟是token使用密集型任务
Cost ：文本预处理（提取/标记）、汇总和代理模拟是令牌使用密集型任务

有几种方法可以支持特定用例：

微调（例如，gpt-llm-trainer，Anyscale）
提取或标记等用例的函数调用

感兴趣的小伙伴，可以继续深入学习，原文链接：python.langchain.com/docs/guides…，

感谢阅读欢迎点赞，收藏，评论

更多免费原创AI教程，🚀请继续关注公众号：AI深度研究员

Langchain社区刚刚发布的关于 13 个流行的 LLM 用例的新指南

Private, local, open source LLMs

用例

概述

Open Source LLMs

推理

快速入门

API Reference:

API Reference

​ 环境

运行Apple silicon GPU

LLMs

Ollama

API Reference:

Llama.cpp

API Reference:

GPT4All

API Reference:

Prompts

Set our LLM

API Reference:

Chain

使用案例

环境