(章节 3.8) AI 大模型测速 (字数太多只能分篇)

231 阅读6分钟

由于本文字数太多, 只能分开发布.

本文标题: 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770)


3.8 Windows (CPU) r5-5600g AVX2

在 6 号 PC (物理机) 上运行. 版本:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64

运行模型 llama2-7B.q4, 生成长度 100:

p>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1724480697

llama_print_timings:        load time =    1005.41 ms
llama_print_timings:      sample time =       4.11 ms /   100 runs   (    0.04 ms per token, 24354.60 tokens per second)
llama_print_timings: prompt eval time =     399.08 ms /    10 tokens (   39.91 ms per token,    25.06 tokens per second)
llama_print_timings:        eval time =    9688.39 ms /    99 runs   (   97.86 ms per token,    10.22 tokens per second)
llama_print_timings:       total time =   10110.42 ms /   109 tokens

运行模型 llama2-7B.q4, 生成长度 200:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200

llama_print_timings:        load time =    1045.93 ms
llama_print_timings:      sample time =       8.82 ms /   200 runs   (    0.04 ms per token, 22673.17 tokens per second)
llama_print_timings: prompt eval time =     436.84 ms /    10 tokens (   43.68 ms per token,    22.89 tokens per second)
llama_print_timings:        eval time =   19960.35 ms /   199 runs   (  100.30 ms per token,     9.97 tokens per second)
llama_print_timings:       total time =   20439.79 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500

llama_print_timings:        load time =    1028.02 ms
llama_print_timings:      sample time =      18.32 ms /   500 runs   (    0.04 ms per token, 27300.03 tokens per second)
llama_print_timings: prompt eval time =     382.15 ms /    10 tokens (   38.22 ms per token,    26.17 tokens per second)
llama_print_timings:        eval time =   51622.99 ms /   499 runs   (  103.45 ms per token,     9.67 tokens per second)
llama_print_timings:       total time =   52107.10 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000

llama_print_timings:        load time =    1241.78 ms
llama_print_timings:      sample time =      41.52 ms /  1000 runs   (    0.04 ms per token, 24084.78 tokens per second)
llama_print_timings: prompt eval time =     484.10 ms /    10 tokens (   48.41 ms per token,    20.66 tokens per second)
llama_print_timings:        eval time =  114393.05 ms /   999 runs   (  114.51 ms per token,     8.73 tokens per second)
llama_print_timings:       total time =  115084.29 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100

llama_print_timings:        load time =    1429.29 ms
llama_print_timings:      sample time =      15.21 ms /   100 runs   (    0.15 ms per token,  6572.89 tokens per second)
llama_print_timings: prompt eval time =     523.07 ms /     9 tokens (   58.12 ms per token,    17.21 tokens per second)
llama_print_timings:        eval time =   17786.69 ms /    99 runs   (  179.66 ms per token,     5.57 tokens per second)
llama_print_timings:       total time =   18409.82 ms /   108 tokens

运行模型 qwen2-7B.q8, 生成长度 200:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200

llama_print_timings:        load time =    1424.62 ms
llama_print_timings:      sample time =      31.78 ms /   200 runs   (    0.16 ms per token,  6292.47 tokens per second)
llama_print_timings: prompt eval time =     564.79 ms /     9 tokens (   62.75 ms per token,    15.93 tokens per second)
llama_print_timings:        eval time =   36148.33 ms /   199 runs   (  181.65 ms per token,     5.51 tokens per second)
llama_print_timings:       total time =   36919.37 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500

llama_print_timings:        load time =    1462.26 ms
llama_print_timings:      sample time =      80.31 ms /   500 runs   (    0.16 ms per token,  6225.64 tokens per second)
llama_print_timings: prompt eval time =     720.86 ms /     9 tokens (   80.10 ms per token,    12.49 tokens per second)
llama_print_timings:        eval time =   90566.92 ms /   499 runs   (  181.50 ms per token,     5.51 tokens per second)
llama_print_timings:       total time =   91801.55 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000

llama_print_timings:        load time =    1439.21 ms
llama_print_timings:      sample time =     165.06 ms /  1000 runs   (    0.17 ms per token,  6058.48 tokens per second)
llama_print_timings: prompt eval time =     555.15 ms /     9 tokens (   61.68 ms per token,    16.21 tokens per second)
llama_print_timings:        eval time =  184706.64 ms /   999 runs   (  184.89 ms per token,     5.41 tokens per second)
llama_print_timings:       total time =  186313.82 ms /  1008 tokens

3.9 Windows (GPU) A770 vulkan

在 6 号 PC (物理机) 上运行. 版本:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64

运行模型 llama2-7B.q4, 生成长度 100:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1724482103

llama_print_timings:        load time =    3375.14 ms
llama_print_timings:      sample time =       4.04 ms /   100 runs   (    0.04 ms per token, 24764.74 tokens per second)
llama_print_timings: prompt eval time =     471.87 ms /    10 tokens (   47.19 ms per token,    21.19 tokens per second)
llama_print_timings:        eval time =    5913.11 ms /    99 runs   (   59.73 ms per token,    16.74 tokens per second)
llama_print_timings:       total time =    6408.49 ms /   109 tokens

运行模型 llama2-7B.q4, 生成长度 200:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33

llama_print_timings:        load time =    2932.55 ms
llama_print_timings:      sample time =       8.03 ms /   200 runs   (    0.04 ms per token, 24915.91 tokens per second)
llama_print_timings: prompt eval time =     471.34 ms /    10 tokens (   47.13 ms per token,    21.22 tokens per second)
llama_print_timings:        eval time =   11931.98 ms /   199 runs   (   59.96 ms per token,    16.68 tokens per second)
llama_print_timings:       total time =   12452.04 ms /   209 tokens

运行模型 llama2-7B.q4, 生成长度 500:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33

llama_print_timings:        load time =    2913.84 ms
llama_print_timings:      sample time =      19.84 ms /   500 runs   (    0.04 ms per token, 25204.15 tokens per second)
llama_print_timings: prompt eval time =     471.64 ms /    10 tokens (   47.16 ms per token,    21.20 tokens per second)
llama_print_timings:        eval time =   30253.41 ms /   499 runs   (   60.63 ms per token,    16.49 tokens per second)
llama_print_timings:       total time =   30844.12 ms /   509 tokens

运行模型 llama2-7B.q4, 生成长度 1000:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33

llama_print_timings:        load time =    2909.30 ms
llama_print_timings:      sample time =      40.91 ms /  1000 runs   (    0.04 ms per token, 24443.90 tokens per second)
llama_print_timings: prompt eval time =     471.58 ms /    10 tokens (   47.16 ms per token,    21.21 tokens per second)
llama_print_timings:        eval time =   61725.41 ms /   999 runs   (   61.79 ms per token,    16.18 tokens per second)
llama_print_timings:       total time =   62433.39 ms /  1009 tokens

运行模型 qwen2-7B.q8, 生成长度 100:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33

llama_print_timings:        load time =    4785.92 ms
llama_print_timings:      sample time =       9.08 ms /   100 runs   (    0.09 ms per token, 11016.86 tokens per second)
llama_print_timings: prompt eval time =     609.77 ms /     9 tokens (   67.75 ms per token,    14.76 tokens per second)
llama_print_timings:        eval time =    6401.98 ms /    99 runs   (   64.67 ms per token,    15.46 tokens per second)
llama_print_timings:       total time =    7100.18 ms /   108 tokens

运行模型 qwen2-7B.q8, 生成长度 200:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33

llama_print_timings:        load time =    4783.54 ms
llama_print_timings:      sample time =      18.63 ms /   200 runs   (    0.09 ms per token, 10735.37 tokens per second)
llama_print_timings: prompt eval time =     610.60 ms /     9 tokens (   67.84 ms per token,    14.74 tokens per second)
llama_print_timings:        eval time =   12910.01 ms /   199 runs   (   64.87 ms per token,    15.41 tokens per second)
llama_print_timings:       total time =   13698.94 ms /   208 tokens

运行模型 qwen2-7B.q8, 生成长度 500:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33

llama_print_timings:        load time =    4798.07 ms
llama_print_timings:      sample time =      46.32 ms /   500 runs   (    0.09 ms per token, 10794.47 tokens per second)
llama_print_timings: prompt eval time =     610.28 ms /     9 tokens (   67.81 ms per token,    14.75 tokens per second)
llama_print_timings:        eval time =   32517.07 ms /   499 runs   (   65.16 ms per token,    15.35 tokens per second)
llama_print_timings:       total time =   33565.60 ms /   508 tokens

运行模型 qwen2-7B.q8, 生成长度 1000:

>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33

llama_print_timings:        load time =    4802.01 ms
llama_print_timings:      sample time =      93.21 ms /   989 runs   (    0.09 ms per token, 10610.22 tokens per second)
llama_print_timings: prompt eval time =     610.76 ms /     9 tokens (   67.86 ms per token,    14.74 tokens per second)
llama_print_timings:        eval time =   64868.89 ms /   988 runs   (   65.66 ms per token,    15.23 tokens per second)
llama_print_timings:       total time =   66351.20 ms /   997 tokens

(未完待续)