由于本文字数太多, 只能分开发布.
本文标题: 本地运行 AI 有多慢 ? 大模型推理测速 (llama.cpp, Intel GPU A770)
3.8 Windows (CPU) r5-5600g AVX2
在 6 号 PC (物理机) 上运行. 版本:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64
运行模型 llama2-7B.q4, 生成长度 100:
p>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1724480697
llama_print_timings: load time = 1005.41 ms
llama_print_timings: sample time = 4.11 ms / 100 runs ( 0.04 ms per token, 24354.60 tokens per second)
llama_print_timings: prompt eval time = 399.08 ms / 10 tokens ( 39.91 ms per token, 25.06 tokens per second)
llama_print_timings: eval time = 9688.39 ms / 99 runs ( 97.86 ms per token, 10.22 tokens per second)
llama_print_timings: total time = 10110.42 ms / 109 tokens
运行模型 llama2-7B.q4, 生成长度 200:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200
llama_print_timings: load time = 1045.93 ms
llama_print_timings: sample time = 8.82 ms / 200 runs ( 0.04 ms per token, 22673.17 tokens per second)
llama_print_timings: prompt eval time = 436.84 ms / 10 tokens ( 43.68 ms per token, 22.89 tokens per second)
llama_print_timings: eval time = 19960.35 ms / 199 runs ( 100.30 ms per token, 9.97 tokens per second)
llama_print_timings: total time = 20439.79 ms / 209 tokens
运行模型 llama2-7B.q4, 生成长度 500:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500
llama_print_timings: load time = 1028.02 ms
llama_print_timings: sample time = 18.32 ms / 500 runs ( 0.04 ms per token, 27300.03 tokens per second)
llama_print_timings: prompt eval time = 382.15 ms / 10 tokens ( 38.22 ms per token, 26.17 tokens per second)
llama_print_timings: eval time = 51622.99 ms / 499 runs ( 103.45 ms per token, 9.67 tokens per second)
llama_print_timings: total time = 52107.10 ms / 509 tokens
运行模型 llama2-7B.q4, 生成长度 1000:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000
llama_print_timings: load time = 1241.78 ms
llama_print_timings: sample time = 41.52 ms / 1000 runs ( 0.04 ms per token, 24084.78 tokens per second)
llama_print_timings: prompt eval time = 484.10 ms / 10 tokens ( 48.41 ms per token, 20.66 tokens per second)
llama_print_timings: eval time = 114393.05 ms / 999 runs ( 114.51 ms per token, 8.73 tokens per second)
llama_print_timings: total time = 115084.29 ms / 1009 tokens
运行模型 qwen2-7B.q8, 生成长度 100:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100
llama_print_timings: load time = 1429.29 ms
llama_print_timings: sample time = 15.21 ms / 100 runs ( 0.15 ms per token, 6572.89 tokens per second)
llama_print_timings: prompt eval time = 523.07 ms / 9 tokens ( 58.12 ms per token, 17.21 tokens per second)
llama_print_timings: eval time = 17786.69 ms / 99 runs ( 179.66 ms per token, 5.57 tokens per second)
llama_print_timings: total time = 18409.82 ms / 108 tokens
运行模型 qwen2-7B.q8, 生成长度 200:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200
llama_print_timings: load time = 1424.62 ms
llama_print_timings: sample time = 31.78 ms / 200 runs ( 0.16 ms per token, 6292.47 tokens per second)
llama_print_timings: prompt eval time = 564.79 ms / 9 tokens ( 62.75 ms per token, 15.93 tokens per second)
llama_print_timings: eval time = 36148.33 ms / 199 runs ( 181.65 ms per token, 5.51 tokens per second)
llama_print_timings: total time = 36919.37 ms / 208 tokens
运行模型 qwen2-7B.q8, 生成长度 500:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500
llama_print_timings: load time = 1462.26 ms
llama_print_timings: sample time = 80.31 ms / 500 runs ( 0.16 ms per token, 6225.64 tokens per second)
llama_print_timings: prompt eval time = 720.86 ms / 9 tokens ( 80.10 ms per token, 12.49 tokens per second)
llama_print_timings: eval time = 90566.92 ms / 499 runs ( 181.50 ms per token, 5.51 tokens per second)
llama_print_timings: total time = 91801.55 ms / 508 tokens
运行模型 qwen2-7B.q8, 生成长度 1000:
>.\llama-b3617-bin-win-avx2-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000
llama_print_timings: load time = 1439.21 ms
llama_print_timings: sample time = 165.06 ms / 1000 runs ( 0.17 ms per token, 6058.48 tokens per second)
llama_print_timings: prompt eval time = 555.15 ms / 9 tokens ( 61.68 ms per token, 16.21 tokens per second)
llama_print_timings: eval time = 184706.64 ms / 999 runs ( 184.89 ms per token, 5.41 tokens per second)
llama_print_timings: total time = 186313.82 ms / 1008 tokens
3.9 Windows (GPU) A770 vulkan
在 6 号 PC (物理机) 上运行. 版本:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe --version
version: 3617 (a07c32ea)
built with MSVC 19.29.30154.0 for x64
运行模型 llama2-7B.q4, 生成长度 100:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
Log start
main: build = 3617 (a07c32ea)
main: built with MSVC 19.29.30154.0 for x64
main: seed = 1724482103
llama_print_timings: load time = 3375.14 ms
llama_print_timings: sample time = 4.04 ms / 100 runs ( 0.04 ms per token, 24764.74 tokens per second)
llama_print_timings: prompt eval time = 471.87 ms / 10 tokens ( 47.19 ms per token, 21.19 tokens per second)
llama_print_timings: eval time = 5913.11 ms / 99 runs ( 59.73 ms per token, 16.74 tokens per second)
llama_print_timings: total time = 6408.49 ms / 109 tokens
运行模型 llama2-7B.q4, 生成长度 200:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 200 -ngl 33
llama_print_timings: load time = 2932.55 ms
llama_print_timings: sample time = 8.03 ms / 200 runs ( 0.04 ms per token, 24915.91 tokens per second)
llama_print_timings: prompt eval time = 471.34 ms / 10 tokens ( 47.13 ms per token, 21.22 tokens per second)
llama_print_timings: eval time = 11931.98 ms / 199 runs ( 59.96 ms per token, 16.68 tokens per second)
llama_print_timings: total time = 12452.04 ms / 209 tokens
运行模型 llama2-7B.q4, 生成长度 500:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 500 -ngl 33
llama_print_timings: load time = 2913.84 ms
llama_print_timings: sample time = 19.84 ms / 500 runs ( 0.04 ms per token, 25204.15 tokens per second)
llama_print_timings: prompt eval time = 471.64 ms / 10 tokens ( 47.16 ms per token, 21.20 tokens per second)
llama_print_timings: eval time = 30253.41 ms / 499 runs ( 60.63 ms per token, 16.49 tokens per second)
llama_print_timings: total time = 30844.12 ms / 509 tokens
运行模型 llama2-7B.q4, 生成长度 1000:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m llama-2-7b.Q4_K_M.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33
llama_print_timings: load time = 2909.30 ms
llama_print_timings: sample time = 40.91 ms / 1000 runs ( 0.04 ms per token, 24443.90 tokens per second)
llama_print_timings: prompt eval time = 471.58 ms / 10 tokens ( 47.16 ms per token, 21.21 tokens per second)
llama_print_timings: eval time = 61725.41 ms / 999 runs ( 61.79 ms per token, 16.18 tokens per second)
llama_print_timings: total time = 62433.39 ms / 1009 tokens
运行模型 qwen2-7B.q8, 生成长度 100:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 100 -ngl 33
llama_print_timings: load time = 4785.92 ms
llama_print_timings: sample time = 9.08 ms / 100 runs ( 0.09 ms per token, 11016.86 tokens per second)
llama_print_timings: prompt eval time = 609.77 ms / 9 tokens ( 67.75 ms per token, 14.76 tokens per second)
llama_print_timings: eval time = 6401.98 ms / 99 runs ( 64.67 ms per token, 15.46 tokens per second)
llama_print_timings: total time = 7100.18 ms / 108 tokens
运行模型 qwen2-7B.q8, 生成长度 200:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 200 -ngl 33
llama_print_timings: load time = 4783.54 ms
llama_print_timings: sample time = 18.63 ms / 200 runs ( 0.09 ms per token, 10735.37 tokens per second)
llama_print_timings: prompt eval time = 610.60 ms / 9 tokens ( 67.84 ms per token, 14.74 tokens per second)
llama_print_timings: eval time = 12910.01 ms / 199 runs ( 64.87 ms per token, 15.41 tokens per second)
llama_print_timings: total time = 13698.94 ms / 208 tokens
运行模型 qwen2-7B.q8, 生成长度 500:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 500 -ngl 33
llama_print_timings: load time = 4798.07 ms
llama_print_timings: sample time = 46.32 ms / 500 runs ( 0.09 ms per token, 10794.47 tokens per second)
llama_print_timings: prompt eval time = 610.28 ms / 9 tokens ( 67.81 ms per token, 14.75 tokens per second)
llama_print_timings: eval time = 32517.07 ms / 499 runs ( 65.16 ms per token, 15.35 tokens per second)
llama_print_timings: total time = 33565.60 ms / 508 tokens
运行模型 qwen2-7B.q8, 生成长度 1000:
>.\llama-b3617-bin-win-vulkan-x64\llama-cli.exe -m qwen2-7b-instruct-q8_0.gguf -p "hello, this is a very very long story" -n 1000 -ngl 33
llama_print_timings: load time = 4802.01 ms
llama_print_timings: sample time = 93.21 ms / 989 runs ( 0.09 ms per token, 10610.22 tokens per second)
llama_print_timings: prompt eval time = 610.76 ms / 9 tokens ( 67.86 ms per token, 14.74 tokens per second)
llama_print_timings: eval time = 64868.89 ms / 988 runs ( 65.66 ms per token, 15.23 tokens per second)
llama_print_timings: total time = 66351.20 ms / 997 tokens
(未完待续)