官方的《README》文档内容很详细,包含原理、安装、模型说明、使用方法等,大家可以参考。简单说一下环境,用的 Windows 台式机,装的的有 Anaconda。
1.安装
官方《安装说明》的 ffmpeg 部分内容:
1.1 chocolatey(目的是安装工具 ffmpeg)
- 使用 Powershell 安装 chocolatey
Shift+鼠标右键 --> 在此处打开 Powershell 窗口(S) --> 输入一下文本
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
- 安装过程输出
Forcing web requests to allow TLS v1.2 (Required for requests to Chocolatey.org)
Getting latest version of the Chocolatey package for download.
Not using proxy.
Getting Chocolatey from https://community.chocolatey.org/api/v2/package/chocolatey/2.4.3.
Downloading https://community.chocolatey.org/api/v2/package/chocolatey/2.4.3 to C:\Users\ADMINI~1\AppData\Local\Temp\chocolatey\chocoInstall\chocolatey.zip
Not using proxy.
Extracting C:\Users\ADMINI~1\AppData\Local\Temp\chocolatey\chocoInstall\chocolatey.zip to C:\Users\ADMINI~1\AppData\Local\Temp\chocolatey\chocoInstall
Installing Chocolatey on the local machine
Creating ChocolateyInstall as an environment variable (targeting 'Machine')
Setting ChocolateyInstall to 'C:\ProgramData\chocolatey'
WARNING: It's very likely you will need to close and reopen your shell
before you can use choco.
Restricting write permissions to Administrators
We are setting up the Chocolatey package repository.
The packages themselves go to 'C:\ProgramData\chocolatey\lib'
(i.e. C:\ProgramData\chocolatey\lib\yourPackageName).
A shim file for the command line goes to 'C:\ProgramData\chocolatey\bin'
and points to an executable in 'C:\ProgramData\chocolatey\lib\yourPackageName'.
Creating Chocolatey CLI folders if they do not already exist.
chocolatey.nupkg file not installed in lib.
Attempting to locate it from bootstrapper.
PATH environment variable does not have C:\ProgramData\chocolatey\bin in it. Adding...
警告: Not setting tab completion: Profile file does not exist at
'D:\Documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1'.
Chocolatey CLI (choco.exe) is now ready.
You can call choco from anywhere, command line or powershell by typing choco.
Run choco /? for a list of functions.
You may need to shut down and restart powershell and/or consoles
first prior to using choco.
Ensuring Chocolatey commands are on the path
Ensuring chocolatey.nupkg is in the lib folder
1.2 ffmpeg
- 安装命令
choco install ffmpeg
- 安装过程输出
Chocolatey v2.4.3
Installing the following packages:
ffmpeg
By installing, you accept licenses for the packages.
Downloading package from source 'https://community.chocolatey.org/api/v2/'
Progress: Downloading ffmpeg 7.1.1... 100%
ffmpeg v7.1.1 [Approved]
ffmpeg package files install completed. Performing other installation steps.
The package ffmpeg wants to run 'chocolateyInstall.ps1'.
Note: If you don't run this script, the installation will fail.
Note: To confirm automatically next time, use '-y' or consider:
choco feature enable -n allowGlobalConfirmation
Do you want to run the script?([Y]es/[A]ll - yes to all/[N]o/[P]rint): Y
Extracting 64-bit C:\ProgramData\chocolatey\lib\ffmpeg\tools\ffmpeg-release-essentials.7z to C:\ProgramData\chocolatey\lib\ffmpeg\tools...
C:\ProgramData\chocolatey\lib\ffmpeg\tools
Removing extracted archive.
Sleeping for 2 seconds to allow anti-viruses to finish scanning...
Renaming ffmpeg directory to common name (Try 1 / 3)
Successfully renamed directory.
ShimGen has successfully created a shim for ffmpeg.exe
ShimGen has successfully created a shim for ffplay.exe
ShimGen has successfully created a shim for ffprobe.exe
The install of ffmpeg was successful.
Deployed to 'C:\ProgramData\chocolatey\lib\ffmpeg\tools'
Chocolatey installed 1/1 packages.
See the log for details (C:\ProgramData\chocolatey\logs\chocolatey.log).
1.3 Whisper
官方《安装说明》的 Setup 部分内容:
- 使用最简单的方式
pip install -U openai-whisper
- 安装过程输出
WARNING: Ignoring invalid distribution -rotobuf (e:\programdata\anaconda3\envs\deepface\lib\site-packages)
Looking in indexes: http://mirrors.aliyun.com/pypi/simple/
Collecting openai-whisper
Downloading http://mirrors.aliyun.com/pypi/packages/f5/77/952ca71515f81919bd8a6a4a3f89a27b09e73880cebf90957eda8f2f8545/openai-whisper-20240930.tar.gz (800 kB)
---------------------------------------- 800.5/800.5 kB 4.2 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numba in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from openai-whisper) (0.59.0)
Requirement already satisfied: numpy in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from openai-whisper) (1.26.0)
Requirement already satisfied: torch in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from openai-whisper) (2.2.1)
Requirement already satisfied: tqdm in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from openai-whisper) (4.65.0)
Collecting more-itertools (from openai-whisper)
Downloading http://mirrors.aliyun.com/pypi/packages/2b/9f/7ba6f94fc1e9ac3d2b853fdff3035fb2fa5afbed898c4a72b8a020610594/more_itertools-10.7.0-py3-none-any.whl (65 kB)
---------------------------------------- 65.3/65.3 kB 3.4 MB/s eta 0:00:00
Collecting tiktoken (from openai-whisper)
Downloading http://mirrors.aliyun.com/pypi/packages/70/22/e8fc1bf9cdecc439b7ddc28a45b976a8c699a38874c070749d855696368a/tiktoken-0.9.0-cp39-cp39-win_amd64.whl (894 kB)
---------------------------------------- 894.2/894.2 kB 4.0 MB/s eta 0:00:00
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from numba->openai-whisper) (0.42.0)
Requirement already satisfied: regex>=2022.1.18 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from tiktoken->openai-whisper) (2023.10.3)
Requirement already satisfied: requests>=2.26.0 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from tiktoken->openai-whisper) (2.31.0)
Requirement already satisfied: filelock in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (3.12.4)
Requirement already satisfied: typing-extensions>=4.8.0 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (4.8.0)
Requirement already satisfied: sympy in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (1.12)
Requirement already satisfied: networkx in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (3.1)
Requirement already satisfied: jinja2 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (3.1.2)
Requirement already satisfied: fsspec in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from torch->openai-whisper) (2023.10.0)
Requirement already satisfied: colorama in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from tqdm->openai-whisper) (0.4.6)
Requirement already satisfied: charset-normalizer<4,>=2 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from requests>=2.26.0->tiktoken->openai-whisper) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from requests>=2.26.0->tiktoken->openai-whisper) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from requests>=2.26.0->tiktoken->openai-whisper) (2.0.5)
Requirement already satisfied: certifi>=2017.4.17 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from requests>=2.26.0->tiktoken->openai-whisper) (2023.7.22)
Requirement already satisfied: MarkupSafe>=2.0 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from jinja2->torch->openai-whisper) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in e:\programdata\anaconda3\envs\deepface\lib\site-packages (from sympy->torch->openai-whisper) (1.3.0)
Building wheels for collected packages: openai-whisper
Building wheel for openai-whisper (pyproject.toml) ... done
Created wheel for openai-whisper: filename=openai_whisper-20240930-py3-none-any.whl size=803437 sha256=5a47002d0dfc48e09be5402d7e934e6910cbcb00b5328cb6b55b6f34487f0a04
Stored in directory: c:\users\administrator\appdata\local\pip\cache\wheels\92\5b\c7\b99c19e9966d6d455c6301d62efccc73b39366509094f9fe87
Successfully built openai-whisper
WARNING: Ignoring invalid distribution -rotobuf (e:\programdata\anaconda3\envs\deepface\lib\site-packages)
Installing collected packages: more-itertools, tiktoken, openai-whisper
Successfully installed more-itertools-10.7.0 openai-whisper-20240930 tiktoken-0.9.0
2.使用
- 帮助信息获取
whisper --help
- 使用说明(有点儿长)
usage: whisper [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR] [--output_format {txt,vtt,srt,tsv,json,all}] [--verbose VERBOSE] [--task {transcribe,translate}]
[--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}]
[--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patience PATIENCE] [--length_penalty LENGTH_PENALTY] [--suppress_tokens SUPPRESS_TOKENS]
[--initial_prompt INITIAL_PROMPT] [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16] [--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK]
[--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD] [--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHOLD] [--word_timestamps WORD_TIMESTAMPS]
[--prepend_punctuations PREPEND_PUNCTUATIONS] [--append_punctuations APPEND_PUNCTUATIONS] [--highlight_words HIGHLIGHT_WORDS] [--max_line_width MAX_LINE_WIDTH]
[--max_line_count MAX_LINE_COUNT] [--max_words_per_line MAX_WORDS_PER_LINE] [--threads THREADS] [--clip_timestamps CLIP_TIMESTAMPS]
[--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD]
audio [audio ...]
positional arguments:
audio audio file(s) to transcribe
optional arguments:
-h, --help show this help message and exit
--model MODEL name of the Whisper model to use (default: turbo)
--model_dir MODEL_DIR
the path to save model files; uses ~/.cache/whisper by default (default: None)
--device DEVICE device to use for PyTorch inference (default: cpu)
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
directory to save the outputs (default: .)
--output_format {txt,vtt,srt,tsv,json,all}, -f {txt,vtt,srt,tsv,json,all}
format of the output file; if not specified, all available formats will be produced (default: all)
--verbose VERBOSE whether to print out the progress and debug messages (default: True)
--task {transcribe,translate}
whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate') (default: transcribe)
--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}
language spoken in the audio, specify None to perform language detection (default: None)
--temperature TEMPERATURE
temperature to use for sampling (default: 0)
--best_of BEST_OF number of candidates when sampling with non-zero temperature (default: 5)
--beam_size BEAM_SIZE
number of beams in beam search, only applicable when temperature is zero (default: 5)
--patience PATIENCE optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search (default: None)
--length_penalty LENGTH_PENALTY
optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default (default: None)
--suppress_tokens SUPPRESS_TOKENS
comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations (default: -1)
--initial_prompt INITIAL_PROMPT
optional text to provide as a prompt for the first window. (default: None)
--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT
if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting
stuck in a failure loop (default: True)
--fp16 FP16 whether to perform inference in fp16; True by default (default: True)
--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK
temperature to increase when falling back when the decoding fails to meet either of the thresholds below (default: 0.2)
--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD
if the gzip compression ratio is higher than this value, treat the decoding as failed (default: 2.4)
--logprob_threshold LOGPROB_THRESHOLD
if the average log probability is lower than this value, treat the decoding as failed (default: -1.0)
--no_speech_threshold NO_SPEECH_THRESHOLD
if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence (default: 0.6)
--word_timestamps WORD_TIMESTAMPS
(experimental) extract word-level timestamps and refine the results based on them (default: False)
--prepend_punctuations PREPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the next word (default: "'“¿([{-)
--append_punctuations APPEND_PUNCTUATIONS
if word_timestamps is True, merge these punctuation symbols with the previous word (default: "'.。,,!!??::”)]}、)
--highlight_words HIGHLIGHT_WORDS
(requires --word_timestamps True) underline each word as it is spoken in srt and vtt (default: False)
--max_line_width MAX_LINE_WIDTH
(requires --word_timestamps True) the maximum number of characters in a line before breaking the line (default: None)
--max_line_count MAX_LINE_COUNT
(requires --word_timestamps True) the maximum number of lines in a segment (default: None)
--max_words_per_line MAX_WORDS_PER_LINE
(requires --word_timestamps True, no effect with --max_line_width) the maximum number of words in a segment (default: None)
--threads THREADS number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS (default: 0)
--clip_timestamps CLIP_TIMESTAMPS
comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process, where the last end timestamp defaults to the end of the file (default: 0)
--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD
(requires --word_timestamps True) skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected (default: None)
3.实例
- 测试文件
- 开始测试
# STT
whisper .\whisper_test.mp3 --language Chinese
第一次进行 STT 会下载模型,模型文件存储在 用户目录的.cache\whisper\下,文件为 large-v3-turbo.pt。
如果想下载其他模型可以执行如下命令,比如使用 small模型:
whisper audio.flac audio.mp3 audio.wav --model small
# 输出信息【此时是在下载模型】
1%|▎ | 15.7M/1.51G [00:09<2:26:01, 183kiB/s]
我是用的 Windows 台式机,没有显卡,性能也不行,所以换了一个半分钟的录音,测试结果如下:
E:\ProgramData\anaconda3\envs\deepface\lib\site-packages\whisper\transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
[00:00.000 --> 00:26.140] 我今年40岁,大五年,出生在天津市,父母就是普通工人,没有什么过多资源吧,基本上就是家 里条件你说多好也没有,说多差也不至于,都是普通家庭吧,大学的时候学文部专业的,毕业之后就进了拍卖行工作了,就在 北京,一直被漂到今天,我的妻子就是我的大同学,然后我结婚也比较晚,生孩子也比较晚,现在孩子也一岁半。
实际的内容如下:
我呢今年40岁,85年的,出生在天津市,父母就是普通工人,没有什么这个过多资源吧,基本上就是家里条件,你说多好也没有,说多差也不至于,都是普通家庭吧,大学的时候学文物专业的,毕业之后呢就进了拍卖行工作,就在北京,一直北漂到今天,我的妻子呢就是我的这个大同学,然后我们结婚也比较晚,生孩子也比较晚,现在孩子也一岁半。
简单总结一下,我测试过在只有 CPU 和有 GPU 的服务器上运行的大模型,输出的结果区别很大,这个是在 Windows 台式机上进行的测试,没有 GPU 且性能也不好,实际内容我也是听了好多遍才转录下来的,整体正确率还是挺高的,如果有 GPU 效果应该更好。