GLM-OCR部署使用指南
部署方式选择
本地部署
配置: CPU
操作系统: Windows11
步骤
- 安装 python 13 和 Anconda(python 环境管理工具,便于单独管理特定版本的 python)
- 打开 CMD 命令行窗口
云端部署
配置: GPU
操作系统: Ubuntu 22.04
步骤
- 打开浏览器
- 搜索
smoothcloud(或直接输入网址smoothcloud润云)
- 注册/登录账号
- 打开 控制台->快速创建实例(如果余额不足请先充值)
- 选择 推理实例->选择推理卡->Virtual 高性能推理卡 - 8GB->1个GPU数,1个节点 -> 存储按需配置 -> 镜像推荐选择
VLLMSGlang-> 计费方式按需选择 -> 勾选 创建SSH
- 点击快速创建实例(如果有优惠卷,可以勾选上)
- 实例启动完毕后,进入实例列表,找到刚创建的实例,点击 快捷工具 ->选择 Jupyter 进入实例操作空间 -> 在 Launcher 窗口点击 Other 下的 Terminal 打开终端
准备工作
-
创建虚拟环境并应用
-
创建环境
-
python3 -m venv venv -
应用环境
-
Windows:
-
.\venv\Scripts\activate -
Linux:
-
source ./venv/bin/activate
-
-
-
切换下载源,加快依赖下载速度(阿里源为例)
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ pip config set install.trusted-host mirrors.aliyun.com可执行
pip config list来查看是否切换源成功 -
下载
transformers源代码git clone https://github.com/huggingface/transformers.git -
创建依赖汇总文件
requirements.txt,拷贝以下内容(注意需要替换实际路径):accelerate==1.12.0 annotated-doc==0.0.4 annotated-types==0.7.0 anyio==4.12.1 certifi==2026.1.4 charset-normalizer==3.4.4 click==8.3.1 colorama==0.4.6 fastapi==0.128.1 filelock==3.20.3 fsspec==2026.1.0 h11==0.16.0 hf-xet==1.2.0 httpcore==1.0.9 httpx==0.28.1 huggingface_hub==1.4.0 idna==3.11 Jinja2==3.1.6 MarkupSafe==3.0.3 modelscope==1.34.0 mpmath==1.3.0 networkx==3.6.1 numpy==2.4.2 packaging==26.0 pillow==12.1.0 psutil==7.2.2 pydantic==2.12.5 pydantic_core==2.41.5 python-multipart==0.0.22 PyYAML==6.0.3 regex==2026.1.15 requests==2.32.5 safetensors==0.7.0 setuptools==80.10.2 shellingham==1.5.4 starlette==0.50.0 sympy==1.14.0 tokenizers==0.22.2 torch==2.10.0 torchvision==0.25.0 tqdm==4.67.3 transformers @ file://{请替换实际的transformers源代码路径} typer-slim==0.21.1 typing-inspection==0.4.2 typing_extensions==4.15.0 urllib3==2.6.3 uvicorn==0.40.0PS:
在 Windows 下,修改 transformers 源代码路径时需要在路径前加上一个
/, 示例:transformers @ file:///D:\src\transformers -
下载安装依赖
pip install -r requirements.txt
下载模型
安装 modelscope
pip install modelscope
下载glm-ocr模型文件
modelscope download --model ZhipuAI/GLM-OCR --local_dir {自行修改路径}
运行模型服务
前置工作
- 创建配置文件
config.json, 拷贝以下内容:
{
"host": "",
"port": 8080,
"model_path": {请替换实际路径}
}
- 创建图片目录
images,从网上下载带文本的图片,然后放在该目录下,其中一个图片需重命名为test_image.png,以便后续用于验证模型
验证模型
创建演示代码文件model_demo.py,拷贝以下内容:
from modelscope import AutoProcessor, AutoModelForImageTextToText
import torch
import os
import json
# Load configuration
config_path = os.path.join(os.path.dirname(__file__), "config.json")
with open(config_path, "r") as f:
config = json.load(f)
MODEL_PATH = config["model_path"]
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "./images/test_image.png"
},
{
"type": "text",
"text": "Text Recognition:"
}
],
}
]
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)
运行程序
python3 model_demo.py
运行模型API服务
创建演示代码文件server_demo.py,拷贝以下内容:
import logging
import time
import io
import uvicorn
import requests
from fastapi import FastAPI, UploadFile, File, Form, HTTPException
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText
from fastapi.responses import HTMLResponse
import torch
import os
import json
from fastapi.responses import FileResponse
from fastapi.concurrency import run_in_threadpool
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger("ocr-server")
# Load configuration
config_path = os.path.join(os.path.dirname(__file__), "config.json")
with open(config_path, "r") as f:
config = json.load(f)
app = FastAPI()
# Load model and tokenizer
model_path = config["model_path"]
logger.info(f"Loading model from {model_path}...")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
pretrained_model_name_or_path=model_path,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else "auto",
device_map="auto",
trust_remote_code=True,
)
logger.info("Model loaded successfully.Device Uses: " + str(model.device)+" Is GPU: "+ str(torch.cuda.is_available()))
@app.post("/ocr")
async def perform_ocr(
file: UploadFile = File(None),
image_url: str = Form(None),
prompt: str = Form("Extract text from this image")
):
start_time = time.time()
try:
if file:
logger.info(f"Processing uploaded file: {file.filename}")
image_data = await file.read()
image = Image.open(io.BytesIO(image_data)).convert("RGB")
elif image_url:
logger.info(f"Processing image from URL: {image_url}")
if image_url.startswith(("http://", "https://")):
# Offload blocking network request to threadpool
response = await run_in_threadpool(requests.get, image_url, timeout=10)
response.raise_for_status()
image = Image.open(io.BytesIO(response.content)).convert("RGB")
else:
image = Image.open(image_url).convert("RGB")
else:
raise HTTPException(status_code=400, detail="No image file or URL provided")
# Synchronous wrapper for model inference to prevent blocking the event loop
def model_inference(img, p):
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": img},
{"type": "text", "text": p}
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
logger.info("Generating OCR result (in threadpool)...")
generated_ids = model.generate(**inputs, max_new_tokens=8192)
return processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
# Offload the CPU/GPU intensive work to a threadpool for concurrency
output_text = await run_in_threadpool(model_inference, image, prompt)
elapsed_time = time.time() - start_time
logger.info(f"OCR request completed in {elapsed_time:.2f}s")
return {"text": output_text, "elapsed": f"{elapsed_time:.2f}s"}
except Exception as e:
logger.error(f"Error during OCR processing: {str(e)}", exc_info=True)
raise HTTPException(status_code=500, detail=str(e))
@app.get("/images/{filename}")
async def get_image(filename: str):
image_path = os.path.join("images", filename)
if not os.path.exists(image_path):
raise HTTPException(status_code=404, detail="Image not found")
return FileResponse(image_path)
@app.get("/demo", response_class=HTMLResponse)
async def demo_page():
example_images = []
if os.path.exists("images"):
example_images = [f for f in os.listdir("images") if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp'))]
example_html = "".join([
f'<div class="example-item" onclick="selectExample(\'images/{f}\')">{f}</div>'
for f in example_images
])
return f"""
<!DOCTYPE html>
<html>
<head>
<title>GLM-OCR 交互演示</title>
<style>
body {{ font-family: 'PingFang SC', 'Microsoft YaHei', sans-serif; max-width: 900px; margin: 20px auto; padding: 0; line-height: 1.6; color: #333; background-color: #f4f7f9; }}
.container {{ background: white; border: 1px solid #ddd; padding: 30px; border-radius: 12px; box-shadow: 0 4px 20px rgba(0,0,0,0.08); }}
.form-group {{ margin-bottom: 25px; }}
label {{ display: block; margin-bottom: 10px; font-weight: bold; color: #444; }}
input[type="text"] {{ width: 100%; padding: 12px; box-sizing: border-box; border: 1px solid #ddd; border-radius: 6px; }}
button {{ background: #007bff; color: white; border: none; padding: 12px 28px; border-radius: 6px; cursor: pointer; font-size: 16px; font-weight: 500; transition: all 0.2s; }}
button:hover {{ background: #0056b3; box-shadow: 0 2px 8px rgba(0,123,255,0.4); }}
button:disabled {{ background: #ccc; cursor: not-allowed; }}
#previewContainer {{ margin: 20px 0; padding: 10px; background: #fafafa; border: 2px dashed #eee; border-radius: 8px; display: none; text-align: center; }}
#imagePreview {{ max-width: 100%; max-height: 350px; border-radius: 4px; cursor: zoom-in; transition: opacity 0.2s; }}
#imagePreview:hover {{ opacity: 0.8; }}
.result-container {{ margin-top: 25px; }}
#result {{ white-space: pre-wrap; background: #282c34; color: #abb2bf; padding: 20px; border-radius: 8px; min-height: 150px; font-family: 'Consolas', 'Monaco', monospace; line-height: 1.5; }}
#info {{ margin-top: 10px; font-size: 0.85em; color: #888; text-align: right; }}
.loader {{ color: #007bff; font-weight: bold; display: none; margin-top: 15px; text-align: center; }}
.header {{ text-align: center; margin-bottom: 30px; }}
.examples {{ display: flex; flex-wrap: wrap; gap: 8px; margin-bottom: 15px; }}
.example-item {{ padding: 6px 14px; background: #fff; border: 1px solid #dcdfe6; border-radius: 20px; cursor: pointer; font-size: 13px; transition: all 0.2s; }}
.example-item:hover {{ color: #007bff; border-color: #007bff; }}
.selected-example {{ background: #007bff !important; color: white !important; border-color: #007bff; }}
/* Viewer Modal Styles */
.modal {{ display: none; position: fixed; z-index: 1000; left: 0; top: 0; width: 100%; height: 100%; background-color: rgba(0,0,0,0.9); align-items: center; justify-content: center; overflow: hidden; }}
.modal-content {{ max-width: 95%; max-height: 95%; transition: transform 0.1s; cursor: grab; transform-origin: center; }}
.modal-content:active {{ cursor: grabbing; }}
.close-modal {{ position: absolute; top: 20px; right: 35px; color: #fff; font-size: 40px; font-weight: bold; cursor: pointer; z-index: 1001; }}
.zoom-tip {{ position: absolute; bottom: 20px; left: 50%; transform: translateX(-50%); color: white; background: rgba(0,0,0,0.6); padding: 5px 15px; border-radius: 20px; font-size: 14px; pointer-events: none; }}
</style>
</head>
<body>
<div class="header">
<h1>GLM-OCR 交互演示</h1>
<p>基于智谱 GLM 模型的高性能图片文字识别</p>
</div>
<div class="container">
<div class="form-group">
<label>示例图片 (点击快速选择):</label>
<div class="examples" id="exampleList">
{example_html}
</div>
<label>或者上传本地图片:</label>
<input type="file" id="imageInput" accept="image/*">
<div id="previewContainer">
<p style="font-size: 12px; color: #999; margin-bottom: 5px;">预览图 (点击图片开启大图缩放模式)</p>
<img id="imagePreview" src="" alt="预览图" title="点击放大查看">
</div>
</div>
<div class="form-group">
<label>提示词 (Prompt):</label>
<input type="text" id="promptInput" value="提取图片中的文字">
</div>
<button id="submitBtn">开始识别</button>
<div id="loader" class="loader">🚀 正在处理模型输出,请稍候...</div>
<div class="result-container">
<h3>识别结果:</h3>
<div id="result">识别结果将显示在这里...</div>
<div id="info"></div>
</div>
</div>
<!-- Image Viewer Modal -->
<div id="viewerModal" class="modal">
<span class="close-modal" id="closeModal">×</span>
<img class="modal-content" id="modalImg">
<div class="zoom-tip" id="zoomTip">缩放: 100% (使用滚轮缩放)</div>
</div>
<script>
const imageInput = document.getElementById('imageInput');
const imagePreview = document.getElementById('imagePreview');
const previewContainer = document.getElementById('previewContainer');
const submitBtn = document.getElementById('submitBtn');
const resultDiv = document.getElementById('result');
const infoDiv = document.getElementById('info');
const loader = document.getElementById('loader');
const modal = document.getElementById('viewerModal');
const modalImg = document.getElementById('modalImg');
const zoomTip = document.getElementById('zoomTip');
const closeModal = document.getElementById('closeModal');
let selectedExamplePath = null;
let currentScale = 1;
// Handle Example Selection
function selectExample(path) {{
selectedExamplePath = path;
imageInput.value = "";
imagePreview.src = path;
previewContainer.style.display = 'block';
document.querySelectorAll('.example-item').forEach(el => {{
el.classList.remove('selected-example');
if (el.innerText === path.split('/').pop()) el.classList.add('selected-example');
}});
}}
// Handle Local File Upload
imageInput.onchange = () => {{
const file = imageInput.files[0];
if (file) {{
selectedExamplePath = null;
document.querySelectorAll('.example-item').forEach(el => el.classList.remove('selected-example'));
const reader = new FileReader();
reader.onload = (e) => {{
imagePreview.src = e.target.result;
previewContainer.style.display = 'block';
}};
reader.readAsDataURL(file);
}}
}};
// Modal Viewer Logic
imagePreview.onclick = () => {{
modal.style.display = "flex";
modalImg.src = imagePreview.src;
currentScale = 1;
updateTranslate();
}};
closeModal.onclick = () => {{
modal.style.display = "none";
}};
modal.onclick = (e) => {{
if (e.target === modal) modal.style.display = "none";
}};
modal.onwheel = (e) => {{
e.preventDefault();
const delta = e.deltaY > 0 ? -0.1 : 0.1;
currentScale = Math.min(Math.max(0.1, currentScale + delta), 5);
updateTranslate();
}};
function updateTranslate() {{
modalImg.style.transform = `scale(${{currentScale}})`;
zoomTip.innerText = `缩放: ${{Math.round(currentScale * 100)}}% (使用滚轮缩放)`;
}}
// Form Submission
submitBtn.onclick = async () => {{
const prompt = document.getElementById('promptInput').value;
const file = imageInput.files[0];
if (!file && !selectedExamplePath) {{
alert("请先选择或上传一张图片。");
return;
}}
const formData = new FormData();
if (file) formData.append('file', file);
if (selectedExamplePath) formData.append('image_url', selectedExamplePath);
formData.append('prompt', prompt);
submitBtn.disabled = true;
loader.style.display = 'block';
resultDiv.innerText = "处理中...";
infoDiv.innerText = "";
try {{
const response = await fetch('/ocr', {{
method: 'POST',
body: formData
}});
const data = await response.json();
if (response.ok) {{
resultDiv.innerText = data.text;
infoDiv.innerText = "处理时间: " + data.elapsed;
}} else {{
resultDiv.innerText = "错误: " + (data.detail || "发生未知错误");
}}
}} catch (err) {{
resultDiv.innerText = "网络连接故障: " + err.message;
}} finally {{
submitBtn.disabled = false;
loader.style.display = 'none';
}}
}};
</script>
</body>
</html>
"""
if __name__ == "__main__":
uvicorn.run(app, host=config["host"], port=config["port"])
代码文件说明:
- 采用了 Fast API 框架编写
- 接口包括:
[POST] /ocr: ocr 模型调用
- 参数:
file: 图片文件, 可选image_url: 图片文件URL,可选prompt: 提示词,必填[GET] /images/{filename}: 从images目录下获取图片文件[GET] /demo: 获取演示页面 HTML 文件- 支持并发请求
运行程序
python3 server_demo.py
运行成功后,
-
如果是本地部署: 浏览器访问
http://localhost:8080/demo即可访问演示页面 -
如果是在
www.smoothcloud.com.cn部署: 则需要访问 实例列表 -> 查看实例详情 -> 在端口号下增加8080端口 -> 添加成功后,复制访问地址到浏览器搜索框,并在地址最后加上/demo即可访问演示页面
结语
恭喜你🎉,你现在成功部署和落地了一个可以直接拿来 ocr 文本识别服务,如有疑问或任何想交流的内容,欢迎评论和留言😄
作者:Smoothcloud润云-Zpekii