告别繁琐！Phi-3-Vision-128K人工智能OCR轻松搞定PDF解析创作不易，方便的话点点关注，谢谢文章结尾有

创作不易，方便的话点点关注，谢谢

本文是经过严格查阅相关权威文献和资料，形成的专业的可靠的内容。全文数据都有据可依，可回溯。特别申明：数据和资料已获得授权。本文内容，不涉及任何偏颇观点，用中立态度客观事实描述事情本身。

文章结尾有最新热度的文章，感兴趣的可以去看看。

文章有点长(4530字阅读时长：13分)，期望您能坚持看完，并有所收获

在快速发展的人工智能领域，多模态模型为整合视觉和文本数据树立了新的标准。最新的突破之一便是Phi-3-Vision-128K-Instruct，它是一种先进的开放式多模态模型，拓展了人工智能在处理图像和文本方面的能力边界。该模型专注于文档提取、光学字符识别（OCR）以及通用图像理解等方面的设计，有望彻底改变我们处理来自PDF、图表、表格以及其他结构化或半结构化文档信息的方式。

让我们深入探究Phi-3-Vision-128K-Instruct的具体细节，探索它的架构、技术要求、合理使用注意事项，并了解如何利用它简化文档提取、PDF解析以及人工智能驱动的数据分析等复杂任务。

什么是Phi-3-Vision-128K-Instruct？

Phi-3-Vision-128K-Instruct隶属于Phi-3模型家族，专为多模态数据处理而构建，支持长达128,000个词元的上下文长度。该模型融合了文本和视觉数据，非常适用于需要同时解读文本与图像的任务。其开发过程涉及5000亿个训练词元，这些词元由高质量的合成数据以及经过严格筛选的公开可用数据源共同组成。通过包括监督微调以及偏好优化在内的精细训练流程，该模型得以精心打造，旨在提供精准、可靠且安全的人工智能解决方案。

Phi-3-Vision-128K-Instruct拥有42亿个参数，其架构包含图像编码器、连接器、投影仪以及Phi-3 Mini语言模型，这使其成为适用于众多应用场景的轻量级但功能强大的选择。

核心应用场景

该模型的主要应用涵盖多个领域，尤其侧重于以下方面：

文档提取和OCR：能高效地将文本图像或扫描文档转换为可编辑格式。它可以处理诸如表格、图表和示意图等复杂布局，是将实体文档数字化或自动化数据提取工作流程的宝贵工具。
通用图像理解：解析视觉内容以识别物体、解读场景并提取相关信息。
内存/计算受限环境：可在计算能力或内存有限的情况下运行人工智能任务，且不会影响性能。
对延迟有要求的场景：能减少实时应用（如实时数据推送、基于聊天的助手或流内容分析）中的处理延迟。

如何开始使用Phi-3-Vision-128K-Instruct

要使用Phi-3-Vision-128K-Instruct，你需要用所需的库和工具搭建开发环境。该模型已集成到Hugging Face Transformers库的开发版本（4.40.2）中。在深入研究代码示例之前，请确保你的Python环境已配置好以下这些包：

# Required Packages
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.40.2

要加载模型，你可以更新本地的Transformers库，或者直接从源代码克隆并安装：

pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers

现在，让我们来看一些实用的代码片段，展示如何利用这个强大的模型进行人工智能驱动的文档提取和文本生成。

加载模型的示例代码

以下是一个Python示例，展示了如何初始化模型并开始进行推理。我们将使用类和函数来使代码保持清晰和有条理：

from PIL importImage
import requests
from transformers importAutoModelForCausalLM,AutoProcessor

classPhi3VisionModel:
def__init__(self, model_id="microsoft/Phi-3-vision-128k-instruct", device="cuda"):
"""
        Initialize the Phi3VisionModel with the specified model ID and device.
        
        Args:
            model_id (str): The identifier of the pre-trained model from Hugging Face's model hub.
            device (str): The device to load the model on ("cuda" for GPU or "cpu").
        """
        self.model_id = model_id
        self.device = device
        self.model = self.load_model()# Load the model during initialization
        self.processor = self.load_processor()# Load the processor during initialization

defload_model(self):
"""
        Load the pre-trained language model with causal language modeling capabilities.
        
        Returns:
            model (AutoModelForCausalLM): The loaded model.
        """
print("Loading model...")
# Load the model with automatic device mapping and data type adjustment
returnAutoModelForCausalLM.from_pretrained(
            self.model_id,
            device_map="auto",# Automatically map model to the appropriate device(s)
            torch_dtype="auto",# Use an appropriate torch data type based on the device
            trust_remote_code=True,# Allow execution of custom code for loading the model
            _attn_implementation='flash_attention_2'# Use optimized attention implementation
).to(self.device)# Move the model to the specified device

defload_processor(self):
"""
        Load the processor associated with the model for processing inputs and outputs.
        
        Returns:
            processor (AutoProcessor): The loaded processor for handling text and images.
        """
print("Loading processor...")
# Load the processor with trust_remote_code=True to handle any custom processing logic
returnAutoProcessor.from_pretrained(self.model_id, trust_remote_code=True)

defpredict(self, image_url, prompt):
"""
        Perform a prediction using the model given an image and a prompt.
        
        Args:
            image_url (str): The URL of the image to be processed.
            prompt (str): The textual prompt that guides the model's generation.
        
        Returns:
            response (str): The generated response from the model.
        """
# Load the image from the provided URL
        image =Image.open(requests.get(image_url, stream=True).raw)

# Format the input prompt template for the model
        prompt_template =f"<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n"

# Process the inputs, converting the prompt and image into tensor format
        inputs = self.processor(prompt_template,[image], return_tensors="pt").to(self.device)

# Set generation arguments for the model's response generation
        generation_args ={
"max_new_tokens":500,# Maximum number of tokens to generate
"temperature":0.7,# Sampling temperature for diversity in generation
"do_sample":False# Disable sampling for deterministic output
}
print("Generating response...")
# Generate the output IDs using the model, skipping the input tokens
        output_ids = self.model.generate(**inputs,**generation_args)
        output_ids = output_ids[:, inputs['input_ids'].shape[1]:]# Ignore the input prompt in the output

# Decode the generated output tokens to obtain the response text
        response = self.processor.batch_decode(output_ids, skip_special_tokens=True)[0]
return response

# Initialize the model
phi_model =Phi3VisionModel()

# Example prediction
image_url ="https://example.com/sample_image.png"# URL of the sample image
prompt ="Extract the data in json format."# Prompt for model guidance
response = phi_model.predict(image_url, prompt)# Get the response from the model

print("Response:", response)  # Print the generated response

上述代码定义了一个Phi3VisionModel类，它对模型的加载和使用进行了抽象，使其更易于集成到应用程序中。predict()方法展示了如何使用自定义提示基于图像进行推理。

利用扫描身份证测试OCR能力

为评估Phi-3-Vision-128K-Instruct模型的OCR性能，我们使用了若干真实世界的扫描身份证图像对其进行测试。这些图像在质量和清晰度方面各不相同，给模型带来了一系列挑战。目的是展示该模型从具有不同特征（如模糊、背景复杂以及文字字体多样）的文档中提取文本信息的能力如何。

图像1：一张虚构的护照，包含详细文字，有姓名、国籍、出生地、签发日期和有效期等个人信息。文字稍有风格化，底部有机器可读区域。图像质量较高，无明显背景噪点。

输出：

{
  "Type/Type":"P",
"Country code/Code du pays":"UTO",
"Passport Number/N° de passeport":"L898902C3",
"Surname/Nom":"ERIKSSON",
"Given names/Prénoms":"ANNA MARIA",
"Nationality/Nationalité":"UTOPIAN",
"Date of Birth/Date de naissance":"12 AUGUST/AOUT 74",
"Personal No./N° personnel":"Z E 184226 B",
"Sex/Sexe":"F",
"Place of birth/Lieu de naissance":"ZENITH",
"Date of issue/Date de délivrance":"16 APR/AVR 07",
"Authority/Autorité":"PASSPORT OFFICE",
"Date of expiry/Date d'expiration":"15 APR/AVR 12",
"Holder's signature/Signature du titulaire":"anna maria eriksson",
"Passport/Passeport":"P<UTOERIKSSON<<ANNA<MARIA<<<<<<<<<<<<<<<<<<<<<<<L898902C36UT07408122F1204159ZE184226B<<<<10"
}

图像2：一张护照，护照持有者图像清晰，文字格式规整。字段包括护照号码、姓名、出生日期、国籍和有效期等。文档对比度高，使得文本提取相对简单。底部的机器可读区域（MRZ）提供了结构化的数据格式，有助于验证提取信息的准确性。

输出：

Here's the extracted full data from the passport in JSON format:

{
  "passport": {
    "issuingCountry": "Netherlands",
    "issuingAuthority": "Koninkrijk der Nederlanden",
    "passportNumber": "SPEC12014",
    "issuingDate": "09 MAR 2014",
    "expiryDate": "09 MAR 2024",
    "holder": {
      "gender": "F",
      "nationality": "Netherlands",
      "placeOfBirth": "SPECIMEN",
      "sex": "WF",
      "firstNames": [
        "Willem",
        "Lieselotte"
      ]
    },
    "physicalDescription": {
      "height": "1.75 m",
      "hairColor": "gray",
      "hairLength": "short"
    },
    "issuingOffice": "Burg. van Stad en Dorp",
    "issuingDateAsInt": "14032014",
    "expiryDateAsInt": "14032024",
    "fieldsExtracted": [
      {
        "code": "NL",
        "dateOfBirth": "10 MAR 1965",
        "dateOfIssue": "09 MAR 2014",
        "dateOfExpiry": "09 MAR 2024",
        "firstNames": [
          "Willem",
          "Lieselotte"
        ],
        "nationality": "Netherlands",
        "passportNumber": "SPEC12014",
        "placeOfBirth": "SPECIMEN",
        "sex": "WF"
      }
    ]
  }

以下是以JSON格式从护照中提取的完整数据：

亲自尝试Phi-3-Vision-128K-Instruct

如果你想亲自试用Phi-3-Vision-128K-Instruct模型，可以通过以下链接进行探索：在Azure AI上试用Phi-3-Vision-128K-Instruct。通过该链接，你可以体验该模型的功能，并对其OCR功能进行试验。

理解架构与训练

Phi-3-Vision-128K-Instruct模型并非普通的语言模型，它是一个能够处理视觉和文本数据的多模态强大工具。它经历了全面的训练过程，包含5000亿个词元，涵盖文本和图像数据。其架构整合了语言模型和图像处理模块，创建了一个能理解超过128K词元上下文的连贯系统，可用于处理长篇对话或大容量文档内容。

该模型通过诸如512个H100 GPU等强大硬件进行训练，并利用闪存注意力机制来提高内存效率，能够轻松应对大规模任务。训练数据集包含合成数据和经过筛选的真实世界数据，侧重于数学、编码、常识推理以及通用知识，使其具备足够的通用性，适用于各种应用场景。