Extract提取

DocumentReader，抽取文档，继承自Supplier<List<Document>>，把不同类型的文档转换成 List<Document>。

public interface DocumentReader extends Supplier<List<Document>> {
    default List<Document> read() {
        return get();
    }
}

JSON格式

可以直接使用 Spring AI 提供的JsonReader。

@Component
class MyAiAppComponent {

    private final Resource resource;

    MyAiAppComponent(@Value("classpath:bikes.json") Resource resource) {
        this.resource = resource;
    }

    List<Document> loadJsonAsDocuments() {
        JsonReader jsonReader = new JsonReader(resource, "description");
        return jsonReader.read();
    }
}

在实例代码中，采用com.fasterxml.jackson.databind.ObjectMapper来将JSON文档转为文档对象。

实例代码

public List<Article> importDocuments(Path path) {
    try {
        Article article = objectMapper.readValue(path.toFile(), Article.class);
        return List.of(article);
    } catch (IOException e) {
        log.error("Failed to import JSON document {}, skipping", path.toAbsolutePath());
    }
    return List.of();
}

单元测试

public void importDocumentsWithJsonTest() {
    try {
        Path path = Paths.get(DocumentServiceTest.class.getResource("/importer/1.json").toURI());
        List<Article> articles = jsonDocumentService.importDocuments(path);
        assertIterableEquals(List.of("是否可以从一个static方法内部发出对非static方法的调用？"), articles.stream().map(Article::getTitle).toList());
    } catch (URISyntaxException e) {
        throw new RuntimeException(e);
    }
}

TEXT格式

可以直接使用Spring AI 提供的 TextReader。

@Component
class MyTextReader {

    private final Resource resource;

    MyTextReader(@Value("classpath:text-source.txt") Resource resource) {
        this.resource = resource;
    }
    List<Document> loadText() {
        TextReader textReader = new TextReader(resource);
        textReader.getCustomMetadata().put("filename", "text-source.txt");
        return textReader.read();
    }
}

Markdown格式

可以使用第三方库flexmark。需要特定的格式要求，比如 H1 表示标题，其它内容属于知识库信息，主题包含在 front matter。

public static String parseMarkdownFlexmark(String markdown) {
    MutableDataSet options = new MutableDataSet();
    options.set(Parser.EXTENSIONS, List.of(YamlFrontMatterExtension.create()));
    Parser parser = Parser.builder(options).build();
    Document document = parser.parse(markdown);
    HtmlRenderer renderer = HtmlRenderer.builder(options).build();
    return renderer.render(document);
}

PDF格式

Spring AI 分别针对不同的格式，提供了两个不同的 DocumentReader 实现。

PagePdfDocumentReader：每一页或者多页组织成为一个 Document

@Component
public class MyPagePdfDocumentReader {
    List<Document> getDocsFromPdf() {
        PagePdfDocumentReader pdfReader = new PagePdfDocumentReader("classpath:/sample1.pdf", 
                PdfDocumentReaderConfig.builder().withPageTopMargin(0)
                                .withPageExtractedTextFormatter(ExtractedTextFormatter.builder().
                                        withNumberOfTopTextLinesToDelete(0).build()).
                        withPagesPerDocument(1).build());
        return pdfReader.read();
    }
}

ParagraphPdfDocumentReader：按段落组织成 Document，但是 PDF 文档必须要用大纲。

@Component
public class MyPagePdfDocumentReader {
    List<Document> getDocsFromPdfwithCatalog() {
        new ParagraphPdfDocumentReader("classpath:/sample1.pdf",
                PdfDocumentReaderConfig.builder()
                        .withPageTopMargin(0)
                        .withPageExtractedTextFormatter(ExtractedTextFormatter.builder()
                                .withNumberOfTopTextLinesToDelete(0)
                                .build())
                        .withPagesPerDocument(1)
                        .build());
        return pdfReader.read();
    }
}

由于 PagePdfDocumentReader 要求文档中相关内容组织到同一个 Document 中，基于这种局限性，PDF 文档建议的读取方式为：

有大纲的，使用 ParagraphPdfDocumentReader；
没有大纲的，每部分内容的页数固定，使用 PagePdfDocumentReader 按页数读取；
否则的话，使用 PagePdfDocumentReader 把 PDF 文档读取成一个 Document。

实例代码

private class SinglePdfDocumentReader extends ParagraphPdfDocumentReader {
    private Path path;

    public SinglePdfDocumentReader(Path path) {
        super(new PathResource(path));
        this.path = path;
    }

    @Override
    public String getTextBetweenParagraphs(Paragraph fromParagraph, Paragraph toParagraph) {
        PagePdfDocumentReader reader = new PagePdfDocumentReader(new PathResource(path), PdfDocumentReaderConfig.builder().withPagesPerDocument(ALL_PAGES).build());
        return Optional.ofNullable(reader.get().get(0)).map(Document::getContent).map(String::trim).orElse("");
    }
}

单元测试

@Test
public void importDocumentsWithPdfTest() {
    try {
        Path path = Paths.get(DocumentServiceTest.class.getResource("/importer/Mybatis.pdf").toURI());
        List<Article> articles = pdfDocumentService.importDocuments(path);
        assertIterableEquals(List.of("Mybatis是什么？"), articles.stream().map(Article::getTitle).toList());
    } catch (URISyntaxException e) {
        throw new RuntimeException(e);
    }
}

★ 在 spring-ai-tika-document-reader 中提供了一个比较万能的 TikaDocumentReader，支持读取 PDF、DOC/DOCX、PPT/PPTX 和 HTML等格式（支持格式很多，详情看 tika.apache.org/2.9.0/forma… ）。

抽取时机

1、监控特定目录的文件变化

在 Java 7中新增了java.nio.file.WatchService，通过它可以实现文件变动的监听。WatchService是基于操作系统的文件系统监控器，可以监控系统所有文件的变化，无需遍历、无需比较，是一种基于信号收发的监控，效率高。

在实例代码中，通过监听增删改和导入请求发出来的ApplicationEvent，实现文件变化的监控。

2、根据文件扩展名，调用不同的 DocumentReader 来抽取 Document。

在实例代码中，采用策略模式，将 json、text、markdown、pdf 四种格式分别放在单独的类处理。

SpringAI的Extract入门