今天工程上想实现一个所有文档转化为pdf,并在网页上可预览的功能。大体可分为几步:
- 1.获取所有文档格式的内容
- 2.将这些内容写入创建的pdf中
- 3.利用前端pdfjs预览
通过apache poi的extractor文本提取器获取文本没什么问题,但是用io流读取txt文档却遇到读出的文档内容乱码问题。打开txt文档看编码格式为“ANSI”,“ANSI”编码代表用的是系统默认编码,可以进入命令行查看:
chcp
得到我的默认编码代码为:936 常见得编码代码和编码格式映射表如下:
| 码 | 编码格式 |
|---|---|
| 936 | GBK |
| 950 | Big-5 |
| 437 | ASCII |
附上我获取文档的部分代码:
//获取某个文件的文本信息
private String getText(File file){
String name = file.getName();
try {
if(name.endsWith(".xlsx")){
XSSFWorkbook workbook = new XSSFWorkbook(file);
ExcelExtractor extractor = new XSSFExcelExtractor(workbook);
String content = extractor.getText();
workbook.close();
return content;
}else if(name.endsWith(".xls")){
POIFSFileSystem fileSystem = new POIFSFileSystem(file);
ExcelExtractor extractor = new EventBasedExcelExtractor(fileSystem);
String content = extractor.getText();
fileSystem.close();
return content;
}else if(name.endsWith(".docx")){
XWPFDocument document = new XWPFDocument(new FileInputStream(file));
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
String content = extractor.getText();
document.close();
return content;
}else if(name.endsWith(".doc")){
InputStream inputStream = new FileInputStream(file);
WordExtractor extractor = new WordExtractor(inputStream);
String content = extractor.getText();
inputStream.close();
return content;
}else if(name.endsWith(".txt")){
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file),"GBK"));
StringBuilder sb = new StringBuilder(16);
String txt = "";
while((txt = reader.readLine()) != null){
sb.append(txt);
}
reader.close();
return sb.toString();
}else if(name.endsWith(".pptx")){
SlideShowExtractor<XSLFShape, XSLFTextParagraph> extractor = new SlideShowExtractor<>(new XMLSlideShow(new FileInputStream(file)));
return extractor.getText();
}
} catch (IOException e) {
e.printStackTrace();
}catch (OpenXML4JException e) {
e.printStackTrace();
}
return "";
}
将内容写入pdf可以用itext核心包或者apche fop,我也还在研究阶段。