查看txt文档默认编码,获取各种文档内容

870 阅读1分钟

今天工程上想实现一个所有文档转化为pdf,并在网页上可预览的功能。大体可分为几步:

  • 1.获取所有文档格式的内容
  • 2.将这些内容写入创建的pdf中
  • 3.利用前端pdfjs预览

通过apache poi的extractor文本提取器获取文本没什么问题,但是用io流读取txt文档却遇到读出的文档内容乱码问题。打开txt文档看编码格式为“ANSI”,“ANSI”编码代表用的是系统默认编码,可以进入命令行查看:

chcp

得到我的默认编码代码为:936 常见得编码代码和编码格式映射表如下:

编码格式
936 GBK
950 Big-5
437 ASCII

附上我获取文档的部分代码:

//获取某个文件的文本信息
    private String getText(File file){
        String name = file.getName();
        try {
            if(name.endsWith(".xlsx")){
                XSSFWorkbook workbook = new XSSFWorkbook(file);
                ExcelExtractor extractor = new XSSFExcelExtractor(workbook);
                String content = extractor.getText();
                workbook.close();
                return content;
            }else if(name.endsWith(".xls")){
                POIFSFileSystem fileSystem = new POIFSFileSystem(file);
                ExcelExtractor extractor = new EventBasedExcelExtractor(fileSystem);
                String content = extractor.getText();
                fileSystem.close();
                return content;
            }else if(name.endsWith(".docx")){
                XWPFDocument document = new XWPFDocument(new FileInputStream(file));
                XWPFWordExtractor extractor = new XWPFWordExtractor(document);
                String content = extractor.getText();
                document.close();
                return content;
            }else if(name.endsWith(".doc")){
                InputStream inputStream = new FileInputStream(file);
                WordExtractor extractor = new WordExtractor(inputStream);
                String content = extractor.getText();
                inputStream.close();
                return content;
            }else if(name.endsWith(".txt")){
                BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file),"GBK"));
                StringBuilder sb = new StringBuilder(16);
                String txt = "";
                while((txt = reader.readLine()) != null){
                    sb.append(txt);
                }
                reader.close();
                return sb.toString();
            }else if(name.endsWith(".pptx")){
                SlideShowExtractor<XSLFShape, XSLFTextParagraph> extractor = new SlideShowExtractor<>(new XMLSlideShow(new FileInputStream(file)));
                return extractor.getText();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }catch (OpenXML4JException e) {
            e.printStackTrace();
        }
        return "";
    }

将内容写入pdf可以用itext核心包或者apche fop,我也还在研究阶段。