自动探测zip文件名编码,解决zip文件Java解压乱码问题

5,513 阅读8分钟

一、问题描述

ZIP文件在不同平台压缩时,对于文件名会有不同的编码,主要分为下面两类:

  1. Windows平台,默认的中文编码为GBK,因此压缩后的文件名编码格式为GBK。
  2. Linux/MacOS平台,默认的中文编码为UTF8,因此压缩后的文件名编码格式为UTF8。

这样就存在一个问题,在Windows上压缩的文件放到MacOS上解压,或者将Windows上压缩的文件上传到服务器上解压后处理,里面的文件名都会出现乱码情况。出现这个问题,主要是因为ZIP标准公布于1989年1月,那时还没有Unicode标准。在当前ZIP标准中,Info-ZIP Unicode Path Extra Field(0x7075)会记录UTF8的编码名。

0x07c8        Macintosh
0x2605        ZipIt Macintosh
0x2705        ZipIt Macintosh 1.3.5+
0x2805        ZipIt Macintosh 1.3.5+
0x334d        Info-ZIP Macintosh
0x4341        Acorn/SparkFS 
0x4453        Windows NT security descriptor (binary ACL)
0x4704        VM/CMS
0x470f        MVS
0x4b46        FWKCS MD5 (see below)
0x4c41        OS/2 access control list (text ACL)
0x4d49        Info-ZIP OpenVMS
0x4f4c        Xceed original location extra field
0x5356        AOS/VS (ACL)
0x5455        extended timestamp
0x554e        Xceed unicode extra field
0x5855        Info-ZIP UNIX (original, also OS/2, NT, etc)
0x6375        Info-ZIP Unicode Comment Extra Field
0x6542        BeOS/BeBox
0x7075        Info-ZIP Unicode Path Extra Field
0x756e        ASi UNIX
0x7855        Info-ZIP UNIX (new)
0xa11e        Data Stream Alignment (Apache Commons-Compress)
0xa220        Microsoft Open Packaging Growth Hint
0xfd4a        SMS/QDOS
0x9901        AE-x encryption structure (see APPENDIX E)
0x9902        unknown

Info-ZIP Unicode Path Extra Field (0x7075)描述如下:

       Stores the UTF-8 version of the file name field as stored in the
       local header and central directory header. (Last Revision 20070912)

         Value         Size        Description
         -----         ----        -----------
 (UPath) 0x7075        Short       tag for this extra block type ("up")
         TSize         Short       total data size for this block
         Version       1 byte      version of this extra field, currently 1
         NameCRC32     4 bytes     File Name Field CRC32 Checksum
         UnicodeName   Variable    UTF-8 version of the entry File Name

      Currently Version is set to the number 1.  If there is a need
      to change this field, the version will be incremented.  Changes
      MAY NOT be backward compatible so this extra field SHOULD NOT be
      used if the version is not recognized.

      The NameCRC32 is the standard zip CRC32 checksum of the File Name
      field in the header.  This is used to verify that the header
      File Name field has not changed since the Unicode Path extra field
      was created.  This can happen if a utility renames the File Name but
      does not update the UTF-8 path extra field.  If the CRC check fails,
      this UTF-8 Path Extra Field SHOULD be ignored and the File Name field
      in the header SHOULD be used instead.

      The UnicodeName is the UTF-8 version of the contents of the File Name
      field in the header.  As UnicodeName is defined to be UTF-8, no UTF-8
      byte order mark (BOM) is used.  The length of this field is determined
      by subtracting the size of the previous fields from TSize.  If both
      the File Name and Comment fields are UTF-8, the new General Purpose
      Bit Flag, bit 11 (Language encoding flag (EFS)), can be used to
      indicate that both the header File Name and Comment fields are UTF-8
      and, in this case, the Unicode Path and Unicode Comment extra fields
      are not needed and SHOULD NOT be created.  Note that, for backward
      compatibility, bit 11 SHOULD only be used if the native character set
      of the paths and comments being zipped up are already in UTF-8. It is
      expected that the same file name storage method, either general
      purpose bit 11 or extra fields, be used in both the Local and Central
      Directory Header for a file.

但是,这个字段不是强制字段,允许为空,Linux/Mac OS在压缩时都不会记录这个标识,因此,无法识别出zip包采用的编码方式,导致解压时会出现乱码。

二、工具解决方案

因为通过ZIP包无法获取文件的编码方式,我们看下成熟的ZIP压解缩工具The Unarchiver是怎么实现的。在The Unarchiver的高级配置中,有一项文件名编码的自动检测功能。 image.png 尝试使用The Unarchiver去解压Windows通过GBK编码压缩的文件,会自动给出文件名编码提示。 image.png image.png 从上可以窥视出,The Unarchiver在实现自动解压时,会将压缩包的文件进行编码检测,挑选最有可能的编码方式,在无法确认时,由用户来确认。

三、Java解压实现

参考The Unarchiver的实现机制,Java解压也对译码结果进行检查,判断解压后的文件名是否会有乱码。当然,考虑应用场景,我们只检测译码后的结果是否会有中文乱码,不会检测是否是其他语言的乱码(比如俄文)。

3.1 引入zip4j包

<dependency>
  <groupId>net.lingala.zip4j</groupId>
  <artifactId>zip4j</artifactId>
  <version>2.9.1</version>
</dependency>

3.2 解压实现

/**
 * 解压zip包
 *
 * @param srcFile 源文件
 * @param destPath 要解压到的路径
 */
public static void unzip(String srcFile, String destPath) throws ZipException {
  ZipFile zipFile = new ZipFile(srcFile);
  zipFile.setCharset(Charset.forName("GBK"));
  String charset = recognizeCharset(zipFile.getFileHeaders());
  zipFile = new ZipFile(srcFile);
  zipFile.setCharset(Charset.forName(charset));
  zipFile.extractAll(destPath);
}

/**
 * 识别编码方式
 *
 * @param fileHeaders 文件头
 * @return 编码方式,默认GBK,当前Windows用户还是偏多
 */
public static String recognizeCharset(List<FileHeader> fileHeaders) {
  if (fileHeaders == null || fileHeaders.isEmpty()) {
    return "GBK";
  }
  // 只识别空白符、英文大小写字母、数字、下划线、点、中文、反斜杠
  // 其他需要识别的字符需要自己扩充
  // 标点符号\pP|\pS暂不识别
  String messyRegex = "[\\sa-zA-Z_0-9./\u4e00-\u9fa5]+";
  for (FileHeader fileHeader : fileHeaders) {
    String fileName = fileHeader.getFileName();
    if (!fileName.matches(messyRegex)) {
      // 存在乱码,使用UTF8
      return "UTF8";
    }
  }
  return "GBK";
}

四、基于字节码增强实现编码探测

上述检测方法很简单,容易误检测。目前已经有一些开源的编码检测方式,比如icu4jjuniversalchardet。阅读zip4j源码,zip4j调用了readCentralDirectory来实现对文件名译码。

  private CentralDirectory readCentralDirectory(RandomAccessFile zip4jRaf, RawIO rawIO, Charset charset) throws IOException {
    ...
    
    for (int i = 0; i < centralDirEntryCount; i++) {
      ...
          
      if (fileNameLength > 0) {
        byte[] fileNameBuff = new byte[fileNameLength];
        zip4jRaf.readFully(fileNameBuff);
        // 对文件名译码  
        String fileName = decodeStringWithCharset(fileNameBuff, fileHeader.isFileNameUTF8Encoded(), charset);
        fileHeader.setFileName(fileName);
      } else {
        fileHeader.setFileName(null);
      }
      ...

      fileHeaders.add(fileHeader);
    }

    ...
    return centralDirectory;
  }

在译码时,调用了HeaderUtil类中的decodeStringWithCharset方法。

  public static String decodeStringWithCharset(byte[] data, boolean isUtf8Encoded, Charset charset) {
    if (charset != null) {
      return new String(data, charset);
    }

    if (isUtf8Encoded) {
      return new String(data, InternalZipConstants.CHARSET_UTF_8);
    }

    try {
      return new String(data, ZIP_STANDARD_CHARSET_NAME);
    } catch (UnsupportedEncodingException e) {
      return new String(data);
    }
  }

整个处理流程并未暴露获取文件名编码字节数组的方法,导致无法使用工具包自动检测。 如果要获取文件名编码字节数据,可以由下面几种方法:

  1. 修改zip4j文件,重新打成jar,替换本地的jar包。这样对以后升级jar都较为不便。
  2. 修改decodeStringWithCharset字节码,植入获取字节数组的逻辑。

显然,第2种实现方法更加优雅,也便于维护,字节码增强可以采用ASM或者Javassist。但是,decodeStringWithCharset方法是在zip4j内部调用,无法替换成自定义类加载器加载的类。而一个类一旦被类加载器加载,就无法修改。因此,类在被增强后,必须要使用系统类加载器加载,javassist恰好提供了这样的机制,在加载类时可以指定加载使用的类加载器。下面将会对两种增强技术进行介绍。

4.1 基于ASM对方法进行增强

  1. 引入ASM包
<!-- ASM包 -->
<dependency>
  <groupId>org.ow2.asm</groupId>
  <artifactId>asm-all</artifactId>
  <version>5.2</version>
</dependency>
  1. 对方法进行增强
/**
 * 构建增强类字节文件
 * 
 * @param className 要增强的类名
 */
public static byte[] buildClassBytes(String className) throws IOException {
  ClassReader classReader = new ClassReader(className);
  ClassWriter classWriter = new ClassWriter(ClassWriter.COMPUTE_MAXS);
  HeaderUtilClassVisitor classVisitor = new HeaderUtilClassVisitor(classWriter);
  classReader.accept(classVisitor, ClassReader.SKIP_DEBUG);
  return classWriter.toByteArray();
}

/**
 * HeaderUtil类visitor,对decodeStringWithCharset方法进行增强
 */
public static class HeaderUtilClassVisitor extends ClassVisitor {

  public HeaderUtilClassVisitor(ClassVisitor cv) {
    super(Opcodes.ASM5, cv);
  }

  @Override
  public MethodVisitor visitMethod(int access, String name, String desc, String signature, String[] exceptions) {
    if (!name.equals("decodeStringWithCharset")) {
      return super.visitMethod(access, name, desc, signature, exceptions);
    }
    MethodVisitor mv = cv.visitMethod(access, name, desc, signature, exceptions);
      return new DecodeMethodVisitor(mv);
  }

}

/**
 * 增强逻辑,获取文件名字节数组
 */
public static class DecodeMethodVisitor extends MethodVisitor {

  public DecodeMethodVisitor(MethodVisitor mv) {
    super(Opcodes.ASM5, mv);
  }

  @Override
  public void visitCode() {
    // 静态方法,0为第一个参数;非静态方法,0代表this
    mv.visitVarInsn(Opcodes.ALOAD, 0);
    visitMethodInsn(Opcodes.INVOKESTATIC, "com/beidou/study/asm/zip4j/ByteReader", "read", "([B)V", false);
  }

}


/**
 * 从zip4j中读取文件名的字节数组
 */
public class ByteReader {

  private final static Map<Thread, List<byte[]>> PARAM_MAP = new ConcurrentHashMap<>();

  public static void put() {
    if (!PARAM_MAP.containsKey(Thread.currentThread())) {
      PARAM_MAP.put(Thread.currentThread(), Lists.newArrayList());
    }
  }

  public static void remove() {
    PARAM_MAP.remove(Thread.currentThread());
  }

  public static List<byte[]> get() {
    return PARAM_MAP.get(Thread.currentThread());
  }

  public static void read(byte[] bytes) {
    if (bytes == null || bytes.length == 0) {
      return;
    }
    if (!PARAM_MAP.containsKey(Thread.currentThread())) {
      return;
    }
    PARAM_MAP.get(Thread.currentThread()).add(bytes);
  }

}

  1. 增强效果
public static String decodeStringWithCharset(byte[] var0, boolean var1, Charset var2) {
  ByteReader.read(var0);
  if (var2 != null) {
    return new String(var0, var2);
  } else if (var1) {
    return new String(var0, InternalZipConstants.CHARSET_UTF_8);
  } else {
    try {
      return new String(var0, "Cp437");
    } catch (UnsupportedEncodingException var4) {
      return new String(var0);
    }
  }
}

可以看到,方法中已被增强了读取字节数组的逻辑:

ByteReader.read(var0);

4.2 基于Javassist对方法进行增强

  1. 引入包
<!-- javassist包 -->
<dependency>
  <groupId>org.javassist</groupId>
  <artifactId>javassist</artifactId>
  <version>3.28.0-GA</version>
</dependency>
  1. 对方法进行增强
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.get("net.lingala.zip4j.headers.HeaderUtil");
CtMethod method = cc.getDeclaredMethod("decodeStringWithCharset");
method.insertBefore("com.beidou.study.asm.zip4j.ByteReader.read($1);");
  1. 增强效果和ASM的类似

4.3 基于Javassist指定类加载器加载增强类

一个类被加载,就会被缓存,以后再访问该类时,不会再重新加载。因此,只需要在服务启动时加载一次使用系统类加载器加载增强后的HeaderUtil类,原生的类就不会被加载。

  1. 基于ASM增强的加载
/**
 * 加载HeaderUtil类(加载器指定为当前类的加载器)
 */
public static void loadClass() throws IOException, CannotCompileException {
  String className = "net.lingala.zip4j.headers.HeaderUtil";
  byte[] bytes = buildClassBytes(className);
  // javassit加载生成的Class文件,并交由当前类所使用的加载器加载
  ClassPool cp = ClassPool.getDefault();
  CtClass cc = cp.makeClassIfNew(new ByteArrayInputStream(bytes));
  cc.toClass(Zip4jDemo.class.getClassLoader(), null);
}
  1. 基于Javassist增强的加载
/**
 * 基于javassist修改类字节码并加载(加载器指定为当前类的加载器)
 */
public static void loadClass() throws NotFoundException, CannotCompileException, ClassNotFoundException {
  ClassPool cp = ClassPool.getDefault();
  CtClass cc = cp.get("net.lingala.zip4j.headers.HeaderUtil");
  CtMethod method = cc.getDeclaredMethod("decodeStringWithCharset");
  method.insertBefore("com.beidou.study.asm.zip4j.ByteReader.read($1);");
  // 使用当前类加载器加载
  cc.toClass(Zip4jDemo.class.getClassLoader(), null);
}

4.4 基于juniversalchardet对编码方式自动检测

  1. 引入juniversalchardet,与icu4j差异可自行研究
<!-- 编码探测包 -->
<dependency>
  <groupId>com.github.albfernandez</groupId>
  <artifactId>juniversalchardet</artifactId>
  <version>2.4.0</version>
</dependency>
  1. 自动检测
/**
 * 编码识别器,使用juniversalchardet包
 */
public class EncodingDetector {

  public static String recognize(byte[] bytes) {
    if (bytes == null || bytes.length == 0) {
      return null;
    }
    UniversalDetector detector = new UniversalDetector();
    detector.handleData(bytes);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    return encoding;
  }

}

识别编码的准确度可以在此处进行优化改进,测试过程中发现GBK编码可能会被识别成GB18030,需要自己调整优化。

4.5 识别Demo

public class Zip4jUnzipDemo {

  public static void main(String[] args) throws Exception {
    // 加载HeaderUtil增强类
    HeaderUtilEnhance.loadClass();
    // 初始化当前线程引用的字节列表
    ByteReader.put();

    // 走一遍获取文件逻辑,会自动获取各个文件名的编码字节数组
    String fileName = "/Users/ginger/Desktop/测试压缩.zip";
    ZipFile zipFile = new ZipFile(fileName);
    zipFile.getFileHeaders();

    // 获取文件名编码字节列表,并释放当前线程引用
    List<byte[]> bytes = ByteReader.get();
    ByteReader.remove();

    // 所有文件名编码字节列表进行合并
    int len = 0;
    for (byte[] tmp : bytes) {
      len += tmp.length;
    }
    byte[] allBytes = new byte[len];
    len = 0;
    for (byte[] tmp : bytes) {
      System.arraycopy(tmp, 0, allBytes, len, tmp.length);
      len += tmp.length;
    }

    // 识别编码
    String charset = EncodingDetector.recognize(allBytes);

    // 用识别的解压
    zipFile = new ZipFile(fileName);
    zipFile.setCharset(Charset.forName(charset));
    zipFile.extractAll("/Users/ginger/Desktop");
  }

}

五、参考资料