一、问题描述
ZIP文件在不同平台压缩时,对于文件名会有不同的编码,主要分为下面两类:
- Windows平台,默认的中文编码为GBK,因此压缩后的文件名编码格式为GBK。
- Linux/MacOS平台,默认的中文编码为UTF8,因此压缩后的文件名编码格式为UTF8。
这样就存在一个问题,在Windows上压缩的文件放到MacOS上解压,或者将Windows上压缩的文件上传到服务器上解压后处理,里面的文件名都会出现乱码情况。出现这个问题,主要是因为ZIP标准公布于1989年1月,那时还没有Unicode标准。在当前ZIP标准中,Info-ZIP Unicode Path Extra Field(0x7075)
会记录UTF8的编码名。
0x07c8 Macintosh
0x2605 ZipIt Macintosh
0x2705 ZipIt Macintosh 1.3.5+
0x2805 ZipIt Macintosh 1.3.5+
0x334d Info-ZIP Macintosh
0x4341 Acorn/SparkFS
0x4453 Windows NT security descriptor (binary ACL)
0x4704 VM/CMS
0x470f MVS
0x4b46 FWKCS MD5 (see below)
0x4c41 OS/2 access control list (text ACL)
0x4d49 Info-ZIP OpenVMS
0x4f4c Xceed original location extra field
0x5356 AOS/VS (ACL)
0x5455 extended timestamp
0x554e Xceed unicode extra field
0x5855 Info-ZIP UNIX (original, also OS/2, NT, etc)
0x6375 Info-ZIP Unicode Comment Extra Field
0x6542 BeOS/BeBox
0x7075 Info-ZIP Unicode Path Extra Field
0x756e ASi UNIX
0x7855 Info-ZIP UNIX (new)
0xa11e Data Stream Alignment (Apache Commons-Compress)
0xa220 Microsoft Open Packaging Growth Hint
0xfd4a SMS/QDOS
0x9901 AE-x encryption structure (see APPENDIX E)
0x9902 unknown
Info-ZIP Unicode Path Extra Field (0x7075)
描述如下:
Stores the UTF-8 version of the file name field as stored in the
local header and central directory header. (Last Revision 20070912)
Value Size Description
----- ---- -----------
(UPath) 0x7075 Short tag for this extra block type ("up")
TSize Short total data size for this block
Version 1 byte version of this extra field, currently 1
NameCRC32 4 bytes File Name Field CRC32 Checksum
UnicodeName Variable UTF-8 version of the entry File Name
Currently Version is set to the number 1. If there is a need
to change this field, the version will be incremented. Changes
MAY NOT be backward compatible so this extra field SHOULD NOT be
used if the version is not recognized.
The NameCRC32 is the standard zip CRC32 checksum of the File Name
field in the header. This is used to verify that the header
File Name field has not changed since the Unicode Path extra field
was created. This can happen if a utility renames the File Name but
does not update the UTF-8 path extra field. If the CRC check fails,
this UTF-8 Path Extra Field SHOULD be ignored and the File Name field
in the header SHOULD be used instead.
The UnicodeName is the UTF-8 version of the contents of the File Name
field in the header. As UnicodeName is defined to be UTF-8, no UTF-8
byte order mark (BOM) is used. The length of this field is determined
by subtracting the size of the previous fields from TSize. If both
the File Name and Comment fields are UTF-8, the new General Purpose
Bit Flag, bit 11 (Language encoding flag (EFS)), can be used to
indicate that both the header File Name and Comment fields are UTF-8
and, in this case, the Unicode Path and Unicode Comment extra fields
are not needed and SHOULD NOT be created. Note that, for backward
compatibility, bit 11 SHOULD only be used if the native character set
of the paths and comments being zipped up are already in UTF-8. It is
expected that the same file name storage method, either general
purpose bit 11 or extra fields, be used in both the Local and Central
Directory Header for a file.
但是,这个字段不是强制字段,允许为空,Linux/Mac OS在压缩时都不会记录这个标识,因此,无法识别出zip包采用的编码方式,导致解压时会出现乱码。
二、工具解决方案
因为通过ZIP包无法获取文件的编码方式,我们看下成熟的ZIP压解缩工具The Unarchiver
是怎么实现的。在The Unarchiver
的高级配置中,有一项文件名编码的自动检测功能。
尝试使用
The Unarchiver
去解压Windows通过GBK编码压缩的文件,会自动给出文件名编码提示。
从上可以窥视出,
The Unarchiver
在实现自动解压时,会将压缩包的文件进行编码检测,挑选最有可能的编码方式,在无法确认时,由用户来确认。
三、Java解压实现
参考The Unarchiver
的实现机制,Java解压也对译码结果进行检查,判断解压后的文件名是否会有乱码。当然,考虑应用场景,我们只检测译码后的结果是否会有中文乱码,不会检测是否是其他语言的乱码(比如俄文)。
3.1 引入zip4j包
<dependency>
<groupId>net.lingala.zip4j</groupId>
<artifactId>zip4j</artifactId>
<version>2.9.1</version>
</dependency>
3.2 解压实现
/**
* 解压zip包
*
* @param srcFile 源文件
* @param destPath 要解压到的路径
*/
public static void unzip(String srcFile, String destPath) throws ZipException {
ZipFile zipFile = new ZipFile(srcFile);
zipFile.setCharset(Charset.forName("GBK"));
String charset = recognizeCharset(zipFile.getFileHeaders());
zipFile = new ZipFile(srcFile);
zipFile.setCharset(Charset.forName(charset));
zipFile.extractAll(destPath);
}
/**
* 识别编码方式
*
* @param fileHeaders 文件头
* @return 编码方式,默认GBK,当前Windows用户还是偏多
*/
public static String recognizeCharset(List<FileHeader> fileHeaders) {
if (fileHeaders == null || fileHeaders.isEmpty()) {
return "GBK";
}
// 只识别空白符、英文大小写字母、数字、下划线、点、中文、反斜杠
// 其他需要识别的字符需要自己扩充
// 标点符号\pP|\pS暂不识别
String messyRegex = "[\\sa-zA-Z_0-9./\u4e00-\u9fa5]+";
for (FileHeader fileHeader : fileHeaders) {
String fileName = fileHeader.getFileName();
if (!fileName.matches(messyRegex)) {
// 存在乱码,使用UTF8
return "UTF8";
}
}
return "GBK";
}
四、基于字节码增强实现编码探测
上述检测方法很简单,容易误检测。目前已经有一些开源的编码检测方式,比如icu4j
,juniversalchardet
。阅读zip4j
源码,zip4j
调用了readCentralDirectory
来实现对文件名译码。
private CentralDirectory readCentralDirectory(RandomAccessFile zip4jRaf, RawIO rawIO, Charset charset) throws IOException {
...
for (int i = 0; i < centralDirEntryCount; i++) {
...
if (fileNameLength > 0) {
byte[] fileNameBuff = new byte[fileNameLength];
zip4jRaf.readFully(fileNameBuff);
// 对文件名译码
String fileName = decodeStringWithCharset(fileNameBuff, fileHeader.isFileNameUTF8Encoded(), charset);
fileHeader.setFileName(fileName);
} else {
fileHeader.setFileName(null);
}
...
fileHeaders.add(fileHeader);
}
...
return centralDirectory;
}
在译码时,调用了HeaderUtil
类中的decodeStringWithCharset
方法。
public static String decodeStringWithCharset(byte[] data, boolean isUtf8Encoded, Charset charset) {
if (charset != null) {
return new String(data, charset);
}
if (isUtf8Encoded) {
return new String(data, InternalZipConstants.CHARSET_UTF_8);
}
try {
return new String(data, ZIP_STANDARD_CHARSET_NAME);
} catch (UnsupportedEncodingException e) {
return new String(data);
}
}
整个处理流程并未暴露获取文件名编码字节数组的方法,导致无法使用工具包自动检测。 如果要获取文件名编码字节数据,可以由下面几种方法:
- 修改
zip4j
文件,重新打成jar,替换本地的jar包。这样对以后升级jar都较为不便。 - 修改
decodeStringWithCharset
字节码,植入获取字节数组的逻辑。
显然,第2种实现方法更加优雅,也便于维护,字节码增强可以采用ASM
或者Javassist
。但是,decodeStringWithCharset
方法是在zip4j
内部调用,无法替换成自定义类加载器加载的类。而一个类一旦被类加载器加载,就无法修改。因此,类在被增强后,必须要使用系统类加载器加载,javassist
恰好提供了这样的机制,在加载类时可以指定加载使用的类加载器。下面将会对两种增强技术进行介绍。
4.1 基于ASM
对方法进行增强
- 引入ASM包
<!-- ASM包 -->
<dependency>
<groupId>org.ow2.asm</groupId>
<artifactId>asm-all</artifactId>
<version>5.2</version>
</dependency>
- 对方法进行增强
/**
* 构建增强类字节文件
*
* @param className 要增强的类名
*/
public static byte[] buildClassBytes(String className) throws IOException {
ClassReader classReader = new ClassReader(className);
ClassWriter classWriter = new ClassWriter(ClassWriter.COMPUTE_MAXS);
HeaderUtilClassVisitor classVisitor = new HeaderUtilClassVisitor(classWriter);
classReader.accept(classVisitor, ClassReader.SKIP_DEBUG);
return classWriter.toByteArray();
}
/**
* HeaderUtil类visitor,对decodeStringWithCharset方法进行增强
*/
public static class HeaderUtilClassVisitor extends ClassVisitor {
public HeaderUtilClassVisitor(ClassVisitor cv) {
super(Opcodes.ASM5, cv);
}
@Override
public MethodVisitor visitMethod(int access, String name, String desc, String signature, String[] exceptions) {
if (!name.equals("decodeStringWithCharset")) {
return super.visitMethod(access, name, desc, signature, exceptions);
}
MethodVisitor mv = cv.visitMethod(access, name, desc, signature, exceptions);
return new DecodeMethodVisitor(mv);
}
}
/**
* 增强逻辑,获取文件名字节数组
*/
public static class DecodeMethodVisitor extends MethodVisitor {
public DecodeMethodVisitor(MethodVisitor mv) {
super(Opcodes.ASM5, mv);
}
@Override
public void visitCode() {
// 静态方法,0为第一个参数;非静态方法,0代表this
mv.visitVarInsn(Opcodes.ALOAD, 0);
visitMethodInsn(Opcodes.INVOKESTATIC, "com/beidou/study/asm/zip4j/ByteReader", "read", "([B)V", false);
}
}
/**
* 从zip4j中读取文件名的字节数组
*/
public class ByteReader {
private final static Map<Thread, List<byte[]>> PARAM_MAP = new ConcurrentHashMap<>();
public static void put() {
if (!PARAM_MAP.containsKey(Thread.currentThread())) {
PARAM_MAP.put(Thread.currentThread(), Lists.newArrayList());
}
}
public static void remove() {
PARAM_MAP.remove(Thread.currentThread());
}
public static List<byte[]> get() {
return PARAM_MAP.get(Thread.currentThread());
}
public static void read(byte[] bytes) {
if (bytes == null || bytes.length == 0) {
return;
}
if (!PARAM_MAP.containsKey(Thread.currentThread())) {
return;
}
PARAM_MAP.get(Thread.currentThread()).add(bytes);
}
}
- 增强效果
public static String decodeStringWithCharset(byte[] var0, boolean var1, Charset var2) {
ByteReader.read(var0);
if (var2 != null) {
return new String(var0, var2);
} else if (var1) {
return new String(var0, InternalZipConstants.CHARSET_UTF_8);
} else {
try {
return new String(var0, "Cp437");
} catch (UnsupportedEncodingException var4) {
return new String(var0);
}
}
}
可以看到,方法中已被增强了读取字节数组的逻辑:
ByteReader.read(var0);
4.2 基于Javassist
对方法进行增强
- 引入包
<!-- javassist包 -->
<dependency>
<groupId>org.javassist</groupId>
<artifactId>javassist</artifactId>
<version>3.28.0-GA</version>
</dependency>
- 对方法进行增强
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.get("net.lingala.zip4j.headers.HeaderUtil");
CtMethod method = cc.getDeclaredMethod("decodeStringWithCharset");
method.insertBefore("com.beidou.study.asm.zip4j.ByteReader.read($1);");
- 增强效果和
ASM
的类似
4.3 基于Javassist
指定类加载器加载增强类
一个类被加载,就会被缓存,以后再访问该类时,不会再重新加载。因此,只需要在服务启动时加载一次使用系统类加载器加载增强后的HeaderUtil
类,原生的类就不会被加载。
- 基于
ASM
增强的加载
/**
* 加载HeaderUtil类(加载器指定为当前类的加载器)
*/
public static void loadClass() throws IOException, CannotCompileException {
String className = "net.lingala.zip4j.headers.HeaderUtil";
byte[] bytes = buildClassBytes(className);
// javassit加载生成的Class文件,并交由当前类所使用的加载器加载
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.makeClassIfNew(new ByteArrayInputStream(bytes));
cc.toClass(Zip4jDemo.class.getClassLoader(), null);
}
- 基于
Javassist
增强的加载
/**
* 基于javassist修改类字节码并加载(加载器指定为当前类的加载器)
*/
public static void loadClass() throws NotFoundException, CannotCompileException, ClassNotFoundException {
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.get("net.lingala.zip4j.headers.HeaderUtil");
CtMethod method = cc.getDeclaredMethod("decodeStringWithCharset");
method.insertBefore("com.beidou.study.asm.zip4j.ByteReader.read($1);");
// 使用当前类加载器加载
cc.toClass(Zip4jDemo.class.getClassLoader(), null);
}
4.4 基于juniversalchardet
对编码方式自动检测
- 引入
juniversalchardet
,与icu4j
差异可自行研究
<!-- 编码探测包 -->
<dependency>
<groupId>com.github.albfernandez</groupId>
<artifactId>juniversalchardet</artifactId>
<version>2.4.0</version>
</dependency>
- 自动检测
/**
* 编码识别器,使用juniversalchardet包
*/
public class EncodingDetector {
public static String recognize(byte[] bytes) {
if (bytes == null || bytes.length == 0) {
return null;
}
UniversalDetector detector = new UniversalDetector();
detector.handleData(bytes);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
detector.reset();
return encoding;
}
}
识别编码的准确度可以在此处进行优化改进,测试过程中发现GBK
编码可能会被识别成GB18030
,需要自己调整优化。
4.5 识别Demo
public class Zip4jUnzipDemo {
public static void main(String[] args) throws Exception {
// 加载HeaderUtil增强类
HeaderUtilEnhance.loadClass();
// 初始化当前线程引用的字节列表
ByteReader.put();
// 走一遍获取文件逻辑,会自动获取各个文件名的编码字节数组
String fileName = "/Users/ginger/Desktop/测试压缩.zip";
ZipFile zipFile = new ZipFile(fileName);
zipFile.getFileHeaders();
// 获取文件名编码字节列表,并释放当前线程引用
List<byte[]> bytes = ByteReader.get();
ByteReader.remove();
// 所有文件名编码字节列表进行合并
int len = 0;
for (byte[] tmp : bytes) {
len += tmp.length;
}
byte[] allBytes = new byte[len];
len = 0;
for (byte[] tmp : bytes) {
System.arraycopy(tmp, 0, allBytes, len, tmp.length);
len += tmp.length;
}
// 识别编码
String charset = EncodingDetector.recognize(allBytes);
// 用识别的解压
zipFile = new ZipFile(fileName);
zipFile.setCharset(Charset.forName(charset));
zipFile.extractAll("/Users/ginger/Desktop");
}
}