下载文件乱码？RFC文档了解一下！你下载的文件是否会出现下面这样的文件名？ %E6%89%B9%E9%87%8F%E4%

你下载的文件是否会出现下面这样的文件名？ %E6%89%B9%E9%87%8F%E4%B8%8B%E8%BD%BD.zip

这种其实不算乱码，只是未解码的url编码。

问题

最近在优化一个下载接口，测试时发现下载的文件名是经过url编码后的名称。我就想都2026年了，怎么还会有这种问题，肯定是哪里没设置好。

查看了一下代码，写的是：

response.setHeader("Content-Disposition",
                "attachment;filename=" + encodeFileName);

经过一番查找修改代码后，名称终于对了，显示名称为：批量下载.zip。

参考代码如下：

response.setHeader("Content-Disposition",
                "attachment;filename*=UTF-8''" + encodeFileName);

为什么改成这样就行了呢？这又是什么语法？

RFC 2616

这种写法是遵循了RFC文档规范，定义格式如下：

     content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

     disposition-type    = "inline" | "attachment" | disp-ext-type
                         ; case-insensitive
     disp-ext-type       = token

     disposition-parm    = filename-parm | disp-ext-parm

     filename-parm       = "filename" "=" value
                         | "filename*" "=" ext-value

     disp-ext-parm       = token "=" value
                         | ext-token "=" ext-value

根据格式，disposition-parm的值可以是 filename-parm 或 disp-ext-parm，其中 filename-parm 的值可以是 "filename" "=" value 或"filename*" "=" ext-value。

filename 和 filename* 参数的区别就是 filename 的值不进行编码，而filename*的值按照RFC 5987定义的编码格式编码，并且可以。

为了保持兼容，有时候会同时提供filename 和 filename* 参数，规范要求客户端优先使用filename*，忽略 filename。

recipients SHOULD pick "filename*" and ignore "filename".

好了，既然知道了这里要填写文件名，那么 ext-value又是什么格式呢？

RFC 5987

摘录格式定义如下：

     ext-value     = charset  "'" [ language ] "'" value-chars
                   ; like RFC 2231's <extended-initial-value>
                   ; (see [RFC2231], Section 7)

     charset       = "UTF-8" / "ISO-8859-1" / mime-charset
     
     value-chars   = *( pct-encoded / attr-char )

     pct-encoded   = "%" HEXDIG HEXDIG
                   ; see [RFC3986], Section 2.1

该规范定义了编码用的字符集

要求接收方必须实现 ISO-8859-1 和 UTF-8，生产方必须实现其中之一。

recipients implementing this specification MUST support the character sets "ISO-8859-1" [ISO-8859-1] and "UTF-8" [RFC3629].

Producers MUST use either the "UTF-8" ([RFC3629]) or the "ISO-8859-1" ([ISO-8859-1]) character set.

原来代码中filename*后面的UTF-8是这里定义的。
接着后面跟的两个单引号是 language，且是可以省略的。这里的语言指的是人类自然语言，不再深究。
value-chars 是具体的编码值了。

为什么 ext-value 格式是这样的？ RFC 3321 了解一下！

为什么 ext-value 编码使用什么方案？ RFC 3986 了解一下！

RFC 2231

该规范对格式有描述，使用单引号用来隔开字符集、语言、实际值。

A single quote ("'") is used to delimit the character set and language information at the beginning of the parameter value. Percent signs ("%") are used as the encoding flag, which agrees with RFC 2047.

空格编码问题

根据文档 RFC 3986 中的描述，一个byte会编码成字符3元组，例：二进制00100000会变为 %20 对于 ASCII的空格。

For example, "%20" is the percent-encoding for the binary octet "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space character (SP).

作为Java程序员你就会发现，使用工具类URLEncoder进行编码时，会将空格转换为“+”号。

工具类的 Javadoc 注释中有下面的说明

When encoding a String, the following rules apply:

The space character " " is converted into a plus sign "+".

怎么办呢？

那就手动改一下吧

URLEncoder.encode(filename, StandardCharsets.UTF_8).replaceAll("\\+", "%20");

总结

不查不知道，下载文件名编码的问题，居然牵扯这么多东西。有了文档规范就真是能大大提高大范围的协作效率。

通过对这些规范的了解，相信你在后面的开发工作中，再也不会出现乱码的情况了。

参考

下载的附件名总乱码？你该去读一下 RFC 文档了！ - 郑晓龙 - 博客园

RFC 6266 - Use of the Content-Disposition Header Field in the Hypertext Transfer Protocol (HTTP)

RFC 5987 - Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters

RFC 2231 - MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations

RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax