编码/解码那些事

783 阅读12分钟

一、踩坑

前两天在项目中遇到一个问题,发送出去的消息加密之后到web端解密之后出现乱码。问题的奇怪之处是一开始正常,移动端没有修改代码,后台没有修改代码,前端也说没有修改代码,但是灵异的bug就是始终解决不了。后来经过断点调试,发现移动端加/解密和web端加/解密逻辑完全一致,问题主要出在编码上面,严格来说是对/等特殊符号的转义处理上面,于是我就深入的了解了以下编码。

网络标准RFC 1738做了硬性规定:

"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'(),"

只有字母和数字[0-9a-zA-Z]、一些特殊符号“$-_.+!*'(),”[不包括双引号]、以及某些保留字,才可以不经过编码直接用于URL

二、填坑

发现bug之后,我开始思考对密文进行编码,一开始使用了URLEncoder.encode("****","utf-8")这种方式来进行编码,但是编码之后发现编码之后的密文和编码之前的密文完全一致。和同事聊了一下这个现象,猜测是javascript编码方式不一致。于是我开始探索web端使用的编码方式——escape

escape函数

JavaScript escape() 函数

该方法不会对 ASCII 字母和数字进行编码,也不会对下面这些 ASCII 标点符号进行编码: * @ - _ + . / 。其他所有的字符都会被转义序列替换。

可以使用 unescape() 对 escape() 编码的字符串进行解码。

当我试图从网上寻找Java版本的escape函数时,看见一种思路:

image.png

  • 将数字、大小写的字母不进行处理
  • 对每个非汉字的特殊符号(ASCII值小于256)前增加%,然后将该符号转ASCII值,并且以十六进制展示。
  • 对每个中文前增加%u,然后对该文字转ASCII值,并且以十六进制展示。如将“中”转为%u4E2D 这种思路和上面文档的定义严重违背,不符合上面文档的说明,测试失败。

继续翻,发现网上有一个java版本的Escape工具类出现频率很高。根据懒人原则第一条,复制代码到项目,测试是否可用。准备关闭项目摸鱼的时候发现一个问题.先卖个关子,看看这个方法和上面方法的区别:

  • 将数字、大小写的字母以及['-','_','.','!','~','*','/','(',')']数组内的特殊字符不处理
  • 将空格转为+
  • 对每个非汉字的特殊符号(ASCII值小于128)前增加%,然后将该符号转ASCII值,并且以十六进制展示。如转为%A5
  • 对每个中文前增加%u,然后对该文字转ASCII值,并且以十六进制展示。如将“中”转为%u4E2D

扫了一眼,发现这个方法和定义不对,定义里面只有7种特殊符号,也没有对空格和+号做阐述,于是测试了一波,发现果然还是坑:

escape(" ")  // 运行结果:"%20"

于是再次修改,这次只根据JavaScript escape() 函数定义来编辑,得到如下代码:

private fun escape(src: String): String {
    var i = 0
    var j: Char
    val tmp = StringBuffer()
    tmp.ensureCapacity(src.length * 6)
    while (i < src.length) {
        j = src[i]
        when {
            Character.isDigit(j)
                    || Character.isLowerCase(j)
                    || Character.isUpperCase(j)
                    || specialSymbols.contains(j) -> tmp.append(j)
            j.toInt() < 128 -> {
                tmp.append("%")
                if (j.toInt() < 16) tmp.append("0")
                tmp.append(j.toInt().toString(16))
            }
            else -> {
                tmp.append("%u")
                tmp.append(j.toInt().toString(16))
            }
        }
        i++
    }
    return tmp.toString()
}

既然有编码,那肯定得有对应的解码方法,这个也很简单:

fun unescape(src: String): String {
    val tmp = StringBuffer()
    tmp.ensureCapacity(src.length)
    var lastPos = 0
    var pos = 0
    var ch: Char
    while (lastPos < src.length) {
        pos = src.indexOf("%", lastPos)
        if (pos == lastPos) {
            when {
                src[pos + 1] == 'u' -> {
                    ch = src
                        .substring(pos + 2, pos + 6).toInt(16).toChar()
                    tmp.append(ch)
                    lastPos = pos + 6
                }
                else -> {
                    ch = src
                        .substring(pos + 1, pos + 3).toInt(16).toChar()
                    tmp.append(ch)
                    lastPos = pos + 3
                }
            }
        } else {
            lastPos = if (pos == -1) {
                tmp.append(src.substring(lastPos))
                src.length
            } else {
                tmp.append(src.substring(lastPos, pos))
                pos
            }
        }
    }
    return tmp.toString()
}

三、反思

照理来说,这个时候应该可以摸鱼了,但是有一个疑问深深的困惑着我,为什么解码/编码没有统一的方法?难道真的存在javajavascript编码不一致吗?于是我打开了URLEncoder.encode源码:

/**
 * Translates a string into {@code application/x-www-form-urlencoded}
 * format using a specific encoding scheme. This method uses the
 * supplied encoding scheme to obtain the bytes for unsafe
 * characters.
 * <p>
 * <em><strong>Note:</strong> The <a href=
 * "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars">
 * World Wide Web Consortium Recommendation</a> states that
 * UTF-8 should be used. Not doing so may introduce
 * incompatibilities.</em>
 *
 * @param   s   {@code String} to be translated.
 * @param   enc   The name of a supported
 *    <a href="../lang/package-summary.html#charenc">character
 *    encoding</a>.
 * @return  the translated {@code String}.
 * @exception  UnsupportedEncodingException
 *             If the named encoding is not supported
 * @see URLDecoder#decode(java.lang.String, java.lang.String)
 * @since 1.4
 */
public static String encode(String s, String enc)
    throws UnsupportedEncodingException {

    boolean needToChange = false;
    StringBuffer out = new StringBuffer(s.length());
    Charset charset;
    CharArrayWriter charArrayWriter = new CharArrayWriter();

    if (enc == null)
        throw new NullPointerException("charsetName");

    try {
        charset = Charset.forName(enc);
    } catch (IllegalCharsetNameException e) {
        throw new UnsupportedEncodingException(enc);
    } catch (UnsupportedCharsetException e) {
        throw new UnsupportedEncodingException(enc);
    }

    for (int i = 0; i < s.length();) {
        int c = (int) s.charAt(i);
        //System.out.println("Examining character: " + c);
        if (dontNeedEncoding.get(c)) {
            if (c == ' ') {
                c = '+';
                needToChange = true;
            }
            //System.out.println("Storing: " + c);
            out.append((char)c);
            i++;
        } else {
            // convert to external encoding before hex conversion
            do {
                charArrayWriter.write(c);
                /*
                 * If this character represents the start of a Unicode
                 * surrogate pair, then pass in two characters. It's not
                 * clear what should be done if a bytes reserved in the
                 * surrogate pairs range occurs outside of a legal
                 * surrogate pair. For now, just treat it as if it were
                 * any other character.
                 */
                if (c >= 0xD800 && c <= 0xDBFF) {
                    /*
                      System.out.println(Integer.toHexString(c)
                      + " is high surrogate");
                    */
                    if ( (i+1) < s.length()) {
                        int d = (int) s.charAt(i+1);
                        /*
                          System.out.println("\tExamining "
                          + Integer.toHexString(d));
                        */
                        if (d >= 0xDC00 && d <= 0xDFFF) {
                            /*
                              System.out.println("\t"
                              + Integer.toHexString(d)
                              + " is low surrogate");
                            */
                            charArrayWriter.write(d);
                            i++;
                        }
                    }
                }
                i++;
            } while (i < s.length() && !dontNeedEncoding.get((c = (int) s.charAt(i))));

            charArrayWriter.flush();
            String str = new String(charArrayWriter.toCharArray());
            byte[] ba = str.getBytes(charset);
            for (int j = 0; j < ba.length; j++) {
                out.append('%');
                char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16);
                // converting to use uppercase letter as part of
                // the hex value if ch is a letter.
                if (Character.isLetter(ch)) {
                    ch -= caseDiff;
                }
                out.append(ch);
                ch = Character.forDigit(ba[j] & 0xF, 16);
                if (Character.isLetter(ch)) {
                    ch -= caseDiff;
                }
                out.append(ch);
            }
            charArrayWriter.reset();
            needToChange = true;
        }
    }

    return (needToChange? out.toString() : s);
}

看到这里就开始疑惑了,注释里面说得很清楚:

Utility class for HTML form encoding. This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format. For more information about HTML form encoding, consult the HTML

用于HTML表单编码的实用程序类。该类包含静态方法 用于将一个字符串转换为application/x-www-form-urlencoded MIME格式。有关HTML表单编码的更多信息.请参阅 www.w3.org/TR/html4/

这个方法专门用于web端编码的,为什么web端还会有乱码呢?而且网上自定义的编码方法也是根据这个方法来,对空格和+进行了处理(ps:既然直接处理,为什么不用官方的方法?),那么这个方法时灵时不灵的真正原因是什么啊? 再次查看JavaScript escape() 函数,注意到一句话:ECMAScript v3 反对使用该方法,应用使用 decodeURI() 和 decodeURIComponent() 替代它。

难道是escape函数和 unescape函数被废弃了,推荐使用的decodeURI函数可以解决这个问题吗? 我们测试一下:

URLEncoder.encode("ȋàډ\µe‹‰•¨­©•g\\RRNN…ËÉÕÈÍéՇ\j\N…ÒÝâÙÓâ–\\浭鸞�끐悯kノ﹦RNI⁋䁌ἲ︧j⁋䀹ἦKNN…ÒÝâÙÓâÈÍéՇ\\v™¬vNN‹ÜƸÓ҆\®æçڑN”×ÈÈÎßÛ®­†\\‰Ùáäåì¯lkd“˜leiqqš’b–Çš›“`dii–Øœ–b–ËŸ[NN”×ÈÈÎßÛ³¯Î҇\\邵댓돎ꩻ丬NN•ØÓÒ­­†\\l‹‘ ¯Ålhfhe•š˜™geihd’’ijgmh’š›špœÄ’gXNN•ØÓÒ²¯Î҇\\脃ʂ锭NN•ØÓÒ¸½Ö҇\\Tbbc^]da]iYQhqldlkfWNN•ØÓÒ¸½ÖÒ¸ÇÕÎݒ\kgghpofadheeg^N•ØØæÜØÝ·­†\\‰Ùáäåì¯lkd“˜leiqqš’b–Çš›“`dii–Øœ–b–ËŸ[NN•çÕÕéè•\\t—†…fŸ©N…Ðц\\…ËÉÕÓ¢«ÉÕÁÀډŸ","utf-8")

// %C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C3%89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0%3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC%A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B%C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C%5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9B%C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94%ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D%C2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2%97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81%C3%80%C3%9A%C2%89%C2%9F
decodeURI("%C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C3%89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0%3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC%A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B%C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C%5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9B%C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94%ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D%C2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2%97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81%C3%80%C3%9A%C2%89%C2%9F")
// ȋàډ\µe‹‰•¨­©•g\\RRNN…ËÉÕÈÍéՇ\j\N…ÒÝâÙÓâ–\\浭鸞%3F끐悯kノ﹦RNI⁋䁌ἲ︧j⁋䀹ἦKNN…ÒÝâÙÓâÈÍéՇ\\v™¬vNN‹ÜƸÓ҆\®æçڑN”×ÈÈÎßÛ®­†\\‰Ùáäåì¯lkd“˜leiqqš’b–Çš›“`dii–Øœ–b–ËŸ[NN”×ÈÈÎßÛ³¯Î҇\\邵댓돎ꩻ丬NN•ØÓÒ­­†\\l‹‘ ¯Ålhfhe•š˜™geihd’’ijgmh’š›špœÄ’gXNN•ØÓÒ²¯Î҇\\脃ʂ锭NN•ØÓÒ¸½Ö҇\\Tbbc^]da]iYQhqldlkfWNN•ØÓÒ¸½ÖÒ¸ÇÕÎݒ\kgghpofadheeg^N•ØØæÜØÝ·­†\\‰Ùáäåì¯lkd“˜leiqqš’b–Çš›“`dii–Øœ–b–ËŸ[NN•çÕÕéè•\\t—†…fŸ©N…Ðц\\…ËÉÕÓ¢«ÉÕÁÀډŸ

对比没有问题,我们再看看其它数据:

URLEncoder.encode("{\"msg\":{\"CHATTYPE\":\"0\",\"chatType\":0,\"content\":\"测试信息@+~(*\$%……,;/“”)\",\"contentType\":\"TEXT\",\"isSend\":true,\"receiveId\":\"9527\",\"receiveName\":\"武当山\",\"sendId\":\"10086\",\"sendName\":\"张三丰\",\"sendTime\":\"2021-04-09 17:22:15\",\"sendTimeStamp\":1617960135052,\"sessionId\":\"10010\",\"status\":\"READ\"},\"cmd\":\"chat_ChatMsg\"}","utf-8")

// %7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF%A1%E6%81%AF%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22contentType%22%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD%93%E5%B1%B1%22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222021-04-09+17%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ%22%7D%2C%22cmd%22%3A%22chat_ChatMsg%22%7D
decodeURI("%7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF%A1%E6%81%AF%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22contentType%22%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD%93%E5%B1%B1%22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222021-04-09+17%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ%22%7D%2C%22cmd%22%3A%22chat_ChatMsg%22%7D")

// {"msg"%3A{"CHATTYPE"%3A"0"%2C"chatType"%3A0%2C"content"%3A"测试信息%40%2B~(*%24%……,;%2F“”)"%2C"contentType"%3A"TEXT"%2C"isSend"%3Atrue%2C"receiveId"%3A"9527"%2C"receiveName"%3A"武当山"%2C"sendId"%3A"10086"%2C"sendName"%3A"张三丰"%2C"sendTime"%3A"2021-04-09+17%3A22%3A15"%2C"sendTimeStamp"%3A1617960135052%2C"sessionId"%3A"10010"%2C"status"%3A"READ"}%2C"cmd"%3A"chat_ChatMsg"}

经过测试,证明这个乱码是web端不当使用编码方法造成的。但是通过查询,并没有找到JavaScript escape 函数实现的源码,因此文章中的解/编码方法并不能保证100%正确