一、踩坑
前两天在项目中遇到一个问题,发送出去的消息加密之后到web
端解密之后出现乱码。问题的奇怪之处是一开始正常,移动端没有修改代码,后台没有修改代码,前端也说没有修改代码,但是灵异的bug就是始终解决不了。后来经过断点调试,发现移动端加/解密和web
端加/解密逻辑完全一致,问题主要出在编码上面,严格来说是对/
等特殊符号的转义处理上面,于是我就深入的了解了以下编码。
网络标准
RFC 1738
做了硬性规定:"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'(),"
只有字母和数字[0-9a-zA-Z]、一些特殊符号“$-_.+!*'(),”[不包括双引号]、以及某些保留字,才可以不经过编码直接用于URL
二、填坑
发现bug之后,我开始思考对密文进行编码,一开始使用了URLEncoder.encode("****","utf-8")
这种方式来进行编码,但是编码之后发现编码之后的密文和编码之前的密文完全一致。和同事聊了一下这个现象,猜测是javascript
编码方式不一致。于是我开始探索web
端使用的编码方式——escape
escape
函数
JavaScript escape() 函数
该方法不会对 ASCII 字母和数字进行编码,也不会对下面这些 ASCII 标点符号进行编码: * @ - _ + . / 。其他所有的字符都会被转义序列替换。
可以使用 unescape() 对 escape() 编码的字符串进行解码。
当我试图从网上寻找Java
版本的escape
函数时,看见一种思路:
- 将数字、大小写的字母不进行处理
- 对每个非汉字的特殊符号(
ASCII
值小于256)前增加%
,然后将该符号转ASCII
值,并且以十六进制展示。 - 对每个中文前增加
%u
,然后对该文字转ASCII
值,并且以十六进制展示。如将“中”转为%u4E2D
这种思路和上面文档的定义严重违背,不符合上面文档的说明,测试失败。
继续翻,发现网上有一个java
版本的Escape
工具类出现频率很高。根据懒人原则第一条,复制代码到项目,测试是否可用。准备关闭项目摸鱼的时候发现一个问题.先卖个关子,看看这个方法和上面方法的区别:
- 将数字、大小写的字母以及
['-','_','.','!','~','*','/','(',')']
数组内的特殊字符不处理 - 将空格转为
+
- 对每个非汉字的特殊符号(
ASCII
值小于128)前增加%
,然后将该符号转ASCII
值,并且以十六进制展示。如¥
转为%A5
- 对每个中文前增加
%u
,然后对该文字转ASCII
值,并且以十六进制展示。如将“中”转为%u4E2D
扫了一眼,发现这个方法和定义不对,定义里面只有7种特殊符号,也没有对空格和+
号做阐述,于是测试了一波,发现果然还是坑:
escape(" ") // 运行结果:"%20"
于是再次修改,这次只根据JavaScript escape() 函数定义来编辑,得到如下代码:
private fun escape(src: String): String {
var i = 0
var j: Char
val tmp = StringBuffer()
tmp.ensureCapacity(src.length * 6)
while (i < src.length) {
j = src[i]
when {
Character.isDigit(j)
|| Character.isLowerCase(j)
|| Character.isUpperCase(j)
|| specialSymbols.contains(j) -> tmp.append(j)
j.toInt() < 128 -> {
tmp.append("%")
if (j.toInt() < 16) tmp.append("0")
tmp.append(j.toInt().toString(16))
}
else -> {
tmp.append("%u")
tmp.append(j.toInt().toString(16))
}
}
i++
}
return tmp.toString()
}
既然有编码,那肯定得有对应的解码方法,这个也很简单:
fun unescape(src: String): String {
val tmp = StringBuffer()
tmp.ensureCapacity(src.length)
var lastPos = 0
var pos = 0
var ch: Char
while (lastPos < src.length) {
pos = src.indexOf("%", lastPos)
if (pos == lastPos) {
when {
src[pos + 1] == 'u' -> {
ch = src
.substring(pos + 2, pos + 6).toInt(16).toChar()
tmp.append(ch)
lastPos = pos + 6
}
else -> {
ch = src
.substring(pos + 1, pos + 3).toInt(16).toChar()
tmp.append(ch)
lastPos = pos + 3
}
}
} else {
lastPos = if (pos == -1) {
tmp.append(src.substring(lastPos))
src.length
} else {
tmp.append(src.substring(lastPos, pos))
pos
}
}
}
return tmp.toString()
}
三、反思
照理来说,这个时候应该可以摸鱼了,但是有一个疑问深深的困惑着我,为什么解码/编码没有统一的方法?难道真的存在java
和javascript
编码不一致吗?于是我打开了URLEncoder.encode
源码:
/**
* Translates a string into {@code application/x-www-form-urlencoded}
* format using a specific encoding scheme. This method uses the
* supplied encoding scheme to obtain the bytes for unsafe
* characters.
* <p>
* <em><strong>Note:</strong> The <a href=
* "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars">
* World Wide Web Consortium Recommendation</a> states that
* UTF-8 should be used. Not doing so may introduce
* incompatibilities.</em>
*
* @param s {@code String} to be translated.
* @param enc The name of a supported
* <a href="../lang/package-summary.html#charenc">character
* encoding</a>.
* @return the translated {@code String}.
* @exception UnsupportedEncodingException
* If the named encoding is not supported
* @see URLDecoder#decode(java.lang.String, java.lang.String)
* @since 1.4
*/
public static String encode(String s, String enc)
throws UnsupportedEncodingException {
boolean needToChange = false;
StringBuffer out = new StringBuffer(s.length());
Charset charset;
CharArrayWriter charArrayWriter = new CharArrayWriter();
if (enc == null)
throw new NullPointerException("charsetName");
try {
charset = Charset.forName(enc);
} catch (IllegalCharsetNameException e) {
throw new UnsupportedEncodingException(enc);
} catch (UnsupportedCharsetException e) {
throw new UnsupportedEncodingException(enc);
}
for (int i = 0; i < s.length();) {
int c = (int) s.charAt(i);
//System.out.println("Examining character: " + c);
if (dontNeedEncoding.get(c)) {
if (c == ' ') {
c = '+';
needToChange = true;
}
//System.out.println("Storing: " + c);
out.append((char)c);
i++;
} else {
// convert to external encoding before hex conversion
do {
charArrayWriter.write(c);
/*
* If this character represents the start of a Unicode
* surrogate pair, then pass in two characters. It's not
* clear what should be done if a bytes reserved in the
* surrogate pairs range occurs outside of a legal
* surrogate pair. For now, just treat it as if it were
* any other character.
*/
if (c >= 0xD800 && c <= 0xDBFF) {
/*
System.out.println(Integer.toHexString(c)
+ " is high surrogate");
*/
if ( (i+1) < s.length()) {
int d = (int) s.charAt(i+1);
/*
System.out.println("\tExamining "
+ Integer.toHexString(d));
*/
if (d >= 0xDC00 && d <= 0xDFFF) {
/*
System.out.println("\t"
+ Integer.toHexString(d)
+ " is low surrogate");
*/
charArrayWriter.write(d);
i++;
}
}
}
i++;
} while (i < s.length() && !dontNeedEncoding.get((c = (int) s.charAt(i))));
charArrayWriter.flush();
String str = new String(charArrayWriter.toCharArray());
byte[] ba = str.getBytes(charset);
for (int j = 0; j < ba.length; j++) {
out.append('%');
char ch = Character.forDigit((ba[j] >> 4) & 0xF, 16);
// converting to use uppercase letter as part of
// the hex value if ch is a letter.
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
out.append(ch);
ch = Character.forDigit(ba[j] & 0xF, 16);
if (Character.isLetter(ch)) {
ch -= caseDiff;
}
out.append(ch);
}
charArrayWriter.reset();
needToChange = true;
}
}
return (needToChange? out.toString() : s);
}
看到这里就开始疑惑了,注释里面说得很清楚:
Utility class for HTML form encoding. This class contains static methods for converting a String to the
application/x-www-form-urlencoded
MIME format. For more information about HTML form encoding, consult the HTML用于HTML表单编码的实用程序类。该类包含静态方法 用于将一个字符串转换为
application/x-www-form-urlencoded MIME
格式。有关HTML表单编码的更多信息.请参阅 www.w3.org/TR/html4/
这个方法专门用于web
端编码的,为什么web
端还会有乱码呢?而且网上自定义的编码方法也是根据这个方法来,对空格和+
进行了处理(ps:既然直接处理,为什么不用官方的方法?),那么这个方法时灵时不灵的真正原因是什么啊?
再次查看JavaScript escape() 函数,注意到一句话:ECMAScript v3 反对使用该方法,应用使用 decodeURI() 和 decodeURIComponent() 替代它。
难道是escape
函数和 unescape
函数被废弃了,推荐使用的decodeURI
函数可以解决这个问题吗?
我们测试一下:
URLEncoder.encode("ȋàÚ\µe¨©g\\RRNN
ËÉÕÈÍéÕ\j\N
ÒÝâÙÓâ\\浭鸞�끐悯kノ﹦RNI⁋䁌ἲ︧j⁋䀹ἦKNN
ÒÝâÙÓâÈÍéÕ\\v¬vNNÜÆ¸ÓÒ\®æçÚN×ÈÈÎßÛ®\\Ùáäåì¯lkdleiqqbÇ`diiÃbË[NN×ÈÈÎßÛ³¯ÎÒ\\邵댓돎ꩻ丬NNØÓÒ\\l ¯ÅlhfhegeihdijgmhpÄgXNNØÓÒ²¯ÎÒ\\脃ʂ锭NNØÓÒ¸½ÖÒ\\Tbbc^]da]iYQhqldlkfWNNØÓÒ¸½ÖÒ¸ÇÕÎÝ\kgghpofadheeg^NØØæÜØÝ·\\Ùáäåì¯lkdleiqqbÇ`diiÃbË[NNçÕÕéè\\t
f©N
ÐÑ\\
ËÉÕÓ¢«ÉÕÁÀÚ","utf-8")
// %C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C3%89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0%3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC%A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B%C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C%5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9B%C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94%ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D%C2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2%97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81%C3%80%C3%9A%C2%89%C2%9F
decodeURI("%C8%8B%C2%9D%C2%8F%C3%A0%C3%9A%C2%89%5C%C2%B5%C2%9De%C2%8B%C2%89%C2%95%C2%A8%C2%AD%C2%A9%C2%95g%5C%5CRRNN%C2%85%C3%8B%C3%89%C3%95%C3%88%C3%8D%C3%A9%C3%95%C2%87%5Cj%5CN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C2%96%5C%5C%E6%B5%AD%EF%A4%A0%3F%EB%81%90%E6%82%AFk%EF%BE%89%EF%B9%A6%EF%BC%B2NI%E2%81%8B%E4%81%8C%E1%BC%B2%EF%B8%A7%EF%BD%8A%E2%81%8B%E4%80%B9%E1%BC%A6%EF%BC%ABNN%C2%85%C3%92%C3%9D%C3%A2%C3%99%C3%93%C3%A2%C3%88%C3%8D%C3%A9%C3%95%C2%87%5C%5Cv%C2%99%C2%9D%C2%ACvNN%C2%8B%C3%9C%C3%86%C2%B8%C3%93%C3%92%C2%86%5C%C2%AE%C3%A6%C3%A7%C3%9A%C2%91N%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%AE%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%94%C3%97%C3%88%C3%88%C3%8E%C3%9F%C3%9B%C2%B3%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E9%82%B5%EE%B1%89%EB%8C%93%EB%8F%8E%EA%A9%BB%E4%B8%ACNN%C2%95%C3%98%C3%93%C3%92%C2%AD%C2%AD%C2%86%5C%5Cl%C2%8D%C2%8D%C2%8B%C2%91%C2%A0%C2%AF%C3%85%C2%9Dlhfhe%C2%95%C2%9A%C2%98%C2%99geihd%C2%92%C2%92ijgmh%C2%92%C2%9A%C2%9B%C2%9Ap%C2%9C%C3%84%C2%92gXNN%C2%95%C3%98%C3%93%C3%92%C2%B2%C2%AF%C3%8E%C3%92%C2%87%5C%5C%E8%84%83%EE%B9%98%CA%82%E9%94%ADNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%87%5C%5CTbbc%5E%5Dda%5DiYQhqldlkfWNN%C2%95%C3%98%C3%93%C3%92%C2%B8%C2%BD%C3%96%C3%92%C2%B8%C3%87%C3%95%C3%8E%C3%9D%C2%92%5Ckgghpofadheeg%5EN%C2%95%C3%98%C3%98%C3%A6%C3%9C%C3%98%C3%9D%C2%B7%C2%AD%C2%86%5C%5C%C2%89%C3%99%C3%A1%C3%A4%C3%A5%C3%AC%C2%AFlkd%C2%93%C2%98leiqq%C2%9A%C2%92b%C2%96%C3%87%C2%9A%C2%9B%C2%93%60dii%C2%96%C3%83%C2%98%C2%9C%C2%96b%C2%96%C3%8B%C2%9F%5BNN%C2%95%C3%A7%C3%95%C3%95%C3%A9%C3%A8%C2%95%5C%5Ct%C2%97%C2%86%C2%85f%C2%9F%C2%A9N%C2%85%C3%90%C3%91%C2%86%5C%5C%C2%85%C3%8B%C3%89%C3%95%C3%93%C2%A2%C2%AB%C3%89%C3%95%C3%81%C3%80%C3%9A%C2%89%C2%9F")
// ȋàÚ\µe¨©g\\RRNN
ËÉÕÈÍéÕ\j\N
ÒÝâÙÓâ\\浭鸞%3F끐悯kノ﹦RNI⁋䁌ἲ︧j⁋䀹ἦKNN
ÒÝâÙÓâÈÍéÕ\\v¬vNNÜÆ¸ÓÒ\®æçÚN×ÈÈÎßÛ®\\Ùáäåì¯lkdleiqqbÇ`diiÃbË[NN×ÈÈÎßÛ³¯ÎÒ\\邵댓돎ꩻ丬NNØÓÒ\\l ¯ÅlhfhegeihdijgmhpÄgXNNØÓÒ²¯ÎÒ\\脃ʂ锭NNØÓÒ¸½ÖÒ\\Tbbc^]da]iYQhqldlkfWNNØÓÒ¸½ÖÒ¸ÇÕÎÝ\kgghpofadheeg^NØØæÜØÝ·\\Ùáäåì¯lkdleiqqbÇ`diiÃbË[NNçÕÕéè\\t
f©N
ÐÑ\\
ËÉÕÓ¢«ÉÕÁÀÚ
对比没有问题,我们再看看其它数据:
URLEncoder.encode("{\"msg\":{\"CHATTYPE\":\"0\",\"chatType\":0,\"content\":\"测试信息@+~(*\$%……,;/“”)\",\"contentType\":\"TEXT\",\"isSend\":true,\"receiveId\":\"9527\",\"receiveName\":\"武当山\",\"sendId\":\"10086\",\"sendName\":\"张三丰\",\"sendTime\":\"2021-04-09 17:22:15\",\"sendTimeStamp\":1617960135052,\"sessionId\":\"10010\",\"status\":\"READ\"},\"cmd\":\"chat_ChatMsg\"}","utf-8")
// %7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF%A1%E6%81%AF%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22contentType%22%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD%93%E5%B1%B1%22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222021-04-09+17%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ%22%7D%2C%22cmd%22%3A%22chat_ChatMsg%22%7D
decodeURI("%7B%22msg%22%3A%7B%22CHATTYPE%22%3A%220%22%2C%22chatType%22%3A0%2C%22content%22%3A%22%E6%B5%8B%E8%AF%95%E4%BF%A1%E6%81%AF%40%2B%EF%BD%9E%EF%BC%88*%24%25%E2%80%A6%E2%80%A6%EF%BC%8C%EF%BC%9B%2F%E2%80%9C%E2%80%9D%EF%BC%89%22%2C%22contentType%22%3A%22TEXT%22%2C%22isSend%22%3Atrue%2C%22receiveId%22%3A%229527%22%2C%22receiveName%22%3A%22%E6%AD%A6%E5%BD%93%E5%B1%B1%22%2C%22sendId%22%3A%2210086%22%2C%22sendName%22%3A%22%E5%BC%A0%E4%B8%89%E4%B8%B0%22%2C%22sendTime%22%3A%222021-04-09+17%3A22%3A15%22%2C%22sendTimeStamp%22%3A1617960135052%2C%22sessionId%22%3A%2210010%22%2C%22status%22%3A%22READ%22%7D%2C%22cmd%22%3A%22chat_ChatMsg%22%7D")
// {"msg"%3A{"CHATTYPE"%3A"0"%2C"chatType"%3A0%2C"content"%3A"测试信息%40%2B~(*%24%……,;%2F“”)"%2C"contentType"%3A"TEXT"%2C"isSend"%3Atrue%2C"receiveId"%3A"9527"%2C"receiveName"%3A"武当山"%2C"sendId"%3A"10086"%2C"sendName"%3A"张三丰"%2C"sendTime"%3A"2021-04-09+17%3A22%3A15"%2C"sendTimeStamp"%3A1617960135052%2C"sessionId"%3A"10010"%2C"status"%3A"READ"}%2C"cmd"%3A"chat_ChatMsg"}
经过测试,证明这个乱码是web
端不当使用编码方法造成的。但是通过查询,并没有找到JavaScript escape
函数实现的源码,因此文章中的解/编码方法并不能保证100%正确。