spring.replaceAll中踩坑 替换句中有\和💲引起的bug

1,165 阅读1分钟

太长不看版 这篇文章讲了啥

spring.replaceAll(String regex, String replacement)是用于正则替换的常见方法。但是如果你的replacement有反斜杠或者美元符有可能会报错。

有💲报:

java.lang.IllegalArgumentException: Illegal group reference: group index is missing

有反斜杠\在最后一位报错:

java.lang.IllegalArgumentException: character to be escaped is missing

测试代码

        final String ALERT_FORMAT = "hello @name I am a rap star";
        System.out.println(ALERT_FORMAT.replaceAll("@name","Galaxy\\22"));
        System.out.println(ALERT_FORMAT.replaceAll("@name","Galaxy\\"));
        System.out.println(ALERT_FORMAT.replaceAll("@name","Galaxy$"));

原因

spring.replaceAll底层是matcher.replaceAll。 输入的正则字符串会通过pattern.compile编译后调用matcher()生成matcher。之后调用matcher的replaceAll方法。

问题出在matcher这里,matcher在计算是否和正则匹配的时候,会把未命中的字符append到结果字符串后。这个过程中用到了appendReplacement()。更重要的是这个计算过程中有两个字符有特别的含义,💲将会被当作开始匹配的索引,反斜杠用于转义。 所以如果在使用string.replaceAll的时候,有💲会导致本来应该被匹配的group丢掉index。 如果反斜杠在字符的最后一位,jvm将会认为还有字符需要被转义【不在最后一位是不会报错的】。都会抛出IllegalArgumentException。

解决办法

提前处理一下字符,去掉美元和\。或者使用hutool这个工具类的springUtils。

 return StrUtil.isEmpty(content) ? StrUtil.EMPTY : content
            .replaceAll("[^\\u0000-\\uFFFF]", "")
            .replaceAll("[$]", "")
            .replaceAll("[\\\\]","");

我这的代码还额外去掉了emoji。不需要的删掉就好了。

                content = StrUtil.replace(content, wildcard.getWildcard(), value);

源码

 /**
     * Replaces every subsequence of the input sequence that matches the
     * pattern with the given replacement string.
     *
     * <p> This method first resets this matcher.  It then scans the input
     * sequence looking for matches of the pattern.  Characters that are not
     * part of any match are appended directly to the result string; each match
     * is replaced in the result by the replacement string.  The replacement
     * string may contain references to captured subsequences as in the {@link
     * #appendReplacement appendReplacement} method.
     *
     * <p> Note that backslashes (<tt>\</tt>) and dollar signs (<tt>$</tt>) in
     * the replacement string may cause the results to be different than if it
     * were being treated as a literal replacement string. Dollar signs may be
     * treated as references to captured subsequences as described above, and
     * backslashes are used to escape literal characters in the replacement
     * string.
     *
     * <p> Given the regular expression <tt>a*b</tt>, the input
     * <tt>"aabfooaabfooabfoob"</tt>, and the replacement string
     * <tt>"-"</tt>, an invocation of this method on a matcher for that
     * expression would yield the string <tt>"-foo-foo-foo-"</tt>.
     *
     * <p> Invoking this method changes this matcher's state.  If the matcher
     * is to be used in further matching operations then it should first be
     * reset.  </p>
     *
     * @param  replacement
     *         The replacement string
     *
     * @return  The string constructed by replacing each matching subsequence
     *          by the replacement string, substituting captured subsequences
     *          as needed
     */
     
 public String replaceAll(String replacement) {
        reset();
        boolean result = find();
        if (result) {
            StringBuffer sb = new StringBuffer();
            do {
                appendReplacement(sb, replacement);
                result = find();
            } while (result);
            appendTail(sb);
            return sb.toString();
        }
        return text.toString();
    }