每日一题-最长重复字串(困难)

175 阅读2分钟

题目描述:

给你一个字符串 s ,考虑其所有 重复子串 :即,s 的连续子串,在 s 中出现 2 次或更多次。这些出现之间可能存在重叠。

返回 任意一个 可能具有最长长度的重复子串。如果 s 不含重复子串,那么答案为 "" 。

示例:

示例 1:

输入:s = "banana"
输出:"ana"

示例 2:

输入:s = "abcd"
输出:""

提示:

2<=s.length<=31042 <= s.length <= 3 * 10^4
s 由小写英文字母组成

分析:

  1. 首先题目需要找出最长长度的重复字串。若存在长度为L的字串满足条件,则字串长度小于L时,必然会存在重复字串,当长度大于L时,则没有重复字串。所以第一步可以用二分法来猜最长重复字串的长度。
  2. 确定完字串长度后,我们就可以采用滑动窗口移动来遍历字符串s,看是否存在重复字串.

编码:

class Solution {
        public String longestDupSubstring(String s) {
        int len = s.length();
        int start = 0;
        int end = len - 1;
        String res = "";
        while (start <= end){
            int mid = start + (end - start + 1) / 2;
            String subStr = check(mid, s, len);
            if (!("".equals(subStr))) {
                start = mid + 1;
                res = subStr;
            } else {
               end = mid - 1;
            }
        }
        return res;
    }

    private String check(int mid, String s, int strLen) {
        HashSet<String> set = new HashSet<>();
        for (int i = 0; i <= strLen - mid; i++) {
            String substring = s.substring(i, i + mid);
            if (set.contains(substring)) {
                return substring;
            } else {
                set.add(substring);
            }
        }
        return "";
    }
}

啪,超出内存限制,太年轻了! 2<=s.length<=31042 <= s.length <= 3 * 10^4 提示很明显了。

image.png

官方题解:二分 + Rabin-Karp 字符串编码

  • 第一步思路一致,用二分来确定重复字串的长度 L。
  • 第二步采用 Rabin-Karp 字符串编码高效判断 s 中是否有长度为 L 的重复子串 那么什么是 Rabin-Karp 字符串编码呢?
    核心 就是:用 hash 来判断字符串是否重复(若两个子字符串hash一致,则重复(ps:别扣着hash冲突不放)),并且计算下一滑动窗口的字符串的 hash 仅需 O(1) 的时间。
    那么它是如何实现的呢?
  1. 首先,我们需要对 s 的每个字符进行编码,得到一个数组 arr。因为本题中 s 仅包含小写字母,我们可按照 arr[i] = (int)s.charAt(i) - (int)a,将所有字母编码为 0-25 之间的数字。比如字符串 "abcde" 可以编码为数组 [0,1,2,3,4]。
  2. 我们将子串看成一个 26 进制的数,它对应的 10 进制数就是它的编码。假设此时我们需要求长度为 3 的子串的编码。那么第一个子串 “abc” 的编码就是 h0=0×262+1×261+3×260 h_0=0 \times 26^2 + 1 \times 26^1 + 3 \times 26^0,抽象成一般形式就是:
h0=c0aL1+c1aL2+...+cL1a1h_0=c_0a^{L-1}+c_1a^{L-2}+...+c_{L-1}a^1
  1. 接下来我们求取下一滑动窗口字符字串的编码时就相当于该26进制数左移一位后掐头加尾了,例如:第二个字串为“bcd”,则编码为h1=(h00×262)×26+3×260 h_1=(h_0-0\times26^2)\times26+3\times26^0。一般形式为:
h1=(h0×ac0×aL)cL+1h_1=(h_0\times a-c_0\times a^L)-c_{L+1}

这样仅需O(1)的时间就可求得下一字串的hash值,我们再用一个hashset来存储该hash值,若存在相同的hash值,则存在长度为L的重复字串。

大佬解法:

class Solution {
    public String longestDupSubstring(String S) {
        char[] sc = S.toCharArray();
        
        // Check if there aren't any duplicate substrings.  There can 
        // only be no duplicates if the string does not have more than 
        // one occurrence of any character in the string.  Since the 
        // string only contains lowercase characters, the string 
        // length must be less than 26 characters, otherwise at least 
        // one character must be duplicated.
        int longestSubstringIdx = 0;
        int longestSubstringLen = 0;
        int[] found = new int[26];
        for (int i = sc.length - 1; i >= 0; i--) {
            if (found[sc[i] - 'a']++ > 0) {
                longestSubstringIdx = i;
                longestSubstringLen = 1;
                break;
            }
        }
        if (longestSubstringLen == 0)  return "";
        
        // Check for the same character over a large contiguous area.  
        // If we find a long repeat of the same character, then we can 
        // use this to set a minimum length for the longest duplicate 
        // substring, and therefore we don't have to check any shorter 
        // substrings.
        for (int i = sc.length - 1; i > 0; i--) {
            if (sc[i] == sc[i - 1]) {
                char c = sc[i];
                int startI = i;
                int reptCount = 2;
                for (i = i - 2; i >= 0 && sc[i] == c; i--) { }
                i++;
                if (startI - i > longestSubstringLen) {
                    longestSubstringLen = startI - i;
                    longestSubstringIdx = i + 1;
                }
            }
        }
        if (longestSubstringLen == sc.length - 1)  return S.substring(0, longestSubstringLen);

        // Build a table of two-charactar combined values for the 
        // passed String.  These combined values are formed for any 
        // index into the String, by the character at the current 
        // index reduced to the range [0..25] times 26, plus the 
        // next character in the string reduced to the range [0..25].  
        // This combined value is used to index into the array 
        // twoCharHead[], which contains the index into the string of 
        // the first character pair with this combined value, which is 
        // also used to index into the array twoCharList[].  The 
        // twoCharList[] array is a "linked list" of String indexes 
        // that have the same combined values for a character pair.
        //
        // To look up all character pairs with the same combined 
        // value N, start at twoCharHead[N].  This will give the 
        // String index X of the first character pair with that 
        // combined value.  To find successive String indexes, lookup 
        // in twoCharList[X] to get the new String index X.  Then 
        // repeatedly lookup new X values in twoCharList[X], until 
        // X equals zero, which indicates the end of the character 
        // pairs with the same combined value.
        short[] twoCharHead = new short[26 * 26];
        short[] twoCharList = new short[sc.length + 1];
        for (int i = sc.length - longestSubstringLen - 1; i > 0; i--) {
            int twoCharNum = (sc[i] - 'a') * 26 + sc[i + 1] - 'a';
            twoCharList[i] = twoCharHead[twoCharNum];
            twoCharHead[twoCharNum] = (short)i;
        }
        
        // Search the String for matching substrings that are longer 
        // than the current longest substring found.  Start at the 
        // beginning of the string, and successively get a character 
        // pair's combined value.  Use that character pair's combined 
        // value to find all other character pair's with the same 
        // combined value.  In the process, remove any character pairs 
        // that occur in the String before the current character pair.  
        // For two character pairs that appear that they may be a 
        // possible matching substring longer than the currently 
        // longest found match, then test to see if the substrings 
        // match.
        int curIdxLimit = sc.length - longestSubstringLen - 1;
        for (int i = 0; i <= curIdxLimit; i++) {
            int twoCharNum = (sc[i] - 'a') * 26 + sc[i + 1] - 'a';
            while (twoCharHead[twoCharNum] <= i && twoCharHead[twoCharNum] != 0)
                twoCharHead[twoCharNum] = twoCharList[twoCharHead[twoCharNum]];
            int compIdx = twoCharHead[twoCharNum];
            while (compIdx != 0 && compIdx <= curIdxLimit) {
                if (sc[i + longestSubstringLen] == sc[compIdx + longestSubstringLen] && 
                            sc[i + longestSubstringLen / 2] == sc[compIdx + longestSubstringLen / 2]) {
                    int lowIdx = i + 2;
                    int highIdx = compIdx + 2;
                    while (highIdx < sc.length && sc[lowIdx] == sc[highIdx]) {
                        lowIdx++;
                        highIdx++;
                    }
                    if (lowIdx - i > longestSubstringLen) {
                        longestSubstringLen = lowIdx - i;
                        longestSubstringIdx = i;
                        curIdxLimit = sc.length - longestSubstringLen - 1;
                    }
                }
                compIdx = twoCharList[compIdx];
            }
        }
        
        return S.substring(longestSubstringIdx, longestSubstringIdx + longestSubstringLen);
    }
}

题目链接