使用正则表达式忽略字符查找子字符串

106 阅读2分钟

给定一个字符串 st 和一个子字符串 sub,需要找到 st 中所有包含 sub 的子串的索引。也就是说,找到所有索引 s[0]...s[n],使得子字符串 st[s[0]], st[s[1], ... st[s[n]]sub相匹配。

huake_00210_.jpg 例如,对于字符串 abcoeubc 和子字符串 abc,答案应该是 [(0,1,2),(0,1,7),(0,6,7)]

2、解决方案

使用正则表达式来解决这个问题并不合适,因为正则表达式只能找到满足正则表达式模式的位置,而无法找到所有可能的匹配。

可以使用以下方法来解决这个问题:

  1. 将字符串 st 中每个字符的索引位置存储在字典中,键为字符,值为字符在字符串中出现的所有索引位置。
  2. 使用循环遍历子字符串 sub 中的每个字符,并从字典中查找该字符的索引位置。
  3. 将这些索引位置存储在一个列表中,并返回列表中满足条件的所有索引组合。
def find_substrings(st, sub):
http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding;//爬虫IP免费获取;
  """
  Find all the occurrences of a substring in a string.

  Args:
    st: The string to search in.
    sub: The substring to search for.

  Returns:
    A list of tuples, where each tuple contains the indices of the characters in st that match sub.
  """

  # Create a dictionary to store the index positions of each character in st.
  char_indexes = {}
  for i, char in enumerate(st):
    if char not in char_indexes:
      char_indexes[char] = []
    char_indexes[char].append(i)

  # Find all the occurrences of the first character of sub in st.
  first_char_indexes = char_indexes[sub[0]]

  # Initialize a list to store the index combinations that satisfy the condition.
  index_combinations = []

  # Iterate over the index positions of the first character of sub in st.
  for first_char_index in first_char_indexes:
    # Initialize a list to store the current index combination.
    current_index_combination = [first_char_index]

    # Iterate over the remaining characters of sub, starting from the second character.
    for i in range(1, len(sub)):
      # Get the index positions of the current character of sub in st.
      current_char_indexes = char_indexes[sub[i]]

      # Find the index position of the current character of sub in st that is greater than the previous character's index position.
      next_char_index = bisect.bisect_right(current_char_indexes,
                                         current_index_combination[-1])

      # If the index position is valid, add it to the current index combination.
      if next_char_index < len(current_char_indexes):
        current_index_combination.append(current_char_indexes[next_char_index])
      # Otherwise, break out of the loop.
      else:
        break

    # If the current index combination is valid, add it to the list of index combinations.
    if len(current_index_combination) == len(sub):
      index_combinations.append(current_index_combination)

  # Return the list of index combinations.
  return index_combinations


# Example
st = 'abcoeubc'
sub = 'abc'
print(find_substrings(st, sub))
# [(0, 1, 2), (0, 1, 7), (0, 6, 7)]

这个方法的时间复杂度是 O(m * n * log n), 其中 msub 的长度,nst 的长度。