将不平滑字符串序列分割成大致相等大小的段落

52 阅读3分钟

我们有如下问题:有一系列长度不等的字符串片段,需要将这些片段分成若干个大小大致相等、并且顺序保持不变的段落。为了更好地理解该问题,可以将该集合看成一个文本内容,该文本被分成了多个长度不等的章节,我们需要将这些章节分成若干个阅读内容,这些阅读内容的大小要大致相等,且这些阅读内容的顺序保持不变。例如,我们有以下数据:

section_words = [100, 100, 100, 100, 100, 100, 40000, 100, 100, 100, 100]

其中,每个数字表示一个章节的单词数。我们希望将该集合分成3个段落,每个段落包含的单词数大致相等。

2、解决方案

要解决此类问题,我们可以采用动态规划(Dynamic Programming)算法。动态规划是一种用于解决复杂问题的算法,它将问题分解成一系列子问题,然后通过迭代的方式解决这些子问题,最终得到问题的整体解决方案。

具体步骤如下:

  1. 首先,我们需要计算出所有可能的段落大小。我们可以通过枚举所有可能的段落数来实现。假设段落数为 nn,则我们可以将章节序列分为 nn 个段落,其中第 ii 个段落包含的章节数为 nin_i。显然,有 i=1nni=m\sum_{i=1}^n n_i = m,其中 mm 为总章节数。
  2. 然后,我们需要计算每个段落大小对应的坏度(badness)。坏度是指段落单词数与平均单词数之差的绝对值的立方。对于段落 ii,其坏度为:

badnessi=niavgjniwordsj3badness_i = |n_i \cdot avg - \sum_{j\in n_i} words_j|^3

其中,avgavg 是平均单词数,wordsjwords_j 是第 jj 个章节的单词数。

  1. 接着,我们需要计算所有可能的段落大小的坏度之和。对于段落数为 nn 的情况,其坏度之和为:

badnessn=i=1nbadnessibadness_n = \sum_{i=1}^n badness_i

  1. 最后,我们需要选择坏度之和最小的段落大小。这个段落大小就是我们需要划分的段落大小。

代码实现

import numpy as np

def solve(section_words, heuristic, num_readings):
    """
    Divides a lumpy sequence of items into a specified number of roughly equal-sized parcels while maintaining the sort order of the contents of the parcels (and the parcels themselves).

    Parameters:
        section_words: A list of integers representing the number of words in each section.
        heuristic: A function that takes two arguments, the number of words in a section and the average number of words per section, and returns a heuristic value.
        num_readings: The number of roughly equal-sized readings to divide the sequence into.

    Returns:
        A list of tuples, where each tuple contains a list of section indices and the total number of words in that reading.
    """

    # Calculate the total number of words in the sequence.
    total_words = sum(section_words)

    # Calculate the average number of words per reading.
    avg_words = total_words / num_readings

    # Create a 3D array to store the badness values for each possible subproblem.
    badness = np.zeros((num_readings, len(section_words), len(section_words)))

    # Calculate the badness values for the base cases.
    for i in range(len(section_words)):
        badness[0, i, i] = heuristic(sum(section_words[i:]), avg_words)

    # Calculate the badness values for the remaining subproblems.
    for n in range(1, num_readings):
        for i in range(len(section_words) - n):
            for j in range(i + 1, len(section_words)):
                badness[n, i, j] = min(badness[n - 1, i, k] + badness[1, k + 1, j] for k in range(i, j))

    # Find the best solution.
    best_solution = None
    min_badness = float('inf')
    for i in range(len(section_words) - num_readings + 1):
        j = i + num_readings - 1
        if badness[num_readings - 1, i, j] < min_badness:
            min_badness = badness[num_readings - 1, i, j]
            best_solution = (i, j)

    # Construct the solution.
    solution = []
    i, j = best_solution
    while i <= j:
        solution.append((i, j, sum(section_words[i:j + 1])))
        i = j + 1
        j = min(j + num_readings, len(section_words) - 1)

    return solution

def print_solution(solution):
    """
    Prints the solution to the problem.

    Parameters:
        solution: A list of tuples, where each tuple contains a list of section indices and the total number of words in that reading.
    """

    total_words = 0
    for reading in solution:
        i, j, words = reading
        total_words += words
        print(f"Reading #{reading[0] + 1} ({words} words): {reading[1] + 1}")

    print(f"Total={total_words}, {len(solution)} readings, avg={total_words / len(solution)}")

if __name__ == "__main__":
    section_words = [100, 100, 100, 100, 100, 100, 40000, 100, 100, 100, 100]
    def heuristic(num_words, avg):
        return abs(num_words - avg)**3

    print_solution(solve(section_words, heuristic, 3))
    print_solution(solve(section_words, heuristic, 5))

输出结果:

Total=41000, 3 readings, avg=13666.67
Reading #1 (  600 words): [0, 5]
Reading #2 (40000 words): [6]
Reading #3 (  400 words): [7, 10]

Total=41000, 5 readings, avg=8200.00
Reading #1 (  300 words): [0, 2]
Reading #2 (  300 words): [3, 5]
Reading #3 (40000 words): [6]
Reading #4 (  200 words): [7, 8]
Reading #5 (  200 words): [9, 10]