举例演示在Apache Beam中进行核心转换目录简介核心光束变换基金会应用ParDo GroupByKey 协

简介

Apache Beam中的Transform是管道中的操作，并提供了一个通用的处理框架。你以函数对象的形式提供处理逻辑（俗称 "用户代码"），你的用户代码被应用于输入的每个元素 [PCollection](https://www.educative.io/answers/what-is-pcollection-in-apache-beam)(或一个以上的PCollection ）。

Beam的核心转换

Beam提供了以下的核心转换，每一个都代表了一个不同的处理范式。

ParDo
GroupByKey
CoGroupByKey
Combine
Flatten
Partition

ParDo

ParDo 是一个用于通用并行处理的Beam变换。一个变换考虑输入中的每个元素，对该元素执行一些处理函数（你的用户代码），并将零、一或多个元素发射到输出。ParDo PCollection PCollection

ParDo 变换对于各种常见的数据处理操作都很有用，包括：

**筛选一个数据集。**你可以使用ParDo ，考虑一个PCollection 中的每个元素，并将该元素输出到一个新的集合中，或将其丢弃。
**对一个数据集中的每个元素进行格式化或类型转换。**如果你的输入PCollection 包含的元素的类型或格式与你想要的不同，你可以使用ParDo 来对每个元素进行转换，并将结果输出到一个新的PCollection 。
**在一个数据集中提取每个元素的部分内容。**例如，如果你有一个带有多个字段的记录PCollection ，你可以使用ParDo ，只解析出你想考虑的字段，并输入一个新的PCollection 。
对数据集中的每个元素进行计算。你可以使用ParDo ，对一个PCollection 的每个元素或某些元素进行简单或复杂的计算，并将结果输出为一个新的PCollection 。

应用ParDo

public static void main(String[] args) {
    Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
    p.apply(Create.of("Hello", "World")).apply(MapElements.via(new SimpleFunction<String, String>() {

        @Override
        public String apply(String input) {
            return input.toUpperCase();
        }
    })).apply(ParDo.of(new DoFn<String, Void>() {

        @ProcessElement
        public void processElement(ProcessContext c) {
            LOG.info(c.element());
        }
    }));
    p.run();
}

GroupByKey

GroupByKey 是一个用于处理键/值对集合的Beam变换。它是一个并行的还原操作，类似于Map/Shuffle/Reduce式算法的Shuffle阶段。的输入是一个键/值对的集合，它代表了GroupByKey 一个多图，其中该集合包含了多个键相同但值不同的对。给定这样一个集合，你用来收集与每个独特的键相关的所有值。GroupByKey

让我们用一个简单的例子来研究GroupByKey 的机制，我们的数据集包括一个文本文件中的单词和它们出现的行号。我们想把所有共享同一个词（键）的行号（值）组合在一起，让我们看到文本中某个特定词出现的所有地方。

// The input PCollection.
 PCollection<KV<String, String>> mapped = ...;

// Apply GroupByKey to the PCollection mapped.
// Save the result as the PCollection reduced.
PCollection<KV<String, Iterable<String>>> reduced =
 mapped.apply(GroupByKey.<String, String>create());

CoGroupByKey

CoGroupByKey 对具有相同键类型的两个或多个键/值，执行关系连接。PCollection

如果你有多个数据集提供相关事物的信息，请考虑使用CoGroupByKey 。例如，假设你有两个包含用户数据的不同文件：一个文件有姓名和电子邮件地址；另一个文件有姓名和电话号码。你可以连接这两个数据集，把用户名作为一个共同的键，把其他数据作为相关的值。连接后，你有一个数据集，包含与每个名字相关的所有信息（电子邮件地址和电话号码）。

在Beam SDK for Java中，CoGroupByKey 接受一个带键的PCollections (PCollection<KV<K, V>>) 的元组作为输入。为了类型安全，SDK要求你把每个PCollection 作为KeyedPCollectionTuple 的一部分来传递。你必须为你想传递给CoGroupByKey 的KeyedPCollectionTuple 的每个输入PCollection 声明一个TupleTag 。作为输出，CoGroupByKey 返回一个PCollection<KV<K, CoGbkResult>> ，它将所有输入的PCollection的值按其共同的键分组。每个键（所有类型的K ）将有一个不同的CoGbkResult ，它是一个从TupleTag<T> 到Iterable<T> 的映射。你可以通过使用你提供的初始集合的TupleTag 来访问CoGbkResult 对象中的一个特定集合。

final List<KV<String, String>> emailsList =
    Arrays.asList(
        KV.of("amy", "amy@example.com"),
        KV.of("carl", "carl@example.com"),
        KV.of("julia", "julia@example.com"),
        KV.of("carl", "carl@email.com"));

final List<KV<String, String>> phonesList =
    Arrays.asList(
        KV.of("amy", "111-222-3333"),
        KV.of("james", "222-333-4444"),
        KV.of("amy", "333-444-5555"),
        KV.of("carl", "444-555-6666"));

PCollection<KV<String, String>> emails = p.apply("CreateEmails", Create.of(emailsList));
PCollection<KV<String, String>> phones = p.apply("CreatePhones", Create.of(phonesList));

在CoGroupByKey 之后，产生的数据包含了与任何输入集合的每个唯一键相关的所有数据。

final TupleTag<String> emailsTag = new TupleTag<>();
final TupleTag<String> phonesTag = new TupleTag<>();

final List<KV<String, CoGbkResult>> expectedResults =
    Arrays.asList(
        KV.of(
            "amy",
            CoGbkResult.of(emailsTag, Arrays.asList("amy@example.com"))
                .and(phonesTag, Arrays.asList("111-222-3333", "333-444-5555"))),
        KV.of(
            "carl",
            CoGbkResult.of(emailsTag, Arrays.asList("carl@email.com", "carl@example.com"))
                .and(phonesTag, Arrays.asList("444-555-6666"))),
        KV.of(
            "james",
            CoGbkResult.of(emailsTag, Arrays.asList())
                .and(phonesTag, Arrays.asList("222-333-4444"))),
        KV.of(
            "julia",
            CoGbkResult.of(emailsTag, Arrays.asList("julia@example.com"))
                .and(phonesTag, Arrays.asList())));

下面的代码示例用CoGroupByKey 连接两个PCollections ，然后用ParDo 来消耗结果。然后，代码使用标签来查找和格式化每个集合的数据。

PCollection<KV<String, CoGbkResult>> results =
    KeyedPCollectionTuple.of(emailsTag, emails)
        .and(phonesTag, phones)
        .apply(CoGroupByKey.create());

PCollection<String> contactLines =
    results.apply(
        ParDo.of(
            new DoFn<KV<String, CoGbkResult>, String>() {
              @ProcessElement
              public void processElement(ProcessContext c) {
                KV<String, CoGbkResult> e = c.element();
                String name = e.getKey();
                Iterable<String> emailsIter = e.getValue().getAll(emailsTag);
                Iterable<String> phonesIter = e.getValue().getAll(phonesTag);
                String formattedResult =
                    Snippets.formatCoGbkResults(name, emailsIter, phonesIter);
                c.output(formattedResult);
              }
            }));

合并

Combine是一个Beam转换，用于组合你的数据中的元素或值的集合。Combine 有一些变体，用于整个PCollection，还有一些变体用于组合PCollection的键/值对中每个键的值。

当你应用一个Combine 变换时，你必须提供包含组合元素或值的逻辑的函数。组合函数应该是换元的和关联的，因为该函数不一定对所有具有给定键的值都精确调用一次。Beam SDK还为常见的数字组合操作提供了一些预建的组合函数，如sum、min和max。

简单的组合操作，如求和，通常可以用一个简单的函数来实现。更复杂的组合操作可能需要你创建一个CombineFn 的子类，该子类有一个不同于输入/输出类型的累加类型。

CombineFn 的关联性和互换性允许运行者自动应用一些优化。

组合器提升：这是最重要的优化。输入元素在被洗牌前会按键和窗口进行组合，因此洗牌后的数据量可能会减少很多数量级。这种优化的另一个术语是 "映射器侧的组合"。
增量结合：当你有一个CombineFn ，将数据的大小减少了很多，当元素从流式洗牌中出现时，将它们结合起来是很有用的。这可以在你的流式计算可能闲置的时候分散做组合的成本。增量结合也减少了中间累积器的存储。

public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
  public static class Accum {
    int sum = 0;
    int count = 0;
  }

  @Override
  public Accum createAccumulator() { return new Accum(); }

  @Override
  public Accum addInput(Accum accum, Integer input) {
      accum.sum += input;
      accum.count++;
      return accum;
  }

  @Override
  public Accum mergeAccumulators(Iterable<Accum> accums) {
    Accum merged = createAccumulator();
    for (Accum accum : accums) {
      merged.sum += accum.sum;
      merged.count += accum.count;
    }
    return merged;
  }

  @Override
  public Double extractOutput(Accum accum) {
    return ((double) accum.sum) / accum.count;
  }
}

将一个PC集合合并为一个单一的值

使用全局组合将一个给定的PCollection 中的所有元素转化为一个单一的值，在你的流水线中表示为一个包含一个元素的新PCollection 。下面的示例代码显示了如何应用Beam提供的sum combine函数，为一个PCollection 的整数产生一个单一的和值。

// Sum.SumIntegerFn() combines the elements in the input PCollection. The resulting PCollection, called sum,
// contains one value: the sum of all the elements in the input PCollection.
PCollection<Integer> pc = ...;
PCollection<Integer> sum = pc.apply(
   Combine.globally(new Sum.SumIntegerFn()));

扁平化

Flatten是Beam对存储相同数据类型的PCollection 对象的转换。Flatten 将多个PCollection 对象合并为一个逻辑PCollection 。

下面的例子显示了如何应用一个Flatten 变换来合并多个PCollection 对象。

// Flatten takes a PCollectionList of PCollection objects of a given type.
// Returns a single PCollection that contains all of the elements in the PCollection objects in that list.
PCollection<String> pc1 = ...;
PCollection<String> pc2 = ...;
PCollection<String> pc3 = ...;
PCollectionList<String> collections = PCollectionList.of(pc1).and(pc2).and(pc3);

PCollection<String> merged = collections.apply(Flatten.<String>pCollections());

分区

Partition是一个Beam变换，用于存储相同数据类型的PCollection 对象。Partition ，将一个PCollection ，分割成固定数量的小集合。

Partition 分区是根据你提供的分区函数来划分一个的元素。分区函数包含的逻辑决定了如何将输入的元素分割到每个产生的分区。分区的数量必须在图形构建时确定。例如，你可以在运行时将分区的数量作为一个命令行选项传递（然后用于构建你的管道图），但你不能在管道中间确定分区的数量（例如，基于你的管道图构建后计算的数据）。PCollection PCollection PCollection

下面的例子将一个PCollection 分成百分位数组。

// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the
// partitioning function. In this example, we define the PartitionFn in-line. Returns a PCollectionList
// containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
    students.apply(Partition.of(10, new PartitionFn<Student>() {
        public int partitionFor(Student student, int numPartitions) {
            return student.getPercentile()  // 0..99
                 * numPartitions / 100;
        }}));

// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);

总结

在这个文章中，我们用合适的例子在Apache Beam中进行了核心转换。