PIG-设计模式-二-PIG 设计模式（二）三、数据分析模式 | | 花时间看数据总是值得的。 | |

PIG 设计模式（二）

原文：Pig design patterns

协议：CC BY-NC-SA 4.0

三、数据分析模式

在前一章中，您研究了从 Hadoop 生态系统接收和输出不同类型数据的各种模式，以便开始分析过程的下一个逻辑步骤。在本章中，我们将了解与数据分析相关的最广泛使用的设计模式。本章将逐步诊断数据集是否有问题，最后将数据集转化为可用信息。

通过了解数据的内容、上下文、结构和条件，数据分析是深入了解 Hadoop 所吸收的数据的必要的第一步。

本章描述的数据分析设计模式在开始将数据清理为更有用的形式之前，收集 Hadoop 集群中关于数据属性的重要信息。在本章中，我们将查看以下部分:

理解大数据背景下数据分析的概念和意义。
数据分析中使用 Pig 的基本原理
理解数据类型推理设计模式
了解基本的统计分析设计模式。
理解模式匹配设计模式
理解字符串分析设计模式
理解非结构化文本分析的设计模式

大数据的数据分析

不良数据隐藏在 Hadoop 吸收的所有数据中，但随着大数据数量和类型的惊人增加，不良数据的影响也在不断扩大。处理丢失的记录、格式错误的值和格式错误的文件会增加浪费的时间。令我们沮丧的是，我们看到了即使我们拥有也无法使用的数据量，我们手头拥有但随后丢失的数据，以及与昨天不同的数据。在大数据分析项目中，通常会收到一个非常大的数据集，但是关于它来自哪里、如何收集、字段的含义等信息并不多。在许多情况下，数据自收集以来经历了许多手和许多转换，没有人真正知道这一切意味着什么。

分析是对后续步骤中数据质量和数据处理适用性的度量。只是说明数据有问题。数据分析首先是对数据进行短期爆发式分析，以确定其适用性，了解挑战，并在数据密集型工作的早期阶段做出是否外出的决定。数据分析活动从定性的角度为您提供关于 Hadoop 吸收哪些数据的关键见解，并在获得任何分析见解之前评估将数据与其他来源集成所涉及的风险。有时，分析过程在分析过程的不同阶段进行，以消除不良数据并改进分析本身。

数据分析通过帮助我们从业务角度而不是分析角度理解数据，在提高整体数据质量、可读性和可加工性方面发挥了重要作用。在大数据信息管理平台(如 Hadoop)中构建数据分析框架，可以保证数据质量不会影响报告、分析、预测等决策所需的关键业务需求的结果。

传统上，数据汇总分析是根据其预期用途进行的，其中使用数据的目的是预先定义的。对于大数据项目来说，数据可能不得不以一种意想不到的方式被使用，必须进行相应的分析来解决如何重用数据的问题。这是因为大多数大数据项目处理对定义不明确的数据的探索性分析，这种分析试图找出如何使用数据或重新调整数据用途。要做到这一点，必须明确规定各种质量措施，如完整性、一致性、一致性、正确性、及时性、合理性。

在大数据项目中，采集元数据确定数据质量。元数据包括以下内容:

数据质量属性
商业规则
起草
清洁程序
要素汇总和测量

大数据环境下数据质量的度量考虑以下因素:

数据源
数据类型
数据的有意和无意使用
将使用数据和结果来分析工件的用户组。

在上述各点中，数据类型在数据质量要求中起着至关重要的作用，如以下各点所述:

**结构化大数据分析:**在处理大量结构化数据的大数据项目中，企业可以重用现有的处理关系数据库的数据质量流程，前提是这些流程可以扩展以满足大规模需求。
**分析非结构化大数据:**社交媒体相关大数据项目聚焦质量问题，涉及从由俚语和缩写组成的非标准语言表达的句子中提取实体。社交媒体的分析价值可以通过与结构化交易数据关联来提取，因此社交网络上事件之间的关系可以映射到组织的内部数据，如供应链数据或客户人口统计数据。要执行此映射，必须分析非结构化文本以了解以下几点:
- 如何提取对分析很重要的实体
- 有多少数据拼错了？
- 特定字段的通用缩写是什么？
- 删除 stopword 的标准是什么？
- 如何执行 stem？
- 如何根据前面的单词理解单词的语境意义？

大数据分析维度

大数据分析跨多个维度进行，具体维度的选择通常取决于分析问题和时间/质量的平衡。有些维度重叠，有些维度根本不适用于这个问题。以下是衡量大数据质量的最重要方面:

Completeness: This dimension is a measure to know if you have all the data required to answer your queries. To evaluate if your dataset is complete, start by understanding the answers that you wish to seek from the data, and determine the fields needed and the percentage of complete records required to comfortably answer these questions. The following are a few of the ways completeness can be determined:
- 如果我们知道主数据统计(记录、字段等的数量。)预先，完整性可以被确定为接收的记录与主数据记录的数量的比率。
- 当主数据统计不可访问时，完整性通过空值的存在以下列方式来衡量:
  - 属性完整性:它处理特定属性中是否有空值。
  - 元组完整性:它处理元组中未知属性值的数量。
  - 值完整性:它处理半结构化 XML 数据中缺失的完整元素或属性。
因为大数据分析的一个方面是问以前没有问过的问题，所以检查完整性维度就有了新的意义。在这种情况下，可以考虑执行一次迭代*，向前看找到一系列你期望回答的问题，然后，向后推理*找出回答这些问题需要什么样的数据。可以应用简单的记录计数机制来检查 Hadoop 集群中是否有总的期望记录。但是，对于跨越千兆字节的数据大小，此活动可能会很繁重，必须通过应用统计采样来执行。如果发现数据不完整，可以根据分析案例对丢失的记录进行修复、删除、标记或忽略。
Correctness: This dimension measures the accuracy of the data. To find out if the data is accurate, you have to know what comprises inaccurate data, and this purely depends on the business context. In cases where data should be unique, duplicate data is considered inaccurate. Calculating the number of duplicate elements in data that is spread across multiple systems is a nontrivial job. The following techniques can be used to find out the measure of potential inaccuracy of data:
- 在包含离散值的数据集中，频率分布可以对数据的潜在不准确性给出有价值的见解。频率相对较低的值可能不正确。
- 对于字符串，您可以创建字符串长度分布模式和低频标志模式作为潜在的疑点。同样，具有非典型长度的字符串可能会被标记为不正确。
- 在连续属性的情况下，可以使用描述性统计数据(如最大值和最小值)将数据标记为不准确。
对于大数据项目，建议确定精度需要哪些属性子集，知道应该有多少数据是准确的，并对数据进行采样以确定精度。

注

需要注意的是，大海捞针的经典大数据问题中，有大量的分析值隐藏在不准确的数据中，可以认为是异常值。这些异常值不会被认为是不准确的，但是它们可以被标记并考虑用于用例的进一步分析，例如欺诈检测。
**相干性:**这个维度衡量数据相对于自身是否有意义，判断记录之间是否有一致的关联，遵循数据集的内在逻辑。数据集一致性的度量可以通过以下方法理解:
- 参照完整性:这个保证了表之间的关系是一致的。在大数据环境中，引用完整性不能应用于存储在 NoSQL 数据库(如 HBase)中的数据，因为没有数据的关系表示。
- 值完整性:这个保证一个表中的值是否和自己一致。通过将这些值与预定义的一组可能值(来自主数据)进行比较，可以发现不一致的数据。

大数据采样分析注意事项

大数据采样是了解数据质量，通过只分析一部分人群，而不是深入到整个人群。选择样本最重要的标准之一是它的代表性，它决定了样本子集和总体之间的相似性。为了获得准确的结果，代表性应该更高。采样尺度对子集的代表性精度也有相当大的影响。

对大量数据进行采样以进行特征分析平衡了成本和质量之间的权衡，因为所有人进行特征分析的成本和复杂性都很高。在大多数情况下，分析活动并不是作为一种全面分析整个数据的机制来设计的，而是一个从正确性、一致性和完整性等维度获得数据整体质量的首过分析/发现阶段。随着数据在管道中移动，分析活动将迭代进行，这将有助于细化数据。因此，用于分析目的的数据采样在大数据中起着非常重要的作用。

虽然采样被认为是分析所必需的，但在将采样技术应用于大数据时建议谨慎。由于实现的复杂性和不适用性，并非所有的数据类型和收集机制都需要采样；当几乎实时地从传感器获取数据时，这是有效的。同样，并不是所有的用例都需要采样，这在获取数据进行搜索、推荐系统和点击流分析时是有效的。在这种情况下，必须完全不取样地查看数据。在这些情况下，取样会引入一些偏差，降低结果的准确性。

选择合适的采样技术对轮廓的整体精度有影响。这些技术包括非概率抽样和概率抽样方法。通常，我们不考虑非概率方法来执行数据分析活动。我们只限于概率抽样方法，因为这种方法提高了精度，减少了代表性偏差。为了更好的了解采样技术，数据简化模式中数值简化-采样设计模式详见第六章*。*

PIG 的采样支持

Pig 有本地支持使用SAMPLE操作符采样。我们使用了SAMPLE运算符来解释它在剖面分析的上下文中是如何工作的，并使用了基本的统计剖面分析设计模式。SAMPLE算子通过概率算法帮助你从人群中随机选择样本。内部算法非常初级，有时不能代表被采样的整个数据集。算法内部采用简单随机采样技术。 SAMPLE算子正在进化以适应更深奥的采样算法。更多关于前方道路的信息可以在issues.apache.org/jira/browse…找到。

在 Pig 中实现稳健采样方法的其他方法是利用 UDF 特征和 Pig 流对其进行扩展。

利用 Pig 的可扩展性，可以像 UDF 一样实现采样，但是使用起来非常复杂和费力，因为 UDF 最大的局限性就是只接受一个输入值，生成一个输出值。

也可以考虑使用 stream 实现采样，不受 UDFS 的限制。一个流可以接受任意数量的输入或发出任意数量的输出。r 语言有执行采样所需的函数，可以通过 Pig 流在 Pig 脚本中使用这些函数。这种方法的局限性在于，它通过将大部分数据保存在主存中来执行采样计算，而 R 必须安装在 Hadoop 集群的每个数据节点上，流才能工作。

LinkedIn 的 Pig 实用程序中的 Datafu 库已经发布了一些自己的采样实现。这个库现在是 Cloudera Hadoop 发行版的一部分。以下采样技术由 Datafu 实施:

储层采样:它利用内存中的储层生成给定大小的随机样本。
Samplebykey :它根据某个键从元组中生成一个随机的样本。采用内部 分层随机抽样技术。
加权样本:它通过赋权生成一个随机样本。

关于 Datafu 采样实施的更多信息,请访问 http://LinkedIn .github .io/数据单元/文档/当前/数据单元/清管器/取样/包装-摘要 html 。

数据分析中使用 Pig 的基本原理

在 Hadoop 环境中实现分析代码减少了对外部系统质量检查的依赖。下图描述了实施的高级概述:

Rationale for using Pig in data profiling

对 PIG 进行分析。

以下是在 Hadoop 环境中使用 Pig 进行数据分析的优势:

在 Pig 中实现设计模式通过将分析代码直接移动到数据中来减少数据移动，从而提高性能并加速分析和开发过程。
通过在 Pig 中实现这种模式，数据质量工作和数据转换在同一个环境中进行。这减轻了每次在 Hadoop 中接收数据时执行重复数据质量检查的手动冗余工作。
Pig 擅长于运行前摄入的数据模式未知的情况；它的语言特性为数据科学家提供了在运行时破译正确模式和构建原型模型的灵活性。
Pig 固有的发现数据模式和采样的能力使得在 Hadoop 环境中实现概要分析代码非常有利。
Pig 易于使用，这使得编写定制的分析代码变得更加容易。
Pig 通过链接复杂的分析工作流实现了分析过程的自动化，这对于定期更新的数据集非常有用。

既然我们已经理解了数据概要分析的概念和使用 Pig 进行概要分析的基本原理，我们将在下面的小节中探索一些具体的设计模式。

数据类型推理模式

本节描述了数据类型推断设计模式，其中我们使用 Pig 脚本来捕获关于数据类型的重要信息。

背景

Hadoop 中的大部分数据都有一些关联的元数据，这是对其特性的描述。元数据包括关于字段类型、长度、约束和唯一性的重要信息。我们也可以知道一个字段是否是强制的。元数据还用于通过检查刻度、测量单位、标签含义等来解释值。理解数据集的预期结构有助于解释其含义、描述、语义和数据质量。这种对数据类型的分析有助于我们了解它们在语法上是否一致(不同的数据集具有相同的一致格式规范)以及在语义上是否一致(不同的数据集具有相同的值集)。

动机

这种设计模式的意图是从 Hadoop 中摄取的数据中推断数据类型元数据。这个模型帮助你找到Type元数据与实际数据的对比，看看它们是否不一致，对分析有没有什么深远的影响。扫描数据类型和属性值，并将其与记录的元数据进行比较。基于该扫描，提出了适当的数据类型和数据长度。

这种设计模式用于检查数据集的结构，对于数据集，很少或没有现有的元数据，或者有理由怀疑现有元数据的完整性或质量。模式的结果有助于发现、记录和组织关于数据集元数据的“基本事实”。在这里，数据汇总分析的结果用于增量获取与数据元素的结构、语义和使用相关联的知识库。

用例

当你不得不摄取大量的结构化数据并且缺乏关于数据集的文档化知识时，你可以使用的设计模式。如果您需要使用未记录的数据进行进一步分析，或者需要对领域业务术语、相关数据元素、它们的定义、使用的参考数据集以及数据集中属性的结构有更深入的了解，则可以使用这种设计模式。

模式实现

这个模式被实现为一个独立的 Pig 脚本，里面有一个 Java UDF。实现这种模式的核心概念是发现列中的主要数据类型。首先，检查列值，找出它们是类型int、long、double、string还是boolean。评估该值后，组合每种数据类型以找到频率。从这个分析中，我们可以找出哪个是占主导地位的(最频繁的)数据类型。

代码片段

为了说明的工作原理，我们考虑存储在 Hadoop 分布式文件系统 ( HDFS )中的零售交易数据集。包括Transaction ID、Transaction date、Customer ID、Phone Number、Product、Product subclass、Product ID、Sales Price、Country Code等属性。对于这个模式，我们感兴趣的是属性Customer ID的值。

PIG

下面是一个 Pig 脚本，演示了这种模式的实现:

/*
Register the datatypeinferer and custom storage jar files
*/
REGISTER '/home/cloudera/pdp/jars/datatypeinfererudf.jar';
REGISTER'/home/cloudera/pdp/jars/customdatatypeinfererstorage.jar';

/*
Load the transactions dataset into the relation transactions
*/
transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:chararray, cust_id:chararray, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*
Infer the data type of the field cust_id by invoking the DataTypeInfererUDF.
It returns a tuple with the inferred data type.
*/
data_types = FOREACH transactions GENERATEcom.profiler.DataTypeInfererUDF(cust_id) AS inferred_data_type;

/*
Compute the count of each data type, total count, percentage.
The data type with the highest count is considered as dominant data type
*/
grpd = GROUP data_types BY inferred_data_type;
inferred_type_count = FOREACH grpd GENERATE group ASinferred_type, COUNT(data_types) AS count;
grpd_inf_type_count_all = GROUP inferred_type_count ALL;
total_count = FOREACH grpd_inf_type_count_all GENERATESUM(inferred_type_count.count) AS tot_sum,MAX(inferred_type_count.count) AS max_val;
percentage = FOREACH inferred_type_count GENERATE inferred_type AStype, count AS total_cnt,CONCAT((Chararray)ROUND(count*100.0/total_count.tot_sum),'%') ASpercent,(count==total_count.max_val?'Dominant':'Other') ASinferred_dominant_other_datatype;
percentage_ord = ORDER percentage BYinferred_dominant_other_datatype ASC;

/*
CustomDatatypeInfererStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.
The results are stored on the HDFS in the directory datatype_inferer
*/
STORE percentage_ord INTO'/user/cloudera/pdp/output/data_profiling/datatype_inferer'using com.profiler.CustomDatatypeInfererStorage('cust_id','chararray');

Java UDF

以下是 Java UDF 的代码片段:

@Override
  public String exec(Tuple tuples) throws IOException {

    String value = (String) tuples.get(0);
    String inferredType = null;
    try {
/*if tuples.get(0) is null it returns null else invokes getDataType() method to infer the datatype
      */
      inferredType = value != null ? getDataType(value) : NULL;

    } catch (Exception e) {
      e.printStackTrace();
  }
    // returns inferred datatype of the input value
    return inferredType;

结果

以下是将设计模式应用于交易数据的结果:

Column Name :  cust_id
Defined Datatype :  chararray
Inferred Dominant Datatype(s):  int, Count: 817740 Percentage: 100%

在之前的结果中，对输入数据列cust_id进行了评估，以检查该值是否准确反映了定义的数据类型。在摄取阶段，数据类型被定义为chararray。通过使用数据来推断设计模式，cust_id列中的值的数据类型被推断为整数。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter3/code/
Chapter3/datasets/

基本统计分析模式

本节描述了基本的统计分析设计模式，其中我们使用 Pig 脚本来应用统计函数来捕获关于数据质量的重要信息。

背景

之前的设计模式描述了一种推断数据类型的方法。数据分析过程的下一个逻辑步骤是评估值的质量测量。这是通过应用统计方法收集和分析数据来实现的。这些统计数据提供了数据对特定分析问题的适用性的高级概述，并揭示了数据生命周期管理早期阶段的潜在问题。

动机

基本统计分析设计模式有助于创建数据质量元数据，包括平均值、中值、模式、最大值、最小值和标准差等基本统计数据。这些统计数据为您提供了整个数据领域的完整快照，随着时间的推移跟踪这些统计数据将有助于您深入了解 Hadoop 集群正在接收的新数据的特征。在将新数据引入 Hadoop 之前，可以检查新数据的基本统计信息，提前警告不一致的数据，帮助防止添加低质量的数据。

该设计模式试图解决以下总结要求:

范围分析方法扫描这些值，并确定数据是否要进行整体排序，还确定这些值是否被约束在一个定义明确的范围内。
您可以评估数据的稀疏度来找到未填充元素的百分比。
数据集的基数可以通过找出数据中出现的不同值的个数来分析。
可以评估的唯一性，以确定分配给属性的每个值是否确实是排他的。
可以评估数据过载以检查属性是否被用于各种目的。
格式评估可以通过将无法识别的数据解析为已定义的格式来完成。

用例

以下是可以应用基本统计概要设计模式的用例:

这种设计模式可以用来检测数据集中的异常，并通过对数据集中的值进行经验分析来发现意外行为。该模式检查记录数据的频率分布、方差、百分比及其与数据集的关系，以揭示潜在的缺陷数据值。
可以使用这种设计模式的一个常见用例是，将数据从仍在使用的遗留数据源吸收到 Hadoop 集群中。在大型机等遗留系统中，大型机程序员在数据创建期间设计快捷方式和代码，并为不再使用或理解的不同目的重新加载特定字段。当这些数据被吸收到 Hadoop 中时，基本的统计设计模式可以帮助发现这个问题。

模式实现

这个设计模式在 Pig 中实现为一个独立的脚本，内部使用一个宏来传递参数和检索答案。拉丁语有一组Math函数，可以直接应用于一列数据。首先将数据加载到 Pig 关系中，然后将该关系作为参数传递给getProfile宏。宏迭代关系并将Math函数应用于每一列。getProfile宏设计模块化，可以应用于各种数据集，更好地理解数据汇总。

代码片段

为了解释这个模型是如何工作的，我们考虑存储在 HDFS 的零售交易数据集。包括Transaction ID、Transaction date、Customer ID、Phone Number、Product、Product subclass、Product ID、Sales Price、Country Code等属性。对于这个模式，我们将分析属性Sales Price的值。

PIG

下面是 Pig 脚本，说明了这个模式的实现:

/*
Register the datafu and custom storage jar files
*/
REGISTER '/home/cloudera/pdp/jars/datafu.jar';
REGISTER '/home/cloudera/pdp/jars/customprofilestorage.jar';

/*
Import macro defined in the file numerical_profiler_macro.pig
*/
IMPORT '/home/cloudera/pdp/data_profiling/numerical_profiler_macro.pig';

/*
Load the transactions dataset into the relation transactions
*/
transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:datetime, cust_id:long, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*
Use SAMPLE operator to pick a subset of the data, at most 20% of the data is returned as a sample
*/
sample_transactions = SAMPLE transactions 0.2;

/*
Invoke the macro getProfile with the parameters sample_transactions which contains a sample of the dataset and the column name on which the numerical profiling has to be done.
The macro performs numerical profiling on the sales_price column and returns various statistics like variance, standard deviation, row count, null count, distinct count and mode
*/
result =  getProfile(sample_transactions,'sales_price');

/*
CustomProfileStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.
The results are stored on the HDFS in the directory numeric
*/
STORE result INTO'/user/cloudera/pdp/output/data_profiling/numeric' USINGcom.profiler.CustomProfileStorage();

宏

下面是一个 Pig 脚本，展示了 getProfile宏的实现:

/*
Define alias VAR for the function datafu.pig.stats.VAR
*/
DEFINE VAR datafu.pig.stats.VAR(); 

/*
Define the macro, specify the input parameters and the return value
*/
DEFINE getProfile(data,columnName) returns numerical_profile{

/*
Calculate the variance, standard deviation, row count, null count and distinct count for the column sales_price
*/
data_grpd = GROUP $data ALL;
numerical_stats = FOREACH data_grpd  
{
  variance = VAR($data.$columnName);
  stdDeviation = SQRT(variance);
  rowCount = COUNT_STAR($data.$columnName);
  nullCount = COUNT($data.$columnName);
  uniq = DISTINCT $data.$columnName;
  GENERATE 'Column Name','$columnName' AS colName,'Row Count',rowCount,'Null Count' , (rowCount - nullCount),'Distinct Count',COUNT(uniq),'Highest Value',MAX($data.$columnName) AS max_numerical_count,'Lowest Value',MIN($data.$columnName) ASmin_numerical_count, 'Total Value',SUM($data.$columnName) AStotal_numerical_count,'Mean Value', AVG($data.$columnName) ASavg_numerical_count,'Variance',variance AS 	variance,'StandardDeviation', stdDeviation AS stdDeviation,'Mode' asmodeName,'NONE' as modevalue;
}

/*
Compute the mode of the column sales_price
*/
groupd = GROUP $data BY $columnName;
groupd_count = FOREACH groupd GENERATE 'Mode' as modeName, groupAS mode_values, (long) COUNT($data) AS total;
groupd_count_all = GROUP groupd_count ALL;
frequency = FOREACH groupd_count_all GENERATEMAX(groupd_count.total) AS fq;
filterd = FILTER groupd_count BY (total== frequency.fq AND total>1AND mode_values IS NOT NULL);
mode  = GROUP filterd BY modeName;

/*
Join relations numerical stats and mode. Return these values
*/
$numerical_profile = JOIN numerical_stats BY modeName FULL,mode BY group;
};

结果

通过使用基本统计分析模型，获得以下结果:

Column Name: sales_price
Row Count: 163794
Null Count: 0
Distinct Count: 1446
Highest Value: 70589
Lowest Value: 1
Total Value: 21781793
Mean Value: 132.98285040966093
Variance: 183789.18332067598
Standard Deviation: 428.7064069041609
Mode: 99

前面的结果总结了数据属性、行计数、空计数和不同值的数量。我们也知道数据在中心趋势和偏差方面的关键特征。均值和模式是衡量中心趋势的几个指标；方差是理解数据偏差的一种方法。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter3/code/
Chapter3/datasets/

模式匹配模式

这部分描述了模式匹配设计模式，其中我们使用 Pig 脚本来匹配数字和文本模式，以确定数据是否与自身相关，从而获得数据质量的度量。

背景

在企业的上下文中，检查数据的一致性是在数据被摄取并确定其完整性和正确性之后。给定属性的值可以有不同的形状和大小。对于需要手动输入的字段尤其如此，在这些字段中，值是根据用户的意愿输入的。假设代表电话号码字段的列是连贯的，可以说所有的值都代表有效的电话号码，因为它们匹配预期的格式、长度和数据类型(号码)，从而满足系统的预期。以不正确的格式虚报数据将导致不准确的分析。在大数据环境下，庞大的数据量会放大这种不准确性。

动机

从模式匹配的角度分析数据，衡量数据的一致性以及与预期模式匹配的数据量。该分析过程将这些值与一组预定义的可能值进行比较，以找出这些值是否与其自身一致。它抓住了数据的本质，并告诉您一个字段是完全数字的还是具有相同的长度。它还提供特定于数据格式的其他信息。评估是通过将无法识别的数据解析为定义的格式来完成的。基于模式分析和使用，识别数据的抽象类型以执行语义数据类型关联。在分析周期的早期阶段识别不匹配模式的百分比不准确性可以确保更好的数据清理并减少工作量。

用例

这个设计模式可以用来分析应该与特定模式匹配的数字或字符串数据。

模式实现

这个设计模式在 Pig 中作为独立脚本实现。该脚本试图通过分析存储在属性中的数据字符串来发现数据中的模式和常见记录类型。它生成几个与属性中的值匹配的模式，并报告每个候选模式后的数据百分比。该脚本主要执行以下任务:

从元组中发现模式；每一个的数量和百分比。
检查发现的模式，并将其分类为有效或无效。

代码片段

为了解释这个模型是如何工作的，我们考虑存储在 HDFS 的零售交易数据集。包含 T0、T1、T2、T3、T4、T5、T6、T7、T8 等属性。对于这个模式，我们感兴趣的是属性Phone Number的值。

PIG

下面是一个 Pig 脚本，说明了这种模式的实现:

/*
Import macro defined in the file pattern_matching_macro.pig
*/
IMPORT '/home/cloudera/pdp/data_profiling/pattern_matching_macro.pig';

/*
Load the dataset transactions.csv into the relation transactions
*/
transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'USING  PigStorage(',') AS (transaction_id:long,transaction_date:datetime, cust_id:long, age:chararray,area:chararray, prod_subclass:int, prod_id:long, amt:int,asset:int, sales_price:int, phone_no:chararray,country_code:chararray);

/*
Invoke the macro and pass the relation transactions and the column phone_no as parameters to it.
The pattern matching is performed on the column that is passed.
This macro returns the phone number pattern, its count and the percentage
*/
result = getPatterns(transactions, 'phone_no');

/*
Split the relation result into the relation valid_pattern if the phone number pattern matches any of the two regular expressions. The patterns that do not match any of the regex are stored into the relation invalid_patterns
*/
SPLIT result INTO valid_patterns IF (phone_number MATCHES'([0-9]{3}-[0-9]{3}-[0-9]{4})' or phone_number MATCHES'([0-9]{10})'), invalid_patterns OTHERWISE;

/*
The results are stored on the HDFS in the directories valid_patterns and invalid_patterns
*/
STORE valid_patterns INTO '/user/cloudera/pdp/output/data_profiling/pattern_matching/valid_patterns';
STORE invalid_patterns INTO '/user/cloudera/pdp/output/data_profiling/pattern_matching/invalid_patterns';

宏

以下是显示getPatterns宏实现的 PIG 脚本:

/*
Define the macro, specify the input parameters and the return value
*/
DEFINE getPatterns(data,phone_no) returns percentage{

/*
Iterate over each row of the phone_no column and transform each value by replacing all digits with 9 and all alphabets with a to form uniform patterns
*/
transactions_replaced = FOREACH $data  
{
  replace_digits = REPLACE($phone_no,'\\d','9');
  replace_alphabets = REPLACE(replace_digits,'[a-zA-Z]','a');
  replace_spaces = REPLACE(replace_alphabets,'\\s','');
  GENERATE replace_spaces AS phone_number_pattern;
}
/*
Group by phone_number_pattern and calculate count of each pattern
*/
grpd_ph_no_pattern = GROUP transactions_replaced BYphone_number_pattern;
phone_num_count = FOREACH grpd_ph_no_pattern GENERATE group asphone_num, COUNT(transactions_replaced.phone_number_pattern) ASphone_count;

/*
Compute the total count and percentage.
Return the relation percentage with the fields phone number pattern, count and the rounded percentage
*/
grpd_ph_no_cnt_all = GROUP phone_num_count ALL;
total_count = FOREACH grpd_ph_no_cnt_all GENERATESUM(phone_num_count.phone_count) AS tot_sum;
$percentage = FOREACH phone_num_count GENERATE phone_num asphone_number, phone_count as phone_number_count,CONCAT((Chararray)ROUND(phone_count*100.0/total_count.tot_sum),'%') as percent;
};

结果

以下是将设计模式应用于交易数据的结果。结果存储在文件夹valid_patterns和invalid_patterns中。

文件夹valid_patterns中的输出如下:

9999999999  490644    60%
999-999-9999  196257    24%

文件夹invalid_patterns中的输出如下:

99999         8177  1%
aaaaaaaaaa  40887  5%
999-999-999a  40888  5%
aaa-aaa-aaaa  40887  5%

先前的结果给出了我们数据集中所有电话号码模式的快照，它们的计数和百分比。利用这些数据，我们可以确定数据集中不准确数据的百分比，并在数据清理阶段采取必要的措施。因为 999-999-9999 格式的电话号码具有较高的相对频率，并且它是有效的模式，所以您可以导出一个规则，该规则要求该属性中的所有值都符合该模式。此规则可应用于数据清理阶段。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter3/code/
Chapter3/datasets/

字符串分析模式

本节描述字符串分析的设计模式，其中我们使用文本数据的 Pig 脚本来学习重要的统计信息。

背景

大多数大数据实现处理嵌入在列中的文本数据。为了从这些列中获得洞察力，它们必须与其他企业结构化数据集成。这种设计模式说明了一些有助于理解文本数据质量的方法。

动机

文本的质量可以通过对属性值应用基本的统计技术来确定。在为目标系统选择合适的数据类型和大小时，找到字符串的长度是最重要的方面。您可以使用最大和最小字符串长度来确定 Hadoop 接收的数据是否满足给定的约束。当处理 petabyte 范围内的数据时，将字符数量限制得足够大可以通过减少不必要的存储空间来优化存储和计算。

使用字符串长度，还可以确定一列中每个字符串的不同长度，以及每个长度在表格中所代表的行数百分比。

例如，代表美国州代码的列的配置文件应该是两个字符，但是如果收集的配置文件显示的值不同于两个字符，则表明该列中的值不一致。

用例

这种模式可以应用于主要包含文本数据类型的数据列，以找出文本是否在定义的约束内。

模式实现

这个设计模式在 Pig 中实现为一个独立的脚本，它使用一个宏在内部检索概要文件。Pig Latin 有一组可以直接应用于一列数据的数学函数。首先将数据加载到 Pig 关系transactions中，然后将该关系作为参数传递给getStringProfile宏。宏迭代关系并将Math函数应用于每个值。getStringProfile宏被设计成模块化的，可以跨各种文本列应用，以更好地理解字符串数据的摘要。

代码片段

为了说明 T10 模型的工作原理，我们考虑存储在 HDFS 的零售交易数据集。包含 T0、T1、T2、T3、T4、T5、T6、T7、T8 等属性。对于这个模式，我们感兴趣的是属性Country Code的值。

PIG

下面是一个 Pig 脚本，说明了这种模式的实现:

/*
Register the datafu and custom storage jar files
*/
REGISTER '/home/cloudera/pdp/jars/datafu.jar';
REGISTER '/home/cloudera/pdp/jars/customprofilestorage.jar';

/*
Import macro defined in the file string_profiler_macro.pig
*/
IMPORT'/home/cloudera/pdp/data_profiling/string_profiler_macro.pig';

/*
Load the transactions dataset into the relation transactions
*/
transactions = LOAD'/user/cloudera/pdp/datasets/data_profiling/transactions.csv'using PigStorage(',') as(transaction_id:long,transaction_date:datetime, cust_id:long,age:chararray, area:chararray, prod_subclass:int, prod_id:long,amt:int, asset:int, sales_price:int, phone_no:chararray,country_code:chararray);
/*
Invoke the macro getStringProfile with the parameters transactions and the column name on which the string profiling has to be done.
The macro performs string profiling on the country_code column and returns various statistics like row count, null count, total character count, word count, identifies distinct country codes in the dataset and calculates their count and percentage.
*/
result =  getStringProfile(transactions,'country_code');

/*
CustomProfileStorage UDF extends the StoreFunc. All the abstract methods have been overridden to implement logic that writes the contents of the relation into a file in a custom report like format.
The results are stored on the HDFS in the directory string
*/
STORE result INTO'/user/cloudera/pdp/output/data_profiling/string' USINGcom.profiler.CustomProfileStorage();

宏

以下是【T1】 Pig 脚本，展示了getStringProfile宏的实现:

/*
Define the macro, specify the input parameters and the return value
*/
DEFINE getStringProfile(data,columnName) returns string_profile{

/*
Calculate row count and null count on the column country_code
*/
data_grpd = GROUP $data ALL;
string_stats = FOREACH data_grpd  
{
  rowCount = COUNT_STAR($data.$columnName);
  nullCount = COUNT($data.$columnName);
  GENERATE 'Column Name','$columnName' AS colName,'Row Count',rowCount,'Null Count' ,(rowCount - nullCount),'Distinct Values' as dist,'NONE' as distvalue;
}

/*
Calculate total char count, max chars, min chars, avg chars on the column country_code
*/
size = FOREACH $data GENERATE SIZE($columnName) AS chars_count;
size_grpd_all = GROUP size ALL;
char_stats = FOREACH size_grpd_all GENERATE 'Total CharCount',SUM(size.chars_count) AS total_char_count,'Max Chars',MAX(size.chars_count) AS max_chars_count,'Min Chars',MIN(size.chars_count) AS min_chars_count,'Avg Chars',AVG(size.chars_count) AS avg_chars_count,'Distinct Values' asdist,'NONE' as distvalue;

/*
Calculate total word count, max words and min words on the column country_code
*/
words = FOREACH $data GENERATE FLATTEN(TOKENIZE($columnName)) ASword;
whitespace_filtrd_words = FILTER words BY word MATCHES '\\w+';
grouped_words = GROUP whitespace_filtrd_words BY word;
word_count = FOREACH grouped_words GENERATECOUNT(whitespace_filtrd_words) AS count, group AS word;
word_count_grpd_all = GROUP word_count ALL;
words_stats = FOREACH word_count_grpd_all GENERATE 'WordCount',SUM(word_count.count) AS total_word_count,'Max Words',MAX(word_count.count) AS max_count,'Min Words',MIN(word_count.count) AS min_count,'Distinct Values'as dist,'NONE' as distvalue;

/*
Identify distinct country codes and their count
*/
grpd_data = GROUP $data BY $columnName;
grpd_data_count = FOREACH grpd_data GENERATE group ascountry_code, COUNT($data.$columnName) AS country_count;

/*
Calculate the total sum of all the counts
*/
grpd_data_cnt_all = GROUP grpd_data_count ALL;
total_count = FOREACH grpd_data_cnt_all GENERATESUM(grpd_data_count.country_count) AS tot_sum;

/*
Calculate the percentage of the distinct country codes
*/
percentage = FOREACH grpd_data_count GENERATE country_code ascountry_code, 
country_count as country_code_cnt,ROUND(country_count*100.0/total_count.tot_sum) aspercent,'Distinct Values' as dist;

/*
Join string stats, char_stats, word_stats and the relation with distinct country codes, their count and the rounded percentage. Return these values
*/
percentage_grpd = GROUP percentage BY dist;
$string_profile = JOIN string_stats BY dist,char_stats BY dist ,words_stats BY dist, percentage_grpd BY group;
};

结果

使用字符串分析模式，可以获得以下结果:

Column Name: country_code
Row Count: 817740
Null Count: 0
Total Char Count: 5632733
Max Chars: 24
Min Chars: 2
Avg Chars: 6.888171056815125
Word Count: 999583
Max Words: 181817
Min Words: 90723

Distinct Values
country_code      Count    Percentage
US          181792    22%
U.S	        90687    11%
USA	        181782    22%
U.S.A        90733    11%
America        90929    11%
United States      91094    11%
United States of America  90723    11%

前面的结果总结了数据属性，如行数、空值数和字符总数。Max chars和Min chars计数可以通过检查数值的长度是否在范围内来验证数据质量。根据元数据，国家代码的有效值应为两个字符，但结果显示最大字符数为24，这意味着数据不准确。Distinct values部分的结果显示了数据集中不同的国家代码，以及它们的计数和百分比。利用这些结果，我们可以确定数据集中不准确数据的百分比，并在数据清理阶段采取必要的措施。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter3/code/
Chapter3/datasets/

非结构化文本解析模式

本节描述了非结构化文本分析的设计模式，其中我们使用 Pig 脚本来学习自由格式文本数据的重要统计信息。

背景

文本挖掘基于 Hadoop 拍摄的非结构化数据，从无意义的数据块中提取有趣且非琐碎的有意义的模式。文本挖掘是一个跨学科的领域，它利用信息检索、数据挖掘、机器学习、统计学和计算语言学从文本中提取有意义的模式。通常，Hadoop 的并行处理能力用于处理大量文本数据、文档分类、推文聚类、本体构建、实体提取、情感分析等。

该模型讨论了一种利用文本预处理技术来确定文本数据质量的方法，如停止词去除、词干和 TF-IDF。

动机

非结构化文本本质上是不一致的，会导致分析不准确。文本中的不一致是因为表达一个想法的方式有很多。

文本预处理提高了数据质量，提高了分析的准确性，降低了文本挖掘过程的难度。以下是完成文本预处理的步骤:

文本预处理的第一步是将文本块转换成标签，以删除标点符号、连字符、括号等。，并且只保留有意义的关键字、缩写和首字母缩略词以供进一步处理。标记涉及文档一致性的度量，因为被消除的无意义标记的数量和数据相对于自身的不一致性之间存在线性比例关系。
去词是文本预处理的下一个逻辑步骤。此步骤包括删除不为文档提供任何含义或上下文的单词。这些单词叫做 stopword 。它们通常包含代词、冠词、介词等。在实际从原始文本中删除它们之前，将定义一个停用词列表。该列表可以包括特定于特定领域的其他单词。
词干步骤通过将一个单词移除或转换成它的基本单词(词干)，将单词的各种形式简化成它的词根形式。例如，同意、同意、不同意、同意和不同意(取决于特定的词干算法)被词干化为同意。这样做是为了使语料库中的所有标签一致。
词干处理完成后，通过计算词频，相对于出现频率为标签分配权重。此统计数据表示单词在文档中出现的次数。计算逆文档频率，知道单词在所有文档中出现的频率。这个统计数据决定了一个词在所有文档中是常见还是罕见。查找单词的频率和反向文档的频率对文本的质量有影响，因为这些统计数据会根据单词对文档或语料库的重要性来告诉您是否可以丢弃或使用它们。

用例

这种设计模式可以用于需要通过文本预处理技术了解非结构化文本语料库质量的情况。这种设计模式并不详尽，它涵盖了文本预处理的几个重要方面及其对数据分析的适用性。

模式实现

这个设计模式在 Pig 中作为独立脚本实现。内部使用unstructuredtextprofiling Java UDF 执行词干，生成词频和逆文档词频。该脚本执行右连接来移除停止字。Stopword 列表首先从外部文本文件加载到关系中，然后在外部连接中使用。

词干是使用unstructuredtextprofiling JAR 文件中实现的波特斯特梅尔算法完成的。

代码片段

为了演示这种模式是如何工作的，我们考虑维基百科的文本语料库，它存储在 HDFS 可以访问的文件夹中。这个样本语料库由与计算机科学和信息技术相关的维基页面组成。

PIG

下面是一个 Pig 脚本，说明了这种模式的实现:

/*
Register custom text profiler jar
*/
REGISTER '/home/cloudera/pdp/jars/unstructuredtextprofiler.jar';

/*
Load stop words into the relation stop_words_list
*/
stop_words_list = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/stopwords.txt'USING PigStorage();

/*
Tokenize the stopwords to extract the words
*/
stopwords = FOREACH stop_words_list GENERATEFLATTEN(TOKENIZE($0));

/*
Load the dataset into the relations doc1 and doc2.
Tokenize to extract the words for each of these documents
*/
doc1 = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/computer_science.txt' AS (words:chararray);
docWords1 = FOREACH doc1 GENERATE 'computer_science.txt' ASdocumentId, FLATTEN(TOKENIZE(words)) AS word;
doc2 = LOAD'/user/cloudera/pdp/datasets/data_profiling/text/information_technology.txt' AS (words:chararray);
docWords2 = FOREACH doc2 GENERATE 'information_technology.txt' ASdocumentId, FLATTEN(TOKENIZE(words)) AS word;

/*
Combine the relations using the UNION operator
*/
combined_docs = UNION docWords1, docWords2;

/*
Perform pre-processing by doing the following
Convert the data into lowercase
Remove stopwords
Perform stemming by calling custom UDF. it uses porter stemmer algorithm to perform stemming
*/
lowercase_data = FOREACH combined_docs GENERATE documentId asdocumentId, FLATTEN(TOKENIZE(LOWER($1))) as word;
joind = JOIN stopwords BY $0 RIGHT OUTER, lowercase_data BY $1;
stop_words_removed = FILTER joind BY $0 IS NULL;
processed_data = FOREACH stop_words_removed GENERATE documentId asdocumentId, com.profiler.unstructuredtextprofiling.Stemmer($2)as word;

/*
Calculate word count per word/doc combination using the Group and FOREACH statement and the result is stored in word_count
*/
grpd_processed_data = GROUP processed_data BY (word, documentId);
word_count = FOREACH grpd_processed_data GENERATE group ASwordDoc,COUNT(processed_data) AS wordCount;

/*
Calculate Total word count per document using the Group and FOREACH statement and the result is stored in total_docs_wc
*/
grpd_wc = GROUP word_count BY wordDoc.documentId;
grpd_wc_all = GROUP grpd_wc ALL;
total_docs = FOREACH grpd_wc_all GENERATEFLATTEN(grpd_wc),COUNT(grpd_wc) AS totalDocs;
total_docs_wc = FOREACH total_docs GENERATEFLATTEN(word_count),SUM(word_count.wordCount) AS wordCountPerDoc,totalDocs;

/*
Calculate Total document count per word is using the Group and FOREACH statement and the result is stored in doc_count_per_word 
*/
grpd_total_docs_wc = GROUP total_docs_wc BY wordDoc.word;
doc_count_per_word = FOREACH grpd_total_docs_wc GENERATEFLATTEN(total_docs_wc),COUNT(total_docs_wc) AS docCountPerWord;

/*
Calculate tfidf by invoking custom Java UDF.
The overall relevancy of a document with respect to a term is computed and the resultant data is stored in gen_tfidf
*/
gen_tfidf = FOREACH doc_count_per_word GENERATE $0.word AS word,$0.documentId AS documentId,com.profiler.unstructuredtextprofiling.GenerateTFIDF(wordCount,wordCountPerDoc,totalDocs,docCountPerWord) AS tfidf;

/*
Order by relevancy
*/
orderd_tfidf = ORDER gen_tfidf BY word ASC, tfidf DESC;

/*
The results are stored on the HDFS in the directory tfidf
*/
STORE orderd_tfidf into'/user/cloudera/pdp/output/data_profiling/unstructured_text_profiling/tfidf';

爪哇 UDF 是炮泥。

以下是 Java UDF 的代码片段:

public String exec(Tuple input) throws IOException {
    //Few declarations go here
    Stemmer s = new Stemmer();
    //Code for exception handling goes here
    //Extract values from the input tuple
    String str = (String)input.get(0);

    /*
    Invoke the stem(str) method of the class Stemmer.
    It return the stemmed form of the word
    */
    return s.stem(str);
}

TF-IDF 一代的爪哇 UDF

以下是用于计算 TF-IDF 的 Java UDF 代码片段:

public class GenerateTFIDF extends EvalFunc<Double>{
  @Override
  /**
  *The pre-calculated wordCount, wordCountPerDoc, totalDocs and docCountPerWord are passed as parameters to this UDF.
  */
  public Double exec(Tuple input) throws IOException {
    /*
    Retrieve the values from the input tuple
    */
    long countOfWords = (Long) input.get(0);
    long countOfWordsPerDoc = (Long) input.get(1);
    long noOfDocs = (Long) input.get(2);
    long docCountPerWord = (Long) input.get(3);
    /*
    Compute the overall relevancy of a document with respect to a term. 
    */
    double tf = (countOfWords * 1.0) / countOfWordsPerDoc;
    double idf = Math.log((noOfDocs * 1.0) / docCountPerWord);
    return tf * idf;
  }
}

结果

以下是将设计模式应用于 computer_science和information_technology维基文本语料库的结果:

associat  information_technology.txt  0.0015489322470613302
author  information_technology.txt  7.744661235306651E-4
automat  computer_science.txt  8.943834587870262E-4
avail  computer_science.txt  0.0
avail  information_technology.txt  0.0
babbag  computer_science.txt  8.943834587870262E-4
babbage'  computer_science.txt  0.0026831503763610786
base  information_technology.txt  0.0
base  computer_science.txt  0.0
base.  computer_science.txt  8.943834587870262E-4
basic  information_technology.txt  7.744661235306651E-4
complex.  computer_science.txt  8.943834587870262E-4
compon  information_technology.txt  0.0015489322470613302
compsci  computer_science.txt  8.943834587870262E-4
comput  computer_science.txt  0.0
comput  information_technology.txt  0.0
computation  computer_science.txt  8.943834587870262E-4
computation.  computer_science.txt  8.943834587870262E-4
distinguish  information_technology.txt  7.744661235306651E-4
distribut	computer_science.txt  0.0
distribut	information_technology.txt  0.0
divid  computer_science.txt  8.943834587870262E-4
division.  computer_science.txt  8.943834587870262E-4
document  information_technology.txt  7.744661235306651E-4
encompass  information_technology.txt  7.744661235306651E-4
engin  computer_science.txt  0.0035775338351481047
engine.[5]  computer_science.txt  8.943834587870262E-4
enigma  computer_science.txt  8.943834587870262E-4
enough  computer_science.txt  0.0017887669175740523
enterprise.[2]  information_technology.txt  7.744661235306651E-4
entertain  computer_science.txt  8.943834587870262E-4

原文经过停词和词干提取阶段，然后计算词频-逆文档频率。结果显示单词、它所属的文档和 TF-IDF。TF-IDF 较高的单词意味着它们与其出现的文档有很强的关系，而 TF-IDF 较低的单词被认为质量较低，可以忽略。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter3/code/
Chapter3/datasets/

总结

本章以第二章、数据接收输出模式所学知识为基础。在本章中，我们将来自多个源系统的数据进行集成，并将其接收到 Hadoop 中。下一步是通过查看组件值来找到关于数据类型的线索。检查这些值，查看它们是否被曲解，它们的单位是否被曲解，或者单位的上下文是否被错误导出。这种调查机制将在数据类型推断模型中详细讨论。

在基本统计分析模式下，我们收集数值的统计信息，检查这些数值是否满足用例的质量预期，从而找到以下问题的答案:对于数值字段，所有数值都是数值吗？可枚举字段的所有值都属于正确的集合吗？字段是否满足范围约束？它们完整吗？，等等。

模式匹配设计模式探索了一些通过数据类型、数据长度和正则表达式模式来度量数字和文本列数据集的一致性的技术。下一个模式通过使用各种统计方法揭示了表示字符串值的列的质量度量。这将在字符串分析设计模式中详细解释。非结构化文本分析设计模式试图形式化文本预处理技术(如停止词移除、词干分析和 TF-IDF 计算)来理解非结构化文本的质量。

在下一章中，我们将重点关注可以应用于各种数据格式的数据验证和清理模式。阅读本章后，观众将能够使用约束检查和正则表达式匹配等技术选择正确的模式来验证数据的准确性和完整性。我们还将在下一章讨论数据清理技术，如过滤器和统计清理。

四、数据验证和清理模式

在前一章中，您已经学习了与数据分析相关的各种模式。通过这些模式，您已经了解了获取 Hadoop 集群中数据的属性、内容、上下文、结构和条件等重要信息的不同方法。在开始将数据清理成更有用的形式之前，将这些数据分析模式应用于数据生命周期。

本章涵盖以下设计模式:

约束验证和清理模式:这说明数据是根据一组约束进行验证，检查是否有缺失值，这些值是否在业务规则规定的范围内，或者这些值是否符合参照完整性和唯一性约束。根据业务规则，要么删除无效记录，要么对无效数据应用适当的清理步骤。
regex 验证和清理模式:这演示了通过使用正则表达式和基于模式的记录过滤来验证数据，以特定模式或长度匹配数据来清理无效数据。
损坏数据验证和清理模式:这为从各种来源了解数据损坏设置了上下文。该模型详细解释了噪声和异常值对数据的影响以及检测和清除它们的方法。
非结构化文本数据验证和清洗模式:这展示了通过执行预处理步骤，如小写转换、停止词移除、词干移除、标点符号移除、额外空间移除、数字识别和拼写错误识别来验证和清洗非结构化数据语料库的方法。

大数据的数据验证和清理

数据验证和清理过程检测并删除数据中不正确的记录。数据验证和清理的过程确保在数据用于分析过程之前，可以很好地识别数据中的不一致性。然后替换、修改或删除不一致的数据，使其更加一致。

大多数数据验证和清理都是通过基于模式分析静态约束来执行的。组合约束检查模式将告诉我们缺失值、空值、模糊表示、外键约束等的存在。

从大数据的角度来看，数据验证和清理在获取价值方面发挥着越来越重要的作用。清理大数据时，需要考虑的最大权衡之一是时间质量权衡。给定无限的时间，我们可以提高坏数据的质量，但设计一个好的数据清理脚本的挑战是在时限内覆盖尽可能多的数据并成功清理。

在典型的大数据用例中，不同类型数据的集成会增加清理过程的复杂性。来自联邦系统的高维数据有自己的清理方法，而大量纵向数据的清理方法是不同的。流式数据可以是时间序列数据，可以通过实时清洗的方式代替批量清洗机制进行有效处理。必须使用文本预处理和清理方法来处理非结构化数据、描述性数据和 web 数据。

Hadoop 环境中的各种接触点可能会导致数据不良。以下要点概述了映射到这些接触点的常见清洁问题:

数据采集导致的大数据清理问题:数据不一致可能是数据采集方式错误导致的。这些方法可以从手动输入到社交媒体数据(使用非标准单词)、输入重复数据和测量误差。
数据传递不当导致的大数据清理:这可能是进入与 Hadoop 集成的上游系统后数据转换不当。数据传输不正确导致数据不良的原因包括聚合不正确、空值转换为默认值、缓冲区溢出和传输问题。
数据存储问题导致的大数据清理问题:数据不一致可能是数据物理和逻辑存储问题的结果。物理存储导致的数据不一致是罕见的原因。这种情况发生在数据长时间存储并趋于损坏时，称为位损坏。将数据存储在逻辑存储结构中会导致以下问题:元数据不足、实体关系中缺少链接、缺少时间戳等。
数据集成带来的大数据清洗问题:集成异构系统的数据对不良数据的贡献很大，通常会导致不一致字段相关的问题。不同的定义、不同的字段格式、不正确的时间同步和遗留数据的特征、错误的处理算法都会导致这种情况。

一般来说，验证和清理大数据的第一步是用各种数学技术探索数据，以检测数据中是否有任何不准确之处。这一初步探索是通过了解每个属性的数据类型和域、其上下文、可接受的值等来完成的。之后，核实实际数据，验证是否符合可接受的限值。这给出了不准确特征及其位置的初步估计。在验证阶段，我们可以通过指定预期约束来执行这一探索活动，以查找和过滤不符合预期约束的数据，然后在清理步骤中对数据采取所需的操作。

在我们探索数据并发现异常之后，一系列数据清理例程将迭代运行。这些数据清理例程参考主数据和相关业务规则来执行清理，并获得更高质量数据的最终结果。数据清理例程通过填充缺失值、平滑有噪声的数据、识别或移除异常值以及解决不一致来清理数据。在执行清洗之后，执行可选的控制步骤，其中评估结果，并且异常处理在清洗过程中未被校正的元组。

选择 PIG 进行验证和清洗。

在 Hadoop 环境中的 Pig 中实现验证和清理代码减少了时间和质量之间的权衡，以及将数据移动到外部系统以执行清理的需要。下图描述了实施的高级概述:

Choosing Pig for validation and cleansing

验证并清洗 PIG。

以下是在 Hadoop 环境下使用 Pig 进行数据清理的优势:

由于验证和清洗是在同一环境中进行的，因此整体性能得到了提高。无需将数据传输到外部系统进行清理。
Pig 非常适合编写用于验证和清理脚本的代码，因为内置函数适合处理混乱的数据和探索性分析。
Pig 通过链接复杂的工作流实现了清洗过程的自动化，对于定期更新的数据集非常方便。

约束验证和清洁设计模式

约束验证和清理模式根据一套规则和技术处理验证数据，然后清理无效数据。

背景

约束告诉我们数据应该服从的属性。它们可以应用于整个数据库、表、列或整个模式。这些约束是在设计时创建的规则，用于防止数据被破坏并减少处理错误数据的开销。它们指定哪些值对数据有效。

约束，如空值检查和范围检查，可以用来知道在 Hadoop 中获得的数据是否有效。通常，Hadoop 中的约束验证和数据清理可以基于业务规则来执行，业务规则实际上决定了必须应用于特定数据子集的约束类型。

在给定列必须属于特定类型的情况下，将应用数据类型约束。当我们想要强制一个约束时，比如一个数字或日期应该落在一个指定的范围内，一个范围约束被应用。这些范围约束通常指定用于比较的最小值和最大值。约束强制执行硬验证规则，以确保一些重要字段不会保持为空，这本质上是检查空值或缺失值，并使用一系列方法来消除它们。成员约束强制执行数据值应该总是来自一组预定值的规则。

无效数据可能是从没有在 Hadoop 中实现软件约束的遗留系统接收数据的结果，也可能是从电子表格等来源接收数据的结果，在这些来源中，对用户选择在单元格中输入的内容设置约束相对困难。

动机

约束验证和清理设计模式实现了一个 Pig 脚本，它通过检查数据是否在特定、指定和强制的范围约束内来验证数据，然后进行清理。

有许多方法可以检查 Hadoop 中驻留的数据是否符合强制约束。最有用的方法之一是检查空值或缺失值。如果给定的一组数据中有缺失的值，了解这些缺失的值是否是数据质量不足的原因是很重要的，因为在很多情况下，数据中有缺失的值是可以的。

查找空值或缺失值相对简单，但是通过用适当的值填充它们来清除缺失值是一项复杂的任务，这通常取决于业务案例。

根据业务案例，空值可以被忽略或手动输入，作为清理过程的一部分，但这种方法是最不推荐的。对于分类变量，可以使用一个恒定的全局标签，如“XXXXX”来描述缺失的值。如果此标签不能与表中的其他现有值冲突，或者缺少的值可以由最频繁出现的值(模式)替换。根据数据分布，建议使用正态分布数据的平均值和偏斜分布数据的中值。平均值和中值的使用仅适用于数字数据类型。使用概率度量，如贝叶斯推理或决策树，可以比其他情况更准确地计算缺失值，但这是一种耗时的方法。

范围通过指定有效值的上限和下限来限制数据中可以使用的值。设计模式首先检查数据的有效性，并找出数据是否不在指定的范围内。通过过滤无效数据，按照业务规则清理无效数据，或者如果无效数据高于最大范围值，则用最大范围值替换无效值；相反，如果无效数据低于范围，无效值将被最小范围值替换。

唯一约束将值的存在限制为在整个表中唯一。这适用于重复值等于无效数据的主键。一个表可以有任意数量的唯一约束，主键被定义为其中之一。Hadoop 接收到数据后，我们可以使用这种设计模式进行验证，找出数据是否满足唯一约束，并通过删除重复项来清理。

用例

当您摄取大量结构化数据，并希望通过对照强制、范围和唯一约束验证数据来执行完整性检查，然后清理数据时，您可以使用这种设计模式。

模式实现

这种设计模式作为一个独立的脚本在 Pig 中实现。该脚本加载数据，并根据指定的约束对其进行验证。以下是模式如何实现的简要描述:

强制约束:【T2】脚本检查不符合强制约束的无效和缺失数据，并通过用中值替换缺失值来移除缺失数据。
范围约束 : 脚本定义了一个范围约束，说明claim_amount列的有效值应该在下限和上限之间。该脚本验证数据并找到该范围之外的所有值。在清洗步骤中，过滤这些值；它们还可以根据预定义的业务规则更新为范围的最小值和最大值。
唯一约束:【T2】脚本执行检查以验证数据是否不同，然后通过删除重复值来清除数据。

代码片段

为了说明模式的工作原理，我们考虑一个存储在 HDFS 的汽车保险理赔数据集，它包含两个文件。automobile_policy_master.csv为主文件；它包含唯一的 ID、车辆详细信息、价格和支付的保费。主文件用于验证索赔文件中的数据。automobile_insurance_claims.csv文件包含汽车保险理赔数据，具体为车辆修理费理赔；包括CLAIM_ID、POLICY_MASTER_ID、VEHICLE_DETAILS和CLAIM_DETAILS等属性。对于该模式，我们将对CLAIM_AMOUNT、POLICY_MASTER_ID、AGE和CITY进行约束验证和清理，如下代码所示:

/*
Register Datafu and custom jar files
*/
REGISTER '/home/cloudera/pdp/jars/datatypevalidationudf.jar';
REGISTER  '/home/cloudera/pdp/jars/datafu.jar';

/*
Define aliases for Quantile UDF from Datafu and custom UDF DataTypeValidationUDF.
The parameters to Quantile constructor specify list of quantiles to compute
The parameter to the DataTypeValidationUDF constructor specifies the Data type that would be used for validation
*/
DEFINE Quantile datafu.pig.stats.Quantile('0.25','0.5','0.75'); 
DEFINEDataTypeValidationUDF com.validation.DataTypeValidationUDF('double');

/*
Load automobile insurance claims data set into the relation claims and policy master data set into the relation policy_master
*/
claims = LOAD'/user/cloudera/pdp/datasets/data_validation/automobile_insurance_claims.csv' USING  PigStorage(',') AS(claim_id:chararray, policy_master_id:chararray,registration_no:chararray, engine_no:chararray,chassis_no:chararray,customer_id:int,age:int,first_name:chararray,last_name:chararray,street:chararray,address:chararray,  city:chararray,  zip:long,gender:chararray, claim_date:chararray,garage_city:chararray,bill_no:long,claim_amount:chararray,garage_name:chararray,claim_status:chararray);
policy_master = LOAD'/user/cloudera/pdp/datasets/data_validation/automobile_policy_master.csv' USING  PigStorage(',') AS(policy_master_id:chararray, model:int, make:chararray,price:double, premium:float);

/*
Remove duplicate tuples from the relation claims to ensure that the data meets unique constraint
*/
claims_distinct = DISTINCT claims;

/*
Invoke the custom DataTypeValidationUDF with the parameter claim_amount.
The UDF returns the tuples where claim_amount does not match the specified data type (double), these values are considered as invalid.
Invalid values are stored in the relation invalid_claims_amt
*/
claim_distinct_claim_amount = FOREACH claims_distinct GENERATEclaim_amount AS claim_amount;
invalid_c_amount = FOREACH claim_distinct_claim_amount GENERATEDataTypeValidationUDF(claim_amount) AS claim_amount;
invalid_claims_amt = FILTER invalid_c_amount BY claim_amount ISNOT NULL;

/*
Filter invalid values from the relation claims_distinct and segregate the valid and invalid claim amount
*/
valid_invalid_claims_amount_join = JOIN invalid_claims_amt BY claim_amount RIGHT, claims_distinct BY claim_amount;
valid_claims_amount = FILTER valid_invalid_claims_amount_join BY$0 IS NULL;
invalid_claims_amount = FILTER valid_invalid_claims_amount_join BY$0 IS NOT NULL;

/*
For each invalid_claims_amount, generate all the values and specify the reason for considering these values as invalid
*/
invalid_datatype_claims = FOREACH invalid_claims_amount GENERATE$1 AS claim_id,$2 AS policy_master_id, $3 AS registration_no,$4 AS engine_no, $5 AS chassis_no,$6 AS customer_id,$7 AS age,$8 AS first_name,$9 AS last_name, $10 AS street, $11 AS address,$12 AS city, $13 AS zip, $14 AS gender, $15 AS claim_date,$16 AS garage_city,$17 AS bill_no, $18 AS claim_amount,$19 ASgarage_name, $20 AS claim_status,'Invalid Datatype forclaim_amount' AS reason;

valid_datatype_claims = FOREACH valid_claims_amount GENERATE $1 ASclaim_id,$2 AS policy_master_id, $3 AS registration_no,$4 AS engine_no, $5 AS chassis_no,$6 AS customer_id,$7 AS age,$8 AS first_name,$9 AS last_name, $10 AS street, $11 AS address,$12 AS city, $13 AS zip, $14 AS gender, $15 AS claim_date,$16 AS garage_city,$17 AS bill_no, $18 AS claim_amount,$19 AS garage_name, $20 AS claim_status;

/*
Compute quantiles using Datafu's Quantile UDF
*/
groupd = GROUP valid_datatype_claims ALL;
quantiles = FOREACH groupd {
  sorted = ORDER valid_datatype_claims BY age;
  GENERATE Quantile(sorted.age) AS quant;
}

/*
Check for occurrence of null values for the column Age which is a numerical field and for city which is a categorical field.
The nulls in age column are replaced with median and the nulls in city column are replaced with a constant string XXXXX.
*/
claims_replaced_nulls = FOREACH valid_datatype_claims GENERATE $0,$1 ,$2 , $3 ,$4 , $5 ,(int) ($6 is null ? FLOOR(quantiles.quant.quantile_0_5) : $6) AS age, $7, $8 ,$9 , $10 ,($11 is null ? 'XXXXX' : $11) AS city, $12, $13 , $14 , $15 ,$16 ,(double)$17 , $18 ,$19;

/*
Ensure Referential integrity by checking if the policy_master_id in the claims dataset is present in the master dataset.
The values in the claims dataset that do not find a match in the master dataset are considered as invalid values and are removed.
*/
referential_integrity_check = JOIN claims_replaced_nulls BYpolicy_master_id, policy_master BY policy_master_id;
referential_integrity_invalid_data = JOIN policy_master BYpolicy_master_id RIGHT, claims_replaced_nulls BYpolicy_master_id;
referential_check_invalid_claims = FILTERreferential_integrity_invalid_data BY $0 IS NULL;

/*
For each referential_check_invalid_claims, generate all the values and specify the reason for considering these values as invalid
*/
invalid_referential_claims = FOREACHreferential_check_invalid_claims GENERATE  $5 ,$6, $7, $8 ,$9 ,$10 , $11, $12, $13 , $14 , $15 , $16 ,$17 , $18 ,$19,$20,  $21 ,(chararray) $22 , $23 ,$24,'Referential check Failed for policy_master_id' AS reason;

/*
Perform Range validation by checking if the values in the claim_amount column are within a range of 7% to 65% of the price in the master dataset.
The values that fall outside the range are considered as invalid values and are removed.
*/
referential_integrity_valid_claims = FILTERreferential_integrity_check BY( claims_replaced_nulls::claim_amount>=(policy_master::price*7/100) ANDclaims_replaced_nulls::claim_amount<=(policy_master::price*65/100 ));
valid_claims = FOREACH referential_integrity_valid_claims GENERATE$0, $1 ,$2 , $3 ,$4 , $5 ,$6 , $7, $8 ,$9 , $10 , $11 , $12,$13 , $14 , $15 , $16 ,$17 , $18 ,$19;
invalid_range = FILTER referential_integrity_check BY( claims_replaced_nulls::claim_amount<=(policy_master::price*7/100) ORclaims_replaced_nulls::claim_amount>=(policy_master::price*65/100 ));

/*
For each invalid_range, generate all the values and specify the reason for considering these values as invalid
*/
invalid_claims_range = FOREACH invalid_range GENERATE $0, $1 ,$2 ,$3 ,$4 , $5 ,$6, $7, $8 ,$9 , $10 , $11, $12, $13 , $14 , $15 ,$16 ,(chararray)$17 , $18 ,$19,'claim_amount not within range' AS reason;

/*
Combine all the relations containing invalid values. 
*/
invalid_claims = UNIONinvalid_datatype_claims,invalid_referential_claims,invalid_claims_range;

/*
The results are stored on the HDFS in the directories valid_data and invalid_data
The values that are not meeting the constraints are written to a file in the folder invalid_data. This file has an additional column specifying the reason for elimination of the record, this can be used for further analysis.
*/
STORE valid_claims INTO'/user/cloudera/pdp/output/data_validation_cleansing/constraints_validation_cleansing/valid_data';
STORE invalid_claims INTO'/user/cloudera/pdp/output/data_validation_cleansing/constraints_validation_cleansing/invalid_data';

结果

以下为原始数据集的片段；为了提高可读性，我们删除了一些列。

claim_id,policy_master_id,cust_id,age,city,claim_date,claim_amount
A123B39,A213,39,34,Maryland,5/13/2012,147157
A123B39,A213,39,34,Maryland,5/13/2012,147157
A123B13,A224,13,,Minnesota,2/18/2012,8751.24
A123B70,A224,70,59,,4/2/2012,8751.24
A123B672,A285AC,672,52,Las Vegas,10/19/2012,7865.73
A123B726,A251ext,726,26,Las Vegas,4/6/2013,4400
A123B21,A214,21,41,Maryland,2/28/2009,1230000000
A123B40,A214,40,35,Austin,6/30/2009,29500  
A123B46,A220,46,32,Austin,12/29/2011,13986 Amount
A123B20,A213,20,42,Redmond,5/18/2013,147157 Price
A123B937,A213,937,35,Minnesota,9/27/2009,147157

以下是将此模式应用于数据集的结果:

有效数据

A123B39,A213,39,34,Maryland,5/13/2012,147157
A123B13,A224,13,35,Minnesota,2/18/2012,8751.24
A123B70,A224,70,59,XXXXX,4/2/2012,8751.24
A123B937,A213,937,35,Minnesota,9/27/2009,147157

无效数据

A123B672,A285AC,672,52,Las Vegas,10/19/2012,7865.73,Referential check Failed for policy_master_id
A123B726,A251ext,726,26,Las Vegas,4/6/2013,4400,Referential check Failed for policy_master_id
A123B21,A214,21,41,Maryland,2/28/2009,1230000000,claim_amount not within range
A123B40,A214,40,35,Austin,6/30/2009,29500,claim_amount not within range
A123B46,A220,46,32,Austin,12/29/2011,13986 Amount,InvalidDatatype for claim_amount
A123B20,A213,20,42,Redmond,5/18/2013,147157 Price,InvalidDatatype for claim_amount

获得的 T10 数据可以分为有效数据和无效数据。在之前的结果中，删除了带有claim_id A123B39的重复记录。对于有claim_id A123B13的记录，age的空值被35(中间值)代替，对于有claim_id A123B70的记录，city的空值被XXXXX代替。此外，有效数据还包含与列claim_amount和policy_master_id上的数据类型、范围和引用完整性约束相匹配的记录列表。无效数据被写入文件夹invalid_data中的文件。文件的最后一列提到了记录被认为无效的原因。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter4/code/
Chapter4/datasets/

正则表达式验证和清洗设计模式

本设计模式处理使用正则表达式函数验证数据。regex 函数可用于验证数据以匹配特定长度或模式，并清除无效数据。

背景

这个设计模式讨论了如何使用正则表达式函数来识别和清除字段长度无效的数据。该模式还从数据中识别具有指定日期格式的所有值，并删除不符合指定格式的无效值。

动机

识别长度不正确的字符串数据是知道数据是否准确的最快方法之一。通常，我们需要这个字符串长度参数来判断数据，而不需要深入数据。当字符串的长度必须有上限时，这将非常有用，例如，美国州代码的长度上限通常为 2。

查找与日期模式匹配的所有字符串模式是最常见的数据转换之一。日期很容易用多种方式表达(日/月/YY、月/日/YY、年/月/日等)。).转换包括发现这些模式的出现，并将所有这些格式标准化为业务规则所需的统一日期格式。

用例

正则表达式用于要求字符串完全或部分匹配的情况。以下是执行提取、转换和加载 ( ETL )时的一些常见用例:

字符串长度和模式验证:如果数据的结构是标准格式，并且数据匹配指定的长度，则使用正则表达式来验证数据。例如，如果一个字段以字母开头，后跟三个数字，它可以帮助验证数据。
过滤与特定模式不匹配的字段:如果你的业务案例需要你剔除与特定模式不匹配的数据，这个可以在清理阶段使用；例如，筛选日期与预定义格式不匹配的记录。
将字符串拆分成标签:非结构化文本可以使用正则表达式解析并拆分成标签。一个常见的例子是使用\s标记将文本拆分成单词，这意味着按空格拆分。另一个用途是使用模式来分割字符串以获得前缀或后缀。例如，从字符串“100 dollars”中提取100的值。
提取数据匹配模式:这个有一些用处。您可以从一个巨大的文件中提取一些文本匹配模式。文件预处理就是一个例子；您可以形成一个正则表达式，从一个巨大的日志中提取请求或响应模式，并进一步分析提取的数据。

模式实现

该设计模式在 Pig 中作为独立脚本实现。脚本加载数据，并根据正则表达式模式对其进行验证。

字符串模式和长度:脚本验证policy_master_id列中的值是否与预定义的长度和模式匹配。与图案或长度不匹配的值将被删除。
日期格式:脚本验证claim_date列的值是否与 MM/DD/YYYY 日期格式匹配；过滤日期格式无效的记录。

代码片段

为了说明该模型的工作原理，我们考虑一个存储在 HDFS 的汽车保险索赔数据集，它包含两个文件。automobile_policy_master.csv为主文件；它包含唯一的 ID、车辆详细信息、价格和支付的保费。主文件用于验证索赔文件中的数据。automobile_insurance_claims.csv文件包含汽车保险理赔数据，具体为车辆修理费理赔；包括CLAIM_ID、POLICY_MASTER_ID、VEHICLE_DETAILS和CLAIM_DETAILS等属性。对于这种模式，我们将对POLICY_MASTER_ID和CLAIM_DATE进行正则表达式验证和清理，如下代码所示:

/*
Load automobile insurance claims dataset into the relation claims
*/
claims = LOAD'/user/cloudera/pdp/datasets/data_validation/automobile_insurance_claims.csv' USING  PigStorage(',') AS(claim_id:chararray, policy_master_id:chararray,registration_no:chararray, engine_no:chararray,chassis_no:chararray,customer_id:int,age:int,first_name:chararray,last_name:chararray,street:chararray,address:chararray,city:chararray,zip:long,gender:chararray, claim_date:chararray,garage_city:chararray,bill_no:long,claim_amount:chararray,garage_name:chararray,claim_status:chararray);

/*
Validate the values in the column policy_master_id with a regular expression to match the pattern where the value should start with an alphabet followed by three digits.
The values that do not match the pattern or length are considered as invalid values and are removed.
*/
valid_policy_master_id = FILTER claims BY policy_master_id MATCHES'[aA-zZ][0-9]{3}';

/*
Invalid values are stored in the relation invalid_length
*/
invalid_policy_master_id = FILTER claims BY NOT(policy_master_id MATCHES '[aA-zZ][0-9]{3}');
invalid_length = FOREACH invalid_policy_master_id GENERATE $0,$1 ,$2 , $3 ,$4 , $5 ,$6 , $7, $8 ,$9 , $10 , $11, $12, $13 ,$14 , $15 , $16 ,$17 , $18 ,$19,'Invalid length or pattern for policy_master_id' AS reason;

/*
Validate the values in the column claim_date to match MM/DD/YYYY format, also validate the values given for MM and DD to fall within 01 to 12 for month and 01 to 31 for day
The values that do not match the pattern are considered as invalid values and are removed.
*/
valid_claims = FILTER valid_policy_master_id BY( claim_date MATCHES '^(0?[1-9]|1[0-2])[\\/](0?[1-9]|[12][0-9]|3[01])[\\/]\\d{4}$');

/*
Invalid values are stored in the relation invalid_date
*/
invalid_dates = FILTER valid_policy_master_id BY NOT( claim_date MATCHES '^(0?[1-9]|1[0-2])[\\/](0?[1-9]|[12][0-9]|3[01])[\\/]\\d{4}$');
invalid_date = FOREACH invalid_dates GENERATE $0, $1 ,$2 , $3 ,$4 , $5 ,$6 , $7, $8 ,$9 , $10 , $11, $12, $13 , $14 , $15 ,$16 ,$17 , $18 ,$19,'Invalid date format for claim_date' AS reason;

/*
Combine the relations that contain invalid values. 
*/
invalid_claims = UNION invalid_length,invalid_date;

/*
The results are stored on the HDFS in the directories valid_data and invalid_data
The invalid values are written to a file in the folder invalid_data. This file has an additional column specifying the reason for elimination of the record, this can be used for further analysis.
*/
STORE valid_claims INTO'/user/cloudera/pdp/output/data_validation_cleansing/regex_validation_cleansing/valid_data';
STORE invalid_claims INTO'/user/cloudera/pdp/output/data_validation_cleansing/regex_validation_cleansing/invalid_data';

结果

以下为原始数据集的片段；为了提高可读性，我们删除了一些列。

claim_id,policy_master_id,cust_id,age,city,claim_date,claim_amount
A123B1,A290,1,42,Minnesota,1/5/2011,8211
A123B672,A285AC,672,52,Las Vegas,10/19/2012,7865.73
A123B726,A251ext,726,26,Las Vegas,4/6/2013,4400
A123B2,A213,2,35,Redmond,1/22/2009,147157
A123B28,A221,28,19,Austin,6/37/2012,31930.2
A123B888,A247,888,49,Las Vegas,21/20/2012,873
A123B3,A214,3,23,Maryland,7/8/2011,8400

以下是将此模式应用于数据集的结果:

有效数据

A123B1,A290,1,42,Minnesota,1/5/2011,8211
A123B2,A213,2,35,Redmond,1/22/2009,147157
A123B3,A214,3,23,Maryland,7/8/2011,8400

无效数据

A123B672,A285AC,672,52,Las Vegas,10/19/2012,7865.73,Invalid length or pattern for policy_master_id
A123B726,A251ext,726,26,Las Vegas,4/6/2013,4400,Invalid length or pattern for policy_master_id
A123B28,A221,28,19,Austin,6/37/2012,31930.2,Invalid date format for claim_date
A123B888,A247,888,49,Las Vegas,21/20/2012,873,Invalid date format for claim_date

如前所示，结果数据分为有效数据和无效数据。有效数据包含符合 regex 模式的记录列表，用于验证policy_master_id和claim_date。无效数据写入invalid_data文件夹中的文件；文件的最后一列提到了记录被认为无效的原因。我们选择过滤无效数据；但是，清理技术取决于业务案例，在这种情况下，无效数据可能必须转换为有效数据。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter4/code/
Chapter4/datasets/

数据验证和损坏数据清理的设计模式

这个设计模式从被视为噪声或异常值的损坏数据的角度来讨论数据损坏。详细讨论了受损数据的识别和去除技术。

背景

本设计模式探索使用 Pig 来验证和清除数据集中的损坏数据。它试图从传感器到结构化数据的各种大数据源中设置数据损坏的背景。此设计模式从两个角度检测数据损坏，一个是噪声，另一个是异常值，如下表所示:

噪声可以定义为测量中的随机误差，它导致损坏的数据与正确的数据一起被吸收。误差与期望值相差不远。
异常值也是一种噪声，但误差值与期望值相差太远。异常值对分析的准确性有很大影响。它们通常是测量或记录误差。其中一些可以代表有趣的现象，从应用领域的角度来看非常重要，这意味着不应该消除所有的异常值。

受损数据可以表示为噪声或异常值，两者的主要区别在于与期望值的差异程度。噪声数据的变化程度小，其值更接近原始数据，而异常值的变化程度大，其值远离原始数据。为了说明下面这组数字中的例子，4 可以视为噪声，21 可以视为异常值。

A = [1，2，1，4，1，2，1，1，1，2，21，1，2，2]

数据损坏会增加执行分析所需的工作量，还会对数据挖掘分析的准确性产生不利影响。以下几点说明了适用于大数据的几种数据损坏来源:

传感器数据中的数据损坏:传感器数据已经成为我们数据世界大量大数据源中最大的数据源之一。这些传感器通常长时间产生大量数据，由于数据不准确，导致各种计算挑战。Hadoop 被广泛应用于纵向传感器数据中的模式挖掘，而这个过程中最大的挑战之一就是传感器数据的自然误差和不完整性。传感器的电池寿命不足，很多传感器可能长时间无法发送准确的数据，从而损坏数据。
结构化数据中的数据损坏:在结构化大数据的上下文中，任何数据，无论是数字数据还是分类数据，如果存储在 Hadoop 中的方式使其不能被为数据编写的任何数据处理例程读取或使用，都可以被视为损坏。

动机

通过应用适当的噪声和异常检测技术，可以验证和清除损坏的数据。用于此目的的常用技术有宁滨、回归和聚类，如下表所示:

宁滨:这个技术可以用来识别噪声和异常值。该技术也用于通过应用平滑函数来去除噪声数据。宁滨的工作原理是创建一组分类值，这些值被分成不同的容器。这些值除以等频率或等宽度。为了平滑数据(或去除噪声)，给定分区或框中的原始数据值被该分区的平均值或中值替换。在当前的设计模式下，我们将解释宁滨消除噪声的适用性。
回归:它是一种将数据值拟合到函数的技术。回归可用于通过识别回归函数并移除远离函数预测值的所有数据值来消除噪声。线性回归找到“最合适”的线性函数来拟合两个变量，这样一个变量可以用来预测另一个变量。线性回归类似于线性回归，涉及两个以上的变量，数据适用于多维曲面。
聚类:离群值分析可以通过聚类来执行，通过将相似的值分组在一起来找到在聚类之外并且可以被认为是离群值的值。一个集群包含的值与同一集群中的其他值相似，但与其他集群中的值不同。您可以将一组值视为一个组，以便在宏级别与其他组值进行比较。

寻找异常值的另一种方法是计算 ( iqr )的四分位数区间。在该方法中，首先根据这些值计算三个四分位数(Q1、Q2 和 Q3)。四分位数将值分成四个相等的组，每个组包含四分之一的数据。上下两列用三个四分位数计算，高于或低于这两列的任何值都被认为是异常值。边界是定义异常值范围的指南。在当前的设计模式中，我们使用这种方法来发现异常值。

用例

您可以考虑使用这种设计模式，通过去除噪声和异常值来去除损坏的数据。这种设计模式将有助于理解如何将数据分类为噪声或异常值，然后将其移除。

模式实现

这个设计模式是使用第三方库datafu.jar作为独立的 Pig 脚本实现的。该脚本实现了识别和去除噪声和异常值。

宁滨技术识别并消除噪音。在宁滨，这些值被分类并分配到多个箱中。确定每个库的最小值和最大值，并将它们设置为库边界。每个箱值由最近的箱边界值代替。这种方法被称为通过面元边界平滑。为了识别异常值，我们使用标准盒图规则方法；它根据数据分布的上四分位数和下四分位数发现异常值。使用 (Q1-C * iqd，Q3+C * iqd) 计算数据分布的 Q1 和 Q3 及其四分位间距，给出数据应该落在的范围。这里，c 是一个值为 1.5 的常数。超出此范围的值被视为异常值。该脚本使用 Datafu 库来计算四分位数。

代码片段

为了说明这个模型的工作原理，我们考虑一个存储在 HDFS 的汽车保险理赔数据集，它包含两个文件。automobile_policy_master.csv为主文件；它包含唯一的 ID、车辆详细信息、价格和支付的保费。主文件用于验证索赔文件中的数据。automobile_insurance_claims.csv文件包含汽车保险理赔数据，具体为车辆修理费理赔；包括CLAIM_ID、POLICY_MASTER_ID、VEHICLE_DETAILS和CLAIM_DETAILS等属性。对于该模式，我们将在CLAIM_AMOUNT和AGE上验证和清理损坏的数据，如下代码所示:

/*
Register Datafu jar file
*/
REGISTER  '/home/cloudera/pdp/jars/datafu.jar';

/*
Define alias for the UDF quantile
The parameters specify list of quantiles to compute
*/
DEFINE Quantile datafu.pig.stats.Quantile('0.25','0.50','0.75'); 

/*
Load automobile insurance claims data set into the relation claims
*/
claims = LOAD'/user/cloudera/pdp/datasets/data_validation/automobile_insurance_claims.csv' USING  PigStorage(',') AS(claim_id:chararray, policy_master_id:chararray,registration_no:chararray, engine_no:chararray,chassis_no:chararray,customer_id:int,age:int,first_name:chararray,last_name:chararray,street:chararray,address:chararray,city:chararray,zip:long,gender:chararray, claim_date:chararray,garage_city:chararray,bill_no:long,claim_amount:double,garage_name:chararray,claim_status:chararray);

/*
Sort the relation claims by age
*/
claims_age_sorted = ORDER claims BY age ASC;

/*
Divide the data into equal frequency bins. 
Minimum and maximum values are identified for each bin and are set as bin boundaries.
Replace each bin value with the nearest bin boundary.
*/
bin_id_claims = FOREACH claims_age_sorted GENERATE(customer_id - 1) * 10 / (130- 1 + 1) AS bin_id, $0 ,$1 ,$2 ,$3 ,$4 ,$5 ,$6 ,$7 ,$8 ,$9 ,$10 ,$11 ,$12 ,$13 ,$14 ,$15 ,$16 ,$17 ,$18 ,$19 ;
group_by_id = GROUP bin_id_claims BY bin_id;
claims_bin_boundaries = FOREACH group_by_id
{
  bin_lower_bound=(int) MIN(bin_id_claims.age);
  bin_upper_bound = (int)MAX(bin_id_claims.age);
  GENERATE bin_lower_bound AS bin_lower_bound, bin_upper_bound ASbin_upper_bound, FLATTEN(bin_id_claims);
};
smoothing_by_bin_boundaries = FOREACH claims_bin_boundariesGENERATE $3 AS claim_id,$4 AS policy_master_id,$5 ASregistration_no,$6 AS engine_no,$7 AS chassis_no,$8 AS customer_id,( ( $9 - bin_lower_bound ) <=( bin_upper_bound - $9 ) ? bin_lower_bound : bin_upper_bound )AS age,$10 AS first_name,$11 AS last_name,$12 AS street,$13 AS address,$14 AS city,$15 AS zip,$16 AS gender,$17 AS claim_date,$18 AS garage_city,$19 AS bill_no,$20 AS claim_amount,$21 AS garage_name,$22 AS claim_status;

/*
Identify outliers present in the column claim_amount by calculating the quartiles, interquartile distance and the upper and lower fences.
The values that do not fall within this range are considered as outliers and are filtered out.
*/
groupd = GROUP smoothing_by_bin_boundaries ALL;
quantiles = FOREACH groupd { 
  sorted = ORDER smoothing_by_bin_boundaries BY claim_amount;
  GENERATE Quantile(sorted.claim_amount) AS quant;
}
valid_range = FOREACH quantiles GENERATE(quant.quantile_0_25 - 1.5 * (quant.quantile_0_75 - quant.quantile_0_25)) ,(quant.quantile_0_75 + 1.5 *(quant.quantile_0_75 - quant.quantile_0_25));
claims_filtered_outliers = FILTER smoothing_by_bin_boundaries BYclaim_amount>= valid_range.$0 AND claim_amount<= valid_range.$1;

/*
Store the invalid values in the relation invalid_claims
*/
invalid_claims_filter = FILTER smoothing_by_bin_boundaries BY claim_amount<= valid_range.$0 OR claim_amount>= valid_range.$1;
invalid_claims = FOREACH invalid_claims_filter GENERATE $0 ,$1 ,$2 ,$3 ,$4 ,$5 ,$6 ,$7 ,$8 ,$9 ,$10 ,$11 ,$12 ,$13 ,$14 ,$15 ,$16 ,$17 ,$18 ,$19,'claim_amount identified as Outlier' as reason;

/*
The results are stored on the HDFS in the directories valid_data and invalid_data
The invalid values are written to a file in the folder invalid_data. This file has an additional column specifying the reason for elimination of the record, this can be used for further analysis.
*/
STORE invalid_claims INTO'/user/cloudera/pdp/output/data_validation_cleansing/corrupt_data_validation_cleansing/invalid_data';
STORE claims_filtered_outliers INTO'/user/cloudera/pdp/output/data_validation_cleansing/corrupt_data_validation_cleansing/valid_data';

结果

以下为原始数据集的片段；为了提高可读性，我们删除了一些列。

claim_id,policy_master_id,cust_id,age,city,claim_date,claim_amount
A123B6,A217,6,42,Las Vegas,6/25/2010,-12495
A123B11,A222,11,21,,11/5/2012,293278.7,claim_amount identified asOutlier
A123B2,A213,2,42,Redmond,1/22/2009,147157,claim_amount identifiedas Outlier
A123B9,A220,9,21,Maryland,9/20/2011,13986
A123B4,A215,4,42,Austin,12/16/2011,35478

以下是将此模式应用于数据集的结果:

有效数据

A123B6,A217,6,42,Las Vegas,6/25/2010,-12495
A123B9,A220,9,21,Maryland,9/20/2011,13986
A123B4,A215,4,42,Austin,12/16/2011,35478

无效数据

A123B11,A222,11,21,,11/5/2012,293278.7,claim_amount identified as Outlier
A123B2,A213,2,42,Redmond,1/22/2009,147157,claim_amount identified as Outlier

如前所示，结果数据分为有效数据和无效数据。有效数据有一个记录列表，其中age列的噪声被平滑。检测claim_amount栏的异常值，上下栏标为-34929.0、70935.0；不在此范围内的值被识别为异常值，并写入文件夹invalid_data中的文件。文件的最后一列显示了记录被认为无效的原因。过滤异常值，数据存储在valid_data文件夹中。删除离群值的前一个脚本；但是，这一决定可能会因业务规则而异。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter4/code/
Chapter4/datasets/

非结构化文本数据验证和清理的设计模式

非结构化文本验证和清理模式 n 演示了应用各种数据预处理技术清理非结构化数据的方法。

背景

用 Hadoop 处理海量非结构化数据是一项具有挑战性的任务，需要清理并做好处理准备。文本数据，包括文档、电子邮件、文本文件和聊天文件，在被 Hadoop 接收时，本质上是无组织的，没有定义的数据模型。

为了打开非结构化数据进行分析，我们必须引入类似于结构的东西。组织非结构化数据的基础是通过在数据存储中执行有计划和有控制的数据清理转换和流动，将其与企业中现有的结构化数据集成在一起进行操作和/或分析。非结构化数据的集成对于结果数据的有意义的查询和分析是必要的。

接收非结构化数据后的第一步是从文本数据中发现元数据，并以有利于进一步处理的方式进行组织，从而消除数据中的一些不规则和歧义。这种元数据的创建本身就是一个多步骤的迭代过程，采用各种数据解析、清洗和转换技术，从简单的实体提取和语义标注到使用人工智能算法的自然语言处理。

动机

该设计模式展示了一种通过执行预处理步骤来清理非结构化数据语料库的方法，如小写转换、停止词移除、词干、标点符号移除、额外空间移除、数字识别和拼写错误识别。

这个模型的动机是了解非结构化数据中的各种不一致性，并帮助识别和消除这些问题。

非结构化数据容易出现从完整性到不一致性的各种质量问题。以下是非结构化文本的常见清理步骤:

文本也可以用替代拼写来表示；例如，一个名字可以用不同的方式书写。如果拼写不同，但仍然指向同一个实体，则搜索该名称将不会得到结果。一方面，可以认为是拼写错误。集成和清理这些替代拼写来引用同一个实体可以减少歧义。从非结构化文本到结构化格式的有效转换需要我们考虑所有可选的拼写。
拼写错误也是很多不规范的原因，会影响分析的准确性。为了使数据可处理，我们必须识别拼错的单词并用正确的单词替换它们。
文本中的数字标识使我们能够选择所有的数字。根据业务环境，这些提取的号码可以包含在进一步的处理中或从进一步的处理中删除。数据清理也可以通过从文本中提取数字来执行；例如，如果文本由短语“1000”组成，则可以将其转换为 1000 来执行适当的分析。
使用正则表达式提取与特定模式匹配的数据可能是一种清理方法。例如，您可以通过指定模式从文本中提取日期。如果提取的日期不是标准格式(DD/MM/YY)，则标准化日期可能是按日期读取和索引非结构化数据的清理活动之一。

用例

考虑数据被 Hadoop 摄取后，这种设计模式可以通过删除拼写错误、标点符号等来清理非结构化数据。

模式实现

这个设计模式实现为一个独立的 Pig 脚本，内部使用右连接去除停止词。Stopword 列表首先从外部文本文件加载到关系中，然后在外部连接中使用。

LOWER功能用于将所有单词转换为小写。使用REPLACE功能匹配特定单词删除标点符号。同样，通过使用REPLACE匹配文本中所有数字的模式来删除数字。

拼错单词的代码使用布鲁姆过滤器，该过滤器最近在 0.10 版本中被包含为 UDF。

Bloom filter 是一种空间优化的数据结构，专门用于通过测试属于较小数据集的元素是否是较大数据集的成员，从较大的数据集中过滤出较小的数据集。Bloom filter 内部实现了一种巧妙的机制来存储每个元素，从而使用恒定的内存量，而不管元素的大小，从而实现彻底的空间优化。虽然布隆过滤器与其他结构相比具有巨大的空间优势，但过滤并不完全准确，因为可能存在误报的可能性。

Pig 通过调用BuildBloom来支持 Bloom filter，通过从 Pig 关系中存储的字典语料库加载的值列表中加载和训练 Bloom filter 来构建 Bloom filter。存储在分布式缓存中并内部传输到Mapper功能的经过训练的布隆过滤器用于通过使用BLOOM UDF 进行FILTER操作来对输入数据执行实际的过滤操作。在布隆过滤器消除所有拼写错误的单词后，过滤后的结果集将是拼写正确的单词。

代码片段

为了演示这个模型是如何工作的，我们考虑了存储在 HDFS 可访问的文件夹中的维基百科文本语料库。这个样本语料库由与计算机科学和信息技术相关的维基页面组成。一些拼错的单词被故意引入语料库，以展示该模型的功能。

/*
Define alias for the UDF BuildBloom.
The first parameter to BuildBloom constructor is the hashing technique to use, the second parameter specifies the number of distinct elements that would be placed in the filter and the third parameter is the acceptable rate of false positives.
*/
DEFINE BuildBloom BuildBloom('jenkins', '75000', '0.1');

/*
Load dictionary words
*/
dict_words1 = LOAD'/user/cloudera/pdp/datasets/data_validation/unstructured_text/dictionary_words1.csv' as (words:chararray); 
dict_words2 = LOAD'/user/cloudera/pdp/datasets/data_validation/unstructured_text/dictionary_words2.csv' as (words:chararray);

/*
Load stop words
*/
stop_words_list = LOAD'/user/cloudera/pdp/datasets/data_validation/unstructured_text/stopwords.txt' USING PigStorage();
stopwords = FOREACH stop_words_list GENERATEFLATTEN(TOKENIZE($0));

/*
Load the document corpus and tokenize to extract the words
*/
doc1 = LOAD'/user/cloudera/pdp/datasets/data_validation/unstructured_text/computer_science.txt' AS (words:chararray);
docWords1 = FOREACH doc1 GENERATE FLATTEN(TOKENIZE(words)) ASword;
doc2 = LOAD'/user/cloudera/pdp/datasets/data_validation/unstructured_text/information_technology.txt' AS (words:chararray);
docWords2 = FOREACH doc2 GENERATE FLATTEN(TOKENIZE(words)) ASword;

/*
Combine the contents of the relations docWords1 and docWords2
*/
combined_docs = UNION docWords1, docWords2;

/*
Convert to lowercase, remove stopwords, punctuations, spaces, numbers.
Replace nulls with the value "dummy string"
*/
lowercase_data = FOREACH combined_docs GENERATEFLATTEN(TOKENIZE(LOWER($0))) as word;
joind = JOIN stopwords BY $0 RIGHT OUTER, lowercase_data BY $0;
stop_words_removed = FILTER joind BY $0 IS NULL;
punctuation_removed = FOREACH stop_words_removed  
{
  replace_punct = REPLACE($1,'[\\p{Punct}]','');
  replace_space = REPLACE(replace_punct,'[\\s]','');
  replace_numbers = REPLACE(replace_space,'[\\d]','');
  GENERATE replace_numbers AS replaced_words;
}
replaced_nulls = FOREACH punctuation_removed GENERATE(SIZE($0) > 0 ? $0 : 'dummy string') as word;

/*
Remove duplicate words
*/
unique_words_corpus = DISTINCT replaced_nulls;

/*
Combine the two relations containing dictionary words
*/
dict_words = UNION dict_words1, dict_words2;

/*
BuildBloom builds a bloom filter that will be used in Bloom.
Bloom filter is built on the relation dict_words which contains all the dictionary words.
The resulting file dict_words_bloom is used in bloom filter by passing it to Bloom.
The call to bloom returns the words that are present in the dictionary, we select the words that are not present in the dictionary and classify them as misspelt words. The misspelt words are filtered from the original dataset and are stored in the folder invalid_data.
*/
dict_words_grpd = GROUP dict_words all;
dict_words_bloom = FOREACH dict_words_grpd GENERATEBuildBloom(dict_words.words);
STORE dict_words_bloom into 'dict_words_bloom';
DEFINE bloom Bloom('dict_words_bloom');
filterd = FILTER unique_words_corpus BY NOT(bloom($0));
joind = join filterd by $0, unique_words_corpus by $0;
joind_right = join filterd by $0 RIGHT, unique_words_corpus BY $0;
valid_words_filter = FILTER joind_right BY $0 IS NULL;
valid_words = FOREACH valid_words_filter GENERATE $1;
misspellings = FOREACH joind GENERATE $0 AS misspelt_word;

/*
The results are stored on the HDFS in the directories valid_data and invalid_data.
The misspelt words are written to a file in the folder invalid_data.
*/
STORE misspellings INTO'/user/cloudera/pdp/output/data_validation_cleansing/unstructured_data_validation_cleansing/invalid_data';
STORE valid_words INTO'/user/cloudera/pdp/output/data_validation_cleansing/unstructured_data_validation_cleansing/valid_data';

结果

以下单词被识别为拼写错误，并存储在文件夹invalid_data中。我们选择从原始数据集中过滤这些单词。但是，这取决于业务规则；如果业务规则要求拼错的单词必须用正确的拼写替换，则必须采取适当的步骤来纠正拼写。

sme
lemme
puttin
speling
wntedly
mistaces
servicesa
insertingg
missspellingss
telecommunications

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter4/code/
Chapter4/datasets/

总结

在本章中，您学习了各种大数据验证和清理技术，这些技术用于检测和清理数据中不正确或不准确的记录。这些技术保证了不一致的数据在用于分析过程之前，可以通过按照一套规则对数据进行验证来识别，然后根据业务规则对不一致的数据进行替换、修改或删除，使其更加一致。在本章中，我们学习了上一章中关于数据分析的知识。

在下一章中，我们将重点介绍数据转换模式，它可以应用于多种数据格式。阅读本章后，读者将能够使用聚合、归纳和连接等技术选择正确的模式来转换数据。

五、数据转换模式

在上一章中，您了解了与数据验证和清理相关的各种模式，从中您了解到有许多方法可以检测和删除数据中不正确或不准确的记录。当数据验证和清理完成时，甚至在数据被用于分析生命周期的下一步之前，数据中的不一致性就被识别出来了。然后，替换、修改或删除不一致的数据，使其更加一致。

在本章中，您将了解与数据转换相关的各种设计模式，例如结构化到分层、标准化、集成、聚合和一般化的设计模式。

数据转换过程

数据转换的过程是大数据分析知识发现过程的基本组成部分之一，也是至关重要的一步。数据转换是一个迭代的过程，将源数据修改成一种格式，使分析算法得到有效应用。通过确保以有利于应用分析的格式存储和检索数据，转换提高了算法的性能和准确性。这是通过提高源数据的整体质量来实现的。

大数据的数据转换主要包括以下主要过程:

归一化:本次变换对属性数据进行缩放，使其在指定范围内。通常，属性值被转换为适合 0 到 1 之间的范围。这是为了消除某些属性对分析的一些不必要的影响。规范化转换不同于关系数据库设计中使用的第一、第二和第三范式。
聚合:这个转换执行数据聚合操作，比如从每日的股票数据计算出月度和年度汇总，从而创建一个多维度和粒度分析的数据立方体。
泛化:在的变换中，使用概念层次将低级原始数据替换为更高级的抽象。例如，根据分析用例，低层数据(如街道)可以被高层抽象(如城市或州)所替代。
数据集成:是将多个结构相似或不相似的输入管道的数据连接成单个输出管道的过程。

以下部分详细描述了最常用的 Pig 设计模式，这些模式有助于数据转换。

从结构化到分层的转换模式

结构到层次的转换模式通过从结构化数据生成层次结构(如 XML 或 JSON)来处理数据转换。

背景

从结构到层次的转换模式创建了一个新的层次结构，比如 JSON 或 XML，数据存储在一个扁平的类似行的结构中。这是一种创建新记录的数据转换模式。与原始记录相比，新记录以不同的结构表示。

动机

Hadoop 擅长整合多个来源的数据，但要及时进行分析连接，始终是一项复杂耗时的操作。

为了高效地执行某些类型的分析(如日志文件分析)，数据有时不需要以标准化的形式存储在 Hadoop 中。将标准化数据存储在多个表中会产生一个额外的步骤，将所有数据连接在一起以对它们进行分析—连接通常在标准化结构化数据上执行，以将来自多个源的数据与外键关系集成在一起。

相反，原始数据通过反规格化分层嵌套。这种数据预处理将确保高效分析。

NoSQL 数据库，如 HBase、Cassandra 或 MongoDB，有助于将平面数据存储在列族或 JSON 对象中。Hadoop 可用于以批处理模式集成来自多个来源的数据，并创建可轻松插入这些数据库的分层数据结构。

用例

这种设计模式主要适用于集成来自多个独立数据源的结构化和基于行的数据。这种集成的具体目标是将数据转换成层次结构，以便可以分析数据。

这种模式对于将单个源中的数据转换成层次结构，然后使用层次结构将数据加载到柱状数据库和 JSON 数据库中也很有用。

模式实现

Pig 对元组和包形式的分层数据有现成的支持，用于在一行中表示嵌套对象。COGROUP运算符将数据分组为一个或多个关系，并创建输出元组的嵌套表示。

这种设计模式在 Pig 中作为独立的脚本实现。该脚本通过从结构化格式生成数据的分层表示来演示这种模式的用法。这个脚本加载一个非规范化的 CSV 文件，并将其传递给一个自定义的 Java UDF。Java UDF 使用XMLParser来构建一个 XML 文件。自定义存储函数以 XML 格式存储结果。

代码片段

为了说明该模型的工作原理，我们考虑存储在 HDFS 的制造数据集。文件production_all.csv包含从production.csv和manufacturing_units.csv导出的反规格化数据。我们将把结构化数据从 CSV 格式转换成层次化的 XML 格式。

PIG 脚本从结构化到分层的设计模式如下:

/*
Register the piggybank jar and generateStoreXml jar, it is a custom storage function which generates an XML representation and stores it
*/
REGISTER '/home/cloudera/pdp/jars/generateStoreXml.jar';
REGISTER '/usr/share/pig/contrib/piggybank/java/piggybank.jar';

/*
Load the production dataset into the relation production_details
*/
production_details = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_all.csv' USING  PigStorage(',') AS(production_date,production_hours,manufacturing_unit_id,manufacturing_unit_name,currency,product_id,product_name,quantity_produced);

/*
Call the custom store function TransformStoreXML to transform the contents into a hierarchical representation i.e XML and to store it in the directory structured_to_hierarchical
*/
STORE production_details INTO'/user/cloudera/pdp/output/data_transformation/structured_to_hierarchical' USINGcom.xmlgenerator.TransformStoreXML('production_details','production_data');

以下是之前的 Pig 脚本用来执行结构到层次转换的 Java UDF 的片段:

  /**
   * data from tuple is appended to xml root element
   * @param tuple
   */
  protected void write(Tuple tuple)
  {
    // Retrieving all fields from the schema
    ResourceFieldSchema[] fields = schema.getFields();
    //Retrieve values from tuple
    List<Object> values = tuple.getAll();
    /*Creating xml element by using fields as element tag 
and tuple value as element value*/
    Element transactionElement =xmlDoc.createElement(TransformStoreXML.elementName);
    for(int counter=0;counter<fields.length;counter++)
    {
      //Retrieving element value from values
      String columnValue = String.valueOf(values.get(counter));
      //Creating element tag from fields
      Element columnName = xmlDoc.createElement(fields[counter].getName().toString().trim());
      //Appending value to element tag

      columnName.appendChild(xmlDoc.createTextNode(columnValue));
      //Appending element to transaction element
        transactionElement.appendChild(columnName);    
    }
    //Appending transaction element to root element
    rootElement.appendChild(transactionElement);
  }

结果

以下是对输入执行代码后生成的 XML 文件的片段:

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<production_details>
  <production_data>
    <production_date>2011-01-01T00:00:00</production_date>
    <production_hours>7</production_hours>
    <manufacturing_unit_id>1</manufacturing_unit_id>
    <manufacturing_unit_name>unit1</manufacturing_unit_name>
    <currency>USD</currency>
    <product_id>C001</product_id>
    <product_name>Refrigerator 180L</product_name>
    <quantity_produced>49</quantity_produced>
  </production_data>
  <production_data>
    .
    .
    .

  </production_data>
  .
  .
  .
</production_details>

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter5/code/
Chapter5/datasets/

数据归一化模式

数据标准化设计模式讨论了数据值标准化或规范化的方法。

背景

数据标准化是指将在不同尺度上测量的数据值拟合、调整或缩放到概念上的共同范围。举一个简单的例子，将数据集与不同的距离测量单位(如公里和英里)连接起来，当它们不标准化时，可以提供不同的结果。因此，它们被标准化，使它们返回到一个共同的单位，如公里或英里，这样不同测量单位的影响就不会被分析所感觉到。

动机

大数据中集成多个数据源时，在同一个数据属性中遇到不同的值是很常见的。

数据预处理和转换是通过标准化原始数据并将其缩放到指定范围(例如，0 到 1 的范围)以及为所有属性分配相等的权重来执行的。在标准化数据之前，从数据中移除任何异常值。当分析可能受到较高范围内测量单位和值的选择的影响时，需要数据标准化。

规范化用于分析，如聚类，这是一种基于距离的方法，可防止值较高的属性支配值较小的属性。标准化数字和非数字数据的技术如下:

Normalizing numeric data: Numeric data is normalized using methods such as min-max normalization, thus transforming the data to a value between a specified range [newMin, newMax]. The minValue and maxValue are usually identified from the dataset and the normalization is done by applying the following formula for each value:

Normalized value = [(value-minimum)/(maximum-minimum)) * (new maximum-new minimum)+new minimum]
归一化非数值数据:非数值数据首先转换为数值数据，然后归一化。例如，如果评级属性的值可以是优秀、非常好、良好、一般、低于一般、差或最差，则可以将其转换为 1 到 7 之间的数值；因此，这些值可以被归一化以适合模型。

用例

可以考虑用这个设计模式作为预处理技术进行分析。这种模式可以用于用例分析，以避免初始值较高的属性和初始值较低的属性之间的对比。

这种模式可以被视为封装原始数据的一种方法，因为它通过规范化来转换原始数据。

模式实现

这个设计模式在 Pig 中作为独立脚本实现。用例识别给定产品的相似制造单元；它显示了标准化。该脚本为产品C001的每个制造单元加载数据并计算total produced quantity和total production hours。每个制造单元由product、total produced quantity和total production hours表示。该脚本使用最小-最大归一化技术归一化total number of units produced和total production hours，以便将所有值设置为相同的比率(范围从 0 到 1)。然后，脚本计算这些点之间的欧几里德距离。距离越小，制造单位越相似。

代码片段

为了说明该模型的工作原理，我们考虑存储在 HDFS 的制造数据集。文件production.csv包含各制造单元的生产信息；该文件包含production_date、production_hours、manufacturing_unit_id、product_id、produced_quantity等属性。我们将为产品C001的每个制造单元计算total produced quantity和total production hours，如下代码所示:

/*
Load the production dataset into the relation production
*/
production = LOAD'/user/cloudera/pdp/datasets/data_transformation/production.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);

/*
Filter the relation products to fetch the records with product id C001
*/
production_filt = FILTER production BY product_id=='C001';

/*
Calculate the total production hours and total produced quantity of product C001 in each manufacturing unit
*/
production_grpd = GROUP production_filt BY(manufacturing_unit_id,product_id);
production_sum = FOREACH production_grpd GENERATE group.$0 ASmanufacturing_unit_id, group.$1 AS product_id,(float)SUM(production_filt.production_hours) ASproduction_hours,(float)SUM(production_filt.produced_quantity)AS produced_quantity;

/*
Apply Min max normalization on total production hours and total produced quantity for each manufacturing unit to scale the data to fit in the range of [0-1]
*/
production_sum_grpd = GROUP production_sum ALL;
production_min_max = FOREACH production_sum_grpd GENERATEMIN(production_sum.production_hours)-1 ASmin_hour,MAX(production_sum.production_hours)+1 AS max_hour,MIN(production_sum.produced_quantity)-1 AS min_qty,MAX(production_sum.produced_quantity)+1 AS max_qty;
production_norm = FOREACH production_sum 
{
  norm_production_hours = (float)(((production_hours -production_min_max.min_hour)/(production_min_max.max_hour -production_min_max.min_hour))*(1-0))+1;
  norm_produced_quantity = (float)(((produced_quantity -production_min_max.min_qty)/(production_min_max.max_qty -production_min_max.min_qty))*(1-0))+1;
  GENERATE manufacturing_unit_id AS manufacturing_unit_id,product_id AS product_id, norm_production_hours ASproduction_hours, norm_produced_quantity AS produced_quantity;
}
prod_norm = FOREACH production_norm GENERATE manufacturing_unit_idAS manufacturing_unit_id,product_id ASproduct_id,production_hours ASproduction_hours,produced_quantity AS produced_quantity;

/*
Calculate the Euclidean distance to find out similar manufacturing units w.r.t the product C001
*/
manufacturing_units_euclidean_distance  = FOREACH (CROSS production_norm,prod_norm) {
distance_between_points = (production_norm::production_hours -prod_norm::production_hours)*(production_norm::production_hours -prod_norm::production_hours) +(production_norm::produced_quantity -prod_norm::produced_quantity)*(production_norm::produced_quantity - prod_norm::produced_quantity);
GENERATE  production_norm::manufacturing_unit_id,production_norm::product_id,prod_norm::manufacturing_unit_id,prod_norm::product_id,SQRT(distance_between_points) as dist;         
};

/*
The results are stored on the HDFS in the directory data_normalization
*/
STORE manufacturing_units_euclidean_distance INTO'/user/cloudera/pdp/output/data_transformation/data_normalization';

结果

以下是对输入执行代码后生成的结果片段:

1  C001  1  C001  0.0
1  C001  3  C001  1.413113776343348
1  C001  5  C001  0.2871426024640011
3  C001  1  C001  1.413113776343348
3  C001  3  C001  0.0
3  C001  5  C001  1.1536163027782005
5  C001  1  C001  0.2871426024640011
5  C001  3  C001  1.1536163027782005
5  C001  5  C001  0.0

制造单位之间的相似度是为一个产品(C001)计算的。如前所示，制造单元1和5与产品C001相似，因为它们之间的距离小于其他单元之间的距离。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter5/code/
Chapter5/datasets/

数据整合模式

数据集成模式处理的是多来源数据集成的方法，以及解决此活动产生的数据不一致的技术。

背景

这个模型讨论了整合多个来源的数据的方法。数据集成有时会导致数据不一致。例如，不同的数据源可能使用不同的测量单位。解决数据不一致的集成模式处理技术。

动机

对于许多大数据解决方案来说，数据存在于不同的地方是很常见的，例如 SQL 表、日志文件和 HDFS。为了发现不同地方的数据之间令人兴奋的关系，有必要从不同的来源获取和整合这些数据。另一方面，这种来自多个来源的数据集成有时会导致数据不一致。的数据集成通过添加更多的属性并赋予它更多的意义和上下文来丰富它。它还可以通过删除不必要的细节来过滤数据。

数据集成主要通过连接操作来实现。联接操作基于名为外键的字段集成多个数据集中的记录。外键是表中等于另一个表的列的字段。它被用作表间交叉引用的手段。虽然这个操作在 SQL 中相当简单，但是 MapReduce 的工作方式使它成为 Hadoop 上最昂贵的操作之一。

以下示例以两个数据集 A 和 B 为例，说明了理解不同类型联接的简单方法。下图显示了每个数据集中的值。

Motivation

以下是可以在数据集上执行的不同类型的连接:

Inner join: When this is performed on two datasets, all the matching records from both the datasets are returned. As shown in the following figure, it returns the matching records (2, 3) from both the datasets.
Left outer join: When this is performed on two datasets, all the matching records from both the datasets are returned along with the unmatched records from the dataset on the left-hand side. As shown in the following figure, the matched records (2, 3) along with the unmatched record in the dataset to the left (1) are returned.
Right outer join: When this is performed on two datasets, all the matching records from both the tables are returned along with the unmatched records from the dataset on the right-hand side. As shown in the following figure, the matched records (2, 3) along with the unmatched record in the dataset to the right (4) are returned.
Full outer join: When this is applied on two datasets, all the matching records from both the tables are returned along with the unmatched records from both tables. As shown in the following figure, the matched records (2, 3) along with unmatched records in both the datasets (1, 4) are returned.
Cartesian join: When the Cartesian join is performed on two datasets, each record from the first dataset and all the records of the second dataset are joined together. As shown in the following figure, the result would be (1, 2), (1, 3), (1, 4), (2, 2), (2, 3), (2, 4), (3, 2), (3, 3), and (3, 4).

组合来自多个来源的数据可能会导致数据不一致。不同的数据源可能使用不同的度量单位。例如，假设有两个数据源，并且每个数据源使用不同的货币，例如美元对欧元。因此，这两个来源整合的数据是不一致的。另一个问题是，每个源中的数据可能以不同的方式表示，例如真/假是/否。您必须使用数据转换来解决这些不一致。

在清管器中执行连接操作有两种主要技术:

递减末端连接: 【T2】第一种技术在 MapReduce 术语中称为递减末端连接。它在几个具有外键关系的大型数据集上使用默认连接运算符。该技术执行任何类型的连接操作(内部、外部、右侧、左侧等)。)对数据集的影响。此外，它可以同时处理多个数据集。这种连接操作最大的缺点就是会给网络带来巨大的负载，因为所有被连接的数据都是先进行排序，然后再发送给减速器，这就减缓了这种操作的执行速度。
复制连接: 第二种技术叫复制连接，使用replicated关键字和Join运算符语法。这种连接技术适用于非常大的数据集和许多小数据集。在内部，这种连接只在映射器端执行，不需要额外的数据排序和洗牌。复制允许 Pig 向每个节点分发一个小数据集(小到足以容纳内存)，这样数据集就可以直接连接到地图作业，从而消除了减少作业的需要。复制中并不支持所有类型的连接；它只支持内部连接和左侧外部连接。

用例

您可以考虑在以下场景中使用这种设计模式:

当您需要在应用分析之前组合来自多个来源的数据时
通过反规格化数据减少处理时间；去规范化可以通过将事务数据集与其关联的主数据集连接来实现。
转换数据，解决数据集成带来的数据不一致。
使用特定连接过滤数据。

模式实现

这个设计模式在 Pig 中作为独立脚本实现。它结合了所有制造单元的生产信息，通过转换数据来解决数据不一致的问题，并找出每个单元是否发挥了最佳性能。

该脚本首先加载每个制造单元的数据，并使用UNION进行组合。然后，它通过将连接应用于生产数据集及其主数据集来反规格化数据，以获得manufacturing unit和product details。它实现了复制连接，将一个巨大的生产数据集与一个更小的名为 products 的数据集连接起来。一个单位使用印度卢比作为其货币；这导致数据不一致。该脚本通过将单位的制造成本属性(以印度卢比表示)转换为美元来解决这种不一致。

然后，脚本将每个单元的the actual quantity produced与 expected quantity进行比较，以确定每个单元是否具有最佳性能。

代码片段

为了说明该模型的工作原理，我们考虑存储在 HDFS 的制造数据集。它包含三个主要文件；manufacturing_units.csv包含每个制造单元的信息，products.csv包含制造产品的详细信息，manufacturing_units_products.csv存储不同制造单元制造的产品的详细信息。生产数据集对于每个制造单元都有一个单独的生产文件；该文件包含 T3、T4、T5、T6 和 T7 等属性。下面的代码是 Pig 脚本，它演示了这种模式的实现:

/*
Load the production datasets of five manufacturing units into the relations
*/
production_unit_1 = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_unit_1.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);
production_unit_2 = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_unit_2.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);
production_unit_3 = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_unit_3.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);
production_unit_4 = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_unit_4.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);
production_unit_5 = LOAD'/user/cloudera/pdp/datasets/data_transformation/production_unit_5.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);

/*
Combine the data in the relations using UNION operator
*/
production = UNIONproduction_unit_1,production_unit_2,production_unit_3,production_unit_4,production_unit_5;

/*
Load manufacturing_unit and manufacturing_units_products datasets
*/
manufacturing_units_products = LOAD'/user/cloudera/pdp/datasets/data_transformation/manufacturing_units_products.csv' USING PigStorage(',') AS(manufacturing_unit_id:chararray,product_id:chararray,capacity_per_hour:int,manufacturing_cost:float);
manufacturing_units = LOAD'/user/cloudera/pdp/datasets/data_transformation/manufacturing_units.csv' USING PigStorage(',') AS(manufacturing_unit_id:chararray,manufacturing_unit_name:chararray,manufacturing_unit_city:chararray,country:chararray,currency:chararray);

/*
Use replicated join to join the relation production, which is huge with a smaller relation manufacturing_units_products.
The relations manufacturing_units_products and manufacturing units are small enough to fit into the memory
*/
replicated_join = JOIN production BY(manufacturing_unit_id,product_id),manufacturing_units_products BY(manufacturing_unit_id,product_id) USING 'replicated';
manufacturing_join = JOIN replicated_join BYproduction::manufacturing_unit_id, manufacturing_units BYmanufacturing_unit_id USING 'replicated';

/*
Identify varying representation of currency and transform the values in the attribute manufacturing_cost to USD for the units that have INR as currency
*/
transformed_varying_values = FOREACH manufacturing_join GENERATE$0 AS production_date,$2 AS manufacturing_unit_id,$3 ASproduct_id,$4 AS actual_quantity_produced,($1*$7) AS expected_quantity_produced,(float)((($13 == 'INR') ?($8/60) : $8)*$4) AS manufacturing_cost;

/*
Calculate the expected quantity to be produced, actual quantity produced, percentage, total manufacturing cost for each month for each manufacturing unit and product to identify how each unit is performing
*/
transformed_varying_values_grpd = GROUP transformed_varying_valuesBY (GetMonth($0),manufacturing_unit_id,product_id);
quantity_produced = FOREACH transformed_varying_values_grpd 
{
  expected_quantity_produced =SUM(transformed_varying_values.expected_quantity_produced);
  actual_quantity_produced =SUM(transformed_varying_values.actual_quantity_produced);
  percentage_quantity_produced =100*actual_quantity_produced/expected_quantity_produced;
  manufacturing_cost =SUM(transformed_varying_values.manufacturing_cost);
  GENERATE group.$0 AS production_month,group.$1 ASmanufacturing_unit_id,group.$2 ASproduct_id,expected_quantity_produced ASexpected_quantity_produced,actual_quantity_produced ASactual_quantity_produced,percentage_quantity_produced ASpercentage_quantity_produced,ROUND(manufacturing_cost) ASmanufacturing_cost;
}

/*
Sort the relation by the percentage of quantity produced
*/
ordered_quantity_produced = ORDER quantity_produced BY $5 DESC;

/*
The results are stored on the HDFS in the directory data_integration
*/
STORE ordered_quantity_produced INTO '/user/cloudera/pdp/output/data_transformation/data_integration';

结果

以下是对输入执行代码后生成的结果片段:

6  2  C003  2400  2237	93  894800
10  2  C004  1984  1814  91  816300
12  3  L002  74400  66744  89  33372

第一列显示month，第二列为manufacturing unit id，第三列代表product id。Expected quantity to be produced、actual quantity produced、percentage、total manufacturing cost per month；所有这些都是根据每个单位的月度表现来计算的。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter5/code/
Chapter5/datasets/

聚合模式

聚合设计模式探索了 Pig 通过对数据应用聚合或聚合操作来转换数据的用法。

背景

聚合提供数据的汇总和高级视图。将多个属性聚合为一个属性，从而通过将一组记录视为单个记录或不关注不重要记录的子部分来减少记录总数。数据聚合可以在不同的粒度级别上执行。

数据聚合保持了数据的完整性，尽管结果数据集的体积比原始数据集小。

动机

数据聚合在大数据中起着关键作用，因为海量的数据本来就很难提供太多的整体信息。而是每天收集数据，然后汇总成周数据；每周的数据可以汇总成一个月的值，依此类推。此数据模式出现，可用于分析。一个简单的例子是细分年龄组，根据特定属性(如按年龄购买)获取特定群体的更多信息。这种使用特定属性聚合数据的能力可以快速为进一步分析提供有价值的见解。

有各种技术来聚合数据。聚合数据的基础技术有SUM、AVG、COUNT；先进技术包括CUBE和ROLLUP。

CUBE和ROLLUP在很多方面都很相似，都是汇总数据产生单个结果集。ROLLUP从小到大，计算不同级别的总量，如SUM、COUNT、MAX、MIN、AVG。

CUBE使用所选列中所有可能的值组合，启用来计算SUM、COUNT、MAX、MIN和AVG。一旦在一组列上计算了此聚合，它就可以提供这些维度上所有可能的聚合问题的结果。

用例

您可以考虑使用这种设计模式来生成数据的汇总表示。我们将看几个需要用摘要或摘要信息替换数据的场景。这种聚合是在数据被发送进行分析和处理之前完成的。聚合设计模式可用于以下特定场景:

包含交易信息的记录可以基于产品或交易日期等多个维度进行聚合。
每个家庭成员的收入等个人信息可以概括为代表家庭平均收入。

模式实现

这个设计模式是作为一个独立的 Pig 脚本实现的。该脚本实现了使用 Pig 0 . 11 . 0 版本中引入的CUBE和ROLLUP运算符来聚合数据。

聚合是提取、转换、加载 ( ETL )中对转换阶段的数据进行的基本操作。聚合数据最快的方法是使用ROLLUP和CUBE。在大多数情况下，ROLLUP和CUBE提供最有意义的数据汇总。该脚本加载多个制造单元的生产数据。这些数据可以出于各种目的进行汇总。通过将ROLLUP应用于该数据，我们可以获得以下总量:

每个制造单位每月每个产品的生产数量
每个制造单元中每个产品在所有月份的生产数量
每个制造单位的总生产量
所有制造单位的总产量

通过将CUBE应用于同一数据集，除了前面的聚合之外，我们还获得了以下聚合:

每个制造单位的月生产量
每个产品每月的生产数量
每种产品的生产数量
每月生产数量

CUBE返回的其他四个聚合是其内置函数的结果，该函数可以为分组列的所有可能组合创建小计。

代码片段

为了说明该模型的工作原理，我们考虑存储在 HDFS 的制造数据集。它包含三个主要文件:manufacturing_units.csv包含每个制造单元的信息，products.csv包含制造产品的详细信息，manufacturing_units_products.csv保存不同制造单元制造的产品的详细信息。文件production.csv包含各制造单元的生产信息；该文件包含 T4、T5、T6、T7 和 T8 等属性。我们将在manufacturing_unit_id、product_id和production_month上应用CUBE和ROLLUP聚合，如下代码所示:

/*
Load the data from production.csv, manufacturing_units_products.csv, manufacturing_units.csv files into the relations production, manufacturing_units_products and manufacturing_units
The files manufacturing_units_products.csv and manufacturing_units.csv contain master data information.
*/
production = LOAD'/user/cloudera/pdp/datasets/data_transformation/production.csv' USING PigStorage(',') AS(production_date:datetime,production_hours:int,manufacturing_unit_id:chararray,product_id:chararray,produced_quantity:int);
manufacturing_units_products = LOAD'/user/cloudera/pdp/datasets/data_transformation/manufacturing_units_products.csv' USING PigStorage(',') AS(manufacturing_unit_id:chararray,product_id:chararray,capacity_per_hour:int,manufacturing_cost:float);
manufacturing_units = LOAD'/user/cloudera/pdp/datasets/data_transformation/manufacturing_units.csv' USING PigStorage(',') AS(manufacturing_unit_id:chararray,manufacturing_unit_name:chararray,manufacturing_unit_city:chararray,country:chararray,currency:chararray);

/*
The relations are joined to get details from the master data.
*/
production_join_manufacturing_units_products = JOIN production BY(manufacturing_unit_id,product_id), manufacturing_units_productsBY (manufacturing_unit_id,product_id);
manufacture_join = JOINproduction_join_manufacturing_units_products BYproduction::manufacturing_unit_id, manufacturing_units BYmanufacturing_unit_id;

/*
The manufacturing cost attribute is converted to dollars for the units that have currency as INR.
*/
transformed_varying_values = FOREACH manufacture_join GENERATE $2AS manufacturing_unit_id,$3 AS product_id,GetMonth($0) AS production_month,((($13 == 'INR') ? ($8/60) :$8)*$4) AS manufacturing_cost;

/*
Apply CUBE and ROLLUP aggregations on manufacturing_unit_id, product_id, production_month and store the results in the relations results_cubed and results_rolledup
*/
cubed = CUBE transformed_varying_values BYCUBE(manufacturing_unit_id,product_id,production_month);
rolledup = CUBE transformed_varying_values BYROLLUP(manufacturing_unit_id,product_id,production_month);
result_cubed = FOREACH cubed GENERATE FLATTEN(group),ROUND(SUM(cube.manufacturing_cost)) AS total_manufacturing_cost;
result_rolledup = FOREACH rolledup GENERATE FLATTEN(group),ROUND(SUM(cube.manufacturing_cost)) AS total_manufacturing_cost;

/*
The results are stored on the HDFS in the directories cube and rollup
*/
STORE result_cubed INTO'/user/cloudera/pdp/output/data_transformation/data_aggregation/cube';
STORE result_rolledup INTO'/user/cloudera/pdp/output/data_transformation/data_aggregation/rollup';

结果

对manufacturing_unit_id、product_id和production_month应用ROLLUP后，产生以下结果组合:

每个制造单位每月每个产品的生产数量如下:
```
1  C001  1  536600
5  C002  12  593610
```

每个制造单位每个产品的月生产量如下:

1  C001    7703200
2  C003    10704000
5  C002    7139535

每个制造单元的总产量如下:
```
1      15719450
4      15660186
```
所有制造单位的总产量如下:
```
      69236355
```

将CUBE应用于manufacturing_unit_id、product_id和production_month后，除了ROLLUP生产的组合外，还获得了以下组合:

每个制造单元的月生产量如下:
```
1    4  1288250
5    12  1166010
```

每个产品每月的生产数量如下:

  C001  8  1829330
  L002  12  101748
  L001  10  36171

每个产品的生产数量如下:

  C002    15155785
  C004    16830110
  L002    667864

月生产量如下:

    2  5861625
    10  5793634
    11  5019340

如前所示，CUBE比ROLLUP多返回四个合计(每个制造单位的月生产量、每个产品的月生产量、每个产品的月生产量、每个产品的月生产量)。这是因为CUBE内置了为分组列的所有可能组合创建小计的功能。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter5/code/
Chapter5/datasets/

数据泛化模式

数据泛化模式通过创建概念层次并将数据替换为这些层次来处理数据转换。

背景

这个设计模式通过一个 Pig 脚本探索数据泛化的实现。数据泛化是创建称为概念层次的顶层概要层的过程，它以一般形式描述底层数据概念。这是一种描述方法的形式，其中数据由分组，并使用概念层次结构由更高级别的类别或概念替换。例如，属性 age 的原始值可以用概念标签(如成人、青少年或儿童)或区间标签(0 到 5、13 到 19 等)来代替。).这些标签可以递归地组织成更高级的概念，从而产生属性的概念层次。

动机

大数据背景下，海量数据的典型分析流水线需要多个结构化和非结构化数据集的集成。

数据泛化过程通过使用简洁通用的方式描述的泛化数据，减少了 Hadoop 集群中待分析数据所占用的空间。数据摘要的过程不是对整个数据语料库进行分析，而是以概念层次的形式呈现数据的一般属性，有助于快速获得更广更小的分析趋势视图，并在多个抽象层次进行挖掘。

应用数据泛化可能会导致细节的丢失，但在某些分析案例中，泛化的数据更有意义，也更容易解释。

通过在顶层概念层次中组织数据，可以在多个数据分析管道中实现一致的数据表示。此外，对精简数据集的分析需要更少的输入/输出操作和更少的网络吞吐量，并且比对更大的非标准化数据集的分析更有效。

由于这些好处，数据泛化通常被用作分析之前的预处理步骤，而不是在挖掘期间。有各种技术用于对数值数据进行数据汇总，例如宁滨、直方图分析、基于熵的离散化、卡方分析、聚类分析和通过视觉分割的离散化。类似地，对于分类数据，可以基于定义层次的属性的不同值的数量来执行一般化。

用例

您可以考虑在分析场景中使用这种设计模式来生成数字和分类结构化数据的广义表示，其中需要用更高级别的汇总来概括数据，而不是用低级别的原始数据来实现一致性。

您也可以考虑在数据集成过程之后立即使用这种模式作为分析加速器，以创建更适合高效分析的简化数据集。

模式实现

这个设计模式是作为一个独立的 Pig 脚本实现的。该脚本根据每个属性的不同值生成分类数据的概念层次结构。

脚本对manufacturing_unit_products、products、components和product_components的关系进行连接操作。然后从属性components和products中选择不同的值，生成概念层次；属性按其不同的值以升序排序。这将根据排序顺序生成层次结构；第一个属性位于层次结构的顶部，最后一个属性位于层次结构的底部。

代码片段

主数据集components.csv包含组件详细信息，products_components.csv文件包含组件详细信息和制造产品所需的组件数量。该文件包含 T2、T3 和 T4 等属性。下面的代码是 Pig 脚本，它演示了这种模式的实现:

/*
Load products_components data set into the relation products_components
*/
products_components = LOAD'/user/cloudera/pdp/datasets/data_transformation/products_components.csv' USING PigStorage(',') AS(product_id:chararray,component_id:chararray,required_qty_per_Unit:int);

/*
Calculate the distinct count for product_id and component_id and store the results in the relations products_unique_count and components_unique_count
*/
products_components_grpd = GROUP products_components ALL;
products_unique_count = FOREACH products_components_grpd
{
  attribute_name = 'Products';
  distinct_prod = DISTINCT products_components.product_id;
  GENERATE attribute_name AS attribute_name, COUNT(distinct_prod)AS attribute_count; 
}
components_unique_count = FOREACH products_components_grpd
{
  attribute_name = 'Components'; 
  distinct_comp = DISTINCT products_components.component_id;
  GENERATE attribute_name AS attribute_name, COUNT(distinct_comp)AS attribute_count; 
}

/*
The relations product_unique_count and components_unique_count are combined using the UNION operator.
This relation contains two columns attribute_name and attribute_count, it is then sorted by attribute_count
*/
combined_products_components_count = UNIONproducts_unique_count,components_unique_count;
ordered_count = ORDER combined_products_components_count BYattribute_count ASC;

/*
The results are stored on the HDFS in the directory data_generalization
*/
STORE ordered_count INTO'/user/cloudera/pdp/output/data_transformation/data_generalization';

结果

以下是对分类数据进行归纳的结果:

Products    6
Components  18

结果显示attribute name和unique count；属性的计数排序。结果描述了概念层次。第一个属性Products在层次结构的顶部，最后一个属性Components在层次结构的底部。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter5/code/
Chapter5/datasets/

总结

在本章中，您已经学习了各种大数据转换技术，这些技术处理将数据结构转换为分层表示，以利用 Hadoop 处理半结构化数据的能力。在分析数据之前，我们已经看到了标准化数据的重要性。然后，我们讨论使用连接来反规范化数据连接。CUBE、ROLLUP将数据聚合几次；这些聚合提供了数据的快照。在数据合成中，我们讨论了各种数字和分类数据的合成技术。

在下一章中，我们将重点介绍数据缩减技术。数据简化旨在获得数据的简化表示；它确保了数据的完整性，尽管获得的数据集的体积要小得多。我们将讨论数据约简技术，如降维、采样技术、宁滨和聚类。阅读本章后，您将能够选择正确的数据缩减模式。

六、理解数据简化模式

在前一章中，我们研究了各种大数据转换技术，这些技术处理了将数据结构转换为分层表示的问题。这是为了利用 Hadoop 处理半结构化数据的能力。在分析数据之前，我们已经看到了标准化数据的重要性。然后，我们讨论使用连接来反规范化数据。并且 CUBE 和 ROLLUP 对数据执行多次聚合；这些聚合提供了数据的快照。在数据合成部分，我们讨论了各种数值和分类数据的合成技术。

在这一章中，我们将讨论使用主成分分析技术的降维设计模式，以及使用聚类、采样和直方图技术的降维设计模式。

数据简化–快速介绍

数据简化旨在获得数据的简化表示。它确保了数据的完整性，尽管缩减后的数据集在体积上比原始数据集小得多。

数据简化技术分为以下三类:

降维:这组数据降维技术处理的是减少分析问题中考虑的属性数量。他们通过检测和消除不相关的属性、相关但弱的属性或冗余属性来做到这一点。主成分分析和小波变换是降维技术的例子。
数值约简:这套数据约简技术通过用数据的稀疏表示代替原始数据集来约简数据。数据的稀疏子集采用参数法计算，如回归，其中模型用于估计数据，因此只有一个子集就足够了，而不是整个数据集。还有其他方法，如非参数方法，如聚类、采样和直方图，无需建模即可工作。
压缩:这组数据缩减技术使用算法来减少数据消耗的物理存储大小。通常，压缩在比属性或记录级别更高的粒度级别上执行。如果需要从压缩数据中检索原始数据而不丢失任何信息(这在存储字符串或数字数据时是必需的)，请使用无损压缩方案。相反，如果视频和声音文件需要解压缩以适应难以察觉的清晰度损失，则使用有损压缩技术。

下图说明了上述组中使用的不同技术:

Data reduction – a quick introduction

数据简化技术-概述

大数据数据约简注意事项

在大数据问题中，数据约简技术必须被视为分析过程的一部分，而不是一个单独的过程。这将使您了解哪些类型的数据必须保留或删除，因为它们与建议的分析相关问题无关。

在典型的大数据分析环境中，数据通常是从多个来源获取和整合的。尽管使用整个数据集进行分析可能会获得隐藏的回报，这可能会产生更丰富、更好的见解，但成本有时会超过结果。正是在这种情况下，您可能不得不考虑减少数据量，而不会大大降低分析意见的有效性，这本质上是为了保护数据的完整性。

由于数据量巨大，对大数据执行任何类型的分析通常都会导致高昂的存储和检索成本。当数据很小时，数据约简过程的好处有时并不明显；当数据集开始变大时，它们开始变得明显。这些数据缩减过程是从存储和检索角度优化数据的第一步。重要的是要考虑数据约简的影响，这样在数据约简上花费的计算时间就不会超过或消除数据挖掘对约简后的数据集节省的时间。既然我们已经理解了数据缩减的概念，我们将在下面的小节中探索一些特定的设计模式。

降维-主成分分析设计模式

在这个设计模式中，我们会考虑使用主成分分析 ( 主成分分析)和奇异值分解 ( 奇异值分解)来实现降维，广泛用于探索性数据分析和预测创建。

背景

给定数据中的维度可以直观地理解为用来解释数据观测属性的所有属性的集合。降维意味着将高维数据转换成与数据的内在或潜在维度成比例的降维集合。这些潜在维度是描述数据集所需的最小属性数。因此，降维是一种理解数据隐藏结构的方法，用于缓解高维空间的诅咒等不必要的属性。

一般来说，降维有两种方式。一种是线性降维，如主成分分析和奇异值分解。二是非线性降维，以核主成分分析和多维标度为例。

在这种设计模式中，我们通过在 R 中实现 PCA 和在 Mahout 中实现 SVD 并将其与 Pig 集成来探索线性降维。

动机

我们先来概述一下 PCA。主成分分析(PCA)是一种线性降维技术，通过将数据集植入低维的子空间，对给定的数据集进行无监督的处理，这是通过构建原始数据的基于方差的表示来实现的。

主成分分析的基本原理是通过分析数据变化最大的方向或数据分布最广的方向来识别数据的隐藏结构。

直观地说，主成分可以认为是一条线，它穿过一组变化较大的数据点。如果让同一条线通过数据点，没有区别，说明数据是一样的，没有携带太多信息。在没有方差的情况下，数据点不被认为是整个数据集属性的代表，这些属性可以省略。

主成分分析包括寻找数据集的成对特征值和特征向量。给定的数据集被分解成多对特征向量和特征值。特征向量定义了单位向量或垂直于其他向量的数据方向。就是特征值数据在这个方向上的分布值。

在多维数据中，可以存在的特征值和特征向量的个数等于数据的维数。特征值最大的特征向量是主成分。

找出主成分后，按照特征值降序排序，这样第一个向量显示方差最高，第二个向量显示方差次高，以此类推。这些信息有助于发现以前没有被怀疑的隐藏模式，从而允许通常不会产生的解释。

由于数据现在按重要性降序排序，因此可以通过消除具有弱分量的属性或具有小数据方差的低重要性来减少数据大小。利用高值主成分，原始数据集可以构建一个良好的近似。

例如，考虑一个对 1 亿人的抽样选举调查，这些人被问了 150 个关于他们对选举相关问题的看法的问题。分析超过 150 个属性的 1 亿个答案是一项繁琐的任务。我们有一个 150 维的高维空间，从中产生 150 个特征值/向量。我们按照重要性降序排列特征值(例如，230，160，130，97，62，8，6，4，2，1 …最多 150 个维度)。从这些数值中，我们可以理解为可以有 150 个维度，但只有前 5 个维度有变化较大的数据。有了这个，我们可以把高维空间减少 150，在下一步的分析过程中考虑前五个特征值。

接下来我们来看看 SVD。奇异值分解与主成分分析密切相关有时两个术语都用作奇异值分解，这是实现主成分分析的一种更通用的方法。奇异值分解是矩阵分析的一种形式，它产生高维矩阵的低维表示。它通过移除线性相关数据来减少数据。和主成分分析一样，奇异值分解也利用特征值进行降维。该方法是将来自几个相关向量的信息组合成正交基向量，并解释数据中的大部分方差。

**例如，如果您有两个属性，一个是冰淇淋的销量，另一个是温度，它们之间的相关性非常高，以至于第二个属性“温度”不会提供任何对分类任务有用的附加信息。奇异值分解得到的特征值决定了哪些属性信息最丰富，哪些属性不能使用。

Mahout 的随机奇异值分解 ( ssvd ) 是基于分布式计算数学的奇异值分解。如果pca参数设置为真，SSVD 在主成分分析模式下运行；该算法计算输入列均值，然后用它计算主成分分析空间。

用例

可以考虑用这个模式进行数据约简和数据探索，作为聚类和多元回归的输入。

设计模式可以应用于稀疏和倾斜数据的有序和无序属性。它也可以用于图像。这种设计模式不能应用于复杂的非线性数据。

模式实现

以下步骤描述了使用 R 实现主成分分析:

脚本使用主成分分析技术进行降维。主成分分析包括寻找数据集的特征值和特征向量对。特征值最大的特征向量是主成分。按组件特征值的降序排序。
该脚本加载数据，并使用流调用 r 脚本。r 脚本对数据执行 PCA 并返回主成分。只能选择能够解释大部分变化的前几个主成分，从而降低了数据的维度。

主成分分析的局限性

虽然流允许你调用自己选择的可执行文件，但是会影响性能，在输入数据集很大的情况下，解决方案是不可伸缩的。为了克服这一点，我们提出了一种更好的利用 Mahout 进行降维的方法。它包含一组高度可扩展的机器学习库。

以下步骤描述了 SSVD 在 Mahout 上的实现:

以 CSV 格式读取输入数据集，以键/值对的形式准备一组数据点；关键字应该是唯一的，值应该由 n 个向量元组组成。
将之前的数据写入序列文件。按键可以是WritableComparable、Long或者String类型，数值应该是VectorWritable类型。
确定缩减空间的维度。
在 Mahout 上使用rank参数执行 SSVD(这指定了尺寸)，并将pca、us和V设置为真。当pca参数设置为真时，算法在主成分分析模式下运行，计算输入列均值，然后用它计算主成分分析空间。USigma文件夹包含缩小后的输出。

一般来说，降维应用于非常高维的数据集；然而，在我们的例子中，为了更好地解释，我们在较少维度的数据集上演示了这一点。

代码片段

为了说明该模型的工作原理，我们考虑存储在 Hadoop 文件系统 ( HDFS )中的零售交易数据集。包含Transaction ID、Transaction date、Customer ID、Product subclass、Phone No、Product ID、age、quantity、asset、Transaction Amount、Service Rating、Product Rating、Current Stock等 20 个属性。对于这个模型，我们将使用主成分分析来降维。下面的代码片段是一个 Pig 脚本，它演示了通过 Pig 流实现这种模式:

/*
Assign an alias pcar to the streaming command
Use ship to send streaming binary files (R script in this use case) from the client node to the compute node
*/
DEFINE pcar '/home/cloudera/pdp/data_reduction/compute_pca.R' ship('/home/cloudera/pdp/data_reduction/compute_pca.R'); 

/*
Load the data set into the relation transactions
*/
transactions = LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims.csv' USING  PigStorage(',') AS (transaction_id:long, transaction_date:chararray, customer_id:chararray, prod_subclass:chararray, phone_no:chararray, country_code:chararray, area:chararray, product_id:chararray, age:int, amt:int, asset:int, transaction_amount:double, service_rating:int, product_rating:int, curr_stock:int, payment_mode:int, reward_points:int, distance_to_store:int, prod_bin_age:int, cust_height:int);
/*
Extract the columns on which PCA has to be performed.
STREAM is used to send the data to the external script.
The result is stored in the relation princ_components
*/
selected_cols = FOREACH transactions GENERATE age AS age, amt AS amount, asset AS asset, transaction_amount AS transaction_amount, service_rating AS service_rating, product_rating AS product_rating, curr_stock AS current_stock, payment_mode AS payment_mode, reward_points AS reward_points, distance_to_store AS distance_to_store, prod_bin_age AS prod_bin_age, cust_height AS cust_height;
princ_components = STREAM selected_cols THROUGH pcar;

/*
The results are stored on the HDFS in the directory pca
*/
STORE princ_components INTO '/user/cloudera/pdp/output/data_reduction/pca';

下面是说明这种模式的实现的代码:

#! /usr/bin/env Rscript
options(warn=-1)

#Establish connection to stdin for reading the data
con <- file("stdin","r")

#Read the data as a data frame
data <- read.table(con, header=FALSE, col.names=c("age", "amt", "asset", "transaction_amount", "service_rating", "product_rating", "current_stock", "payment_mode", "reward_points", "distance_to_store", "prod_bin_age", "cust_height"))
attach(data)

#Calculate covariance and correlation to understand the variation between the independent variables
covariance=cov(data, method=c("pearson"))
correlation=cor(data, method=c("pearson"))

#Calculate the principal components
pcdat=princomp(data)
summary(pcdat)
pcadata=prcomp(data, scale = TRUE)
pcadata

下面的代码片段说明了使用 Mahout 的 SSVD 实现这个模式。以下是 shell 脚本的一个片段，其中包含执行 CSV 到序列转换器的命令:

#All the mahout jars have to be included in HADOOP_CLASSPATH before execution of this script. 
#Execute csvtosequenceconverter jar to convert the CSV file to sequence file.
hadoop jar csvtosequenceconverter.jar com.datareduction.CsvToSequenceConverter /user/cloudera/pdp/datasets/data_reduction/transactions_multi_dims_ssvd.csv /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq

以下是 Pig 脚本的代码片段，其中包含在 Mahout 上执行 SSVD 的命令:

/*
Register piggybank jar file
*/
REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar';

/*
*Ideally the following data pre-processing steps have to be generally performed on the actual data, we have deliberately omitted the implementation as these steps were covered in the respective chapters

*Data Ingestion to ingest data from the required sources

*Data Profiling by applying statistical techniques to profile data and find data quality issues

*Data Validation to validate the correctness of the data and cleanse it accordingly

*Data Transformation to apply transformations on the data.
*/

/*
Use sh command to execute shell commands.
Convert the files in a directory to sequence files
-i specifies the input path of the sequence file on HDFS
-o specifies the output directory on HDFS
-k specifies the rank, i.e the number of dimensions in the reduced space
-us set to true computes the product USigma
-V set to true computes V matrix
-pca set to true runs SSVD in pca mode
*/

sh /home/cloudera/mahout-distribution-0.8/bin/mahout ssvd -i /user/cloudera/pdp/output/data_reduction/ssvd/transactions.seq -o /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions -k 7 -us true -V true -U false -pca true -ow -t 1

/*
Use seqdumper to dump the output in text format.
-i specifies the HDFS path of the input file
*/
sh /home/cloudera/mahout-distribution-0.8/bin/mahout seqdumper -i /user/cloudera/pdp/output/data_reduction/ssvd/reduced_dimensions/V/v-m-00000

结果

以下是通过 Pig 流执行 R 脚本的结果片段。为了提高可读性，只显示结果的重要部分。

Importance of components:
                             Comp.1      Comp.2       Comp.3
Standard deviation     1415.7219657 548.8220571 463.15903326
Proportion of Variance    0.7895595   0.1186566   0.08450632
Cumulative Proportion     0.7895595   0.9082161   0.99272241

下图显示了结果的图形表示:

Results

主成分分析输出

从累积的结果来看，我们可以用前三个组成部分来解释大部分的变化。因此，我们可以去掉其他组件，仍然解释大部分数据，从而实现数据约简。

以下是在 Mahout 上应用 SSVD 后获得的结果的代码片段:

Key: 0: Value: {0:6.78114976729216E-5,1:-2.1865954292525495E-4,2:-3.857078959222571E-5,3:9.172780131217343E-4,4:-0.0011674781643860148,5:-0.5403803571549012,6:0.38822546035077155}
Key: 1: Value: {0:4.514870142377153E-6,1:-1.2753047299542729E-5,2:0.002010945408634006,3:2.6983823401328314E-5,4:-9.598021198119562E-5,5:-0.015661212194480658,6:-0.00577713052974214}
Key: 2: Value: {0:0.0013835831436886054,1:3.643672803676861E-4,2:0.9999962672043754,3:-8.597640675661196E-4,4:-7.575051881399296E-4,5:2.058878196540628E-4,6:1.5620427291943194E-5}
.
.
Key: 11: Value: {0:5.861358116239576E-4,1:-0.001589570485260711,2:-2.451436184622473E-4,3:0.007553283166922416,4:-0.011038688645296836,5:0.822710349440101,6:0.060441819443160294}

V文件夹的内容显示了原始变量对每个主成分的贡献。结果是一个 12×7 的矩阵，因为在我们的原始数据集中有 12 个维度，根据 SSVD 秩参数，这些维度被减少到 7 个。

USigma文件夹包含缩小尺寸的输出。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter6/code/
Chapter6/datasets/

关于 Mahout 实现 SSVD 的信息可以在以下链接中找到:

递减数值-直方图设计模式

数值化简-直方图设计模式探索直方图技术在数据化简中的实现。

背景

直方图属于数据约简的数值约简范畴。它们是非参数数据约简方法，其中假设数据不适合预定义的模型或函数。

动机

直方图的工作原理是将整个数据划分成桶或组，存储每个桶的中心趋势。在内部，这类似于宁滨。直方图可以通过动态编程进行优化。直方图不同于条形图，因为它们代表连续的数据类别，而不是离散的类别。这意味着直方图中代表不同类别的列之间没有间隙。

直方图通过将大量连续属性分组来帮助减少数据的类别。表示大量属性可能会导致复杂的直方图，有太多的列来解释信息。因此，数据被分组到一个范围中，该范围代表属性值的连续范围。数据可以按以下方式分组:

等宽分组技术:在这个分组技术中，每个区间都是等宽的。
等频(或等深)分组技术:在等频分组技术中，以每个范围的频率恒定或每个范围包含相同数量的连续数据元素的方式创建范围。
V-最优分组技术:在这个分组技术中，我们考虑给定范围内所有可能的直方图，选择方差最小的直方图。
Maxdiff 分组技术:这种直方图分组技术根据每对相邻值之间的差异考虑将值分组到一个范围内。边界被定义在具有最大差异的每对相邻点之间。下图描述了根据 9-14 和 18-27 之间的最大差异分为三个范围的分类数据。

Motivation

最大差异-图标

在前面提到的分组技术中，V-Optimal 和 MaxDiff 技术对于接近稀疏和密集的数据以及高度偏斜和均匀的数据更加准确和有效。这些直方图也可以通过使用多维直方图来处理多个属性，多维直方图可以捕捉属性之间的依赖关系。

用例

这种设计模式在以下情况下可以考虑:

当数据不适用于回归或对数线性模型等参数模型时
当数据是连续的而不是离散的时
当数据具有有序或无序的数字属性时
当数据倾斜或稀疏时

模式实现

脚本加载数据，并使用等宽分组将数据分成桶。Transaction Amount字段的数据被分组到桶中。它计算每个存储桶中的事务数量，并返回存储桶范围和计数作为输出。

这种模式产生了数据集的简化表示，其中事务量被划分为指定数量的桶，事务计数在这个范围内。这些数据被绘制成直方图。

代码片段

为了解释这个模型是如何工作的，我们考虑存储在 HDFS 的零售交易数据集。包含 T0、T1、T2、T3、T4、T5、T6、T7、T8、T9 等属性。对于这个模式，我们将在属性Transaction Amount上生成桶。下面的代码片段是一个 Pig 脚本，演示了这种模式的实现:

/*
Register the custom UDF
*/
REGISTER '/home/cloudera/pdp/jars/databucketgenerator.jar';

/*
Define the alias generateBuckets for the custom UDF, the number of buckets(20) is passed as a parameter
*/
DEFINE generateBuckets com.datareduction.GenerateBuckets('20');

/*
Load the dataset into the relation transactions
*/
transactions = LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions.csv' USING  PigStorage(',') AS (transaction_id:long,transaction_date:chararray, cust_id:chararray, age:chararray, area:chararray, prod_subclass:int, prod_id:long, quantity:int, asset:int, transaction_amt:double, phone_no:chararray, country_code:chararray);

/*
Maximum value of transactions amount and the actual transaction amount are passed to generateBuckets UDF
The UDF calculates the bucket size by dividing maximum transaction amount by the number of buckets.
It finds out the range to which each value belongs to and returns the value along with the bucket range
*/
transaction_amt_grpd = GROUP transactions ALL;
transaction_amt_min_max = FOREACH transaction_amt_grpd GENERATE MAX(transactions.transaction_amt) AS max_transaction_amt,FLATTEN(transactions.transaction_amt) AS transaction_amt;
transaction_amt_buckets = FOREACH transaction_amt_min_max GENERATE generateBuckets(max_transaction_amt,transaction_amt) ;

/*
Calculate the count of values in each range
*/
transaction_amt_buckets_grpd = GROUP transaction_amt_buckets BY range;
transaction_amt_buckets_count = FOREACH transaction_amt_buckets_grpd GENERATE group, COUNT(transaction_amt_buckets);

/*
The results are stored on HDFS in the directory histogram.
*/
STORE transaction_amt_buckets_count INTO '/user/cloudera/pdp/output/data_reduction/histogram';

下面的代码片段是 Java UDF 代码，演示了这种模式的实现:

@Override
  public String exec(Tuple input) throws IOException {
    if (input == null || input.size() ==0)
      return null;
    try{
      //Extract the maximum transaction amount
      max = Double.parseDouble(input.get(0).toString());
      //Extract the value
      double rangeval = Double.parseDouble(input.get(1).toString());
      /*Calculate the bucket size by dividing maximum 
        transaction amount by the number of buckets.
      */
      setBucketSize();

      /*Set the bucket range by using the bucketSize and 
        noOfBuckets
      */
      setBucketRange();

      /*
      It finds out the range to which each value belongs 
      to and returns the value along with the bucket range
      */
      return getBucketRange(rangeval);
    } catch(Exception e){
      System.err.println("Failed to process input; error - " + e.getMessage());
      return null;
    }

结果

以下是将该模式应用于数据集的结果片段；第一列是Transaction Amount属性的时间范围，第二列是交易计数:

1-110        45795
110-220      50083
220-330      60440
330-440      40001
440-550      52802

下面是使用 gnuplot 绘制这些数据时生成的直方图。它以图形方式显示了交易金额期间和每个期间的交易数量。

Results

输出直方图

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter6/code/
Chapter6/datasets/

递减数值-抽样设计模式

这个设计模式探索了数据约简的采样技术的实现。

背景

采样属于数据约简的数值约简范畴。它可以用作数据缩减技术，因为它使用小得多的子集来表示大量数据。

动机

采样本质上是一种数据约简方法，确定具有整体种群特征的种群的近似子集。抽样是一种选择数据子集以准确代表人口的通用方法。采样通过各种方法执行，这些方法定义子集的内容，并以不同的方式定位子集的候选项。

在大数据场景下，分析整个群体的成本(比如分类、优化)非常高；采样有助于降低成本，因为它减少了用于执行实际分析的数据空间，然后根据整体情况推断结果。准确性会略有下降，但这远远超过了减少时间和存储之间权衡的好处。

说到大数据，无论统计抽样技术应用在哪里，识别要分析的人都很重要。即使收集到的数据非常大，样本也可能只与人口中的一小部分相关，并不代表全部。在选择样本时，代表性起着至关重要的作用，因为它决定了抽样数据与总体的接近程度。

可以使用概率和非概率方法进行采样。下图显示了采样技术的概况:

Motivation

抽样法

概率抽样方法使用随机抽样，总体中的每个元素都有已知的非零(大于零)机会被选入抽样子集中。概率抽样方法利用加权抽样得到总体的无偏样本。以下是一些概率抽样方法:

简单随机抽样:这是最基本的抽样类型。在这个抽样中，群体中的每个元素都有同等的机会被选入一个子集。样本是客观随机选取的。简单的随机抽样可以通过替换总体中的选定项目，以便它们可以被再次选择(带替换的抽样)或通过不替换总体中的选定项目(不带替换的抽样)来完成。随机抽样并不总是产生有代表性的样本，在非常大的数据集上执行这种操作是一种昂贵的操作。采用分层或聚类的方法对人群进行预抽样，可以提高随机抽样的代表性。
The following diagram illustrates the difference between the Simple Random Sampling Without Replacement (SRSWOR) and Simple Random Sampling With Replacement (SRSWR).

斯沃弗 vs 斯沃弗
分层抽样:这个抽样技术是在我们已经知道种群包含很多独特的类别时使用的，用来将种群组织成子种群(地层)；然后可以从中选择一个样本。所选样本必须包含每个子群体的元素。这种抽样方法侧重于相关的子组，而忽略了不相关的子组。通过消除绝对随机性，增加了样本的代表性，这一点可以通过简单的随机抽样和从独立类别中选择项目来证明。当预先确定地层的独特类型时，分层抽样是一种更有效的抽样技术。分层有一个总的时间成本权衡，因为最初为相对同质的人识别独特的类别可能很无聊。
非概率抽样:这种抽样方法选择的是人群的一个子集，但并没有给人群中的某些元素同样的选择机会。在这种抽样中，不能准确确定选择元素的概率。元素的选择纯粹是基于对感兴趣的人的一些假设。非概率抽样得分太低，无法准确代表总体，因此无法将分析从样本外推至总体。非概率抽样方法包括协方差抽样、判断抽样和定额抽样。

用例

您可以考虑在以下场景中使用数字下采样设计模式:

当数据是连续的或离散的时
当数据的每个元素都有同等的机会被选择而不影响抽样的代表性时
当数据具有有序或无序的属性时

模式实现

这个设计模式在 Pig 中作为独立脚本实现。它使用datafu库，把 SRSWR 的实现看成是一对 UDF、SimpleRandomSampleWithReplacementElect和SimpleRandomSampleWithReplacementVote；他们为 SRSWR 实现了一个可扩展的算法。该算法包括投票和选举两个阶段。每个职位的候选人在投票阶段投票。在选举阶段，每个职位选举一名候选人。输出是一包采样数据。

该脚本使用 SRSWR 技术从交易数据集中选择 100，000 条记录的样本。

代码片段

为了说明 T10 模型的工作原理，我们考虑存储在 HDFS 的零售交易数据集。包含 T0、T1、T2、T3、T4、T5、T6、T7、T8、T9 等属性。在这种模式下，我们将对事务数据集执行 SRSWR。下面的代码片段是一个 Pig 脚本，演示了这种模式的实现:

/*
Register datafu and commons math jar files
*/
REGISTER '/home/cloudera/pdp/jars/datafu-1.2.0.jar';
REGISTER '/home/cloudera/pdp/jars/commons-math3-3.2.jar';

/*
Define aliases for the classes SimpleRandomSampleWithReplacementVote and SimpleRandomSampleWithReplacementElect
*/
DEFINE SRSWR_VOTE  datafu.pig.sampling.SimpleRandomSampleWithReplacementVote();
DEFINE SRSWR_ELECT datafu.pig.sampling.SimpleRandomSampleWithReplacementElect();

/*
Load the dataset into the relation transactions
*/
transactions= LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions.csv' USING  PigStorage(',') AS (transaction_id:long,transaction_date:chararray, cust_id:chararray, age:int, area:chararray, prod_subclass:int, prod_id:long, quantity:int, asset:int, transaction_amt:double, phone_no:chararray, country_code:chararray);

/*
The input to Vote UDF is the bag of items, the desired sample size (100000 in our use case) and the actual population size.
  This UDF votes candidates for each position
*/
summary = FOREACH (GROUP transactions ALL) GENERATE COUNT(transactions) AS count;
candidates = FOREACH transactions GENERATE FLATTEN(SRSWR_VOTE(TOBAG(TOTUPLE(*)), 100000, summary.count));

/*
The Elect UDF elects one candidate for each position and returns a bag of sampled items stored in the relation sampled
*/
sampled = FOREACH (GROUP candidates BY position PARALLEL 10) GENERATE FLATTEN(SRSWR_ELECT(candidates));

/*
The results are stored on the HDFS in the directory sampling
*/
STORE sampled into '/user/cloudera/pdp/output/data_reduction/sampling';

结果

以下是采样交易数据后得到的结果片段。为了提高可读性，我们删除了一些列。

580493 … 1621624 … … … … 1 115 576 900-435-5791 U.S.A
193016 … 1808643 … … … … 1 119 735 9020138550 U.S.A
800748 … 199995 … … … … 1 28 1577 904-066-467q USA

结果是一个包含 100，000 条记录的文件，作为原始数据集的样本。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter6/code/
Chapter6/datasets/

减量-集群设计模式

这种设计模式探索了聚类技术在数据约简中的实现。

背景

聚类属于数据约简的数值约简范畴。聚类是一种非参数模型，它使用无监督学习在没有类别标签先验知识的情况下工作。

动机

聚类是解决数据分组问题的一种通用方法。这可以通过各种算法来实现，这些算法在定义什么进入一个组以及如何找到该组的候选人方面是不同的。聚类算法有 100 多种不同的实现方式，可以针对不同的目标解决各种问题。对于给定的问题，没有单一的规模适合所有的聚类算法；我们必须通过仔细的实验选择正确的。适用于特定数据模型的聚类算法并不总是适用于不同的模型。聚类广泛应用于机器学习、图像分析、模式识别和信息检索。

聚类的目标是基于一组启发式算法对数据集进行划分，并有效地减小其大小。集群在某种程度上类似于宁滨，因为它模仿了宁滨的分组方法；然而，区别在于聚类中分组的精确方式。

分区的实现方式是一个集群中的数据与同一个集群中的另一个数据相似，但与其他集群中的其他数据不同。这里，相似性被定义为数据彼此有多接近的度量。

K-means 是应用最广泛的聚类方法之一。用聚类分析的 k 种方法将观测值分成 k 个聚类；这里，每个观测值都属于平均值最近的聚类。这是一个迭代过程，只有当集群质心不再移动时，这个过程才会稳定。

可以通过测量每个聚类对象距聚类质心的直径或平均距离来确定聚类执行得如何的质量度量。

通过一家服装公司计划向市场发布新 t 恤的例子，我们可以直观地理解聚类减少数据量的必要性。如果公司不使用数据还原技术，最终会做出不同尺寸的 t 恤来迎合不同的人。为了防止这种情况，他们减少了数据。首先，他们记录了人们的身高和体重，绘制在图表上，并将其分为三类:小、中、大。

K-Means 方法使用身高和体重的数据集( n 个观测值)并将其划分为 k (即三个聚类)。对于每个聚类(小、中、大)，聚类中的数据点更接近聚类类别(即小身高、小体重的平均值)。K Means 为我们提供了三种最适合每个人的尺寸，从而有效降低了数据的复杂度；集群使我们能够自己替换实际数据，而不是处理实际数据。

注

我们已经考虑使用 Mahout 的 K-Means 实现；更多信息可从https://mahout . Apache . org/users/clustering/k-means-clustering . html获取。

从大数据的角度来看，由于需要处理大量的数据，在选择聚类算法时需要考虑时间和质量的权衡。正在进行新的研究，以开发一种能够高效处理大数据的预聚类方法。然而，预聚类方法的结果是原始数据集的近似预划分，最终将通过传统方法(如 K-means)再次聚类。

用例

这种设计模式在以下情况下可以考虑:

当数据是连续的或离散的并且数据的类别标签事先未知时
当需要通过聚类对数据进行预处理并最终对大量数据进行分类时
当数据具有数字、有序或无序属性时
当数据是绝对的
当数据不偏斜、稀疏或模糊时

模式实现

设计模式在 Pig 和 Mahout 中实现。数据集被载入 Pig。要对其执行 K 均值聚类的年龄属性被转换成向量，并以 Mahout 可读格式存储。它将 Mahout 的 K 均值聚类应用于事务数据集的年龄属性。k 均值聚类将观测值划分为 k 个聚类，其中每个观测值都属于均值最近的聚类；只有当聚类质心不再移动时，该过程才是迭代和稳定的。

该模式产生数据集的简化表示，其中age属性被分成预定数量的簇。该信息可用于识别光顾该商店的顾客的年龄组。

代码片段

为了说明的工作原理，我们考虑存储在 HDFS 的零售交易数据集。包含 T0、T1、T2、T3、T4、T5、T6、T7、T8、T9 等属性。对于这种模式，我们将对age属性进行 K 均值聚类。下面的代码片段是一个 Pig 脚本，演示了这种模式的实现:

/*
Register the required jar files
*/
REGISTER '/home/cloudera/pdp/jars/elephant-bird-pig-4.3.jar';
REGISTER '/home/cloudera/pdp/jars/elephant-bird-core-4.3.jar';
REGISTER '/home/cloudera/pdp/jars/elephant-bird-mahout-4.3.jar';
REGISTER '/home/cloudera/pdp/jars/elephant-bird-hadoop-compat-4.3.jar';
REGISTER '/home/cloudera/mahout-distribution-0.7/lib/json-simple-1.1.jar';
REGISTER '/home/cloudera/mahout-distribution-0.7/lib/guava-r09.jar';
REGISTER '/home/cloudera/mahout-distribution-0.7/mahout-examples-0.7-job.jar'; 
REGISTER '/home/cloudera/pig-0.11.0/contrib/piggybank/java/piggybank.jar';

/*
Use declare to create aliases.
declare is a preprocessor statement and is processed before running the script
*/
%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';
%declare VECTOR_CONVERTER 'com.twitter.elephantbird.pig.mahout.VectorWritableConverter';
%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';

/*
Load the dataset into the relation transactions
*/
transactions = LOAD '/user/cloudera/pdp/datasets/data_reduction/transactions.csv' USING  PigStorage(',') AS (id:long,transaction_date:chararray, cust_id:int, age:int, area:chararray, prod_subclass:int, prod_id:long, quantity:int, asset:int, transaction_amt:double, phone_no:chararray, country_code:chararray);

/*
Extract the columns on which clustering has to be performed
*/
age = FOREACH transactions GENERATE id AS tid, 1 AS index, age AS cust_age;

/*
Generate tuples from the parameters
*/
grpd = GROUP age BY tid;
vector_input = FOREACH grpd generate group, org.apache.pig.piggybank.evaluation.util.ToTuple(age.(index, cust_age));
/*
Use elephant bird functions to store the data into sequence file (mahout readable format)
cardinality represents the dimension of the vector.
*/
STORE vector_input INTO '/user/cloudera/pdp/output/data_reduction/kmeans_preproc' USING $SEQFILE_STORAGE (
 '-c $TEXT_CONVERTER', '-c $VECTOR_CONVERTER -- -cardinality 100'
);

下面的是一个 shell 脚本的片段，其中包含对 Mahout 执行 K-means 聚类的命令:

#All the mahout jars have to be included in classpath before execution of this script.
#Create the output directory on HDFS before executing VectorConverter
hadoop fs -mkdir /user/cloudera/pdp/output/data_reduction/kmeans_preproc_nv

#Execute vectorconverter jar to convert the input to named vectors
hadoop jar /home/cloudera/pdp/data_reduction/vectorconverter.jar com.datareduction.VectorConverter /user/cloudera/pdp/output/data_reduction/kmeans_preproc/ /user/cloudera/pdp/output/data_reduction/kmeans_preproc_nv/

#The below Mahout command shows the usage of kmeans. The algorithm takes the input vectors from the path specified in the -i argument, it chooses the initial clusters at random, -k argument specifies the number of clusters as 3, -x specified the maximum number of iterations as 15\. -dm specifies the distance measure to use i.e euclidean distance and a convergence threshold specified in -cd as 0.1
/home/cloudera/mahout-distribution-0.7/bin/mahout kmeans -i /user/cloudera/pdp/output/data_reduction/kmeans_preproc_nv/ -c kmeans-initial-clusters -k 3 -o /user/cloudera/pdp/output/data_reduction/kmeans_clusters -x 15 -ow -cl -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.01

# Execute cluster dump command to print information about the cluster
/home/cloudera/mahout-distribution-0.7/bin/mahout clusterdump --input /user/cloudera/pdp/output/data_reduction/kmeans_clusters/clusters-4-final --pointsDir /user/cloudera/pdp/output/data_reduction/kmeans_clusters/clusteredPoints --output age_kmeans_clusters

结果

以下是在事务数据集上应用该模式的结果片段:

VL-817732{n=309263 c=[1:45.552] r=[1:4.175]}
  Weight : [props - optional]:  Point:
1.0: 1 = [1:48.000]
  1.0: 2 = [1:42.000]
  1.0: 3 = [1:42.000]
  1.0: 4 = [1:41.000]
VL-817735{n=418519 c=[1:32.653] r=[1:4.850]}
  Weight : [props - optional]:  Point:
  1.0: 5 = [1:24.000]
  1.0: 7 = [1:38.000]
  1.0: 12 = [1:34.000]
  1.0: 14 = [1:23.000]
VL-817738{n=89958 c=[1:65.198] r=[1:5.972]}
  Weight : [props - optional]:  Point:
  1.0: 6 = [1:66.000]
  1.0: 8 = [1:58.000]
  1.0: 16 = [1:62.000]
  1.0: 24 = [1:74.000]

VL-XXXXXX是收敛聚类的聚类标识，c是质心和向量，n是聚类的点数，r是半径和向量。根据 K-means 命令，数据被分为三个簇。当这个数据被可视化时，我们可以推断 41 和 55 之间的值被分组在组 1 下，20 和 39 被分组在组 2 下，56 和 74 被分组在组 3 下。

附加信息

本节的完整代码和数据集位于以下 GitHub 目录中:

Chapter6/code/
Chapter6/datasets/

总结

在本章中，您学习了各种旨在获得数据简化表示的数据简化技术。我们探索了使用主成分分析技术降维和使用聚类、采样和直方图技术降维的设计模式。

在下一章中，您将探索用 Pig 模拟社交媒体数据的高级模型，并使用文本分类和其他相关技术来更好地理解上下文。我们也将了解未来 PIG 语将如何演变。**