Hadoop-数据分析高级教程-二-Hadoop 数据分析高级教程（二）五、数据管道以及如何构建它们在本章中，我们将

Hadoop 数据分析高级教程（二）

原文：Pro Hadoop Data Analytics

协议：CC BY-NC-SA 4.0

五、数据管道以及如何构建它们

在本章中，我们将讨论如何使用标准数据源和 Hadoop 生态系统构建基本的数据管道。我们提供了一个端到端的例子，说明如何使用 Hadoop 和其他分析组件来链接和处理数据源，以及这与标准 ETL 过程有何相似之处。我们将在第十五章更详细地阐述本章提出的观点。

A Note About the Example System Structure

由于我们将认真地开始开发示例系统，所以在这里对示例系统的包结构做一个说明是合适的。整本书中开发的示例系统的基本包结构如图 5-1 所示，附录 a 中也有再现。在进入数据管道构建之前，让我们简要检查一下包包含的内容及其作用。图 5-2 显示了 Probda 系统的一些主要子包的简要描述。

图 5-2。

Brief description of the packages in the Probda example system

图 5-1。

Fundamental package structure for the analytics system

在这一章中，我们将关注包com.apress.probda.pipeline中的类。

代码贡献中提供了五个基本 java 类，这将使您能够使用基本的数据管道策略来读取、转换和编写不同的数据源。有关更多详细信息，请参见代码贡献注释。

5.1 基础数据管道

一个基本的分布式数据管道可能看起来像图 5-3 中的架构图。

图 5-3。

A basic data pipeline architecture diagram

我们可以使用标准的现成软件组件来实现这种类型的架构。

我们将使用 Apache Kafka、Beam、Storm、Hadoop、Druid 和 Gobblin(以前的 Camus)来构建我们的基本管道。

5.2 Apache Beam 简介

Apache Beam ( http://incubator.apache.org/projects/beam.html )是专门为构建数据管道而设计的工具包。它有一个统一的编程模型，并被设计成这样，因为我们在本书中的方法的核心是设计和构造分布式数据管道。无论是使用 Apache Hadoop、Apache Spark 还是 Apache Flink 作为核心技术，Apache Beam 都以一种非常符合逻辑的方式融入了技术堆栈。在写这本书的时候，Apache Beam 还是一个酝酿中的项目，所以请查看网页了解它的现状。

Apache 射束编程模型中的关键概念是:

“PCollection”:表示数据的集合，其大小可以是有界的或无界的
“PTransform”:表示将输入 PCollections 转换为输出 PCollections 的计算
“管道”:管理准备执行的 PTransforms 和 PCollections 的有向非循环图
“PipelineRunner”:指定管道应该在何处以及如何执行

这些基本元素可用于构建具有许多不同拓扑的管道，如清单 5-1 中的示例代码。

图 5-4。

Successful Maven build of Apache Beam, showing the reactor summary

 static final String[] WORDS_ARRAY = new String[] {
 "probda analytics", "probda", "probda pro analytics",
 "probda one", "three probda", "two probda"};

 static final List<String> TEST_WORDS = Arrays.asList(WORDS_ARRAY);

 static final String[] WORD_COUNT_ARRAY = new String[] {
 "probda: 6", "one: 1", "pro: 1", "two: 1", "three: 1", "analytics: 2"};

 @Test
 @Category(RunnableOnService.class)
 public void testCountWords() throws Exception {
 Pipeline p = TestPipeline.create();

 PCollection<String> input = p.apply(Create.of(TEST_WORDS).withCoder(StringUtf8Coder.of()));

 PCollection<String> output = input.apply(new CountWords())
 .apply(MapElements.via(new FormatAsTextFn()));

 PAssert.that(output).containsInAnyOrder(WORD_COUNT_ARRAY);
 p.run().waitUntilFinish();
 }

cd to contribs/Hadoop and run the Maven file installation
    mvn clean package

Listing 5-1.Apache Beam test code snippet example

5.3 阿帕奇猎鹰简介

Apache Falcon(falcon.apache.org)是一个 feed 处理和 feed 管理系统，旨在使最终用户更容易在 Hadoop 集群上进行 feed 处理和执行 feed 管理。

Apache Falcon 提供了以下功能:

Apache Falcon ( https://falcon.apache.org ))可用于处理和管理 Hadoop 集群上的“feeds ”,从而提供一个管理系统，使实现 onboarding 和建立数据流变得更加简单。它还有其他有用的功能，包括:

在 Hadoop 环境中建立各种数据和处理元素之间的关系
feed 管理服务，如 feed 保留、跨集群复制、归档等。
易于采用新的工作流/管道，支持后期数据处理和重试策略
与 metastore/目录(如 Hive/HCatalog)集成
根据源组(相关源的逻辑组，可能会一起使用)的可用性向最终客户提供通知
支持在 colo 和全局聚合中进行本地处理的用例
捕获提要和进程的沿袭信息

5.4 数据源和接收器:使用 Apache Tika 构建管道

Apache Tika(tika.apache.org)是一个内容分析工具包。请参阅附录 a 中的 Apache Tika 安装说明。

使用 Apache Tika，几乎所有主流数据源都可以与分布式数据管道一起使用。

在本例中，我们将加载一种特殊的数据文件，采用 DBF 格式，使用 Apache Tika 处理结果，并使用 JavaScript 可视化工具观察我们的工作结果。

DBF 文件通常用于表示标准的数据库面向行的数据，如清单 5-2 所示。

Map: 26 has: 8 entries...
STATION-->Numeric
5203
MAXDAY-->Numeric
20
AV8TOP-->Numeric
9.947581
MONITOR-->Numeric
36203
LAT-->Numeric
34.107222
LON-->Numeric
-117.273611
X_COORD-->Numeric
474764.37263
Y_COORD-->Numeric
3774078.43207

DBF 文件通常用于表示标准的数据库面向行的数据，如清单 5-3 所示。

读取 DBF 文件的典型方法如清单 5-3 所示。

public static List<Map<String, Object>>readDBF(String filename){
                Charset stringCharset = Charset.forName("Cp866");
        List<Map<String,Object>> maps = new ArrayList<Map<String,Object>>();
                try {
                File file = new File(filename);
                DbfReader reader = new DbfReader(file);
                DbfMetadata meta = reader.getMetadata();
                DbfRecord rec = null;
                int i=0;
                while ((rec = reader.read()) != null) {
                        rec.setStringCharset(stringCharset);
                        Map<String,Object> map = rec.toMap();
                        System.out.println("Map: " + i + " has: " + map.size()+ " entries...");

                        maps.add(map);
                        i++;
                }
                reader.close();
                } catch (IOException e){ e.printStackTrace(); }
                catch (ParseException pe){ pe.printStackTrace(); }
                System.out.println("Read DBF file: " + filename + " , with : " + maps.size()+ " results...");
                return maps
}

goblin(http://gobblin.readthedocs.io/en/latest/))——原名加缪——是另一个基于我们之前谈到的“通用分析范式”的系统的例子。

“这里缺少一些东西:是一个通用的数据接收框架，用于从各种数据源(如数据库、rest APIs、FTP/SFTP 服务器、文件服务器等)提取、转换和加载大量数据。，到 Hadoop 上。Gobblin 处理所有数据摄取 ETL 所需的常见例行任务，包括作业/任务调度、任务划分、错误处理、状态管理、数据质量检查、数据发布等。Gobblin 在同一个执行框架中接收来自不同数据源的数据，并在一个地方管理不同来源的元数据。这一点，再加上其他特性，如自动可伸缩性、容错性、数据质量保证、可扩展性和处理数据模型演变的能力，使 Gobblin 成为一个易于使用、自助式和高效的数据摄取框架。”

图 5-5 显示了 Gobblin 系统的成功安装。

图 5-5。

A successful installation of Gobblin

5.5 计算和转换

我们的数据流的计算和转换可以通过少量的简单步骤来执行。这部分处理管道有几个候选方案，包括 Splunk 和提供 Rocana Transform 的商业软件。

我们可以使用 Splunk 作为基础，也可以使用 Rocana Transform。Rocana 是一个商业产品，因此为了使用它，您可以购买它或使用免费的评估试用版。

Rocana ( https://github.com/scalingdata/rocana-transform-action-plugin ) Transform 是一个配置驱动的转换库，可以嵌入任何基于 JVM 的流处理或批处理系统，如 Spark Streaming、Storm、Flink 或 Apache MapReduce。

其中一个代码贡献示例展示了如何构建 Rocana 转换引擎插件，该插件可以在示例系统中执行事件数据处理。

在 Rocana 中，转换插件由两个重要的类组成，一个基于 Action 接口，另一个基于 ActionBuilder 接口，如代码贡献中所述。

5.6 可视化和报告结果

一些可视化和报告最好用面向笔记本的软件工具来完成。大多数基于 Python——比如 Jupyter 或 Zeppelin。回想一下，Python 生态系统看起来有点像图 5-6 。Jupyter 和 Zeppelin 将在“其他包和工具箱”标题下，但这并不意味着它们不重要。

图 5-8。

Successful installation of the Anaconda Python system

图 5-7。

Initial installer diagram for the Anaconda Python system

图 5-6。

Basic Python ecosystem , with a place for notebook-based visualizers

在接下来的章节中，我们将会看到几个复杂的可视化工具包，但是现在让我们从一个更流行的基于 JavaScript 的工具包 D3 开始，它可以用来可视化各种各样的数据源和表示类型。这些包括地理位置和地图；标准饼图、折线图和条形图；表格报告；以及许多其他东西(定制的表示类型、图数据库输出等等)。

一旦 Anaconda 正常工作，我们就可以安装另一个非常有用的工具包 TensorFlow。TensorFlow ( https://www.tensorflow.org )是一个机器学习库，其中也包含对各种“深度学习”技术的支持。

图 5-10。

Successfully installing Anaconda

图 5-9。

Successfully running the Jupyter notebook program Note

回想一下，要构建 Zeppelin，请执行以下步骤:

mvn clean package -Pcassandra-spark-1.5 -Dhadoop.version=2.6.0 -Phadoop-2.6 -DskipTests

图 5-11。

Sophisticated visualizations may be created using the Jupyter visualization feature .

5.7 总结

在本章中，我们讨论了如何构建一些基本的分布式数据管道，并概述了一些更有效的工具包、堆栈和策略来组织和构建您的数据管道。其中包括 Apache Tika、Gobblin、Spring Integration 和 Apache Flink。我们还安装了 Anaconda(这使得 Python 开发环境更容易设置和使用)，以及一个重要的机器学习库 TensorFlow。

此外，我们还研究了各种输入和输出格式，包括古老但有用的 DBF 格式。

在下一章，我们将讨论使用 Lucene 和 Solr 的高级搜索技术，并介绍一些有趣的 Lucene 新扩展，如 ElasticSearch。

5.8 参考文献

Lewis，N.D .用 Python 一步步深度学习。2016.www.auscov.com

克里斯·马特曼和朱卡·齐汀。蒂卡在行动。纽约州谢尔特岛:曼宁出版公司，2012 年。

扎克内，吉安卡洛。TensorFlow 入门。英国伯明翰:PACKT 开源出版，2016 年。

六、Hadoop、Lucene 和 Solr 的高级搜索技术

在本章中，我们将描述 Apache Lucene 和 Solr 第三方搜索引擎组件的结构和使用，如何在 Hadoop 中使用它们，以及如何开发为分析应用程序定制的高级搜索功能。我们还将研究一些较新的基于 Lucene 的搜索框架，主要是 Elasticsearch，这是一个主要的搜索工具，特别适合构建分布式分析数据管道。我们还将讨论扩展的 Lucene/Solr 生态系统，以及一些如何在分布式大数据分析应用程序中使用 Lucene 和 Solr 的真实编程示例。

6.1 Lucene/SOLR 生态系统介绍

正如我们在第一章 Lucene 和 Solr 概述中所讨论的，Apache Lucene(lucene.apache.com)是构建定制搜索组件时需要了解的一项关键技术，这是有道理的。它是最古老的 Apache 组件之一，经历了很长时间才成熟。尽管年代久远，Lucene/Solr 项目一直是搜索技术中一些有趣的新发展的焦点。截至 2010 年，Lucene 和 Solr 已经合并为一个 Apache 项目。Lucene/Solr 生态系统的一些主要组件如图 6-1 所示。

图 6-1。

The Lucene/SOLR ecosystem, with some useful additions

SolrCloud 是 Lucene/Solr 技术栈的新成员，它允许使用 RESTful 接口进行多核处理。要了解更多关于 SolrCloud 的信息，请访问信息页面 https://cwiki.apache.org/confluence/display/solr/SolrCloud 。

6.2 Lucene 查询语法

Lucene 查询在 Lucene 项目的生命周期中已经发展到包括一些对过去基本查询语法的复杂扩展。虽然 Lucene 查询语法可能会因版本而异(自 2001 年在 Apache 中引入以来已经有了很大的发展)，但大多数功能和搜索类型保持不变，如表 6-1 所示。

表 6-1。

Lucene query types and how to use them

| 搜索组件的类型 | 句法 | 例子 | 描述 | | --- | --- | --- | --- | | 自由格式文本 | 单词或“短语” | `"to be or not to be"` | 不带引号的单词或带双引号的短语 | | 关键字搜索 | 字段名称:冒号值 | `city:Sunnyvale` | 要搜索的字段、冒号和要搜索的字符串 | | 助推 | 后跟提升值的术语或短语 | `term³` | 使用插入符号为术语提供新的提升值。 | | 通配符搜索 | 符号*可用于通配符。 | `*kerry` | 带有“*”或“？”的通配符搜索标志 | | 模糊搜索 | 使用波浪符号表示公制距离。 | `Hadoop∼` | 模糊搜索使用符号波浪号来表示使用 Levenschein 距离度量的接近程度。 | | 分组 | 普通括号提供分组。 | `(java or C)` | 使用括号提供子查询。 | | 字段分组 | 圆括号和冒号用于阐明查询字符串。 | `title:(+gift +"of the magi")` | 用字段名限定符分组，使用普通括号来提供分组 | | 范围搜索 | 字段名称和冒号，后跟范围子句 | `startDate:[20020101 TO 20030101]` `heroes:{Achilles TO Zoroaster}` | 方括号和允许构造 range 子句的关键字，例如{阿基里斯对琐罗亚斯德}。 | | 邻近搜索 | 术语颚化符接近值 | `Term∼` `10` | 近似搜索使用代字号符号来表示与匹配项的“接近程度”。 |

Installing Hadoop, Apache Solr, and NGDATA Lily

在这一节中，我们将简要概述如何安装 Hadoop、Lucene/Solr 和 NGData 的 Lily 项目，并建议一些“快速启动”技术来启动 Lily 安装并运行，以便进行开发和测试。

首先，安装 Hadoop。这是一个下载、解压缩、配置和运行的过程，类似于您在本书中遇到的许多其他过程。

当您成功安装和配置了 Hadoop，并设置了 HDFS 文件系统后，您应该能够执行一些简单的 Hadoop 命令，例如

hadoop fs –ls /

执行此操作后，您应该会看到类似于图 6-2 中的屏幕。

图 6-2。

Successful test of installation of Hadoop and the Hadoop Distributed File System (HDFS)

第二，安装 Solr。这只是下载 zip 文件，解压缩，并刻录到二进制文件中，然后使用命令立即启动服务器。

Solr 的成功安装可以如图 6-3 所示进行测试。

图 6-3。

A successful installation and start of the Solr server

第三，在 https://github.com/NGDATA/lilyproject 从 github 项目下载 NGDATA 的 Lily 项目。

让 Hadoop、Lucene、Solr 和 Lily 在同一个软件环境中协同工作可能会很棘手，因此我们提供了一些关于设置环境的技巧，这些技巧您可能已经忘记了。

Tips On Using HADOOP With SOLR and LUCENE

确保您可以使用“ssh”登录而无需密码。这是 Hadoop 正常工作的必要条件。时不时地测试一下你的 Hadoop 安装，以确保所有的活动组件都正常工作，这不会有什么坏处。只需几个命令，就可以在命令行上完成 Hadoop 功能的快速测试。例如:
确保您的环境变量设置正确，并适当配置您的 init 文件。这包括诸如你的。例如，如果您在 MacOS 上，bash_profile 文件。
经常测试组件交互。分布式系统中有许多移动组件。进行单独测试，以确保每个零件都能顺利工作。
在适当的时候，在独立、伪分布式和全分布式模式下测试交互。这包括调查可疑的性能问题、挂起、意外停止和错误，以及版本不兼容。
注意 pom.xml 中的版本不兼容性，并始终保持良好的 pom.xml 卫生。确保您的基础设施组件(如 Java、Maven、Python、npm、Node 和其他组件)是最新且兼容的。请注意:本书中的大多数示例使用 Java 8(并且一些示例依赖于 Java 8 中存在的高级特性)，以及使用 Maven 3+。如有疑问，请使用 java 版本和 mvn 版本！
对您的整个技术体系进行“整体优化”。这包括 Hadoop、Solr 和数据源/接收器级别。识别瓶颈和资源问题。如果您在小型 Hadoop 集群上运行，请识别“问题硬件”，尤其是单个“问题处理器”。
经常在您的应用中使用多核功能。在复杂的应用程序中很少使用单个内核，所以要确保使用多个内核能够顺利工作。
虔诚地执行集成测试。
性能监控是必须的。使用标准的性能监控“脚本”,并根据以前的结果和当前的预期来评估性能。根据需要升级硬件和软件以改善性能结果，并重新监控以确保准确的分析。
不要忽视单元测试。在 https://wiki.apache.org/hadoop/HowToDevelopUnitTests 可以找到为当前版本的 Hadoop 编写单元测试的很好的介绍。

Apache Katta ( http://katta.sourceforge.net/about )是任何基于 Solr 的分布式数据管道架构的有用补充，并允许 Hadoop 索引到分片中，以及许多其他高级功能。

How to Install and Configure Apache Katta

从位于 https://sourceforge.net/projects/katta/files/ 的资源库下载 Apache Katta。解压文件。
将 Katta 环境变量添加到。bash_profile 文件(如果您在 MacOS 下运行),或者适当的启动文件(如果运行另一个版本的 Linux)。这些变量包括(请注意，这些只是示例；在这里替换您自己合适的路径值):
```
export KATTA_HOME= /Users/kerryk/Downloads/kata-core-0.6.4
```
并将 Katta 的二进制文件添加到路径中，这样您就可以直接调用它:
```
export PATH=$KATTA_HOME/bin:$PATH
```
Check to make sure the Katta process is running correctly by typing
```
ps –al | grep katta
```
on the command line. You should see an output similar to Figure 6-4.

图 6-4。

A successful initialization of the Katta Solr subsystem
Successfully running the Katta component will produce results similar to those in Figure 6-4.

图 6-5。

Successful installation and run of Apache Katta screen

6.3 使用 SOLR 的编程示例

我们将通过一个完整的例子来使用 SOLR 加载、修改、评估和搜索我们从互联网上下载的标准数据集。我们将重点介绍 Solr 的几个特性。正如我们前面提到的，Solr 包含称为“核心”的独立数据存储库。每一个都可以有一个单独定义的模式与之相关联。Solr 内核可以在命令行上创建。

首先，从 URL http://samplecsvs.s3.amazonaws.com/SacramentocrimeJanuary2006.csv 下载样本数据集作为 csv 文件

您可以在下载文件夹中找到它，文件名为

yourDownLoadDirectory/SacramentocrimeJanuary2006.csv

使用以下命令创建新的 SOLR 核心:

./solr create –c crimecore1 –d basic_configs

如果您的核心创建成功，您将看到类似于图 6-2 中的屏幕。

图 6-6。

Successful construction of a Solr core

通过在规范末尾添加正确的字段来修改模式文件 schema.xml。

<!— much more of the schema.xml file will be here -->

                                ............

<!--   you will now add the field specifications for the cdatetime,address,district,beat,grid,crimedescr,ucr_ncic_code,latitude,longitude
   fields found in the data file SacramentocrimeJanuary2006.csv
-->
    <field name="cdatetime" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
    <field name="address" type="string" indexed="true" stored="true" required="true" multiValued="false" />  

    <field name="district" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
<field name="beat" type="string" indexed="true" stored="true" required="true" multiValued="false" />  

<field name="grid" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
<field name="crimedescr" type="string" indexed="true" stored="true" required="true" multiValued="false" />  

<field name="ucr_ncic_code" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
<field name="latitude" type="string" indexed="true" stored="true" required="true" multiValued="false" />  

<field name="longitude" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
   <field name="internalCreatedDate" type="date" indexed="true" stored="true" required="true" multiValued="false" />  

   <!--  the previous fields were added to the schema.xml file. Field type definition for currentcy is shown below  -->


    <fieldType name="currency" class="solr.CurrencyField" precisionStep="8" defaultCurrency="USD" currencyConfig="currency.xml" />

</schema>

通过向 CSV 文件的各个数据行添加关键字和附加数据，可以很容易地修改数据。清单 6-1 是这样一个 CSV 转换程序的简单例子。

通过向 CSV 文件添加唯一的键和创建日期来修改 Solr 数据。

清单 6-1 中显示了完成这项工作的程序。文件名将为com/apress/converter/csv/CSVConverter.java。

向 CSV 数据集添加字段的程序几乎不需要解释。它逐行读取输入的 CSV 文件，向每行数据添加唯一的 ID 和日期字段。类中有两个助手方法，createInternalSolrDate()和getCSVField()。

在 CSV 数据文件中，标题和前几行如图 6-7 所示，如 Excel 所示。

图 6-7。

Crime data CSV file . This data will be used throughout this chapter.

package com.apress.converter.csv;

import java.io.File;

import java.io.FileNotFoundException;

import java.io.FileOutputStream;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.io.LineNumberReader;

import java.text.DateFormat;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.Date;

import java.util.List;

import java.util.TimeZone;

import java.util.logging.Logger;

public class CSVConverter {
        Logger LOGGER = Logger.getAnonymousLogger();

        String targetSource = "SacramentocrimeJan2006.csv";
        String targetDest = "crime0.csv";

        /** Make a date Solr can understand from a regular oracle-style day string.
         *
         * @param regularJavaDate
         * @return

         */
        public String createInternalSolrDate(String regularJavaDate){
                if (regularJavaDate.equals("")||(regularJavaDate.equals("\"\""))){ return ""; }

                String answer = "";
                TimeZone tz = TimeZone.getTimeZone("UTC");
                DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm'Z'");
                df.setTimeZone(tz);
                try {
                answer = df.format(new Date(regularJavaDate));
                } catch (IllegalArgumentException e){
                        return "";
                }
                return answer

;
        }

        /** Get a CSV field in a CSV string by numerical index. Doesnt care if there are blank fields, but they count in the indices.
         *
         * @param s
         * @param fieldnum
         * @return

         */
        public String getCSVField(String s, int fieldnum){
                String answer = "";
                if (s != null) { s = s.replace(",,", ", ,");
                String[] them = s.split(",");
                int count = 0;
                for (String t : them){
                        if (fieldnum == count) answer = them[fieldnum];
                        count++;
                }        
                }
                return answer;
        }

        public CSVConverter()

{
                LOGGER.info("Performing CSV conversion for SOLR input");

                List<String>contents = new ArrayList<String>();
                ArrayList<String>result = new ArrayList<String>();
                String readline = "";
                LineNumberReader reader = null;
                FileOutputStream writer = null;
                try {
                         reader = new LineNumberReader(new FileReader(targetSource));
                         writer = new FileOutputStream (new File(targetDest));
                        int count = 0;
                        int thefield = 1;
                        while (readline != null){
                        readline = reader.readLine();
                        if (readline.split(","))<2){
                                LOGGER.info("Last line, exiting...");
                                break;
                        }
                        if (count != 0){
                                String origDate = getCSVField(readline, thefield).split(“ “)[0];
                                String newdate = createInternalSolrDate(origDate);
                                String resultLine = readline + "," + newdate+"\n";
                                LOGGER.info("===== Created new line: " + resultLine);
                                writer.write(resultLine.getBytes());
                                result.add(resultLine);
                        } else {
                                String resultLine = readline +",INTERNAL_CREATED_DATE\n";   // add the internal date for faceted search
                                writer.write(resultLine.getBytes());
                        }
                        count++;

                        LOGGER.info("Just read imported row: " + readline);
                        }
                } catch (FileNotFoundException e) {
                        e.printStackTrace();
                } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                }
                for (String line : contents){
                String newLine = "";        
                }
                try {
                reader.close();
                writer.close();
                } catch (IOException e){ e.printStackTrace(); }
                LOGGER.info("...CSV conversion complete...");
        }

        /** MAIN ROUTINE
         *
         * @param args
         */
        public static void main(String[] args){
                new CSVConverter(args[0], args[1]);
        }

}

Listing 6-1.Java source code for CSVConverter.java.

通过键入以下命令编译该文件:

javac com/apress/converter/csv/CSVConverter.java

如上所述正确设置 CSV 转换程序后，您可以通过键入

java com.apress.converter.csv.CSVConverter inputcsvfile.csv outputcsvfile.csv

将修改后的数据发布到 SOLR 核心:

./post –c crimecore1 ./modifiedcrimedata2006.csv

既然我们已经将数据发送到 Solr 核心，我们可以在 Splr 仪表板中检查数据集。为此，请转到 localhost:8983。您应该会看到类似于图 6-4 中的屏幕。

图 6-8。

Initial Solr dashboard

我们还可以评估来自我们在本章前面创建的 Solandra 核心的数据，如图 _ _。

现在，从核心选择器下拉列表中选择 crimedata0 核心。单击 query，将输出格式(“wt”参数下拉菜单)更改为 csv，这样您就可以一次看到几行数据。您将看到类似于图 6-9 中的数据显示。

图 6-10。

Result of Solr query using the dashboard (Sacramento crime data core)

图 6-9。

Result of Solr query , showing the JSON output format

由于 Solr 的 RESTful 接口，我们可以通过仪表板(符合前面讨论的 Lucene 查询语法)或在命令行上使用 CURL 实用程序进行查询。

6.4 使用 ELK 堆栈(Elasticsearch、Logstash 和 Kibana)

正如我们之前提到的，Lucene、Solr 和 Nutch 都有替代品。根据系统的整体架构，您可以使用各种技术栈、语言、集成和插件助手库以及功能。其中一些组件可能使用 Lucene 或 Solr，或者通过集成助手库(如 Spring Data、Spring MVC 或 Apache Camel 等)与 Lucene/Solr 组件兼容。图 6-6 显示了一个替代基本 Lucene 堆栈的例子，称为“ELK 堆栈”。

elastic search(elasticsearch.org)是一个分布式高性能搜索引擎。在引擎盖下，Elasticsearch 使用 Lucene 作为核心组件，如图 6-3 所示。Elasticsearch 是 SolrCloud 的强大竞争对手，易于扩展、维护、监控和部署。

为什么会用 Elasticsearch 而不是 Solr？仔细查看 Solr 和 Elasticsearch 的特性矩阵可以发现，在许多方面，这两个工具包具有相似的功能。它们都利用了 Apache Lucene。Solr 和 Elasticsearch 都可以使用 JSON 作为数据交换格式，虽然 Solr 也支持 XML。

表 6-2。

Feature comparison table of Elasticsearch features vs. Apache Solr features

| | 数据 | 可扩展置标语言 | 战斗支援车 | HTTP REST | 管理扩展 | 客户端库 | Lucene 查询解析 | 独立分布式集群 | 分片 | 形象化 | Web 管理界面 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 使用 | X | X | X | X | X | 爪哇 | X | | X | 吉婆那港（香蕉） | | | 弹性搜索 | X | | | X | | Java Python Javascript | X | X | X | 马纳人 | |

log stash(logstash.net)是一个有用的应用程序，允许将各种不同类型的数据导入到 Elasticsearch 中，包括 CSV 格式的文件和普通的“日志格式”文件。基巴纳( https://www.elastic.co/guide/en/kibana/current/index.html )是一个开放源码的可视化组件，允许定制。Elasticsearch、Logstash 和 Kibana 一起构成了所谓的“ELK stack”，主要用于。在这一节中，我们将看一个运行中的 ELK 堆栈的小例子。

图 6-12。

ELK stack in use: Elasticsearch search engine/pipeline architecture diagram

图 6-11。

The so-called “ELK stack ”: Elasticsearch, Logstash, and Kibana visualization Installing Elasticsearch, Logstash, and Kibana

安装和试用 ELK 堆栈再简单不过了。如果你已经阅读了本书的介绍性章节，这是一个熟悉的过程。按照以下三个步骤安装和测试 ELK 堆栈:

Download Elasticsearch from https://www.elastic.co/downloads/elasticsearch . Unzip the downloaded file to a convenient staging area. Then,

cd $ELASTICSEARCH_HOME/bin/
./elasticsearch

Elasticsearch will start up successfully with a display similar to that in Figure 6-3.

图 6-13。

Successful start-up of the Elasticsearch server from the binary directory Use the following Java program to import the crime data CSV file (or, with a little modification, any CSV formatted data file you wish):

public static void main(String[] args)
    {
        System.out.println( "Import crime data" );
        String originalClassPath = System.getProperty("java.class.path");
        String[] classPathEntries = originalClassPath.split(";");
        StringBuilder esClasspath = new StringBuilder();
        for (String entry : classPathEntries) {
        if (entry.contains("elasticsearch") || entry.contains("lucene")) {
        esClasspath.append(entry);
        esClasspath.append(";");
        }
        }
        System.setProperty("java.class.path", esClasspath.toString());
        System.setProperty("java.class.path", originalClassPath);
        System.setProperty("es.path.home", "/Users/kerryk/Downloads/elasticsearch-2.3.1");
        String file = "SacramentocrimeJanuary2006.csv";
        Client client = null;
        try {

        client = TransportClient.builder().build()
         .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));

        int numlines = 0;
        XContentBuilder builder = null;

        int i=0;

        String currentLine = "";
        BufferedReader br = new BufferedReader(new FileReader(file));

            while ((currentLine = br.readLine()) != null) {
            if (i > 0){
            System.out.println("Processing line: " + currentLine);
        String[] tokens = currentLine.split(",");
        String city = "sacramento";
        String recordmonthyear = "jan2006";
        String cdatetime = tokens[0];
        String address = tokens[1];
        String district = tokens[2];
        String beat = tokens[3];
        String grid = tokens[4];
        String crimedescr = tokens[5];
        String ucrnciccode = tokens[6]

;
        String latitude = tokens[7];
        String longitude = tokens[8];
        System.out.println("Crime description = " + crimedescr);
        i=i+1;
        System.out.println("Index is: " + i);
        IndexResponse response = client.prepareIndex("thread", "answered", "400"+new Integer(i).toString()).setSource(

        jsonBuilder()
        .startObject()
        .field("cdatetime", cdatetime)
        .field("address", address)
        .field("district", district)
        .field("beat", beat)
        .field("grid", grid)
        .field("crimedescr", crimedescr)
        .field("ucr_ncic_code", ucrnciccode)
        .field("latitude", latitude)
        .field("longitude", longitude)
        .field("entrydate", new Date())
        .endObject())
        .execute().actionGet();

            } else {
                System.out.println("Ignoring first line...");
                i++;
            }
        }

    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();

        }
        }

Run the program in Eclipse or in the command line. You will see a result similar to the one in Figure 6-14. Please note that each row of the CSV is entered as a set of fields into the Elasticsearch repository. You can also select the index name and index type by changing the appropriate strings in the code example.

图 6-14。

Successful test of an Elasticsearch crime database import from the Eclipse IDE You can test the query capabilities of your new Elasticsearch set-up by using ‘curl’ on the command line to execute some sample queries, such as:

图 6-16。

Successful test of an Elasticsearch crime database query from the command line

图 6-15。

You can see the schema update logged in the Elasticsearch console

Download Logstash from https://www.elastic.co/downloads/logstash . Unzip the downloaded file to the staging area.
```
cd <your logstash staging area, LOGSTASH_HOME>
```
After entering some text, you will see an echoed result similar to Figure 6-6.

图 6-17。

Testing your Logstash installation . You can enter text from the command line. You will also need to set up a configuration file for use with Logstash. Follow the directions found at to make a configuration file such as the one shown in Listing 6-2.
```
input { stdin { } }

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  date {
    match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
}

output {
  elasticsearch { hosts => ["localhost:9200"] }
  stdout { codec => rubydebug }
}

Listing 6-2.Typical Logstash configuration file

listing
```
Download Kibana from https://www.elastic.co/downloads/kibana . Unzip the downloaded file to the staging area. In a similar way to starting the Elasticsearch server:

图 6-18。

Successful start-up of the Kibana visualization component from its binary directory
```
  cd bin
./kibana
```
You can easily query for keywords or more complex queries interactively using the Kibana dashboard as shown in Figure 6-19.

图 6-20。

Kibana dashboard example: highlighted search for “FRAUD.”

图 6-19。

Kibana dashboard example with crime dataset Add this schema for the crime data to Elasticsearch with this cURL command:
```
curl -XPUT http://localhost:9200/crime2 -d '
{ "mappings" :
{ "crime2" : { "properties" : { "cdatetime" : {"type" : "string"}, "address" : {"type": "string"}, "district" : {"type" : "string"}, "beat": {"type" : "string"}, "grid": {"type" : "string"}, "crimedescr" : {"type": "string"}, "ucr_ncic_code": {"type": "string"},"latitude": {"type" : "string"}, "longitude": {"type" : "string"}, "location": {"type" : "geo_point"}}  
} } }'
```
Notice the “location” tag in particular, which has a geo_point-type definition. This allows Kibana to identify the physical location on a map for visualization purposes , as shown in Figure 6-21.

图 6-21。

The crime data for Sacramento as a visualization in Kibana

图 6-21 是理解复杂数据集的一个很好的例子。我们可以立即用红色标出“高犯罪率”地区。

6.5 Solr 与 ElasticSearch:特征和逻辑

在本节中，我们将在一个使用 Elasticsearch 的代码示例中使用所谓的 CRUD 操作(创建、替换、更新和删除方法，以及一个附加的搜索实用程序方法)作为示例。

package com.apress.main;

import java.io.IOException;

import java.util.Date;

import java.util.HashMap;

import java.util.Map;

import org.elasticsearch.action.delete.DeleteResponse;

import org.elasticsearch.action.get.GetResponse;

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.action.search.SearchType;

import org.elasticsearch.client.Client;

import static org.elasticsearch.index.query.QueryBuilders.fieldQuery;

import org.elasticsearch.node.Node;

import static org.elasticsearch.node.NodeBuilder.nodeBuilder;

import org.elasticsearch.search.SearchHit;

/**
 *
 * @author kerryk
 */

public class ElasticSearchMain {

          public static final String INDEX_NAME = "narwhal";
          public static final String THEME_NAME = "messages";

    public static void main(String args[]) throws IOException{

        Node node     = nodeBuilder().node();
        Client client = node.client();

        client.prepareIndex(INDEX_NAME, THEME_NAME, "1")
              .setSource(put("ElasticSearch: Java",
                                         "ElasticSeach provides Java API, thus it executes all operations " +
                                          "asynchronously by using client object..",
                                         new Date(),
                                         new String[]{"elasticsearch"},
                                         "Kerry Koitzsch", "iPad", "Root")).execute().actionGet();

        client.prepareIndex(INDEX_NAME, THEME_NAME, "2")
              .setSource(put("Java Web Application and ElasticSearch (Video)",
                                         "Today, here I am for exemplifying the usage of ElasticSearch which is an open source, distributed " +
                                         "and scalable full text search engine and a data analysis tool in a Java web application.",
                                         new Date(),
                                         new String[]{"elasticsearch"},
                                         "Kerry Koitzsch", "Apple TV", "Root")).execute().actionGet();

        get(client, INDEX_NAME, THEME_NAME, "1");

        update(client, INDEX_NAME, THEME_NAME, "1", "title", "ElasticSearch: Java API");
        update(client, INDEX_NAME, THEME_NAME, "1", "tags", new String[]{"bigdata"});

        get(client, INDEX_NAME, THEME_NAME, "1");

        search(client, INDEX_NAME, THEME_NAME, "title", "ElasticSearch");

        delete(client, INDEX_NAME, THEME_NAME, "1");

        node.close();
    }

    public static Map<String, Object> put(String title, String content, Date postDate,
                                                      String[] tags, String author, String communityName, String parentCommunityName){

        Map<String, Object> jsonDocument = new HashMap<String, Object>();

        jsonDocument.put("title", title);
        jsonDocument.put("content", content);
        jsonDocument.put("postDate", postDate);
        jsonDocument.put("tags", tags);
        jsonDocument.put("author", author);
        jsonDocument.put("communityName", communityName);
        jsonDocument.put("parentCommunityName", parentCommunityName);
        return jsonDocument;
    }

    public static void get(Client client, String index, String type, String id){

        GetResponse getResponse = client.prepareGet(index, type, id)

                                        .execute()
                                        .actionGet();
        Map<String, Object> source = getResponse.getSource();

        System.out.println("------------------------------");
        System.out.println("Index: " + getResponse.getIndex());
        System.out.println("Type: " + getResponse.getType());
        System.out.println("Id: " + getResponse.getId());
        System.out.println("Version: " + getResponse.getVersion());
        System.out.println(source);
        System.out.println("------------------------------");

    }

    public static void update(Client client, String index, String type,

                                      String id, String field, String newValue){

        Map<String, Object> updateObject = new HashMap<String, Object>();
        updateObject.put(field, newValue);

        client.prepareUpdate(index, type, id)

              .setScript("ctx._source." + field + "=" + field)
              .setScriptParams(updateObject).execute().actionGet();
    }

    public static void update(Client client, String index, String type,

                                      String id, String field, String[] newValue){

        String tags = "";
        for(String tag :newValue)
            tags += tag + ", ";

        tags = tags.substring(0, tags.length() - 2);

        Map<String, Object> updateObject = new HashMap<String, Object>();
        updateObject.put(field, tags);

        client.prepareUpdate(index, type, id)

                .setScript("ctx._source." + field + "+=" + field)
                .setScriptParams(updateObject).execute().actionGet();
    }

    public static void search(Client client, String index, String type,

                                      String field, String value){

        SearchResponse response = client.prepareSearch(index)
                                        .setTypes(type)

                                        .setSearchType(SearchType.QUERY_AND_FETCH)
                                        .setQuery(fieldQuery(field, value))
                                        .setFrom(0).setSize(60).setExplain(true)
                                        .execute()
                                        .actionGet();

        SearchHit[] results = response.getHits().getHits();

        System.out.println("Current results: " + results.length);
        for (SearchHit hit : results) {
            System.out.println("------------------------------");
            Map<String,Object> result = hit.getSource();   
            System.out.println(result);
        }
    }

    public static void delete(Client client, String index, String type, String id){

        DeleteResponse response = client.prepareDelete(index, type, id).execute().actionGet();
        System.out.println("===== Information on the deleted document:");
        System.out.println("Index: " + response.getIndex());
        System.out.println("Type: " + response.getType());
        System.out.println("Id: " + response.getId());
        System.out.println("Version: " + response.getVersion());
    }
}

Listing 6-3.
CRUD operations

for  Elasticsearch example

为搜索组件定义 CRUD 操作是定制组件如何“适应”系统其余部分的整体架构和逻辑的关键。

6.6 带有 Elasticsearch 和 Solr 的 Spring 数据组件

在这一节中，我们将开发一个代码示例，它使用 Spring 数据来实现同类组件，使用 Solr 和 Elasticsearch 作为“幕后”使用的搜索框架。

您可以为 pom.xml 文件分别定义 Elasticsearch 和 Solr 两个属性，如下所示:

<spring.data.elasticsearch.version>2.0.1.RELEASE</spring.data.elasticsearch.version>
<spring.data.solr.version>2.0.1.RELEASE</spring.data.solr.version>

<dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-elasticsearch</artifactId>
        <version>2.0.1.RELEASE</version>
</dependency>
and
<dependency>
        <groupId>org.springframework.data</groupId>
        <artifactId>spring-data-solr</artifactId>
        <version>2.0.1.RELEASE</version>
</dependency>

我们现在可以开发基于 Spring 数据的代码示例，如清单 6-5 和清单 6-6 所示。

package com.apress.probda.solr.search;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import com.apress.probda.context.config.SearchContext;
import com.apress.probda.context.config.WebContext;
@Configuration
@ComponentScan
@EnableAutoConfiguration
@Import({ WebContext.class, SearchContext.class })
public class Application {
        public static void main(String[] args) {
                SpringApplication.run(Application.class, args);
        }
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.solr.repository.config.EnableSolrRepositories;

@Configuration
@EnableSolrRepositories(basePackages = { "org.springframework.data.solr.showcase.product" }, multicoreSupport = true)
public class SearchContext {

        @Bean
        public SolrServer solrServer(@Value("${solr.host}") String solrHost) {
                return new HttpSolrServer(solrHost);
        }

}

File: WebContext.java

import java.util.List;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.web.PageableHandlerMethodArgumentResolver;
import org.springframework.web.method.support.HandlerMethodArgumentResolver;
import org.springframework.web.servlet.config.annotation.ViewControllerRegistry;
import org.springframework.web.servlet.config.annotation.WebMvcConfigurerAdapter;

/**
 * @author kkoitzsch
 */
@Configuration
public class WebContext {
        @Bean
        public WebMvcConfigurerAdapter mvcViewConfigurer() {
                return new WebMvcConfigurerAdapter() {
                        @Override
                        public void addViewControllers(ViewControllerRegistry registry) {

                                registry.addViewController("/").setViewName("search");
                                registry.addViewController("/monitor").setViewName("monitor");
                        }
                        @Override
                        public void addArgumentResolvers(List<HandlerMethodArgumentResolver> argumentResolvers) {
                                argumentResolvers.add(new PageableHandlerMethodArgumentResolver());
                        }
                };
        }
}

Listing 6-4.
NLP program
—main() executable method

public static void main(String[] args) throws IOException {
        String text = "The World is a great place";
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, parse, sentiment");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        Annotation annotation = pipeline.process(text);
        List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
            String sentiment = sentence.get(SentimentCoreAnnotations.SentimentClass.class);
            System.out.println(sentiment + "\t" + sentence);
        }
    }

Listing 6-5.Spring Data code example using Solr

package com.apress.probda.search.elasticsearch;

import com.apress.probda.search.elasticsearch .Application;

import com.apress.probda.search.elasticsearch .Post;

import com.apress.probda.search.elasticsearch.Tag;

import com.apress.probda.search.elasticsearch.PostService;

import org.junit.Before;

import org.junit.Test;

import org.junit.runner.RunWith;

import org.springframework.beans.factory.annotation.Autowired;

import org.springframework.boot.test.SpringApplicationConfiguration;

import org.springframework.data.domain.Page;

import org.springframework.data.domain.PageRequest;

import org.springframework.data.elasticsearch.core.ElasticsearchTemplate;

import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;

import java.util.Arrays;

import static org.hamcrest.CoreMatchers.notNullValue;

import static org.hamcrest.core.Is.is;

import static org.junit.Assert.assertThat;

@RunWith(SpringJUnit4ClassRunner.class)

@SpringApplicationConfiguration(classes = Application.class)

public class PostServiceImplTest{

    @Autowired

    private PostService postService;

    @Autowired

    private ElasticsearchTemplate elasticsearchTemplate;

    @Before

    public void before() {

        elasticsearchTemplate.deleteIndex(Post.class);

        elasticsearchTemplate.createIndex(Post.class);

        elasticsearchTemplate.putMapping(Post.class);

        elasticsearchTemplate.refresh(Post.class, true);

    }

    //@Test

    public void testSave() throws Exception {

        Tag tag = new Tag();

        tag.setId("1");

        tag.setName("tech");

        Tag tag2 = new Tag();

        tag2.setId("2");

        tag2.setName("elasticsearch");

        Post post = new Post();

        post.setId("1");

        post.setTitle("Bigining with spring boot application and elasticsearch");

        post.setTags(Arrays.asList(tag, tag2));

        postService.save(post);

        assertThat(post.getId(), notNullValue());

        Post post2 = new Post();

        post2.setId("1");

        post2.setTitle("Bigining with spring boot application");

        post2.setTags(Arrays.asList(tag));

        postService.save(post);

        assertThat(post2.getId(), notNullValue());

    }

    public void testFindOne() throws Exception {

    }

    public void testFindAll() throws Exception {

    }

    @Test

    public void testFindByTagsName() throws Exception {

        Tag tag = new Tag();

        tag.setId("1");

        tag.setName("tech");

        Tag tag2 = new Tag();

        tag2.setId("2");

        tag2.setName("elasticsearch");

        Post post = new Post();

        post.setId("1");

        post.setTitle("Bigining with spring boot application and elasticsearch");

        post.setTags(Arrays.asList(tag, tag2));

        postService.save(post);

        Post post2 = new Post();

        post2.setId("1");

        post2.setTitle("Bigining with spring boot application");

        post2.setTags(Arrays.asList(tag));

        postService.save(post);

        Page<Post> posts  = postService.findByTagsName("tech", new PageRequest(0,10));

        Page<Post> posts2  = postService.findByTagsName("tech", new PageRequest(0,10));

        Page<Post> posts3  = postService.findByTagsName("maz", new PageRequest(0,10));

       assertThat(posts.getTotalElements(), is(1L));

        assertThat(posts2.getTotalElements(), is(1L));

        assertThat(posts3.getTotalElements(), is(0L));

    }

}

Listing 6-6.Spring Data code example using Elasticsearch (unit test)

6.7 使用 LingPipe 和 GATE 进行定制搜索

在这一节中，我们将回顾一对有用的分析工具，它们可以与 Lucene 和 Solr 一起使用，以增强分布式分析应用程序中的自然语言处理(NLP)分析能力。LingPipe ( http://alias-i.com/lingpipe/ )和 GATE(文本工程通用架构， https://gate.ac.uk )可用于为分析系统增加自然语言处理能力。基于 NLP 的分析系统的典型架构可能类似于图 6-22 。

图 6-22。

NLP system architecture , using LingPipe, GATE, and NGDATA Lily

自然语言处理系统可以以类似于任何其他分布式流水线系统的方式来设计和构建。唯一的区别是针对数据和元数据本身的特殊性质进行了必要的调整。LingPipe、GATE、Vowpal Wabbit 和 StanfordNLP 允许处理、解析和“理解”文本，Emir/Caliph、ImageTerrier 和 HIPI 等软件包提供了分析和索引基于图像和信号的数据的功能。你可能还希望添加一些软件包来帮助地理定位，比如 SpatialHadoop ( http://spatialhadoop.cs.umn.edu )，这将在第十四章中详细讨论。

GATE 可以处理各种输入格式，包括原始文本、XML、HTML 和 PDF 文档，以及关系数据/JDBC 中介数据。这包括从 Oracle、PostgreSQL 等导入的数据。

Apache Tika 导入组件可能如清单 6-7 中所示实现。

Package com.apress.probda.io;

import java.io.*;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Set;

import com.apress.probda.pc.AbstractProbdaKafkaProducer;
import org.apache.commons.lang3.StringUtils;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.serialization.JsonMetadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.isatab.ISArchiveParser;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.dia.kafka.solr.consumer.SolrKafkaConsumer;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.JSONValue;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import static org.dia.kafka.Constants.*;

public class ISAToolsKafkaProducer extends AbstractKafkaProducer {

    /**
     * Tag for specifying things coming out of LABKEY
     */
    public final static String ISATOOLS_SOURCE_VAL = "ISATOOLS";
    /**
     * ISA files default prefix
     */
    private static final String DEFAULT_ISA_FILE_PREFIX = "s_";
    /**
     * Json jsonParser to decode TIKA responses
     */
    private static JSONParser jsonParser = new JSONParser();
    ;

    /**
     * Constructor
     */
    public ISAToolsKafkaProducer(String kafkaTopic, String kafkaUrl) {
        initializeKafkaProducer(kafkaTopic, kafkaUrl);
    }

    /**
     * @param args
     */
    public static void main(String[] args) throws IOException {
        String isaToolsDir = null;
        long waitTime = DEFAULT_WAIT;
        String kafkaTopic = KAFKA_TOPIC;
        String kafkaUrl = KAFKA_URL;

        // TODO Implement commons-cli
        String usage = "java -jar ./target/isatools-producer.jar [--tikaRESTURL <url>] [--isaToolsDir <dir>] [--wait <secs>] [--kafka-topic <topic_name>] [--kafka-url]\n";

        for (int i = 0; i < args.length - 1; i++) {
            if (args[i].equals("--isaToolsDir")) {
                isaToolsDir = args[++i];
            } else if (args[i].equals("--kafka-topic")) {
                kafkaTopic = args[++i];
            } else if (args[i].equals("--kafka-url")) {
                kafkaUrl = args[++i];
            }
        }

        // Checking for required parameters
        if (StringUtils.isEmpty(isaToolsDir)) {
            System.err.format("[%s] A folder containing ISA files should be specified.\n", ISAToolsKafkaProducer.class.getSimpleName());
            System.err.println(usage);
            System.exit(0);
        }

        // get KafkaProducer
        final ISAToolsKafkaProducer isatProd = new ISAToolsKafkaProducer(kafkaTopic, kafkaUrl);
        DirWatcher dw = new DirWatcher(Paths.get(isaToolsDir));

        // adding shutdown hook for shutdown gracefully
        Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
            public void run() {
                System.out.println();
                System.out.format("[%s] Exiting app.\n", isatProd.getClass().getSimpleName());
                isatProd.closeProducer();
            }
        }));

        // get initial ISATools files
        List<JSONObject> newISAUpdates = isatProd.initialFileLoad(isaToolsDir);
        // send new studies to kafka
        isatProd.sendISAToolsUpdates(newISAUpdates);
        dw.processEvents(isatProd);

    }

    /**
     * Checks for files inside a folder
     *
     * @param innerFolder
     * @return
     */
    public static List<String> getFolderFiles(File innerFolder) {
        List<String> folderFiles = new ArrayList<String>();
        String[] innerFiles = innerFolder.list(new FilenameFilter() {
            public boolean accept(File dir, String name) {
                if (name.startsWith(DEFAULT_ISA_FILE_PREFIX)) {
                    return true;
                }
                return false;
            }
        });

        for (String innerFile : innerFiles) {
            File tmpDir = new File(innerFolder.getAbsolutePath() + File.separator + innerFile);
            if (!tmpDir.isDirectory()) {
                folderFiles.add(tmpDir.getAbsolutePath());
            }
        }
        return folderFiles;
    }

    /**
     * Performs the parsing request to Tika
     *
     * @param files
     * @return a list of JSON objects.
     */
    public static List<JSONObject> doTikaRequest(List<String> files) {
        List<JSONObject> jsonObjs = new ArrayList<JSONObject>();

        try {
            Parser parser = new ISArchiveParser();
            StringWriter strWriter = new StringWriter();

            for (String file : files) {
                JSONObject jsonObject = new JSONObject();

                // get metadata from tika
                InputStream stream = TikaInputStream.get(new File(file));
                ContentHandler handler = new ToHTMLContentHandler();
                Metadata metadata = new Metadata();
                ParseContext context = new ParseContext();
                parser.parse(stream, handler, metadata, context);

                // get json object
                jsonObject.put(SOURCE_TAG, ISATOOLS_SOURCE_VAL);
                JsonMetadata.toJson(metadata, strWriter);
                jsonObject = adjustUnifiedSchema((JSONObject) jsonParser.parse(new String(strWriter.toString())));
                //TODO Tika parsed content is not used needed for now
                //jsonObject.put(X_TIKA_CONTENT, handler.toString());
                System.out.format("[%s] Tika message: %s \n", ISAToolsKafkaProducer.class.getSimpleName(), jsonObject.toJSONString());

                jsonObjs.add(jsonObject);

                strWriter.getBuffer().setLength(0);
            }
            strWriter.flush();
            strWriter.close();

        } catch (IOException e) {
            e.printStackTrace();
        } catch (ParseException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (TikaException e) {
            e.printStackTrace();
        }
        return jsonObjs;
    }

    private static JSONObject adjustUnifiedSchema(JSONObject parse) {
        JSONObject jsonObject = new JSONObject();
        List invNames = new ArrayList<String>();
        List invMid = new ArrayList<String>();
        List invLastNames = new ArrayList<String>();

        Set<Map.Entry> set = parse.entrySet();
        for (Map.Entry entry : set) {
            String jsonKey = SolrKafkaConsumer.updateCommentPreffix(entry.getKey().toString());
            String solrKey = ISA_SOLR.get(jsonKey);

//            System.out.println("solrKey " + solrKey);
            if (solrKey != null) {
//                System.out.println("jsonKey: " + jsonKey + " -> solrKey: " + solrKey);
                if (jsonKey.equals("Study_Person_First_Name")) {
                    invNames.addAll(((JSONArray) JSONValue.parse(entry.getValue().toString())));
                } else if (jsonKey.equals("Study_Person_Mid_Initials")) {
                    invMid.addAll(((JSONArray) JSONValue.parse(entry.getValue().toString())));
                } else if (jsonKey.equals("Study_Person_Last_Name")) {
                    invLastNames.addAll(((JSONArray) JSONValue.parse(entry.getValue().toString())));
                }
                jsonKey = solrKey;
            } else {
                jsonKey = jsonKey.replace(" ", "_");
            }
            jsonObject.put(jsonKey, entry.getValue());
        }

        JSONArray jsonArray = new JSONArray();

        for (int cnt = 0; cnt < invLastNames.size(); cnt++) {
            StringBuilder sb = new StringBuilder();
            if (!StringUtils.isEmpty(invNames.get(cnt).toString()))
                sb.append(invNames.get(cnt)).append(" ");
            if (!StringUtils.isEmpty(invMid.get(cnt).toString()))
                sb.append(invMid.get(cnt)).append(" ");
            if (!StringUtils.isEmpty(invLastNames.get(cnt).toString()))
                sb.append(invLastNames.get(cnt));
            jsonArray.add(sb.toString());
        }
        if (!jsonArray.isEmpty()) {
            jsonObject.put("Investigator", jsonArray.toJSONString());
        }
        return jsonObject;
    }

    /**
     * Send message from IsaTools to kafka
     *
     * @param newISAUpdates
     */
    void sendISAToolsUpdates(List<JSONObject> newISAUpdates) {
        for (JSONObject row : newISAUpdates) {
            row.put(SOURCE_TAG, ISATOOLS_SOURCE_VAL);
            this.sendKafka(row.toJSONString());
            System.out.format("[%s] New message posted to kafka.\n", this.getClass().getSimpleName());
        }
    }

    /**
     * Gets the application updates from a directory
     *
     * @param isaToolsTopDir
     * @return
     */
    private List<JSONObject> initialFileLoad(String isaToolsTopDir) {
        System.out.format("[%s] Checking in %s\n", this.getClass().getSimpleName(), isaToolsTopDir);
        List<JSONObject> jsonParsedResults = new ArrayList<JSONObject>();
        List<File> innerFolders = getInnerFolders(isaToolsTopDir);

        for (File innerFolder : innerFolders) {
            jsonParsedResults.addAll(doTikaRequest(getFolderFiles(innerFolder)));
        }

        return jsonParsedResults;
    }

    /**
     * Gets the inner folders inside a folder
     *
     * @param isaToolsTopDir
     * @return
     */
    private List<File> getInnerFolders(String isaToolsTopDir) {
        List<File> innerFolders = new ArrayList<File>();
        File topDir = new File(isaToolsTopDir);
        String[] innerFiles = topDir.list();
        for (String innerFile : innerFiles) {
            File tmpDir = new File(isaToolsTopDir + File.separator + innerFile);
            if (tmpDir.isDirectory()) {
                innerFolders.add(tmpDir);
            }
        }
        return innerFolders;
    }
}

Listing 6-7.Apache Tika import routines for use throughout the PROBDA System

Installing and Testing Lingpipe, Gate, and Stanford Core NLP

首先从 http://alias-i.com/lingpipe/web/download.html 下载灵管发布 JAR 文件安装灵管。你也可以从 http://alias-i.com/lingpipe/web/models.html 下载你感兴趣的凌管模型。按照指示将模型放在正确的目录中，这样 LingPipe 就可以为需要它们的适当演示获取模型。
从谢菲尔德大学网站( https://gate.ac.uk )下载 GATE，使用安装程序安装 GATE 组件。安装对话框非常容易使用，允许你有选择地安装各种组件，如图 6-24 所示。
We will also introduce the StanfordNLP ( http://stanfordnlp.github.io/CoreNLP/#human-languages-supported ) library component for our example. To get started with Stanford NLP, download the CoreNLP zip file from the GitHub link above. Expand the zip file. Make sure the following dependencies are in your pom.xml file:
```
           <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.5.2</version>
            <classifier>models</classifier>
           </dependency>
           <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-corenlp</artifactId>
            <version>3.5.2</version>
           </dependency>
           <dependency>
            <groupId>edu.stanford.nlp</groupId>
            <artifactId>stanford-parser</artifactId>
            <version>3.5.2</version>
           </dependency>
```
Go to Stanford NLP “home directory” (where the pom.xml file is located) and do
```
mvn clean package
```
then test the interactive NLP shell to insure correct behavior. Type
```
./corenlp.sh
```
to start the interactive NLP shell. Type some sample text into the shell to see the parser in action. The results shown will be similar to those shown in Figure 6-17.

图 6-23。

StanfordNLP interactive shell in action

我们可以如下定义通用搜索的接口:

public interface ProbdaSearchEngine<T> {

  <Q> List<T> search(final String field, final Q query, int maximumResultCount);

  List<T> search(final String query, int maximumResultCount);
…………}

Listing 6-8.ProbdaSearchEngine java interface stub

search()有两个不同的方法签名。一个是专门针对字段和查询组合的。Query 是字符串形式的 Lucene 查询，maximumResultCount 将结果元素的数量限制在可管理的范围内。

我们可以定义 ProbdaSearchEngine 接口的实现，如清单 6-8 所示。

图 6-24。

GATE Installation dialog. GATE is easy to install and use.

只需点击安装向导。请访问网站并安装提供的所有软件组件。

为了在程序中使用 LingPipe 和 GATE，让我们来看一个简单的例子，如清单 6-9 所示。请参考本章末尾的一些参考资料，以便更全面地了解 LingPipe 和 GATE 所能提供的特性。

package com.apress.probda.nlp;

import java.io.*;
import java.util.*;

import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

public class ProbdaNLPDemo {

  public static void main(String[] args) throws IOException {
    PrintWriter out;
    if (args.length > 1) {
      out = new PrintWriter(args[1]);
    } else {
      out = new PrintWriter(System.out);
    }
    PrintWriter xmlOut = null;
    if (args.length > 2) {
      xmlOut = new PrintWriter(args[2]);
    }

    StanfordCoreNLP pipeline = new StanfordCoreNLP();
    Annotation annotation;
    if (args.length > 0) {
      annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
    } else {
      annotation = new Annotation(“No reply from local Probda email site”);
    }

    pipeline.annotate(annotation);
    pipeline.prettyPrint(annotation, out);
    if (xmlOut != null) {
      pipeline.xmlPrint(annotation, xmlOut);
    }
    List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
    if (sentences != null && sentences.size() > 0) {
      CoreMap sentence = sentences.get(0);
      Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
      out.println();
      out.println("The first sentence parsed is:");
      tree.pennPrint(out);
    }
  }

}

Listing 6-9.LingPipe | GATE | StanfordNLP Java test program

,

imports

6.8 总结

在这一章中，我们对 Apache Lucene 和 Solr 生态系统做了一个快速的概述。有趣的是，虽然 Hadoop 和 Solr 一起作为 Lucene 生态系统的一部分开始，但它们后来分道扬镳，演变成了有用的独立框架。然而，这并不意味着 Solr 和 Hadoop 生态系统不能非常有效地协同工作。许多 Apache 组件，如 Mahout、LingPipe、GATE 和 Stanford NLP，都可以与 Lucene 和 Solr 无缝协作。Solr 中的新技术，比如 SolrCloud 和其他技术，使得使用 RESTful APIs 来连接 Lucene/Solr 技术变得更加容易。

我们完成了一个使用 Solr 及其生态系统的完整示例:从下载、按摩和输入数据集到转换数据和以各种数据格式输出结果。更加明显的是，Apache Tika 和 Spring 数据对于数据管道“粘合”非常有用。

我们没有忽视 Lucene/Solr 技术栈的竞争对手。我们讨论了 Elasticsearch，它是 Lucene/Solr 的一个强有力的替代品，并描述了使用 Elasticsearch 优于更“普通”的 Lucene/Solr 方法的优缺点。Elasticsearch 最有趣的部分之一是无缝的可视化数据的能力，正如我们在探索萨克拉门托的犯罪统计数据时所展示的那样。

在下一章中，我们将在目前所学的基础上，讨论一些对构建分布式分析系统特别有用的分析技术和算法。

6.9 参考文献

阿瓦德，玛丽埃特和卡纳，拉胡尔。高效的学习机器。纽约:2015 年出版发行。

巴本科，德米特里和马尔马尼，哈拉兰博斯。智能网的算法。庇护所岛:曼宁出版公司，2009 年。

穆罕默德·古勒。使用 Apache Spark 进行大数据分析。纽约:阿普瑞斯出版社，2015 年。

卡兰贝尔卡，赫里什克什。使用 Hadoop 和 Solr 扩展大数据。英国伯明翰:PACKT 出版社，2013 年。

曼努·孔查迪。构建搜索应用:Lucene，LingPipe 和 GATE。弗吉尼亚州奥克顿:穆斯特鲁出版社，2008 年。

克里斯·马特曼和兹廷、朱卡·蒂卡在行动。庇护所岛:曼宁出版公司，2012 年。

波拉克、马克、吉尔克、奥利弗、里斯伯格、托马斯、布里斯班、乔恩、亨格、迈克尔。Spring Data:企业 Java 的现代数据访问。塞瓦斯托波尔，加利福尼亚州:奥莱利媒体，2012 年。

万纳杰森。Pro Hadoop。纽约州纽约市:新闻出版社，2009 年。

七、分析技术和算法概述

在本章中，我们提供了四类算法的概述:统计、贝叶斯、本体驱动和混合算法，这些算法利用标准库中找到的更基本的算法来使用 Hadoop 执行更深入和准确的分析。

7.1 算法类型概述

事实证明，Apache Mahout 和大多数其他主流机器学习工具包支持我们感兴趣的各种算法。例如，参见图 7-1 查看 Apache Mahout 支持的算法。

| 数字 | 算法名称 | 算法类型 | 描述 | | --- | --- | --- | --- | | one | 奈伊夫拜厄斯 | 分类者 | 简单的贝叶斯分类器:存在于几乎所有的现代工具包中 | | Two | 隐马尔可夫模型 | 分类者 | 通过结果观察预测系统状态 | | three | (学习)随机森林 | 分类者 | 随机森林算法(有时也称为随机决策森林)是一种用于分类、回归和其他任务的集成学习方法，它在训练时构建决策树集合，输出作为分类类模式或单个树的均值预测(回归)的类。 | | four | (学习)多层感知机(LMP) | 分类者 | 也在 Theano 工具包和其他几个工具包中实现。 | | five | (学习)逻辑回归 | 分类者 | scikit-learn 中也支持。实际上是一种分类技术，而不是回归技术。 | | six | 随机梯度下降 | 优化器，模型查找 | H2O 和沃帕尔·瓦比特等人也支持目标函数最小化程序 | | seven | 遗传算法 | 遗传算法 | 根据维基百科，“在数学优化领域，遗传算法(GA)是一种模拟自然选择过程的搜索启发式算法。这种试探法(有时也称为元试探法)通常用于生成优化和搜索问题的有用解决方案。 | | eight | 奇异值分解 | 降维 | 降维的矩阵分解 | | nine | 协同过滤(CF) | 被推荐的 | 一些推荐系统使用的技术 | | Ten | 潜在狄利克雷分配 | 主题建模器 | 一个强大的算法(学习者),自动(和联合)将单词聚类成“主题”,并将文档聚类成主题“混合物” | | Eleven | 谱聚类 | 聚合 | | | Twelve | 频繁模式挖掘 | 数据挖掘器 | | | Thirteen | k 均值聚类 | 聚合 | 使用 Mahout 可以得到普通的和模糊的 k 均值 | | Fourteen | 树冠集群 | 聚合 | k-means 聚类器的预处理步骤:双阈值系统 |

图 7-1。

Mean, standard deviation, and normal distribution are often used in statistical methods

统计和数值算法是我们可以使用的最直接的分布式算法。

统计技术包括使用标准统计计算，如图 7-1 所示。

贝叶斯技术是构建分类器、数据建模和其他目的的最有效的技术之一。

另一方面，本体驱动的算法是一整套算法，它们依赖于逻辑的、结构化的、层次化的建模、语法和其他技术来为建模、数据挖掘和对数据集进行推断提供基础设施。

混合算法将一个或多个由不同类型的算法组成的模块结合在一起，用 glueware 连接在一起，提供一个比单一算法类型更灵活、更强大的数据管道。例如，神经网络技术可以与贝叶斯技术和 ML 技术相结合来创建“学习贝叶斯网络”，这是通过使用混合方法可以获得的协同作用的一个非常有趣的例子。

7.2 统计/数字技术

示例系统中的统计类和支持方法可以在com.apress.probda.algorithms.statistics子包中找到。

我们可以在图 7-2 中看到一个使用 Apache Storm 的简单分布式技术栈。

图 7-3。

An Apache Spark-centric technology stack

图 7-2。

A distributed technology stack including Apache Storm

我们可以在图 7-4 中看到一个以超光速粒子为中心的技术堆栈。Tachyon 是一个容错分布式内存文件系统

图 7-4。

A Tachyon-centric technology stack , showing some of the associated ecosystem

7.3 贝叶斯技术

我们在示例系统中实现的贝叶斯技术可以在包 com.prodba.algorithms.bayes 中找到

我们最流行的库所支持的一些贝叶斯技术(除了朴素贝叶斯算法之外)包括图 7-1 中所示的那些。

朴素贝叶斯分类器基于基本贝叶斯方程，如图 7-5 所示。

图 7-5。

The fundamental Bayes equation

该方程包含四种主要的概率类型:后验概率、似然性、类先验概率和预测先验概率。这些术语在本章末尾的参考资料中有解释。

我们可以用一种简单的方式来尝试 Mahout 文本分类器。首先，下载一个基本数据集进行测试。

7.4 本体驱动的算法

本体驱动的组件和支持类可以在com.apress.probda.algorithms.ontology子包中找到。

要包含 Protégé Core 组件，请将以下 Maven 依赖项添加到您的项目 pom.xml 文件中。

<dependency>
        <groupId>edu.stanford.protege</groupId>
        <artifactId>protege-common</artifactId>
        <version>5.0.0-beta-24</version>
</dependency>

从网站注册并下载 protegé:

protege.stanford.edu/products.php#desktop-prot%C3%A9g%C3%A9

可以通过使用诸如斯坦福的 Protégé系统之类的本体编辑器来交互式地定义本体，如图 7-5 所示。

图 7-6。

Setting up SPARQL functionality with the Stanford toolkit interactive setup

您可以安全地选择所有组件，或者只选择您需要的组件。请参考各个在线文档页面，查看这些组件是否适合您的应用。

图 7-7。

Using an ontology editor to define ontologies, taxonomies, and grammars

7.5 混合算法:组合算法类型

Probda 示例系统中实现的混合算法可以在com.apress.prodbda.algorithms.hybrid子包中找到。

我们可以混合匹配算法类型来构建更好的数据管道。这些“混合系统”可能更复杂一些——当然，它们通常有几个额外的组件——但它们在实用性上弥补了这一点。

最有效的混合算法之一是所谓的“深度学习”组件。并非所有人都认为深度学习器是一种混合算法(在大多数情况下，它们本质上是建立在多层神经网络技术上的)，但有一些令人信服的理由将深度学习器视为混合系统，正如我们将在下面讨论的那样。

所谓的“深度学习”技术包括图 yy-yy 所示的技术。DeepLearning4J 和 TensorFlow toolkit 是目前可用的两个比较流行和强大的深度学习库。在 https://www.tensorflow.org/versions/r0.10/get_started/basic_usage.htmlautoenc 查看 TensorFlow。Theano 是一个基于 Python 的多维数组库。查看 http://deeplearning.net/tutorial/dA.html 了解更多关于如何使用 Theano 的细节。

| 数字 | 算法名称 | 算法类型 | 工具包 | 描述 | | --- | --- | --- | --- | --- | | one | 深度信念网络 | 神经网络 | 深入学习 4j，tensorflow，Theano | 多层隐藏单元，仅具有层互连性 | | Two | (堆叠、去噪)自动编码器(DA) | 基本自动编码器原理的变体 | 深入学习 4j，tensorflow，Theano | 堆叠自动编码器是由多层稀疏自动编码器组成的神经网络，其中每层的输出都连接到连续层的输入。去噪自动编码器可以接受部分损坏的输入，同时恢复原始未损坏的输入。 | | three | 卷积神经网络(CNN) | 神经网络，MLP 的变体 | 深入学习 4j，tensorflow，Theano | 稀疏连接和共享权重是细胞神经网络的两个特征。 | | four | 长短期记忆单位(LSTM) | 递归神经网络，分类器，预测器 | Deeplearning4j， TensorFlow | 分类和时间序列预测，甚至情感分析 | | five | 递归神经网络 | 神经网络 | Deeplearning4j， TensorFlow | 分类，时间序列预测 | | six | 计算图 | 复杂网络架构构建器 | Deeplearning4j， TensorFlow | 计算用图形表示。 |

7.6 代码示例

在本节中，我们将讨论我们在前面章节中讨论的算法类型的一些扩展示例。

为了对一些算法比较有所了解，让我们使用电影数据集来评估我们已经讨论过的一些算法和工具包。

package com.apress.probda.datareader.csv;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStreamWriter;

public class FileTransducer {

        /**
         * This routine splits a line which is delimited into fields by the vertical
         * bar symbol '|'
         *
         * @param l
         * @return
         */
        public static String makeComponentsList(String l) {
                String[] genres = l.split("\\|");
                StringBuffer sb = new StringBuffer();
                for (String g : genres) {
                        sb.append("\"" + g + "\",");
                }
                String answer = sb.toString();
                return answer.substring(0, answer.length() - 1);
        }

        /**
         * The main routine processes the standard movie data files so that mahout
         * can use them.
         *
         * @param args
         */
        public static void main(String[] args) {
if (args.length < 4){
System.out.println("Usage: <movie data input><movie output file><ratings input file> <ratings output file>");
                        System.exit(-1);
                }
                File file = new File(args[0]);
                if (!file.exists()) {
                        System.out.println("File: " + file + " did not exist, exiting...");
                        System.exit(-1);
                }
                System.out.println("Processing file: " + file);
                BufferedWriter bw = null;
                FileOutputStream fos = null;
                String line;
                try (BufferedReader br = new BufferedReader(new FileReader(file))) {
                        int i = 1;
                        File fout = new File(args[1]);
                        fos = new FileOutputStream(fout);
                        bw = new BufferedWriter(new OutputStreamWriter(fos));
                        while ((line = br.readLine()) != null) {
                                String[] components = line.split("::");
                                String number = components[0].trim();
                                String[] titleDate = components[1].split("\\(");
                                String title = titleDate[0].trim();
                                String date = titleDate[1].replace(")", "").trim();
                                String genreList = makeComponentsList(components[2]);
                                String outLine = "{ \"create\" : { \"_index\" : \"bigmovie\", \"_type\" : \"film\", \"_id\" : \"" + i
                                                + "\" } }\n" + "{ \"id\": \"" + i + "\", \"title\" : \"" + title + "\", \"year\":\"" + date
                                                + "\" , \"genre\":[" + genreList + "] }";
                                i++;
                                bw.write(outLine);
                                bw.newLine();
                        }
                } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                } finally {
                        if (bw != null) {
                                try {
                                        bw.close();
                                } catch (IOException e) {
                                        // TODO Auto-generated catch block
                                        e.printStackTrace();
                                }
                        }
                }
                file = new File(args[2]);
                try (BufferedReader br2 = new BufferedReader(new FileReader(file))) {
                        File fileout = new File(args[3]);
                        fos = new FileOutputStream(fileout);
                        bw = new BufferedWriter(new OutputStreamWriter(fos));
                        while ((line = br2.readLine()) != null) {
                                String lineout = line.replace("::", "\t");
                                bw.write(lineout);
                        }
                } catch (IOException e) {
                        // TODO Auto-generated catch block
                        e.printStackTrace();
                } finally {
                        if (bw != null) {
                                try {
                                        bw.close();
                                } catch (IOException e) {
                                        // TODO Auto-generated catch block
                                        e.printStackTrace();
                                }
                        }
                }
        }
}
Execute the following curl command on the command line to import the data set into elastic search:
curl -s -XPOST localhost:9200/_bulk --data-binary @index.json; echo

数据集可以通过命令行使用 CURL 命令导入到 Elasticsearch 中。图 7-8 就是执行这样一个命令的结果。Elasticsearch 服务器返回一个 JSON 数据结构，该数据结构显示在控制台上，并被索引到 Elasticsearch 系统中。

图 7-9。

Using Kibana as a reporting and visualization tool

图 7-8。

Importing a standard movie data set example using a CURL command

我们可以在图 7-7 中看到一个使用 Kibana 作为报告工具的简单例子。顺便说一句，在本书余下的大部分内容中，我们会遇到 Kibana 和 ELK Stack(elastic search–Logstash–Kibana)。虽然有使用 ELK 堆栈的替代方法，但这是从第三方构建块构建数据分析系统的更轻松的方法之一。

7.7 摘要

在这一章中，我们讨论了分析技术和算法以及一些评估算法有效性的标准。我们谈到了一些较老的算法类型:统计和数值分析函数。组合或混合算法在最近几天变得特别重要，因为来自机器学习、统计学和其他领域的技术可以以合作的方式非常有效地使用，正如我们在本章中所看到的。对于分布式算法的一般介绍，参见 Barbosa (1996)。

这些算法类型中有许多非常复杂。其中一些，例如贝叶斯技术，有专门的文献。对于贝叶斯技术的详细解释，特别是概率技术，见扎德(1992)，

在下一章中，我们将讨论基于规则的系统，可用的规则引擎系统，如 JBoss Drools，以及基于规则的系统在智能数据收集、基于规则的分析和数据管道控制调度和编排方面的一些应用。

7.8 参考文献

分布式算法导论。麻省剑桥:麻省理工学院出版社，1996 年。

贝叶斯统计导论。纽约:威利 Inter-Science，威利父子公司，2004 年。

贾科姆利，皮可。阿帕奇看象人食谱。英国伯明翰:PACKT 出版社，2013 年。

古普塔，阿希什。学习 Apache Mahout 分类。英国伯明翰:PACKT 出版社，2015 年。

马尔马尼什，哈拉兰博斯和巴本科，德米特里。智能网的算法。康涅狄格州格林威治:曼宁出版社，2009 年。

Nakhaeizadeh，g .和 Taylor，C.C .(编辑)。机器学习和统计学:界面。纽约:约翰·威利父子公司，1997 年。

西蒙帕森斯。不确定条件下推理的定性方法。剑桥。马:麻省理工学院出版社，2001。

珀尔朱迪亚。智能系统中的概率推理:似是而非推理网络。加利福尼亚州圣马特奥:摩根-考夫曼出版社，1988 年。

扎德、洛夫蒂和卡普奇克(合编)。不确定性管理的模糊逻辑。纽约:约翰·威利父子公司，1992 年。