在数据量较少的时候（在百万条或千万条时），使用Python对文件进行读取、数据库进行处理。当数据量达到上亿条时，以上处理方式难以处理数据，这时可以使用MapReduce、spark、hbase等来进行处理。

1.pandas处理

1.1 merge

类似于关系型数据库的连接方式，可以根据一个或多个键将不同的DatFrame连接起来。

该函数的典型应用场景是，针对同一个主键存在两张不同字段的表，根据主键整合到一张表里面。

      left_index=False, right_index=False, sort=True,  
      suffixes=('_x', '_y'), copy=True, indicator=False)

参数介绍：

 left和right：两个不同的DataFrame；

 how：连接方式，有inner、left、right、outer，默认为inner；

 on：指的是用于连接的列索引名称，必须存在于左右两个DataFrame中，如果没有指定且其他参数也没有指定，则以两个DataFrame列名交集作为连接键；

 left_on：左侧DataFrame中用于连接键的列名，这个参数左右列名不同但代表的含义相同时非常的有用；

 right_on：右侧DataFrame中用于连接键的列名；

 left_index：使用左侧DataFrame中的行索引作为连接键；

 right_index：使用右侧DataFrame中的行索引作为连接键；

 sort：默认为True，将合并的数据进行排序，设置为False可以提高性能；

 suffixes：字符串值组成的元组，用于指定当左右DataFrame存在相同列名时在列名后面附加的后缀名称，默认为('_x', '_y')；

 copy：默认为True，总是将数据复制到数据结构中，设置为False可以提高性能；

 indicator：显示合并数据中数据的来源情况

1.2 concat

pd.concat() 是专门用于数据连接合并的函数，它可以沿着行或者列进行操作，同时可以指定非合并轴的合并方式（合集、交集等）

          ignore_index=False, keys=None,
          levels=None, names=None, sort = False,
          verify_integrity=False, copy=True)

objs：为df,[df1,df2]等等

轴方向 axis 连接轴的方法，默认是 0，按列连接，追加在行后边，为1时追加到列后边。

合并方式 join 其他轴上的数据是按交集（inner）还是并集（outer）进行合并。

保留索引 ignore_index 是否保留原表索引，默认保留，为 True 会自动增加自然索引。

连接关系 keys 使用传递的键作为最外层级别来构造层次结构索引，就是给每个表指定一个一级索引。

索引层级 levels

索引名称 names 索引的名称，包括多层索引。

索引检测 verify_integrity 参数为True时，如果合并的数据与原数据包含索引相同的行，将报错。

排序 sort 对非连接轴进行排序。

深拷贝 copy 如果为 False，则不要深拷贝。

1.3 join

join方法提供了一个简便的方法用于将两个DataFrame中的不同的列索引合并成为一个DataFrame。

其中参数的意义与merge方法基本相同,只是join方法默认为左外连接how=left。

Join 是基于行索引来进行的合并操作。如果你的需求是让两个DataFrame合并，且是依据两个DF的索引来进行合并的，那么显然Join是最好的选择

        self, other, on=None, how="left", lsuffix="", rsuffix="", sort=False
    ) -> "DataFrame":
        """
        Join columns of another DataFrame.

        Join columns with `other` DataFrame either on index or on a key
        column. Efficiently join multiple DataFrame objects by index at once by
        passing a list.

        Parameters
        ----------
        other : DataFrame, Series, or list of DataFrame
            Index should be similar to one of the columns in this one. If a
            Series is passed, its name attribute must be set, and that will be
            used as the column name in the resulting joined DataFrame.
        on : str, list of str, or array-like, optional
            Column or index level name(s) in the caller to join on the index
            in `other`, otherwise joins index-on-index. If multiple
            values given, the `other` DataFrame must have a MultiIndex. Can
            pass an array as the join key if it is not already contained in
            the calling DataFrame. Like an Excel VLOOKUP operation.
        how : {'left', 'right', 'outer', 'inner'}, default 'left'
            How to handle the operation of the two objects.

            * left: use calling frame's index (or column if on is specified)
            * right: use `other`'s index.
            * outer: form union of calling frame's index (or column if on is
              specified) with `other`'s index, and sort it.
              lexicographically.
            * inner: form intersection of calling frame's index (or column if
              on is specified) with `other`'s index, preserving the order
              of the calling's one.
        lsuffix : str, default ''
            Suffix to use from left frame's overlapping columns.
        rsuffix : str, default ''
            Suffix to use from right frame's overlapping columns.
        sort : bool, default False
            Order result DataFrame lexicographically by the join key. If False,
            the order of the join key depends on the join type (how keyword).

        Returns
        -------
        DataFrame
            A dataframe containing columns from both the caller and `other`.

其中 other：DataFrame, Series, or list of DataFrame，另外一个dataframe, series，或者dataframe list。

on: 参与join的列，与sql中的on参数类似。

how: {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’，与sql中的join方式类似。

lsuffix: 左DataFrame中重复列的后缀

rsuffix: 右DataFrame中重复列的后缀

sort: 按字典序对结果在连接键上排序

1.4 append

在四种合并方式中，最简单易用的合并方法要数append了，不过正因为简单易用，其支持的操作也比前边三个少了很多，常规操作，我们还是先来看看参数详解：

other 可传入DataFrame、Series或由DataFrame（Series）组成的列表（适用于多DF（Series）合并到df中的情况）。

ignore_index 布尔值，默认为False. 忽略合并后的索引值。

verify_integrity 布尔值，默认为False（允许出现重复项）. 检查合并后的结果对象索引列的重复情况。

sort 默认不排序。如果join为outer时尚未对齐未连接轴，则对它进行排序。

2.sql数据合并

SQL JOIN有四种类型：

INNER JOIN：如果表中有至少一个匹配，则返回行（相当于集合中的交运算）

LEFT JOIN：即使右表中没有匹配，也从左表返回所有的行（包含左表全部内容）

RIGHT JOIN：即使左表中没有匹配，也从右表返回所有的行（包含右表全部内容）

FULL JOIN：只要其中一个表中存在匹配，则返回行（相当于集合中的并运算）

3.大数据合并

3.1 MapReduce

合并的操作是在reduce阶段完成，reduce端的处理压力太大，map节点的运算负载则很低，资源利用率不高，且在reduce阶段极易产生数据倾斜

3.2 hive

concat_ws和collect_set()函数实现（对某列进行去重）

使用hive sql语句进行处理：

hql中的join语句和sql中的语句相似，但是在Hive中只支持等值连接，不支持非等值连接。

内连接 inner join：交集

外连接 outer join

左连接 && 右连接：left join ，right join

大多数情况下，Hive会对每个JOIN连接对象启动一个MapReduce任务。上面例子中，首先会启动一个MapReduce job对表teacher和表course进行连接操作，然后会在启动一个MapReduce job将第一个MapReduce job的输出和表score进行连接操作。所以说在多表连接的时候，会启动多个MapReduce任务。

3.2 hbase

hbase实现二级索引、联合查询、join等操作，后期具体总结

3.3 其他数据库和方法

flink合并流、storm、mongdb、Redis等等。

数据合并、连接