hbase为什么要有column family吧,怎么用,以及解决了什么问题。搜到一个stackoverflow的问题:stackoverflow.com/questions/6… ,翻译一下。
What is the reason for having column families? Example: 为什么要有 clumn families(列族),例:
Scenario 1 : 场景1:
Table Row-Key ColumnFamily1 ColumnFamily2 ColumnFamily3
Scenario 2 : 场景2:
Table1 Row-Key Column1...ColumnN Table2 Row-Key Column1...ColumnN Table3 Row-Key Column1...ColumnN
In scenario 1, although a table can have many column families, all column families are stored separately. Then why is there a concept of column families itself? Why can't there be simply scenario 2? Again with scenario 2, I'm not blocking any feature HBase provides. You can still add dynamic columns later on (and other features). 在场景1中,虽然一个表可以有很多column family,但是各 column family是分开存储的。那么,为什么要有 column family这个概念呢?为什么不用更简单的场景2呢?场景2 的实现不会阻塞HBase的任何feature,你依然可以继续添加动态列(同理其他feature)。
My only concern is, if the column families are stored separately, then why they are in the same table? I'm only interested in what is the intent of having column families (and what problem it solves)? 我唯一关心的是,如果列族是分开存储的,那么它们为什么要在同一个table?我好奇设计出列族的动机是什么,它解决了什么问题
A table, by definition, is a unit of organization for data which logically belongs together. Column families provide you with a way to create substructure within your table in order to optimize performance based on your access patterns (that's the problem it solves). 按定义来说,一个 table 是逻辑上应该在一起的数据的组织单元。列族提供了一种方法,让你在table中创建子结构,以优化你的访问模式的性能(这就是它解决的问题)。
In practical terms, although column families within a table are stored "separately," in different files, they are also stored "nearby" in the sense that HBase stores all the values for a given row in the same Region. This includes the separate files for column families. Although they're in separate files, they're owned by the same Region Server. 实际上,尽管一个table中的列组是在不同文件中分开存储的,他们也是挨着存储的,HBase 将同一行的数据都存储在同一个 region。这包括了列族的分开的文件。也就是说,他们在不同的文件,但被同一个region server 拥有。
By contrast, if you divided your data into different tables, parts of the same "row" would live in different HBase Regions, and when accessing them you'd pay the overhead of lookup on different Region Servers in your cluster. 反过来说,如果你将数据拆分到不同table中,那么原本同一行的数据将存储在不同的region中,那么当你访问他们时,就会付出额外的跨server查找的成本。
So if you opt to put some of your data in a separate table rather than in a column family, not only are you organizing your data in a way which could become hard to manage, you're also forfeiting a lot of performance advantages from HBase. 所以如果你选择将你的一部分数据放在分开的table而不是列族中,不仅仅是你组织数据的方式将更难管理,你也将失去HBase的很多性能优势。