JAVA创建列式内存数据向量结构运用(二)

82 阅读5分钟

Apache Arrow是一个跨语言的开发者平台,旨在为列式数据提供高性能的内存分析处理。它定义了一种标准化的列式内存格式,这种格式被设计用于高效的数据交换和数据处理

项目使用maven引入

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-memory-netty</artifactId>
  <version>17.0.0</version>
   <scope>runtime</scope>
</dependency>

<dependency>
  <groupId>org.apache.arrow</groupId>
  <artifactId>arrow-vector</artifactId>
   <version>17.0.0</version> <!-- 请使用最新版本 -->
</dependency>

创建ValueVector

Arrow Java提供了几个构建块。数据类型描述了值的类型;ValueVector是类型值的序列;字段描述表格数据中的列类型;模式描述表格数据中的列序列,而VectorSchemaRoot表示表格数据。Arrow还为读取器和写入器提供了从存储加载数据和将数据持久化到存储的功能。

ValueVector表示相同类型的值序列。它们也被称为列格式的“数组”。

示例:创建一个表示[2,null,5]的32位整数向量:

try ( //创建内存分配器
BufferAllocator allocator=new RootAllocator();
//创建IntVector向量实例
IntVector intVector=new IntVector("fixed-size-primitive-layout", allocator);
){
   //分配内存
   intVector.allocateNew(3);
   intVector.set(0,2);
   intVector.setNull(1);
   intVector.set(2,5);
   //设置值计数
   intVector.setValueCount(3);
   System.out.println("Vector created in memory: " + intVector);
} catch (Exception e) {
  e.printStackTrace();
}

示例:创建一个UTF-8编码字符串的向量,表示["one", "two", "three"]

try(
//创建内存分配器
BufferAllocator allocator = new RootAllocator();
//创建VarCharVector向量实例
VarCharVector varCharVector = new VarCharVector("variable-size-primitive-layout", allocator);
){
  //分配内存
   varCharVector.allocateNew(3);
   varCharVector.set(0, "one".getBytes());
   varCharVector.set(1, "two".getBytes());
   varCharVector.set(2, "three".getBytes());
   //设置值计数
   varCharVector.setValueCount(3);
  System.out.println("Vector created in memory: " + varCharVector);

}

创建 Field

Field用于表示表格数据的特定列。它们由名称、数据类型、指示列是否可以具有空值的标志以及可选的键值元数据组成。

示例:创建一个名为“document”的字符串类型的字段:

Map<String, String> metadata = new HashMap<>();
metadata.put("A", "Id card");
metadata.put("B", "Passport");
metadata.put("C", "Visa");
//创建Field
Field document = new Field("document",
new FieldType(true, new ArrowType.Utf8(), /*dictionary*/ null, metadata),
/*children*/ null);
System.out.println("Field created: " + document + ", Metadata: " + document.getMetadata());

创建Schema

Schema包含一系列字段和一些可选元数据。

示例:创建一个描述两列数据集的模式:int32列“A”和UTF8编码字符串列“B”

//创建Metadata
Map<String, String> metadata = new HashMap<>();
metadata.put("K1", "V1");
metadata.put("K2", "V2");
//创建Field
Field a = new Field("A", FieldType.nullable(new ArrowType.Int(32, true)), /*children*/ null);
Field b = new Field("B", FieldType.nullable(new ArrowType.Utf8()), /*children*/ null);
//创建Schema
Schema schema = new Schema(Arrays.asList(a,b), metadata);
System.out.println("Schema created: " + schema);

创建VectorSchemaRoot

VectorSchemaRoot将ValueVectors与Schema组合在一起,以表示表格数据。

示例:创建一个包含姓名(字符串)和年龄(32位有符号整数)的数据集。

//创建属性age
Field age = new Field("age",
FieldType.nullable(new ArrowType.Int(32, true)),/*children*/null);
//创建属性name
Field name = new Field("name",
FieldType.nullable(new ArrowType.Utf8()),/*children*/null);
//创建schama
Schema schema = new Schema(Arrays.asList(age, name), /*metadata*/ null);
try(
    //创建根内存分配器
    BufferAllocator allocator = new RootAllocator();
    //创建VectorSchemaRoot
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    //获取age的Intvector向量
    IntVector ageVector = (IntVector) root.getVector("age");
    //获取name的Intvector向量
    VarCharVector nameVector = (VarCharVector) root.getVector("name");
){
    //分配内存
    ageVector.allocateNew(3);
    ageVector.set(0, 10);
    ageVector.set(1, 20);
    ageVector.set(2, 30);
    nameVector.allocateNew(3);
    nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8));
    nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8));
    nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8));
     // 设置值计数
     root.setRowCount(3);
     System.out.println("VectorSchemaRoot created: \n" + root.contentToTSVString());
}

进程间通信(IPC)

arrow数据可以写入磁盘和从磁盘读取,这两种操作都可以根据应用程序的要求以流式和/或随机访问的方式完成。将数据写入arrow文件

示例:将上述例中的数据集写入Arrow IPC文件(随机访问)。

//创建属性age
Field age = new Field("age",
FieldType.nullable(new ArrowType.Int(32, true)),/*children*/null);
//创建属性name
Field name = new Field("name",
FieldType.nullable(new ArrowType.Utf8()),/*children*/null);
//创建schama
Schema schema = new Schema(Arrays.asList(age, name), /*metadata*/ null);
try(
     //创建根内存分配器
     BufferAllocator allocator = new RootAllocator();
    //创建VectorSchemaRoot
    VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
    //获取age的Intvector向量
    IntVector ageVector = (IntVector) root.getVector("age");
    //获取name的Intvector向量
    VarCharVector nameVector = (VarCharVector) root.getVector("name");
){
       //分配内存
     ageVector.allocateNew(3);
     ageVector.set(0, 10);
     ageVector.set(1, 20);
     ageVector.set(2, 30);
     nameVector.allocateNew(3);
     nameVector.set(0, "Dave".getBytes(StandardCharsets.UTF_8));
     nameVector.set(1, "Peter".getBytes(StandardCharsets.UTF_8));
     nameVector.set(2, "Mary".getBytes(StandardCharsets.UTF_8));
   // 设置值计数
     root.setRowCount(3);
     System.out.println("VectorSchemaRoot created: \n" + root.contentToTSVString());
    //写入的目标文件
     File file = new File("random_access_file.arrow");
try (
    FileOutputStream fileOutputStream = new FileOutputStream(file);
     // 写入文件
    ArrowFileWriter writer = new ArrowFileWriter(root, /*provider*/ null,        fileOutputStream.getChannel());
) {
   //开始写
    writer.start();
    writer.writeBatch();
    //结束写
     writer.end();
    System.out.println("Record batches written: " + writer.getRecordBlocks().size()
      + ". Number of rows written: " + root.getRowCount());
} catch (IOException e) {
       e.printStackTrace();
}
}

从arrow文件读取数据

示例:从Arrow IPC文件中读取前一个示例中的数据集(随机访问)

try(
//创建内存分配器
BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE);
//创建ArrowFileReader
FileInputStream fileInputStream = new FileInputStream(new File("random_access_file.arrow"));
ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), allocator);
){
    // 读取数据批次
     for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
       reader.loadRecordBatch(arrowBlock);
      VectorSchemaRoot root = reader.getVectorSchemaRoot();
      System.out.println("VectorSchemaRoot read: \n" + root.contentToTSVString());
      }
} catch (IOException e) {
e.printStackTrace();
}

其他类型案例

Dictionary-Encoded Array of Varchar

在某些情况下,对列进行字典编码有助于节省内存。

try (BufferAllocator root = new RootAllocator();
VarCharVector countries = new VarCharVector("country-dict", root);
VarCharVector appUserCountriesUnencoded = new VarCharVector("app-use-country-dict", root)
) {
countries.allocateNew(10);
countries.set(0, "Andorra".getBytes(StandardCharsets.UTF_8));
countries.set(1, "Cuba".getBytes(StandardCharsets.UTF_8));
countries.set(2, "Grecia".getBytes(StandardCharsets.UTF_8));
countries.set(3, "Guinea".getBytes(StandardCharsets.UTF_8));
countries.set(4, "Islandia".getBytes(StandardCharsets.UTF_8));
countries.set(5, "Malta".getBytes(StandardCharsets.UTF_8));
countries.set(6, "Tailandia".getBytes(StandardCharsets.UTF_8));
countries.set(7, "Uganda".getBytes(StandardCharsets.UTF_8));
countries.set(8, "Yemen".getBytes(StandardCharsets.UTF_8));
countries.set(9, "Zambia".getBytes(StandardCharsets.UTF_8));
countries.setValueCount(10);

Dictionary countriesDictionary = new Dictionary(countries,
new DictionaryEncoding(/*id=*/1L, /*ordered=*/false, /*indexType=*/new ArrowType.Int(8, true)));
System.out.println("Dictionary: " + countriesDictionary);
appUserCountriesUnencoded.allocateNew(5);
appUserCountriesUnencoded.set(0, "Andorra".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(1, "Guinea".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(2, "Islandia".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(3, "Malta".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.set(4, "Uganda".getBytes(StandardCharsets.UTF_8));
appUserCountriesUnencoded.setValueCount(5);
System.out.println("Unencoded data: " + appUserCountriesUnencoded);

try (FieldVector appUserCountriesDictionaryEncoded = (FieldVector) DictionaryEncoder
   .encode(appUserCountriesUnencoded, countriesDictionary)) {
   System.out.println("Dictionary-encoded data: " + appUserCountriesDictionaryEncoded);
}

结果
Dictionary: Dictionary DictionaryEncoding[id=1,ordered=false,indexType=Int(8, true)] [Andorra, Cuba, Grecia, Guinea, Islandia, Malta, Tailandia, Uganda, Yemen, Zambia]
//未编码数据
Unencoded data: [Andorra, Guinea, Islandia, Malta, Uganda]
//已编码数据
Dictionary-encoded data: [0, 3, 4, 5, 7]

ListVector创建案例

try(
BufferAllocator allocator = new RootAllocator();
//创建ListVector向量
ListVector listVector = ListVector.empty("listVector", allocator);
//UnionListWriter
UnionListWriter listWriter = listVector.getWriter()
) {
   int[] data = new int[] { 1, 2, 3, 10, 20, 30, 100, 200, 300, 1000, 2000, 3000 };
   int tmp_index = 0;
   for(int i = 0; i < 4; i++) {
  //设置坐标位
     listWriter.setPosition(i);
   //开始写
     listWriter.startList();
     for(int j = 0; j < 3; j++) {
          listWriter.writeInt(data[tmp_index]);
          tmp_index = tmp_index + 1;
        }
    //设置写入数据量
   listWriter.setValueCount(3);
   //结束写
   listWriter.endList();
}
listVector.setValueCount(4);
System.out.print(listVector);
} catch (Exception e) {
e.printStackTrace();
}

结果
[[1,2,3], [10,20,30], [100,200,300], [1000,2000,3000]]