Starrocks问题

149 阅读5分钟

1.be启动不了,查看日志报错[ error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small]\( error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small)

E0507 21:25:49.384902 5103 storage*engine.cpp:415] File descriptor number is less than 60000. Please use (ulimit -n) to set a value equal or greater than 60000 W0507 21:25:49.385150 5103 storage*engine.cpp:199] check fd number failed, error: Internal error: file descriptors limit is too small W0507 21:25:49.385156 5103 storage*engine.cpp:102] open engine failed, error: Internal error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small

原因:表明当前系统的文件描述符数量限制小于 60000,而 StarRocks BE 需要至少 60000 个文件描述符才能正常启动。

解决方案:

  1. 临时修改:以普通用户身份登录,使用 ulimit -n 60000 命令来临时设置文件描述符数量限制为 60000。但这种设置在用户会话结束后会失效。
  2. 永久修改:以管理员身份编辑 /etc/security/limits.conf 文件,添加或修改以下内容:
*               soft    nofile          60000
*               hard    nofile          60000

image.png

2.Starrocks和外部系统的相关问题

(1) Starrocks对外部系统存储的表,进行创建物化视图,必须在自己的内部库下创建,否则,报错SQL 错误 [1064] [42000]: Getting analyzing error at line 1, column 89. Detail message: Can not find database:test_p.

解决方案:使用sr的其他内部库,然后在创建MV的from语句采用联邦查询方式paimon_catalog.test_p.p_test即可对外部系统的表进行创建MV

(2) Starrocks对外部表创建异步物化视图,必须指定interval,且必须超过1min,否则,会报错:SQL 错误 [1064] [42000]: Materialized view which type is ASYNC need to specify refresh interval for external table

原因

  1. 如果是内部表,完全可不配,采用REFRESH ASYNC,那么默认就是内部表数据更新了,异步物化视图就跟着更新
  2. 如果是外部表,由于Starrocks感知不到外部表的数据变化,因此,必须要配置更新间隔时间,才能进行更新mv

解决方案:配置REFRESH ASYNC EVERY(INTERVAL 1 MINUTE),这里最少要1min,否则会报错

3.物化视图创建卡死,导致元数据锁表了,starrocks事务失败,导致FlinkJob的checkpoint频繁失败,无法写入数据

报错信息:

Could not complete snapshot 1 for operator Sink: dwt_search_adpv_to_sr[3] (2/2)#0. Failure reason: Checkpoint was declined. org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 1 for operator Sink: test_sr[3] (2/2)#0. Failure reason: Checkpoint was declined. Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: test_db, table: test_sr, label: flink-b0439806-d365-44f8-870a-1204a579a797, responseBody: { "Status": "SERVICE_UNAVAILABLE", "Message": "Failed to load data into tablet 613617932, because of too many versions, current/limit: 1501/1500. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,: be:127.0.0.1" } 排查原因

首先报错信息有一个Failed to load data into tablet 613617932

那么去sr在查SHOW TABLET 613617932
打印结构如下
DbName|TableName|PartitionName|IndexName|DbId|TableId|PartitionId|IndexId|IsSync|DetailCmd                                         |
------+---------+-------------+---------+----+-------+-----------+-------+------+--------------------------------------------------+
      |         |             |         |1  |1     |1         |613617667     |true |SHOW PROC '/dbs/477830501/609929097/partitions/610006565/613617667/613617932';|

然后执行里面的detailCmd:
SHOW PROC '/dbs/477830501/609929097/partitions/610006565/613617667/613617932';

打印如下
ReplicaId|BackendId|Version|VersionHash|LstSuccessVersion|LstSuccessVersionHash|LstFailedVersion|LstFailedVersionHash|LstFailedTime|SchemaHash|DataSize|RowCount|State|IsBad|IsSetBadForce|VersionCount|PathHash|MetaUrl                             |CompactionStatus                                                 |IsErrorState|
---------+---------+-------+-----------+-----------------+---------------------+----------------+--------------------+-------------+----------+--------+--------+-----+-----+-------------+------------+--------+------------------------------------+-----------------------------------------------------------------+------------+

613617933|449984861|3238   |0          |3238             |0                    |-1              |0                   |             |-1        |16754120|758341  |ALTER|false|false        |1501        |-1      |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false       |

613617934|327736692|3238   |0          |3238             |0                    |-1              |0                   |             |-1        |16754120|758341  |ALTER|false|false        |1501        |-1      |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false       |

613617935|140233404|3238   |0          |3238             |0                    |-1              |0                   |             |-1        |16754120|758341  |ALTER|false|false        |1501        |-1      |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false       |

然后我们查询有没有alter table和alter  MATERIALIZED VIEW;
执行SHOW ALTER MATERIALIZED VIEW; 
打印结果如下
JobId|TableName|CreateTime|FinishedTime|BaseIndexName|RollupIndexName|RollupId|TransactionId|State|Msg|Progress|Timeout|
-----+---------+----------+------------+-------------+---------------+--------+-------------+-----+---+--------+-------+
613617666|  xxx|  xxx     |  xxx       |   xxx       |   xxx         |613617667| xxx        |xxx  |xxx| xxx    |xxx    |

发现,这里的JobId是613617666,而RollupId是613617667(正好是SHOW TABLET 613617932中的IndexId),并且这个物化视图始终没有finished,卡住了,这就导致在 StarRocks 中,创建物化视图会对目标表 test_sr 加元数据锁,阻塞所有可能修改表结构或元数据的操作。

解决方案:删除掉这个卡住的物化视图,或者让他完成

4.默认的FlinkSQL-Starrocks是CSV格式,若数据值有'',会导致解析有误

解决方案:

-- 在with中配置如下参数,采用json格式
'sink.properties.format' = 'json',  
'sink.properties.strip_outer_array' = 'true',