1.be启动不了,查看日志报错[ error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small]\( error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small)
E0507 21:25:49.384902 5103 storage*engine.cpp:415] File descriptor number is less than 60000. Please use (ulimit -n) to set a value equal or greater than 60000 W0507 21:25:49.385150 5103 storage*engine.cpp:199] check fd number failed, error: Internal error: file descriptors limit is too small W0507 21:25:49.385156 5103 storage*engine.cpp:102] open engine failed, error: Internal error: file descriptors limit is too small E0507 21:25:49.385486 5103 starrocks*be.cpp:129] file descriptors limit is too small
原因:表明当前系统的文件描述符数量限制小于 60000,而 StarRocks BE 需要至少 60000 个文件描述符才能正常启动。
解决方案:
- 临时修改:以普通用户身份登录,使用
ulimit -n 60000命令来临时设置文件描述符数量限制为 60000。但这种设置在用户会话结束后会失效。 - 永久修改:以管理员身份编辑
/etc/security/limits.conf文件,添加或修改以下内容:
* soft nofile 60000
* hard nofile 60000
2.Starrocks和外部系统的相关问题
(1) Starrocks对外部系统存储的表,进行创建物化视图,必须在自己的内部库下创建,否则,报错SQL 错误 [1064] [42000]: Getting analyzing error at line 1, column 89. Detail message: Can not find database:test_p.
解决方案:使用sr的其他内部库,然后在创建MV的from语句采用联邦查询方式paimon_catalog.test_p.p_test即可对外部系统的表进行创建MV
(2) Starrocks对外部表创建异步物化视图,必须指定interval,且必须超过1min,否则,会报错:SQL 错误 [1064] [42000]: Materialized view which type is ASYNC need to specify refresh interval for external table
原因:
- 如果是内部表,完全可不配,采用
REFRESH ASYNC,那么默认就是内部表数据更新了,异步物化视图就跟着更新 - 如果是外部表,由于Starrocks感知不到外部表的数据变化,因此,必须要配置更新间隔时间,才能进行更新mv
解决方案:配置REFRESH ASYNC EVERY(INTERVAL 1 MINUTE),这里最少要1min,否则会报错
3.物化视图创建卡死,导致元数据锁表了,starrocks事务失败,导致FlinkJob的checkpoint频繁失败,无法写入数据
报错信息:
Could not complete snapshot 1 for operator Sink: dwt_search_adpv_to_sr[3] (2/2)#0. Failure reason: Checkpoint was declined. org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 1 for operator Sink: test_sr[3] (2/2)#0. Failure reason: Checkpoint was declined. Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: test_db, table: test_sr, label: flink-b0439806-d365-44f8-870a-1204a579a797, responseBody: { "Status": "SERVICE_UNAVAILABLE", "Message": "Failed to load data into tablet 613617932, because of too many versions, current/limit: 1501/1500. You can reduce the loading job concurrency, or increase loading data batch size. If you are loading data with Routine Load, you can increase FE configs routine_load_task_consume_second and max_routine_load_batch_size,: be:127.0.0.1" }
排查原因
首先报错信息有一个Failed to load data into tablet 613617932
那么去sr在查SHOW TABLET 613617932
打印结构如下
DbName|TableName|PartitionName|IndexName|DbId|TableId|PartitionId|IndexId|IsSync|DetailCmd |
------+---------+-------------+---------+----+-------+-----------+-------+------+--------------------------------------------------+
| | | |1 |1 |1 |613617667 |true |SHOW PROC '/dbs/477830501/609929097/partitions/610006565/613617667/613617932';|
然后执行里面的detailCmd:
SHOW PROC '/dbs/477830501/609929097/partitions/610006565/613617667/613617932';
打印如下
ReplicaId|BackendId|Version|VersionHash|LstSuccessVersion|LstSuccessVersionHash|LstFailedVersion|LstFailedVersionHash|LstFailedTime|SchemaHash|DataSize|RowCount|State|IsBad|IsSetBadForce|VersionCount|PathHash|MetaUrl |CompactionStatus |IsErrorState|
---------+---------+-------+-----------+-----------------+---------------------+----------------+--------------------+-------------+----------+--------+--------+-----+-----+-------------+------------+--------+------------------------------------+-----------------------------------------------------------------+------------+
613617933|449984861|3238 |0 |3238 |0 |-1 |0 | |-1 |16754120|758341 |ALTER|false|false |1501 |-1 |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false |
613617934|327736692|3238 |0 |3238 |0 |-1 |0 | |-1 |16754120|758341 |ALTER|false|false |1501 |-1 |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false |
613617935|140233404|3238 |0 |3238 |0 |-1 |0 | |-1 |16754120|758341 |ALTER|false|false |1501 |-1 |http://*:0/api/meta/header/613617932|http://*:0/api/compaction/show?tablet_id=613617932&schema_hash=-1|false |
然后我们查询有没有alter table和alter MATERIALIZED VIEW;
执行SHOW ALTER MATERIALIZED VIEW;
打印结果如下
JobId|TableName|CreateTime|FinishedTime|BaseIndexName|RollupIndexName|RollupId|TransactionId|State|Msg|Progress|Timeout|
-----+---------+----------+------------+-------------+---------------+--------+-------------+-----+---+--------+-------+
613617666| xxx| xxx | xxx | xxx | xxx |613617667| xxx |xxx |xxx| xxx |xxx |
发现,这里的JobId是613617666,而RollupId是613617667(正好是SHOW TABLET 613617932中的IndexId),并且这个物化视图始终没有finished,卡住了,这就导致在 StarRocks 中,创建物化视图会对目标表 test_sr 加元数据锁,阻塞所有可能修改表结构或元数据的操作。
解决方案:删除掉这个卡住的物化视图,或者让他完成
4.默认的FlinkSQL-Starrocks是CSV格式,若数据值有'',会导致解析有误
解决方案:
-- 在with中配置如下参数,采用json格式
'sink.properties.format' = 'json',
'sink.properties.strip_outer_array' = 'true',