环境:mongo分片集群部署在k8s集群,版本v4.4.5
问题:在mongo分片集群中使用事务时,经常会报出一些事务失败的问题,例如
org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/devops-cicenter-flowmain].[dispatcherServlet] [175] -| Servlet.service() for servlet [dispatcherServlet] in context with path [/devops-cicenter-flowmain] threw exception [Request processing failed; nested exception is org.springframework.transaction.TransactionSystemException: Could not commit Mongo transaction for session [ClientSessionImpl@5bd2e9ce id = {"id": {"$binary": {"base64": "uP28bbCXR9m1tJJ4At3YhQ==", "subType": "04"}}}, causallyConsistent = true, txActive = false, txNumber = 87, error = d != java.lang.Boolean].; nested exception is com.mongodb. : Command failed with error 251 (NoSuchTransaction): 'Recovery token is empty, meaning the transaction only performed reads and can be safely retried' on server rdcloud-mongo-mongodb-sharded:27017. The full response is {"ok": 0.0, "errmsg": "Recovery token is empty, meaning the transaction only performed reads and can be safely retried", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1639759743, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1639759743, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "7cEyTxp4gQrVCA/Nno+WHfz4UoI=", "subType": "00"}}, "keyId": 7016257415104430101}}, "errorLabels": ["TransientTransactionError"]}] with root cause
com.mongodb.MongoCommandException: Command failed with error 251 (NoSuchTransaction): 'Recovery token is empty, meaning the transaction only performed reads and can be safely retried' on server rdcloud-mongo-mongodb-sharded:27017. The full response is {"ok": 0.0, "errmsg": "Recovery token is empty, meaning the transaction only performed reads and can be safely retried", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1639759743, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1639759743, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "7cEyTxp4gQrVCA/Nno+WHfz4UoI=", "subType": "00"}}, "keyId": 7016257415104430101}}, "errorLabels": ["TransientTransactionError"]}
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:175)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:359)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:280)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:100)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:490)
at com.mongodb.internal.connection.CommandProtocolImpl
2021-12-21 13:47:07.596 |-ERROR [http-nio-8080-exec-8] org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/zte-devops-cicenter-flowmain].[dispatcherServlet] [175] -| Servlet.service() for servlet [dispatcherServlet] in context with path [/zte-devops-cicenter-flowmain] threw exception [Request processing failed; nested exception is org.springframework.transaction.TransactionSystemException: [ClientSessionImpl@5e33d6de id = {"id": {"$binary": {"base64": "cdh12E2MQeabbPq/sGJppA==", "subType": "04"}}}, causallyConsistent = true, txActive = false, txNumber = 6, error = d != java.lang.Boolean].; nested exception is com.mongodb.MongoCommandException: Command failed with error 251 (NoSuchTransaction): 'Recovering the transaction's outcome found the transaction aborted' on server 10.57.67.51:30320. The full response is {"errorLabels": ["TransientTransactionError"], "ok": 0.0, "errmsg": "Recovering the transaction's outcome found the transaction aborted", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1640065627, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1640065627, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "z7ec6229DTGAACNjEecH8ywWsns=", "subType": "00"}}, "keyId": 7043689418468622349}}, "recoveryToken": {}}] with root cause
com.mongodb.MongoCommandException: Command failed with error 251 (NoSuchTransaction): 'Recovering the transaction's outcome found the transaction aborted' on server 10.57.67.51:30320. The full response is {"errorLabels": ["TransientTransactionError"], "ok": 0.0, "errmsg": "Recovering the transaction's outcome found the transaction aborted", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1640065627, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1640065627, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "z7ec6229DTGAACNjEecH8ywWsns=", "subType": "00"}}, "keyId": 7043689418468622349}}, "recoveryToken": {}}
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:175)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:359)
2021-12-22 10:03:11.629 |-ERROR [http-nio-8080-exec-1] cn.com.zte.devops.flowmain.service.impl.FlowServiceImpl [252] -| update flow config flowInitialId:8358d695-a350-7d1e-88e0-b18ab8e4a975 failed
org.springframework.data.mongodb.MongoTransactionException: Command failed with error 251 (NoSuchTransaction): 'cannot continue txnId -1 for f1946d3c-4f4c-4d49-9cdd-c4450dcf1e53 - Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg= with txnId 2' on server 10.57.67.51:30320. The full response is {"ok": 0.0, "errmsg": "cannot continue txnId -1 for session f1946d3c-4f4c-4d49-9cdd-c4450dcf1e53 - Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg= with txnId 2", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1640138589, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1640138589, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "zh4C+gNah+G/Uu+crY1N2hBJw3E=", "subType": "00"}}, "keyId": 7043689418468622349}}, "errorLabels": ["TransientTransactionError"]}; nested exception is com.mongodb.MongoCommandException: Command failed with error 251 (NoSuchTransaction): 'cannot continue txnId -1 for session f1946d3c-4f4c-4d49-9cdd-c4450dcf1e53 - Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg= with txnId 2' on server 10.57.67.51:30320. The full response is {"ok": 0.0, "errmsg": "cannot continue txnId -1 for session f1946d3c-4f4c-4d49-9cdd-c4450dcf1e53 - Y5mrDaxi8gv8RmdTsQ+1j7fmkr7JUsabhNmXAheU0fg= with txnId 2", "code": 251, "codeName": "NoSuchTransaction", "operationTime": {"$timestamp": {"t": 1640138589, "i": 1}}, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1640138589, "i": 1}}, "signature": {"hash": {"$binary": {"base64": "zh4C+gNah+G/Uu+crY1N2hBJw3E=", "subType": "00"}}, "keyId": 7043689418468622349}}, "errorLabels": ["TransientTransactionError"]}
at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:137)
at org.springframework.data.mongodb.core.MongoTemplate.potentiallyConvertRuntimeException(MongoTemplate.java:2881)
at org.springframework.data.mongodb.core.MongoTemplate.execute(MongoTemplate.java:563)
at org.springframework.data.mongodb.core.MongoTemplate.doCount(MongoTemplate.java:1133)
原因:不要在mongos前面做负载均衡
finisky.github.io/2021/01/13/…
后来发现元凶在于分片服务器的部署,我们在两个mongos实例之前加了一个负载均衡器,一个transaction中的几个操作可能在两个不同的mongos上执行从而出现了上述错误。
I found the root cause: the loadbalancer in front of mongos. Since there are 2 mongos instances behind the same stateless kubernetes service, a transaction may not be executed on the same mongos throughout its lifetime.
I'll expose every mongos instance seperately and change the connection string.
I think, the mongos used while doing the transaction changed and therefore you got this error message. It is called mongos pinning. That means, if you start the transaction all requests must be sent to the same mongos server where you started.
So don't need to load-balance the connection from the driver, because the driver contains some balancing code (random). Just specify your mongos servers and everything should be fine.
Mongos Pinning
Drivers MUST send all commands for a single transaction to the same mongos (excluding retries of commitTransaction and abortTransaction).
After the driver selects a mongos for the first command within a transaction, the driver MUST pin the ClientSession to the selected mongos. Drivers MUST send all subsequent commands that are part of the same transaction (excluding certain retries of commitTransaction and abortTransaction) to the same mongos.