[rds-devel] what is rdma immediate data and how is it used ?

原文出处

Generally, rdma operations do not provide a notification to the remote host when they complete. It is possible to setup for notification - using IB native "immediate" data (4 bytes) for example - but the data size is limited.

RDS rdma immediate data is a normal RDS socket message which is guaranteed to be delivered after the rdma completes. The immediate data size is limited by the socket send message operation. Furthermore, the rdma data and immediate data are an atomic unit - either both arrive - or neither arrives.

In practice the immediate is pushed down the RDS pipe immediately ( a few instructions ) after the rdma is posted. So the immediate data is racing behind the rdma operation and in theory will arrive with very low latency between the rdma completing and the immediate data arriving.

So what is the immediate data good for ?

Well for one, the client requesting the rdma would like to know when the rdma has completed. So generally, the immediate data is a message from the rdma server which contains some identifier that the rdma client provided, of which the rdma client can use to recognize the operation completing and do things like free the rdma key, and if the rdma is incoming, process the data and free the buffer, etc.

Consider the case of an rdma server used to implement a simple zero copy disk block server over zero copy RDS sockets.

To read a disk block, the rdma client requests that the rdma server issue a disk read and then issue an rdma write back to the rdma client, and send a completion message (immediate data) indicating the write read is complete - data is in rdma client memory.

So how does the rdma disk server know when the rdma read is complete - assuming it's not a sync rdma read - or sync rds barrier operation - and that rdma server is not polling via the rds barrier operation ?

Via a common completion model implemented via poll() !

Poll() can wait for:

a) incoming messages (pollin) b) send space available (pollout) c) any rdma completion or a specific rdma completion (pollin) d) congestion removed from a destination (pollin)

To write a disk block, the rdma client requests that the rdma server issue an rdma read from rdma client host memory into the rdma server memory, and then to issue the disk write and send back a completion message. What's interesting in this case is that the immediate data which could be sent as part of the rdma read is optional (could be send size of zero). If the immediate data is sent, the rdma client would know that the rdma server has completed pulling the data from the rdma client host memory - so it's possible for the rdma client to re-use the write buffer at that point. Of course this assumes that either the rdma client is willing to live possible data loss in light of a path failure - or that the rdma disk server is guaranteeing to commit the data. When the actual disk write completes a separate completion message is sent back to the rdma client.