AOF Format
AOF file, following RESP, is stored as lines of Redis command. redis.io/topics/prot…
*2\r\n$6\r\nSELECT\r\n0
*5\r\n$5\r\nRPUSH\r\n$7\r\nNUMBERS\r\n$3\r\nONE\r\n$3\r\nTWO\r\n$5\r\nTHREE
...
AOF Read
Since AOF file is stored as lines of command, the reading process is to create a fake client and let it call Redis with the commands stored in AOF file.
Create Fake Client
Most of the settings are left null or empty due to the fact that interaction between fake client and Redis is one direction.
Load AOF file
- Load the AOF file and check intergrity that a AOF file must NOT start with 5 characters:
REDIS. - Start main while loop that iterate over lines of command in RESP format.
- Create a char buffer
char buf[128]. fgetsat max 128 chars into the buffer, break if nothing is read. Note thatfgetsstops at\r\n.- Make sure the first char of buffer
buf[0]is equal to*. - Convert the second char of buffer into integer and store it in
argc. - Start inner loop
ioverargc. fgetsat max 128 chars into the buffer. Note thatfgetsstops at\r\n.- Make sure the first char of buffer
buf[0]is equal to$. - Convert the second char of buffer into integer and record it as
len, length ofargv[i]. freadthe nextlenlength of char and store it inargv[i].freadthe next 2 chars to skip the\r\nbehindargv[i].- Search the corresponding command in CommandTable via
lookupCommand(argv[i]->ptr). - Execute the command via
cmd->proc(fakeClient) - Go back to step 3 and repeat.
AOF Write
Redis encodes an entire command into RESP format and appends it to the end of server.aof_buf. If a BGAOFRewrite is ongoing, also append to AOF rewrite buffer.
Append Generic Command
These are the commands that have nothing to do with expiration.
- Encode a
SELECTcommand if the key Redis is going to encode doesn't reside in the same db inserver.aof_selected_db. Then updateserver.aof_selected_db. - Encode a char
*. - Encode a char value of
argc. - Encode
\r\n. - Loop
ioverargc. - Encode a char
$. - Encode a char value of
sdslen(argv[i]). - Encode
\r\n. - Encode the value of
argv[i]. - Encode
\r\n. - Go back to step 5 and repeat.
Append Expire Command
/* This command is used in order to translate EXPIRE and PEXPIRE commands
* into PEXPIREAT command so that we retain precision in the append only
* file, and the time is always absolute and not relative. */
- If the unit of the command is in seconds, convert into ms via multiplying the value by 1000.
- EXPIRE
- SETEX
- EXPIREAT
- If the command uses relative time value, conver into absolute via adding
mstime().- EXPIRE
- PEXPIRE
- SETEX
- PSETEX
- Now all the aforementioned commands can be converted into format:
PEXPIREAT [key] [absolute time value] - Append the converted command using the steps in section: Append Generic Command
AOF Rewrite
/* This is how rewriting of the append only file in background works:
*
* 1) The user calls BGREWRITEAOF
* 2) Redis calls this function, that forks():
* 2a) the child rewrite the append only file in a temp file.
* 2b) the parent accumulates differences in server.aof_rewrite_buf.
* 3) When the child finished '2a' exists.
* 4) The parent will trap the exit code, if it's OK, will append the
* data accumulated into server.aof_rewrite_buf into the temp file, and
* finally will rename(2) the temp file in the actual file name.
* The the new file is reopened as the new append only file. Profit!
*/
AOF Server Rewrite Algorithm
- Open and initiate all the pipes needed for IPC between server and child. See section AOF IPC for detail.
- Fork a child process. See the next section for detail.
- Log some metrics about AOF.
- Set
server.aof_selected_db = -1to make sure rewrite buffer start with the correct db.
server.stat_fork_time = ustime()-start;
server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024*1024*1024); /* GB per second. */
server.aof_rewrite_scheduled = 0;
server.aof_rewrite_time_start = time(NULL);
server.aof_child_pid = childpid;
updateDictResizePolicy();
/* We set appendseldb to -1 in order to force the next call to the
* feedAppendOnlyFile() to issue a SELECT command, so the differences
* accumulated by the parent into server.aof_rewrite_buf will start
* with a SELECT statement and it will be safe to merge. */
server.aof_selected_db = -1;
AOF Child Rewrite Algorithm
- Open a temp file
temp-rewriteaof-{pid}.aof. - Initialize child side rewrite buffer
server.aof_child_diffas empty sds. - Iterate over each and every DB of Redis.
- Append a
SELECTcommand. - Iterate over each and every key/value pair in the db.
- Append the correct command based on
o->type&o->encode. - Keep accepting aof diff from server side and append to the end of child side rewrite buffer. See section AOF Child Rewrite Buffer Read for detail.
- After finish rewriting all the DBs, do a
fsync()first to make the nextfsync()faster. - Apply the data in child rewrite buffer to the end of temp file. See section AOF Child Rewrite Buffer Apply for detail.
int rewriteAppendOnlyFileRio(rio *aof) {
...
size_t processed = 0;
for (j = 0; j < server.dbnum; j++) {
char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
redisDb *db = server.db+j;
dict *d = db->dict;
if (dictSize(d) == 0) continue;
di = dictGetSafeIterator(d);
...
/* Iterate this DB writing every entry */
while((de = dictNext(di)) != NULL) {
o = dictGetVal(de);
if (o->type == OBJ_STRING) {
...
} else if
...
} else {
serverPanic("Unknown object type");
}
}
/* Read some diff from the parent process from time to time. */
if (aof->processed_bytes > processed+AOF_READ_DIFF_INTERVAL_BYTES) {
processed = aof->processed_bytes;
aofReadDiffFromParent();
}
}
}
AOF Rewrite Finish
Server doesn't wait synchronously for child to finish AOF rewrite. Server periodically check whether the rewrite is done in serverCron() and handle the result via backgroundRewriteDoneHandler().
- Open temp AOF file and store fd in
newfd. - Do a final append of server rewrite buffer.
- Rename temp AOF file atomically.
- Assign
server.aof_fdwhich point to the old AOF file tooldfd. - Assign
newfdtoserver.aof_fd. - Unlike
oldfdin another thread to avoid server block.
/* A background append only file rewriting (BGREWRITEAOF) terminated its work.
* Handle this. */
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
if (!bysignal && exitcode == 0) {
...
//open fds, local buffers
snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int)server.aof_child_pid);
newfd = open(tmpfile,O_WRONLY|O_APPEND);
...
// Flush the differences accumulated by the parent to the rewritten AOF.
aofRewriteBufferWrite(newfd);
...
//rename won't unlink the old aof file causing server block, because server.aof_fd is still referencing it
rename(tmpfile,server.aof_filename)
if (server.aof_fd == -1) {
/* AOF disabled, we don't need to set the AOF file descriptor
* to this new file, so we can close it. */
close(newfd);
} else {
/* AOF enabled, replace the old fd with the new one. */
oldfd = server.aof_fd;
server.aof_fd = newfd;
...
/* Clear regular AOF buffer since its contents was just written to
* the new AOF from the background rewrite buffer. */
sdsfree(server.aof_buf);
server.aof_buf = sdsempty();
}
...
/* Asynchronously close the overwritten AOF. */
if (oldfd != -1) {
bioCreateBackgroundJob(BIO_CLOSE_FILE,(void*)(long)oldfd,NULL,NULL);
}
} else {
...
//handler failure
}
AOF IPC
Server and child uses special pipes to communicate with each other. The data transmitted in the pipe are temporary stored in data structure called: Rewrite Buffer
/* Create the pipes used for parent - child process IPC during rewrite.
* We have a data pipe used to send AOF incremental diffs to the child,
* and two other pipes used by the children to signal it finished with
* the rewrite so no more data should be written, and another for the
* parent to acknowledge it understood this new condition. */
int aofCreatePipes(void) {
int fds[6] = {-1, -1, -1, -1, -1, -1};
...
//One direction for transferring AOF diff from server to child.
server.aof_pipe_write_data_to_child = fds[1];
server.aof_pipe_read_data_from_parent = fds[0];
//Child to server ACK, indicating child requested server to stop sending diff.
server.aof_pipe_write_ack_to_parent = fds[3];
server.aof_pipe_read_ack_from_child = fds[2];
//Server to Child ACK, indicating server agreed to stop sending diff.
server.aof_pipe_write_ack_to_child = fds[5];
server.aof_pipe_read_ack_from_parent = fds[4];
server.aof_stop_sending_diff = 0;
return C_OK;
AOF Rewrite Buffer
AOF Server Rewrite Buffer
/* ----------------------------------------------------------------------------
* AOF rewrite buffer implementation.
*
* The following code implement a simple buffer used in order to accumulate
* changes while the background process is rewriting the AOF file.
*
* We only need to append, but can't just use realloc with a large block
* because 'huge' reallocs are not always handled as one could expect
* (via remapping of pages at OS level) but may involve copying data.
*
* For this reason we use a list of blocks, every block is
* AOF_RW_BUF_BLOCK_SIZE bytes.
* ------------------------------------------------------------------------- */
#define AOF_RW_BUF_BLOCK_SIZE (1024*1024*10) /* 10 MB per block */
AOF Child Rewrite Buffer
AOF child side rewrite buffer is a SDS string stored at server.aof_child_diff.
AOF Server Rewrite Buffer Append
- Iterate to the last AOF rewrite buffer block.
- Append to it if there's remaining space in the last block.
- Create a new block at the end of the list and append to it for the remaining data.
- Repeat step 3 until all the data are appened.
- Register a write event to
server.elviaaeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child, AE_WRITABLE, aofChildWriteDiffData, NULL).
AOF Server Rewrite Buffer Send
- Write each and every block in
server.aof_rewrite_buf_blockstofds[1]until write return -1. - Delete the block after writing all the bytes in it and proceed to the next.
- Unregister write event to
server.elafter all blocks are written.
while(1) {
ln = listFirst(server.aof_rewrite_buf_blocks);
block = ln ? ln->value : NULL;
if (server.aof_stop_sending_diff || !block) {
aeDeleteFileEvent(server.el,server.aof_pipe_write_data_to_child,
AE_WRITABLE);
return;
}
if (block->used > 0) {
nwritten = write(server.aof_pipe_write_data_to_child,
block->buf,block->used);
if (nwritten <= 0) return;
memmove(block->buf,block->buf+nwritten,block->used-nwritten);
block->used -= nwritten;
block->free += nwritten;
}
if (block->used == 0) listDelNode(server.aof_rewrite_buf_blocks,ln);
}
AOF Child Rewrite Buffer Read
/* This function is called by the child rewriting the AOF file to read
* the difference accumulated from the parent into a buffer, that is
* concatenated at the end of the rewrite. */
ssize_t aofReadDiffFromParent(void) {
char buf[65536]; /* Default pipe buffer size on most Linux systems. */
ssize_t nread, total = 0;
while ((nread =
read(server.aof_pipe_read_data_from_parent,buf,sizeof(buf))) > 0) {
server.aof_child_diff = sdscatlen(server.aof_child_diff,buf,nread);
total += nread;
}
return total;
}
AOF Child Rewrite Buffer Apply
int rewriteAppendOnlyFile(char *filename) {
...//rewrite logic, see section AOF Rewrite Child
//At this point, rewrite has finished, time to finish up buffer append.
/* Read again a few times to get more data from the parent.
* We can't read forever (the server may receive data from clients
* faster than it is able to send data to the child), so we try to read
* some more data in a loop as soon as there is a good chance more data
* will come. If it looks like we are wasting time, we abort (this
* happens after 20 ms without new data). */
int nodata = 0;
mstime_t start = mstime();
while(mstime()-start < 1000 && nodata < 20) {
if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
{
nodata++;
continue;
}
nodata = 0; /* Start counting from zero, we stop on N *contiguous*
timeouts. */
aofReadDiffFromParent();
}
/* Ask the master to stop sending diffs. */
if (write(server.aof_pipe_write_ack_to_parent,"!",1) != 1) goto werr;
if (anetNonBlock(NULL,server.aof_pipe_read_ack_from_parent) != ANET_OK)
goto werr;
/* We read the ACK from the server using a 10 seconds timeout. Normally
* it should reply ASAP, but just in case we lose its reply, we are sure
* the child will eventually get terminated. */
if (syncRead(server.aof_pipe_read_ack_from_parent,&byte,1,5000) != 1 ||
byte != '!') goto werr;
/* Read the final diff if any. */
aofReadDiffFromParent();
if (rioWrite(&aof,server.aof_child_diff,sdslen(server.aof_child_diff)) == 0)
goto werr;
...//clean-up fds, pipes. Rename temp file
return C_OK;
}