Redis aof.c

711 阅读6分钟

AOF Format

AOF file, following RESP, is stored as lines of Redis command. redis.io/topics/prot…

*2\r\n$6\r\nSELECT\r\n0
*5\r\n$5\r\nRPUSH\r\n$7\r\nNUMBERS\r\n$3\r\nONE\r\n$3\r\nTWO\r\n$5\r\nTHREE
...

AOF Read

Since AOF file is stored as lines of command, the reading process is to create a fake client and let it call Redis with the commands stored in AOF file.

Create Fake Client

Most of the settings are left null or empty due to the fact that interaction between fake client and Redis is one direction.

Load AOF file

  1. Load the AOF file and check intergrity that a AOF file must NOT start with 5 characters: REDIS.
  2. Start main while loop that iterate over lines of command in RESP format.
  3. Create a char buffer char buf[128].
  4. fgets at max 128 chars into the buffer, break if nothing is read. Note that fgets stops at \r\n.
  5. Make sure the first char of buffer buf[0] is equal to *.
  6. Convert the second char of buffer into integer and store it in argc.
  7. Start inner loop i over argc.
  8. fgets at max 128 chars into the buffer. Note that fgets stops at \r\n.
  9. Make sure the first char of buffer buf[0] is equal to $.
  10. Convert the second char of buffer into integer and record it as len, length of argv[i].
  11. fread the next len length of char and store it in argv[i].
  12. fread the next 2 chars to skip the \r\n behind argv[i].
  13. Search the corresponding command in CommandTable via lookupCommand(argv[i]->ptr).
  14. Execute the command via cmd->proc(fakeClient)
  15. Go back to step 3 and repeat.

AOF Write

Redis encodes an entire command into RESP format and appends it to the end of server.aof_buf. If a BGAOFRewrite is ongoing, also append to AOF rewrite buffer.

Append Generic Command

These are the commands that have nothing to do with expiration.

  1. Encode a SELECT command if the key Redis is going to encode doesn't reside in the same db in server.aof_selected_db. Then update server.aof_selected_db.
  2. Encode a char *.
  3. Encode a char value of argc.
  4. Encode \r\n.
  5. Loop i over argc.
  6. Encode a char $.
  7. Encode a char value of sdslen(argv[i]).
  8. Encode \r\n.
  9. Encode the value of argv[i].
  10. Encode \r\n.
  11. Go back to step 5 and repeat.

Append Expire Command

/* This command is used in order to translate EXPIRE and PEXPIRE commands
 * into PEXPIREAT command so that we retain precision in the append only
 * file, and the time is always absolute and not relative. */
  1. If the unit of the command is in seconds, convert into ms via multiplying the value by 1000.
    • EXPIRE
    • SETEX
    • EXPIREAT
  2. If the command uses relative time value, conver into absolute via adding mstime().
    • EXPIRE
    • PEXPIRE
    • SETEX
    • PSETEX
  3. Now all the aforementioned commands can be converted into format:
    PEXPIREAT [key] [absolute time value]
    
  4. Append the converted command using the steps in section: Append Generic Command

AOF Rewrite

/* This is how rewriting of the append only file in background works:
 *
 * 1) The user calls BGREWRITEAOF
 * 2) Redis calls this function, that forks():
 *    2a) the child rewrite the append only file in a temp file.
 *    2b) the parent accumulates differences in server.aof_rewrite_buf.
 * 3) When the child finished '2a' exists.
 * 4) The parent will trap the exit code, if it's OK, will append the
 *    data accumulated into server.aof_rewrite_buf into the temp file, and
 *    finally will rename(2) the temp file in the actual file name.
 *    The the new file is reopened as the new append only file. Profit!
 */

AOF Server Rewrite Algorithm

  1. Open and initiate all the pipes needed for IPC between server and child. See section AOF IPC for detail.
  2. Fork a child process. See the next section for detail.
  3. Log some metrics about AOF.
  4. Set server.aof_selected_db = -1 to make sure rewrite buffer start with the correct db.
    server.stat_fork_time = ustime()-start;
    server.stat_fork_rate = (double) zmalloc_used_memory() * 1000000 / server.stat_fork_time / (1024*1024*1024); /* GB per second. */
    server.aof_rewrite_scheduled = 0;
    server.aof_rewrite_time_start = time(NULL);
    server.aof_child_pid = childpid;
    updateDictResizePolicy();
    /* We set appendseldb to -1 in order to force the next call to the
     * feedAppendOnlyFile() to issue a SELECT command, so the differences
     * accumulated by the parent into server.aof_rewrite_buf will start
     * with a SELECT statement and it will be safe to merge. */
    server.aof_selected_db = -1;

AOF Child Rewrite Algorithm

  1. Open a temp file temp-rewriteaof-{pid}.aof.
  2. Initialize child side rewrite buffer server.aof_child_diff as empty sds.
  3. Iterate over each and every DB of Redis.
  4. Append a SELECT command.
  5. Iterate over each and every key/value pair in the db.
  6. Append the correct command based on o->type & o->encode.
  7. Keep accepting aof diff from server side and append to the end of child side rewrite buffer. See section AOF Child Rewrite Buffer Read for detail.
  8. After finish rewriting all the DBs, do a fsync() first to make the next fsync() faster.
  9. Apply the data in child rewrite buffer to the end of temp file. See section AOF Child Rewrite Buffer Apply for detail.
int rewriteAppendOnlyFileRio(rio *aof) {
    ...
    size_t processed = 0;
    for (j = 0; j < server.dbnum; j++) {
        char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
        redisDb *db = server.db+j;
        dict *d = db->dict;
        if (dictSize(d) == 0) continue;
        di = dictGetSafeIterator(d);
        ...
        /* Iterate this DB writing every entry */
        while((de = dictNext(di)) != NULL) {
            o = dictGetVal(de);
            if (o->type == OBJ_STRING) {
                ...
            } else if
                ...
            } else {
                serverPanic("Unknown object type");
            }
        }
        /* Read some diff from the parent process from time to time. */
        if (aof->processed_bytes > processed+AOF_READ_DIFF_INTERVAL_BYTES) {
            processed = aof->processed_bytes;
            aofReadDiffFromParent();
        }
    }
}

AOF Rewrite Finish

Server doesn't wait synchronously for child to finish AOF rewrite. Server periodically check whether the rewrite is done in serverCron() and handle the result via backgroundRewriteDoneHandler().

  1. Open temp AOF file and store fd in newfd.
  2. Do a final append of server rewrite buffer.
  3. Rename temp AOF file atomically.
  4. Assign server.aof_fd which point to the old AOF file to oldfd.
  5. Assign newfd to server.aof_fd.
  6. Unlike oldfd in another thread to avoid server block.
/* A background append only file rewriting (BGREWRITEAOF) terminated its work.
 * Handle this. */
void backgroundRewriteDoneHandler(int exitcode, int bysignal) {
    if (!bysignal && exitcode == 0) {
        ...
        //open fds, local buffers
        
        snprintf(tmpfile,256,"temp-rewriteaof-bg-%d.aof", (int)server.aof_child_pid);
        newfd = open(tmpfile,O_WRONLY|O_APPEND);
        ...
        // Flush the differences accumulated by the parent to the rewritten AOF.
        aofRewriteBufferWrite(newfd);
        ...
        //rename won't unlink the old aof file causing server block, because server.aof_fd is still referencing it
        rename(tmpfile,server.aof_filename)
        
        if (server.aof_fd == -1) {
            /* AOF disabled, we don't need to set the AOF file descriptor
             * to this new file, so we can close it. */
            close(newfd);
        } else {
            /* AOF enabled, replace the old fd with the new one. */
            oldfd = server.aof_fd;
            server.aof_fd = newfd;
            ...
            /* Clear regular AOF buffer since its contents was just written to
             * the new AOF from the background rewrite buffer. */
            sdsfree(server.aof_buf);
            server.aof_buf = sdsempty();
        }
        ...
         /* Asynchronously close the overwritten AOF. */
        if (oldfd != -1) {
            bioCreateBackgroundJob(BIO_CLOSE_FILE,(void*)(long)oldfd,NULL,NULL);
        }
    } else {
        ...
        //handler failure
    }

AOF IPC

Server and child uses special pipes to communicate with each other. The data transmitted in the pipe are temporary stored in data structure called: Rewrite Buffer

/* Create the pipes used for parent - child process IPC during rewrite.
 * We have a data pipe used to send AOF incremental diffs to the child,
 * and two other pipes used by the children to signal it finished with
 * the rewrite so no more data should be written, and another for the
 * parent to acknowledge it understood this new condition. */
int aofCreatePipes(void) {
    int fds[6] = {-1, -1, -1, -1, -1, -1};
    ...
    //One direction for transferring AOF diff from server to child.
    server.aof_pipe_write_data_to_child = fds[1];
    server.aof_pipe_read_data_from_parent = fds[0];
    //Child to server ACK, indicating child requested server to stop sending diff.
    server.aof_pipe_write_ack_to_parent = fds[3];
    server.aof_pipe_read_ack_from_child = fds[2];
    //Server to Child ACK, indicating server agreed to stop sending diff.
    server.aof_pipe_write_ack_to_child = fds[5];
    server.aof_pipe_read_ack_from_parent = fds[4];
    server.aof_stop_sending_diff = 0;
    return C_OK;

AOF Rewrite Buffer

AOF Server Rewrite Buffer

/* ----------------------------------------------------------------------------
 * AOF rewrite buffer implementation.
 *
 * The following code implement a simple buffer used in order to accumulate
 * changes while the background process is rewriting the AOF file.
 *
 * We only need to append, but can't just use realloc with a large block
 * because 'huge' reallocs are not always handled as one could expect
 * (via remapping of pages at OS level) but may involve copying data.
 *
 * For this reason we use a list of blocks, every block is
 * AOF_RW_BUF_BLOCK_SIZE bytes.
 * ------------------------------------------------------------------------- */

#define AOF_RW_BUF_BLOCK_SIZE (1024*1024*10)    /* 10 MB per block */

AOF Child Rewrite Buffer

AOF child side rewrite buffer is a SDS string stored at server.aof_child_diff.

AOF Server Rewrite Buffer Append

  1. Iterate to the last AOF rewrite buffer block.
  2. Append to it if there's remaining space in the last block.
  3. Create a new block at the end of the list and append to it for the remaining data.
  4. Repeat step 3 until all the data are appened.
  5. Register a write event to server.el via aeCreateFileEvent(server.el, server.aof_pipe_write_data_to_child, AE_WRITABLE, aofChildWriteDiffData, NULL).

AOF Server Rewrite Buffer Send

  1. Write each and every block in server.aof_rewrite_buf_blocks to fds[1] until write return -1.
  2. Delete the block after writing all the bytes in it and proceed to the next.
  3. Unregister write event to server.el after all blocks are written.
while(1) {
    ln = listFirst(server.aof_rewrite_buf_blocks);
    block = ln ? ln->value : NULL;
    if (server.aof_stop_sending_diff || !block) {
        aeDeleteFileEvent(server.el,server.aof_pipe_write_data_to_child,
                          AE_WRITABLE);
        return;
    }
    if (block->used > 0) {
        nwritten = write(server.aof_pipe_write_data_to_child,
                         block->buf,block->used);
        if (nwritten <= 0) return;
        memmove(block->buf,block->buf+nwritten,block->used-nwritten);
        block->used -= nwritten;
        block->free += nwritten;
    }
    if (block->used == 0) listDelNode(server.aof_rewrite_buf_blocks,ln);
}

AOF Child Rewrite Buffer Read

/* This function is called by the child rewriting the AOF file to read
 * the difference accumulated from the parent into a buffer, that is
 * concatenated at the end of the rewrite. */
ssize_t aofReadDiffFromParent(void) {
    char buf[65536]; /* Default pipe buffer size on most Linux systems. */
    ssize_t nread, total = 0;

    while ((nread =
            read(server.aof_pipe_read_data_from_parent,buf,sizeof(buf))) > 0) {
        server.aof_child_diff = sdscatlen(server.aof_child_diff,buf,nread);
        total += nread;
    }
    return total;
}

AOF Child Rewrite Buffer Apply

int rewriteAppendOnlyFile(char *filename) {
    ...//rewrite logic, see section AOF Rewrite Child
    
    //At this point, rewrite has finished, time to finish up buffer append.
    /* Read again a few times to get more data from the parent.
     * We can't read forever (the server may receive data from clients
     * faster than it is able to send data to the child), so we try to read
     * some more data in a loop as soon as there is a good chance more data
     * will come. If it looks like we are wasting time, we abort (this
     * happens after 20 ms without new data). */
    int nodata = 0;
    mstime_t start = mstime();
    while(mstime()-start < 1000 && nodata < 20) {
        if (aeWait(server.aof_pipe_read_data_from_parent, AE_READABLE, 1) <= 0)
        {
            nodata++;
            continue;
        }
        nodata = 0; /* Start counting from zero, we stop on N *contiguous*
                       timeouts. */
        aofReadDiffFromParent();
    }
    
    /* Ask the master to stop sending diffs. */
    if (write(server.aof_pipe_write_ack_to_parent,"!",1) != 1) goto werr;
    if (anetNonBlock(NULL,server.aof_pipe_read_ack_from_parent) != ANET_OK)
        goto werr;
    /* We read the ACK from the server using a 10 seconds timeout. Normally
     * it should reply ASAP, but just in case we lose its reply, we are sure
     * the child will eventually get terminated. */
    if (syncRead(server.aof_pipe_read_ack_from_parent,&byte,1,5000) != 1 ||
        byte != '!') goto werr;
    
    /* Read the final diff if any. */
    aofReadDiffFromParent();
    
    if (rioWrite(&aof,server.aof_child_diff,sdslen(server.aof_child_diff)) == 0)
        goto werr;
    
    ...//clean-up fds, pipes. Rename temp file
    return C_OK;
}