prometheus tsdb headhead prometheus里头内存中有个head，只有唯一的一个chunk可

head

prometheus里头内存中有个head，只有唯一的一个chunk可以写入。

写入chunk的时候先写入wal，防止内存数据丢失。

当series的samples大于120个后就会写入磁盘，通过mmap的方式访问chunk。

然后当head的数据大于3小时左右，就会将前2小时的chunk生成一个带index的block到磁盘。

为什么要把块mmap到内存

通过mmap，只是分配了虚拟地址，对应的物理地址只有访问的时候才会使用，因此减少了rss 30 %。

（流失率低的时候明显，因为流失率高的时候很多series的sample没达到生成chunk的级别。）

减少了wal重放的时间，由于chunk已经在磁盘了，因此mmap的chunk可以直接跳过，这节省了15%的重放时间。

grafana.com/blog/2020/0…

head appender

在添加kv数据的时候，通过appender来做批量的添加，然后commit（写wal，写series），如果有error就rollback。

批量的好处：

降低写wal的次数
每次commit的时候检查是否需要compact。

所有的sample都存放在samples里头，然后sampleSeries就存放对应的sample对应的memseries。

这样commit的时候，就通过samples的index找到对应的memseries，然后写入sample。

通过lset去stripeseries中找到对应的memseries。


type headAppender struct {
  head         *Head
  minValidTime int64 // No samples below this timestamp are allowed.
  mint, maxt   int64

  series       []record.RefSeries      // New series held by this appender.
  samples      []record.RefSample      // New samples held by this appender.
  exemplars    []exemplarWithSeriesRef // New exemplars held by this appender.
  sampleSeries []*memSeries            // Series corresponding to the samples held by this appender (using corresponding slice indices - same series may appear more than once).

  appendID, cleanupAppendIDsBelow uint64
  closed                          bool
}

// RefSample is a timestamp/value pair associated with a reference to a series.
type RefSample struct {
  Ref chunks.HeadSeriesRef
  T   int64
  V   float64
}

思考

如果把samples按照series排序，能对一个series批量写，这样是否会更快？

memseries

对于prometheus来说，一个series表示唯一的一组tag。

memseries表示一个series在内存中的表示：

lset，标示此series的label
mmapedchunk，通过chunkdiskmapper mmap的head chunk
headchunk 最近添加的k，v数据

// memSeries is the in-memory representation of a series. None of its methods
// are goroutine safe and it is the caller's responsibility to lock it.
type memSeries struct {
  sync.RWMutex

  ref  chunks.HeadSeriesRef
  lset labels.Labels

  // Immutable chunks on disk that have not yet gone into a block, in order of ascending time stamps.
  // When compaction runs, chunks get moved into a block and all pointers are shifted like so:
  //
  //                                    /------- let's say these 2 chunks get stored into a block
  //                                    |  |
  // before compaction: mmappedChunks=[p5,p6,p7,p8,p9] firstChunkID=5
  //  after compaction: mmappedChunks=[p7,p8,p9]       firstChunkID=7
  //
  // pN is the pointer to the mmappedChunk referered to by HeadChunkID=N
  mmappedChunks []*mmappedChunk

  mmMaxTime    int64     // Max time of any mmapped chunk, only used during WAL replay.
  headChunk    *memChunk // Most recent chunk in memory that's still being built.
  chunkRange   int64
  firstChunkID chunks.HeadChunkID // HeadChunkID for mmappedChunks[0]

  nextAt int64 // Timestamp at which to cut the next chunk.

  // We keep the last 4 samples here (in addition to appending them to the chunk) so we don't need coordination between appender and querier.
  // Even the most compact encoding of a sample takes 2 bits, so the last byte is not contended.
  sampleBuf [4]sample

  pendingCommit bool // Whether there are samples waiting to be committed to this series.

  // Current appender for the head chunk. Set when a new head chunk is cut.
  // It is nil only if headChunk is nil. E.g. if there was an appender that created a new series, but rolled back the commit
  // (the first sample would create a headChunk, hence appender, but rollback skipped it while the Append() call would create a series).
  app chunkenc.Appender

  memChunkPool *sync.Pool

  // txs is nil if isolation is disabled.
  txs *txRing
}

stripeseries

所有的memseries都存储在stripeseries里头，通过lset的hash找到对应的memseries。

seriesHashmap通过lset的hash找到对应的value。

value是一连串的memseries，为了解决hash冲突。

通过分段锁的方式减少锁冲突。

// seriesHashmap is a simple hashmap for memSeries by their label set.
// It is built on top of a regular hashmap and holds a slice of series to
// resolve hash collisions. Its methods require the hash to be submitted
// with the label set to avoid re-computing hash throughout the code.
type seriesHashmap map[uint64][]*memSeries

// stripeSeries locks modulo ranges of IDs and hashes to reduce lock
// contention. The locks are padded to not be on the same cache line.
// Filling the padded space with the maps was profiled to be slower -
// likely due to the additional pointer dereferences.
type stripeSeries struct {
  size   int
  series []map[chunks.HeadSeriesRef]*memSeries // key是seriesref，方便通过seriesref去获取
  hashes []seriesHashmap // hash是lset的hash，通过hash % size 去获取对应的serieshashmap
  locks  []stripeLock
}
type stripeLock struct {
  sync.RWMutex
  // Padding to avoid multiple locks being on the same cache line.
  _ [40]byte
}

isolation

isolation主要做读写的隔离。

原理和mysql的mvcc差不多，维护一个当前正在进行的事务集合。

如果小于当前正在进行的最小的事务id，说明已经commit；

如果大于，说明绝对没有commit。

如果在范围中，则需要确认在不在活跃的事务集合里头。

为每一个head_appender分配一个id，然后用双向链表串起来。
通过一个hashmap存储所有正在进行的head_appender
readsopen 表示所有正在进行的读取。每一个read，我们都会分配一个isolation state，isolationstate会拷贝当前isolation的appendsopen map。


type isolationAppender struct {
  appendID uint64
  prev     *isolationAppender
  next     *isolationAppender
}

// isolation is the global isolation state.
type isolation struct {
  // Mutex for accessing lastAppendID and appendsOpen.
  appendMtx sync.RWMutex
  // Which appends are currently in progress.
  appendsOpen map[uint64]*isolationAppender
  // New appenders with higher appendID are added to the end. First element keeps lastAppendId.
  // appendsOpenList.next points to the first element and appendsOpenList.prev points to the last element.
  // If there are no appenders, both point back to appendsOpenList.
  appendsOpenList *isolationAppender
  // Pool of reusable *isolationAppender to save on allocations.
  appendersPool sync.Pool

  // Mutex for accessing readsOpen.
  // If taking both appendMtx and readMtx, take appendMtx first.
  readMtx sync.RWMutex
  // All current in use isolationStates. This is a doubly-linked list.
  readsOpen *isolationState
  // If true, writes are not tracked while reads are still tracked.
  disabled bool
}

// State returns an object used to control isolation
// between a query and appends. Must be closed when complete.
func (i *isolation) State(mint, maxt int64) *isolationState {