prometheus tsdb head

1,034 阅读4分钟

head

prometheus里头内存中有个head,只有唯一的一个chunk可以写入。

写入chunk的时候先写入wal,防止内存数据丢失。

当series的samples大于120个后就会写入磁盘,通过mmap的方式访问chunk。

然后当head的数据大于3小时左右,就会将前2小时的chunk生成一个带index的block到磁盘。

image

为什么要把块mmap到内存

  • 通过mmap,只是分配了虚拟地址,对应的物理地址只有访问的时候才会使用,因此减少了rss 30 %。

(流失率低的时候明显,因为流失率高的时候很多series的sample没达到生成chunk的级别。)

  • 减少了wal重放的时间,由于chunk已经在磁盘了,因此mmap的chunk可以直接跳过,这节省了15%的重放时间。

grafana.com/blog/2020/0…

head appender

在添加kv数据的时候,通过appender来做批量的添加,然后commit(写wal,写series),如果有error就rollback。

批量的好处:

  • 降低写wal的次数
  • 每次commit的时候检查是否需要compact。

所有的sample都存放在samples里头,然后sampleSeries就存放对应的sample对应的memseries。

这样commit的时候,就通过samples的index找到对应的memseries,然后写入sample。

通过lset去stripeseries中找到对应的memseries。

type headAppender struct {
  head         *Head
  minValidTime int64 // No samples below this timestamp are allowed.
  mint, maxt   int64
​
  series       []record.RefSeries      // New series held by this appender.
  samples      []record.RefSample      // New samples held by this appender.
  exemplars    []exemplarWithSeriesRef // New exemplars held by this appender.
  sampleSeries []*memSeries            // Series corresponding to the samples held by this appender (using corresponding slice indices - same series may appear more than once).
​
  appendID, cleanupAppendIDsBelow uint64
  closed                          bool
}
​
// RefSample is a timestamp/value pair associated with a reference to a series.
type RefSample struct {
  Ref chunks.HeadSeriesRef
  T   int64
  V   float64
}

思考

如果把samples按照series排序,能对一个series批量写,这样是否会更快?

memseries

对于prometheus来说,一个series表示唯一的一组tag。

memseries表示一个series在内存中的表示:

  • lset,标示此series的label
  • mmapedchunk,通过chunkdiskmapper mmap的head chunk
  • headchunk 最近添加的k,v数据
// memSeries is the in-memory representation of a series. None of its methods
// are goroutine safe and it is the caller's responsibility to lock it.
type memSeries struct {
  sync.RWMutex
​
  ref  chunks.HeadSeriesRef
  lset labels.Labels
​
  // Immutable chunks on disk that have not yet gone into a block, in order of ascending time stamps.
  // When compaction runs, chunks get moved into a block and all pointers are shifted like so:
  //
  //                                    /------- let's say these 2 chunks get stored into a block
  //                                    |  |
  // before compaction: mmappedChunks=[p5,p6,p7,p8,p9] firstChunkID=5
  //  after compaction: mmappedChunks=[p7,p8,p9]       firstChunkID=7
  //
  // pN is the pointer to the mmappedChunk referered to by HeadChunkID=N
  mmappedChunks []*mmappedChunk
​
  mmMaxTime    int64     // Max time of any mmapped chunk, only used during WAL replay.
  headChunk    *memChunk // Most recent chunk in memory that's still being built.
  chunkRange   int64
  firstChunkID chunks.HeadChunkID // HeadChunkID for mmappedChunks[0]
​
  nextAt int64 // Timestamp at which to cut the next chunk.// We keep the last 4 samples here (in addition to appending them to the chunk) so we don't need coordination between appender and querier.
  // Even the most compact encoding of a sample takes 2 bits, so the last byte is not contended.
  sampleBuf [4]sample
​
  pendingCommit bool // Whether there are samples waiting to be committed to this series.// Current appender for the head chunk. Set when a new head chunk is cut.
  // It is nil only if headChunk is nil. E.g. if there was an appender that created a new series, but rolled back the commit
  // (the first sample would create a headChunk, hence appender, but rollback skipped it while the Append() call would create a series).
  app chunkenc.Appender
​
  memChunkPool *sync.Pool
​
  // txs is nil if isolation is disabled.
  txs *txRing
}

stripeseries

所有的memseries都存储在stripeseries里头,通过lset的hash找到对应的memseries。

seriesHashmap通过lset的hash找到对应的value。

value是一连串的memseries,为了解决hash冲突。

通过分段锁的方式减少锁冲突。

// seriesHashmap is a simple hashmap for memSeries by their label set.
// It is built on top of a regular hashmap and holds a slice of series to
// resolve hash collisions. Its methods require the hash to be submitted
// with the label set to avoid re-computing hash throughout the code.
type seriesHashmap map[uint64][]*memSeries
​
// stripeSeries locks modulo ranges of IDs and hashes to reduce lock
// contention. The locks are padded to not be on the same cache line.
// Filling the padded space with the maps was profiled to be slower -
// likely due to the additional pointer dereferences.
type stripeSeries struct {
  size   int
  series []map[chunks.HeadSeriesRef]*memSeries // key是seriesref,方便通过seriesref去获取
  hashes []seriesHashmap // hash是lset的hash,通过hash % size 去获取对应的serieshashmap
  locks  []stripeLock
}
type stripeLock struct {
  sync.RWMutex
  // Padding to avoid multiple locks being on the same cache line.
  _ [40]byte
}
​

isolation

isolation主要做读写的隔离。

原理和mysql的mvcc差不多,维护一个当前正在进行的事务集合。

如果小于当前正在进行的最小的事务id,说明已经commit;

如果大于,说明绝对没有commit。

如果在范围中,则需要确认在不在活跃的事务集合里头。

  • 为每一个head_appender分配一个id,然后用双向链表串起来。
  • 通过一个hashmap存储所有正在进行的head_appender
  • readsopen 表示所有正在进行的读取。每一个read,我们都会分配一个isolation state,isolationstate会拷贝当前isolation的appendsopen map。
type isolationAppender struct {
  appendID uint64
  prev     *isolationAppender
  next     *isolationAppender
}
​
// isolation is the global isolation state.
type isolation struct {
  // Mutex for accessing lastAppendID and appendsOpen.
  appendMtx sync.RWMutex
  // Which appends are currently in progress.
  appendsOpen map[uint64]*isolationAppender
  // New appenders with higher appendID are added to the end. First element keeps lastAppendId.
  // appendsOpenList.next points to the first element and appendsOpenList.prev points to the last element.
  // If there are no appenders, both point back to appendsOpenList.
  appendsOpenList *isolationAppender
  // Pool of reusable *isolationAppender to save on allocations.
  appendersPool sync.Pool
​
  // Mutex for accessing readsOpen.
  // If taking both appendMtx and readMtx, take appendMtx first.
  readMtx sync.RWMutex
  // All current in use isolationStates. This is a doubly-linked list.
  readsOpen *isolationState
  // If true, writes are not tracked while reads are still tracked.
  disabled bool
}
​
// State returns an object used to control isolation
// between a query and appends. Must be closed when complete.
func (i *isolation) State(mint, maxt int64) *isolationState {
​

\