head
prometheus里头内存中有个head,只有唯一的一个chunk可以写入。
写入chunk的时候先写入wal,防止内存数据丢失。
当series的samples大于120个后就会写入磁盘,通过mmap的方式访问chunk。
然后当head的数据大于3小时左右,就会将前2小时的chunk生成一个带index的block到磁盘。
为什么要把块mmap到内存
- 通过mmap,只是分配了虚拟地址,对应的物理地址只有访问的时候才会使用,因此减少了rss 30 %。
(流失率低的时候明显,因为流失率高的时候很多series的sample没达到生成chunk的级别。)
- 减少了wal重放的时间,由于chunk已经在磁盘了,因此mmap的chunk可以直接跳过,这节省了15%的重放时间。
head appender
在添加kv数据的时候,通过appender来做批量的添加,然后commit(写wal,写series),如果有error就rollback。
批量的好处:
- 降低写wal的次数
- 每次commit的时候检查是否需要compact。
所有的sample都存放在samples里头,然后sampleSeries就存放对应的sample对应的memseries。
这样commit的时候,就通过samples的index找到对应的memseries,然后写入sample。
通过lset去stripeseries中找到对应的memseries。
type headAppender struct {
head *Head
minValidTime int64 // No samples below this timestamp are allowed.
mint, maxt int64
series []record.RefSeries // New series held by this appender.
samples []record.RefSample // New samples held by this appender.
exemplars []exemplarWithSeriesRef // New exemplars held by this appender.
sampleSeries []*memSeries // Series corresponding to the samples held by this appender (using corresponding slice indices - same series may appear more than once).
appendID, cleanupAppendIDsBelow uint64
closed bool
}
// RefSample is a timestamp/value pair associated with a reference to a series.
type RefSample struct {
Ref chunks.HeadSeriesRef
T int64
V float64
}
思考
如果把samples按照series排序,能对一个series批量写,这样是否会更快?
memseries
对于prometheus来说,一个series表示唯一的一组tag。
memseries表示一个series在内存中的表示:
- lset,标示此series的label
- mmapedchunk,通过chunkdiskmapper mmap的head chunk
- headchunk 最近添加的k,v数据
// memSeries is the in-memory representation of a series. None of its methods
// are goroutine safe and it is the caller's responsibility to lock it.
type memSeries struct {
sync.RWMutex
ref chunks.HeadSeriesRef
lset labels.Labels
// Immutable chunks on disk that have not yet gone into a block, in order of ascending time stamps.
// When compaction runs, chunks get moved into a block and all pointers are shifted like so:
//
// /------- let's say these 2 chunks get stored into a block
// | |
// before compaction: mmappedChunks=[p5,p6,p7,p8,p9] firstChunkID=5
// after compaction: mmappedChunks=[p7,p8,p9] firstChunkID=7
//
// pN is the pointer to the mmappedChunk referered to by HeadChunkID=N
mmappedChunks []*mmappedChunk
mmMaxTime int64 // Max time of any mmapped chunk, only used during WAL replay.
headChunk *memChunk // Most recent chunk in memory that's still being built.
chunkRange int64
firstChunkID chunks.HeadChunkID // HeadChunkID for mmappedChunks[0]
nextAt int64 // Timestamp at which to cut the next chunk.
// We keep the last 4 samples here (in addition to appending them to the chunk) so we don't need coordination between appender and querier.
// Even the most compact encoding of a sample takes 2 bits, so the last byte is not contended.
sampleBuf [4]sample
pendingCommit bool // Whether there are samples waiting to be committed to this series.
// Current appender for the head chunk. Set when a new head chunk is cut.
// It is nil only if headChunk is nil. E.g. if there was an appender that created a new series, but rolled back the commit
// (the first sample would create a headChunk, hence appender, but rollback skipped it while the Append() call would create a series).
app chunkenc.Appender
memChunkPool *sync.Pool
// txs is nil if isolation is disabled.
txs *txRing
}
stripeseries
所有的memseries都存储在stripeseries里头,通过lset的hash找到对应的memseries。
seriesHashmap通过lset的hash找到对应的value。
value是一连串的memseries,为了解决hash冲突。
通过分段锁的方式减少锁冲突。
// seriesHashmap is a simple hashmap for memSeries by their label set.
// It is built on top of a regular hashmap and holds a slice of series to
// resolve hash collisions. Its methods require the hash to be submitted
// with the label set to avoid re-computing hash throughout the code.
type seriesHashmap map[uint64][]*memSeries
// stripeSeries locks modulo ranges of IDs and hashes to reduce lock
// contention. The locks are padded to not be on the same cache line.
// Filling the padded space with the maps was profiled to be slower -
// likely due to the additional pointer dereferences.
type stripeSeries struct {
size int
series []map[chunks.HeadSeriesRef]*memSeries // key是seriesref,方便通过seriesref去获取
hashes []seriesHashmap // hash是lset的hash,通过hash % size 去获取对应的serieshashmap
locks []stripeLock
}
type stripeLock struct {
sync.RWMutex
// Padding to avoid multiple locks being on the same cache line.
_ [40]byte
}
isolation
isolation主要做读写的隔离。
原理和mysql的mvcc差不多,维护一个当前正在进行的事务集合。
如果小于当前正在进行的最小的事务id,说明已经commit;
如果大于,说明绝对没有commit。
如果在范围中,则需要确认在不在活跃的事务集合里头。
- 为每一个head_appender分配一个id,然后用双向链表串起来。
- 通过一个hashmap存储所有正在进行的head_appender
- readsopen 表示所有正在进行的读取。每一个read,我们都会分配一个isolation state,isolationstate会拷贝当前isolation的appendsopen map。
type isolationAppender struct {
appendID uint64
prev *isolationAppender
next *isolationAppender
}
// isolation is the global isolation state.
type isolation struct {
// Mutex for accessing lastAppendID and appendsOpen.
appendMtx sync.RWMutex
// Which appends are currently in progress.
appendsOpen map[uint64]*isolationAppender
// New appenders with higher appendID are added to the end. First element keeps lastAppendId.
// appendsOpenList.next points to the first element and appendsOpenList.prev points to the last element.
// If there are no appenders, both point back to appendsOpenList.
appendsOpenList *isolationAppender
// Pool of reusable *isolationAppender to save on allocations.
appendersPool sync.Pool
// Mutex for accessing readsOpen.
// If taking both appendMtx and readMtx, take appendMtx first.
readMtx sync.RWMutex
// All current in use isolationStates. This is a doubly-linked list.
readsOpen *isolationState
// If true, writes are not tracked while reads are still tracked.
disabled bool
}
// State returns an object used to control isolation
// between a query and appends. Must be closed when complete.
func (i *isolation) State(mint, maxt int64) *isolationState {
\