介绍了Flink基本的概念,重点介绍了flink在高性能封面的优化。本章重点介绍Flink的内存模型,以及内存管理相关知识
Flink 内存层级结构
MemorySegment
内存段 Flink 内部最基本的内存分配单元,大小为32KB。它即可以是堆上内存(Java的字节分配),也可以是堆外内存(基于Netty的DirectByteBuffer),同时提供了对二进制数据进行读取和写入的方法;
内存页
内存页面是MemorySegment之上的数据访问视图,数据读取抽象为DataInputView,数据写入抽象为DataOutputView。使用时不需担心MemorySegment的细节,会自动处理跨MemorySegment的读取和写入。
缓冲
Task算子之间在网络上层传输数据,使用的是缓冲区,申请和释放由Flink自管,实现类为NetworkBuffer。1个NetworkBuffer包装了1个MemorySegment。
缓冲资源池
BufferPool用于管理Buffer,包含Buffer的申请,释放,销毁,可用Buffer通知等,实现类是LocalBufferPool,每个任务拥有自己的LocalBufferPool。
BufferPoolFactory用于提供BufferPool的创建和销毁,唯一的实现类是NetworkBufferPool,每个TaskManager只有一个NetworkBufferPool。同一个TaskManager的任务共享NetworkBufferPool,在TaskManager启动的时候创建并分配内存。
内存管理器
MemoryManager用于管理Flink中用于排序,哈希表,中间结果的缓存或使用堆外部内存的状态替换(RocksDB)的内存。1.10以后版本主要体现在Slots插槽内存的管理
Flink Segment 申请流程图
几个重要的类
MemoryManager.java
public class MemoryManager {
public static final int DEFAULT_PAGE_SIZE = 32 * 1024;
/** The minimal memory page size. Currently set to 4 KiBytes. */
public static final int MIN_PAGE_SIZE = 4 * 1024;
// ------------------------------------------------------------------------
/** Memory segments allocated per memory owner. */
private final Map<Object, Set<MemorySegment>> allocatedSegments;
/** Reserved memory per memory owner. */
private final Map<Object, Long> reservedMemory;
private final long pageSize;
private final long totalNumberOfPages;
private final UnsafeMemoryBudget memoryBudget;
private final SharedResources sharedResources;
MemoryManager(long memorySize, int pageSize, int verifyEmptyWaitGcMaxSleeps) {
sanityCheck(memorySize, pageSize);
this.pageSize = pageSize;
this.memoryBudget = new UnsafeMemoryBudget(memorySize, verifyEmptyWaitGcMaxSleeps);
this.totalNumberOfPages = memorySize / pageSize;
this.allocatedSegments = new ConcurrentHashMap<>();
this.reservedMemory = new ConcurrentHashMap<>();
this.sharedResources = new SharedResources();
verifyIntTotalNumberOfPages(memorySize, totalNumberOfPages);
LOG.debug(
"Initialized MemoryManager with total memory size {} and page size {}.",
memorySize,
pageSize);
}
public List<MemorySegment> allocatePages(Object owner, int numPages)
throws MemoryAllocationException {
List<MemorySegment> segments = new ArrayList<>(numPages);
allocatePages(owner, segments, numPages);
return segments;
}
public void allocatePages(Object owner, Collection<MemorySegment> target, int numberOfPages)
throws MemoryAllocationException {
// sanity check
Preconditions.checkNotNull(owner, "The memory owner must not be null.");
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
Preconditions.checkArgument(
numberOfPages <= totalNumberOfPages,
"Cannot allocate more segments %d than the max number %d",
numberOfPages,
totalNumberOfPages);
// reserve array space, if applicable
if (target instanceof ArrayList) {
((ArrayList<MemorySegment>) target).ensureCapacity(numberOfPages);
}
long memoryToReserve = numberOfPages * pageSize;
try {
memoryBudget.reserveMemory(memoryToReserve);
} catch (MemoryReservationException e) {
throw new MemoryAllocationException(
String.format("Could not allocate %d pages", numberOfPages), e);
}
Runnable gcCleanup = memoryBudget.getReleaseMemoryAction(getPageSize());
allocatedSegments.compute(
owner,
(o, currentSegmentsForOwner) -> {
Set<MemorySegment> segmentsForOwner =
currentSegmentsForOwner == null
? new HashSet<>(numberOfPages)
: currentSegmentsForOwner;
for (long i = numberOfPages; i > 0; i--) {
MemorySegment segment =
allocateOffHeapUnsafeMemory(getPageSize(), owner, gcCleanup);
target.add(segment);
segmentsForOwner.add(segment);
}
return segmentsForOwner;
});
Preconditions.checkState(!isShutDown, "Memory manager has been concurrently shut down.");
}
public void release(MemorySegment segment) {
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
// check if segment is null or has already been freed
if (segment == null || segment.getOwner() == null) {
return;
}
// remove the reference in the map for the owner
try {
allocatedSegments.computeIfPresent(
segment.getOwner(),
(o, segsForOwner) -> {
segment.free();
segsForOwner.remove(segment);
return segsForOwner.isEmpty() ? null : segsForOwner;
});
} catch (Throwable t) {
throw new RuntimeException(
"Error removing book-keeping reference to allocated memory segment.", t);
}
}
/**
* Tries to release many memory segments together.
*
* <p>The segment is only freed and made eligible for reclamation by the GC. Each segment will
* be returned to the memory pool, increasing its available limit for the later allocations.
*
* @param segments The segments to be released.
*/
public void release(Collection<MemorySegment> segments) {
if (segments == null) {
return;
}
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
boolean successfullyReleased = false;
do {
Iterator<MemorySegment> segmentsIterator = segments.iterator();
try {
MemorySegment segment = null;
while (segment == null && segmentsIterator.hasNext()) {
segment = segmentsIterator.next();
}
while (segment != null) {
segment = releaseSegmentsForOwnerUntilNextOwner(segment, segmentsIterator);
}
segments.clear();
// the only way to exit the loop
successfullyReleased = true;
} catch (ConcurrentModificationException | NoSuchElementException e) {
// this may happen in the case where an asynchronous
// call releases the memory. fall through the loop and try again
}
} while (!successfullyReleased);
}
public void releaseAll(Object owner) {
if (owner == null) {
return;
}
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
// get all segments
Set<MemorySegment> segments = allocatedSegments.remove(owner);
// all segments may have been freed previously individually
if (segments == null || segments.isEmpty()) {
return;
}
// free each segment
for (MemorySegment segment : segments) {
segment.free();
}
segments.clear();
}
public void releaseMemory(Object owner, long size) {
checkMemoryReservationPreconditions(owner, size);
if (size == 0L) {
return;
}
reservedMemory.compute(
owner,
(o, currentlyReserved) -> {
long newReservedMemory = 0;
if (currentlyReserved != null) {
if (currentlyReserved < size) {
LOG.warn(
"Trying to release more memory {} than it was reserved {} so far for the owner {}",
size,
currentlyReserved,
owner);
}
newReservedMemory =
releaseAndCalculateReservedMemory(size, currentlyReserved);
}
return newReservedMemory == 0 ? null : newReservedMemory;
});
}
private long releaseAndCalculateReservedMemory(long memoryToFree, long currentlyReserved) {
final long effectiveMemoryToRelease = Math.min(currentlyReserved, memoryToFree);
memoryBudget.releaseMemory(effectiveMemoryToRelease);
return currentlyReserved - effectiveMemoryToRelease;
}
private void checkMemoryReservationPreconditions(Object owner, long size) {
Preconditions.checkNotNull(owner, "The memory owner must not be null.");
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
Preconditions.checkArgument(
size >= 0L, "The memory size (%s) has to have non-negative size", size);
}
/**
* Releases all reserved memory chunks from an owner to this memory manager.
*
* @param owner The owner to associate with the memory reservation, for the fallback release.
*/
public void releaseAllMemory(Object owner) {
checkMemoryReservationPreconditions(owner, 0L);
Long memoryReservedForOwner = reservedMemory.remove(owner);
if (memoryReservedForOwner != null) {
memoryBudget.releaseMemory(memoryReservedForOwner);
}
}
}
UnsafeMemoryBudget.java
class UnsafeMemoryBudget{
private static final int MAX_SLEEPS =
11; // 2^11 - 1 = (2 x 1024) - 1 ms ~ 2 s total sleep duration
static final int MAX_SLEEPS_VERIFY_EMPTY =
17; // 2^17 - 1 = (128 x 1024) - 1 ms ~ 2 min total sleep duration
private static final int RETRIGGER_GC_AFTER_SLEEPS = 9; // ~ 0.5 sec
private final long totalMemorySize;
private final AtomicLong availableMemorySize;
private final int verifyEmptyWaitGcMaxSleeps;
UnsafeMemoryBudget(long totalMemorySize, int verifyEmptyWaitGcMaxSleeps) {
this.totalMemorySize = totalMemorySize;
this.availableMemorySize = new AtomicLong(totalMemorySize);
this.verifyEmptyWaitGcMaxSleeps = verifyEmptyWaitGcMaxSleeps;
}
long getTotalMemorySize() {
return totalMemorySize;
}
long getAvailableMemorySize() {
return availableMemorySize.get();
}
boolean verifyEmpty() {
try {
// we wait longer than during the normal reserveMemory as we have to GC all memory,
// allocated by task, to perform the verification
reserveMemory(totalMemorySize, verifyEmptyWaitGcMaxSleeps);
} catch (MemoryReservationException e) {
return false;
}
releaseMemory(totalMemorySize);
return availableMemorySize.get() == totalMemorySize;
}
/**
* Reserve memory of certain size if it is available.
*
* <p>Adjusted version of {@link java.nio.Bits#reserveMemory(long, int)} taken from Java 11.
*/
void reserveMemory(long size) throws MemoryReservationException {
reserveMemory(size, MAX_SLEEPS);
}
@SuppressWarnings({"OverlyComplexMethod", "JavadocReference", "NestedTryStatement"})
void reserveMemory(long size, int maxSleeps) throws MemoryReservationException {
long availableOrReserved = tryReserveMemory(size);
// optimist!
if (availableOrReserved >= size) {
return;
}
boolean interrupted = false;
try {
// Retry allocation until success or there are no more
// references (including Cleaners that might free direct
// buffer memory) to process and allocation still fails.
boolean refprocActive;
do {
try {
refprocActive = JavaGcCleanerWrapper.tryRunPendingCleaners();
} catch (InterruptedException e) {
// Defer interrupts and keep trying.
interrupted = true;
refprocActive = true;
}
availableOrReserved = tryReserveMemory(size);
if (availableOrReserved >= size) {
return;
}
} while (refprocActive);
// trigger VM's Reference processing
System.gc();
// A retry loop with exponential back-off delays.
// Sometimes it would suffice to give up once reference
// processing is complete. But if there are many threads
// competing for memory, this gives more opportunities for
// any given thread to make progress. In particular, this
// seems to be enough for a stress test like
// DirectBufferAllocTest to (usually) succeed, while
// without it that test likely fails. Since failure here
// ends in MemoryReservationException, there's no need to hurry.
long sleepTime = 1;
int sleeps = 0;
while (true) {
availableOrReserved = tryReserveMemory(size);
if (availableOrReserved >= size) {
return;
}
if (sleeps >= maxSleeps) {
break;
}
try {
if (!JavaGcCleanerWrapper.tryRunPendingCleaners()) {
if (sleeps >= RETRIGGER_GC_AFTER_SLEEPS) {
// trigger again VM's Reference processing if we have to wait longer
System.gc();
}
Thread.sleep(sleepTime);
sleepTime <<= 1;
sleeps++;
}
} catch (InterruptedException e) {
interrupted = true;
}
}
// no luck
throw new MemoryReservationException(
String.format(
"Could not allocate %d bytes, only %d bytes are remaining. This usually indicates "
+ "that you are requesting more memory than you have reserved. "
+ "However, when running an old JVM version it can also be caused by slow garbage collection. "
+ "Try to upgrade to Java 8u72 or higher if running on an old Java version.",
size, availableOrReserved));
} finally {
if (interrupted) {
// don't swallow interrupts
Thread.currentThread().interrupt();
}
}
}
private long tryReserveMemory(long size) {
long currentAvailableMemorySize;
while (size <= (currentAvailableMemorySize = availableMemorySize.get())) {
if (availableMemorySize.compareAndSet(
currentAvailableMemorySize, currentAvailableMemorySize - size)) {
return size;
}
}
return currentAvailableMemorySize;
}
void releaseMemory(@Nonnegative long size) {
if (size == 0) {
return;
}
boolean released = false;
long currentAvailableMemorySize = 0L;
while (!released
&& totalMemorySize
>= (currentAvailableMemorySize = availableMemorySize.get()) + size) {
released =
availableMemorySize.compareAndSet(
currentAvailableMemorySize, currentAvailableMemorySize + size);
}
if (!released) {
throw new IllegalStateException(
String.format(
"Trying to release more managed memory (%d bytes) than has been allocated (%d bytes), the total size is %d bytes",
size, currentAvailableMemorySize, totalMemorySize));
}
}
/**
* Generates an release memory action that can be performed later
*
* <p>The generated runnable could be safely referenced by possible gc cleaner action without
* worrying about cycle reference back to memory manager.
*/
Runnable getReleaseMemoryAction(@Nonnegative long size) {
return () -> {
releaseMemory(size);
};
}
}
SegmentsUtil.java
public class SegmentsUtil {
/** Constant that flags the byte order. */
public static final boolean LITTLE_ENDIAN = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN;
private static final int ADDRESS_BITS_PER_WORD = 3;
private static final int BIT_BYTE_INDEX_MASK = 7;
/**
* SQL execution threads is limited, not too many, so it can bear the overhead of 64K per
* thread.
*/
private static final int MAX_BYTES_LENGTH = 1024 * 64;
private static final int MAX_CHARS_LENGTH = 1024 * 32;
private static final int BYTE_ARRAY_BASE_OFFSET = UNSAFE.arrayBaseOffset(byte[].class);
private static final ThreadLocal<byte[]> BYTES_LOCAL = new ThreadLocal<>();
private static final ThreadLocal<char[]> CHARS_LOCAL = new ThreadLocal<>();
/**
* Allocate bytes that is only for temporary usage, it should not be stored in somewhere else.
* Use a {@link ThreadLocal} to reuse bytes to avoid overhead of byte[] new and gc.
*
* <p>If there are methods that can only accept a byte[], instead of a MemorySegment[]
* parameter, we can allocate a reuse bytes and copy the MemorySegment data to byte[], then call
* the method. Such as String deserialization.
*/
public static byte[] allocateReuseBytes(int length) {
byte[] bytes = BYTES_LOCAL.get();
if (bytes == null) {
if (length <= MAX_BYTES_LENGTH) {
bytes = new byte[MAX_BYTES_LENGTH];
BYTES_LOCAL.set(bytes);
} else {
bytes = new byte[length];
}
} else if (bytes.length < length) {
bytes = new byte[length];
}
return bytes;
}
public static char[] allocateReuseChars(int length) {
...
}
/**
* Copy segments to a new byte[].
*
* @param segments Source segments.
* @param offset Source segments offset.
* @param numBytes the number bytes to copy.
*/
public static byte[] copyToBytes(MemorySegment[] segments, int offset, int numBytes) {
return copyToBytes(segments, offset, new byte[numBytes], 0, numBytes);
}
/**
* Copy segments to target byte[].
*
* @param segments Source segments.
* @param offset Source segments offset.
* @param bytes target byte[].
* @param bytesOffset target byte[] offset.
* @param numBytes the number bytes to copy.
*/
public static byte[] copyToBytes(
MemorySegment[] segments, int offset, byte[] bytes, int bytesOffset, int numBytes) {
if (inFirstSegment(segments, offset, numBytes)) {
segments[0].get(offset, bytes, bytesOffset, numBytes);
} else {
copyMultiSegmentsToBytes(segments, offset, bytes, bytesOffset, numBytes);
}
return bytes;
}
public static void copyMultiSegmentsToBytes(
MemorySegment[] segments, int offset, byte[] bytes, int bytesOffset, int numBytes) {
int remainSize = numBytes;
for (MemorySegment segment : segments) {
int remain = segment.size() - offset;
if (remain > 0) {
int nCopy = Math.min(remain, remainSize);
segment.get(offset, bytes, numBytes - remainSize + bytesOffset, nCopy);
remainSize -= nCopy;
// next new segment.
offset = 0;
if (remainSize == 0) {
return;
}
} else {
// remain is negative, let's advance to next segment
// now the offset = offset - segmentSize (-remain)
offset = -remain;
}
}
}
/**
* Copy segments to target unsafe pointer.
*
* @param segments Source segments.
* @param offset The position where the bytes are started to be read from these memory segments.
* @param target The unsafe memory to copy the bytes to.
* @param pointer The position in the target unsafe memory to copy the chunk to.
* @param numBytes the number bytes to copy.
*/
public static void copyToUnsafe(
MemorySegment[] segments, int offset, Object target, int pointer, int numBytes) {
if (inFirstSegment(segments, offset, numBytes)) {
segments[0].copyToUnsafe(offset, target, pointer, numBytes);
} else {
copyMultiSegmentsToUnsafe(segments, offset, target, pointer, numBytes);
}
}
private static void copyMultiSegmentsToUnsafe(
MemorySegment[] segments, int offset, Object target, int pointer, int numBytes) {
...
}
/**
* Copy bytes of segments to output view. Note: It just copies the data in, not include the
* length.
*
* @param segments source segments
* @param offset offset for segments
* @param sizeInBytes size in bytes
* @param target target output view
*/
public static void copyToView(
MemorySegment[] segments, int offset, int sizeInBytes, DataOutputView target)
throws IOException {
...
}
/**
* Copy target segments from source byte[].
*
* @param segments target segments.
* @param offset target segments offset.
* @param bytes source byte[].
* @param bytesOffset source byte[] offset.
* @param numBytes the number bytes to copy.
*/
public static void copyFromBytes(
MemorySegment[] segments, int offset, byte[] bytes, int bytesOffset, int numBytes) {
if (segments.length == 1) {
segments[0].put(offset, bytes, bytesOffset, numBytes);
} else {
copyMultiSegmentsFromBytes(segments, offset, bytes, bytesOffset, numBytes);
}
}
private static void copyMultiSegmentsFromBytes(
MemorySegment[] segments, int offset, byte[] bytes, int bytesOffset, int numBytes) {
...
}
/** Maybe not copied, if want copy, please use copyTo. */
public static byte[] getBytes(MemorySegment[] segments, int baseOffset, int sizeInBytes) {
// avoid copy if `base` is `byte[]`
if (segments.length == 1) {
byte[] heapMemory = segments[0].getHeapMemory();
if (baseOffset == 0 && heapMemory != null && heapMemory.length == sizeInBytes) {
return heapMemory;
} else {
byte[] bytes = new byte[sizeInBytes];
segments[0].get(baseOffset, bytes, 0, sizeInBytes);
return bytes;
}
} else {
byte[] bytes = new byte[sizeInBytes];
copyMultiSegmentsToBytes(segments, baseOffset, bytes, 0, sizeInBytes);
return bytes;
}
}
/**
* Equals two memory segments regions.
*
* @param segments1 Segments 1
* @param offset1 Offset of segments1 to start equaling
* @param segments2 Segments 2
* @param offset2 Offset of segments2 to start equaling
* @param len Length of the equaled memory region
* @return true if equal, false otherwise
*/
public static boolean equals(
MemorySegment[] segments1,
int offset1,
MemorySegment[] segments2,
int offset2,
int len) {
if (inFirstSegment(segments1, offset1, len) && inFirstSegment(segments2, offset2, len)) {
return segments1[0].equalTo(segments2[0], offset1, offset2, len);
} else {
return equalsMultiSegments(segments1, offset1, segments2, offset2, len);
}
}
@VisibleForTesting
static boolean equalsMultiSegments(
MemorySegment[] segments1,
int offset1,
MemorySegment[] segments2,
int offset2,
int len) {
if (len == 0) {
// quick way and avoid segSize is zero.
return true;
}
int segSize1 = segments1[0].size();
int segSize2 = segments2[0].size();
// find first segIndex and segOffset of segments.
int segIndex1 = offset1 / segSize1;
int segIndex2 = offset2 / segSize2;
int segOffset1 = offset1 - segSize1 * segIndex1; // equal to %
int segOffset2 = offset2 - segSize2 * segIndex2; // equal to %
while (len > 0) {
int equalLen = Math.min(Math.min(len, segSize1 - segOffset1), segSize2 - segOffset2);
if (!segments1[segIndex1].equalTo(
segments2[segIndex2], segOffset1, segOffset2, equalLen)) {
return false;
}
len -= equalLen;
segOffset1 += equalLen;
if (segOffset1 == segSize1) {
segOffset1 = 0;
segIndex1++;
}
segOffset2 += equalLen;
if (segOffset2 == segSize2) {
segOffset2 = 0;
segIndex2++;
}
}
return true;
}
/**
* hash segments to int, numBytes must be aligned to 4 bytes.
*
* @param segments Source segments.
* @param offset Source segments offset.
* @param numBytes the number bytes to hash.
*/
public static int hashByWords(MemorySegment[] segments, int offset, int numBytes) {
if (inFirstSegment(segments, offset, numBytes)) {
return MurmurHashUtil.hashBytesByWords(segments[0], offset, numBytes);
} else {
return hashMultiSegByWords(segments, offset, numBytes);
}
}
private static int hashMultiSegByWords(MemorySegment[] segments, int offset, int numBytes) {
byte[] bytes = allocateReuseBytes(numBytes);
copyMultiSegmentsToBytes(segments, offset, bytes, 0, numBytes);
return MurmurHashUtil.hashUnsafeBytesByWords(bytes, BYTE_ARRAY_BASE_OFFSET, numBytes);
}
/**
* hash segments to int.
*
* @param segments Source segments.
* @param offset Source segments offset.
* @param numBytes the number bytes to hash.
*/
public static int hash(MemorySegment[] segments, int offset, int numBytes) {
if (inFirstSegment(segments, offset, numBytes)) {
return MurmurHashUtil.hashBytes(segments[0], offset, numBytes);
} else {
return hashMultiSeg(segments, offset, numBytes);
}
}
private static int hashMultiSeg(MemorySegment[] segments, int offset, int numBytes) {
byte[] bytes = allocateReuseBytes(numBytes);
copyMultiSegmentsToBytes(segments, offset, bytes, 0, numBytes);
return MurmurHashUtil.hashUnsafeBytes(bytes, BYTE_ARRAY_BASE_OFFSET, numBytes);
}
/** Is it just in first MemorySegment, we use quick way to do something. */
private static boolean inFirstSegment(MemorySegment[] segments, int offset, int numBytes) {
return numBytes + offset <= segments[0].size();
}
/**
* Given a bit index, return the byte index containing it.
*
* @param bitIndex the bit index.
* @return the byte index.
*/
private static int byteIndex(int bitIndex) {
return bitIndex >>> ADDRESS_BITS_PER_WORD;
}
/**
* unset bit.
*
* @param segment target segment.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static void bitUnSet(MemorySegment segment, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
byte current = segment.get(offset);
current &= ~(1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(offset, current);
}
/**
* set bit.
*
* @param segment target segment.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static void bitSet(MemorySegment segment, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
byte current = segment.get(offset);
current |= (1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(offset, current);
}
/**
* read bit.
*
* @param segment target segment.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static boolean bitGet(MemorySegment segment, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
byte current = segment.get(offset);
return (current & (1 << (index & BIT_BYTE_INDEX_MASK))) != 0;
}
/**
* unset bit from segments.
*
* @param segments target segments.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static void bitUnSet(MemorySegment[] segments, int baseOffset, int index) {
if (segments.length == 1) {
MemorySegment segment = segments[0];
int offset = baseOffset + byteIndex(index);
byte current = segment.get(offset);
current &= ~(1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(offset, current);
} else {
bitUnSetMultiSegments(segments, baseOffset, index);
}
}
private static void bitUnSetMultiSegments(MemorySegment[] segments, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
int segSize = segments[0].size();
int segIndex = offset / segSize;
int segOffset = offset - segIndex * segSize; // equal to %
MemorySegment segment = segments[segIndex];
byte current = segment.get(segOffset);
current &= ~(1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(segOffset, current);
}
/**
* set bit from segments.
*
* @param segments target segments.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static void bitSet(MemorySegment[] segments, int baseOffset, int index) {
if (segments.length == 1) {
int offset = baseOffset + byteIndex(index);
MemorySegment segment = segments[0];
byte current = segment.get(offset);
current |= (1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(offset, current);
} else {
bitSetMultiSegments(segments, baseOffset, index);
}
}
private static void bitSetMultiSegments(MemorySegment[] segments, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
int segSize = segments[0].size();
int segIndex = offset / segSize;
int segOffset = offset - segIndex * segSize; // equal to %
MemorySegment segment = segments[segIndex];
byte current = segment.get(segOffset);
current |= (1 << (index & BIT_BYTE_INDEX_MASK));
segment.put(segOffset, current);
}
/**
* read bit from segments.
*
* @param segments target segments.
* @param baseOffset bits base offset.
* @param index bit index from base offset.
*/
public static boolean bitGet(MemorySegment[] segments, int baseOffset, int index) {
int offset = baseOffset + byteIndex(index);
byte current = getByte(segments, offset);
return (current & (1 << (index & BIT_BYTE_INDEX_MASK))) != 0;
}
/**
* get boolean from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static boolean getBoolean(MemorySegment[] segments, int offset) {
...
}
private static boolean getBooleanMultiSegments(MemorySegment[] segments, int offset) {
...
}
/**
* set boolean from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static void setBoolean(MemorySegment[] segments, int offset, boolean value) {
...
}
private static void setBooleanMultiSegments(
MemorySegment[] segments, int offset, boolean value) {
...
}
/**
* get byte from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static byte getByte(MemorySegment[] segments, int offset) {
...
}
private static byte getByteMultiSegments(MemorySegment[] segments, int offset) {
...
}
/**
* set byte from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static void setByte(MemorySegment[] segments, int offset, byte value) {
...
}
private static void setByteMultiSegments(MemorySegment[] segments, int offset, byte value) {
...
}
/**
* get int from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static int getInt(MemorySegment[] segments, int offset) {
...
}
private static int getIntMultiSegments(MemorySegment[] segments, int offset) {
...
}
private static int getIntSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset) {
...
}
/**
* set int from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static void setInt(MemorySegment[] segments, int offset, int value) {
...
}
private static void setIntMultiSegments(MemorySegment[] segments, int offset, int value) {
...
}
private static void setIntSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset, int value) {
...
}
/**
* get long from segments.
*
* @param segments target segments.
* @param offset value offset.
*/
public static long getLong(MemorySegment[] segments, int offset) {
...
}
private static long getLongMultiSegments(MemorySegment[] segments, int offset) {
...
}
private static long getLongSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset) {
...
}
public static void setLong(MemorySegment[] segments, int offset, long value) {
...
}
private static void setLongMultiSegments(MemorySegment[] segments, int offset, long value) {
...
}
private static void setLongSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset, long value) {
...
}
public static short getShort(MemorySegment[] segments, int offset) {
...
}
private static short getShortMultiSegments(MemorySegment[] segments, int offset) {
...
}
public static void setShort(MemorySegment[] segments, int offset, short value) {
...
}
private static void setShortMultiSegments(MemorySegment[] segments, int offset, short value) {
...
}
public static float getFloat(MemorySegment[] segments, int offset) {
...
}
private static float getFloatMultiSegments(MemorySegment[] segments, int offset) {
...
}
public static void setFloat(MemorySegment[] segments, int offset, float value) {
...
}
private static void setFloatMultiSegments(MemorySegment[] segments, int offset, float value) {
...
}
public static double getDouble(MemorySegment[] segments, int offset) {
...
}
private static double getDoubleMultiSegments(MemorySegment[] segments, int offset) {
...
}
public static void setDouble(MemorySegment[] segments, int offset, double value) {
...
}
private static void setDoubleMultiSegments(MemorySegment[] segments, int offset, double value) {
...
}
private static int getTwoByteSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset) {
...
}
private static void setTwoByteSlowly(
MemorySegment[] segments, int segSize, int segNum, int segOffset, int b1, int b2) {
...
}
public static int find(
MemorySegment[] segments1,
int offset1,
int numBytes1,
MemorySegment[] segments2,
int offset2,
int numBytes2) {
...
}
private static int findInMultiSegments(
...
}
}
Flink MemorySegment结构图
MemorySegment:
- UNSAFE: 对堆/非堆内存进行操作,非安全的API
- BYTE_ARRAY_BASE_OFFSET:二进制字节数组的起始索引,相对于字节数组对象
- byte[] heapMemory 如果为堆内存,则指向访问的内存的引用,否则为非堆内存,则为null
- long address; 字节数组对应的相对地址 final long addressLimit; 标识地址结束位置 int size; 内存段的字节数大小
- Object owner 该Segment的所有者
DataView 数据视图
DataInputView.java 接口定义
void skipBytesToRead(int numBytes) throws IOException;
int read(byte[] b, int off, int len) throws IOException;
int read(byte[] b) throws IOException;
DataOutputView.java 接口定义
void skipBytesToWrite(int numBytes) throws IOException;
void write(DataInputView source, int numBytes) throws IOException;
AbstractPagedOutputView && AbstractPagedIntputView
- 提供了基于page的对view的进一步实现
- 提供了跨越多个memory page的数据访问(input/output)视图。
- 包含了从page中读取/写入数据的解码/编码方法以及跨越page的边界检查
参数说明:
currentSegment 表示当前正在操作的memory segment
headerLength 每个memory segment前面有一段是头部,可能存储一些元数据信息,数据访问的时候需要跳过这个长度,要求这个pageview指代的所有memory segment的header length都相等
positionInSegment 类似指针,指向某个segment的当前位置(相对segment的位置)
limitInSegment 类似指针,指向Segment的结尾位置,因为每块大小是固定的32k;
numSegments 包含的segment 的数量 一个Page中管理的Segment的数量
AbstractPagedInputView.java
几个重要的接口定义与实现
接口定义。page中如何获取下一个Segment
protected abstract MemorySegment nextSegment(MemorySegment current)
throws EOFException, IOException;
protected abstract int getLimitForSegment(MemorySegment segment);
protected void doAdvance() throws IOException {
// note: this code ensures that in case of EOF, we stay at the same position such that
// EOF is reproducible (if nextSegment throws a reproducible EOFException)
this.currentSegment = nextSegment(this.currentSegment);
this.limitInSegment = getLimitForSegment(this.currentSegment);
this.positionInSegment = this.headerLength;// 8位的头部,不存储数据
}
public int read(byte[] b, int off, int len) throws IOException {
if (off < 0 || len < 0 || off + len > b.length) {
throw new IndexOutOfBoundsException();
}
// 计算当前块的剩余字节数
int remaining = this.limitInSegment - this.positionInSegment;
if (remaining >= len) {
// 剩余空间足够,直接写数据byte
this.currentSegment.get(this.positionInSegment, b, off, len);
// 当前指针偏移
this.positionInSegment += len;
return len;
} else {
// 没有剩余空间,直接切换到下一个segment,重置positionInSegment 和 limitInSegment
if (remaining == 0) {
try {
advance();
} catch (EOFException eof) {
return -1;
}
//。重置以后 计算segment的剩余字节大小
remaining = this.limitInSegment - this.positionInSegment;
}
int bytesRead = 0;
while (true) {
// 一次读取 min(len,remaining) 这里存在2种情况
// 情形一: remaining < len 相当于当前块 还不够,需要继续找下一块。
// 情形二: 当前块 空间足够,直接读取 len - bytesRead (此时如果 不跨多块,bytesRead=0,跨多块,则bytesRead = 中间读取多块的所有字节数只和
int toRead = Math.min(remaining, len - bytesRead);
this.currentSegment.get(this.positionInSegment, b, off, toRead);
off += toRead;
bytesRead += toRead;
if (len > bytesRead) {
//对应情形一 需要向后找下一块Segment
try {
advance();
} catch (EOFException eof) {
this.positionInSegment += toRead;
return bytesRead;
}
remaining = this.limitInSegment - this.positionInSegment;
} else {
//对应情形二, 更新positionInSegment就行,这个对应的是最后一块读取到toRead
this.positionInSegment += toRead;
break;
}
}
return len;
}
}
AbstractPagedInputView.java
protected abstract MemorySegment nextSegment(MemorySegment current, int positionInCurrent)
throws IOException;
protected final MemorySegment nextSegment(MemorySegment current, int posInSegment)
throws IOException {
if (current != null) {
writeSegment(current, posInSegment, false);
}
final MemorySegment next = this.writer.getNextReturnedBlock();
this.blockCount++;
return next;
}
@Override
public void write(MemorySegment segment, int off, int len) throws IOException {
int remaining = this.segmentSize - this.positionInSegment;
if (remaining >= len) {
segment.copyTo(off, currentSegment, positionInSegment, len);
this.positionInSegment += len;
} else {
if (remaining == 0) {
advance();
remaining = this.segmentSize - this.positionInSegment;
}
while (true) {
int toPut = Math.min(remaining, len);
segment.copyTo(off, currentSegment, positionInSegment, toPut);
off += toPut;
len -= toPut;
if (len > 0) {
this.positionInSegment = this.segmentSize;
advance();
remaining = this.segmentSize - this.positionInSegment;
} else {
this.positionInSegment += toPut;
break;
}
}
}
}
MemoryManager
管理的内存的总量和和每个内存页的大小得到内存页的数量生成相应大小数量的内存页来作为可以使用的内存。
private final MemoryType memoryType; //内存的存储形式,其中分为堆内内存和堆外内存 ,在一个只能存在一个形式的内存。
private final long memorySize; //当前内存管理中的管理的内存总量。
private final int pageSize; //内存管理当中每个分页的内存大小。默认32K。最小是4K
public static final int DEFAULT_PAGE_SIZE = 32 * 1024; public static final int MIN_PAGE_SIZE = 4 * 1024;
private final int totalNumPages; //分页总量 = memorySize / pageSize
private final boolean isPreAllocated; //是否预先分配。如果是堆外内存 默认是 true 开启预分配
private int numNonAllocatedPages; //尚未申请的内存页数,因为flink 内存是用的时候申请
public void allocatePages(Object owner, Collection<MemorySegment> target, int numberOfPages)
throws MemoryAllocationException {
// sanity check
Preconditions.checkNotNull(owner, "The memory owner must not be null.");
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
Preconditions.checkArgument(
numberOfPages <= totalNumberOfPages,
"Cannot allocate more segments %d than the max number %d",
numberOfPages,
totalNumberOfPages);
// reserve array space, if applicable
if (target instanceof ArrayList) {
((ArrayList<MemorySegment>) target).ensureCapacity(numberOfPages);
}
// 计算总分配大小
long memoryToReserve = numberOfPages * pageSize;
try {
memoryBudget.reserveMemory(memoryToReserve);
} catch (MemoryReservationException e) {
throw new MemoryAllocationException(
String.format("Could not allocate %d pages", numberOfPages), e);
}
//绑定owner和实际分配
Runnable gcCleanup = memoryBudget.getReleaseMemoryAction(getPageSize());
allocatedSegments.compute(
owner,
(o, currentSegmentsForOwner) -> {
Set<MemorySegment> segmentsForOwner =
currentSegmentsForOwner == null
? new HashSet<>(numberOfPages)
: currentSegmentsForOwner;
//实际分配 segment 并填充target
for (long i = numberOfPages; i > 0; i--) {
MemorySegment segment =
allocateOffHeapUnsafeMemory(getPageSize(), owner, gcCleanup);
target.add(segment);
segmentsForOwner.add(segment);
}
return segmentsForOwner;
});
Preconditions.checkState(!isShutDown, "Memory manager has been concurrently shut down.");
}
/**
* Tries to release the memory for the specified segment.
*
* <p>If the segment has already been released, it is only freed. If it is null or has no owner,
* the request is simply ignored. The segment is only freed and made eligible for reclamation by
* the GC. The segment will be returned to the memory pool, increasing its available limit for
* the later allocations.
*
* @param segment The segment to be released.
*/
public void release(MemorySegment segment) {
Preconditions.checkState(!isShutDown, "Memory manager has been shut down.");
// check if segment is null or has already been freed
if (segment == null || segment.getOwner() == null) {
return;
}
// remove the reference in the map for the owner
try {
allocatedSegments.computeIfPresent(
segment.getOwner(),
(o, segsForOwner) -> {
segment.free();
segsForOwner.remove(segment);
return segsForOwner.isEmpty() ? null : segsForOwner;
});
} catch (Throwable t) {
throw new RuntimeException(
"Error removing book-keeping reference to allocated memory segment.", t);
}
MemorySegmentPool(缓存池)
SegmentPool 是一个申请和管理MemorySegment的池子,分配MemorySegment,以及回收和缓存
private static final long PER_REQUEST_MEMORY_SIZE = 16 * 1024 * 1024; //默认 Pagesize 16K
private final MemoryManager memoryManager; // 实际的内存管理
private final ArrayList cachePages; // 缓存池
private final int maxPages; //最大管理page总数
private int pageUsage; //已经使用总page
@Override
public void returnAll(List<MemorySegment> memory) {
this.pageUsage -= memory.size();
if (this.pageUsage < 0) {
throw new RuntimeException("Return too more memories.");
}
this.cachePages.addAll(memory);
}
@Override
public MemorySegment nextSegment() {
int freePages = freePages();
if (freePages == 0){
return null;
}
if (this.cachePages.isEmpty()) {
int numPages = Math.min(freePages, this.perRequestPages);
try {
this.memoryManager.allocatePages(owner, this.cachePages, numPages);
} catch (MemoryAllocationException e) {
throw new RuntimeException(e);
}
}
this.pageUsage++;
return this.cachePages.remove(this.cachePages.size() - 1);
}
public List<MemorySegment> allocateSegments(int required) {
int freePages = freePages();
if (freePages < required) {
return null;
}
List<MemorySegment> ret = new ArrayList<>(required);
for (int i = 0; i < required; i++) {
MemorySegment segment;
try {
segment = nextSegment();
Preconditions.checkNotNull(segment);
} catch (Throwable t) {
// unexpected, we should first return all temporary segments
returnAll(ret);
throw t;
}
ret.add(segment);
}
return ret;
}
ResultPartition
RP表示BufferWriter写入的data chunk。一个RP是ResultSubpartition(RS)的集合
ResultPartition的生命周期都有三个阶段:生产、消费和释放。
ResultSubpartition
表示一个operator创建的数据的一个分区,跟要传输的数据逻辑一起传输给接收operator。RS的特定的实现决定了最终的数据传输逻辑,它被设计为插件化的机制来满足系统各种各样的数据传输需求;
InputGate: 在接收端,逻辑上等价于RP,处理并收集来自上游的buffer中的数据
InputChannel. 在接收端,逻辑上等价于RS,用于接收某个特定的分区的数据。
Buffer: 序列化器、反序列化器用于可靠得将类型化的数据转化为纯粹的二进制数据,处理跨buffer的数据
ResultSubpartitionView
汇总:
- 构建segment对象池(pool化),增强复用性,减少重复分配,回收的开销
- 规范segment的大小(page化),提升操作效率
- 抽象跨segment的访问复杂性(view化)
Sort Shuffle 读写流程
Flink Sort Shuffle 流程图
几个核心的类说明:
SortMergeResultPartition 核心处理Segment Sort Shuffle的类 包含一个SortBuffer实例,PartitionedFileWriter实例;
SortBuffer sortedBuffer 排序buffer实例
PartitionedFileWriter fileWriter 实际写文件的实例
PartitionSortedBuffer 真正实现Sort Buffer 排序的内存Buffer结构
- ArrayList buffers 储存record的 Segment 数组
- long[] firstIndexEntryAddresses。 每个分区首个record的地址
- long[] lastIndexEntryAddresses; 每个分区最后一个record的地址
- int. bufferSize 申请的buffer大小
- long numTotalRecords。 sortBuffer中总record记录数
- long numTotalBytes。 sortBuffer中总的append的Bytes记录数
- long numTotalBytesRead。 sortBuffer中已读Bytes记录数
- boolean isFinished && boolean isReleased
- int writeSegmentIndex。当前写入的segment的下标位置。buffers数组的下标
- int writeSegmentOffset。当前segment中的写入偏移offset
- int[] subpartitionReadOrder;
- long readIndexEntryAddress。当前读取的index的地址
- int recordRemainingBytes。当前读取的record的地址
- int readOrderIndex 当前的可以读取的 channel 下标,设置可以读取的分区对应的channel
- BufferPool bufferPool 用于申请Buffer的 BufferPool对象
这里涉及到三组概念:
Partition. 分区
index 索引 是一个long数据,占64位。其中,高32位记录数据长度,低32位记录数据类型。
record 记录 一个long类型的数据,高32位只想segments列表中对应segment的下标,低32位指向segment内部的偏移量。
public boolean append(ByteBuffer source, int targetChannel, DataType dataType)
throws IOException {
checkArgument(source.hasRemaining(), "Cannot append empty data.");
checkState(!isFinished, "Sort buffer is already finished.");
checkState(!isReleased, "Sort buffer is already released.");
// ByteBuffer 剩余可写数据Bytes大小 一般就是 byteBuffer size - current position
int totalBytes = source.remaining();
// return false directly if it can not allocate enough buffers for the given record
if (!allocateBuffersForRecord(totalBytes)) {
return false;
}
// write the index entry and record or event data
writeIndex(targetChannel, totalBytes, dataType);
writeRecord(source);
++numTotalRecords;
numTotalBytes += totalBytes;
return true;
}
private boolean allocateBuffersForRecord(int numRecordBytes) throws IOException {
//实际需要的bytes INDEX_ENTRY_SIZE 计算在内
int numBytesRequired = INDEX_ENTRY_SIZE + numRecordBytes;
//当前segment 可用的bytes 如果首次写入,buffersize 是segment的申请固定大小 writeSegmentOffset是 当前segment的写入偏移,
int availableBytes =
writeSegmentIndex == buffers.size() ? 0 : bufferSize - writeSegmentOffset;
// return directly if current available bytes is adequate
if (availableBytes >= numBytesRequired) {
return true;
}
// skip the remaining free space if the available bytes is not enough for an index entry
// 可写的不足的情况下,如果不足 16 bytes 用于写入IndexEntry IndexEntry不支持跨Segment 需要更新 writeSegmentOffset 和 writeSegmentIndex writeSegmentIndex++ writeSegmentOffset = 0 下一个segment的0位置
if (availableBytes < INDEX_ENTRY_SIZE) {
updateWriteSegmentIndexAndOffset(availableBytes);
availableBytes = 0;
}
// allocate exactly enough buffers for the appended record
// 申请Segment Buffer 此时一次申请bufferSize 如果 availableBytes += bufferSize;
// 如果 availableBytes 还不够 < numBytesRequired 还需要继续申请segment 知道足够位置
do {
MemorySegment segment = requestBufferFromPool();
if (segment == null) {
// return false if we can not allocate enough buffers for the appended record
return false;
}
availableBytes += bufferSize;
addBuffer(segment);
} while (availableBytes < numBytesRequired);
return true;
}
private void updateWriteSegmentIndexAndOffset(int numBytes) {
// 更新当前segment 写入的 offset
writeSegmentOffset += numBytes;
// using the next available free buffer if the current is full
// 写满了buffersize。需要重置offset 并增加writeSegmentIndex下标
if (writeSegmentOffset == bufferSize) {
++writeSegmentIndex;
writeSegmentOffset = 0;
}
}
private void writeIndex(int channelIndex, int numRecordBytes, Buffer.DataType dataType) {
// 找到要写入的Segment writeSegmentIndex 是buffers的下标
MemorySegment segment = buffers.get(writeSegmentIndex);
// 一个index是一个long数据,占64位。其中,高32位记录数据长度,低32位记录数据类型。
segment.putLong(writeSegmentOffset, ((long) numRecordBytes << 32) | dataType.ordinal());
// 索引地址即indexEntryAddress,也是一个long类型的数据,高32位只想segments列表中对应segment的下标,低32位指向segment内部的偏移量。
long indexEntryAddress = ((long) writeSegmentIndex << 32) | writeSegmentOffset;
// 以下逻辑为更新分区最后数据的索引 新数据的索引附加在上一个数据索引的后面,如果没有上一个数据,直接放入firstIndexEntryAddresses,表示当前数据是此分区最早的数据。
long lastIndexEntryAddress = lastIndexEntryAddresses[channelIndex];
lastIndexEntryAddresses[channelIndex] = indexEntryAddress;
if (lastIndexEntryAddress >= 0) {
// +8 表示数据索引的大小 数据写入indexEntryAddress
segment = buffers.get(getSegmentIndexFromPointer(lastIndexEntryAddress));
segment.putLong(
getSegmentOffsetFromPointer(lastIndexEntryAddress) + 8, indexEntryAddress);
} else {
firstIndexEntryAddresses[channelIndex] = indexEntryAddress;
}
// move the write position forward so as to write the corresponding record
updateWriteSegmentIndexAndOffset(INDEX_ENTRY_SIZE);
}
//获取segment在列表中的下标
private int getSegmentIndexFromPointer(long value) {
return (int) (value >>> 32);
}
//获取segment内部的偏移量offset
private int getSegmentOffsetFromPointer(long value) {
return (int) (value);
}
private void writeRecord(ByteBuffer source) {
while (source.hasRemaining()) {
//获取当前写入的Segment,主要source未到结尾 就一直写
MemorySegment segment = buffers.get(writeSegmentIndex);
// toCopy = min(当前segment 剩余的写大小,source当前未写的大小) source.remaining()应该是不断减少
int toCopy = Math.min(bufferSize - writeSegmentOffset, source.remaining());
segment.put(writeSegmentOffset, source, toCopy);
// move the write position forward so as to write the remaining bytes or next record
updateWriteSegmentIndexAndOffset(toCopy);
}
}
source 是需要写入的buffer。channel 是需要写入的分区channel。datatype 目前有2种,一种是数据Buffer,一种是事件buffer。 读取和写入正好是可逆的,这里就不详细展示了
@Override
public BufferWithChannel copyIntoSegment(MemorySegment target) {
checkState(hasRemaining(), "No data remaining.");
checkState(isFinished, "Should finish the sort buffer first before coping any data.");
checkState(!isReleased, "Sort buffer is already released.");
int numBytesCopied = 0;
DataType bufferDataType = DataType.DATA_BUFFER;
int channelIndex = subpartitionReadOrder[readOrderIndex];
do {
//读取元数据部分
int sourceSegmentIndex = getSegmentIndexFromPointer(readIndexEntryAddress);
int sourceSegmentOffset = getSegmentOffsetFromPointer(readIndexEntryAddress);
MemorySegment sourceSegment = buffers.get(sourceSegmentIndex);
long lengthAndDataType = sourceSegment.getLong(sourceSegmentOffset);
int length = getSegmentIndexFromPointer(lengthAndDataType);
DataType dataType = DataType.values()[getSegmentOffsetFromPointer(lengthAndDataType)];
// return the data read directly if the next to read is an event
if (dataType.isEvent() && numBytesCopied > 0) {
break;
}
bufferDataType = dataType;
// get the next index entry address and move the read position forward
long nextReadIndexEntryAddress = sourceSegment.getLong(sourceSegmentOffset + 8);
sourceSegmentOffset += INDEX_ENTRY_SIZE;
// allocate a temp buffer for the event if the target buffer is not big enough
if (bufferDataType.isEvent() && target.size() < length) {
target = MemorySegmentFactory.allocateUnpooledSegment(length);
}
//读取数据部分
numBytesCopied +=
copyRecordOrEvent(
target,
numBytesCopied,
sourceSegmentIndex,
sourceSegmentOffset,
length);
if (recordRemainingBytes == 0) {
// move to next channel if the current channel has been finished
if (readIndexEntryAddress == lastIndexEntryAddresses[channelIndex]) {
updateReadChannelAndIndexEntryAddress();
break;
}
readIndexEntryAddress = nextReadIndexEntryAddress;
}
} while (numBytesCopied < target.size() && bufferDataType.isBuffer());
numTotalBytesRead += numBytesCopied;
Buffer buffer = new NetworkBuffer(target, (buf) -> {}, bufferDataType, numBytesCopied);
return new BufferWithChannel(buffer, channelIndex);
}
private int copyRecordOrEvent(
MemorySegment targetSegment,
int targetSegmentOffset,
int sourceSegmentIndex,
int sourceSegmentOffset,
int recordLength) {
if (recordRemainingBytes > 0) {
// skip the data already read if there is remaining partial record after the previous
// copy
long position = (long) sourceSegmentOffset + (recordLength - recordRemainingBytes);
sourceSegmentIndex += (position / bufferSize);
sourceSegmentOffset = (int) (position % bufferSize);
} else {
recordRemainingBytes = recordLength;
}
int targetSegmentSize = targetSegment.size();
int numBytesToCopy =
Math.min(targetSegmentSize - targetSegmentOffset, recordRemainingBytes);
do {
// move to next data buffer if all data of the current buffer has been copied
if (sourceSegmentOffset == bufferSize) {
++sourceSegmentIndex;
sourceSegmentOffset = 0;
}
int sourceRemainingBytes =
Math.min(bufferSize - sourceSegmentOffset, recordRemainingBytes);
int numBytes = Math.min(targetSegmentSize - targetSegmentOffset, sourceRemainingBytes);
MemorySegment sourceSegment = buffers.get(sourceSegmentIndex);
sourceSegment.copyTo(sourceSegmentOffset, targetSegment, targetSegmentOffset, numBytes);
recordRemainingBytes -= numBytes;
targetSegmentOffset += numBytes;
sourceSegmentOffset += numBytes;
} while ((recordRemainingBytes > 0 && targetSegmentOffset < targetSegmentSize));
return numBytesToCopy;
}
PartitionedFileWriter
主要是写Index文件和数据data文件
public void writeBuffer(Buffer target, int targetSubpartition) throws IOException {
checkState(!isFinished, "File writer is already finished.");
checkState(!isClosed, "File writer is already closed.");
if (targetSubpartition != currentSubpartition) {
checkState(
subpartitionBuffers[targetSubpartition] == 0,
"Must write data of the same channel together.");
subpartitionOffsets[targetSubpartition] = totalBytesWritten;
currentSubpartition = targetSubpartition;
}
totalBytesWritten += writeToByteChannel(dataFileChannel, target, writeDataCache, header);
++subpartitionBuffers[targetSubpartition];
}
private void writeIndexEntry(long subpartitionOffset, int numBuffers) throws IOException {
if (!indexBuffer.hasRemaining()) {
if (!extendIndexBufferIfPossible()) {
flushIndexBuffer();
indexBuffer.clear();
allIndexEntriesCached = false;
}
}
indexBuffer.putLong(subpartitionOffset);
indexBuffer.putInt(numBuffers);
}
static long writeToByteChannel(
FileChannel channel, Buffer buffer, ByteBuffer[] arrayWithHeaderBuffer)
throws IOException {
//写Buffer Header
final ByteBuffer headerBuffer = arrayWithHeaderBuffer[0];
headerBuffer.clear();
headerBuffer.putShort(buffer.isBuffer() ? HEADER_VALUE_IS_BUFFER : HEADER_VALUE_IS_EVENT);
headerBuffer.putShort(
buffer.isCompressed() ? BUFFER_IS_COMPRESSED : BUFFER_IS_NOT_COMPRESSED);
headerBuffer.putInt(buffer.getSize());
headerBuffer.flip();
final ByteBuffer dataBuffer = buffer.getNioBufferReadable();
arrayWithHeaderBuffer[1] = dataBuffer;
final long bytesExpected = HEADER_LENGTH + dataBuffer.remaining();
// The file channel implementation guarantees that all bytes are written when invoked
// because it is a blocking channel (the implementation mentioned it as guaranteed).
// However, the api docs leaves it somewhat open, so it seems to be an undocumented contract
// in the JRE.
// We build this safety net to be on the safe side.
if (bytesExpected < channel.write(arrayWithHeaderBuffer)) {
writeBuffers(channel, arrayWithHeaderBuffer);
}
return bytesExpected;
}
static Buffer readFromByteChannel(
FileChannel channel,
ByteBuffer headerBuffer,
MemorySegment memorySegment,
BufferRecycler bufferRecycler)
throws IOException {
headerBuffer.clear();
if (!tryReadByteBuffer(channel, headerBuffer)) {
return null;
}
headerBuffer.flip();
final ByteBuffer targetBuf;
final boolean isEvent;
final boolean isCompressed;
final int size;
// 读取头部部分
try {
isEvent = headerBuffer.getShort() == HEADER_VALUE_IS_EVENT;
isCompressed = headerBuffer.getShort() == BUFFER_IS_COMPRESSED;
size = headerBuffer.getInt();
targetBuf = memorySegment.wrap(0, size);
} catch (BufferUnderflowException | IllegalArgumentException e) {
// buffer underflow if header buffer is undersized
// IllegalArgumentException if size is outside memory segment size
throwCorruptDataException();
return null; // silence compiler
}
// 读取数据部分
readByteBufferFully(channel, targetBuf);
Buffer.DataType dataType =
isEvent ? Buffer.DataType.EVENT_BUFFER : Buffer.DataType.DATA_BUFFER;
return new NetworkBuffer(memorySegment, bufferRecycler, dataType, isCompressed, size);
}
Flink内存模型
JobManager内存模型
- jobmanager.memory.process.size:
TaskManager内存模型
- Flink 内存 = 堆内存 + 堆外内存
- 堆内存 = 框架堆内存(JobManager) + Task 堆内存(TaskManager)
- 堆外内存 = 框架堆外内存 + Task堆外内存 + 网络缓冲内存
| 组件 | 堆内存 | 堆外内存 |
|---|---|---|
| JobManager | 框架堆内存 | 框架堆外内存 |
| TaskManager | taskmanager.memory.framework.heap.size(TaskManager本身占用内存) taskmanager.memory.task.heap.size 任务执行用户代码时所使用的堆上内存 | taskmanager.memory.framework.off-heap.size TaskManager本身所占用的外部内存 taskmanager.memory.task.off-heap.size 任务执行用户代码所使用的外部内存 |
| Network Buffer | 无 | taskmanager.memory.network.fraction taskmanager.memory.network.min taskmanager.memory.network.max |
| JVM Mete Space | JVM元空间 | 无 |
| JVM | taskmanager.memory.jvm-overhead.min taskmanager.memory.jvm-overhead.max taskmanager.memory.jvm-overhead.fraction | 无 |
| 托管内存 | 无 | taskmanager.memory.managed.fraction taskmanager.memory.managed.size堆外部内存,用于排序,哈希表,缓存中间结果 |
总内存
总进程内存:
Flink Java应用程序(包括用户代码)和JVM运行整个进程所消耗的总内存。
总进程内存= Flink使用内存+ JVM元空间+ JVM执行开销
taskmanager.memory.flink.size
Flink总内存
Flink Java应用程序消耗的内存,包括用户代码,但不包括JVM为此运行而分配的内存
总内存 = 框架堆内外+任务堆内外+网络+管理
taskmanager.memory.flink.size
了解了Flink的总体架构和内存管理机制,那么可以从哪些方面来优化和配置Flink集群和任务内存?
1. 记录数和每条记录的大小
确定集群大小的首要事情就是估算预期进入流计算系统的每秒记录数(也就是我们常说的吞吐量),以及每条记录的大小。不同的记录类型会有不同的大小,这将最终影响 Flink 应用程序平稳运行所需的资源。
2. 不同 key 的数量和每个 key 存储的 state 大小
应用程序中不同 key 的数量和每个 key 所需要存储的 state 大小,都将影响到 Flink 应用程序所需的资源,从而能高效地运行,避免任何反压。
3. 状态的更新频率和状态后端的访问模式
第三个考虑因素是状态的更新频率,因为状态的更新通常是一个高消耗的动作。而不同的状态后端(如 RocksDB,Java Heap)的访问模式差异很大,RocksDB 的每次读取和更新都会涉及序列化和反序列化以及 JNI 操作,而 Java Heap 的状态后端不支持增量 checkpoint,导致大状态场景需要每次持久化的数据量较大。这些因素都会显著地影响集群的大小和 Flink 作业所需的资源。
4. 网络容量
网络容量不仅仅会收到 Flink 应用程序本身的影响,也会受到可能正在交互的 Kafka、HDFS 等外部服务的影响。这些外部服务可能会导致额外的网络流量。例如,启用 replication 可能会在网络的消息 broker 之间产生额外的流量。
5. 磁盘带宽
如果你的应用程序依赖了基于磁盘的状态后端,如 RocksDB,或者考虑使用 Kafka 或 HDFS,那么磁盘的带宽也需要纳入考虑。
6. 机器数量及其可用 CPU 和内存
最后但并非最不重要的,在开始应用部署前,你需要考虑集群中可用机器的数量及其可用的 CPU 和内存。这最终确保了在将应用程序投入生产之后,集群有充足的处理能力。