MySQL-Innodb-刷脏

1,174 阅读14分钟

刷脏的种类

  • 批量刷脏
  • 单页刷脏操作 批量刷脏是后台线程发起,单页刷脏操作是用户线程发起。 批量刷脏中的同步刷脏是Redo空间不足引发的,单页刷脏是没有free的page引发的。

批量刷脏

批量刷脏的场景

buf_flush_page_cleaner_coordinator协调线程的主循环主线程以最多1s的间隔或者收到buf_flush_event事件就会触发进行一轮的刷脏。 批量刷脏主要有3个场景。

  1. 同步刷脏 如果 buf_flush_sync_lsn > 0, 则因为redo log free space 不够了, 那么我们需要进入同步刷脏阶段了。同步刷脏场景下,所有需要写脏数据库的用户线程都会堵塞,这是很严重的情况。

  2. 正常刷脏 最常见逻辑 srv_check_activity(last_activity), 也就是系统有正常活动,有DML/DDL, 这个时候会通过 page_cleaner_flush_pages_recommendation() 函数去合理的判断应该刷多少个page, 既不抖动, 也能够满足刷脏需求

  3. 空闲刷脏 如果系统没有DML\DDL活动,且ret_sleep == OS_SYNC_TIME_EXCEEDED,说明比较空闲。空闲的情况下因为服务器IO比较空闲,所以Innodb使用buf_flush_page_cleaner_coordinator线程本身进行刷新,刷新的块数计算比较简单就是innodb_io_capacity设置的值。

/******************************************************************//**
page_cleaner thread tasked with flushing dirty pages from the buffer
pools. As of now we'll have only one coordinator.
@return a dummy parameter */
extern "C"
os_thread_ret_t
DECLARE_THREAD(buf_flush_page_cleaner_coordinator)(
/*===============================================*/
	void*	arg MY_ATTRIBUTE((unused)))
			/*!< in: a dummy parameter required by
			os_thread_create */
{
	/* 忽略一些逻辑 */
	while (srv_shutdown_state == SRV_SHUTDOWN_NONE) {
		if (ret_sleep != OS_SYNC_TIME_EXCEEDED
		    && srv_flush_sync
		    && buf_flush_sync_lsn > 0) {
			/* 场景1 */
		} else if (srv_check_activity(last_activity)) {
			ulint	n_to_flush;
			lsn_t	lsn_limit = 0;

			/* Estimate pages from flush_list to be flushed */
			if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
				last_activity = srv_get_activity_count();
				n_to_flush =
					page_cleaner_flush_pages_recommendation(
						&lsn_limit, last_pages);
			} else {
				n_to_flush = 0;
			}
			/* 场景2 */
		} else if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
			/* 场景3 */
			/* no activity, slept enough */
			}
		} else {
			/* no activity, but woken up by event */
			n_flushed = 0;
		}
	}
」

三种场景下的具体工作

同步刷脏 pc_request(ULINT_MAX, lsn_limit),会把lsn小于lsn_limit的都flush到硬盘,同时coordinator线程本身也会参与刷脏。

			/* woke up for flush_sync */
			mutex_enter(&page_cleaner->mutex);
			lsn_t	lsn_limit = buf_flush_sync_lsn;
			buf_flush_sync_lsn = 0;
			mutex_exit(&page_cleaner->mutex);

			/* Request flushing for threads */
			pc_request(ULINT_MAX, lsn_limit);

			ib_time_monotonic_ms_t tm = ut_time_monotonic_ms();

			/* Coordinator also treats requests */
			while (pc_flush_slot() > 0) {}

			/* only coordinator is using these counters,
			so no need to protect by lock. */
			page_cleaner->flush_time += ut_time_monotonic_ms() - tm;
			page_cleaner->flush_pass++;

			/* Wait for all slots to be finished */
			ulint	n_flushed_lru = 0;
			ulint	n_flushed_list = 0;
			pc_wait_finished(&n_flushed_lru, &n_flushed_list);

			if (n_flushed_list > 0 || n_flushed_lru > 0) {
				buf_flush_stats(n_flushed_list, n_flushed_lru);

				MONITOR_INC_VALUE_CUMULATIVE(
					MONITOR_FLUSH_SYNC_TOTAL_PAGE,
					MONITOR_FLUSH_SYNC_COUNT,
					MONITOR_FLUSH_SYNC_PAGES,
					n_flushed_lru + n_flushed_list);
			}

			n_flushed = n_flushed_lru + n_flushed_list;

正常刷脏 通过page_cleaner_flush_pages_recommendation计算需要刷新的页。

			ulint	n_to_flush;
			lsn_t	lsn_limit = 0;

			/* Estimate pages from flush_list to be flushed */
			if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
				last_activity = srv_get_activity_count();
				n_to_flush =
					page_cleaner_flush_pages_recommendation(
						&lsn_limit, last_pages);
			} else {
				n_to_flush = 0;
			}

			/* Request flushing for threads */
			pc_request(n_to_flush, lsn_limit);

			ib_time_monotonic_ms_t tm = ut_time_monotonic_ms();

			/* Coordinator also treats requests */
			while (pc_flush_slot() > 0) {
				/* No op */
			}

			/* only coordinator is using these counters,
			so no need to protect by lock. */
			page_cleaner->flush_time += ut_time_monotonic_ms() - tm;
			page_cleaner->flush_pass++ ;

			/* Wait for all slots to be finished */
			ulint	n_flushed_lru = 0;
			ulint	n_flushed_list = 0;

			pc_wait_finished(&n_flushed_lru, &n_flushed_list);

			if (n_flushed_list > 0 || n_flushed_lru > 0) {
				buf_flush_stats(n_flushed_list, n_flushed_lru);
			}

			if (ret_sleep == OS_SYNC_TIME_EXCEEDED) {
				last_pages = n_flushed_list;
			}

			n_evicted += n_flushed_lru;
			n_flushed_last += n_flushed_list;

			n_flushed = n_flushed_lru + n_flushed_list;

			if (n_flushed_lru) {
				MONITOR_INC_VALUE_CUMULATIVE(
					MONITOR_LRU_BATCH_FLUSH_TOTAL_PAGE,
					MONITOR_LRU_BATCH_FLUSH_COUNT,
					MONITOR_LRU_BATCH_FLUSH_PAGES,
					n_flushed_lru);
			}

			if (n_flushed_list) {
				MONITOR_INC_VALUE_CUMULATIVE(
					MONITOR_FLUSH_ADAPTIVE_TOTAL_PAGE,
					MONITOR_FLUSH_ADAPTIVE_COUNT,
					MONITOR_FLUSH_ADAPTIVE_PAGES,
					n_flushed_list);
			}

空闲刷脏 空闲刷脏是coordinator自己进行,直接按照PCT_IO(100)来生成刷新数量。 #define PCT_IO(p) ((ulong) (srv_io_capacity * ((double) (p) / 100.0)))

			buf_flush_lists(PCT_IO(100), LSN_MAX, &n_flushed);

			n_flushed_last += n_flushed;

			if (n_flushed) {
				MONITOR_INC_VALUE_CUMULATIVE(
					MONITOR_FLUSH_BACKGROUND_TOTAL_PAGE,
					MONITOR_FLUSH_BACKGROUND_COUNT,
					MONITOR_FLUSH_BACKGROUND_PAGES,
					n_flushed);

			}

正常刷脏场景下FlushPages数量计算方法

相关参数

  • innodb_max_dirty_pages_pct_lwm
  • innodb_max_dirty_pages_pct
  • innodb_adaptive_flushing
  • innodb_adaptive_flushing_lwm
  • innodb_flushing_avg_loops
  • innodb_io_capacity
  • innodb_io_capacity_max 先看一下要flush的page的数量是怎么计算出来的。 函数page_cleaner_flush_pages_recommendation。
	n_pages = (PCT_IO(pct_total) + avg_page_rate + pages_for_lsn) / 3;
	 /* define PCT_IO(p) ((ulong) (srv_io_capacity * ((double) (p) / 100.0)))
	n_pages是要flush的目标数 */

以下四个参数影响pct_total的大小。

  • innodb_max_dirty_pages_pct_lwm
  • innodb_max_dirty_pages_pct
  • innodb_adaptive_flushing
  • innodb_adaptive_flushing_lwm
	pct_for_dirty = af_get_pct_for_dirty();
	pct_for_lsn = af_get_pct_for_lsn(age);
	pct_total = ut_max(pct_for_dirty, pct_for_lsn);

af_get_pct_for_dirty是根据脏页数计算比例,当innodb_max_dirty_pages_pct_lwm设置为0时,如果脏页比例大于srv_max_buf_pool_modified_pct时,pct_for_dirty设置为100。如果innodb_max_dirty_pages_pct_lwm设置不为0,如果脏页比大于srv_max_dirty_pages_pct_lwm,pct_for_dirty值为 (dirty_pct * 100)/( srv_max_buf_pool_modified_pct+1)。

/*********************************************************************//**
Calculates if flushing is required based on number of dirty pages in
the buffer pool.
@return percent of io_capacity to flush to manage dirty page ratio */
static
ulint
af_get_pct_for_dirty()
/*==================*/
{
	double	dirty_pct = buf_get_modified_ratio_pct();
	if (dirty_pct == 0.0) {
		/* No pages modified */
		return(0);
	}
	if (srv_max_dirty_pages_pct_lwm == 0) {
		/* The user has not set the option to preflush dirty
		pages as we approach the high water mark. */
		if (dirty_pct >= srv_max_buf_pool_modified_pct) {
			/* We have crossed the high water mark of dirty
			pages In this case we start flushing at 100% of
			innodb_io_capacity. */
			return(100);
		}
	} else if (dirty_pct >= srv_max_dirty_pages_pct_lwm) {
		/* We should start flushing pages gradually. */
		return(static_cast<ulint>((dirty_pct * 100)
		       / (srv_max_buf_pool_modified_pct + 1)));
	}

	return(0);
}

af_get_pct_for_lsn是根据Redo的lsn计算比例,如果age>max_async_age(7/8log_get_capacity())或者开启了自适应且age>srv_adaptive_flushing_lwm log_get_capacity(),返回((srv_max_io_capacity / srv_io_capacity)* (lsn_age_factor * sqrt((double)lsn_age_factor)))/ 7.5)),否则返回0。

/*********************************************************************//**
Calculates if flushing is required based on redo generation rate.
@return percent of io_capacity to flush to manage redo space */
static
ulint
af_get_pct_for_lsn(
/*===============*/
	lsn_t	age)	/*!< in: current age of LSN. */
{
	lsn_t	max_async_age;
	lsn_t	lsn_age_factor;
	lsn_t	af_lwm = (srv_adaptive_flushing_lwm
			  * log_get_capacity()) / 100;
	if (age < af_lwm) {
		/* No adaptive flushing. */
		return(0);
	}
	max_async_age = log_get_max_modified_age_async();
	if (age < max_async_age && !srv_adaptive_flushing) {
		/* We have still not reached the max_async point and
		the user has disabled adaptive flushing. */
		return(0);
	}
	/* If we are here then we know that either:
	1) User has enabled adaptive flushing
	2) User may have disabled adaptive flushing but we have reached
	max_async_age. */
	lsn_age_factor = (age * 100) / max_async_age;
	return(static_cast<ulint>(
		((srv_max_io_capacity / srv_io_capacity)
		* (lsn_age_factor * sqrt((double)lsn_age_factor)))
		/ 7.5));
}

以下参数影响着avg_page_rate与pages_for_lsn的计算。

  • innodb_flushing_avg_loops innodb_flushing_avg_loops控制着avg_page_rate与lsn_avg_rate(表示每秒LSN推进的平均速率)的计算频率,值越大,avg_page_rate与lsn_avg_rate更新周期就越长,自适应flush对负载的响应就越不及时。
	sum_pages += last_pages_in; //last_pages_in:the number of pages flushed by the last flush_list flushing.
	ib_time_monotonic_t	curr_time    = ut_time_monotonic();
	uint64_t	        time_elapsed = curr_time - prev_time;
	const ulong             avg_loop     = srv_flushing_avg_loops;
	/* We update our variables every srv_flushing_avg_loops
	iterations to smooth out transition in workload. */
	if (++n_iterations >= avg_loop
	    || time_elapsed >= (uint64_t)avg_loop) {
		if (time_elapsed < 1) {
			time_elapsed = 1;
		}
		avg_page_rate = static_cast<ulint>(
			((static_cast<double>(sum_pages)
			  / time_elapsed)
			 + avg_page_rate) / 2);
		/* How much LSN we have generated since last call. */
		lsn_rate = static_cast<lsn_t>(
			static_cast<double>(cur_lsn - prev_lsn)
			/ time_elapsed);
		lsn_avg_rate = (lsn_avg_rate + lsn_rate) / 2; 
		prev_lsn = cur_lsn;
		prev_time = curr_time;
		n_iterations = 0;
		sum_pages = 0;
	}

pages_for_lsn表示的是每个buffer pool 小于target_lsn的page数总和。 首先根据lsn_avg_rate计算target_lsn,然后遍历bp得到所有小于target_lsb的页的数量。

	oldest_lsn = buf_pool_get_oldest_modification();
	lsn_t	target_lsn = oldest_lsn
			     + lsn_avg_rate * buf_flush_lsn_scan_factor;
	for (ulint i = 0; i < srv_buf_pool_instances; i++) {
		buf_pool_t*	buf_pool = buf_pool_from_array(i);
		ulint		pages_for_lsn = 0;
		buf_flush_list_mutex_enter(buf_pool);
		for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool->flush_list);
		     b != NULL;
		     b = UT_LIST_GET_PREV(list, b)) {
			if (b->oldest_modification > target_lsn) {
				break;
			}
			++pages_for_lsn;
		}
		buf_flush_list_mutex_exit(buf_pool);
		sum_pages_for_lsn += pages_for_lsn;
		mutex_enter(&page_cleaner->mutex);
		page_cleaner->slots[i].n_pages_requested
			= pages_for_lsn / buf_flush_lsn_scan_factor + 1;
		mutex_exit(&page_cleaner->mutex);
	}
	sum_pages_for_lsn /= buf_flush_lsn_scan_factor;
	if(sum_pages_for_lsn < 1) {
		sum_pages_for_lsn = 1;
	}
	/* Cap the maximum IO capacity that we are going to use by
	max_io_capacity. Limit the value to avoid too quick increase */
	ulint	pages_for_lsn =
		std::min<ulint>(sum_pages_for_lsn, srv_max_io_capacity * 2);

计算完总的n_pages以后还要根据redo log的空间以及bp脏页的分布情况考虑每个bp需要flush的页数。

	for (ulint i = 0; i < srv_buf_pool_instances; i++) {
		/* if REDO has enough of free space,
		don't care about age distribution of pages */
		page_cleaner->slots[i].n_pages_requested = pct_for_lsn > 30 ?
			page_cleaner->slots[i].n_pages_requested
			* n_pages / sum_pages_for_lsn + 1
			: n_pages / srv_buf_pool_instances;
	}

page_cleaner_flush_pages_recommendation完整代码

/*********************************************************************//**
This function is called approximately once every second by the
page_cleaner thread. Based on various factors it decides if there is a
need to do flushing.
@return number of pages recommended to be flushed
@param lsn_limit	pointer to return LSN up to which flushing must happen
@param last_pages_in	the number of pages flushed by the last flush_list
			flushing. */
static
ulint
page_cleaner_flush_pages_recommendation(
/*====================================*/
	lsn_t*	lsn_limit,
	ulint	last_pages_in)
{
	static	lsn_t		prev_lsn = 0;
	static	ulint		sum_pages = 0;
	static	ulint		avg_page_rate = 0;
	static	ulint		n_iterations = 0;
	static	ib_time_monotonic_t		prev_time;
	lsn_t			oldest_lsn;
	lsn_t			cur_lsn;
	lsn_t			age;
	lsn_t			lsn_rate;
	ulint			n_pages = 0;
	ulint			pct_for_dirty = 0;
	ulint			pct_for_lsn = 0;
	ulint			pct_total = 0;

	cur_lsn = log_get_lsn();

	if (prev_lsn == 0) {
		/* First time around. */
		prev_lsn = cur_lsn;
		prev_time = ut_time_monotonic();
		return(0);
	}

	if (prev_lsn == cur_lsn) {
		return(0);
	}

	sum_pages += last_pages_in;

	ib_time_monotonic_t	curr_time    = ut_time_monotonic();
	uint64_t	        time_elapsed = curr_time - prev_time;
	const ulong             avg_loop     = srv_flushing_avg_loops;

	/* We update our variables every srv_flushing_avg_loops
	iterations to smooth out transition in workload. */
	if (++n_iterations >= avg_loop
	    || time_elapsed >= (uint64_t)avg_loop) {

		if (time_elapsed < 1) {
			time_elapsed = 1;
		}

		avg_page_rate = static_cast<ulint>(
			((static_cast<double>(sum_pages)
			  / time_elapsed)
			 + avg_page_rate) / 2);

		/* How much LSN we have generated since last call. */
		lsn_rate = static_cast<lsn_t>(
			static_cast<double>(cur_lsn - prev_lsn)
			/ time_elapsed);

		lsn_avg_rate = (lsn_avg_rate + lsn_rate) / 2; //这样做是为了让曲线平滑


		/* aggregate stats of all slots */
		mutex_enter(&page_cleaner->mutex);

		uint64_t  flush_tm = page_cleaner->flush_time;
		ulint	flush_pass = page_cleaner->flush_pass;

		page_cleaner->flush_time = 0;
		page_cleaner->flush_pass = 0;

		uint64_t lru_tm = 0;
		uint64_t list_tm = 0;
		ulint	lru_pass = 0;
		ulint	list_pass = 0;

		for (ulint i = 0; i < page_cleaner->n_slots; i++) {
			page_cleaner_slot_t*	slot;

			slot = &page_cleaner->slots[i];

			lru_tm    += slot->flush_lru_time;
			lru_pass  += slot->flush_lru_pass;
			list_tm   += slot->flush_list_time;
			list_pass += slot->flush_list_pass;

			slot->flush_lru_time  = 0;
			slot->flush_lru_pass  = 0;
			slot->flush_list_time = 0;
			slot->flush_list_pass = 0;
		}

		mutex_exit(&page_cleaner->mutex);

		/* minimum values are 1, to avoid dividing by zero. */
		if (lru_tm < 1) {
			lru_tm = 1;
		}
		if (list_tm < 1) {
			list_tm = 1;
		}
		if (flush_tm < 1) {
			flush_tm = 1;
		}

		if (lru_pass < 1) {
			lru_pass = 1;
		}
		if (list_pass < 1) {
			list_pass = 1;
		}
		if (flush_pass < 1) {
			flush_pass = 1;
		}

		MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_SLOT,
			    list_tm / list_pass);
		MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_SLOT,
			    lru_tm  / lru_pass);

		MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_THREAD,
			    list_tm / (srv_n_page_cleaners * flush_pass));
		MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_THREAD,
			    lru_tm / (srv_n_page_cleaners * flush_pass));
		MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_TIME_EST,
			    flush_tm * list_tm / flush_pass
			    / (list_tm + lru_tm));
		MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_TIME_EST,
			    flush_tm * lru_tm / flush_pass
			    / (list_tm + lru_tm));
		MONITOR_SET(MONITOR_FLUSH_AVG_TIME, flush_tm / flush_pass);

		MONITOR_SET(MONITOR_FLUSH_ADAPTIVE_AVG_PASS,
			    list_pass / page_cleaner->n_slots);
		MONITOR_SET(MONITOR_LRU_BATCH_FLUSH_AVG_PASS,
			    lru_pass / page_cleaner->n_slots);
		MONITOR_SET(MONITOR_FLUSH_AVG_PASS, flush_pass);

		prev_lsn = cur_lsn;
		prev_time = curr_time;

		n_iterations = 0;

		sum_pages = 0;
	}

	oldest_lsn = buf_pool_get_oldest_modification();

	ut_ad(oldest_lsn <= log_get_lsn());

	age = cur_lsn > oldest_lsn ? cur_lsn - oldest_lsn : 0;

	pct_for_dirty = af_get_pct_for_dirty();
	pct_for_lsn = af_get_pct_for_lsn(age);

	pct_total = ut_max(pct_for_dirty, pct_for_lsn);

	/* Estimate pages to be flushed for the lsn progress */
	ulint	sum_pages_for_lsn = 0;
	lsn_t	target_lsn = oldest_lsn
			     + lsn_avg_rate * buf_flush_lsn_scan_factor;

	for (ulint i = 0; i < srv_buf_pool_instances; i++) {
		buf_pool_t*	buf_pool = buf_pool_from_array(i);
		ulint		pages_for_lsn = 0;

		buf_flush_list_mutex_enter(buf_pool);
		for (buf_page_t* b = UT_LIST_GET_LAST(buf_pool->flush_list);
		     b != NULL;
		     b = UT_LIST_GET_PREV(list, b)) {
			if (b->oldest_modification > target_lsn) {
				break;
			}
			++pages_for_lsn;
		}
		buf_flush_list_mutex_exit(buf_pool);

		sum_pages_for_lsn += pages_for_lsn;

		mutex_enter(&page_cleaner->mutex);
		ut_ad(page_cleaner->slots[i].state
		      == PAGE_CLEANER_STATE_NONE);
		page_cleaner->slots[i].n_pages_requested
			= pages_for_lsn / buf_flush_lsn_scan_factor + 1;
		mutex_exit(&page_cleaner->mutex);
	}

	sum_pages_for_lsn /= buf_flush_lsn_scan_factor;
	if(sum_pages_for_lsn < 1) {
		sum_pages_for_lsn = 1;
	}

	/* Cap the maximum IO capacity that we are going to use by
	max_io_capacity. Limit the value to avoid too quick increase */
	ulint	pages_for_lsn =
		std::min<ulint>(sum_pages_for_lsn, srv_max_io_capacity * 2);

	n_pages = (PCT_IO(pct_total) + avg_page_rate + pages_for_lsn) / 3;

	if (n_pages > srv_max_io_capacity) {
		n_pages = srv_max_io_capacity;
	}

	/* Normalize request for each instance */
	mutex_enter(&page_cleaner->mutex);
	ut_ad(page_cleaner->n_slots_requested == 0);
	ut_ad(page_cleaner->n_slots_flushing == 0);
	ut_ad(page_cleaner->n_slots_finished == 0);

	for (ulint i = 0; i < srv_buf_pool_instances; i++) {
		/* if REDO has enough of free space,
		don't care about age distribution of pages */
		page_cleaner->slots[i].n_pages_requested = pct_for_lsn > 30 ?
			page_cleaner->slots[i].n_pages_requested
			* n_pages / sum_pages_for_lsn + 1
			: n_pages / srv_buf_pool_instances;
	}
	mutex_exit(&page_cleaner->mutex);

	MONITOR_SET(MONITOR_FLUSH_N_TO_FLUSH_REQUESTED, n_pages);

	MONITOR_SET(MONITOR_FLUSH_N_TO_FLUSH_BY_AGE, sum_pages_for_lsn);

	MONITOR_SET(MONITOR_FLUSH_AVG_PAGE_RATE, avg_page_rate);
	MONITOR_SET(MONITOR_FLUSH_LSN_AVG_RATE, lsn_avg_rate);
	MONITOR_SET(MONITOR_FLUSH_PCT_FOR_DIRTY, pct_for_dirty);
	MONITOR_SET(MONITOR_FLUSH_PCT_FOR_LSN, pct_for_lsn);

	*lsn_limit = LSN_MAX;

	return(n_pages);
}

批量刷脏的具体执行

相关参数

  • innodb_lru_scan_depth

调用栈

批量刷脏会从lru和flush list中flush。 innodb_lru_scan_depth影响着lru的遍历深度,遍历到的才会执行buf_flush_ready_for_replace与buf_flush_ready_for_flush。

pc_flush_slot
----buf_flush_LRU_list
--------buf_flush_do_batch(BUF_FLUSH_LRU)
------------buf_flush_batch
----------------buf_do_LRU_batch
--------------------buf_free_from_unzip_LRU_list_batch
--------------------buf_flush_LRU_list_batch
------------------------buf_flush_ready_for_replace=>buf_LRU_free_page
------------------------buf_flush_ready_for_flush=>buf_flush_page_and_try_neighbors
----------------------------buf_flush_page_and_try_neighbors
--------------------------------buf_flush_try_neighbors
------------------------------------buf_flush_page
----------------------------------------buf_flush_write_block_low
----buf_flush_do_batch(BUF_FLUSH_LIST)
--------buf_flush_batch
------------buf_do_flush_list_batch
-----------------buf_flush_page_and_try_neighbors
---------------------buf_flush_try_neighbors
-------------------------buf_flush_page
-----------------------------buf_flush_write_block_low
---------------------------------buf_dblwr_add_to_batch
--------buf_flush_end
------------buf_dblwr_flush_buffered_writes

几个思考

  • 批量flush的时候会把lru中的非脏页刷掉吗?会的
  • 如果bufferpool满的话,free的逻辑是咋样的?本节后面有答案。
  • 会把未提交事务的脏页刷到硬盘上吗,比如大事务?验证了一下会的,另外跟事务大小没关系。那么脏页刷新到硬盘,做了checkpoint的情况下,故障恢复是咋做的,先用redo,再用undo恢复。那么已经flush到硬盘上的如何恢复呢,因为undo是逻辑日志,所以可以恢复。

批量刷脏如何保证日志先行?

buf_flush_write_block_low
----log_write_up_to(bpage->newest_modification, true);

双写缓存

目前的理解是buf_dblwr_add_to_batch会把page写到dblwr中,如果dblwr没有满就返回,满了就触发同步写dblwr以及异步刷数据页。 有可能flush的页特别少,写不满dblwr,在buf_flush_end再调用buf_dblwr_flush_buffered_writes一次。

/********************************************************************//**
Posts a buffer page for writing. If the doublewrite memory buffer is
full, calls buf_dblwr_flush_buffered_writes and waits for for free
space to appear. */
void
buf_dblwr_add_to_batch(
/*====================*/
	buf_page_t*	bpage)	/*!< in: buffer block to write */
{
	ut_a(buf_page_in_file(bpage));

try_again:
	mutex_enter(&buf_dblwr->mutex);

	ut_a(buf_dblwr->first_free <= srv_doublewrite_batch_size);

	if (buf_dblwr->batch_running) {

		/* This not nearly as bad as it looks. There is only
		page_cleaner thread which does background flushing
		in batches therefore it is unlikely to be a contention
		point. The only exception is when a user thread is
		forced to do a flush batch because of a sync
		checkpoint. */
		int64_t	sig_count = os_event_reset(buf_dblwr->b_event);
		mutex_exit(&buf_dblwr->mutex);

		os_event_wait_low(buf_dblwr->b_event, sig_count);
		goto try_again;
	}

	if (buf_dblwr->first_free == srv_doublewrite_batch_size) {
		mutex_exit(&(buf_dblwr->mutex));

		buf_dblwr_flush_buffered_writes();

		goto try_again;
	}

	byte*	p = buf_dblwr->write_buf
		+ univ_page_size.physical() * buf_dblwr->first_free;

	if (bpage->size.is_compressed()) {
		UNIV_MEM_ASSERT_RW(bpage->zip.data, bpage->size.physical());
		/* Copy the compressed page and clear the rest. */

		memcpy(p, bpage->zip.data, bpage->size.physical());

		memset(p + bpage->size.physical(), 0x0,
		       univ_page_size.physical() - bpage->size.physical());
	} else {
		ut_a(buf_page_get_state(bpage) == BUF_BLOCK_FILE_PAGE);

		UNIV_MEM_ASSERT_RW(((buf_block_t*) bpage)->frame,
				   bpage->size.logical());

		memcpy(p, ((buf_block_t*) bpage)->frame, bpage->size.logical());
	}

	buf_dblwr->buf_block_arr[buf_dblwr->first_free] = bpage;

	buf_dblwr->first_free++;
	buf_dblwr->b_reserved++;

	ut_ad(!buf_dblwr->batch_running);
	ut_ad(buf_dblwr->first_free == buf_dblwr->b_reserved);
	ut_ad(buf_dblwr->b_reserved <= srv_doublewrite_batch_size);

	if (buf_dblwr->first_free == srv_doublewrite_batch_size) {
		mutex_exit(&(buf_dblwr->mutex));

		buf_dblwr_flush_buffered_writes();

		return;
	}

	mutex_exit(&(buf_dblwr->mutex));
}

buf_dblwr_flush_buffered_writes

void
buf_dblwr_flush_buffered_writes(void)
/*=================================*/
{
	byte*		write_buf;
	ulint		first_free;
	ulint		len;

	if (!srv_use_doublewrite_buf || buf_dblwr == NULL) {
		/* Sync the writes to the disk. */
		buf_dblwr_sync_datafiles();
		return;
	}

	ut_ad(!srv_read_only_mode);

try_again:
	mutex_enter(&buf_dblwr->mutex);

	/* Write first to doublewrite buffer blocks. We use synchronous
	aio and thus know that file write has been completed when the
	control returns. */

	if (buf_dblwr->first_free == 0) {

		mutex_exit(&buf_dblwr->mutex);

		/* Wake possible simulated aio thread as there could be
		system temporary tablespace pages active for flushing.
		Note: system temporary tablespace pages are not scheduled
		for doublewrite. */
		os_aio_simulated_wake_handler_threads();

		return;
	}

	if (buf_dblwr->batch_running) {
		/* Another thread is running the batch right now. Wait
		for it to finish. */
		int64_t	sig_count = os_event_reset(buf_dblwr->b_event);
		mutex_exit(&buf_dblwr->mutex);

		os_event_wait_low(buf_dblwr->b_event, sig_count);
		goto try_again;
	}

	ut_a(!buf_dblwr->batch_running);
	ut_ad(buf_dblwr->first_free == buf_dblwr->b_reserved);

	/* Disallow anyone else to post to doublewrite buffer or to
	start another batch of flushing. */
	buf_dblwr->batch_running = true;
	first_free = buf_dblwr->first_free;

	/* Now safe to release the mutex. Note that though no other
	thread is allowed to post to the doublewrite batch flushing
	but any threads working on single page flushes are allowed
	to proceed. */
	mutex_exit(&buf_dblwr->mutex);

	write_buf = buf_dblwr->write_buf;

	for (ulint len2 = 0, i = 0;
	     i < buf_dblwr->first_free;
	     len2 += UNIV_PAGE_SIZE, i++) {

		const buf_block_t*	block;

		block = (buf_block_t*) buf_dblwr->buf_block_arr[i];

		if (buf_block_get_state(block) != BUF_BLOCK_FILE_PAGE
		    || block->page.zip.data) {
			/* No simple validate for compressed
			pages exists. */
			continue;
		}

		/* Check that the actual page in the buffer pool is
		not corrupt and the LSN values are sane. */
		buf_dblwr_check_block(block);

		/* Check that the page as written to the doublewrite
		buffer has sane LSN values. */
		buf_dblwr_check_page_lsn(write_buf + len2);
	}

	/* Write out the first block of the doublewrite buffer */
	len = ut_min(TRX_SYS_DOUBLEWRITE_BLOCK_SIZE,
		     buf_dblwr->first_free) * UNIV_PAGE_SIZE;

	fil_io(IORequestWrite, true,
	       page_id_t(TRX_SYS_SPACE, buf_dblwr->block1), univ_page_size,
	       0, len, (void*) write_buf, NULL);

	if (buf_dblwr->first_free <= TRX_SYS_DOUBLEWRITE_BLOCK_SIZE) {
		/* No unwritten pages in the second block. */
		goto flush;
	}

	/* Write out the second block of the doublewrite buffer. */
	len = (buf_dblwr->first_free - TRX_SYS_DOUBLEWRITE_BLOCK_SIZE)
	       * UNIV_PAGE_SIZE;

	write_buf = buf_dblwr->write_buf
		    + TRX_SYS_DOUBLEWRITE_BLOCK_SIZE * UNIV_PAGE_SIZE;

	fil_io(IORequestWrite, true,
	       page_id_t(TRX_SYS_SPACE, buf_dblwr->block2), univ_page_size,
	       0, len, (void*) write_buf, NULL);

flush:
	/* increment the doublewrite flushed pages counter */
	srv_stats.dblwr_pages_written.add(buf_dblwr->first_free);
	srv_stats.dblwr_writes.inc();

	/* Now flush the doublewrite buffer data to disk */
	fil_flush(TRX_SYS_SPACE);

	/* We know that the writes have been flushed to disk now
	and in recovery we will find them in the doublewrite buffer
	blocks. Next do the writes to the intended positions. */

	/* Up to this point first_free and buf_dblwr->first_free are
	same because we have set the buf_dblwr->batch_running flag
	disallowing any other thread to post any request but we
	can't safely access buf_dblwr->first_free in the loop below.
	This is so because it is possible that after we are done with
	the last iteration and before we terminate the loop, the batch
	gets finished in the IO helper thread and another thread posts
	a new batch setting buf_dblwr->first_free to a higher value.
	If this happens and we are using buf_dblwr->first_free in the
	loop termination condition then we'll end up dispatching
	the same block twice from two different threads. */
	ut_ad(first_free == buf_dblwr->first_free);
	for (ulint i = 0; i < first_free; i++) {
		buf_dblwr_write_block_to_datafile(
			buf_dblwr->buf_block_arr[i], false);
	}

	/* Wake possible simulated aio thread to actually post the
	writes to the operating system. We don't flush the files
	at this point. We leave it to the IO helper thread to flush
	datafiles when the whole batch has been processed. */
	os_aio_simulated_wake_handler_threads();
}

bufferpool满的话,free的逻辑

如果bufferpool满的话,free的逻辑是咋样的? 看效果图,一直维持在1024。 Snipaste_2020-10-16_00-14-47.png 具体的代码,在buf_flush_LRU_list_batch函数里 image.png

批量刷脏的优化-Hazard Pointer的设计与实现

前提
  • flush_list是包含dirty page并按修改时间有序的链表,在刷脏时选择从链表的尾部进行遍历淘汰
  • 异步IO,当异步写page完成后,io helper线程会调buf_flush_write_complete,将写入的Page从flush list上移除。 调用栈
----fil_aio_wait
----buf_page_io_complete(static_cast<buf_page_t*>(message));
----buf_page_io_complete
--------buf_flush_write_complete
-------------buf_flush_remove
-----------------UT_LIST_REMOVE(buf_pool->flush_list, bpage);
以前的代码实现(版本5.7.1):

刷脏线程刷完一个数据页,就需要回到Flush List末尾,重新扫描新的可刷盘的脏页。

buf_do_flush_list_batch{
	do {
		/* Start from the end of the list looking for a suitable
		block to be flushed. */

		buf_flush_list_mutex_enter(buf_pool);

		/* We use len here because theoretically insertions can
		happen in the flush_list below while we are traversing
		it for a suitable candidate for flushing. We'd like to
		set a limit on how farther we are willing to traverse
		the list. */
		len = UT_LIST_GET_LEN(buf_pool->flush_list);
		bpage = UT_LIST_GET_LAST(buf_pool->flush_list);

		if (bpage) {
			ut_a(bpage->oldest_modification > 0);
		}

		if (!bpage || bpage->oldest_modification >= lsn_limit) {

			/* We have flushed enough */
			buf_flush_list_mutex_exit(buf_pool);
			break;
		}

		ut_a(bpage->oldest_modification > 0);

		ut_ad(bpage->in_flush_list);

		buf_flush_list_mutex_exit(buf_pool);

		/* The list may change during the flushing and we cannot
		safely preserve within this function a pointer to a
		block in the list! */
		while (bpage != NULL
		       && len > 0
		       && !buf_flush_page_and_try_neighbors(
				bpage, BUF_FLUSH_LIST, min_n, &count)) {

			++scanned;
			buf_flush_list_mutex_enter(buf_pool);

			/* If we are here that means that buf_pool->mutex
			 was not released in buf_flush_page_and_try_neighbors()
			above and this guarantees that bpage didn't get
			relocated since we released the flush_list
			mutex above. There is a chance, however, that
			the bpage got removed from flush_list (not
			currently possible because flush_list_remove()
			also obtains buf_pool mutex but that may change
			in future). To avoid this scenario we check
			the oldest_modification and if it is zero
			we start all over again. */
			if (bpage->oldest_modification == 0) {
				buf_flush_list_mutex_exit(buf_pool);
				break;
			}

			bpage = UT_LIST_GET_PREV(list, bpage);

			ut_ad(!bpage || bpage->in_flush_list);

			buf_flush_list_mutex_exit(buf_pool);

			--len;
		}

	} while (count < min_n && bpage != NULL && len > 0);
}
为什么这样设计?

因为刷脏线程刷的这个脏页之前的脏页可能被其他线程给刷走了,之前的脏页可能已经不在Flush list中。 造成这个现象的原因:

  • 用户线程也会进行刷脏,虽然是从lru队列中找到的脏页,当脏页被flush到磁盘后,用户线程也会将写入的Page从flush list上移除。具体的调用栈参见: www.jianshu.com/p/080056afa…
  • 同时,为了减少锁占用的时间,InnoDB在进行写盘的时候都会把之前占用的锁给释放掉。
带来了什么后果

我们的某一个刷脏线程拿到队尾最后一个数据页,IO fixed,发送给IO线程,最后再从队尾扫描寻找可刷盘的脏页。在这次扫描中,它发现最后一个数据页(也就是刚刚发送到IO线程中的数据页)状态为IO fixed(磁盘很慢,还没处理完)所以不能刷,跳过,开始刷倒数第二个数据页,同样IO fixed,发送给IO线程,然后再次重新扫描Flush List。它又发现尾部的两个数据页都不能刷新(因为磁盘很慢,可能还没刷完),直到扫描到倒数第三个数据页。所以,存在一种极端的情况,如果磁盘比较缓慢,刷脏算法性能会从O(N)退化成O(N*N)。 dev.mysql.com/worklog/tas…

Hazard Pointer版本的代码实现(5.7.28)
stati
ulint
buf_do_flush_list_batch(
	buf_pool_t*		buf_pool,
	ulint			min_n,
	lsn_t			lsn_limit)
{
	ulint		count = 0;
	ulint		scanned = 0;

	ut_ad(buf_pool_mutex_own(buf_pool));

	/* Start from the end of the list looking for a suitable
	block to be flushed. */
	buf_flush_list_mutex_enter(buf_pool);
	ulint len = UT_LIST_GET_LEN(buf_pool->flush_list);

	/* In order not to degenerate this scan to O(n*n) we attempt
	to preserve pointer of previous block in the flush list. To do
	so we declare it a hazard pointer. Any thread working on the
	flush list must check the hazard pointer and if it is removing
	the same block then it must reset it. */
	for (buf_page_t* bpage = UT_LIST_GET_LAST(buf_pool->flush_list);
	     count < min_n && bpage != NULL && len > 0
	     && bpage->oldest_modification < lsn_limit;
	     bpage = buf_pool->flush_hp.get(),
	     ++scanned) {

		buf_page_t*	prev;

		ut_a(bpage->oldest_modification > 0);
		ut_ad(bpage->in_flush_list);

		prev = UT_LIST_GET_PREV(list, bpage);
		buf_pool->flush_hp.set(prev);
		buf_flush_list_mutex_exit(buf_pool);

#ifdef UNIV_DEBUG
		bool flushed =
#endif /* UNIV_DEBUG */
		buf_flush_page_and_try_neighbors(
			bpage, BUF_FLUSH_LIST, min_n, &count);

		buf_flush_list_mutex_enter(buf_pool);

		ut_ad(flushed || buf_pool->flush_hp.is_hp(prev));

		--len;
	}

	buf_pool->flush_hp.set(NULL);
	buf_flush_list_mutex_exit(buf_pool);

	if (scanned) {
		MONITOR_INC_VALUE_CUMULATIVE(
			MONITOR_FLUSH_BATCH_SCANNED,
			MONITOR_FLUSH_BATCH_SCANNED_NUM_CALL,
			MONITOR_FLUSH_BATCH_SCANNED_PER_CALL,
			scanned);
	}

	if (count) {
		MONITOR_INC_VALUE_CUMULATIVE(
			MONITOR_FLUSH_BATCH_TOTAL_PAGE,
			MONITOR_FLUSH_BATCH_COUNT,
			MONITOR_FLUSH_BATCH_PAGES,
			count);
	}

	ut_ad(buf_pool_mutex_own(buf_pool));

	return(count);
}
原理

要解决这个问题,最本质的方法就是当刷完一个脏页的时候不要每次都从队尾重新扫描。我们可以使用Hazard Pointer来解决,方法如下:遍历找到一个可刷盘的数据页,在锁释放之前,调整Hazard Pointer使之指向Flush List中下一个节点,注意一定要在持有锁的情况下修改。然后释放锁,进行刷盘,刷完盘后,重新获取锁,读取Hazard Pointer并设置下一个节点,然后释放锁,进行刷盘,如此重复。当这个线程在刷盘的时候,另外一个线程需要刷盘,也是通过Hazard Pointer来获取可靠的节点,并重置下一个有效的节点。通过这种机制,保证每次读到的Hazard Pointer是一个有效的Flush List节点,即使磁盘再慢,刷脏算法效率依然是O(N)。 这个解法同样可以用到LRU List驱逐算法上,提高驱逐的效率。

单页刷脏

单页刷脏的具体实现

调用栈

buf_LRU_get_free_block
----buf_LRU_scan_and_free_block/* 先看看能不能找到可以replaced */
----buf_flush_single_page_from_LRU/* 不行再刷新 */
--------buf_flush_page(buf_pool, bpage, BUF_FLUSH_SINGLE_PAGE, true)
------------buf_flush_write_block_low(bpage, flush_type, sync);
------------if (flush_type == BUF_FLUSH_SINGLE_PAGE) {
		        buf_dblwr_write_single_page(bpage, sync);
	        }
	        if (sync) {
		        ut_ad(flush_type == BUF_FLUSH_SINGLE_PAGE);
		        fil_flush(bpage->id.space());

		        /* true means we want to evict this page from the
		        LRU list as well. */
		        buf_page_io_complete(bpage, true);
	        }
----------------buf_page_io_complete
--------------------buf_flush_write_complete
------------------------buf_flush_remove
----------------------------UT_LIST_REMOVE(buf_pool->flush_list, bpage);
----------------buf_LRU_free_page(bpage, true); //从lru free page

LRU中的flush会从flush list中remove掉,相反flush list的flush不会从lru中remove,因为可能是读热点。

/* If no block was in the free list, search from the end of the LRU list and try to free a block there. If we are doing for the first time we'll scan only tail of the LRU list otherwise we scan the whole LRU list. */

buf_LRU_scan_and_free_block第一轮会迭代innodb_lru_scan_depth个对象,如果没有找到第二轮会迭代整个LRU list,参见buf_LRU_get_free_block完整代码

/******************************************************************//**
Returns a free block from the buf_pool. The block is taken off the
free list. If free list is empty, blocks are moved from the end of the
LRU list to the free list.
This function is called from a user thread when it needs a clean
block to read in a page. Note that we only ever get a block from
the free list. Even when we flush a page or find a page in LRU scan
we put it to free list to be used.
* iteration 0:
  * get a block from free list, success:done
  * if buf_pool->try_LRU_scan is set
    * scan LRU up to srv_LRU_scan_depth to find a clean block
    * the above will put the block on free list
    * success:retry the free list
  * flush one dirty page from tail of LRU to disk
    * the above will put the block on free list
    * success: retry the free list
* iteration 1:
  * same as iteration 0 except:
    * scan whole LRU list
    * scan LRU list even if buf_pool->try_LRU_scan is not set
* iteration > 1:
  * same as iteration 1 but sleep 10ms
@return the free control block, in state BUF_BLOCK_READY_FOR_USE */
buf_block_t*
buf_LRU_get_free_block(
/*===================*/
	buf_pool_t*	buf_pool)	/*!< in/out: buffer pool instance */
{
	buf_block_t*	block		= NULL;
	bool		freed		= false;
	ulint		n_iterations	= 0;
	ulint		flush_failures	= 0;
	bool		mon_value_was	= false;
	bool		started_monitor	= false;

	MONITOR_INC(MONITOR_LRU_GET_FREE_SEARCH);
loop:
	buf_pool_mutex_enter(buf_pool);

	buf_LRU_check_size_of_non_data_objects(buf_pool);

	/* If there is a block in the free list, take it */
	block = buf_LRU_get_free_only(buf_pool);

	if (block != NULL) {

		buf_pool_mutex_exit(buf_pool);
		ut_ad(buf_pool_from_block(block) == buf_pool);
		memset(&block->page.zip, 0, sizeof block->page.zip);

		if (started_monitor) {
			srv_print_innodb_monitor =
				static_cast<my_bool>(mon_value_was);
		}

		block->skip_flush_check = false;
		block->page.flush_observer = NULL;
		return(block);
	}

	MONITOR_INC( MONITOR_LRU_GET_FREE_LOOPS );

	freed = false;
	if (buf_pool->try_LRU_scan || n_iterations > 0) {
		/* If no block was in the free list, search from the
		end of the LRU list and try to free a block there.
		If we are doing for the first time we'll scan only
		tail of the LRU list otherwise we scan the whole LRU
		list. */
		freed = buf_LRU_scan_and_free_block(
			buf_pool, n_iterations > 0);

		if (!freed && n_iterations == 0) {
			/* Tell other threads that there is no point
			in scanning the LRU list. This flag is set to
			TRUE again when we flush a batch from this
			buffer pool. */
			buf_pool->try_LRU_scan = FALSE;
		}
	}

	buf_pool_mutex_exit(buf_pool);

	if (freed) {
		goto loop;
	}

	if (n_iterations > 20
	    && srv_buf_pool_old_size == srv_buf_pool_size) {

		ib::warn() << "Difficult to find free blocks in the buffer pool"
			" (" << n_iterations << " search iterations)! "
			<< flush_failures << " failed attempts to"
			" flush a page! Consider increasing the buffer pool"
			" size. It is also possible that in your Unix version"
			" fsync is very slow, or completely frozen inside"
			" the OS kernel. Then upgrading to a newer version"
			" of your operating system may help. Look at the"
			" number of fsyncs in diagnostic info below."
			" Pending flushes (fsync) log: "
			<< fil_n_pending_log_flushes
			<< "; buffer pool: "
			<< fil_n_pending_tablespace_flushes
			<< ". " << os_n_file_reads << " OS file reads, "
			<< os_n_file_writes << " OS file writes, "
			<< os_n_fsyncs
			<< " OS fsyncs. Starting InnoDB Monitor to print"
			" further diagnostics to the standard output.";

		mon_value_was = srv_print_innodb_monitor;
		started_monitor = true;
		srv_print_innodb_monitor = true;
		os_event_set(lock_sys->timeout_event);
	}

	/* If we have scanned the whole LRU and still are unable to
	find a free block then we should sleep here to let the
	page_cleaner do an LRU batch for us. */

	if (!srv_read_only_mode) {
		os_event_set(buf_flush_event);
	}

	if (n_iterations > 1) {

		MONITOR_INC( MONITOR_LRU_GET_FREE_WAITS );
		os_thread_sleep(10000);
	}

	/* No free block was found: try to flush the LRU list.
	This call will flush one page from the LRU and put it on the
	free list. That means that the free block is up for grabs for
	all user threads.

	TODO: A more elegant way would have been to return the freed
	up block to the caller here but the code that deals with
	removing the block from page_hash and LRU_list is fairly
	involved (particularly in case of compressed pages). We
	can do that in a separate patch sometime in future. */

	if (!buf_flush_single_page_from_LRU(buf_pool)) {
		MONITOR_INC(MONITOR_LRU_SINGLE_FLUSH_FAILURE_COUNT);
		++flush_failures;
	}

	srv_stats.buf_pool_wait_free.add(n_iterations, 1);

	n_iterations++;

	goto loop;
}

问题:

什么Page是可以Replace的。 有没有可能一个用户线程free出来的page被其他用户线程用了?

刷脏后的逻辑

是否从lru和flus里面删除

所以在page cleaner thread 执行flush 操作以后, 在写IO 完成以后, 是否会把page 同时从flush_list, LRU list 同时删除, 还是只是将oldest_modification lsn 设置成0 就可以了?

这里分两种场景考虑:

如果这个page 是从flush_list 上面写IO 完成, 那么就不需要从flush_list上面删除, 因为从flush list 上面删除要完成的操作是刷脏,既然只是为了刷脏, 那么就没必要让他从lru list 上面删除, 有可能这个page 被刷脏了, 还是一个热page 是需要访问的

如果这个page 是从lru_list 上面写IO 完成, 那就需要从lru list 上面删除

原因: 从lru_list 上面删除的page 肯定说明这个page 不是hot page 了,更大的原因可能是buffer pool 空间不够, 需要从lru list 上面淘汰一些page了, 既然这些page 是要从lru list 上面淘汰的, 那么肯定就需要从LRU list 上面移除

具体代码在buf_page_io_complete() 中

buf_page_io_complete{
	case BUF_IO_WRITE:
		/* Write means a flush operation: call the completion
		routine in the flush system */

		buf_flush_write_complete(bpage);

		if (uncompressed) {
			rw_lock_sx_unlock_gen(&((buf_block_t*) bpage)->lock,
					      BUF_IO_WRITE);
		}

		buf_pool->stat.n_pages_written++;

		/* We decide whether or not to evict the page from the
		LRU list based on the flush_type.
		* BUF_FLUSH_LIST: don't evict
		* BUF_FLUSH_LRU: always evict
		* BUF_FLUSH_SINGLE_PAGE: eviction preference is passed
		by the caller explicitly. */
		if (buf_page_get_flush_type(bpage) == BUF_FLUSH_LRU) {
			evict = true;
		}

		if (evict) {
			mutex_exit(buf_page_get_mutex(bpage));
			buf_LRU_free_page(bpage, true);
		} else {
			mutex_exit(buf_page_get_mutex(bpage));
		}

		break;
}

mysql.taobao.org/monthly/201…

www.jianshu.com/p/6991304a8…

mp.weixin.qq.com/s/o2OlvRiyb…

mysql.taobao.org/monthly/201…

www.leviathan.vip/2020/05/19/…

developer.aliyun.com/article/410…

mysql.taobao.org/monthly/201…

mp.weixin.qq.com/s/o2OlvRiyb…

www.geek-share.com/detail/2706…