1. 背景问题
-
问题背景描述
运维同事反馈上线滚动发布,旧实例接收的请求发生业务中断,没有处理完成, 为保证服务质量,期望发布时旧实例将已有请求处理完成再回收。本质其实就是微服务优雅下线的问题。
-
为什么要解决这个问题
先说可能会带来的影响:
(1)业务处理中断
导致数据不完整,属于致命危害。
举例:文件写入中断,文件数据不完整;支付完成,本地业务处理失败。
应用是可以通过一些手段保证业务流程,比如业务补偿、幂等设计、状态机严格控制,这要求编程时考虑因素更大,即使能做到这些,也会带来维护成本,出现问题需要排查,甚至人工补偿,保证数据完整性。
(2)已下线的服务继续提供服务
微服务一般都由注册中心提供的服务自动发现机制,在一定时间范围内,即使服务下线,注册中心和服务有心跳时间窗口,比如eureka默认是30S,这个时间窗口内,注册中心不会下线服务,导致上游调用该服务大量报错,引发告警甚至熔断,业务请求大量失败。服务不存在绝对的优雅下线,一方面有断电、拔电源等外在因素;还有kill -15后应用本身不能在正常下线窗口时间范围内处理完成业务,触发 kill -9,发生业务中断。
但我们可以在可以控制的范围内,做到优雅下线。kill 本质上是发送一个信号,而信号量是进程间的一种异步通信机制;
kill 15,系统向应用发送SIGTERM(15)信号,该信号是可以被执行、阻塞和忽略的,应用程序接收到信号后,可以做很多事情,甚至可以决定不终止;
kill 9, 系统会发出SIGKILL(9)信号,由操作系统内核完成杀进程操作,该信号不允许忽略和阻塞,所以应用程序会立即终止;
发布时,一般都是先执行kill 15 ,并设置一个合理的下线处理时间窗口,如果在时间窗口内没有完成业务处理,则kill 9 强行进程杀。
2. 优雅下线要解决的问题
-
处理完已经接受的请求
(1)正常执行完成后台业务逻辑
(2)调用方接收正常处理完成的信号 -
不再接收新的请求
目前部署架构和发布方式,发布采用滚动发布的方式,k8s已经实现了停机时将服务从注册列表自动移除的机制。
k8s对pod是有一个terminationGracePeriodSeconds的配置项(默认是30秒),当采用RollingUpdate方式更新应用时,一个新pod起来并经过health check后,会通知k8s终止一个pod。
Pod被设置为Terminating状态,并从k8s的service/endpoint列表中删除,Pod停止获得新流量。
K8S向Pod中的容器发送SIGTERM(kill 15)信号,通知应用进程。
这给应用层面省去了不少麻烦,不然得下线时挂钩子(shutdownHook)将服务主动从注册列表移除,还没那么简单,容器重启服务时,如果发布前后ip无变化,还得对外暴露接口,配置执行脚本将服务加入注册列表,需要一整套方案去保证。
DiscoveryManager.getInstance().shutdownComponent();
3. 问题验证
-
测试验证
测试环境保持和生产一致:Springboot v1.5.7.RELEASE 内嵌 jetty-9.4.6.v20170531,测试指令kill 15。
测试代码:
@GetMapping("/test")
public String test1(@RequestBody Map<String, Integer> params) throws InterruptedException {
TimeUnit.SECONDS.sleep(params.get("s"));
System.out.println(Thread.currentThread().getName() + " finished");
return "m1";
}
Jetty 停机日志:可以看到系统会处理完成请求,但是客户端接收不到响应
2021-05-30 12:47:26.190 INFO 1235 --- [ main] .s.b.c.e.j.JettyEmbeddedServletContainer : Jetty started on port(s) 8080 (http/1.1)
2021-05-30 12:47:41.030 INFO 1235 --- [ Thread-14] ationConfigEmbeddedWebApplicationContext : Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@709ba3fb: startup date [Sun May 30 12:47:25 CST 2021]; root of context hierarchy
2021-05-30 12:47:41.033 INFO 1235 --- [ Thread-14] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
2021-05-30 12:47:41.047 INFO 1235 --- [ Thread-14] o.e.jetty.server.AbstractConnector : Stopped ServerConnector@3185fa6b{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
2021-05-30 12:47:41.047 INFO 1235 --- [ Thread-14] org.eclipse.jetty.server.session : Stopped scavenging
2021-05-30 12:47:41.049 INFO 1235 --- [ Thread-14] o.e.j.s.h.ContextHandler.application : Destroying Spring FrameworkServlet 'dispatcherServlet'
2021-05-30 12:47:41.049 INFO 1235 --- [ Thread-14] o.e.jetty.server.handler.ContextHandler : Stopped o.s.b.c.e.j.JettyEmbeddedWebAppContext@7651218e{/,[file:///private/var/folders/kt/428mmnyd6vj_bgzz5xrfgtfm0000gn/T/jetty-docbase.8898877921167659801.8080/],UNAVAILABLE}
qtp1805164661-19 finished
Process finished with exit code 143 (interrupted by signal 15: SIGTERM)
客户端响应
org.apache.http.NoHttpResponseException: localhost:8080 failed to respond
tomcat 停机日志:可以看到系统收到指令,业务中断
2021-05-30 12:44:27.042 INFO 1177 --- [ main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat started on port(s): 8080 (http)
2021-05-30 12:44:27.046 INFO 1177 --- [ main] c.e.c.test.BrpcClientTestApplication : Started BrpcClientTestApplication in 1.634 seconds (JVM running for 2.076)
2021-05-30 12:44:39.702 INFO 1177 --- [nio-8080-exec-1] o.a.c.c.C.[Tomcat].[localhost].[/] : Initializing Spring FrameworkServlet 'dispatcherServlet'
2021-05-30 12:44:39.702 INFO 1177 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet : FrameworkServlet 'dispatcherServlet': initialization started
2021-05-30 12:44:39.733 INFO 1177 --- [nio-8080-exec-1] o.s.web.servlet.DispatcherServlet : FrameworkServlet 'dispatcherServlet': initialization completed in 31 ms
2021-05-30 12:44:41.519 INFO 1177 --- [ Thread-6] ationConfigEmbeddedWebApplicationContext : Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@21a947fe: startup date [Sun May 30 12:44:25 CST 2021]; root of context hierarchy
2021-05-30 12:44:41.522 INFO 1177 --- [ Thread-6] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
Process finished with exit code 143 (interrupted by signal 15: SIGTERM)
客户端响应
org.apache.http.NoHttpResponseException: localhost:8080 failed to respond
-
Springboot对优雅关闭的支持
根据关闭打印日志可以看到,关闭相关类AnnotationConfigEmbeddedWebApplicationContext,
父类AbstractApplicationContext提供了对优雅下线的扩展支持,Springboot 在关闭时,如果有请求没有响应完,
在不同的容器会出现不同的结果,实际上具体关闭处理是交给具体容器自己实现的。
贴一下关键源码:
/**
* Register a shutdown hook with the JVM runtime, closing this context
* on JVM shutdown unless it has already been closed at that time.
* <p>Delegates to {@code doClose()} for the actual closing procedure.
* @see Runtime#addShutdownHook
* @see #close()
* @see #doClose()
*/
@Override
public void registerShutdownHook() {
if (this.shutdownHook == null) {
// No shutdown hook registered yet.
this.shutdownHook = new Thread() {
@Override
public void run() {
synchronized (startupShutdownMonitor) {
doClose();
}
}
};
Runtime.getRuntime().addShutdownHook(this.shutdownHook);
}
}
/**
* Actually performs context closing: publishes a ContextClosedEvent and
* destroys the singletons in the bean factory of this application context.
* <p>Called by both {@code close()} and a JVM shutdown hook, if any.
* @see org.springframework.context.event.ContextClosedEvent
* @see #destroyBeans()
* @see #close()
* @see #registerShutdownHook()
*/
protected void doClose() {
if (this.active.get() && this.closed.compareAndSet(false, true)) {
if (logger.isInfoEnabled()) {
logger.info("Closing " + this);
}
LiveBeansView.unregisterApplicationContext(this);
try {
// Publish shutdown event.
publishEvent(new ContextClosedEvent(this));
}
catch (Throwable ex) {
logger.warn("Exception thrown from ApplicationListener handling ContextClosedEvent", ex);
}
// Stop all Lifecycle beans, to avoid delays during individual destruction.
try {
getLifecycleProcessor().onClose();
}
catch (Throwable ex) {
logger.warn("Exception thrown from LifecycleProcessor on context close", ex);
}
// Destroy all cached singletons in the context's BeanFactory.
destroyBeans();
// Close the state of this context itself.
closeBeanFactory();
// Let subclasses do some final clean-up if they wish...
onClose();
this.active.set(false);
}
}
-
为什么tomcat默认不会优雅关闭?
tomcat 未对服务关闭事件做特殊处理,业务处理中断,调用方报错。 -
jetty的优雅关闭有什么问题?
接收到的请求能继续处理,调用方会报错,原因是Jetty中Server.stop()方法先关闭connections,然后处理队列中未完成的request,request完成后,由于connection已经关闭了,响应不能写回去了。
4. Springboot内嵌容器优雅停机的处理方案
-
Spring Boot 2.3 及以上版本,开箱即用
## 开启优雅停机, 如果不配置是默认IMMEDIATE, 立即停机
server.shutdown=graceful
## 优雅停机宽限期时间
spring.lifecycle.timeout-per-shutdown-phase=30s
2021-05-30 13:42:20.211 INFO 3103 --- [extShutdownHook] o.s.b.web.embedded.jetty.JettyWebServer : Commencing graceful shutdown. Waiting for active requests to complete
qtp1662592920-27 finished
2021-05-30 13:42:25.545 INFO 3103 --- [ jetty-shutdown] o.s.b.web.embedded.jetty.JettyWebServer : Graceful shutdown complete
-
Spring Boot 2.3 以下版本
相当于手动实现高版本功能,完美一点,可以做开关和关闭时间可配置化
(1) Jetty 实现:先看一下jetty核心类org.eclipse.jetty.server.Server两个关键的方法 /** * Set a graceful stop time. * The {@link StatisticsHandler} must be configured so that open connections can * be tracked for a graceful shutdown. * @see org.eclipse.jetty.util.component.ContainerLifeCycle#setStopTimeout(long) */ @Override public void setStopTimeout(long stopTimeout) { super.setStopTimeout(stopTimeout); } /** Set stop server at shutdown behaviour. * @param stop If true, this server instance will be explicitly stopped when the * JVM is shutdown. Otherwise the JVM is stopped with the server running. * @see Runtime#addShutdownHook(Thread) * @see ShutdownThread */ public void setStopAtShutdown(boolean stop)
解决方法:Jetty 是用 StatisticsHandler 做请求数量监控,关闭应用时,请求数量从N变为O的过程其实就是优雅关闭。
@Bean
public EmbeddedServletContainerFactory jettyEmbeddedServletContainerFactory() {
JettyEmbeddedServletContainerFactory factory = new JettyEmbeddedServletContainerFactory();
factory.addServerCustomizers(server -> {
server.setStopAtShutdown(false);
StatisticsHandler statisticsHandler = new StatisticsHandler();
statisticsHandler.setHandler(server.getHandler());
server.setHandler(statisticsHandler);
server.setStopTimeout(30000); // 为了简单,写死30s
});
return factory;
}
2021-05-30 12:50:24.410 INFO 1282 --- [ main] .s.b.c.e.j.JettyEmbeddedServletContainer : Jetty started on port(s) 8080 (http/1.1)
2021-05-30 12:50:54.196 INFO 1282 --- [ Thread-13] ationConfigEmbeddedWebApplicationContext : Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@709ba3fb: startup date [Sun May 30 12:50:23 CST 2021]; root of context hierarchy
2021-05-30 12:50:54.199 INFO 1282 --- [ Thread-13] o.s.j.e.a.AnnotationMBeanExporter : Unregistering JMX-exposed beans on shutdown
qtp548482954-17 finished
2021-05-30 12:51:16.503 INFO 1282 --- [ Thread-13] o.e.jetty.server.AbstractConnector : Stopped ServerConnector@545de5a4{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
2021-05-30 12:51:16.503 INFO 1282 --- [ Thread-13] org.eclipse.jetty.server.session : Stopped scavenging
2021-05-30 12:51:16.505 INFO 1282 --- [ Thread-13] o.e.j.s.h.ContextHandler.application : Destroying Spring FrameworkServlet 'dispatcherServlet'
2021-05-30 12:51:16.505 INFO 1282 --- [ Thread-13] o.e.jetty.server.handler.ContextHandler : Stopped o.s.b.c.e.j.JettyEmbeddedWebAppContext@32f0fba8{/,[file:///private/var/folders/kt/428mmnyd6vj_bgzz5xrfgtfm0000gn/T/jetty-docbase.2324546490925478110.8080/],UNAVAILABLE}
Process finished with exit code 143 (interrupted by signal 15: SIGTERM)
客户端响应
HTTP/1.1 200 OK
Date: Sun, 30 May 2021 04:50:46 GMT
Content-Type: application/json;charset=utf-8
Content-Length: 2
m1
Response code: 200 (OK); Time: 30145ms; Content length: 2 bytes
(2) Tomcat 实现:
@Bean
public EmbeddedServletContainerCustomizer tomcatCustomizer() {
return container -> {
if (container instanceof TomcatEmbeddedServletContainerFactory) {
((TomcatEmbeddedServletContainerFactory) container).addConnectorCustomizers(new GracefulShutdown());
}
};
}
private static class GracefulShutdown implements TomcatConnectorCustomizer,
ApplicationListener<ContextClosedEvent> {
private static final Logger log = LoggerFactory.getLogger(GracefulShutdown.class);
private volatile Connector connector;
@Override
public void customize(Connector connector) {
this.connector = connector;
}
@Override
public void onApplicationEvent(ContextClosedEvent event) {
this.connector.pause();
Executor executor = this.connector.getProtocolHandler().getExecutor();
if (executor instanceof ThreadPoolExecutor) {
try {
ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) executor;
threadPoolExecutor.shutdown();
if (!threadPoolExecutor.awaitTermination(30, TimeUnit.SECONDS)) {
log.warn("Force shutdown ...");
}
} catch (InterruptedException ex) {
Thread.currentThread().interrupt();
}
}
}
}
5. 开发需要注意的事项
- 评估每个服务停机处理完请求需要的下线时间
- 保证应用在下线指定时间内的请求都能正常处理并且响应调用方
- 手动对自定义线程资源优雅下线 业务处理线程资源释放示例:
@Bean
public ExecutorService bizExecutorService() {
ExecutorService executorService = Executors.newFixedThreadPool(10);
// shutdownAndAwaitTermination 可以参考guava线程池优雅关闭
Runtime.getRuntime().addShutdownHook(new Thread(() -> shutdownAndAwaitTermination(executorService, 10l, TimeUnit.SECONDS)));
return executorService;
}
- 对生产发布因未在指定时间内完成的业务请求进行监控,并优化性能
6. 参考资料
github.com/spring-proj…
github.com/eclipse/jet…
github.com/timpeeters/…