参考:
https://mp.weixin.qq.com/s/D8efjj9ZhLyEu7zEqWvJiQ
https://stackoverflow.com/questions/71860152/actuator-health-endpoint-returns-out-of-service-when-all-groups-are-up
https://docs.spring.io/spring-boot/docs/2.6.x/reference/htmlsingle/#actuator.endpoints.kubernetes-probes
本文使用 K8s + SpringBoot 实现零宕机发布:健康检查 + 滚动更新 + 优雅停机 + 弹性伸缩 + Prometheus监控 + 配置分离(镜像复用)
健康检查
- 健康检查类型:就绪探针(readiness)+ 存活探针(liveness)
- 探针类型:exec(进入容器执行脚本)、tcpSocket(探测端口)、httpGet(调用接口)
业务层面
项目依赖 pom.xml
1 2 3 4
| <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency>
|
定义访问端口、路径及权限 application.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| management: server: port: 50000 endpoint: health: probes: enabled: true endpoints: web: exposure: base-path: /actuator include: health
|
将暴露/actuator/health/readiness和/actuator/health/liveness两个接口,访问方式如下:
1 2 3
| http://127.0.0.1:50000/actuator/health -> 返回组下所有信息 http://127.0.0.1:50000/actuator/health/readiness -> 返回readiness组下信息 http://127.0.0.1:50000/actuator/health/liveness -> 返回liveness组下信息
|
运维层面
k8s部署模板deployment.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: {APP_NAME} image: {IMAGE_URL} imagePullPolicy: Always ports: - containerPort: {APP_PORT} - name: management-port containerPort: 50000 readinessProbe: httpGet: path: /actuator/health/readiness port: management-port initialDelaySeconds: 90 periodSeconds: 30 timeoutSeconds: 30 successThreshold: 1 failureThreshold: 3 livenessProbe: httpGet: path: /actuator/health/liveness port: management-port initialDelaySeconds: 90 periodSeconds: 30 timeoutSeconds: 30 successThreshold: 1 failureThreshold: 3
|
滚动更新
k8s资源调度之滚动更新策略,若要实现零宕机发布,需支持健康检查
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| apiVersion: apps/v1 kind: Deployment metadata: name: {APP_NAME} labels: app: {APP_NAME} spec: selector: matchLabels: app: {APP_NAME} replicas: {REPLICAS} strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 1
|
优雅停机
在 K8s 中,优雅停机的目标是:不再接入新流量、让在途任务完成、按序释放资源。下面结合 test_graceful_shutdown_web 的实践给出一套可直接复用的方式。
核心原则
- 先让 Pod 变为 NotReady(readiness 失败),避免新流量进入
- 应用内部“先停入口、再等在途、再释放资源”
- 所有等待都要有超时,避免无限卡死
Spring Boot 配置(应用层)
SmartLifecycle 统一停机入口
在SpringBoot的应用中,我们通常会利用**@PostConstruct和@PreDestroy**注解,在Bean初始化或销毁时执行一些操作,这些操作都处于Bean声明周期的层面。
然而,在某些情况下,我们可能会遇到一些遗漏的场景,比如希望在容器本身的生命周期事件(如容器启动、停止)上执行一些操作
1 2 3 4 5 6 7 8 9
| public class GracefulShutdownGate implements SmartLifecycle { ### `SmartLifecycle` 接口关键方法 1. `isAutoStartup()`:返回一个布尔值,指示这个生命周期组件是否应该在 Spring 容器启动时自动启动。 2. `getPhase()`:返回一个整数,表示这个生命周期组件的启动顺序。数值越小,组件启动的越早。 3. `start()`:启动这个生命周期组件。 4. `stop()`:停止这个生命周期组件。 5. `stop(Runnable callback)`:停止生命周期组件,并在停止后执行提供的回调。 6. `isRunning`:在应用退出时会执行isRunning方法判断该Lifecycle是否已经启动,如果返回true则调用stop()停止方法 }
|
适用场景:
- 你需要统一管理多个资源的停机顺序(如 Kafka、线程池、定时任务)
- 需要等待业务内“在途任务”完成(例如消息处理、异步任务、批处理)
@Async / @Scheduled 的优雅停机
Spring 已支持异步与定时任务的优雅关闭:
1 2 3 4
| spring.task.execution.shutdown.await-termination=true
spring.task.scheduling.shutdown.await-termination=true
|
适用场景:
@Async 的后台任务@Scheduled 的周期任务
HTTP 请求的优雅停机
1 2 3 4 5 6
| spring: lifecycle: timeout-per-shutdown-phase: 30s
server: shutdown: graceful
|
HTTP 层由 Spring Boot 自带 graceful shutdown 控制,表现为:
- 新请求不再进入(接口请求会在停机时继续执行完成)
- 现有请求完成后退出(异步请求会走
@Async 等待配置)
Kafka 在优雅停机中的处理
项目使用 Spring Cloud Stream,消费逻辑示例:
1 2 3 4 5 6 7 8 9 10 11 12
| @Bean public Consumer<Message<KafkaMessageModel>> testConsumer() { return message -> { Acknowledgment ack = message.getHeaders() .get(KafkaHeaders.ACKNOWLEDGMENT, Acknowledgment.class); if (ack != null) { ack.acknowledge(); } }; }
|
关键点:
- 手动 ack:只有处理成功才提交 offset,避免“未处理完但已确认”
- 停机顺序:先停生产者,再停消费者
- 在途任务等待:通过
MainJobState 记录是否在处理,监控类等待完成
配置示例(手动 ack):
1
| spring.cloud.stream.kafka.bindings.testConsumer-in-0.consumer.ack-mode=manual
|
K8s 配置(平台层)
1 2 3 4 5 6 7 8
| spec: terminationGracePeriodSeconds: 30 containers: - name: {APP_NAME} lifecycle: preStop: exec: command: ["curl", "-XPOST", "127.0.0.1:50000/actuator/shutdown"]
|
推荐的停机执行顺序
- readiness 变为 false(K8s 停流)
- 应用内部触发 shutdown gate
- 停止生产端(停止发送/发布)
- 等待在途消费/任务完成
- 停止消费者/释放资源
弹性伸缩
为pod设置资源限制后,创建HPA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| apiVersion: apps/v1 kind: Deployment metadata: name: {APP_NAME} labels: app: {APP_NAME} spec: template: spec: containers: - name: {APP_NAME} image: {IMAGE_URL} imagePullPolicy: Always resources: limits: cpu: 0.5 memory: 1Gi requests: cpu: 0.15 memory: 300Mi --- kind: HorizontalPodAutoscaler apiVersion: autoscaling/v2beta2 metadata: name: {APP_NAME} spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: {APP_NAME} minReplicas: {REPLICAS} maxReplicas: 6 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
|
Prometheus集成
业务层面
项目依赖 pom.xml
1 2 3 4 5 6 7 8 9
| <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency>
|
定义访问端口、路径及权限 application.yaml
1 2 3 4 5 6 7 8 9 10 11
| management: server: port: 50000 metrics: tags: application: ${spring.application.name} endpoints: web: exposure: base-path: /actuator include: metrics,prometheus
|
将暴露/actuator/metric和/actuator/prometheus接口,访问方式如下:
1 2
| http://127.0.0.1:50000/actuator/metric http://127.0.0.1:50000/actuator/prometheus
|
运维层面
deployment.yaml
1 2 3 4 5 6 7 8 9
| apiVersion: apps/v1 kind: Deployment spec: template: metadata: annotations: prometheus:io/port: "50000" prometheus.io/path: /actuator/prometheus prometheus.io/scrape: "true"
|
配置分离
方案:通过configmap挂载外部配置文件,并指定激活环境运行
作用:配置分离,避免敏感信息泄露;镜像复用,提高交付效率
通过文件生成configmap
1 2 3 4 5
| kubectl create cm -n <namespace> <APP_NAME> --from-file=application-test.yaml --dry-run=1 -oyaml > configmap.yaml
kubectl apply -f configmap.yaml
|
挂载configmap并指定激活环境
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| apiVersion: apps/v1 kind: Deployment metadata: name: {APP_NAME} labels: app: {APP_NAME} spec: template: spec: containers: - name: {APP_NAME} image: {IMAGE_URL} imagePullPolicy: Always env: - name: SPRING_PROFILES_ACTIVE value: test volumeMounts: - name: conf mountPath: "/app/config" readOnly: true volumes: - name: conf configMap: name: {APP_NAME}
|
汇总配置
业务层面
项目依赖 pom.xml
1 2 3 4 5 6 7 8 9
| <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency>
|
定义访问端口、路径及权限 application.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
| spring: application: name: project-sample profiles: active: @profileActive@ lifecycle: timeout-per-shutdown-phase: 30s
server: port: 8080 shutdown: graceful
management: server: port: 50000 metrics: tags: application: ${spring.application.name} endpoint: shutdown: enabled: true health: probes: enabled: true endpoints: web: exposure: base-path: /actuator include: health,shutdown,metrics,prometheus
|
运维层面
确保dockerfile模板集成curl工具,否则无法使用curl命令
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| FROM openjdk:8-jdk-alpine
ARG JAR_FILE ARG WORK_PATH="/app" ARG EXPOSE_PORT=8080
ENV JAVA_OPTS=""\ JAR_FILE=${JAR_FILE}
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo 'Asia/Shanghai' >/etc/timezone RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.ustc.edu.cn/g' /etc/apk/repositories \ && apk add --no-cache curl
COPY target/$JAR_FILE $WORK_PATH/
WORKDIR $WORK_PATH
EXPOSE $EXPOSE_PORT
ENTRYPOINT exec java $JAVA_OPTS -jar $JAR_FILE
|
k8s部署模板deployment.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
| apiVersion: apps/v1 kind: Deployment metadata: name: {APP_NAME} labels: app: {APP_NAME} spec: selector: matchLabels: app: {APP_NAME} replicas: {REPLICAS} strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: name: {APP_NAME} labels: app: {APP_NAME} annotations: timestamp: {TIMESTAMP} prometheus.io/port: "50000" prometheus.io/path: /actuator/prometheus prometheus.io/scrape: "true" spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - {APP_NAME} topologyKey: "kubernetes.io/hostname" terminationGracePeriodSeconds: 30 containers: - name: {APP_NAME} image: {IMAGE_URL} imagePullPolicy: Always ports: - containerPort: {APP_PORT} - name: management-port containerPort: 50000 readinessProbe: httpGet: path: /actuator/health/readiness port: management-port initialDelaySeconds: 90 periodSeconds: 30 timeoutSeconds: 30 successThreshold: 1 failureThreshold: 3 livenessProbe: httpGet: path: /actuator/health/liveness port: management-port initialDelaySeconds: 90 periodSeconds: 30 timeoutSeconds: 30 successThreshold: 1 failureThreshold: 3 resources: limits: cpu: 0.5 memory: 1Gi requests: cpu: 0.1 memory: 200Mi env: - name: TZ value: Asia/Shanghai --- kind: HorizontalPodAutoscaler apiVersion: autoscaling/v2beta2 metadata: name: {APP_NAME} spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: {APP_NAME} minReplicas: {REPLICAS} maxReplicas: 6 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50
|
问题
程序中有段代码:在CommandLineRunner.run当中执行while(true){...},永无休止的执行一段代码
这会导致一个问题:这个程序永远都无法正常停止!当执行健康检查的/readiness接口时,返回的status永远都是503
解决办法:while(true)放在单独的一个子线程执行