KubeCon North America 2025 Review

KubeCon North America 2025 13号结束了，官网上也有了些会议资料。挑了几个感兴趣的话题总结下。

Dynamic Routing with Multi-Cluster Inference Gateway#

https://kccncna2025.sched.com/event/27FeP/ai-inference-without-boundaries-dynamic-routing-with-multi-cluster-inference-gateway-rob-scott-google-daneyon-hansen-soloio?iframe=no&w=100%&sidebar=yes&bg=no

这段时间正好在做 AI 网关，这个话题可以说是“瞌睡了送枕头”。

推理服务和传统 API 流量相比，在 payload，响应时间，后端资源开销上都有着很大的差异（见下图）。 2781cd595fe30c8da311d124a8e4ee35_MD5

因此，推理网关需要做到后端负载感知的调度（传统 API 网关也有类似的方案，尤其是在后端机型不一样，普通的 rr 无法均匀负载时，做后端负载感知动态调权）。在 Gateway 和推理实例间引入了一个 EPP（Endpoint Picker）组件（注：EPP 现在也是 Kubernetes 做推理服务的一个通用组件），采集推理实例的指标来动态选择推理后端。 3462773fb295a9dabdabc886b1de43ca_MD5

benchmark 数据显示，使用推理网关相比传统负载均衡，推理实例间的负载更加均衡，请求排队更少，从而降低了响应时间。

在多集群场景下，这套方案需要解决 3 个问题：

服务发现：Cluster Inference Services 如何暴露给 Gateway？
后端选择：Gateway 如何在多集群间分配流量？
路由模式：流量如何从 Gateway 转发到集群？

第一个问题作者提了 3 个解决方法： fd7a6ffacbfd58fcd6482dd40b861461_MD5

不是关注重点，略过。

第二个问题，简单的 RR 和 Active-Passive 肯定就失去了推理网关负载感知的优势。所以，在 EPP 感知负载之外，Gateway 也得做负载感知。作者也提了两个方法： 1d3df4bac299f39b0c917b886bab2a27_MD5 f4bd518bc6dce2999d5805d5b2d46dac_MD5

从层级上来说，EPP Aggregate Metrics 方案更加简洁，毕竟在 EPP 上还得做二次调度。

最后一个问题，如果 EPP 能跨集群直接访问，direct routing 是最合适的方式，不行的话再加一层网关，使用 Cluster-Local Gateway 做暴露也能访问。

20fff68d9bf08d58b4ce101cde13febf_MD5

总结一下，这篇演讲的主要思路是利用负载感知在提高推理网关服务的性能。提出了多种方式解决单集群服务迁移到多集群服务时面临的问题。但是也要注意到，系统的整体复杂性也有一个提升。如何保证系统的稳定性以及多层感知的准确性，也是在实际工程实践中需要注意的地方。

Routing Stateful AI Workloads in Kubernetes#

https://kccncna2025.sched.com/event/27FX6/routing-stateful-ai-workloads-in-kubernetes-maroon-ayoub-ibm-michey-mehta-red-hat?iframe=no&w=100%&sidebar=yes&bg=no

这篇演讲也是讲推理服务的流量调度的，介绍了 llm-d。核心要点有 3 个：

AI Workloads Are Stateful
K8s Networking is blind to AI
llm-d makes Kubernetes AI-Aware

和上篇关注于负载不同，这篇演讲主要关注于 KVCache 对于推理服务流量调度的影响。Prefix Caching 能利用缓存跳过 expensive prefill，提高总体吞吐，降低 TTFT。 6698927efd73835c65bc5d919060eb74_MD5

所以，推理服务实质上是有状态服务。

“The KV-cache hit rate is the single most important metric for a production-stage AI agent. It directly affects both latency and cost.”

Manus, Context Engineering for AI Agents

OpenAI GPT-5: $1.25 per 1M input tokens vs $0.125 per 1M cached input tokens
Anthropic Claude Sonnet: $3 per 1M input tokens vs $0.3 per 1M cached input tokens

而传统的 Kubernetes 流量调度（Round-Robin，Load-Aware）都无法做到 Prefix-Cache Aware Scheduling，自然无法保证 cache hit rate。

6dff4326f7e1c4a0f4b54820dd561428_MD5

多轮对话场景下 precise prefix cache-aware scheduler 和 load-aware scheduler 性能对比：

6eea94533d879fbf42f6263cd883266b_MD5

llm-d precise prefix cache-aware 实现方式：

对于每一个请求：

Query the kvcache.Index
Calculate prefix-cache affinity score per pod:
1. % of prefix already cached
2. Represents saved computational work
Combine with load-aware scores:
1. vLLM queue depth
2. KV-cache utilization
3. Others
Route to maximize cache hits while balancing load

在保证不制造热点的情况下达到最高的 cache 亲和性。

llm-d 仓库中的架构图如下：

86eb979b9afc246cec0f4f03845c54d3_MD5

https://github.com/llm-d/llm-d-inference-scheduler/blob/v0.3.2/docs/architecture.md

Scaling and Securing CoreDNS: Performance and Resilience#

https://kccncna2025.sched.com/event/27Nn5/scaling-and-securing-coredns-performance-and-resilience-yong-tang-datadirect-networks-john-belamaric-google?iframe=no&w=100%&sidebar=yes&bg=no

介绍了 CoreDNS 1.12.0 - 1.13.1 版本的一些新特性。主要有：

multisocket plugin

核心是利用 reuseport 提高 CoreDNS 单实例在多核扩展时的性能。

![603f221fb43550779e2026ad6ba755d7_MD5](01-计算机/03 - Kubernetes/99 - Conference/_assets/603f221fb43550779e2026ad6ba755d7_MD5.jpeg)

社区在2022年的 issue 就开始了关于 CoreDNS 在多核时性能不佳的讨论：https://github.com/coredns/coredns/issues/5595 。不过容器化的好处是可以用小规格启动更多的实例（HPA）来规避这个问题。

kubernetes plugin multicluster support

multicluster defines the multicluster zones as defined by Multi-Cluster Services API (MCS-API). Specifying this option is generally paired with the installation of an MCS-API implementation and the ServiceImport and ServiceExport CRDs. The plugin MUST be authoritative for the zones listed here.

Every Kubernetes Pod Eviction Path Explained#

https://kccncna2025.sched.com/event/27Fdd/evicted-all-the-ways-kubernetes-kills-your-pods-and-how-to-avoid-them-ahmet-alp-balkan-linkedin?iframe=no&w=100%&sidebar=yes&bg=no

总结一张图：

3e0d29adf5c5ea826ffb5ce6677deef7_MD5

关于PDB：

Kubernetes Pod Disruption Budget (PDB) 是一种 Kubernetes API 对象，它的主要作用是在集群执行自愿性中断操作时，确保特定应用程序始终有一定数量的 Pod 在运行，从而避免应用程序的服务中断或性能下降。

简单来说，PDB 就是给你的应用设定一个“预算”，这个预算规定了在任何时候，最多有多少个 Pod 可以同时因为自愿性中断而下线，或者说最少有多少个 Pod 必须保持运行。

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 70% # 至少70%的Pod必须保持运行
  selector:
    matchLabels:
      app: my-app