2 posts tagged with "Inference"

No Kubernetes? No Problem: llm-d Now Runs Anywhere

May 26, 2026 · 23 min read

Senior Technical Staff Member, IBM

llm-d was designed as a Kubernetes-native inference stack, and its guides assume you have a cluster handy. However, a large class of inference workloads runs on infrastructure that isn't managed by Kubernetes, and until recently llm-d was not a fit for them.

With the llm-d router's new file-discovery plugin, that changes. llm-d can now run as a plain process or container in any environment, with no dependency on Kubernetes or any other cluster framework. A YAML file lists your endpoints; the router reads it and reconciles changes live. That's the whole interface.

That opens the door to deployments like:

HPC clusters running Slurm, where GPU nodes are allocated per-job and there is no cluster API
Ray-based training loops (VERL, OpenRLHF) where rollout workers are Ray actors, not pods
Bare-metal inference farms provisioned statically
Local development on a workstation with one or two GPUs

This post introduces the new endpoint-discovery plugin mechanism in the llm-d router. It then shows how to use llm-d without a Kubernetes cluster by enabling the file-discovery plugin, which reads endpoints from a YAML file on disk. We illustrate this with two concrete examples that generate the endpoints file from a Ray cluster and a Slurm job.

Predicted-Latency Based Scheduling for LLMs

March 13, 2026 · 28 min read

Kaushik Mitra

Software Engineer, Google

Benjamin Braun

Software Engineer, Google

Abdullah Gharaibeh

Senior Staff Software Engineer, Google

Clayton Coleman

Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.