Skip to main content

2 posts tagged with "Inference"

LLM inference serving and optimization

View All Tags

No Kubernetes? No Problem: llm-d Now Runs Anywhere

· 23 min read
Ezra Silvera
Senior Technical Staff Member, IBM

llm-d was designed as a Kubernetes-native inference stack, and its guides assume you have a cluster handy. However, a large class of inference workloads runs on infrastructure that isn't managed by Kubernetes, and until recently llm-d was not a fit for them.

With the llm-d router's new file-discovery plugin, that changes. llm-d can now run as a plain process or container in any environment, with no dependency on Kubernetes or any other cluster framework. A YAML file lists your endpoints; the router reads it and reconciles changes live. That's the whole interface.

That opens the door to deployments like:

  • HPC clusters running Slurm, where GPU nodes are allocated per-job and there is no cluster API
  • Ray-based training loops (VERL, OpenRLHF) where rollout workers are Ray actors, not pods
  • Bare-metal inference farms provisioned statically
  • Local development on a workstation with one or two GPUs

This post introduces the new endpoint-discovery plugin mechanism in the llm-d router. It then shows how to use llm-d without a Kubernetes cluster by enabling the file-discovery plugin, which reads endpoints from a YAML file on disk. We illustrate this with two concrete examples that generate the endpoints file from a Ray cluster and a Slurm job.

Predicted-Latency Based Scheduling for LLMs

· 28 min read
Kaushik Mitra
Software Engineer, Google
Benjamin Braun
Software Engineer, Google
Abdullah Gharaibeh
Senior Staff Software Engineer, Google
Clayton Coleman
Distinguished Engineer, Google

Not all LLM requests cost the same. A short prompt might complete in milliseconds, while a long one can occupy a GPU for seconds. If we can predict how long a request will take on each candidate server before dispatching it, we can make substantially better routing decisions. This post describes a system that does exactly that: a lightweight ML model trained online from live traffic that replaces manually tuned heuristic weights with direct latency predictions.