Spin up a vLLM server on HF Jobs with one command

Spin up a vLLM server on HF Jobs with one command | Keryc

Do you want to try a model locally in the cloud without setting up complex infrastructure? With HF Jobs and the official vLLM image you can spin up an OpenAI-compatible server with a single command, run tests, evaluations, or batch generation, and then shut it down when you’re done. Sounds good? Let’s go step by step.

What this does and who it’s for

Start a vLLM server that responds like the OpenAI API in minutes.
Ideal for quick tests, evaluations, batch generation, or validating models before moving to production.
It’s not the same as a managed endpoint: if you need a long-lived service with scale-to-zero, consider Inference Endpoints.

Quick requirements

An active payment method or positive prepaid balance (Jobs charges for hardware time).
huggingface_hub >= 1.20.0.
Be logged in locally:

pip install -U "huggingface_hub>=1.20.0"
hf auth login

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

curl https://<job_id>--8000.hf.jobs/v1/chat/completions \
  -H "Authorization: Bearer $(hf auth token)" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "messages": [{"role": "user", "content": "Hello!"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(
  base_url="https://<job_id>--8000.hf.jobs/v1",
  api_key=get_token(),
)

resp = client.chat.completions.create(
  model="Qwen/Qwen3-4B",
  messages=[{"role": "user", "content": "Hello!"}],
  extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)

curl https://<job_id>--8000.hf.jobs/v1/models -H "Authorization: Bearer $(hf auth token)"

hf jobs cancel <job_id>

hf jobs run --flavor h200x2 --expose 8000 --timeout 2h \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3.5-122B-A10B \
  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
  --max-model-len 32768 --max-num-seqs 256

import gradio as gr
from gradio import ChatMessage
from huggingface_hub import get_token
from openai import OpenAI

client = OpenAI(base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token())

def chat(message, history):
  messages = [{"role": m["role"], "content": m["content"]} for m in history if not m.get("metadata")]
  messages.append({"role": "user", "content": message})
  stream = client.chat.completions.create(model="Qwen/Qwen3-4B", messages=messages, stream=True)
  thinking, answer = "", ""
  for chunk in stream:
    delta = chunk.choices[0].delta
    thinking += delta.model_extra.get("reasoning", "")
    answer += delta.content or ""
  out = []
  if thinking.strip():
    status = "done" if answer.strip() else "pending"
    out.append(ChatMessage(role="assistant", content=thinking, metadata={"title": "💭 Thinking", "status": status}))
  if answer.strip():
    out.append(ChatMessage(role="assistant", content=answer))
  yield out

gr.ChatInterface(chat).launch()

hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \
  vllm/vllm-openai:latest \
  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

hf jobs ssh <job_id>

--enable-auto-tool-choice --tool-call-parser hermes

What this does and who it’s for

Quick requirements

Minimum command: one line to bring the server up

Test the API: curl and Python

Stop the job and costs

Scaling to large models and key flags

Chat UI with Gradio and reasoning

SSH, debugging and GPU monitoring

Integrating agents (example: Pi) and tool calling

HF Jobs vs Inference Endpoints: when to choose each

Original source

Stay up to date!

Spin up a vLLM server on HF Jobs with one command