Spin up a vLLM server on HF Jobs with one command | Keryc
Do you want to try a model locally in the cloud without setting up complex infrastructure? With HF Jobs and the official vLLM image you can spin up an OpenAI-compatible server with a single command, run tests, evaluations, or batch generation, and then shut it down when you’re done. Sounds good? Let’s go step by step.
What this does and who it’s for
Start a vLLM server that responds like the OpenAI API in minutes.
Ideal for quick tests, evaluations, batch generation, or validating models before moving to production.
It’s not the same as a managed endpoint: if you need a long-lived service with scale-to-zero, consider Inference Endpoints.
Quick requirements
An active payment method or positive prepaid balance (Jobs charges for hardware time).
Important security note: the exposed port is protected. Every request must carry an HF token with read access to the job’s namespace. Don’t share the URL or token in untrusted places.
Stop the job and costs
Jobs are billed by the second. Cancel the job when you’re done:
hf jobs cancel <job_id>
The --timeout flag is a safety net, but cancelling manually is cheaper. For example, an a10g-large runs at about 1.50 USD/hour; check hf jobs hardware for pricing and pick the smallest flavor that works for your model.
Scaling to large models and key flags
For huge models you must select more GPUs and enable sharding. Example for Qwen3.5-122B on 2x H200:
--tensor-parallel-size should match the number of GPUs in the flavor (h200x2 -> 2, h200x8 -> 8).
Some models require limiting --max-model-len and --max-num-seqs due to memory. If you see out-of-memory or cache-block errors, reduce those values first.
For large models, H200 flavors often give a better cost/performance trade-off.
Chat UI with Gradio and reasoning
Prefer a UI instead of curl? A local Gradio can point to the job. Add --reasoning-parser deepseek_r1 to vllm serve so "reasoning" arrives in a separate field, and run this snippet locally:
import gradio as gr
from gradio import ChatMessage
from huggingface_hub import get_token
from openai import OpenAI
client = OpenAI(base_url="https://<job_id>--8000.hf.jobs/v1", api_key=get_token())
def chat(message, history):
messages = [{"role": m["role"], "content": m["content"]} for m in history if not m.get("metadata")]
messages.append({"role": "user", "content": message})
stream = client.chat.completions.create(model="Qwen/Qwen3-4B", messages=messages, stream=True)
thinking, answer = "", ""
for chunk in stream:
delta = chunk.choices[0].delta
thinking += delta.model_extra.get("reasoning", "")
answer += delta.content or ""
out = []
if thinking.strip():
status = "done" if answer.strip() else "pending"
out.append(ChatMessage(role="assistant", content=thinking, metadata={"title": "💭 Thinking", "status": status}))
if answer.strip():
out.append(ChatMessage(role="assistant", content=answer))
yield out
gr.ChatInterface(chat).launch()
Open http://127.0.0.1:7860 and you’ll see a chat window with reasoning in a separable panel.
SSH, debugging and GPU monitoring
If you need to get into the container to run nvidia-smi or view live logs, launch the job with --ssh and make sure your public key is registered at huggingface.co/settings/keys:
hf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \
vllm/vllm-openai:latest \
vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000
hf jobs ssh <job_id>
SSH makes debugging and monitoring much easier than scanning remote logs.
Integrating agents (example: Pi) and tool calling
If you plan to use agents that call tools, vLLM must be started with tool calling enabled. Example for agents with Pi and the Qwen3 family:
Then register the job as a provider in ~/.pi/agent/models.json with the baseUrl pointing to https://<job_id>--8000.hf.jobs/v1 and apiKey as !hf auth token. Launch the agent and you’ll have a Read/Write/Edit/Bash agent running on your own model.
HF Jobs vs Inference Endpoints: when to choose each
HF Jobs: maximum flexibility. It’s like running docker run on HF’s cloud. Full control over the image, vllm serve flags, hardware and per-second billing. Great for experiments, tests and batch workloads.
Inference Endpoints: managed production solution. They offer finer access control, public or protected exposure options, and scale-to-zero to save costs when idle.
Not sure? Start with Jobs to validate, and if you need a stable, long-lived service, migrate to Endpoints.