Job Searcher: AI that improves job searches with reasoning

Jun 6, 2026Keryc Díaz4 minutes

You upload your resume, the system generates LinkedIn-style searches, scrapes the listings and returns a short list with a justification for each job. Sounds better than 50 results with no context? That’s exactly what Job Searcher offers: a compact pipeline so job hunting stops being a manual filter and becomes a process with humanoid reasoning and traceability.

What Job Searcher does

The workflow has three clear, repeatable steps. It’s not magic: it’s architecture and well-designed data.

Queries. The “student” reads the resume and your preferences (job type, modality, location, notes) and writes a small set of LinkedIn-style queries, reasoning out loud about each choice.
Search. Those queries are sent to LinkedIn via JobSpy, a scraping gateway that returns the real vacancies matching each query.
Scoring. For every (resume, job) pair the model produces a score on five dimensions: skills, relevance of experience, education and certifications, fit by industry/domain, and alignment of seniority. It also writes a sentence justifying each dimension.

What do you get at the end? Not a long list, but a small shortlist with defensible reasoning: you can read why the second role beats the third.

Architecture and models (technical part)

The project uses a teacher-student strategy. The “teacher” is DeepSeek V4 Pro: strong in structured reasoning and useful for generating large-scale labels offline. The “student” is Qwen3-8B, small enough to fit in a single ZeroGPU slice once quantized to Q4_K_M, but capable of absorbing the teacher’s structured judgment through distillation.

Data and closed loop:

Resumes: 2,500 (based on Divyaamith/Kaggle-Resume).
Queries: the teacher wrote LinkedIn-specific queries per resume.
Jobs: JobSpy scraped LinkedIn using those queries, generating ~10,000 postings, all linked to the query the teacher wrote for that resume.
Labels: the teacher scored each (resume, job) pair on the five dimensions and added one sentence of reasoning per dimension.

Distillation training:

Two LoRA SFT runs on an A100 via Modal, one per task (query generation and fit evaluation).
Adapter: rank 16, alpha 16, dropout off, with projections on attention and MLP.
Schedule: one epoch per task, checkpoints every 200 steps for sanity checks.
Output: safetensors in build-small-hackathon/job-searcher-qwen3-8B and a Q4_K_M + LoRA-GGUF sidecars version for llama.cpp in build-small-hackathon/job-searcher-qwen3-8B-gguf.

The LoRA configuration used is summarized like this: LoraConfig(r=16, lora_alpha=16, task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]).

Deployment, latency and streaming experience

The Space runs llama-cpp-python with the CUDA wheel precompiled in a HuggingFace ZeroGPU Space. Two key design decisions improve latency and cost:

ZeroGPU recycles the CUDA context per call, so it’s not convenient to keep a module-level instance between invocations. The Space makes one GPU call per submission (a whole resume) instead of one per job. In other words: the model is loaded once and all vacancies in the submission are evaluated in that single call.
Streaming: the UI receives the reasoning token by token using an OpenAI-style API create_chat_completion(stream=True). That lets you watch the reasoning process live, not just the final result.

The live demo and artifacts are published: the public Space and a dataset of Claude Code session traces that reconstruct each event (raw JSONL), ideal if you want to study how errors and dead-ends were resolved during development.

Practical lessons and decisions that matter

Two adapters beat one. I tried merging query generation and evaluation into a single LoRA and the model “leaked” formats (JSON in queries and prose in evaluation). Splitting the tasks into two heads on the same backbone, swapped per call, eliminated that class of bugs.
The teacher’s prompt mattered more than the student’s size. Refining the labeling guide to score against concrete resume details (for example: “four years of Rust; the role asks for five”) propagated that rigor through distillation. The student learned to be specific in its justifications.
Design for cost and UX: quantizing to Q4_K_M, using LoRA GGUF sidecars and grouping evaluations by submission reduces costs and latency in a resource-constrained environment.

If you want a system that explains why a job suits you or not, fine-tuning alone isn’t enough: you need labels with judgment, a teacher that’s consistent and a deployment that doesn’t repeat needless loads.

This project isn’t a panacea, but it’s a good example of how to combine large models, structured distillation and engineering choices to turn a tedious everyday task into an automated, transparent and defensible process. Ready to stop filtering 50 offers and start reading precise reasons?

Original source

https://huggingface.co/blog/build-small-hackathon/job-search-blog

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What Job Searcher does

The workflow has three clear, repeatable steps. It’s not magic: it’s architecture and well-designed data.

Queries. The “student” reads the resume and your preferences (job type, modality, location, notes) and writes a small set of LinkedIn-style queries, reasoning out loud about each choice.

Search. Those queries are sent to LinkedIn via JobSpy, a scraping gateway that returns the real vacancies matching each query.

Scoring. For every (resume, job) pair the model produces a score on five dimensions: skills, relevance of experience, education and certifications, fit by industry/domain, and alignment of seniority. It also writes a sentence justifying each dimension.

What do you get at the end? Not a long list, but a small shortlist with defensible reasoning: you can read why the second role beats the third.

Architecture and models (technical part)

Data and closed loop:

Resumes: 2,500 (based on Divyaamith/Kaggle-Resume).

Queries: the teacher wrote LinkedIn-specific queries per resume.

Jobs: JobSpy scraped LinkedIn using those queries, generating ~10,000 postings, all linked to the query the teacher wrote for that resume.

Labels: the teacher scored each (resume, job) pair on the five dimensions and added one sentence of reasoning per dimension.

Distillation training:

Two LoRA SFT runs on an A100 via Modal, one per task (query generation and fit evaluation).

Adapter: rank 16, alpha 16, dropout off, with projections on attention and MLP.

Schedule: one epoch per task, checkpoints every 200 steps for sanity checks.

Output: safetensors in build-small-hackathon/job-searcher-qwen3-8B and a Q4_K_M + LoRA-GGUF sidecars version for llama.cpp in build-small-hackathon/job-searcher-qwen3-8B-gguf.

The LoRA configuration used is summarized like this:

LoraConfig(r=16, lora_alpha=16, task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"])

Deployment, latency and streaming experience

The Space runs llama-cpp-python with the CUDA wheel precompiled in a HuggingFace ZeroGPU Space. Two key design decisions improve latency and cost:

ZeroGPU recycles the CUDA context per call, so it’s not convenient to keep a module-level instance between invocations. The Space makes one GPU call per submission (a whole resume) instead of one per job. In other words: the model is loaded once and all vacancies in the submission are evaluated in that single call.

Streaming: the UI receives the reasoning token by token using an OpenAI-style API create_chat_completion(stream=True). That lets you watch the reasoning process live, not just the final result.

Practical lessons and decisions that matter

Two adapters beat one. I tried merging query generation and evaluation into a single LoRA and the model “leaked” formats (JSON in queries and prose in evaluation). Splitting the tasks into two heads on the same backbone, swapped per call, eliminated that class of bugs.

The teacher’s prompt mattered more than the student’s size. Refining the labeling guide to score against concrete resume details (for example: “four years of Rust; the role asks for five”) propagated that rigor through distillation. The student learned to be specific in its justifications.

Design for cost and UX: quantizing to Q4_K_M, using LoRA GGUF sidecars and grouping evaluations by submission reduces costs and latency in a resource-constrained environment.

If you want a system that explains why a job suits you or not, fine-tuning alone isn’t enough: you need labels with judgment, a teacher that’s consistent and a deployment that doesn’t repeat needless loads.