Blue J turned months of tax research into answers in seconds using OpenAI models. Can you imagine getting an analysis with citations and sources in the time it takes to finish a coffee? That’s the promise they share in the note published by OpenAI on August 21, 2025. (openai.com)
What Blue J did and why it matters
Blue J took its tax-research engine to three countries and more than 3,000 firms, and they did it with a very clear approach: combine deep subject-matter expertise with high-quality language models. For their product they use GPT-4.1
as the centerpiece of the system.
This combination isn’t magic — it’s product engineering focused on trust and accuracy. If you rely on answers to make costly decisions, that distinction is everything. (openai.com)
How it works in simple terms
At the core is a Retrieval-Augmented Generation system, known as RAG
. Blue J maintains its own library with millions of curated documents: laws, regulations, rulings, and expert commentary.
When you ask a question, the system retrieves the most relevant items and GPT-4.1
synthesizes a clear answer with inline citations, much like a knowledgeable colleague would. The result is useful and actionable for professionals who need to justify decisions. (openai.com)
A good question is not just getting an answer, it’s having the source and the path to verify it.
How they maintain trust and fix errors
Trust isn’t left to chance. Blue J included feedback buttons from day one, including a “disagree” button to report incorrect answers. That feedback is categorized and feeds a continuous improvement loop that analyzes patterns and prioritizes fixes.
Thanks to that design, they report a disagreement rate of less than 1 per 700 responses, and more than 70% of their users log in weekly. They also say each user saves on average 2.7 hours per week on research and client communication. (openai.com)
Evaluations that really matter
Before deploying any model, Blue J subjects new versions to a test suite with more than 350 prompts covering the U.S., Canada, and the U.K. They measure instruction adherence, alignment with sources, and clarity.
That standard prevents isolated improvements from breaking critical behaviors in production. They also note that when a major legal change arrived in 2025, the team mapped the impact and was able to update answers for users in hours. (openai.com)
Practical lessons for founders and teams
- Focus on a domain advantage no one else has. Blue J was built by tax law experts who understood the problem’s nuances.
- Design the product to learn. A good feedback button is more valuable than a pretty metric.
- Control your sources. If your answers are used for costly decisions, citations and traceability aren’t optional.
- Evaluate with real cases, not just lab metrics. Tests should reflect the problems your users face every day.
Final reflection
This story isn’t just about technology. It’s about combining human expertise and models to solve a real, regulated problem without sacrificing trust.
If you work in a complex domain, the invitation is simple: use AI to amplify your knowledge, but build the mechanisms that turn that power into repeatable trust. After all, the difference between a good and a bad answer can be expensive. (openai.com)