Open ASR Leaderboard incorporates private data | Keryc
Since September 2023 the community has been using the Open ASR Leaderboard to compare speech-recognition models. Ever wondered what happens when someone optimizes for the test instead of improving real-world performance? Hugging Face's answer was to add a set of private datasets to reduce benchmaxxing and better measure robustness in conversational settings and diverse accents.
What changed and why
The new addition: Appen Inc. and DataoceanAI contributed several high-quality English datasets (scripted and conversational) that are not published to avoid contaminating the test set. Why keep them private? Because when public tests are too available, some teams can tweak their models specifically for those examples and get high scores without real production improvements.
Important: by default, the Average WER on the leaderboard is still calculated only with public datasets. You can enable an option to include the private data and see how the metrics change.
This dual approach answers two goals that often clash in a benchmark: standardization and openness. Hugging Face standardizes transcriptions using a normalizer (based on Whisper's) that removes punctuation, lowercases, and maps to American spelling. And they keep evaluation tools and the UI open so the community can audit and contribute. But that transparency also makes benchmaxxing easier, hence the decision for a private track.
The datasets: technical details
Hugging Face worked with Appen and DataoceanAI to create splits with a variety of accents, styles and durations. Here are the summarized metrics:
Dataset
Accent
Duration [h]
Male (%) / Female (%)
Style
Transcription
Appen Scripted AU
Australian
1.42
49 / 51
Read
Punctuated, cased.
Appen Scripted CA
Canadian
1.53
52 / 48
Read
Punctuated, cased.
Appen Scripted IN
Indian
1.02
49 / 51
Read
Punctuated, cased.
Appen Scripted US
American
1.45
49 / 51
Read
Punctuated, cased.
Appen Conversational IN
Indian
1.37
51 / 49
Conversational, spontaneous
Punctuated, disfluencies.
Appen Conversational US003
American
1.64
49 / 51
Conversational, spontaneous
Punctuated, cased, disfluencies.
Appen Conversational US004
American
1.65
49 / 51
Conversational, spontaneous
Punctuated, disfluencies.
DataoceanAI Scripted US
American
2.43
54 / 46
Read
Punctuated, cased (proper nouns), disfluencies.
DataoceanAI Scripted GB
British
2.43
47 / 53
Read
Punctuated, disfluencies.
DataoceanAI Conversational US
American
8.82
NA
Conversational, spontaneous
Punctuated, disfluencies.
DataoceanAI Conversational GB
British
5.96
NA
Conversational, spontaneous
Punctuated, disfluencies.
They also include audio examples to show variety: scripted, conversational, acronyms, disfluencies and proper names.
How this affects metrics (WER and averages)
Average WER is computed as a macroaverage of the means per data provider, meaning each provider weighs equally.
There are dedicated metrics: Avg Scripted, Avg Conversational, Avg US, Avg non-US.
Individual split scores are not shown to avoid someone optimizing just for one provider or accent.
By default, the private sets do not influence the global ranking. If you want to see their effect, enable the "Private data" tab; then the Average WER will include those splits and you'll see the Rank Δ, which shows how the order changes.
Why this approach? Because a model that shines on a controlled script or in American English can fail in conversational audio or non-American accents. The goal is to capture those differences and give a fuller picture of performance.
Process to upload and verify your model
Open a pull request in the Open ASR Leaderboard repository. A checklist for models will appear.
Report your results on the public datasets in your model card (YAML). That lets your model appear on an unverified leaderboard on the dataset page.
The team will verify the results published on the public sets and compute metrics on the private sets.
Confirm the verified results with the leaderboard maintainers.
This keeps evaluation decentralized for speed, but adds central verification for credibility.
Risks, mitigations and limitations
Benchmaxxing can still happen if someone has access to very similarly distributed data. That's why Appen and DataoceanAI were asked not to deliver these exact sets to their customers, although this can't be guaranteed 100%.
Having multiple providers reduces the edge someone could gain from using data from just one source.
There's also work on tooling to detect quality issues: low signal-to-noise ratio (SNR), misaligned transcripts, extreme cases that skew WER. That helps keep consistency across splits.
What this means for you as a developer or user
If you're a developer: don't settle for optimizing for a public benchmark. If you want models for production, look at averages by data type and test in conversational and diverse-accent conditions.
If you're a user: you now have a more robust way to compare models for your use case. Need something for casual conversations or audio with noise and varied accents? Turn on the private-data tab and watch the rankings change.
The lesson is simple: a good benchmark evolves with real-world applications. Adding private data isn't closing the box — it's raising the bar so models that score high are useful outside the lab.