Imagine a world where speech-to-text technology is so fast and accurate, it feels like magic. That's the reality with Voxtral Transcribe 2, a groundbreaking leap in transcription technology. Today, we're thrilled to unveil two next-generation models that redefine what's possible in speech-to-text conversion: Voxtral Mini Transcribe V2 and Voxtral Realtime. These models aren't just fast—they're lightning-fast, delivering state-of-the-art transcription quality, advanced diarization, and ultra-low latency. But here's where it gets even more exciting: Voxtral Realtime is open-source under the Apache 2.0 license, empowering developers to build privacy-first applications with ease.
To celebrate this launch, we're also introducing an audio playground in Mistral Studio (https://console.mistral.ai/build/audio/speech-to-text). This interactive platform lets you test Voxtral Transcribe 2 instantly, complete with diarization and timestamps, so you can see the magic in action.
Highlights
Voxtral Mini Transcribe V2: This powerhouse offers state-of-the-art transcription across 13 languages, including English, Chinese, Hindi, Spanish, and more. It features speaker diarization, context biasing, and word-level timestamps, making it perfect for complex tasks like meeting transcription and interview analysis. And this is the part most people miss: it achieves all this at an unbeatable price point, with the lowest word error rate in its class.
Voxtral Realtime: Designed for live applications, this model boasts sub-200ms latency, enabling real-time transcription for voice agents, live broadcasts, and more. Its open-weights architecture under Apache 2.0 allows for edge deployment, ensuring privacy and security for sensitive use cases.
Best-in-class efficiency: Voxtral Mini Transcribe V2 sets the industry standard for accuracy at a fraction of the cost. It outperforms competitors like GPT-4o, Gemini 2.5 Flash, and ElevenLabs’ Scribe v2, processing audio 3x faster while maintaining superior quality.
Diving Deeper: Voxtral Realtime
What sets Voxtral Realtime apart is its novel streaming architecture. Unlike traditional models that process audio in chunks, Realtime transcribes audio as it arrives, delivering near-instantaneous results. At 2.4 seconds delay, it matches the accuracy of Voxtral Mini Transcribe V2, ideal for subtitling. Even at 480ms delay, it stays within a 1-2% word error rate, making it perfect for voice agents that demand near-offline precision.
Controversial question: Could this level of real-time accuracy finally bridge the gap between human and machine communication? Let us know your thoughts in the comments!
Voxtral Mini Transcribe V2: The Benchmark for Accuracy
This model isn’t just an upgrade—it’s a revolution. With a 4% word error rate on the FLEURS benchmark and pricing as low as $0.003/min, it offers the best price-performance ratio in the industry. It excels in speaker diarization, even in multilingual settings, and its context biasing feature ensures accurate transcription of names, technical terms, and industry jargon. But here's the kicker: its non-English performance significantly outpaces competitors, making it a global game-changer.
Model Features That Stand Out
- Speaker Diarization: Automatically identifies and labels speakers, even in multi-party conversations. (Note: Overlapping speech is transcribed for one speaker at a time.)
- Context Biasing: Customize transcription for up to 100 words or phrases, ideal for industry-specific terminology.
- Word-Level Timestamps: Generate precise timestamps for every word, enabling advanced applications like subtitle generation and audio search.
- Noise Robustness: Maintains accuracy in noisy environments, from factory floors to busy call centers.
- Longer Audio Support: Process recordings up to 3 hours in a single request.
Transforming Industries with Voxtral
Voxtral isn’t just a tool—it’s a catalyst for innovation across industries:
- Meeting Intelligence: Transcribe multilingual meetings with clear speaker attribution, at an industry-leading cost.
- Voice Agents: Build conversational AI with sub-200ms latency, creating natural, responsive interfaces.
- Contact Center Automation: Analyze calls in real time, improve customer interactions, and streamline CRM workflows.
- Media & Broadcast: Generate live subtitles with minimal latency, even for technical or multilingual content.
- Compliance & Documentation: Ensure regulatory compliance with precise diarization and timestamped audit trails.
Bold claim: With GDPR and HIPAA-compliant deployments, Voxtral is setting a new standard for privacy in transcription technology. Do you agree? Share your perspective below!
Get Started Today
Ready to experience the future of transcription? Voxtral Mini Transcribe V2 is available now via API at $0.003 per minute. Test it out in the Mistral Studio audio playground (https://console.mistral.ai/build/audio/speech-to-text) or Le Chat (http://chat.mistral.ai/).
Voxtral Realtime is also available via API at $0.006 per minute and as open weights on Hugging Face (https://huggingface.co/mistralai/Voxtral-Mini-3B-Realtime-2602).
Explore the full documentation (https://docs.mistral.ai/capabilities/audio_transcription) and join the revolution. And if you’re passionate about pushing the boundaries of speech AI, we’re hiring (https://mistral.ai/careers). Let’s build the future together!