Xiaomi Launches New AI Model to Rival Google and OpenAI

Xiaomi has announced a major upgrade to its AI voice ecosystem with the launch of the MiMo-V2.5-TTS series and MiMo-V2.5-ASR, expanding its MiMo voice AI platform into a more complete speech intelligence system.

The company says the new release is designed as a “full-link voice model system” for the agent-driven AI era, covering both speech generation (text-to-speech) and speech understanding (automatic speech recognition). This update builds on the earlier MiMo-V2-TTS model launched in March, which focused on improving control over tone, emotion, and speaking style.

MiMo-V2.5-TTS: Advanced Voice Generation Models

The new MiMo-V2.5-TTS lineup includes three separate models and is currently available for a limited-time free trial through Xiaomi’s MiMo Open Platform.

The base TTS model offers preset voices with adjustable parameters such as speed, tone, and emotional expression.

The MiMo-V2.5-TTS-VoiceDesign model allows users to generate completely new voice styles using just a short sample sentence, enabling flexible voice creation.

The MiMo-V2.5-TTS-VoiceClone model focuses on replicating a specific voice with only a small number of samples, while maintaining consistency across different emotions and speaking scenarios.

Xiaomi says the system can understand natural language instructions instead of requiring technical parameters. Users can simply describe how a voice should sound, similar to directing a voice actor.

It also supports script-style layered input, making it useful for applications like gaming characters, audio dramas, and storytelling. Developers can assign different traits, dialogue styles, and scene-based emotions within the same script.

In addition, the system supports inline audio tags that allow emotion or delivery changes within a single sentence, and it works across both Chinese and English.

MiMo-V2.5-ASR: Speech Recognition Upgrade

Alongside the TTS models, Xiaomi has released MiMo-V2.5-ASR, an open-source speech recognition system.

The company says it is built for real-world use cases such as bilingual conversations, regional dialects, and noisy environments.

It supports multiple Chinese dialects including Wu, Cantonese, Minnan, and Sichuanese, and can switch between Chinese and English without needing manual language selection. It can also accurately recognize song lyrics even when mixed with background music.

For meetings and multi-speaker environments, the model can transcribe overlapping speech while separating different speakers.

Xiaomi also highlights improved performance in noisy settings and far-field audio capture, making it suitable for public or industrial environments.

Structured Output and Availability

The ASR system includes built-in phonetic processing and context-aware punctuation, reducing the need for manual editing after transcription.

Xiaomi claims the model delivers state-of-the-art or near state-of-the-art performance in benchmarks covering bilingual recognition, dialect understanding, and code-switching tasks.

The TTS models are available for testing via MiMo Studio, while the ASR model is released with open-source weights and code, allowing developers to deploy or customize it for different applications.

Tags :