Key features
Open Source
Open Science is at the heart of Kyutai and Moshi's philosophy. You can explore the full research paper to gain an in-depth understanding of Moshi and access the source code for inference under the Apache 2 license. Additionally, customize the performance by adjusting the model's weights yourself, available under the CC BY 4.0 license.
A full speech-to-speech model
Moshi is an experimental yet advanced Speech-to-Speech conversational model that receives the user's voice and generates both text and a vocal response. Its innovative “Inner Monologue” mechanism enhances the coherence and quality of the generated speech, strengthening its ability to reason and respond accurately.
Say it with emotion
Moshi can modulate its intonation to adapt to various emotional contexts. Whether you ask it to whisper a mysterious story or speak with the energy of a fearless pirate, it can express over 92 different intonations, adding a powerful and immersive emotional dimension to conversations.
Fine acoustic processing
The Mimi acoustic model, integrated into Moshi, processes audio in real time at 24 kHz and compresses it to a bandwidth of 1.1 kbps, while maintaining ultra-low latency of 80ms. Despite this high compression rate, Mimi outperforms non-streaming codecs such as SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps), providing a smooth and accurate experience.
End-to-end seamlessness
Moshi natively integrates WebSocket protocol support, enabling real-time management of vocal inputs and outputs. This ensures natural, continuous, and expressive interactions without any noticeable latency.
Design and trained in France
To make the training of Moshi feasible, Kyutai relied on our supercomputer Nabu2023. This cluster, consisting of 1,016 Nvidia H100 GPUs (~4 PFLOPS), is hosted at DC5, known for its efficient cooling, in the Paris greater-region.
A state-of-the-art model

Current voice dialogue systems rely on chains of independent components (voice activity detection, speech recognition, text processing, and voice synthesis). This results in several seconds of latency and the loss of non-linguistic information, such as emotions or non-verbal sounds. Additionally, these systems segment dialogues into turn-based interactions, overlooking interruptions or overlapping speech.

Kyutai's approach with Moshi aims to solve these issues by directly generating speech (both audio and text) from the user's voice, without relying on intermediate text.

The user's and the AI's voices are modeled separately, allowing for more natural and dynamic dialogues. The model predicts text first, before generating sounds, enhancing linguistic quality while enabling real-time speech recognition and synthesis. With a theoretical latency of 160ms, Moshi is the first real-time, full-duplex voice language model.

Deep dive into Moshi

Moshi operates in full duplex, allowing seamless conversation without any noticeable latency. The WebSocket protocol is fundamental to these real-time interactions. Unlike HTTP requests, which are limited by asynchronous responses, WebSocket enables continuous bidirectional communication, receiving and transmitting voice streams simultaneously. This ensures dynamic exchanges without interruptions, which is essential for a natural user experience.

Mimi is a neural audio codec based on an autoencoder architecture. It converts audio into discrete acoustic tokens, which are then utilized by the model inference. Unlike current approaches, Mimi distills semantic information directly into the early levels of the acoustic tokens. This fusion significantly improves the quality of audio synthesis, ensuring intelligible and expressive dialogues while minimizing model complexity.

Helium is a language model (LLM) of 7 billion parameters based on the Transformer architecture. It has been pre-trained on a large dataset of 2.1 billion textual tokens (from sources like Wikipedia, Wikisource, Stack Exchange). This allows Helium to deeply understand the language nuances, making it a powerful tool for generating fluid and coherent text.

What makes Moshi unique is its ability to manage three distinct processing streams: one for the user, one for its audio output, and a third for its internal dialogue (“Inner Monologue”). The entire model suite (Mimi, Helium and the 3 streams) trained over 7 million hours of audio (24 kHz mono stream). Then, to enable Moshi to listen and speak simultaneously, the model is post-trained on multiple audio streams (i.e. diarization).

An initial fine-tuning is performed on the Fisher dataset, a collection of 2,000 hours of multi-channel telephone conversations, to develop Moshi into a complete conversational agent.

To ensure emotional consistency and richness in the voice generated by Moshi, the Helium model was fine-tuned on 20,000 hours of diverse conversations, recorded under varying conditions and with multiple accents. This ensures that Moshi does not imitate the user’s voice but retains a distinct and stable vocal identity.

Finally, a final adjustment is made on a dataset of 170 hours of high-quality scripted and natural conversations, optimizing Moshi's conversational skills.

dropdown illustration
An open model

Three models have been released: the audio codec Mimi, along with two pre-trained Moshi models featuring artificially generated voices: a masculine voice named Moshiko and a synthetic feminine voice called Moshika.

All these models have been published under the CC BY 4.0 license. This license allows others to distribute, fine-tune, and modify these models, even for commercial purposes, provided they give credit to Kyutai for the original creation.

To fully understand Moshi, the full research paper is also accessible.

Tarifs
ModesSupported languagesQuantizationGPUPrice
MoshikoEnglish 👨FP8L4-1-24G0.93€/hour
MoshikoEnglish 👨FP8, BF16H100-1-80G3.40€/hour
MoshikaEnglish 👩FP8L4-1-24G0.93€/hour
MoshikaEnglish 👩FP8, BF16H100-1-80G3.40€/hour

For other models pricing, see this page

Cheat Sheet

Deploy your Moshi via cURL

In the following commands:

  • <API secret key> designates the API key which allows you to use the Scaleway API
  • <Scaleway Project ID> is the identifier of your project in the Scaleway console
  • <Scaleway Deployment UUID> designates the identifier of the deployment you're about to create
  • <IAM API key> the API key that will allow you to interact with Moshi if you opt for authentication

Deploy your model (you can customize the model, GPU, etc.):

bash

$ curl -X POST https://api.scaleway.com/inference/v1beta1/regions/fr-par/endpoints \
-H "Content-Type: application/json" \
-H "X-Auth-Token: <API secret key>" \
-d '{
"project_id": "<Scaleway Projet ID>'",
"name": "my-moshi-deployment",
"model_name": "kyutai/moshiko-0.1-8b:fp8",
"node_type": "L4",
"min_size": 1,
"max_size": 3,
"accept_eula": true,
"endpoints": [
{
"public": {}
}
]
}'

    Create a public endpoint to join your model:

    bash

    $ curl -X POST https://api.scaleway.com/inference/v1beta1/regions/fr-par/deployments \
    -H "Content-Type: application/json" \
    -H "X-Auth-Token: <API secret key>" \
    -d '{
    "project_id": "<Scaleway Projet ID>'",
    "deployment_id": "<Scaleway Deployment UUID>",
    "endpoints": [
    {
    "disable_auth": false
    "public": {}
    }
    ]
    }'

      If disable_auth is set to true, you won't need the API key <IAM API key> for the rest.

      Interact with Moshi

      Once you've deployed your Moshi model and created an endpoint to join it, you can use one of our clients to interact with it.

      bash

      $ git clone https://github.com/scaleway/moshi-client-examples

        FAQ

        Kyutai aims to enhance Moshi's knowledge base and factual accuracy with community support. Future updates will focus on refining the model and its scalability to handle more complex and longer conversations in additional languages.

        Moshi has a limited context window and conversations longer than 5 minutes will be stopped. It also has a limited knowledge base covering the years 2018 to 2023, which can lead to repetitive or inconsistent responses during prolonged interactions.

        You can find a comprehensive guide here on getting started, including details on deployment, security, and billing. If you need further assistance, feel free to reach out to us through the Slack community #inference-beta.

        Moshi requires some experience to master. It may wait for you to finish speaking before responding. If Moshi takes too long, simply say "go on." Moshi will understand that it's its turn to speak.

        To evaluate toxicity during content generation, the ALERT benchmark by Simone Tedeschi has been applied to Moshi. Moshi's score is 83.05 (Falcon: 88.11, GPT-4: 99.18). A higher score indicates a less "toxic" model.