Managed Inference for Moshi

Chat with the first of its kind speech-to-speech AI model

A new kind of AI

Moshi is a next-generation conversational model (Speech-to-Speech), designed to understand and respond fluidly and naturally to complex conversations, while bringing unprecedented expressiveness and spontaneity.

Unlike traditional AI systems, it offers instant voice interactions, enhanced by speech synthesis that adds a human and emotional dimension to every exchange.

Learn more on Moshi

Open Science by Kyutai

Built by Kyutai, a French AI research lab partially funded by the founder of Scaleway, Moshi is part of an Open Science initiative. This approach enables the community and businesses to benefit from the latest advancements in AI, while fostering innovation and large-scale customization. Moshi represents the future of conversational applications, accessible to everyone.

Learn more about Kyutai

Accessible effortlessly

Thanks to our Managed Inference service, deploying Moshi within the Scaleway ecosystem is effortless. This model benefits from complete isolation of inference computations and network, ensuring optimal performance regardless of other users' activity, as well as full audio confidentiality. With no bandwidth limitations, Moshi is ready to provide dynamic voice interactions at any time.

Learn more on Managed Inference

Available zones:

FR-PAR-2

Key features

Open Source

Open Science is at the heart of Kyutai and Moshi's philosophy. You can explore the full research paper to gain an in-depth understanding of Moshi and access the source code for inference under the Apache 2 license. Additionally, customize the performance by adjusting the model's weights yourself, available under the CC BY 4.0 license.

A full speech-to-speech model

Moshi is an experimental yet advanced Speech-to-Speech conversational model that receives the user's voice and generates both text and a vocal response. Its innovative “Inner Monologue” mechanism enhances the coherence and quality of the generated speech, strengthening its ability to reason and respond accurately.

Say it with emotion

Moshi can modulate its intonation to adapt to various emotional contexts. Whether you ask it to whisper a mysterious story or speak with the energy of a fearless pirate, it can express over 92 different intonations, adding a powerful and immersive emotional dimension to conversations.

Fine acoustic processing

The Mimi acoustic model, integrated into Moshi, processes audio in real time at 24 kHz and compresses it to a bandwidth of 1.1 kbps, while maintaining ultra-low latency of 80ms. Despite this high compression rate, Mimi outperforms non-streaming codecs such as SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps), providing a smooth and accurate experience.

End-to-end seamlessness

Moshi natively integrates WebSocket protocol support, enabling real-time management of vocal inputs and outputs. This ensures natural, continuous, and expressive interactions without any noticeable latency.

Design and trained in France

To make the training of Moshi feasible, Kyutai relied on our supercomputer Nabu2023. This cluster, consisting of 1,016 Nvidia H100 GPUs (~4 PFLOPS), is hosted at DC5, known for its efficient cooling, in the Paris greater-region.

A state-of-the-art model

Current voice dialogue systems rely on chains of independent components (voice activity detection, speech recognition, text processing, and voice synthesis). This results in several seconds of latency and the loss of non-linguistic information, such as emotions or non-verbal sounds. Additionally, these systems segment dialogues into turn-based interactions, overlooking interruptions or overlapping speech.

Kyutai's approach with Moshi aims to solve these issues by directly generating speech (both audio and text) from the user's voice, without relying on intermediate text.

The user's and the AI's voices are modeled separately, allowing for more natural and dynamic dialogues. The model predicts text first, before generating sounds, enhancing linguistic quality while enabling real-time speech recognition and synthesis. With a theoretical latency of 160ms, Moshi is the first real-time, full-duplex voice language model.

Deep dive into Moshi

Moshi operates in full duplex, allowing seamless conversation without any noticeable latency. The WebSocket protocol is fundamental to these real-time interactions. Unlike HTTP requests, which are limited by asynchronous responses, WebSocket enables continuous bidirectional communication, receiving and transmitting voice streams simultaneously. This ensures dynamic exchanges without interruptions, which is essential for a natural user experience.

Mimi is a neural audio codec based on an autoencoder architecture. It converts audio into discrete acoustic tokens, which are then utilized by the model inference. Unlike current approaches, Mimi distills semantic information directly into the early levels of the acoustic tokens. This fusion significantly improves the quality of audio synthesis, ensuring intelligible and expressive dialogues while minimizing model complexity.

Helium is a language model (LLM) of 7 billion parameters based on the Transformer architecture. It has been pre-trained on a large dataset of 2.1 billion textual tokens (from sources like Wikipedia, Wikisource, Stack Exchange). This allows Helium to deeply understand the language nuances, making it a powerful tool for generating fluid and coherent text.

What makes Moshi unique is its ability to manage three distinct processing streams: one for the user, one for its audio output, and a third for its internal dialogue (“Inner Monologue”). The entire model suite (Mimi, Helium and the 3 streams) trained over 7 million hours of audio (24 kHz mono stream). Then, to enable Moshi to listen and speak simultaneously, the model is post-trained on multiple audio streams (i.e. diarization).

An initial fine-tuning is performed on the Fisher dataset, a collection of 2,000 hours of multi-channel telephone conversations, to develop Moshi into a complete conversational agent.

To ensure emotional consistency and richness in the voice generated by Moshi, the Helium model was fine-tuned on 20,000 hours of diverse conversations, recorded under varying conditions and with multiple accents. This ensures that Moshi does not imitate the user’s voice but retains a distinct and stable vocal identity.

Finally, a final adjustment is made on a dataset of 170 hours of high-quality scripted and natural conversations, optimizing Moshi's conversational skills.

An open model

Three models have been released: the audio codec Mimi, along with two pre-trained Moshi models featuring artificially generated voices: a masculine voice named Moshiko and a synthetic feminine voice called Moshika.

All these models have been published under the CC BY 4.0 license. This license allows others to distribute, fine-tune, and modify these models, even for commercial purposes, provided they give credit to Kyutai for the original creation.

To fully understand Moshi, the full research paper is also accessible.

Pricing

Modes	Supported languages	Quantization	GPU	Price
Moshiko	English 👨	FP8	L4-1-24G	0.93€/hour
Moshiko	English 👨	FP8, BF16	H100-1-80G	3.40€/hour
Moshika	English 👩	FP8	L4-1-24G	0.93€/hour
Moshika	English 👩	FP8, BF16	H100-1-80G	3.40€/hour

For other models pricing, see this page

Cheat Sheet

Deploy your Moshi via cURL

In the following commands:

<API secret key> designates the API key which allows you to use the Scaleway API
<Scaleway Project ID> is the identifier of your project in the Scaleway console
<Scaleway Deployment UUID> designates the identifier of the deployment you're about to create
<IAM API key> the API key that will allow you to interact with Moshi if you opt for authentication

Deploy your model (you can customize the model, GPU, etc.):

bash

$ curl -X POST https://api.scaleway.com/inference/v1beta1/regions/fr-par/endpoints \
-H "Content-Type: application/json" \
-H "X-Auth-Token: <API secret key>" \
-d '{
"project_id": "<Scaleway Projet ID>'",
"name": "my-moshi-deployment",
"model_name": "kyutai/moshiko-0.1-8b:fp8",
"node_type": "L4",
"min_size": 1,
"max_size": 3,
"accept_eula": true,
"endpoints": [
{
"public": {}
}
]
}'

Create a public endpoint to join your model:

bash

$ curl -X POST https://api.scaleway.com/inference/v1beta1/regions/fr-par/deployments \
-H "Content-Type: application/json" \
-H "X-Auth-Token: <API secret key>" \
-d '{
"project_id": "<Scaleway Projet ID>'",
"deployment_id": "<Scaleway Deployment UUID>",
"endpoints": [
{
"disable_auth": false
"public": {}
}
]
}'

If disable_auth is set to true, you won't need the API key <IAM API key> for the rest.

Interact with Moshi

Once you've deployed your Moshi model and created an endpoint to join it, you can use one of our clients to interact with it.

bash

$ git clone https://github.com/scaleway/moshi-client-examples

bash

$ cd rust-ui
$ cargo run -- -d <Scaleway Deployment UUID> -k <IAM API key>

bash

$ cd python-no-ui
$ pip install -r requirements.txt
$ python moshi_cli.py -d <Scaleway Deployment UUID> -k <IAM API key>

bash

$ cd node-no-ui
$ yarn install
$ yarn run start -d <Scaleway Deployment UUID> -k <IAM API key>

bash

$ cd web-ui
$ cat > .env.local << EOF
VITE_SCW_DEPLOYMENT_UUID=<Scaleway Deployment UUID>
VITE_SCW_DEFAULT_REGION=fr-par
VITE_SECURE="" # Must not be empty if the endpoint is protected by an API key
EOF
$ npm i
$ npm run dev

You can go to http://localhost:5173/

bash

$ cd web-ui
$ cat > .env.production << EOF
VITE_SCW_DEPLOYMENT_UUID=<Scaleway Deployment UUID>
VITE_SCW_DEFAULT_REGION=fr-par
VITE_SECURE="" # Must not be empty if the endpoint is protected by an API key
EOF
$ docker build -t moshi-web-ui .
$ docker run -p 5173:5173 moshi-web-ui

You can go to http://localhost:5173/

FAQ

Kyutai aims to enhance Moshi's knowledge base and factual accuracy with community support. Future updates will focus on refining the model and its scalability to handle more complex and longer conversations in additional languages.

Moshi has a limited context window and conversations longer than 5 minutes will be stopped. It also has a limited knowledge base covering the years 2018 to 2023, which can lead to repetitive or inconsistent responses during prolonged interactions.

You can find a comprehensive guide here on getting started, including details on deployment, security, and billing. If you need further assistance, feel free to reach out to us through the Slack community #inference-beta.

Moshi requires some experience to master. It may wait for you to finish speaking before responding. If Moshi takes too long, simply say "go on." Moshi will understand that it's its turn to speak.

To evaluate toxicity during content generation, the ALERT benchmark by Simone Tedeschi has been applied to Moshi. Moshi's score is 83.05 (Falcon: 88.11, GPT-4: 99.18). A higher score indicates a less "toxic" model.

Chat with the first of its kind speech-to-speech AI model

A new kind of AI

Open Science by Kyutai

Accessible effortlessly

Open Source

A full speech-to-speech model

Say it with emotion

Fine acoustic processing

End-to-end seamlessness

Design and trained in France

Websocket, the transport layer

Mimi, the audio codec

Helium, the large language model

Moshi, the complete model

The voice

Deploy your Moshi via cURL

Interact with Moshi

What improvements are planned for Moshi?

What are the current limitations of Moshi?

How do I use Scaleway's Managed Inference service with Moshi?

Moshi doesn’t answer, is it normal?

What is Moshi's safety score?