Voice Assistant Entity
[!NOTE] The Home Assistant integration is used as the first reference implementation, followed by Android TV.
A voice assistant entity interacts with a voice assistant or sends voice commands to voice-capable devices. It can request an audio stream from the Remote's microphone when the user pushes the voice button.
This allows forwarding voice commands to the voice control feature of a device, or to a cloud-based voice assistant like Amazon Alexa or Google Home.
Please note that an integration does not have direct access to the microphone and cannot initiate voice recording. This is only possible by the user pressing the voice button.
Features
The features describe the capabilities of the voice assistant.
Some devices only offer simple one-way voice command features, where the result of the command is shown or signaled on the device itself, for example, with a speech response on a smart speaker. The Android TV is such a device; a voice command can be sent, then the result is only shown or played on the TV.
Smart home systems can offer two-way voice command features, where the result(s) of the command is also returned to the client. Home Assistant offers such functionality with the assist pipeline. The individual processing steps like speech-to-text, intent recognition and speech feedback are accessible.
| Name | R | W | Description |
|---|---|---|---|
| transcription | ✅ | ❌ | Speech to text response of the transcribed voice command |
| response_text | ✅ | ❌ | Textual response of the performed action |
| response_speech | ✅ | ❌ | Voice response of the performed action |
Attributes
| Attribute | Features | Type | Values | Description |
|---|---|---|---|---|
| state | enum | States | Default entity state attribute. |
States
The entity state attribute holds the following values:
| Value | Description |
|---|---|
| OFF | The device is ready but the voice feature is not available |
| ON | Ready for voice commands |
Also includes the common entity states.
Device Classes
None.
Options
| Name | Type | Default | Description |
|---|---|---|---|
| audio_cfg | AudioConfiguration | {} | Audio stream format |
| profiles | Profile | [] | List of supported profiles |
| preferred_profile | String | Preferred profile to use as default |
Audio Configuration
The default audio stream the integration receives from the Remote is in PCM 16 kHz mono 16-bit signed format. Audio chunks are sent every 100–200 ms.
The audio_cfg option can specify a different audio format to be sent to the integration.
If an unsupported audio stream is requested, the default setting is used. For example, if a sample rate of 4 kHz is
requested, the default sample rate of 16 kHz is used.
The AudioConfiguration object of the audio_stream option has the following properties:
| Property | Type | Default | Description |
|---|---|---|---|
| channels | Number | 1 | Number of audio channels: 1 or 2 |
| sample_rate | Number | 16000 | Sample rate of the audio stream in Hz. |
| sample_format | enum | I16 | Audio frame format. |
The actual audio stream format is included in the voice_start command and in the first
RemoteVoiceBegin protobuf message when the audio stream is
started. See the Integration-API for the full definition.
Profiles
The profiles option allows specifying a list of voice assistant profiles that can be used by starting a voice command.
Profiles are optional and allow parameterizing voice input. A regular voice-capable device usually just accepts voice
input without additional parameters. Home automation systems can offer multi-language support or an option to use
local or cloud processing.
For example, Home Assistant allows configuring multiple Assist pipelines for voice commands. These pipelines can offer different languages or speech recognition engines.
The Profile object has the following properties:
| Property | Type | Description |
|---|---|---|
| id | String | Profile identifier |
| name | String | Friendly name to show in UI |
| language | String | Optional: language code for speech recognition if the profile represents a specific language |
| transcription | bool | Optional: supports voice command transcription. Entity feature is used if not specified. |
| response_text | bool | Optional: Supports textual response about the performed action. Entity feature is used if not specified. |
| response_speech | bool | Optional: Supports speech response about the performed action. Entity feature is used if not specified. |
The profile identifier can be specified in the profile_id parameter of the voice_start command.
Integration API
The following sequence diagram shows the happy message flow of a voice command initiated by the UI. It's a simplified representation of the communication between the various components, showing the relevant Core-API and Integration-API messages. The interaction between an integration driver and the voice assistant depends on the involved system and available features.
sequenceDiagram
participant U as UI
participant R as Remote
participant I as Integration
participant V as Voice Assistant
U-)R: voice_start
activate R
R-->R: check microphone enabled
R-)I: voice_start
activate I
I--)R: result (ok)
R--)U: result (ok)
deactivate R
I->>V: initiate voice command
I--)R: assistant_event (ready)
deactivate I
activate R
R--)U: assistant_event (ready)
U-->U: show ready
R-)I: protobuf: voice_begin
activate I
I->>V: start voice command
loop microphone button pressed
R-->R: get microphone audio chunk
R-)I: protobuf: voice_data
I-)V: relay audio stream
end
deactivate R
U-)+R: voice_end
R--)U: result (ok)
R-)-I: protobuf: voice_end
I->>V: stop voice command
deactivate I
V-->V: processing
opt stt_response feature
V--)I: STT response
I--)R: assistant_event (stt_response)
R--)U: assistant_event (stt_response)
U-->U: show response
end
opt text_response feature
V--)I: intent response
I--)R: assistant_event (text_response)
R--)U: assistant_event (text_response)
U-->U: show response
end
opt speech_response feature
V--)I: speech response
I--)R: assistant_event (speech_response)
R--)U: assistant_event (speech_response)
U-->U: play audio response
end
I--)R: assistant_event (finished)
R--)U: assistant_event (finished)
Commands
The integration driver has to implement a handler for the entity_command WebSocket message to process the following
command requests in msg_data.cmd_id. See Integration-API
for the full message structure.
| cmd_id | Parameters | Type | Description |
|---|---|---|---|
| voice_start | session_id | Number | Audio session identifier that will be used in follow-up binary voice messages. |
| audio_cfg | AudioConfiguration | Audio stream format used in WS binary messages. | |
| speech_response | bool | Optional: enable voice response of the performed action. | |
| timeout | Number | Optional: processing timeout in seconds. | |
| profile_id | String | Optional: profile used for the voice assistant command. |
Notes:
- The Integration-API only defines the
voice_startWebsocket command. - The end-of-stream notification is sent with the Protobuf message
RemoteVoiceEnd. - The Core-API defines
voice_startandvoice_endcommands.
voice_start
After confirming the voice_start command, the audio stream is started and transmitted as binary WebSocket messages
in protocol buffer format (see protobuf messages).
- The confirmation message must be sent within 2 seconds, or the user interface might abort the command.
- If the integration already knows that it can't process voice commands, it needs to send a negative response.
- The integration must send the
startevent with the providedsession_id, when it is ready to receive the audio stream. - The audio stream is stopped if an
errorevent is sent by the integration.
Example:
{
"kind": "req",
"id": 123,
"msg": "entity_command",
"msg_data": {
"entity_type": "voice_assistant",
"entity_id": "va-1",
"cmd_id": "voice_start",
"params": {
"session_id": 8,
"audio_cfg": {
"channels": 1,
"sample_rate": 8000,
"sample_format": "I16"
}
}
}
}
Events
Entity change event
The regular entity_change event must be emitted by the integration driver if the state of the voice assistant changes.
For example, if a voice assistant becomes unavailable because a required cloud service is no longer responding.
The following attributes must be included:
| Attribute | Description |
|---|---|
| state | New entity state. |
Example:
{
"kind": "event",
"msg": "entity_change",
"cat": "ENTITY",
"msg_data": {
"entity_type": "voice_assistant",
"entity_id": "va-1",
"attributes": {
"state": "ON"
}
}
}
Assistant event
The assistant_event must be emitted by the integration driver to start the audio stream and for providing optional
feedback about the voice command processing and outcome.
| Event | Data | Type | Description |
|---|---|---|---|
| ready | Integration is ready to receive voice audio stream. | ||
| stt_response | text | string | Optional. Transcribed text from voice audio stream. |
| text_response | text | string | Optional. Textual response about the performed action. |
| success | bool | Action result. | |
| speech_response | url | string | Optional. Speech response about the performed action. |
| mime_type | string | ||
| finished | Voice processing finished. | ||
| error | code | string | Processing error while sending or processing the audio stream. |
| message | string |
Supported audio mime types for the speech_response event:
audio/mpegaudio/mp3audio/wavaudio/x-wavaudio/oggaudio/opusaudio/webmaudio/flacaudio/aac
See the Integration-API for the full event definitions.
Ready event
The ready event is emitted after the voice_start command has been confirmed.
The session_id is used to identify the audio stream.
{
"kind": "event",
"msg": "assistant_event",
"cat": "ENTITY",
"msg_data": {
"type": "ready",
"entity_id": "va-1",
"session_id": 8
}
}
Speech to text event
{
"kind": "event",
"msg": "assistant_event",
"cat": "ENTITY",
"msg_data": {
"type": "stt_response",
"entity_id": "va-1",
"session_id": 8,
"data": {
"text": "Switch off the living room lights"
}
}
}
Text response event
{
"kind": "event",
"msg": "assistant_event",
"cat": "ENTITY",
"msg_data": {
"type": "text_response",
"entity_id": "va-1",
"session_id": 8,
"data": {
"success": true,
"text": "Switched off living room lights"
}
}
}
Speech response event
{
"kind": "event",
"msg": "assistant_event",
"cat": "ENTITY",
"msg_data": {
"type": "speech_response",
"entity_id": "va-1",
"session_id": 8,
"data": {
"url": "https://smart.home/api/tts_proxy/6ZZGII-UgUfEEI8CbH1TNg.mp3",
"media_type": "audio/mpeg"
}
}
}
Error event
{
"kind": "event",
"msg": "assistant_event",
"cat": "ENTITY",
"msg_data": {
"type": "error",
"entity_id": "va-1",
"session_id": 8,
"data": {
"code": "NO_TEXT_RECOGNIZED",
"message": "I did not understand"
}
}
}
Defined error codes:
- NO_TEXT_RECOGNIZED
- SERVICE_UNAVAILABLE
- INVALID_AUDIO
- NO_TEXT_RECOGNIZED
- INTENT_FAILED
- TTS_FAILED
- TIMEOUT
- UNEXPECTED_ERROR