Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Voice Assistant Entity

[!NOTE] The Home Assistant integration is used as the first reference implementation, followed by Android TV.

A voice assistant entity interacts with a voice assistant or sends voice commands to voice-capable devices. It can request an audio stream from the Remote's microphone when the user pushes the voice button.

This allows forwarding voice commands to the voice control feature of a device, or to a cloud-based voice assistant like Amazon Alexa or Google Home.

Please note that an integration does not have direct access to the microphone and cannot initiate voice recording. This is only possible by the user pressing the voice button.

Features

The features describe the capabilities of the voice assistant.

Some devices only offer simple one-way voice command features, where the result of the command is shown or signaled on the device itself, for example, with a speech response on a smart speaker. The Android TV is such a device; a voice command can be sent, then the result is only shown or played on the TV.

Smart home systems can offer two-way voice command features, where the result(s) of the command is also returned to the client. Home Assistant offers such functionality with the assist pipeline. The individual processing steps like speech-to-text, intent recognition and speech feedback are accessible.

NameRWDescription
transcriptionSpeech to text response of the transcribed voice command
response_textTextual response of the performed action
response_speechVoice response of the performed action

Attributes

AttributeFeaturesTypeValuesDescription
stateenumStatesDefault entity state attribute.

States

The entity state attribute holds the following values:

ValueDescription
OFFThe device is ready but the voice feature is not available
ONReady for voice commands

Also includes the common entity states.

Device Classes

None.

Options

NameTypeDefaultDescription
audio_cfgAudioConfiguration{}Audio stream format
profilesProfile[]List of supported profiles
preferred_profileStringPreferred profile to use as default

Audio Configuration

The default audio stream the integration receives from the Remote is in PCM 16 kHz mono 16-bit signed format. Audio chunks are sent every 100–200 ms.

The audio_cfg option can specify a different audio format to be sent to the integration. If an unsupported audio stream is requested, the default setting is used. For example, if a sample rate of 4 kHz is requested, the default sample rate of 16 kHz is used.

The AudioConfiguration object of the audio_stream option has the following properties:

PropertyTypeDefaultDescription
channelsNumber1Number of audio channels: 1 or 2
sample_rateNumber16000Sample rate of the audio stream in Hz.
sample_formatenumI16Audio frame format.

The actual audio stream format is included in the voice_start command and in the first RemoteVoiceBegin protobuf message when the audio stream is started. See the Integration-API for the full definition.

Profiles

The profiles option allows specifying a list of voice assistant profiles that can be used by starting a voice command. Profiles are optional and allow parameterizing voice input. A regular voice-capable device usually just accepts voice input without additional parameters. Home automation systems can offer multi-language support or an option to use local or cloud processing.

For example, Home Assistant allows configuring multiple Assist pipelines for voice commands. These pipelines can offer different languages or speech recognition engines.

The Profile object has the following properties:

PropertyTypeDescription
idStringProfile identifier
nameStringFriendly name to show in UI
languageStringOptional: language code for speech recognition if the profile represents a specific language
transcriptionboolOptional: supports voice command transcription. Entity feature is used if not specified.
response_textboolOptional: Supports textual response about the performed action. Entity feature is used if not specified.
response_speechboolOptional: Supports speech response about the performed action. Entity feature is used if not specified.

The profile identifier can be specified in the profile_id parameter of the voice_start command.

Integration API

The following sequence diagram shows the happy message flow of a voice command initiated by the UI. It's a simplified representation of the communication between the various components, showing the relevant Core-API and Integration-API messages. The interaction between an integration driver and the voice assistant depends on the involved system and available features.

sequenceDiagram
    participant U as UI
    participant R as Remote
    participant I as Integration
    participant V as Voice Assistant

    U-)R: voice_start
    activate R
    R-->R: check microphone enabled
    R-)I:  voice_start
    activate I
    I--)R: result (ok)
    R--)U: result (ok)
    deactivate R

    I->>V: initiate voice command
    I--)R: assistant_event (ready)
    deactivate I
    activate R
    R--)U: assistant_event (ready)
    U-->U: show ready

    R-)I:  protobuf: voice_begin
    activate I
    I->>V: start voice command
    loop microphone button pressed
        R-->R:  get microphone audio chunk
        R-)I:   protobuf: voice_data
        I-)V:   relay audio stream
    end
    deactivate R
    
    U-)+R:  voice_end
    R--)U: result (ok)
    R-)-I:  protobuf: voice_end
    I->>V: stop voice command
    deactivate I
    V-->V: processing
    
    opt stt_response feature
        V--)I: STT response
        I--)R: assistant_event (stt_response)
        R--)U: assistant_event (stt_response)
        U-->U: show response
    end
    opt text_response feature
        V--)I: intent response
        I--)R: assistant_event (text_response)
        R--)U: assistant_event (text_response)
        U-->U: show response
    end
    opt speech_response feature
        V--)I: speech response
        I--)R: assistant_event (speech_response)
        R--)U: assistant_event (speech_response)
        U-->U: play audio response
    end
    I--)R: assistant_event (finished)
    R--)U: assistant_event (finished)

Commands

The integration driver has to implement a handler for the entity_command WebSocket message to process the following command requests in msg_data.cmd_id. See Integration-API for the full message structure.

cmd_idParametersTypeDescription
voice_startsession_idNumberAudio session identifier that will be used in follow-up binary voice messages.
audio_cfgAudioConfigurationAudio stream format used in WS binary messages.
speech_responseboolOptional: enable voice response of the performed action.
timeoutNumberOptional: processing timeout in seconds.
profile_idStringOptional: profile used for the voice assistant command.

Notes:

  • The Integration-API only defines the voice_start Websocket command.
  • The end-of-stream notification is sent with the Protobuf message RemoteVoiceEnd.
  • The Core-API defines voice_start and voice_end commands.

voice_start

After confirming the voice_start command, the audio stream is started and transmitted as binary WebSocket messages in protocol buffer format (see protobuf messages).

  • The confirmation message must be sent within 2 seconds, or the user interface might abort the command.
  • If the integration already knows that it can't process voice commands, it needs to send a negative response.
  • The integration must send the start event with the provided session_id, when it is ready to receive the audio stream.
  • The audio stream is stopped if an error event is sent by the integration.

Example:

{
  "kind": "req",
  "id": 123,
  "msg": "entity_command",
  "msg_data": {
    "entity_type": "voice_assistant",
    "entity_id": "va-1",
    "cmd_id": "voice_start",
    "params": {
      "session_id": 8,
      "audio_cfg": {
        "channels": 1,
        "sample_rate": 8000,
        "sample_format": "I16"
      }
    }
  }
}

Events

Entity change event

The regular entity_change event must be emitted by the integration driver if the state of the voice assistant changes. For example, if a voice assistant becomes unavailable because a required cloud service is no longer responding.

The following attributes must be included:

AttributeDescription
stateNew entity state.

Example:

{
  "kind": "event",
  "msg": "entity_change",
  "cat": "ENTITY",
  "msg_data": {
    "entity_type": "voice_assistant",
    "entity_id": "va-1",
    "attributes": {
      "state": "ON"
    }
  }
}

Assistant event

The assistant_event must be emitted by the integration driver to start the audio stream and for providing optional feedback about the voice command processing and outcome.

EventDataTypeDescription
readyIntegration is ready to receive voice audio stream.
stt_responsetextstringOptional. Transcribed text from voice audio stream.
text_responsetextstringOptional. Textual response about the performed action.
successboolAction result.
speech_responseurlstringOptional. Speech response about the performed action.
mime_typestring
finishedVoice processing finished.
errorcodestringProcessing error while sending or processing the audio stream.
messagestring

Supported audio mime types for the speech_response event:

  • audio/mpeg
  • audio/mp3
  • audio/wav
  • audio/x-wav
  • audio/ogg
  • audio/opus
  • audio/webm
  • audio/flac
  • audio/aac

See the Integration-API for the full event definitions.

Ready event

The ready event is emitted after the voice_start command has been confirmed. The session_id is used to identify the audio stream.

{
  "kind": "event",
  "msg": "assistant_event",
  "cat": "ENTITY",
  "msg_data": {
    "type": "ready",
    "entity_id": "va-1",
    "session_id": 8
  }
}
Speech to text event
{
  "kind": "event",
  "msg": "assistant_event",
  "cat": "ENTITY",
  "msg_data": {
    "type": "stt_response",
    "entity_id": "va-1",
    "session_id": 8,
    "data": {
      "text": "Switch off the living room lights"
    }
  }
}

Text response event

{
  "kind": "event",
  "msg": "assistant_event",
  "cat": "ENTITY",
  "msg_data": {
    "type": "text_response",
    "entity_id": "va-1",
    "session_id": 8,
    "data": {
      "success": true,
      "text": "Switched off living room lights"
    }
  }
}

Speech response event

{
  "kind": "event",
  "msg": "assistant_event",
  "cat": "ENTITY",
  "msg_data": {
    "type": "speech_response",
    "entity_id": "va-1",
    "session_id": 8,
    "data": {
      "url": "https://smart.home/api/tts_proxy/6ZZGII-UgUfEEI8CbH1TNg.mp3",
      "media_type": "audio/mpeg"
    }
  }
}
Error event
{
  "kind": "event",
  "msg": "assistant_event",
  "cat": "ENTITY",
  "msg_data": {
    "type": "error",
    "entity_id": "va-1",
    "session_id": 8,
    "data": {
      "code": "NO_TEXT_RECOGNIZED",
      "message": "I did not understand"
    }
  }
}

Defined error codes:

  • NO_TEXT_RECOGNIZED
  • SERVICE_UNAVAILABLE
  • INVALID_AUDIO
  • NO_TEXT_RECOGNIZED
  • INTENT_FAILED
  • TTS_FAILED
  • TIMEOUT
  • UNEXPECTED_ERROR