Speech recognition and synthesis
Various types of speech recognisers can be plugged into OpenDial in order to perform speech recognition and synthesis. One can either rely on cloud-based APIs such as the ones provided by Nuance and AT&T, or do the speech recognition and synthesis locally using software such as Sphinx and Mary TTS.
Nuance Speech API
A bridge to the Nuance Speech API is already installed by default in OpenDial. In order to access the API, you first need to register as a developer on the Nuance Mobile Developer website. Once registered, go to " View Sandbox credentials" on the Developer Website, and you will find your AppID and key (look for the one for HTTP Client Applications) which are necessary to use the plugin.
The plugin relies on three parameters:
- id (mandatory) is the Sandbox application ID
- key (mandatory) is the Sandbox application key
- lang (mandatory) is the language code for your application (e.g. eng-USA, nor-NOR, etc.)
To start OpenDial with the plugin, you can directly add the required parameters to the command line:
Alternatively, you can also specify these parameters in the XML domain specification:
If the plugin is successfully started, you should see a button "Press and hold to record speech" beneath the chat window of OpenDial:
Interface to capture speech input in OpenDial.
Simply click and hold the button to record speech input, and release it to send the audio signal to Nuance's speech API for recognition. You can change the device used to capture audio via the menu Options -> Audio Input. If the audio data is properly captured, you should see the volume bar moving while you speak.
For improved speech recognition results, it is recommended to use custom vocabularies tailored for your application. Check the Nuance developer website on how to upload and activate such custom vocabularies. Please note that Nuance's Sandbox access is limited to 5000 transactions per day, so you might need to upgrade your account if your applications requires a higher number of recognition or synthesis requests.
AT&T Speech API
The AT&T Speech API can also be used to perform cloud-based speech recognition and synthesis. Compared to Nuancce, the AT&T speech API has a slighly higher error rate, and is also a bit slower than Nuance. However, the API allows system developers to provide a full recognition grammar to the recogniser (while Nuance only allows for custom vocabularies).
The AT&T speech plugin requires an application ID and secret. To obtain these, you first need to register as a developer on the AT&T Developer website. Once registered, you must setup a new application, with access to the "Speech To Text Custom" and "Text to Speech" APIs. At the end of the setup procedure, a confidential application key and secret will be created.
As for the Nuance plugin, you can start the AT&T plugin by directly adding the required parameters to the command line:
The grammar parameter is optional. You can also specify these parameters in the XML domain specification (cf. previous section).
As for the Nuance Speech API, a button "Press and hold to record speech" beneath the chat window should appear upon starting the system. You can change the device used to capture audio via the menu Options -> Audio Input. If you want to provide the speech API with a recognition grammar, you can do so by specifying a grammar in the GRXML format and passing its path in the grammar parameter.
Sphinx ASR and Mary TTS
If you don't want to use cloud-based solutions for speech recognition and synthesis (for instance, if your dialogue system is not always online), you can use locally installed software such as Sphinx and Mary TTS.
You can attach the two modules to the dialogue system from the command line:
-Dacousticmodel=resource_to_your_acoustic_model \ -Ddictionary=resource_to_your_dictionary
The easiest way to get a working acoustic model and dictionary is to uncomment the line containing
sphinx4-data in the
build.gradle file, and to specify
resource:/edu/cmu/sphinx/models/en-us/en-us as acoustic model and
resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict as dictionary.
Alternatively, you can attach the modules by modifying the domain specification:
The recognition grammar for Sphinx must be specified in JSGF format. Instead of a recognition grammar, you can also specify a statistical language model (
-Dlm=resource_to_your_language_model). Note that the bridge to Sphinx ASR is quite rudimentary and will most likely require some fine tuning in order to get decent recognition results.
For Mary TTS, you need to have at least one voice model in the classpath (one easy fix is to uncomment the line containing voice-cmu-slt-hsmm in the dependencies of the build.gradle file).
Of course, nothing prevents you from writing your own modules to connect to OpenDial to other components (see the section External modules for details).
If you own a Nao robot, a bridge to the embedded Vocon speech recognition is already implemented as part of the Nao plugin (available in the Download page).