Text-to-speech feed into HiFi


Thinking non-playing characters, I’d like to send the output of a TTS program into HiFi so the NPC can speak. Is such a function existent in an API, something that can be programmed in JS, or something less possible now or ever?

I just found the botProcedural script (via Jadas’s log), checking it out. OK, that seems to be about dancing and sound files. Not what I am interested in. What I have in mind is something similar, but which animates as appropriate for speech interaction and emits a real-time stream somehow generated by TTS driven by a Siri/Cortana-type back end.

Just learned of an avatar type meant for NPCs, called simply ‘avatar’ vs. ‘myavatar’, the one used for personal represenations. Am looking into it.


Very nice idea @Simulacron3 . I plan to use NPC’s for a game-based domain and text-to-speech would be awesome.


This is something I’ve been also thinking about. Here is what I have been thinking on howto achieve speaking NPC:

You would need to create new ScriptingInterface with Qt that uses for example http://api.ai/docs/reference/#tts for creating the sound file. Usin that new interface from JS would be something like creating the TTS object with your api.ai access keys and then calling something like speak(“this text”). The interface would make http call to api.ai and get the sound file which it would then playback just like you currently can play sound files.Shouldn’t be that difficult to implement. I am just lacking the time and motivation to execute my plan. :smile:

And of course once you have the TTS working it should be rather simple to extend the interface to make your NPC little bit more intelligent with the AI stuff that api.ai provides you with.


Would that approach work for real-time exchange? Could that somehow work with an audio stream instead of a file, so you don’t have to write a file and then read it again?

Could I do all the voice recognition and generation locally and feed an audio stream into HF in place of user voice input? Effectively, HF would see the NPC as just another user. Is that insane or something?


My experience with TTS+AI systems are that none of them are “real-time.” I do believe that they almost always use somekind of backend and if there is lag in the network they are not reacting that fast. But keeping the text short and maybe converting one sentence at the time might make it feel little bit more like a magic.

Anyways, I have heard that you should first just implement and then worry about performance. I mean, as a first step I would be just blown away if I could make the NPC understand my speech and talk back to me something that makes at least little bit sense. And it should be possible to achieve this with api.ai. One just needs to add some extra features to the current scripting-engine.


I was just messing with this text to speech site.
I think text to speech could be a cool thing to make work in world for a few reasons.
Firstly not everyone can speak.
Not everyone can speak at some times. Noisy room, people eavesdropping.
People want to disguise their voice.


There is someone using text to speech in here already (maybe one of you?) But she/he says it is hard because of needing to switch between something and something else and turn something on and off, etc… @thoys mentioned something about connecting it directly to or via Google’s tool, but you would have to ask him what he meant, because it went over my head.


Thanks for the link, Judas. Their service has a JS API that returns an swf or mp3 sound file, so all we need is the capability to play the sound file, which is described in the docs as http://docs.highfidelity.com/v1.0/docs/audio-functions. However, HF seems to want .wav or .raw.

So it looks like the pipeline is something like:
bot generates text --> TTS service --> local speech file --> URL–> HF audio injection script

I’m not ready to write the HF script for that yet, but it looks pretty simple.


Very nice. As @SterlingWright mentioned I also have heard one of our fellow alphas using text-speech in HiFi.


I’ve found that AT&T has an API for both TTS and STT with free sandbox and service starting at $99/year.

What I’d like to implement is an agent capable of two-way voice exchange in HF. Now on my To Investigate list.