Simon says: conversational voice interaction?


So when implementing classic interactive voice response, one usually has to come up with a pre-canned set of phrases and then map them to a pre-canned set of commands. Even when phrases are parametric this approach tends to prioritize the machine over the speaker.

But with services like Google’s universal speech-to-text, a more spontaneous potential emerges – essentially, the possibility to improvise solutions using conversational engineering techniques. And at the code level it can then boil down into deciphering arbitrary text messages.

Anyhow, after jury-rigging Google’s recognizer into my Interface scripts – I needed a test subject and my stock avatar looked bored. So I conscribed 'em into helping me test the theory:

[ Simon says: ] raise left arm!
[ Simon says: ] lower right arm!

Below is as far as I got with the experiment on the scripting side (which was enough to get things talking, but I’m probably not going to take it any further right now). If anybody wants to fool around with this concept or freestyle voice recog lemme know and we can try to get the Google speech hack working on your system (it doesn’t require a custom build – just a modern version of Chrome, some certificate shenanigans to provision a proxy on localhost, and some glue scripts).

/* concept / work in progress */

var ws = new vlan.WebSocket('mic:recorder');
Script.scriptEnding.connect(ws, 'close');
ws.onmessage = function(s) {

// callable like a "push-to-talk" button for the speech recog
record = ws.send.bind(ws, 'record');

// ------------------------------------------------
var vec3 = glm.vec3,
    quat = glm.quat,
    radians = glm.radians;

function jointXYZ(name, v) {
   return MyAvatar.setJointRotation.bind({}, name, quat(radians(v)));

var bigFatMappingTable = {
   "lower right arm": jointXYZ('RightArm', vec3( 60, -15, -30)),
   "raise right arm": jointXYZ('RightArm', vec3(-90, -30,  15)),
   "lower left arm":  jointXYZ('LeftArm',  vec3( 60,  15,  30)),
   "raise left arm":  jointXYZ('LeftArm',  vec3(-90,  30, -15)),
   "etc.": " many permutations simon says: ugh!"

function userSaysWhat(utterance) {

   // flat static table lookup
   if (utterance in bigFatMappingTable)
      return bigFatMappingTable[utterance]();
   var MATCH = 'replace';
   // dynamic deciphering idea
     /(lower|raise|turn) (right|left|upper|lower) (arm|leg|brow|head|hips)/i,
     function(_, action, edge, part) {
       if ("TODO: some kind of better mapping logic here ...")
         MyAvatar.setJointRotation("...", {/*...*/});
         /* ... */
     function(_ /*,,*/) {


@humbletim There’s also a SpeechRecognizer object build into Interface’s JavaScript API. It may work only on Windows, but I’m not sure about that. For an example script, see \examples\example\audio\speechControl.js


Thanks for the breadcrumb – the SpeechRecognizer API looks pretty cool, although I am on Linux so need something cross-platform.

Also in this alternative approach there’s no upfront requirement of engineering a phrase book…

Have you ever dictated a search query into

(Or spoken a text message on Android?)

This is that – the same speech-to-text service, where the user talks and it transcribes.

So instead of necessarily making a user conform to decided spoken phrases, an evolving set of supported phrases can emerge organically by simply observing what people actually think to say. This is a powerful feedback mechanism and an automatic aspect of the design.

Also this approach might be friendlier to 90+ fps VR – because unlike native recognition (which sometimes has to choose between responsiveness or stampeding the CPU), all the heavy lifting here is delegated to the cloud. My jury-rigging uses a Chrome instance, but by employing virtual WebSockets even that doesn’t have to run locally – could also be placed on a spare Android, iOS, laptop, C.H.I.P. computer, etc…