Simon is basically an Open Source Speech recognition solution. It can be used in your linux system just as you use Siri in IOS or S-Voice in Samsung’s galaxy range of android smart-phones. It is a new and unique solution to a daunting task, that is speech recognition and while there remains a lot of scope for expansion it’s a made a great start and is making all the right moves.
In this post I’m going to give a brief overview of the same and try and spread the word about this new solution.
Simon is basically a client for a server(Simond) and provides a GUI for speech model and commands. Because of it’s architecture, the same version of Simon can be used with all languages and dialects. One can even mix languages within one model if necessary. 
The Simon Architecture
The main recognition architecture consists of three applications.
- Simon : The main GUI and acting as the client to the Simond server.
- Simond : The recognition server.
- KSimond : A graphical front-end for Simond.
The three components provide a real client/server solution for the recognition. Meaning there can be just one Simond(server) and one or more Simons(clients).
KSimond is just a front-end for Simond. It doesn’t add any functionality but rather provides a way to interact with Simond(server) graphically.
More specialized functions are also a part of this integrated Simon distribution.
- Sam – Provides more in-depth control to speech model and allows to test the acoustic model.
- SSC/SSCd – These two applications can be used to collect large amounts of speech samples from different persons more easily.
- Afaras – Allows users to check large corpora of speech data for erroneous samples of speech.
Simon records the sound from the microphone and transmits it to the server(Simond) – which in turn runs the recognition of the received input stream and then sends the result back to the client(Simon).
Simon then uses these commands to execute commands like opening programs, following links etc.
Simon identifies it’s connection with a user/password combination which is completely independent of the underlying operating system and it’s users. By default a standard user is setup in both Simon and Simond.
Every Simon client logs onto the server with a unique user/password that identifies a unique speech model. Every user maintains his own speech model, but may use it from different computers(different, physical Simon instances) simply by accessing the same Simond server.
As mentioned earlier, one Simond instance can of-course serve multiple users.
If one wants to open up the server to the internet or to multiple users, one will have to configure the server(Simond). Further details can be found in the Simond manual.
To get Simon to recognize speech and react to it you need to setup a speech model. Speech models describe how your voice sounds, what words exist, how they sound and what word combination(sentences or structures) exist.
A speech model basically consists of two parts :
- Language Model : Describes existing words and what sentences are grammatically correct.
- Acoustic Model : Describes how words sound.
Both these components are required for Simon to recognize your voice. Quite obviously you need an acoustic model to activate Simon.
One can even create his own or adapt a base model. Base models are already generated acoustic models that can be used with Simon.
Simon uses external software, known as Backends to build acoustic models and to recognize speech.
Backends can be split into two distinct components ;
- Model Compiler or Model Generation – used to create or adapt acoustic models.
- Recognizer – Used to recognize speech
If you are using base models, Simon will automatically select the appropriate backend for you. More information, pertaining to building own base models from scratch, can be found in the Simond manual. Base models created for one backend are not compatible with any other backend.
There are three types of base models :
- Static base model : uses a pre-compiled acoustic model without modifying it.
- Adapted base model : accuracy can be improved by adapting it to your voice. Certain training data would be required and would be applied to the selected base model.
- User-generated : The user is responsible for training his own model. No base model would be used.
With the exception of the Static base model, the model creation backend needs to be installed in the above mentioned. To know more on the base models, check out .
Scenarios are the use-cases of Simon. To control Firefox/Amarok, just install the appropriate scenario(Firefox/Amarok scenario) and start using your voice instead of the Keyboard and mouse.
Scenarios tell Simon what words and phrases to listen for and what to do when they are recognized. Scenarios are just simple text files(XML format) and can be exchanged easily since they don’t contain information about how words and phrases actually sound.
Scenarios are tailored to work well with a specific base model, inorder to avoid issues with the phoneme set.
The name of the Simon base models most likely start with names like with a tag like “[EN/VF/JHTK]”. It is better/recommended to download Scenarios starting with the same tag.