If you’re wondering what Simon is all about, I’d blogged about the same just a few days back. You might as well want to read that post, before going through this one. Click here to read it.
So, I was reading the Simon handbook and chapter-4 covers this topic in great detail. You can access the handbook from the simon page in kde.org . It first explains what about the fundamental basics that you’ll require in order to get started with creating your own scenarios. Well, I’m also going to do pretty much the same, except that I don’t expect to go in such details and also because I was reading this manual and believe me, it’s quite lengthy and I was getting distracted very easily. So, I’m keeping my fingers crossed, hoping that the motivation of getting a new blog post(s) would be enough for me to stay on track and avoid any distractions. I’ll be writing a series of blog posts that will cover these topics, split in parts, simply because accommodating the entire process in a single blog post would greatly reduce it’s readability. Let’s get started then, this post will cover the fundamentals of speech recognition and how it works in Simon.
As mentioned in the beginning of the chapter, inorder to create a new Simon scenario, you first create a new “shell” by adding a new scenario object and then open it in the Simon main window(We’ll get to this a little later in this post).
Here’s what a scenario in Simon contains ;
- Training Texts
Speech recognition systems take voice input and try to translate it into written text. A speech model consists of two distinct parts;
- Language Model — Defines the vocabulary and grammar you want to use.
- Acoustic Model — Represents your pronounciation in a machine readable format.
The Simon Vocabulary consists of words or entries which consist of the following ;
- Wordname(The written word itself)
- Category (Noun,Verb,etc)
- Pronounciation(How the word is pronounced)
In general, it is advisable to keep your vocabulary sleek as possible. More words increase the scope of Simon misunderstanding you 🙂 . Below is a table that shows how *words* are categorized and pronounced.
|Computer||Noun||k ax m p y uw t er|
|Noun||m ey l|
|Close||Noun||k l ow s|
The vocabulary that is used for definitions is referred to as the active dictionary or active vocabulary. In addition to the active dictionary Simon has something called the Shadow dictionary where you can look up the pronounciations and other characteristics of the word. This can be imported from within Simon by using the “Import Dictionary wizard”.
Even better, Simon also has a Language profile, to provide help with transcribing words. A language profile consists of rules on how words are pronounced in the target language. The automatic deduction of a phoneme transcription from a written word is called “grapheme to phoneme conversion“.
The next feature of a scenario after vocabulary is Grammar. The grammar defines which combinations of words are correct(Remember that in Simon a combination of words is considered as an entry). For eg, the following command,
Computer, Internet (Noun, Noun) Computer, Mail (Noun, Noun) Computer, Close (Noun, Verb)
So Simon can work with the above mentioned combinations, but this also opens the door for something like
Computer, Computer (Noun, Noun) Internet, Internet (Noun, Noun) Internet, Computer (Noun, Noun) Mail, Internet (Noun, Noun)
Now, all of the above mentioned are also combinations that are accepted by Simon(Noun,Noun) but they are logically senseless. To avoid such instances, in Simon, it is advisable to invent new grammatical rules (since we’re not bound by the rules of grammar while using Simon) relevant for each use-case. For eg, in the commands above instead of using categories like noun, verb etc it would be better to use something as shown below;
Computer, Internet (Trigger, Command) Computer, Mail (Trigger, Command) Computer, Close (Trigger, Command)
This method all the combinations described above and also limits the combinations to only those mentioned above. The words “Trigger” and “Command” are pretty much self-explanatory here, I believe.
Now, Turning attentions to the Acoustic Model, which uses the fact that words are composed of sounds much like words are composed of written letters/alphabets. Using this knowledge, one can segment words into sounds(represented by the pronounciation) and assemble back when recognizing. These building blocks are called ‘phonemes’.
This pretty much covers the basic and I believe that it’s in the best interests of both the reader and writer to put an end to this post at this juncture. Stay tuned for more 🙂