I tried voice input using CMU Sphinx with Python's SpeechRecognition.
Since the recognition rate is low as it is, I adapted the acoustic model to increase the recognition rate.
I will introduce how to adapt the acoustic model.
Wouldn't you like to control something with a voice assistant like Apple's Siri, Google Assistant, or Amazon Alexa?
I tried to realize speech recognition, which is the basis of it, with CMUSphinx.
Here, I introduce how to use it in Windows, but I think that it can be handled in a similar way in a Unix-like environment.
I imagine that it can be applied to things like Raspberry Pi.
Table of contents
- Speech recognition with CMUSphinx
- How to adapt the acoustic model
- Create audio data
- Creating tools for adaptation
- Using an adapted acoustic model in SpeechRecoginition
- Support for Japanese
- pocketshinx and speech recognition
- Get source
- Finaly
- Reference
Voice assistant is one of the applications of voice recognition.
One of the things I want to consider when realizing a voice assistant is that it works offline.
If you use it on a personal computer, such considerations are unnecessary, but if you use it on other devices (devices using Raspberry Pi etc.), it is necessary.
The Python speech recognition package SpeechRecognition supports several speech recognition engines.
Among them, CMU Sphinx is the only one that supports offline speech recognition for free. 1
Speech recognition with CMUSphinx
Some time ago, I wrote a program for voice input using the SpeechRecognition package in Python. 2
It is easy to change the speech recognition engine to CMUSphinx in this program.
Just change the method used for speech recognition from recognize_google()
to recognize_sphinx()
.
* You need to install pocketsphinx to use it.
pip install pocketsphinx
When you use it, you can see the following things.
- does not recognize English words well
- Difficulty recognizing English sentences
- The recognition rate is not so high even for English sentences spoken by native speakers
After investigating this result, I decided that some tuning is necessary to use CMUSphinx comfortably.
There are two tuning methods available.
However, the only real option is "acoustic model adaptation".
- Adapting the acoustic model
- Training an acoustic model
According to CMUSphinx, training requires the following: 3
- You have knowledge of the phonetic structure of the language
- You have a lot of data to train on
More than an hour recording of one speaker's commands, etc. - You have enough time to train the model and optimize the parameters
I can't meet these requirements, so I'll tune by adapting the acoustic model.
Effects of acoustic model adaptation
According to CMUSphinx tutorial site , the effects of acoustic model adaptation are as follows.
[ Effects of Acoustic Model Adaptation ]
- Adaptation does not necessarily adapt to a particular speaker, but can adapt to a particular recording environment, accent, etc.
- Cross-language adaptation is also meaningful, and a model of English can be adapted to the sounds of another language.
- The adaptation process is more robust than training and can give good results even with small adaptation data
From these, we decided that CMUSphinx can be used as a voice assistant by adapting the acoustic model.
How to adapt the acoustic model
What it takes to adapt an acoustic model for a voice assistant
Prepare a simple language model for the voice assistant and adapt the existing English acoustic model.
Adapting the acoustic model requires the following process:
- Creating tools for adaptation
- Create a simple language model
- Create adaptive data
- Creating
transcription
andfileids
files - Create audio data
- Creating
- Performing Acoustic Model Adaptation
Create a simple language model
If your goal is to recognize things like simple commands, it's a good idea to create a simple language model.
It is convenient to use Web services for creation.
You only need a corpus (a list of commands) to create a language model with a web service.
A corpus is essentially a large-scale structured collection of natural language sentences 4, but a voice assistant that recognizes only simple commands only needs to list the words it recognizes.
[ Steps to create a simple language model ]
Create corpus file
A "corpus" here is simply a list of sentences used to adapt an acoustic model.
In other words, it's a list of commands to use with your voice assistant.
As an example, assume a command that controls the browser.
- Creation example
down next page up top bottom next link back see search stop
Create this in a file called corpus.txt
for example.
Create language model from corpus
Create a language model from a corpus file with the following web service:
Language model creation service: Sphinx Knowledge Base Tool VERSION 3
On the displayed site, do the following:
- Press the "Browse" button and select the created corpus.txt file
- Press the "COMPILE KNOWLEDGE BASE" button
- You will see a page titled "Sphinx knowledge base generator"
Download the "TARnnnn.tgz" file
* nnnn is a 4 digit number, contains dic, lm, log_pronounce, sent, vocab, HEADER.html files
Create adaptive data
Acoustic model adaptation requires the following data:
transcription
file : A text describing the mapping between words and phonetic datafileids
file : Text of voice data file name listwav
file : audio data
* The transcription
file and the fileids
file should match the order of the audio data file names.
[ Description of each file ]
transcription
file
A text file with the extensiontranscription
File name example:data.transcription
Write the words you pronounce in the audio file (uppercase) and the audio file name (no extension)
Example data:
<s> DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN </s> (down10) <s> NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE </s> (next_page10) <s> UP UP UP UP UP UP UP UP UP UP </s> (up10)
fileids
file
a text file with the extensionfileids
File name example:data.fileids
List audio filenames (without extension)
Example data:
down10 next_page10 up10
wav
file
See the later section "Create audio data"
* There is also a way to create a transcription
file and a fileids
file by splitting an existing recording into sentences.
Performing Acoustic Model Adaptation
Adapting an acoustic model improves the adaptability between the adaptation data and the model, making it easier to recognize the speech style that the adaptation data is based on.
Follow the steps below.
- Create a working folder
Create a working folder and copy acoustic model, dictionary, language model, adaptation data and binaries- Copy default acoustic model
- Get Acoustic Model
Acquisition destination : CMU Sphinx Files
Acquisition file:cmusphinx-en-us-ptm-5.2.tar.gz
(ptm is for mobile)
* After unzipping, make suremixture_weights
is inside andmdef
is a text file
* Originally, copy thepocketsphinx\model\en-us\en-us
folder of the pocketsphinx installation destination, but
This is distributed in a compressed version so get the full uncompressed version from above - Create an
en-us
folder in the working folder and store all the acquired acoustic model files
Files: feat.params, mdef, means, mixture_weights, noisedict, README, sendump, transition_matrices, variances
- Get Acoustic Model
- Copy dictionary (nnnn.dic) and language model (nnnn.lm) to working folder
Acquisitions explained in "Create language model from corpus" - Copy adaptation data to working folder
Copy thefileids
,transcription
, andwav
files created in "Create adaptive data" - Copy binaries
Copysphinx_fe.exe
,bw.exe
,map_adapt.exe
,mllr_solve.exe
created in "Creating tools for adaptation" into the working folder
- Copy default acoustic model
- Generation of acoustic feature files (mfc files)
Run sphinx_fe
sphinx_fe -argfile en-us/feat.params -samprate 16000
-c a_data.fileids -di . -do . -ei wav -eo mfc -mswav yes
- Accumulating observation counts
Run bw
* Make sure the bw argument matches the feat.params in the acoustic model folder
* If you get the following error with bw, the number of words in your recording is not the same as in thetranscription
file.
Review thetranscription
file.
Error:Failed to align audio to transcript: final state of the search is not reached
bw
-hmmdir en-us
-moddeffn en-us/mdef
-ts2cbfn .ptm.
-feat 1s_c_d_dd
-svspec 0-12/13-25/26-38
-cmn current
-agc none
-dictfn 8811.dic
-ctlfn my_small_model.fileids
-lsnfn my_smoll_madel.transcription
-accumdir .
[ Options (partial) ]- -hmmdir: folder with acoustic model
- -moddeffn: model definition file name (mdef when created with Create language model from corpus)
- -dictfn: file name of dictionary
- -ctlfn: filename of
fileids
file - -lsnfn: filename of
transcription
file - -accumdir: Destination folder
- Conversion using MLLR (create mllr_matrix)
Run mllr_solve
mllr_solve -meanfn en-us/means -varfn en-us/variances
-outmllrfn mllr_matrix -accumdir .
- Copy acoustic model folder(Here en-us-adapt)
Copyen-us
folder to createen-us-adapt
folder - Updating Acoustic Model Files with MAP (Update files in en-us-adapt folder)
Run map_adapt
map_adapt -moddeffn en-us/mdef -ts2cbfn .ptm. -meanfn en-us/means
-varfn en-us/variances -mixwfn en-us/mixture_weights
-tmatfn en-us/transition_matrices -accumdir .
-mapmeanfn en-us-adapt/means -mapvarfn en-us-adapt/variances
-mapmixwfn en-us-adapt/mixture_weights
-maptmatfn en-us-adapt/transition_matrices
Create audio data
record speaking voice
Recording is done with the software that comes with Windows.
* Of course you can use whatever you want.
"Voice Recorder (Sound Recorder) 5" that comes with Windows creates audio files in the file format m4a (or wma for Sound Recorder).
Acoustic model adaptation tool expects files in wav format.
Therefore, a file format conversion is required.
Convert m4a(wma) to wav
I created a Python conversion program using the audio file conversion tool ffmpeg
.
It's a simple code, so I'll introduce it below.
Convert m4a files in a folder to wav files in order.
import glob import subprocess ffmpeg_path = r"C:\temp\ffmpeg\bin\ffmpeg.exe" # Path of ffmpeg.exe target_path = r"C:\temp\音声\*.m4a" # Path of audio data before conversion for path in glob.iglob(target_path): print(path) cmd = f"{ffmpeg_path} -i {path} -ac 1 -ar 16000 {path.replace('.m4a', '.wav')}" subprocess.call(cmd, shell=True) print("finish")
The wav file you create should meet the requirements of the adaptation tool.
[ wav file requirements ]
Single channel mono with 16kHz sampling rate
- bit resoulution: 16
- sample rate: 16000
- audio channels: mono
- How to specify in ffmpeg
ffmpeg -i xx.m4a -ac 1 -ar 16000 xx.wav
* You can also execute the above manually one file at a time to convert.
- Some of ffmpeg's options
-ac
: number of audio channels-ar
: sampling rate
- How I chose this method
I checked how to convert from m4a (wma) to wav.
There is a package called "pydub · PyPI " in Python.
As a result of using it, I found the following.
- It uses a library called
ffmpeg
internally - Does not work without downloading
ffmpeg
In this case, I thought it would be easier to understand if I created a program that starts ffmpeg
directly.
Download ffmpeg
ffmpeg can be downloaded from the following site.
- Obtained from:Releases · GyanD/codexffmpeg
- File:
ffmpeg-5.1.2-essentials_build.zip
(version is the latest)
Readme.txt shows what kind of audio files are supported (support type).
[ Examples of support types ]
aac
: File format used in m4a filesalac
: File format used in m4a fileswmav1
: It seems to be compatible with wma7wmav2
: It seems that it corresponds to wma8 and wma9
Creating tools for adaptation
The tools are not provided as binaries, so you have to build them from the sphinxtrain source. 6
Use the following tools for adaptation:
sphinx_fe.exe
bw.exe
mllr_solve.exe
map_adapt.exe
Get sphinxtrain source
The tool used for adaptation has a new source release called sphinxtrain
.
Get sphinxtrain
from:
- Acquisition destination(GitHub): Releases · cmusphinx/sphinxtrain
- File:
Source code(zip)
orSource code(tar.qz)
After downloading, unzip and use.
sphinxtrain-5.0.0
folder will be created.
Build from sphinxtrain source code
Building the sphinxtrain source requires CMake, C, Perl, and Python.
Prepare compiler etc.
- Install CMake
Download and install the installer from the official site
Acquisition destination: Download | CMake - Install and associate perl
- Installation of perl was based on an external site 7
- Associating perl with file extensions
You can associate using the following command (run as administrator)- set association(Set and associate file types)
FTYPE SPerl="C:\strawberry\Perl\bin\perl.exe" "%1" %*
ASSOC .pl=SPerl
- Check Association
Command:assoc .extension
Displayed results:.extension = file type name
Command:ftype file_type_name
Displayed results:file type name = open command string
- Command description
ASSOC [.extension[=[file_type_name]]]
View or change file extension associationsFTYPE [file_type_name[=[open command string]]]
View or change file types used for file extension associations
- set association(Set and associate file types)
- Install C
Installation of C was based on an external site 8
Build with CMake
sphinxtrain provides CMakeLists.txt. (sphinxtrain-5.0.0
folder in the obtained source)
Therefore, it is possible to build with CMake.
CMake creates the build files first, then the binaries.
- Create build files with CMake
Create a build folder, navigate to the build folder and run CMake
* The reason for creating the build folder is because cmake creates a lot of files so you don't pollute the original source folder
Example:mkdir build
cd build
cmake .. -A x64
* The current folder is the build folder, so specify the parent folder where the CMakeLists.txt file is with..
.
Specifying options-G
: Specify generator (target compiler)
default isVisual Studio 17 2022
Not required if C compiler is installed in VSCode
-G MinGW Makefiles
for MinGW--fresh
: Append when rebuilding-A
: Specify platform (if generator supports it)
"Visual Studio 17 2022" defaults to "Win32"
Value: Win32, x64, ARM, ARM64-DCMAKE_BUILD_TYPE=Release
: Change mode
- Create binaries with CMake
cmake --build .
Create binaries using the build file created in step 1.
* The default is Debug mode, so binaries are created in the Debug folder
*.
is the folder where the build file was created. Here is the current build folder.
Using an adapted acoustic model in SpeechRecoginition
Use the adapted acoustic model by specifying it in SpeechRecognition as follows:
Specify the language
argument to the recognize_sphinx()
method.
[ language argument ]
language=(en-us-adapt_folder, nnnn.lm_file, nnnn.dic_file)
- Code example
self.r.recognize_sphinx(audio , language=(r"C:\my_model\en-us-adapt" , r"C:\my_model\1347.lm" , r"C:\my_model\1347.dic"))
Support for Japanese
Since it may be a reference for other languages, I will post it as it is.
As mentioned in Effects of acoustic model adaptation, CMUSphinx may be able to recognize Japanese even when using acoustic models of other languages.
Actually, a usable prospect was in sight.
It is effective for recognizing short words like voice assistants.
Since Japanese acoustic models may not be provided, this method is used to support Japanese.
The adaptation method is the same as the acoustic model adaptation in English.
All you have to do is convert the prepared corpus into Japanese (romaji).
However, it seems that it will be easier to recognize if you consider the following.
- Words with 3 or more letters are easier to recognize
Example: "Uehe" is better than "Ue" - Consonants are easier to recognize than vowels
Example: "Ueni" is better than "Uehe" - Multiple words are easier to recognize
Example: "Ueni ido" is better than "Ueni"
Transition of word selection until it can be recognized in Japanese
I will introduce the contents of trial and error in word selection.
On the third word, it recognizes quite a bit, but not enough.
1st time | 2nd time | 3rd time | meaning |
---|---|---|---|
shita | shitae | shitani idou | scroll down |
jipeiji | jipeiji | jipeiji | next page |
ue | uee | ueni idou | scroll up |
toppu | ichiban uee | ichiban ueni | scroll to top |
soko | ichiban shitae | ichiban shitani | scroll to the bottom |
tabu | tsugino rinku | tsugino rinku | next link |
modoru | modoru | modoru | return |
miru | hyouji | hyouji | open link |
kensaku | kensaku | kensaku | search |
owari | owari | owari | end |
- 1st time: Next page, search, and return are easy to recognize
- 2nd time: "Uehe" and "shitahe" are difficult to recognize
- 3rd time: 60-70% recognized(There are good times and bad times)
The challenge is to select words that increase the recognition rate.
Somehow, I'm imagining that it would be better to use words that start with different sounds for lines and columns in Japanese syllabary.
pocketshinx and speech recognition
pocketshinx uses speech recognition technology.
I can't explain the details of speech recognition technology, but there are terms that you should know in CMUSphinx Tutorial For Developers – CMUSphinx Open Source Speech Recognition .
Recognition method
- The recognition engine looks for words both in the dictionary and in the language model
- If the language model does not have the word, it will not be recognized even if it exists in the dictionary
Models
CMUSphinx uses three models: an acoustic model, a language model, and a phonetic dictionary.
- Acoustic model: A base for converting speech into phonemes (the smallest unit of speech that can distinguish the meaning of words)
General acoustic models are based on statistical processing of speech data for thousands of people/thousands of hours - phonetic dictionary: a mapping of words to phoneme sequences
Judging a word from a sequence of phonemes by pattern matching - Language model: A model of the "words" spoken and written by humans based on the probability of occurrence of words 9
Language model
There are two forms of language model implementation.
- Text ARPA format
- ARPA format can be edited
- ARPA file has the extension
.lm
- Binary BIN format
- Binary format speeds up loading
- Binary file has the extension
.lm.bin
* Conversion between these formats is also possible
Get source
You can get the full source here.
- Source code: voice_input_Sphinx.py
- Obtained from: GitHub juu7g/Python-voice-input
* Please note that the source contains code for debugging.
I use the logging module for debugging.
Finaly
This time, I noticed something while doing this work.
For voice recognition, multiple words are easier to recognize than a single word.
Speaking of which, Google was "OK Google" and Mercedes was "Yes, Mercedes." I feel like it's related.
I think the reason is that in Japanese, each sound is distinct, but in English, when words are connected, the way they are pronounced changes.
Voice recognition seems to be made with that in mind.
I'm not a voice recognition expert, so I'm just guessing
I think Japanese speech recognition is easy if you just recognize sounds.
I think it's hard to turn sounds (kana) into words (kanji).
It's semantic analysis, not speech recognition.
* Sorry. It's still Japanese.
Caution
This article is based on what worked under the following versions:
- Python 3.8.5
- pocketsphinx 5.0.0
Disclaimer
Please check "Disclaimer" before using.
If you have any questions, please contact me from "Contact me".
Reference
- Sphinx: CMUSphinx Tutorial For Developers – CMUSphinx Open Source Speech Recognition
- Acoustic model adaptation: Adapting the default acoustic model – CMUSphinx Open Source Speech Recognition
- [ Japanese site ] What is a corpus: Link
- [ Japanese site ] voice recognition: Link
- [ Japanese site ] How speech recognition works: Link
- [ Japanese site ] 📖 Supported speech recognition engine/API[ Python ] 🔗↩
- [ Japanese site ] 📖 How to make an app that creates sentences by voice input[ Python ] 🔗↩
- See training requirements here: Training an acoustic model for CMUSphinx – CMUSphinx Open Source Speech Recognition ↩
- See Wikipedia: Text corpus - Wikipedia ↩
- Windows recording tools have been renamed in different versions of Windows. Voice Recorder (Windows 10 or later), Sound Recorder (up to Windows 8)↩
- Originally, it seems that SphinxTrain should be executed, but I could not understand how to execute it and what was executed at that time, so I used the built tool.↩
- [ Japanese site ] See here for Perl installation:Link ↩
- [ Japanese site ] See here for C installation: Link ↩
- [ Japanese site ] Language model (also for GPT): Link ↩