Acoustic model adaptation in CMU Sphinx [Python] - プログラムでおかえしできるかな

I tried voice input using CMU Sphinx with Python's SpeechRecognition.
Since the recognition rate is low as it is, I adapted the acoustic model to increase the recognition rate.
I will introduce how to adapt the acoustic model.

Wouldn't you like to control something with a voice assistant like Apple's Siri, Google Assistant, or Amazon Alexa?
I tried to realize speech recognition, which is the basis of it, with CMUSphinx.

Here, I introduce how to use it in Windows, but I think that it can be handled in a similar way in a Unix-like environment.
I imagine that it can be applied to things like Raspberry Pi.

Table of contents

Speech recognition with CMUSphinx
- Effects of acoustic model adaptation
How to adapt the acoustic model
Create audio data
Creating tools for adaptation
- Get sphinxtrain source
- Build from sphinxtrain source code
  - Prepare compiler etc.
  - Build with CMake
Using an adapted acoustic model in SpeechRecoginition
Support for Japanese
- Transition of word selection until it can be recognized in Japanese
pocketshinx and speech recognition
Get source
Finaly
- Caution
- Disclaimer
Reference

Voice assistant is one of the applications of voice recognition.
One of the things I want to consider when realizing a voice assistant is that it works offline.
If you use it on a personal computer, such considerations are unnecessary, but if you use it on other devices (devices using Raspberry Pi etc.), it is necessary.
The Python speech recognition package SpeechRecognition supports several speech recognition engines.
Among them, CMU Sphinx is the only one that supports offline speech recognition for free. ¹

Speech recognition with CMUSphinx

Some time ago, I wrote a program for voice input using the SpeechRecognition package in Python. ²
It is easy to change the speech recognition engine to CMUSphinx in this program.
Just change the method used for speech recognition from recognize_google() to recognize_sphinx().

* You need to install pocketsphinx to use it.
pip install pocketsphinx

When you use it, you can see the following things.

does not recognize English words well
Difficulty recognizing English sentences
The recognition rate is not so high even for English sentences spoken by native speakers

After investigating this result, I decided that some tuning is necessary to use CMUSphinx comfortably.

There are two tuning methods available.
However, the only real option is "acoustic model adaptation".

Adapting the acoustic model
Training an acoustic model

According to CMUSphinx, training requires the following: ³

You have knowledge of the phonetic structure of the language
You have a lot of data to train on
More than an hour recording of one speaker's commands, etc.
You have enough time to train the model and optimize the parameters

I can't meet these requirements, so I'll tune by adapting the acoustic model.

Effects of acoustic model adaptation

According to CMUSphinx tutorial site , the effects of acoustic model adaptation are as follows.

[ Effects of Acoustic Model Adaptation ]

Adaptation does not necessarily adapt to a particular speaker, but can adapt to a particular recording environment, accent, etc.
Cross-language adaptation is also meaningful, and a model of English can be adapted to the sounds of another language.
The adaptation process is more robust than training and can give good results even with small adaptation data

From these, we decided that CMUSphinx can be used as a voice assistant by adapting the acoustic model.

How to adapt the acoustic model

What it takes to adapt an acoustic model for a voice assistant

Prepare a simple language model for the voice assistant and adapt the existing English acoustic model.

Adapting the acoustic model requires the following process:

Creating tools for adaptation
Create a simple language model
Create adaptive data
- Creating transcription and fileids files
- Create audio data
Performing Acoustic Model Adaptation

Create a simple language model

If your goal is to recognize things like simple commands, it's a good idea to create a simple language model.
It is convenient to use Web services for creation.

You only need a corpus (a list of commands) to create a language model with a web service.

A corpus is essentially a large-scale structured collection of natural language sentences ⁴, but a voice assistant that recognizes only simple commands only needs to list the words it recognizes.

[ Steps to create a simple language model ]

Create corpus file
Create language model from corpus

Create corpus file

A "corpus" here is simply a list of sentences used to adapt an acoustic model.
In other words, it's a list of commands to use with your voice assistant.
As an example, assume a command that controls the browser.

- Creation example

down
next page
up
top
bottom
next link
back
see
search
stop

Create this in a file called corpus.txt for example.

Create language model from corpus

Create a language model from a corpus file with the following web service:

Language model creation service: Sphinx Knowledge Base Tool VERSION 3

On the displayed site, do the following:

Press the "Browse" button and select the created corpus.txt file
Press the "COMPILE KNOWLEDGE BASE" button
You will see a page titled "Sphinx knowledge base generator"
Download the "TARnnnn.tgz" file
* nnnn is a 4 digit number, contains dic, lm, log_pronounce, sent, vocab, HEADER.html files

Create adaptive data

Acoustic model adaptation requires the following data:

transcription file : A text describing the mapping between words and phonetic data
fileids file : Text of voice data file name list
wav file : audio data

* The transcription file and the fileids file should match the order of the audio data file names.

[ Description of each file ]

transcription file
A text file with the extension transcription
File name example: data.transcription
Write the words you pronounce in the audio file (uppercase) and the audio file name (no extension)
Example data:

<s> DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN </s> (down10)
<s> NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE </s> (next_page10)
<s> UP UP UP UP UP UP UP UP UP UP </s> (up10)

fileids file
a text file with the extension fileids
File name example: data.fileids
List audio filenames (without extension)
Example data:

down10
next_page10
up10

wav file
See the later section "Create audio data"

* There is also a way to create a transcription file and a fileids file by splitting an existing recording into sentences.

Performing Acoustic Model Adaptation

Adapting an acoustic model improves the adaptability between the adaptation data and the model, making it easier to recognize the speech style that the adaptation data is based on.
Follow the steps below.

Create a working folder
Create a working folder and copy acoustic model, dictionary, language model, adaptation data and binaries
- Copy default acoustic model
  1. Get Acoustic Model
    Acquisition destination : CMU Sphinx Files
    Acquisition file: cmusphinx-en-us-ptm-5.2.tar.gz (ptm is for mobile)
    * After unzipping, make sure mixture_weights is inside and mdef is a text file
    * Originally, copy the pocketsphinx\model\en-us\en-us folder of the pocketsphinx installation destination, but
    This is distributed in a compressed version so get the full uncompressed version from above
  2. Create an en-us folder in the working folder and store all the acquired acoustic model files
    Files: feat.params, mdef, means, mixture_weights, noisedict, README, sendump, transition_matrices, variances
- Copy dictionary (nnnn.dic) and language model (nnnn.lm) to working folder
  Acquisitions explained in "Create language model from corpus"
- Copy adaptation data to working folder
  Copy the fileids, transcription, and wav files created in "Create adaptive data"
- Copy binaries
  Copy sphinx_fe.exe, bw.exe, map_adapt.exe, mllr_solve.exe created in "Creating tools for adaptation" into the working folder
Generation of acoustic feature files (mfc files)
Run sphinx_fe
sphinx_fe -argfile en-us/feat.params -samprate 16000
-c a_data.fileids -di . -do . -ei wav -eo mfc -mswav yes
Accumulating observation counts
Run bw
* Make sure the bw argument matches the feat.params in the acoustic model folder
* If you get the following error with bw, the number of words in your recording is not the same as in the transcription file.
Review the transcription file.
Error: Failed to align audio to transcript: final state of the search is not reached
bw
-hmmdir en-us
-moddeffn en-us/mdef
-ts2cbfn .ptm.
-feat 1s_c_d_dd
-svspec 0-12/13-25/26-38
-cmn current
-agc none
-dictfn 8811.dic
-ctlfn my_small_model.fileids
-lsnfn my_smoll_madel.transcription
-accumdir .
[ Options (partial) ]
- -hmmdir: folder with acoustic model
- -moddeffn: model definition file name (mdef when created with Create language model from corpus)
- -dictfn: file name of dictionary
- -ctlfn: filename of fileids file
- -lsnfn: filename of transcription file
- -accumdir: Destination folder
Conversion using MLLR (create mllr_matrix)
Run mllr_solve
mllr_solve -meanfn en-us/means -varfn en-us/variances
-outmllrfn mllr_matrix -accumdir .
Copy acoustic model folder（Here en-us-adapt）
Copy en-us folder to create en-us-adapt folder
Updating Acoustic Model Files with MAP (Update files in en-us-adapt folder)
Run map_adapt
map_adapt -moddeffn en-us/mdef -ts2cbfn .ptm. -meanfn en-us/means
-varfn en-us/variances -mixwfn en-us/mixture_weights
-tmatfn en-us/transition_matrices -accumdir .
-mapmeanfn en-us-adapt/means -mapvarfn en-us-adapt/variances
-mapmixwfn en-us-adapt/mixture_weights
-maptmatfn en-us-adapt/transition_matrices

Create audio data

record speaking voice

Recording is done with the software that comes with Windows.
* Of course you can use whatever you want.

"Voice Recorder (Sound Recorder) ⁵" that comes with Windows creates audio files in the file format m4a (or wma for Sound Recorder).

Acoustic model adaptation tool expects files in wav format.
Therefore, a file format conversion is required.

Convert m4a(wma) to wav

I created a Python conversion program using the audio file conversion tool ffmpeg.
It's a simple code, so I'll introduce it below.

Convert m4a files in a folder to wav files in order.

import glob
import subprocess

ffmpeg_path = r"C:\temp\ffmpeg\bin\ffmpeg.exe" # Path of ffmpeg.exe
target_path = r"C:\temp\音声\*.m4a" # Path of audio data before conversion

for path in glob.iglob(target_path):
    print(path)
    cmd = f"{ffmpeg_path} -i {path} -ac 1 -ar 16000 {path.replace('.m4a', '.wav')}"
    subprocess.call(cmd, shell=True)
print("finish")

The wav file you create should meet the requirements of the adaptation tool.

[ wav file requirements ]
Single channel mono with 16kHz sampling rate

bit resoulution: 16
sample rate: 16000
audio channels: mono

- How to specify in ffmpeg

 ffmpeg -i xx.m4a -ac 1 -ar 16000 xx.wav

* You can also execute the above manually one file at a time to convert.

- Some of ffmpeg's options

-ac: number of audio channels
-ar: sampling rate

- How I chose this method

I checked how to convert from m4a (wma) to wav.

There is a package called "pydub · PyPI " in Python.

As a result of using it, I found the following.

It uses a library called ffmpeg internally
Does not work without downloading ffmpeg

In this case, I thought it would be easier to understand if I created a program that starts ffmpeg directly.

Download ffmpeg

ffmpeg can be downloaded from the following site.

Obtained from:Releases · GyanD/codexffmpeg
File: ffmpeg-5.1.2-essentials_build.zip (version is the latest)

Readme.txt shows what kind of audio files are supported (support type).

[ Examples of support types ]

aac: File format used in m4a files
alac: File format used in m4a files
wmav1: It seems to be compatible with wma7
wmav2: It seems that it corresponds to wma8 and wma9

Creating tools for adaptation

The tools are not provided as binaries, so you have to build them from the sphinxtrain source. ⁶
Use the following tools for adaptation:

sphinx_fe.exe
bw.exe
mllr_solve.exe
map_adapt.exe

Get sphinxtrain source

The tool used for adaptation has a new source release called sphinxtrain.
Get sphinxtrain from:

Acquisition destination(GitHub): Releases · cmusphinx/sphinxtrain
File: Source code(zip) or Source code(tar.qz)

After downloading, unzip and use.
sphinxtrain-5.0.0 folder will be created.

Build from sphinxtrain source code

Building the sphinxtrain source requires CMake, C, Perl, and Python.

Prepare compiler etc.

Install CMake
Download and install the installer from the official site
Acquisition destination: Download | CMake
Install and associate perl
- Installation of perl was based on an external site ⁷
- Associating perl with file extensions
  You can associate using the following command (run as administrator)
  - set association(Set and associate file types)
    FTYPE SPerl="C:\strawberry\Perl\bin\perl.exe" "%1" %*
    ASSOC .pl=SPerl
  - Check Association
    Command: assoc .extension
    Displayed results: .extension = file type name
    Command: ftype file_type_name
    Displayed results: file type name = open command string
  - Command description
    - ASSOC [.extension[=[file_type_name]]]
      View or change file extension associations
    - FTYPE [file_type_name[=[open command string]]]
      View or change file types used for file extension associations
Install C
Installation of C was based on an external site ⁸

Build with CMake

sphinxtrain provides CMakeLists.txt. (sphinxtrain-5.0.0 folder in the obtained source)
Therefore, it is possible to build with CMake.
CMake creates the build files first, then the binaries.

Create build files with CMake
Create a build folder, navigate to the build folder and run CMake
* The reason for creating the build folder is because cmake creates a lot of files so you don't pollute the original source folder
Example: mkdir build
cd build
cmake .. -A x64
* The current folder is the build folder, so specify the parent folder where the CMakeLists.txt file is with ...
Specifying options
- -G: Specify generator (target compiler)
  default is Visual Studio 17 2022
  Not required if C compiler is installed in VSCode
  -G MinGW Makefiles for MinGW
- --fresh: Append when rebuilding
- -A: Specify platform (if generator supports it)
  "Visual Studio 17 2022" defaults to "Win32"
  Value: Win32, x64, ARM, ARM64
- -DCMAKE_BUILD_TYPE=Release: Change mode
Create binaries with CMake
cmake --build .
Create binaries using the build file created in step 1.
* The default is Debug mode, so binaries are created in the Debug folder
* . is the folder where the build file was created. Here is the current build folder.

Using an adapted acoustic model in SpeechRecoginition

Use the adapted acoustic model by specifying it in SpeechRecognition as follows:

Specify the language argument to the recognize_sphinx() method.

[ language argument ]

language=(en-us-adapt_folder, nnnn.lm_file, nnnn.dic_file)

- Code example

   self.r.recognize_sphinx(audio
                    , language=(r"C:\my_model\en-us-adapt"
                    , r"C:\my_model\1347.lm"
                    , r"C:\my_model\1347.dic"))

Support for Japanese

In reference The method for Japanese is described below.
Since it may be a reference for other languages, I will post it as it is.

As mentioned in Effects of acoustic model adaptation, CMUSphinx may be able to recognize Japanese even when using acoustic models of other languages.
Actually, a usable prospect was in sight.
It is effective for recognizing short words like voice assistants.

Since Japanese acoustic models may not be provided, this method is used to support Japanese.

The adaptation method is the same as the acoustic model adaptation in English.
All you have to do is convert the prepared corpus into Japanese (romaji).
However, it seems that it will be easier to recognize if you consider the following.

Words with 3 or more letters are easier to recognize
Example: "Uehe" is better than "Ue"
Consonants are easier to recognize than vowels
Example: "Ueni" is better than "Uehe"
Multiple words are easier to recognize
Example: "Ueni ido" is better than "Ueni"

Transition of word selection until it can be recognized in Japanese

I will introduce the contents of trial and error in word selection.
On the third word, it recognizes quite a bit, but not enough.

1st time	2nd time	3rd time	meaning
shita	shitae	shitani idou	scroll down
jipeiji	jipeiji	jipeiji	next page
ue	uee	ueni idou	scroll up
toppu	ichiban uee	ichiban ueni	scroll to top
soko	ichiban shitae	ichiban shitani	scroll to the bottom
tabu	tsugino rinku	tsugino rinku	next link
modoru	modoru	modoru	return
miru	hyouji	hyouji	open link
kensaku	kensaku	kensaku	search
owari	owari	owari	end

1st time: Next page, search, and return are easy to recognize
2nd time: "Uehe" and "shitahe" are difficult to recognize
3rd time: 60-70% recognized(There are good times and bad times)

The challenge is to select words that increase the recognition rate.
Somehow, I'm imagining that it would be better to use words that start with different sounds for lines and columns in Japanese syllabary.

pocketshinx and speech recognition

pocketshinx uses speech recognition technology.
I can't explain the details of speech recognition technology, but there are terms that you should know in CMUSphinx Tutorial For Developers – CMUSphinx Open Source Speech Recognition .

Recognition method

The recognition engine looks for words both in the dictionary and in the language model
If the language model does not have the word, it will not be recognized even if it exists in the dictionary

Models

CMUSphinx uses three models: an acoustic model, a language model, and a phonetic dictionary.

Acoustic model: A base for converting speech into phonemes (the smallest unit of speech that can distinguish the meaning of words)
General acoustic models are based on statistical processing of speech data for thousands of people/thousands of hours
phonetic dictionary: a mapping of words to phoneme sequences
Judging a word from a sequence of phonemes by pattern matching
Language model: A model of the "words" spoken and written by humans based on the probability of occurrence of words ⁹

Language model

There are two forms of language model implementation.

Text ARPA format
- ARPA format can be edited
- ARPA file has the extension .lm
Binary BIN format
- Binary format speeds up loading
- Binary file has the extension .lm.bin

* Conversion between these formats is also possible

Get source

You can get the full source here.

Source code: voice_input_Sphinx.py
Obtained from: GitHub juu7g/Python-voice-input

* Please note that the source contains code for debugging.
I use the logging module for debugging.

Finaly

This time, I noticed something while doing this work.
For voice recognition, multiple words are easier to recognize than a single word.
Speaking of which, Google was "OK Google" and Mercedes was "Yes, Mercedes." I feel like it's related.
I think the reason is that in Japanese, each sound is distinct, but in English, when words are connected, the way they are pronounced changes.
Voice recognition seems to be made with that in mind.

I'm not a voice recognition expert, so I'm just guessing
I think Japanese speech recognition is easy if you just recognize sounds.
I think it's hard to turn sounds (kana) into words (kanji).
It's semantic analysis, not speech recognition.

Please read together 📖 How to make an app that creates sentences by voice input[ Python ] 🔗
* Sorry. It's still Japanese.

Caution

This article is based on what worked under the following versions:

Python 3.8.5
pocketsphinx 5.0.0

Disclaimer

Please check "Disclaimer" before using.

If you have any questions, please contact me from "Contact me".

Reference

Sphinx: CMUSphinx Tutorial For Developers – CMUSphinx Open Source Speech Recognition
Acoustic model adaptation: Adapting the default acoustic model – CMUSphinx Open Source Speech Recognition
[ Japanese site ] What is a corpus: Link
[ Japanese site ] voice recognition: Link
[ Japanese site ] How speech recognition works: Link

投稿: 2023-03-25

[ Japanese site ] 📖 Supported speech recognition engine/API[ Python ] 🔗↩
[ Japanese site ] 📖 How to make an app that creates sentences by voice input[ Python ] 🔗↩
See training requirements here: Training an acoustic model for CMUSphinx – CMUSphinx Open Source Speech Recognition ↩
See Wikipedia: Text corpus - Wikipedia ↩
Windows recording tools have been renamed in different versions of Windows. Voice Recorder (Windows 10 or later), Sound Recorder (up to Windows 8)↩
Originally, it seems that SphinxTrain should be executed, but I could not understand how to execute it and what was executed at that time, so I used the built tool.↩
[ Japanese site ] See here for Perl installation：Link ↩
[ Japanese site ] See here for C installation: Link ↩
[ Japanese site ] Language model (also for GPT): Link ↩