プログラムでおかえしできるかな

Python、フェイジョア、日常のあれこれでお返し、元SEの隠居生活。

Acoustic model adaptation in CMU Sphinx [Python]

このエントリーをはてなブックマークに追加

I tried voice input using CMU Sphinx with Python's SpeechRecognition.
Since the recognition rate is low as it is, I adapted the acoustic model to increase the recognition rate.
I will introduce how to adapt the acoustic model.

Wouldn't you like to control something with a voice assistant like Apple's Siri, Google Assistant, or Amazon Alexa?
I tried to realize speech recognition, which is the basis of it, with CMUSphinx.

Here, I introduce how to use it in Windows, but I think that it can be handled in a similar way in a Unix-like environment.
I imagine that it can be applied to things like Raspberry Pi.

Table of contents

Voice assistant is one of the applications of voice recognition.
One of the things I want to consider when realizing a voice assistant is that it works offline.
If you use it on a personal computer, such considerations are unnecessary, but if you use it on other devices (devices using Raspberry Pi etc.), it is necessary.
The Python speech recognition package SpeechRecognition supports several speech recognition engines.
Among them, CMU Sphinx is the only one that supports offline speech recognition for free. 1

Speech recognition with CMUSphinx

Some time ago, I wrote a program for voice input using the SpeechRecognition package in Python. 2
It is easy to change the speech recognition engine to CMUSphinx in this program.
Just change the method used for speech recognition from recognize_google() to recognize_sphinx().

* You need to install pocketsphinx to use it.
pip install pocketsphinx

When you use it, you can see the following things.

  1. does not recognize English words well
  2. Difficulty recognizing English sentences
  3. The recognition rate is not so high even for English sentences spoken by native speakers

After investigating this result, I decided that some tuning is necessary to use CMUSphinx comfortably.

There are two tuning methods available.
However, the only real option is "acoustic model adaptation".

  • Adapting the acoustic model
  • Training an acoustic model

According to CMUSphinx, training requires the following: 3

  • You have knowledge of the phonetic structure of the language
  • You have a lot of data to train on
    More than an hour recording of one speaker's commands, etc.
  • You have enough time to train the model and optimize the parameters

I can't meet these requirements, so I'll tune by adapting the acoustic model.

Effects of acoustic model adaptation

According to CMUSphinx tutorial site , the effects of acoustic model adaptation are as follows.

[ Effects of Acoustic Model Adaptation ]

  • Adaptation does not necessarily adapt to a particular speaker, but can adapt to a particular recording environment, accent, etc.
  • Cross-language adaptation is also meaningful, and a model of English can be adapted to the sounds of another language.
  • The adaptation process is more robust than training and can give good results even with small adaptation data

From these, we decided that CMUSphinx can be used as a voice assistant by adapting the acoustic model.

How to adapt the acoustic model

What it takes to adapt an acoustic model for a voice assistant

Prepare a simple language model for the voice assistant and adapt the existing English acoustic model.

Adapting the acoustic model requires the following process:

Create a simple language model

If your goal is to recognize things like simple commands, it's a good idea to create a simple language model.
It is convenient to use Web services for creation.

You only need a corpus (a list of commands) to create a language model with a web service.

A corpus is essentially a large-scale structured collection of natural language sentences 4, but a voice assistant that recognizes only simple commands only needs to list the words it recognizes.

[ Steps to create a simple language model ]

  1. Create corpus file
  2. Create language model from corpus

Create corpus file

A "corpus" here is simply a list of sentences used to adapt an acoustic model.
In other words, it's a list of commands to use with your voice assistant.
As an example, assume a command that controls the browser.

- Creation example

down
next page
up
top
bottom
next link
back
see
search
stop

Create this in a file called corpus.txt for example.

Create language model from corpus

Create a language model from a corpus file with the following web service:

 Language model creation service: Sphinx Knowledge Base Tool VERSION 3

On the displayed site, do the following:

  1. Press the "Browse" button and select the created corpus.txt file
  2. Press the "COMPILE KNOWLEDGE BASE" button
  3. You will see a page titled "Sphinx knowledge base generator"
    Download the "TARnnnn.tgz" file
    * nnnn is a 4 digit number, contains dic, lm, log_pronounce, sent, vocab, HEADER.html files

Create adaptive data

Acoustic model adaptation requires the following data:

  • transcription file : A text describing the mapping between words and phonetic data
  • fileids file : Text of voice data file name list
  • wav file : audio data

* The transcription file and the fileids file should match the order of the audio data file names.

[ Description of each file ]

  • transcription file
    A text file with the extension transcription
    File name example: data.transcription
    Write the words you pronounce in the audio file (uppercase) and the audio file name (no extension)
    Example data:
<s> DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN DOWN </s> (down10)
<s> NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE NEXT PAGE </s> (next_page10)
<s> UP UP UP UP UP UP UP UP UP UP </s> (up10)
  • fileids file
    a text file with the extension fileids
    File name example: data.fileids
    List audio filenames (without extension)
    Example data:
down10
next_page10
up10

* There is also a way to create a transcription file and a fileids file by splitting an existing recording into sentences.

Performing Acoustic Model Adaptation

Adapting an acoustic model improves the adaptability between the adaptation data and the model, making it easier to recognize the speech style that the adaptation data is based on.
Follow the steps below.

  1. Create a working folder
    Create a working folder and copy acoustic model, dictionary, language model, adaptation data and binaries
    • Copy default acoustic model
      1. Get Acoustic Model
        Acquisition destination : CMU Sphinx Files
        Acquisition file: cmusphinx-en-us-ptm-5.2.tar.gz (ptm is for mobile)
        * After unzipping, make sure mixture_weights is inside and mdef is a text file
        * Originally, copy the pocketsphinx\model\en-us\en-us folder of the pocketsphinx installation destination, but
         This is distributed in a compressed version so get the full uncompressed version from above
      2. Create an en-us folder in the working folder and store all the acquired acoustic model files
        Files: feat.params, mdef, means, mixture_weights, noisedict, README, sendump, transition_matrices, variances
    • Copy dictionary (nnnn.dic) and language model (nnnn.lm) to working folder
      Acquisitions explained in "Create language model from corpus"
    • Copy adaptation data to working folder
      Copy the fileids, transcription, and wav files created in "Create adaptive data"
    • Copy binaries
      Copy sphinx_fe.exe, bw.exe, map_adapt.exe, mllr_solve.exe created in "Creating tools for adaptation" into the working folder
  2. Generation of acoustic feature files (mfc files)
    Run sphinx_fe
    sphinx_fe -argfile en-us/feat.params -samprate 16000
    -c a_data.fileids -di . -do . -ei wav -eo mfc -mswav yes
  3. Accumulating observation counts
    Run bw
    * Make sure the bw argument matches the feat.params in the acoustic model folder
    * If you get the following error with bw, the number of words in your recording is not the same as in the transcription file.
     Review the transcription file.
     Error: Failed to align audio to transcript: final state of the search is not reached
    bw
    -hmmdir en-us
    -moddeffn en-us/mdef
    -ts2cbfn .ptm.
    -feat 1s_c_d_dd
    -svspec 0-12/13-25/26-38
    -cmn current
    -agc none
    -dictfn 8811.dic
    -ctlfn my_small_model.fileids
    -lsnfn my_smoll_madel.transcription
    -accumdir .
    [ Options (partial) ]
    • -hmmdir: folder with acoustic model
    • -moddeffn: model definition file name (mdef when created with Create language model from corpus)
    • -dictfn: file name of dictionary
    • -ctlfn: filename of fileids file
    • -lsnfn: filename of transcription file
    • -accumdir: Destination folder
  4. Conversion using MLLR (create mllr_matrix)
    Run mllr_solve
    mllr_solve -meanfn en-us/means -varfn en-us/variances
    -outmllrfn mllr_matrix -accumdir .
  5. Copy acoustic model folder(Here en-us-adapt)
    Copy en-us folder to create en-us-adapt folder
  6. Updating Acoustic Model Files with MAP (Update files in en-us-adapt folder)
    Run map_adapt
    map_adapt -moddeffn en-us/mdef -ts2cbfn .ptm. -meanfn en-us/means
    -varfn en-us/variances -mixwfn en-us/mixture_weights
    -tmatfn en-us/transition_matrices -accumdir .
    -mapmeanfn en-us-adapt/means -mapvarfn en-us-adapt/variances
    -mapmixwfn en-us-adapt/mixture_weights
    -maptmatfn en-us-adapt/transition_matrices

Create audio data

record speaking voice

Recording is done with the software that comes with Windows.
* Of course you can use whatever you want.

"Voice Recorder (Sound Recorder) 5" that comes with Windows creates audio files in the file format m4a (or wma for Sound Recorder).

Acoustic model adaptation tool expects files in wav format.
Therefore, a file format conversion is required.

Convert m4a(wma) to wav

I created a Python conversion program using the audio file conversion tool ffmpeg.
It's a simple code, so I'll introduce it below.

Convert m4a files in a folder to wav files in order.

import glob
import subprocess

ffmpeg_path = r"C:\temp\ffmpeg\bin\ffmpeg.exe" # Path of ffmpeg.exe
target_path = r"C:\temp\音声\*.m4a" # Path of audio data before conversion

for path in glob.iglob(target_path):
    print(path)
    cmd = f"{ffmpeg_path} -i {path} -ac 1 -ar 16000 {path.replace('.m4a', '.wav')}"
    subprocess.call(cmd, shell=True)
print("finish")

The wav file you create should meet the requirements of the adaptation tool.

[ wav file requirements ]
Single channel mono with 16kHz sampling rate

  • bit resoulution: 16
  • sample rate: 16000
  • audio channels: mono

- How to specify in ffmpeg

 ffmpeg -i xx.m4a -ac 1 -ar 16000 xx.wav

* You can also execute the above manually one file at a time to convert.

- Some of ffmpeg's options

  • -ac: number of audio channels
  • -ar: sampling rate

- How I chose this method

I checked how to convert from m4a (wma) to wav.

There is a package called "pydub · PyPI " in Python.

As a result of using it, I found the following.

  • It uses a library called ffmpeg internally
  • Does not work without downloading ffmpeg

In this case, I thought it would be easier to understand if I created a program that starts ffmpeg directly.

Download ffmpeg

ffmpeg can be downloaded from the following site.

Readme.txt shows what kind of audio files are supported (support type).

[ Examples of support types ]

  • aac: File format used in m4a files
  • alac: File format used in m4a files
  • wmav1: It seems to be compatible with wma7
  • wmav2: It seems that it corresponds to wma8 and wma9

Creating tools for adaptation

The tools are not provided as binaries, so you have to build them from the sphinxtrain source. 6
Use the following tools for adaptation:

  • sphinx_fe.exe
  • bw.exe
  • mllr_solve.exe
  • map_adapt.exe

Get sphinxtrain source

The tool used for adaptation has a new source release called sphinxtrain.
Get sphinxtrain from:

After downloading, unzip and use.
sphinxtrain-5.0.0 folder will be created.

Build from sphinxtrain source code

Building the sphinxtrain source requires CMake, C, Perl, and Python.

Prepare compiler etc.

  • Install CMake
    Download and install the installer from the official site
    Acquisition destination: Download | CMake
  • Install and associate perl
    • Installation of perl was based on an external site 7
    • Associating perl with file extensions
      You can associate using the following command (run as administrator)
      • set association(Set and associate file types)
        FTYPE SPerl="C:\strawberry\Perl\bin\perl.exe" "%1" %*
        ASSOC .pl=SPerl
      • Check Association
        Command: assoc .extension
        Displayed results: .extension = file type name
        Command: ftype file_type_name
        Displayed results: file type name = open command string
      • Command description
        • ASSOC [.extension[=[file_type_name]]]
          View or change file extension associations
        • FTYPE [file_type_name[=[open command string]]]
          View or change file types used for file extension associations
  • Install C
    Installation of C was based on an external site 8

Build with CMake

sphinxtrain provides CMakeLists.txt. (sphinxtrain-5.0.0 folder in the obtained source)
Therefore, it is possible to build with CMake.
CMake creates the build files first, then the binaries.

  1. Create build files with CMake
    Create a build folder, navigate to the build folder and run CMake
    * The reason for creating the build folder is because cmake creates a lot of files so you don't pollute the original source folder
    Example: mkdir build
      cd build
      cmake .. -A x64
    * The current folder is the build folder, so specify the parent folder where the CMakeLists.txt file is with ...
    Specifying options
    • -G: Specify generator (target compiler)
      default is Visual Studio 17 2022
      Not required if C compiler is installed in VSCode
      -G MinGW Makefiles for MinGW
    • --fresh: Append when rebuilding
    • -A: Specify platform (if generator supports it)
      "Visual Studio 17 2022" defaults to "Win32"
      Value: Win32, x64, ARM, ARM64
    • -DCMAKE_BUILD_TYPE=Release: Change mode
  2. Create binaries with CMake
    cmake --build .
    Create binaries using the build file created in step 1.
    * The default is Debug mode, so binaries are created in the Debug folder
    * . is the folder where the build file was created. Here is the current build folder.

Using an adapted acoustic model in SpeechRecoginition

Use the adapted acoustic model by specifying it in SpeechRecognition as follows:

Specify the language argument to the recognize_sphinx() method.

[ language argument ]

  language=(en-us-adapt_folder, nnnn.lm_file, nnnn.dic_file)

- Code example

   self.r.recognize_sphinx(audio
                    , language=(r"C:\my_model\en-us-adapt"
                    , r"C:\my_model\1347.lm"
                    , r"C:\my_model\1347.dic"))

Support for Japanese

In reference The method for Japanese is described below.
Since it may be a reference for other languages, I will post it as it is.

As mentioned in Effects of acoustic model adaptation, CMUSphinx may be able to recognize Japanese even when using acoustic models of other languages.
Actually, a usable prospect was in sight.
It is effective for recognizing short words like voice assistants.

Since Japanese acoustic models may not be provided, this method is used to support Japanese.

The adaptation method is the same as the acoustic model adaptation in English.
All you have to do is convert the prepared corpus into Japanese (romaji).
However, it seems that it will be easier to recognize if you consider the following.

  • Words with 3 or more letters are easier to recognize
    Example: "Uehe" is better than "Ue"
  • Consonants are easier to recognize than vowels
    Example: "Ueni" is better than "Uehe"
  • Multiple words are easier to recognize
    Example: "Ueni ido" is better than "Ueni"

Transition of word selection until it can be recognized in Japanese

I will introduce the contents of trial and error in word selection.
On the third word, it recognizes quite a bit, but not enough.

1st time 2nd time 3rd time meaning
shita shitae shitani idou scroll down
jipeiji jipeiji jipeiji next page
ue uee ueni idou scroll up
toppu ichiban uee ichiban ueni scroll to top
soko ichiban shitae ichiban shitani scroll to the bottom
tabu tsugino rinku tsugino rinku next link
modoru modoru modoru return
miru hyouji hyouji open link
kensaku kensaku kensaku search
owari owari owari end
  • 1st time: Next page, search, and return are easy to recognize
  • 2nd time: "Uehe" and "shitahe" are difficult to recognize
  • 3rd time: 60-70% recognized(There are good times and bad times)

The challenge is to select words that increase the recognition rate.
Somehow, I'm imagining that it would be better to use words that start with different sounds for lines and columns in Japanese syllabary.

pocketshinx and speech recognition

pocketshinx uses speech recognition technology.
I can't explain the details of speech recognition technology, but there are terms that you should know in CMUSphinx Tutorial For Developers – CMUSphinx Open Source Speech Recognition .

Recognition method

  • The recognition engine looks for words both in the dictionary and in the language model
  • If the language model does not have the word, it will not be recognized even if it exists in the dictionary

Models

CMUSphinx uses three models: an acoustic model, a language model, and a phonetic dictionary.

  • Acoustic model: A base for converting speech into phonemes (the smallest unit of speech that can distinguish the meaning of words)
    General acoustic models are based on statistical processing of speech data for thousands of people/thousands of hours
  • phonetic dictionary: a mapping of words to phoneme sequences
    Judging a word from a sequence of phonemes by pattern matching
  • Language model: A model of the "words" spoken and written by humans based on the probability of occurrence of words 9

Language model

There are two forms of language model implementation.

  • Text ARPA format
    • ARPA format can be edited
    • ARPA file has the extension .lm
  • Binary BIN format
    • Binary format speeds up loading
    • Binary file has the extension .lm.bin

* Conversion between these formats is also possible

Get source

You can get the full source here.

* Please note that the source contains code for debugging.
 I use the logging module for debugging.

Finaly

This time, I noticed something while doing this work.
For voice recognition, multiple words are easier to recognize than a single word.
Speaking of which, Google was "OK Google" and Mercedes was "Yes, Mercedes." I feel like it's related.
I think the reason is that in Japanese, each sound is distinct, but in English, when words are connected, the way they are pronounced changes.
Voice recognition seems to be made with that in mind.

I'm not a voice recognition expert, so I'm just guessing
I think Japanese speech recognition is easy if you just recognize sounds.
I think it's hard to turn sounds (kana) into words (kanji).
It's semantic analysis, not speech recognition.


Please read together 📖 How to make an app that creates sentences by voice input[ Python ] 🔗
* Sorry. It's still Japanese.

Caution

This article is based on what worked under the following versions:

  • Python 3.8.5
  • pocketsphinx 5.0.0

Disclaimer

Please check "Disclaimer" before using.

If you have any questions, please contact me from "Contact me".

Reference

投稿:

  1. [ Japanese site ] 📖 Supported speech recognition engine/API[ Python ] 🔗
  2. [ Japanese site ] 📖 How to make an app that creates sentences by voice input[ Python ] 🔗
  3. See training requirements here: Training an acoustic model for CMUSphinx – CMUSphinx Open Source Speech Recognition
  4. See Wikipedia: Text corpus - Wikipedia
  5. Windows recording tools have been renamed in different versions of Windows. Voice Recorder (Windows 10 or later), Sound Recorder (up to Windows 8)
  6. Originally, it seems that SphinxTrain should be executed, but I could not understand how to execute it and what was executed at that time, so I used the built tool.
  7. [ Japanese site ] See here for Perl installation:Link
  8. [ Japanese site ] See here for C installation: Link
  9. [ Japanese site ] Language model (also for GPT): Link