400行Python代码实现文语处理助手(5) - 语音识别

语音识别是 pzh-py-speech 的核心功能，pzh-py-speech 借助的是 SpeechRecognition 系统以及 CMU Sphinx 引擎来实现的语音识别功能，今天痞子衡为大家介绍语音识别在 pzh-py-speech 中是如何实现的。

一、SpeechRecognition 系统简介

SpeechRecognition 是一套基于 python 实现语音识别的系统，该系统的设计者为 Anthony Zhang (Uberi)，该库从 2014 年开始推出，一直持续更新至今，pzh-py-speech 使用的是 SpeechRecognition 3.8.1。
　　

SpeechRecognition 系统的官方主页如下：

SpeechRecognition 官方主页: https://github.com/Uberi/speech_recognition

SpeechRecognition 安装方法: https://pypi.org/project/SpeechRecognition/
　　

SpeechRecognition 系统自身并没有语音识别功能，其主要是调用第三方语音识别引擎来实现语音识别，SpeechRecognition 支持的语音识别引擎非常多，有如下 8 种：

CMU Sphinx (works offline)

Google Speech Recognition

Google Cloud Speech API

Wit.ai

Microsoft Bing Voice Recognition

Houndify API

IBM Speech to Text

Snowboy Hotword Detection (works offline)
　　

不管是选用哪一种语音识别引擎，在 SpeechRecognition 里调用接口都是一致的，我们以实现音频文件转文字的示例代码 audio_transcribe.py 为例了解 SpeechRecognition 的用法，截取 audio_transcribe.py 部分内容如下：

import speech_recognition as sr

# 指定要转换的音频源文件（english.wav）

from os import path

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")

# 定义 SpeechRecognition 对象并获取音频源文件（english.wav）中的数据

r = sr.Recognizer()

with sr.AudioFile(AUDIO_FILE) as source:

audio = r.record(source) # read the entire audio file

# 使用 CMU Sphinx 引擎去识别音频

try:

print("Sphinx thinks you said " + r.recognize_sphinx(audio))

except sr.UnknownValueError:

print("Sphinx could not understand audio")

except sr.RequestError as e:

print("Sphinx error; {0}".format(e))

# 使用 Microsoft Bing Voice Recognition 引擎去识别音频

BING_KEY = "INSERT BING API KEY HERE" # Microsoft Bing Voice Recognition API keys 32-character lowercase hexadecimal strings

try:

print("Microsoft Bing Voice Recognition thinks you said " + r.recognize_bing(audio, key=BING_KEY))
except sr.UnknownValueError:

print("Microsoft Bing Voice Recognition could not understand audio")

except sr.RequestError as e:

print("Could not request results from Microsoft Bing Voice Recognition service; {0}".format(e))

# 使用其他引擎去识别音频

# ... ...
　　

有木有觉得 SpeechRecognition 使用起来特别简单？是的，这正是 SpeechRecognition 系统强大之处，更多示例可见 https://github.com/Uberi/speech_recognition/tree/master/examples。

1.1 选用 CMU Sphinx 引擎

前面痞子衡讲了 SpeechRecognition 系统自身并没有语音识别功能，因此我们需要为 SpeechRecognition 安装一款语音识别引擎，痞子衡为 JaysPySPEECH 选用的是可离线工作的 CMU Sphinx。
　　

CMU Sphinx 是卡内基梅隆大学开发的一款开源语音识别引擎，该引擎可以离线工作，并且支持多语种（英语、中文、法语等）。CMU Sphinx 引擎的官方主页如下：

CMU Sphinx 官方主页: https://cmusphinx.github.io/

CMU Sphinx 官方下载: https://sourceforge.net/projects/cmusphinx/
　　

由于 JaysPySPEECH 是基于 Python 环境开发的，因此我们不能直接用 CMU Sphinx，那该怎么办？别着急，Dmitry Prazdnichnov 大牛为 CMU Sphinx 写了 Python 封装接口，即 PocketSphinx，其官方主页如下：

PocketSphinx 官方主页: https://github.com/bambocher/pocketsphinx-python

PocketSphinx 安装方法: https://pypi.org/project/pocketsphinx/
　　

我们在 JaysPySPEECH 诞生系列文章第一篇环境搭建里已经安装了 SpeechRecognition 和 PocketSphinx，痞子衡的安装路径为 C:\tools_mcu\Python27\Lib\site-packages 下的 \speech_recognition 与 \pocketsphinx，安装好这两个包，引擎便选好了。

1.2 为 PocketSphinx 引擎增加中文语言包

默认情况下，PocketSphinx 仅支持 US English 语言的识别，在 C:\tools_mcu\Python27\Lib\site-

packages\speech_recognition\pocketsphinx-data 目录下仅能看到 en-US 文件夹，先来看一下这个文件夹里有什么:

\pocketsphinx-data\en-US

\acoustic-model -- 声学模型

\feat.params --HMM 模型的特征参数

\mdef -- 模型定义文件

\means -- 混合高斯模型的均值

\mixture_weights -- 混合权重

\noisedict -- 噪声也就是非语音字典

\sendump -- 从声学模型中获取混合权重

\transition_matrices --HMM 模型的状态转移矩阵

\variances -- 混合高斯模型的方差

\language-model.lm.bin -- 语言模型

\pronounciation-dictionary.dict -- 拼音字典
　　

看到这一堆文件是不是觉得有点难懂？这其实跟 CMU Sphinx 引擎的语音识别原理有关，此处我们暂且不深入了解，对我们调用 API 的应用来说只需要关于如何为 CMU Sphinx 增加其他语言包（比如中文包）。
　　

要想增加其他语言，首先得要有语言包数据，CMU Sphinx 主页提供了 12 种主流语言包的下载 https://sourceforge.net/projects/cmusphinx/files/Acoustic_and_Language_Models/，因为 JaysPySPEECH 需要支持中文识别，因此我们需要下载 \Mandarin 下面的三个文件：

\Mandarin

\zh_broadcastnews_16k_ptm256_8000.tar.bz2 -- 声学模型

\zh_broadcastnews_64000_utf8.DMP -- 语言模型

\zh_broadcastnews_utf8.dic -- 拼音字典

有了中文语言包数据，然后我们需要根据 Notes on using PocketSphinx 里指示的步骤操作，痞子衡整理如下：

\speech_recognition\pocketsphinx-data 目录下创建 zh-CN 文件夹

将 zh_broadcastnews_16k_ptm256_8000.tar.bz2 解压缩并里面所有文件放入 \zh-CN\acoustic-model 文件夹下

将 zh_broadcastnews_utf8.dic 重命名为 pronounciation-dictionary.dict 并放入 \zh-CN 文件夹下

借助 SphinxBase 工具将 zh_broadcastnews_64000_utf8.DMP 转换成 language-model.lm.bin 并放入 \zh-CN 文件夹下
　　

关于第 4 步里提到的 SphinxBase 工具，我们需要从 https://github.com/cmusphinx/sphinxbase 里下载源码，然后使用 Visual Studio 2010（或以上）打开 \sphinxbase\sphinxbase.sln 工程 Rebuild All 后会在 \sphinxbase\bin\Release\x64 下看到生成了如下 6 个工具：

\\sphinxbase\bin\Release\x64

\sphinx_cepview.exe

\sphinx_fe.exe

\sphinx_jsgf2fsg.exe

\sphinx_lm_convert.exe

\sphinx_pitch.exe

\sphinx_seg.exe
　　

我们主要使用 sphinx_lm_convert.exe 工具完成转换工作生成 language-model.lm.bin，具体命令如下：

PS C:\tools_mcu\sphinxbase\bin\Release\x64> .\sphinx_lm_convert.exe -i .\zh_broadcastnews_64000_utf8.DMP -o language-model.lm - ofmt arpa

Current configuration:

[NAME] [DEFLT] [VALUE]

-case

-help no no

-i .\zh_broadcastnews_64000_utf8.DMP

-ifmt

-logbase 1.0001 1.000100e+00

-mmap no no

-o language-model.lm

-ofmt arpa

INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format

INFO: ngram_model_trie.c(365): Header doesn't match

INFO: ngram_model_trie.c(177): Trying to read LM in arpa format

INFO: ngram_model_trie.c(70): No \data\ mark in LM file

INFO: ngram_model_trie.c(445): Trying to read LM in dmp format

INFO: ngram_model_trie.c(527): ngrams 1=63944, 2=16600781, 3=20708460

INFO: lm_trie.c(474): Training quantizer

INFO: lm_trie.c(482): Building LM trie

PS C:\tools_mcu\sphinxbase\bin\Release\x64> .\sphinx_lm_convert.exe -i .\language-model.lm -o language-model.lm.bin

Current configuration:

[NAME] [DEFLT] [VALUE]

-case

-help no no

-i .\language-model.lm

-ifmt

-logbase 1.0001 1.000100e+00

-mmap no no

-o language-model.lm.bin

-ofmt

INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format

INFO: ngram_model_trie.c(365): Header doesn't match

INFO: ngram_model_trie.c(177): Trying to read LM in arpa format

INFO: ngram_model_trie.c(193): LM of order 3

INFO: ngram_model_trie.c(195): #1-grams: 63944

INFO: ngram_model_trie.c(195): #2-grams: 16600781

INFO: ngram_model_trie.c(195): #3-grams: 20708460

INFO: lm_trie.c(474): Training quantizer

INFO: lm_trie.c(482): Building LM trie

二、pzh-py-speech 语音识别实现

音识别代码实现其实很简单，直接调用 speech_recognition 里的 API 即可，目前仅实现了 CMU Sphinx 引擎，并且仅支持中英双语识别。具体到 pzh-py-speech 上主要是实现 GUI 界面上"ASR"按钮的回调函数，即 audioSpeechRecognition()，如果用户选定了配置参数（语言类型、ASR 引擎类型），并点击了"ASR"按钮，此时便会触发 audioSpeechRecognition()的执行。代码如下：

import speech_recognition

class mainWin(win.speech_win):

def getLanguageSelection(self):

languageType = self.m_choice_lang.GetString(self.m_choice_lang.GetSelection())

if languageType == 'Mandarin Chinese':

languageType = 'zh-CN'

languageName = 'Chinese'

else: # languageType == 'US English':

languageType = 'en-US'

languageName = 'English'

return languageType, languageName

def audioSpeechRecognition( self, event ):

if os.path.isfile(self.wavPath):

# 创建 speech_recognition 语音识别对象 asrObj

asrObj = speech_recognition.Recognizer()

# 获取 wav 文件里的语音内容

with speech_recognition.AudioFile(self.wavPath) as source:

speechAudio = asrObj.record(source)

self.m_textCtrl_asrttsText.Clear()

# 获取语音语言类型（English/Chinese）

languageType, languageName = self.getLanguageSelection()

engineType = self.m_choice_asrEngine.GetString(self.m_choice_asrEngine.GetSelection())

if engineType == 'CMU Sphinx':

try:

# 调用 recognize_sphinx 完成语音识别

speechText = asrObj.recognize_sphinx(speechAudio, language=languageType)

# 语音识别结果显示在 asrttsText 文本框内

self.m_textCtrl_asrttsText.write(speechText)

self.statusBar.SetStatusText("ASR Conversation Info: Successfully")

# 语音识别结果写入指定文件

fileName = self.m_textCtrl_asrFileName.GetLineText(0)

if fileName == '':

fileName = 'asr_untitled1.txt'

asrFilePath = os.path.join(os.path.dirname(os.path.abspath(os.path.dirname(__file__))), 'conv', 'asr', fileName)

asrFileObj = open(asrFilePath, 'wb')

asrFileObj.write(speechText)

asrFileObj.close()

except speech_recognition.UnknownValueError:

self.statusBar.SetStatusText("ASR Conversation Info: Sphinx could not understand audio")

except speech_recognition.RequestError as e:

self.statusBar.SetStatusText("ASR Conversation Info: Sphinx error; {0}".format(e))

else:

self.statusBar.SetStatusText("ASR Conversation Info: Unavailable ASR Engine")
　　

至此，语音处理工具 pzh-py-speech 诞生之语音识别实现痞子衡便介绍完毕了，掌声在哪里~~~

器件型号	数量	器件厂商	器件描述	ECAD模型	参考价格	更多信息
STM32F407VGT6	1	STMicroelectronics	High-performance foundation line, Arm Cortex-M4 core with DSP and FPU, 1 Mbyte of Flash memory, 168 MHz CPU, ART Accelerator, Ethernet, FSMC	ECAD模型下载ECAD模型	$20.39	查看
PIC32MX575F512L-80I/PT	1	Microchip Technology Inc	32-BIT, FLASH, 80 MHz, RISC MICROCONTROLLER, PQFP100, 12 X 12 MM, 1 MM HEIGHT, LEAD FREE, PLASTIC, TQFP-100	ECAD模型下载ECAD模型	$8.67	查看
ATXMEGA128A1U-AU	1	Microchip Technology Inc	IC MCU 8BIT 128KB FLASH 100TQFP	ECAD模型下载ECAD模型	$7	查看

器件型号

数量

器件厂商

器件描述

数据手册

ECAD模型

风险等级

参考价格

更多信息

STM32F407VGT6

STMicroelectronics

High-performance foundation line, Arm Cortex-M4 core with DSP and FPU, 1 Mbyte of Flash memory, 168 MHz CPU, ART Accelerator, Ethernet, FSMC