JP5852550B2

JP5852550B2 - Acoustic model generation apparatus, method and program thereof

Info

Publication number: JP5852550B2
Application number: JP2012244756A
Authority: JP
Inventors: 哲小橋川; 太一浅見; 祥子山畠; 済央野本; 裕司青野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-11-06
Filing date: 2012-11-06
Publication date: 2016-02-03
Anticipated expiration: 2032-11-06
Also published as: JP2014092750A

Description

本発明は、教師なし適応によって音響モデルを生成する音響モデル生成装置とその方法とプログラムに関する。 The present invention relates to an acoustic model generation apparatus, method and program for generating an acoustic model by unsupervised adaptation.

音声認識では、一般的に音声ファイルと音声ファイルの発話内容を表す正解テキストとを学習データとして、音響モデルの学習（適応）を行う。なお「音響モデルの学習（適応）」とは、学習処理により、学習データ中の事例ができるだけ多く正しく認識できるように、音響モデルのパラメータを最適化する処理を意味する。また、この音響モデルの適応は、音声ファイルに対応する読みを人が書き起こすこと等によって作成される正解テキストを学習データとして用いる教師あり適応と、音声認識結果等を学習データとして用いる教師なし適応とに大別される。 In speech recognition, generally, learning (adaptation) of an acoustic model is performed using a speech file and correct text representing the utterance content of the speech file as learning data. Note that “acoustic model learning (adaptation)” means a process of optimizing the parameters of the acoustic model so that as many examples as possible in the learning data can be correctly recognized by the learning process. In addition, the adaptation of the acoustic model is based on supervised adaptation using the correct text created as a learning data by the person writing up the reading corresponding to the speech file, and unsupervised adaptation using the speech recognition result as the learning data. It is roughly divided into

ここで、教師なし適応によって音響モデルの適応を行う場合、認識精度の高い音声認識結果を正解テキストとして用いる必要がある。しかし、認識精度の低い音声認識結果を正解テキストとして用いた場合、音響モデルの誤った適応によって、音響モデルの精度を低下させてしまう可能性がある。 Here, when the acoustic model is adapted by unsupervised adaptation, it is necessary to use a speech recognition result with high recognition accuracy as the correct text. However, when a speech recognition result with low recognition accuracy is used as a correct text, there is a possibility that the accuracy of the acoustic model is lowered due to incorrect adaptation of the acoustic model.

この問題に対して、音声認識結果に信頼度を付与し、信頼度の高さに応じて音声認識結果を選択し、選択した音声認識結果を用いて音響モデルの適応を行う手法（例えば特許文献１）が知られている。その手法は、信頼度が基準値を超える発話系列を学習用データとして用いる考えである。 To solve this problem, a method of assigning a confidence level to a speech recognition result, selecting a speech recognition result according to the reliability level, and adapting an acoustic model using the selected speech recognition result (for example, Patent Documents) 1) is known. The technique is based on the idea of using an utterance sequence whose reliability exceeds a reference value as learning data.

特開２００７−２４８７３０号公報JP 2007-248730 A

従来の、信頼度が基準値を超える発話系列を学習用データとして用いる手法では、学習効率が高くない課題があった。つまり、信頼度がある一定値以上の値を示すということは、音響モデルがその音響特徴量に既に適応できていることに他ならない。よって、そのような発話系列を用いても音響モデルの精度を低下させてしまうことは無いが、音響モデル学習の進捗が遅く効率的でない。 The conventional method of using an utterance sequence whose reliability exceeds a reference value as learning data has a problem that the learning efficiency is not high. That is, the fact that the reliability indicates a value equal to or greater than a certain value is nothing other than that the acoustic model has already been adapted to the acoustic feature amount. Therefore, even if such an utterance sequence is used, the accuracy of the acoustic model is not reduced, but the progress of acoustic model learning is slow and not efficient.

本発明は、この課題に鑑みてなされたものであり、音響モデル学習を効率よく行うことのできる音響モデル生成装置とその方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and an object of the present invention is to provide an acoustic model generation apparatus, method and program capable of efficiently performing acoustic model learning.

本発明の音響モデル生成装置は、ラベル生成用音声認識部と、データ選択部と、音響モデル学習部と、を具備する。ラベル生成用音声認識部は、入力される音声信号を言語モデルとベース音響モデルを参照して音声認識し、当該音声認識結果にラベルを付与すると共にその信頼度と音響尤度を出力する。データ選択部は、ラベル生成用音声認識部が出力する音声信号とそのラベルと信頼度を入力として、上記信頼度が信頼度閾値より大でかつ音響尤度が尤度閾値よりも小さな音声信号を選択する。音響モデル学習部は、データ選択部が選択した音声信号に、ベース音響モデルを学習させて学習済み音響モデルを生成する。 The acoustic model generation device of the present invention includes a label generation speech recognition unit, a data selection unit, and an acoustic model learning unit. The label generation speech recognition unit recognizes the input speech signal by referring to the language model and the base acoustic model, assigns a label to the speech recognition result, and outputs its reliability and acoustic likelihood. The data selection unit receives the speech signal output from the label recognition speech recognition unit, its label, and the reliability, and receives the speech signal having the reliability greater than the reliability threshold and the acoustic likelihood smaller than the likelihood threshold. select. The acoustic model learning unit generates a learned acoustic model by causing the audio signal selected by the data selection unit to learn the base acoustic model.

本発明の音響モデル生成装置によれば、音声認識結果の信頼度が所定値以上でかつ音響尤度が所定値よりも小さな音声信号を、音響モデルの学習に用いることができる。つまり、音響モデルの学習が十分進んでいないが言語的には正しい音声信号を用いることで、音響モデルの学習効率を向上させることが可能になる。 According to the acoustic model generation device of the present invention, a speech signal having a speech recognition result having a reliability greater than or equal to a predetermined value and an acoustic likelihood smaller than the predetermined value can be used for learning the acoustic model. In other words, the learning efficiency of the acoustic model can be improved by using a speech signal that is not sufficiently advanced but is linguistically correct.

この発明の音響モデル生成装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 100 of this invention. 音響モデル生成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production | generation apparatus 100. FIG. 音響尤度と信頼度の関係を説明する図。The figure explaining the relationship between acoustic likelihood and reliability. この発明の音響モデル生成装置２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 200 of this invention. 音響モデル生成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production | generation apparatus 200. FIG. 文法型音声認識を説明する図。The figure explaining grammatical speech recognition. この発明の音響モデル生成装置３００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 300 of this invention. この発明の音響モデル生成装置４００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 400 of this invention. この発明の音響モデル生成装置５００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 500 of this invention. この発明の音響モデル生成装置６００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 600 of this invention. この発明の音響モデル生成装置７００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 700 of this invention. この発明の音響モデル生成装置８００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 800 of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル生成装置１００の機能構成例を示す。その動作フローを図２に示す。音響モデル生成装置１００は、ラベル生成用音声認識部１０と、言語モデル２０と、ベース音響モデル３０と、データ選択部４０と、音響モデル学習部５０と、制御部６０と、を具備する。音響モデル生成装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of an acoustic model generation apparatus 100 according to the present invention. The operation flow is shown in FIG. The acoustic model generation device 100 includes a label generation speech recognition unit 10, a language model 20, a base acoustic model 30, a data selection unit 40, an acoustic model learning unit 50, and a control unit 60. The acoustic model generation apparatus 100 is realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, a CPU, and the like, and the CPU executing the program.

ラベル生成用音声認識部１０は、音声信号を入力として言語モデルとベース音響モデルを参照して上記音声信号にラベルを付与すると共にその信頼度と音響尤度を出力する（ステップＳ１０）。入力される音声信号は、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換され、離散値化された音声信号の所定数（例えば３２０個）を１フレームとしたフレーム毎に音響特徴量に変換される。図１ではＡ/Ｄ変換部及び特徴量算出部の表記は省略している。音響特徴量は、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって算出される。 The label generation speech recognition unit 10 receives a speech signal as an input, gives a label to the speech signal with reference to the language model and the base acoustic model, and outputs its reliability and acoustic likelihood (step S10). The input audio signal is converted into a discrete digital signal, for example, at a sampling frequency of 16 kHz, and converted into an acoustic feature amount for each frame where a predetermined number (eg, 320) of discrete audio signals is one frame. The In FIG. 1, the notation of the A / D conversion unit and the feature amount calculation unit is omitted. The acoustic feature amount is calculated by, for example, Mel frequency cepstrum coefficient (MFCC) analysis.

ラベル生成用音声認識部１０は、フレーム毎に算出される音響特徴量のベース音響モデル３０内の音響尤度と、言語モデル２０を参照して言語尤度の総和が最も高い音声認識結果候補を音声認識結果として出力する。この時、音声認識結果として出力される音声ファイル毎に、漢字かなまじりの形態素と読みから成る音声認識結果に、音響モデル学習時に用いる読みを元にしたラベルと音響尤度と信頼度とが付与される。音声ファイルとは、音声認識対象の例えば一文単位のことである。 The speech recognition unit for label generation 10 refers to the acoustic likelihood in the base acoustic model 30 of the acoustic feature amount calculated for each frame and the speech recognition result candidate having the highest sum of the language likelihoods with reference to the language model 20. Output as voice recognition result. At this time, for each voice file output as a speech recognition result, a label based on the reading used for learning the acoustic model, an acoustic likelihood, and a reliability are added to the speech recognition result composed of kanji kana magic morpheme and reading. Is done. An audio file is, for example, one sentence unit to be recognized.

図３に、音声ファイルの音響尤度と言語尤度と信頼度の例を示す。音声ファイル１の音声認識結果が「お電話ありがとうございます。」とした場合のラベルが「、おでんわ、ありがとう、ござい、ます、」、例えばその音響尤度が8000、言語尤度が-800、信頼度が0.80である。音声ファイル２の音声認識結果が「ご用件を伺います。」、そのラベルが「、ごようけん、を、うかがい、ます、」、例えばその音響尤度が4500、言語尤度が-900、信頼度が0.80である。 FIG. 3 shows an example of acoustic likelihood, language likelihood, and reliability of an audio file. When the voice recognition result of the voice file 1 is “Thank you for calling”, the label is “Thank you, odenwa, thank you,” for example, the acoustic likelihood is 8000, the language likelihood is -800, The reliability is 0.80. The voice recognition result of the voice file 2 is “I will ask you for your business.”, Its label is “, Goyoken, wa ga suru, wa”, for example, its acoustic likelihood is 4500, language likelihood is -900, The reliability is 0.80.

このように、ラベル生成用音声認識部１０の出力する音声信号と音声認識結果に対して、ラベルと音響尤度と言語尤度と信頼度が付与される。ラベル生成用音声認識部１０は公知の音声認識技術によって実現される。 As described above, the label, the acoustic likelihood, the language likelihood, and the reliability are given to the voice signal output from the label generating voice recognition unit 10 and the voice recognition result. The label generating voice recognition unit 10 is realized by a known voice recognition technique.

データ選択部４０は、ラベル生成用音声認識部１０が出力する音声信号とそのラベルと信頼度を入力として、上記信頼度が信頼度閾値より大でかつ音響尤度が尤度閾値よりも小さな音声信号他を含む音声データを選択する（ステップＳ４０）。つまり、データ選択部４０は、音声認識結果の信頼度が一定程度あるが、音響尤度が低い音声データを選択する。図３の例では、例えば音声ファイル２の「ご用件を伺います。」が選択される。選択された音声データは、図１に破線で示すように教師なし学習音声ＤＢ７０に教師なし学習音声として、一旦、蓄えても良い。 The data selection unit 40 receives the audio signal output from the label generation speech recognition unit 10, the label and the reliability, and inputs the audio having the reliability higher than the reliability threshold and the acoustic likelihood lower than the likelihood threshold. Audio data including signals and the like is selected (step S40). That is, the data selection unit 40 selects voice data having a certain degree of reliability of the voice recognition result but having a low acoustic likelihood. In the example of FIG. 3, for example, “Ask for business” of the audio file 2 is selected. The selected speech data may be temporarily stored as unsupervised learning speech in the unsupervised learning speech DB 70 as indicated by a broken line in FIG.

データ選択部４０は、例えば信頼度が0.80以上で、音響尤度が4500と例えば音響尤度が平均値よりも低い音声データを選択する。このような音声データは、音声認識は正しく行われているが、その音声データに対する音響モデルの学習が進んでいない音声データである。 The data selection unit 40 selects, for example, voice data having a reliability of 0.80 or more and an acoustic likelihood of 4500, for example, an acoustic likelihood lower than the average value. Such voice data is voice data that has been correctly recognized but has not yet been trained in an acoustic model for the voice data.

音響モデル学習部５０は、データ選択部４０が選択した音声データに含まれる音声信号に、ベース音響モデル３０を学習させて学習済み音響モデルを生成する（ステップＳ５０）。音響モデルの学習は、データ選択部４０が選択した音声信号を用いて、ベース音響モデル３０を学習させて学習済み音響モデル生成する。学習データ量が多い場合は、繰り返し学習を行う。音響モデルの適応手法は限定されない。例えば、バームウェルチ（Baum-Weltch）のアルゴリズムによる最尤推定に基づくＭＬ（Maximum Likelihood）学習、または識別学習の手法を用いても良い。音響モデル学習部５０が出力する学習済み音響モデルは、外部の学習済み音響モデル８０に蓄えられる。制御部６０は、上記した各部の時系列的な動作を音響モデル生成装置１００が動作を終了するまで制御する（ステップＳ６０）。 The acoustic model learning unit 50 learns the base acoustic model 30 from the audio signal included in the audio data selected by the data selection unit 40 to generate a learned acoustic model (step S50). The learning of the acoustic model is performed by learning the base acoustic model 30 using the audio signal selected by the data selection unit 40 and generating a learned acoustic model. When the amount of learning data is large, learning is repeated. The adaptation method of the acoustic model is not limited. For example, ML (Maximum Likelihood) learning based on maximum likelihood estimation using a Baum-Weltch algorithm or identification learning may be used. The learned acoustic model output by the acoustic model learning unit 50 is stored in an external learned acoustic model 80. The control unit 60 controls the time-series operation of each unit described above until the acoustic model generation device 100 finishes the operation (step S60).

以上説明したように、この発明の音響モデル生成装置１００によれば、音声認識結果の信頼度が所定値以上でかつ音響尤度が所定値よりも小さな音声データ、つまり、音声認識は正しく行われているが音響モデルの学習が進んでいない音声データを音響モデルの学習に用いるので、音響モデルの学習効率を向上させることが可能になる。 As described above, according to the acoustic model generation device 100 of the present invention, the voice data whose reliability of the voice recognition result is equal to or higher than the predetermined value and whose acoustic likelihood is smaller than the predetermined value, that is, the voice recognition is correctly performed. However, since the sound data that has not yet been learned for the acoustic model is used for the learning of the acoustic model, the learning efficiency of the acoustic model can be improved.

図４に、この発明の音響モデル生成装置２００の機能構成例を示す。その動作フローを図５に示す。音響モデル生成装置２００は、上記した音響モデル生成装置１００に対してラベル変換再認識部２１０を備える点のみが異なる。 FIG. 4 shows a functional configuration example of the acoustic model generation apparatus 200 of the present invention. The operation flow is shown in FIG. The acoustic model generation device 200 is different from the above acoustic model generation device 100 only in that a label conversion re-recognition unit 210 is provided.

ラベル変換再認識部２１０は、ラベル生成用音声認識部１０が出力するラベルを用いて、ベース音響モデル３０を参照して文法型音声認識を行い、音響尤度を再付与して上記データ選択部４０に出力する（ステップＳ２１０）。ここで、文法型音声認識とは、読みの決定や無音の挿入位置の決定を形態素の並び、すなわち文法に従って音声認識する方法である。 The label conversion re-recognition unit 210 performs grammatical speech recognition with reference to the base acoustic model 30 using the label output from the label generation speech recognition unit 10, re-assigns acoustic likelihood, and the data selection unit 40 (step S210). Here, the grammatical speech recognition is a method of recognizing speech according to a morphological arrangement, that is, grammar, for determination of reading or determination of silence insertion position.

音声認識結果に含まれる信頼度は、言語モデルの制約も受けたものになるため、無音の挿入位置を音響モデルのみで決定する場合と異なる可能性がある。そこで、読みの決定や無音の挿入位置の決定について、言語モデルを用いず音響モデルのみを用いた文法型音声認識で再認識を行う。 Since the reliability included in the speech recognition result is also limited by the language model, it may be different from the case where the silence insertion position is determined only by the acoustic model. Therefore, re-recognition is performed by grammatical speech recognition using only the acoustic model, not the language model, for the determination of reading or the insertion position of silence.

図６に、音声ファイル「今日は晴れです」の一文に対応する文法の例を示す。この一文に対する開始〜終了までの状態遷移の累積尤度が最大の経路（パス）を、音声認識結果とする。この時に、無音の長さと、複数の読みを持つ単語の読みも決定される。 FIG. 6 shows an example of a grammar corresponding to one sentence of an audio file “It is sunny today”. A route (path) having the maximum cumulative likelihood of state transition from the start to the end for this sentence is taken as a speech recognition result. At this time, the length of silence and the reading of a word having a plurality of readings are also determined.

音響モデル生成装置２００のデータ選択部４０は、無音の長さと単語の読みが文法に基づいて決定された音声データを選択するので、無音の精度向上や複数読みへの対応が可能となり、音響モデル学習部５０における音響モデルの学習精度をより向上させることが出来る。 Since the data selection unit 40 of the acoustic model generation device 200 selects speech data in which the length of silence and the reading of words are determined based on the grammar, it is possible to improve the accuracy of silence and cope with multiple readings. The learning accuracy of the acoustic model in the learning unit 50 can be further improved.

図７に、この発明の音響モデル生成装置３００の機能構成例を示す。音響モデル生成装置３００は、上記した音響モデル生成装置１００に対して、ラベル生成用音声認識部１０がラベル生成用音声認識部３１０に置き換わった点のみが異なる。 FIG. 7 shows a functional configuration example of the acoustic model generation apparatus 300 of the present invention. The acoustic model generation apparatus 300 is different from the above-described acoustic model generation apparatus 100 only in that the label generation voice recognition unit 10 is replaced with a label generation voice recognition unit 310.

ラベル生成用音声認識部３１０は、音声信号を入力として言語モデル２０とベース音響モデル３０を参照して音声信号の音声認識を行い信頼度が所定値以上の音声信号のみにラベルを付与して出力するものである。つまり、信頼度の低い音声データは捨て、信頼度の高い音声データを音響モデルの学習用のデータに用いる考えである。このようにすることで、データ選択部４０の処理の前で信頼度の低い音声データが除外される。 The label generation speech recognition unit 310 receives speech signals, performs speech recognition of the speech signals with reference to the language model 20 and the base acoustic model 30, assigns a label only to speech signals having a reliability higher than a predetermined value, and outputs them. To do. That is, it is an idea that voice data with low reliability is discarded and voice data with high reliability is used as learning data for the acoustic model. By doing so, audio data with low reliability is excluded before the processing of the data selection unit 40.

信頼度の所定値は、音声認識結果を信じるか否かの閾値であり、例えば0.8等と設定する。信頼度の所定値を、例えば高めに設定すると、データ選択部４０で選択されるデータ量が減少するため、同じ量の入力音声信号に対する音響モデル生成装置３００の処理時間は短縮される。このように、音響モデル生成装置３００の処理速度を、音響モデル生成装置１００よりも短縮することが出来る。 The predetermined value of the reliability is a threshold value for determining whether or not to believe the voice recognition result, and is set to 0.8, for example. If the predetermined value of the reliability is set to a high value, for example, the amount of data selected by the data selection unit 40 decreases, so that the processing time of the acoustic model generation device 300 for the same amount of input speech signals is shortened. Thus, the processing speed of the acoustic model generation device 300 can be shortened compared to the acoustic model generation device 100.

図８に、この発明の音響モデル生成装置４００の機能構成例を示す。音響モデル生成装置４００は、上記した音響モデル生成装置１００に対して、データ選択部４０がデータ選択部４４０に置き換わった点のみが異なる。 FIG. 8 shows a functional configuration example of the acoustic model generation apparatus 400 of the present invention. The acoustic model generation device 400 differs from the acoustic model generation device 100 described above only in that the data selection unit 40 is replaced with the data selection unit 440.

データ選択部４４０は、ラベル生成用音声認識部１０が出力する音声信号とそのラベルと信頼度を入力として、信頼度が信頼度閾値より大（例えば信頼度＞0.8）でかつ音響尤度が尤度閾値よりも小さくかつ第２尤度閾値よりも大きな音声データを選択するものである。尤度閾値を例えば、音響尤度の分布の平均値μとし、第２尤度閾値を例えばμ−σとする。そのようにすると音響尤度が平均よりも１σ以上小さいものは、学習の対象外にすることが出来る。 The data selection unit 440 receives the speech signal output from the label generation speech recognition unit 10, the label, and the reliability, and the reliability is greater than the reliability threshold (for example, reliability> 0.8) and the acoustic likelihood is high. Voice data that is smaller than the threshold value and larger than the second likelihood threshold value is selected. The likelihood threshold is, for example, the average value μ of the acoustic likelihood distribution, and the second likelihood threshold is, for example, μ−σ. In such a case, those whose acoustic likelihood is 1σ or more smaller than the average can be excluded from learning.

この結果、音響尤度が小さすぎる音声データを音響モデルの学習対象から除外することができ、音響モデルの学習精度を向上させる効果が期待できる。 As a result, it is possible to exclude voice data having an acoustic likelihood that is too small from the learning target of the acoustic model, and expect the effect of improving the learning accuracy of the acoustic model.

図９に、この発明の音響モデル生成装置５００の機能構成例を示す。音響モデル生成装置５００は、上記した音響モデル生成装置１００に対して、音響尤度閾値決定部５１０を備える点と、データ選択部４０が音響尤度閾値決定部５１０で決定した閾値を用いてデータ選択を行うデータ選択部５４０である点で異なる。 FIG. 9 shows a functional configuration example of the acoustic model generation apparatus 500 of the present invention. The acoustic model generation device 500 uses the points provided with the acoustic likelihood threshold value determination unit 510 and the threshold value determined by the acoustic likelihood threshold value determination unit 510 by the data selection unit 40 with respect to the acoustic model generation device 100 described above. The difference is that the data selection unit 540 performs selection.

音響尤度閾値決定部５１０は、ラベル変換再認識部１０が出力する音響尤度の分布に対応させて尤度閾値を自動的に生成する。音響尤度は、確率値（０〜１の値）とは異なり対数尤度値であるため所定の範囲を決めるのが難しい。そこで、音響尤度閾値決定部５１０は、ラベル生成用音声認識部１０が出力する音響尤度の分布から例えば平均μと標準偏差σを求め、閾値を、例えば平均μ、又はμ−σ、又はμ＋σ等の値に自動的に決定する。 The acoustic likelihood threshold determination unit 510 automatically generates a likelihood threshold corresponding to the acoustic likelihood distribution output by the label conversion re-recognition unit 10. The acoustic likelihood is a logarithmic likelihood value unlike a probability value (value of 0 to 1), so it is difficult to determine a predetermined range. Therefore, the acoustic likelihood threshold determination unit 510 obtains, for example, the average μ and the standard deviation σ from the acoustic likelihood distribution output by the label generating speech recognition unit 10, and sets the threshold, for example, the average μ, μ-σ, or Automatically determined to a value such as μ + σ.

また、音響尤度のヒストグラムを作成して、そのヒストグラムから頻度の多い範囲に閾値を設定するようにしても良い。例えば、ヒストグラムの頻度の平均を取り、その頻度平均より多い範囲を頻度が多い範囲とする。データ選択部５４０は、音響尤度閾値決定部５１０で決定した閾値に基づいて音声データを選択する。音響モデル生成装置５００によれば、データ選択部５４０の尤度閾値を設定する手間が省ける効果が得られる。 Alternatively, a histogram of acoustic likelihood may be created, and a threshold value may be set in a frequency range from the histogram. For example, the frequency of the histogram is averaged, and a range higher than the frequency average is set as a frequency-high range. The data selection unit 540 selects audio data based on the threshold value determined by the acoustic likelihood threshold value determination unit 510. According to the acoustic model generation device 500, an effect of saving the trouble of setting the likelihood threshold of the data selection unit 540 can be obtained.

図１０に、この発明の音響モデル生成装置６００の機能構成例を示す。音響モデル生成装置６００は、上記した音響モデル生成装置１００に対して、教師なし学習音声データベース６７０（以降データベースはＤＢと表記）と音響モデル適応部６５０と閾値評価部６６０とを備える点と、データ選択部４０が複数データ選択部６４０に置き換わった点で異なる。 FIG. 10 shows a functional configuration example of the acoustic model generation apparatus 600 of the present invention. The acoustic model generation device 600 includes an unsupervised learning speech database 670 (hereinafter referred to as DB), an acoustic model adaptation unit 650, and a threshold evaluation unit 660, in addition to the acoustic model generation device 100 described above, and data The difference is that the selection unit 40 is replaced with a multiple data selection unit 640.

複数データ選択部６４０は、複数の信頼度の音声尤度データを選択する。例えば、信頼度0.9以上、信頼度0.8以上、信頼度0.75以上、の音声データを選択する。各信頼度値で選択された音声データは、教師なし学習音声ＤＢ６７０に蓄えられる。 The multiple data selection unit 640 selects speech likelihood data having a plurality of reliability levels. For example, audio data having a reliability of 0.9 or higher, a reliability of 0.8 or higher, and a reliability of 0.75 or higher is selected. The voice data selected by each reliability value is stored in the unsupervised learning voice DB 670.

音響モデル適応部６５０は、１回の学習で済む適応音響モデルを例えばＭＡＰ適応（最大事後確率推定）等で作成する。音響モデル適応部６５０は、ベース音響モデル３０の音響モデルを教師なし学習音声ＤＢ６７０に蓄えられた信頼度値毎にＭＡＰ適応させる。 The acoustic model adaptation unit 650 creates an adaptive acoustic model that only needs to be learned once, for example, by MAP adaptation (maximum posterior probability estimation). The acoustic model adaptation unit 650 adapts the acoustic model of the base acoustic model 30 for each reliability value stored in the unsupervised learning speech DB 670.

閾値評価部６６０は、信頼度値毎にＭＡＰ適応させた音響モデルを、開発データセットを用いて評価する。開発データセットとは書き起こしテキスト付きの音声データのことである。閾値評価部６６０は、開発データセットに対する音声認識精度あるいは音響尤度が最も高い信頼度値を用いて音声データを選択し、音響モデル学習部５０はその音声データに含まれる音声信号を、教師なし学習データとしてＭＬ（尤度最大化）学習や識別学習を繰り返し行う。 The threshold evaluation unit 660 evaluates the acoustic model adapted to MAP for each reliability value using the development data set. A development data set is audio data with a transcript. The threshold evaluation unit 660 selects speech data using the reliability value with the highest speech recognition accuracy or acoustic likelihood for the development data set, and the acoustic model learning unit 50 unsupervises the speech signal included in the speech data. ML (likelihood maximization) learning and identification learning are repeatedly performed as learning data.

音響モデル生成装置６００によれば、複数の信頼度閾値をＭＡＰ適応等の少ない計算量で得られる音響モデルで評価して最適な信頼度閾値を求め、その最適な信頼度閾値を用いて音響モデルの繰り返し学習を行うので、複数の信頼度閾値の全てについて繰り返し学習を行うよりも音響モデルの生成に要する時間を削減することが出来る。 According to the acoustic model generation apparatus 600, an optimum reliability threshold value is obtained by evaluating a plurality of reliability threshold values with an acoustic model obtained with a small amount of calculation such as MAP adaptation, and the acoustic model is obtained using the optimum reliability threshold value. Therefore, it is possible to reduce the time required for generating the acoustic model, compared to the case where the learning is repeatedly performed for all of the plurality of reliability threshold values.

図１１に、この発明の音響モデル生成装置７００の機能構成例を示す。音響モデル生成装置７００は、上記した音響モデル生成装置１００に対して、既存音声ＤＢ７１０を備える点と、音響モデル学習部５０が音響モデル学習部７５０に置き換わった点で異なる。 FIG. 11 shows a functional configuration example of the acoustic model generation apparatus 700 of the present invention. The acoustic model generation apparatus 700 is different from the above-described acoustic model generation apparatus 100 in that an existing speech DB 710 is provided and the acoustic model learning unit 50 is replaced with an acoustic model learning unit 750.

既存音声ＤＢ７１０は、ベース音響モデル３０を作成するのに用いた音声データを蓄えたデータベースである。音響モデル学習部７５０は、ベース音響モデル３０を、データ選択部４０が選択した音声データと既存音声ＤＢ７１０の音声データとを参照して適応学習させる。 The existing voice DB 710 is a database that stores voice data used to create the base acoustic model 30. The acoustic model learning unit 750 adaptively learns the base acoustic model 30 with reference to the voice data selected by the data selection unit 40 and the voice data of the existing voice DB 710.

音響モデル生成装置７００によれば、既存音声ＤＢ７１０と、生成された教師なし学習音声とを組み合わせて音響モデルを学習するので、音響モデルの精度を向上させる効果が期待できる。つまり、教師なし学習音声には誤りが含まれる可能性があるのに対して、既存音声ＤＢ７１０の音声に誤りは無い、その誤りの無い音声データを用いることで教師なし学習音声で学習する音響モデルを矯正することが出来る。要するに、誤りの無い音声データを音響モデル学習に併用することで、教師なし学習音声のみで音響モデルを学習するよりも音響モデルの精度を向上させることが出来る。 According to the acoustic model generation apparatus 700, the acoustic model is learned by combining the existing speech DB 710 and the generated unsupervised learning speech, so that an effect of improving the accuracy of the acoustic model can be expected. That is, there is a possibility that the unsupervised learning speech includes an error, whereas the speech of the existing speech DB 710 has no error, and the acoustic model is trained with the unsupervised learning speech by using the speech data without the error. Can be corrected. In short, the accuracy of the acoustic model can be improved by using the speech data without error together with the acoustic model learning, rather than learning the acoustic model only with unsupervised learning speech.

図１２に、この発明の音響モデル生成装置８００の機能構成例を示す。音響モデル生成装置８００は、音響モデル精製装置７００に対して擬似非認識対象信号ＤＢ８２０を備える点で異なる。 FIG. 12 shows a functional configuration example of the acoustic model generation apparatus 800 of the present invention. The acoustic model generation apparatus 800 is different from the acoustic model purification apparatus 700 in that a pseudo non-recognition target signal DB 820 is provided.

擬似非認識対象信号ＤＢ８２０は、擬似非認識対象信号を記録している。擬似非認識対象信号は、妨害用信号に１以下のゲインを乗じて音量レベルを小さくした信号である。妨害用信号は、例えば駅のホーム上の雑踏の背景雑音に人の話声が重畳したような音声信号であり、例えば、定常的な背景雑音に非定常な人の声が重なって収音された音声信号である。背景雑音の雑踏音はなくても良い。クリーン音声の人の声で有っても良い。つまり、非定常な音声信号であることが妨害用信号のポイントである。 The pseudo non-recognition target signal DB 820 records a pseudo non-recognition target signal. The pseudo non-recognition target signal is a signal whose volume level is reduced by multiplying the interference signal by a gain of 1 or less. The interference signal is, for example, an audio signal in which a human voice is superimposed on a background noise of a crowd on a platform of a station. For example, a non-stationary human voice is superimposed on a stationary background noise and collected. Audio signal. There is no need for background noise. It may be a human voice with clean voice. That is, the point of the interference signal is that it is an unsteady audio signal.

擬似非認識対象信号は、妨害用信号に１以下のゲインを乗じて音量レベルを小さくすることで、認識対象の音声と区別し、認識対象の音声を非音声として学習してしまう可能性を低減させ、非音声モデルの雑音耐性を高めるものである。擬似非認識対象信号は、妨害用信号を入力とするゲイン調整部８１０で生成することが出来る。 The pseudo non-recognition target signal is reduced from the recognition target voice by multiplying the interference signal by a gain of 1 or less to reduce the volume level, thereby reducing the possibility of learning the recognition target voice as non-speech. This increases the noise resistance of the non-voice model. The pseudo non-recognition target signal can be generated by the gain adjustment unit 810 that receives the interference signal.

音響モデル学習部８５０は、ベース音響モデル３０を、データ選択部４０が選択した音声データと既存音声ＤＢ７１０の音声データとを参照して適応学習すると共に、擬似非認識対象信号を非音声信号としてベース音響モデル２０の非音声モデルを学習する。この非音声モデルは、擬似非認識対象信号に適応させることによって、非定常的な妨害用信号、つまり背景雑音による誤認識結果の湧き出しを低減するモデルとすることが出来る。 The acoustic model learning unit 850 adaptively learns the base acoustic model 30 with reference to the voice data selected by the data selection unit 40 and the voice data of the existing voice DB 710, and bases the pseudo non-recognition target signal as a non-speech signal. A non-voice model of the acoustic model 20 is learned. This non-speech model can be a model that reduces the occurrence of erroneous recognition results due to non-stationary interference signals, that is, background noise, by adapting to pseudo non-recognition target signals.

このように音響モデル生成装置８００は、音響モデル生成装置７００の効果に加えて、非音声モデルの雑音耐性を向上させることが可能である。 Thus, in addition to the effects of the acoustic model generation device 700, the acoustic model generation device 800 can improve the noise resistance of the non-voice model.

以上説明したようにこの発明の音響モデル生成装置によれば、音声認識結果の信頼度が所定値以上でかつ音響尤度が所定値よりも小さな音声データを、音響モデルの学習に用いることができる。この結果、音響モデルの学習効率を向上させることが可能になる。 As described above, according to the acoustic model generation device of the present invention, it is possible to use speech data whose speech recognition result reliability is equal to or higher than a predetermined value and whose acoustic likelihood is smaller than the predetermined value for learning an acoustic model. . As a result, the learning efficiency of the acoustic model can be improved.

なお、上記した音響モデル生成装置１００に、音響モデル生成装置２００の文法型音声認識を用いる考えを組み合わせても良い。また、音響モデル生成装置１００と２００に、音響モデル生成装置３００の信頼度の低い音声データは捨てる考えを組み合わせても良い。また、それらの音響モデル生成装置１００と２００と３００に、音響尤度が低過ぎる音声データは取り除いて教師なし適応を行う音響モデル生成装置４００の考えを組み合わせても良い。また、音響モデル生成装置１００と２００と３００と４００に、音響尤度の閾値を自動設定する音響モデル生成装置５００の考えを組み合わせても良い。また、音響モデル生成装置１００と２００と３００と４００と５００に、複数の信頼度閾値をＭＡＰ適応等の少ない計算量で得られる音響モデルで評価して最適な信頼度閾値を求め、その最適な信頼度閾値を用いる考えの音響モデル生成装置５００の考えを組み合わせても良い。更に、既存音声ＤＢ７１０を組み合わせて用いる音響モデル生成装置７００の考えを組み合わせても良い。また、非音声モデルの雑音耐性を向上させた音響モデル生成装置８００の考えを組み合わせも良い。このように、各実施例はそれぞれを組み合わせて構成することが可能であり、それぞれの効果を得ることが出来る。 The above-described acoustic model generation apparatus 100 may be combined with the idea of using the grammatical speech recognition of the acoustic model generation apparatus 200. Further, the acoustic model generation apparatuses 100 and 200 may be combined with the idea of discarding audio data with low reliability of the acoustic model generation apparatus 300. Further, the acoustic model generation apparatuses 100, 200, and 300 may be combined with the idea of the acoustic model generation apparatus 400 that performs unsupervised adaptation by removing voice data whose acoustic likelihood is too low. Further, the acoustic model generation apparatuses 100, 200, 300, and 400 may be combined with the idea of the acoustic model generation apparatus 500 that automatically sets a threshold value of acoustic likelihood. In addition, the acoustic model generation apparatuses 100, 200, 300, 400, and 500 evaluate a plurality of reliability thresholds with an acoustic model obtained with a small amount of calculation such as MAP adaptation to obtain an optimal reliability threshold, You may combine the idea of the acoustic model production | generation apparatus 500 of the idea which uses a reliability threshold value. Furthermore, you may combine the idea of the acoustic model production | generation apparatus 700 which uses the existing audio | voice DB710 in combination. Further, the idea of the acoustic model generation apparatus 800 that improves the noise tolerance of the non-voice model may be combined. As described above, the embodiments can be configured by combining them, and the respective effects can be obtained.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることが出来る。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A speech recognition unit for label generation that recognizes speech by referring to an input speech signal with reference to a language model and a base acoustic model, gives a label to the speech recognition result, and outputs its reliability and acoustic likelihood;
The speech signal output from the label generating speech recognition unit, its label, reliability, and acoustic likelihood, and the reliability is greater than the reliability threshold and the acoustic likelihood is smaller than the likelihood threshold. A data selection section for selecting
An acoustic model learning unit for generating the learned acoustic model by learning the base acoustic model to the audio signal selected by the data selection unit;
Obtain one of average, standard deviation, and histogram from the distribution of acoustic likelihood output by the label generating speech recognition unit, and use at least one of the average, standard deviation, and histogram An acoustic model generation apparatus comprising an acoustic likelihood threshold determination unit that automatically generates the likelihood threshold .

The acoustic model generation device according to claim 1,
Furthermore,
A label conversion re-recognition unit that performs grammatical speech recognition with reference to the acoustic model and re-assigns acoustic likelihood and outputs to the data selection unit using the label output by the label generation speech recognition unit ,
An acoustic model generation apparatus comprising the acoustic model generation apparatus.

In the acoustic model generation device according to claim 1 or 2,
The label generating voice recognition unit
The sound is characterized in that the speech signal is recognized by referring to the language model and the base acoustic model with the speech signal as an input, and only a speech signal having a reliability of a predetermined value or higher is given a label and output. Model generator.

The acoustic model generation device according to any one of claims 1 to 3,
The data selection part
With the speech signal output from the label generating speech recognition unit, its label and reliability as input, the reliability is greater than the reliability threshold, the acoustic likelihood is less than the likelihood threshold, and is greater than the second likelihood threshold. An acoustic model generation apparatus characterized by selecting a large audio signal.

A speech recognition process for label generation that recognizes speech by referring to a speech model and a base acoustic model, gives a label to the speech recognition result, and outputs its reliability and acoustic likelihood;
An input audio signal and the reliability thereof label the label generation speech recognition process outputs, the reliability is large at and the acoustic likelihood than confidence threshold to select a small audio signal than the likelihood threshold The data selection process,
To the audio signal which the data selection process selected, the acoustic model training process for generating the learned acoustic model by learning the base acoustic model,
Obtain one of average, standard deviation, and histogram from the distribution of acoustic likelihood output by the label generation speech recognition process, and use at least one of the average, standard deviation, and histogram An acoustic model generation method comprising an acoustic likelihood threshold determination process for automatically generating the likelihood threshold .

A program for causing a computer to function as an acoustic model generating apparatus according to claim 1乃optimum 4.