Example: air traffic controller

THE IBM ATTILA SPEECH RECOGNITION TOOLKIT Hagen …

THE IBM ATTILA SPEECH RECOGNITION TOOLKITH agen Soltau, George Saon, and Brian KingsburyIBM T. J. Watson Research Center, Yorktown Heights, NY 10598, describe the design of IBM s ATTILA SPEECH recognitiontoolkit. We show how the combination of a highly modu-lar and efficient library of low-level C++ classes with simpleinterfaces, an interconnection layer implemented in a mod-ern scripting language (Python), and a standardized collec-tion of scripts for system-building produce a flexible and scal-able TOOLKIT that is useful both for basic research and for con-struction of large transcription systems for competitive Terms: SPEECH recognition1. INTRODUCTIONOur goals for the ATTILA TOOLKIT were driven by our previousexperience using other toolkits for both basic research andconstruction of large evaluation systems. A key to success-ful evaluation systems, for example in the DARPA EARSand GALE programs, is completing a large number of experi-ments in a short amount of time: efficient implementation andscalability to large compute clusters are crucial.

THE IBM ATTILA SPEECH RECOGNITION TOOLKIT Hagen Soltau, George Saon, and Brian Kingsbury IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA

Tags:

  Attila

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of THE IBM ATTILA SPEECH RECOGNITION TOOLKIT Hagen …

1 THE IBM ATTILA SPEECH RECOGNITION TOOLKITH agen Soltau, George Saon, and Brian KingsburyIBM T. J. Watson Research Center, Yorktown Heights, NY 10598, describe the design of IBM s ATTILA SPEECH recognitiontoolkit. We show how the combination of a highly modu-lar and efficient library of low-level C++ classes with simpleinterfaces, an interconnection layer implemented in a mod-ern scripting language (Python), and a standardized collec-tion of scripts for system-building produce a flexible and scal-able TOOLKIT that is useful both for basic research and for con-struction of large transcription systems for competitive Terms: SPEECH recognition1. INTRODUCTIONOur goals for the ATTILA TOOLKIT were driven by our previousexperience using other toolkits for both basic research andconstruction of large evaluation systems. A key to success-ful evaluation systems, for example in the DARPA EARSand GALE programs, is completing a large number of experi-ments in a short amount of time: efficient implementation andscalability to large compute clusters are crucial.

2 A key to suc-cess in basic research is rapidly prototyping new ideas withoutneeding to write a lot of low-level code: a researcher shouldbe able to focus on the algorithm without needing to satisfycomplex interfaces. In summary, the design of the TOOLKIT isbased on the following wish list:FlexibilityA rich interface that supports fast overhead for fast experiment focus on the algorithm, not on the small, low-complexity code last goal is motivated by our observation that some auto-matic SPEECH RECOGNITION toolkits, including a previous inter-nal TOOLKIT , comprise hundreds of thousands of lines of code,leading new users to duplicate preexisting functions simplybecause they could not comprehend the existing DESIGNTo accomplish these goals, the TOOLKIT makes a clear distinc-tion between core algorithms and glue code by combining theadvantages of a high-level scripting language with the effi-ciency of C++ [1].

3 Traditionally, C++ modules are assembledinto executables that are controlled via a command-line in-terface, configuration files, or both, and the executables aremanaged from the scripting language. An example of this ap-proach is HTK. This approach is cumbersome because it en-tails the parsing of many parameters and provides only verycoarse-grained access to the underlying C++ classes. We optfor a different approach, in which the target scripting lan-guage, Python, is extended with customized C++ classes. Akey enabler isSWIG[2], a tool that automatically generatesinterface code from the header files of a C++ library. We havedesigned the C++ classes in ATTILA such that most class mem-bers are public; thus, nearly all C++ data structures are ex-posed within consists of three TheC++ Library, which contains all the low-levelclasses needed to build a modern, large-vocabularyacoustic model, as well as two different ThePython Library, which contains modules thatconnect objects from the C++ library together intohigher-level building blocks used to train and testspeech RECOGNITION TheAttila Training Recipe (ATR): a collection ofstandard scripts to build state-of-the-art large vocabu-lary SPEECH RECOGNITION systems.

4 These scripts alloweven inexperienced users to build new systems fromscratch, starting with a flat-start procedure and cul-minating in discriminatively trained, 1 illustrates the structure of the TOOLKIT . The C++classes are represented by proxy Python classes that are gen-erated automatically usingSWIG. The modules in the Pythonlibrary provide the glue to create higher-level functions usedin scripts for training and decoding. The main benefit ofthis design is that it offers maximal flexibility without requir-ing the writing of interface code and without sacrificing effi-Front endTrainerTranscriptsDatabasePhonesTagsL exiconTreeMatrixPLPMFCCMMI EstimatorML EstimatorGMM AccumulatorGMM SetHMMA lignmentDynamic DecoderFSM DecoderC++ LibraryInterface Code (SWIG)Python LibraryAttila DesignFig. 1. Structure of the ATTILA In the next paragraphs we highlight some of the de-sign choices we made that we found particularly Separation of Models, Accumulators, and EstimatorsAs shown in Figure 1, we have separate objects for models( , Gaussian mixture models), accumulators (to hold suffi-cient statistics), and estimators ( , maximum likelihood andmaximum mutual information).

5 This allows us to reuse com-ponents and combine them in new ways. For example, the ac-cumulator routines can be used for both maximum likelihood(ML) and maximum mutual information (MMI) the ML estimator uses only one accumulator, the MMIestimators will update the models using two accumulators,one for the numerator statistics and one for the Abstraction of AlignmentsOur alignment object is simply a container holding a set ofhidden Markov model (HMM) state posterior probabilitiesfor each frame. An alignment can be populated using severaldifferent methods: Viterbi, Baum-Welch, modified forward-backward routines over lattices for minimum phone error(MPE) or boosted MMI training, uniform segmentation forflat-start training, or conversion of manual labels as in theTIMIT corpus. Accumulator objects accumulate sufficientstatistics only through an alignment object.

6 This makes iteasy to add new models to the TOOLKIT , because only a methodto accumulate sufficient statistics given an alignment and amethod to update the model parameters from the statistics arerequired. Likewise, it is easy to add new alignment methods,because the designer only needs to worry about properly pop-ulating the alignment Acoustic Scorer InterfaceBecause we are interested in working with a variety of acous-tic models, including Gaussian mixture models, neural net-works, and exponential models, we use an abstract interfacefor acoustic scoring that permits us to use any acoustic modelwith any decoder or alignment method. The interface consistsof only a few functions:class Scorer:virtual int get_model_count()=0;virtual int get_frame_count()=0;virtual int set_frame(int frameX)=0;virtual float get_score(int modelX)=0;getmodelcountreturns the number of models ina model container ( , the number of HMM states).

7 Getframecountreturns the number of available framesin the current a frame the score (scaled negative log-likelihood) for themodelX-th model for the current Language Model InterfaceBecause the TOOLKIT provides two different decoders, a staticFSM decoder and a dynamic network decoder, and severaldifferent types of language model, it was necessary to use anabstract language model interface to maximize LM:virtual STATE start (WORD wordX);virtual STATE extend(STATE state, WORD wordX)=0;virtual SCORE score (STATE state, WORD wordX)=0;virtual void score (STATE state, SCORE*scoreptr)=0;The language model state is an abstraction of the n-gram history that provides a unified view of the decoder accesses the language model onlythroughLM::STATE instances. At the start of the utter-ance, the decoder generates an initial language model stateby callingstart(wordX).

8 The state is updated by call-ingextend(state,wordX)when transitioning to a newword. Decoding with n-gram models or finite state gram-mars can be easily expressed with this interface, as can latticerescoring. The second variant of thescoremethod retrievesthe language model scores for all words in the vocabulary fora givenstate. This function is needed for fast languagemodel access when computing lookahead scores for the dy-namic network Front End InterfaceWe implement dependency resolution for speakers and utter-ances in the Python layer, within a base class for all front endclasses, and allow front end module instances to depend onthe outputs of multiple other instances. We also use a verysimple interface for front end modules: all modules producematrices of floating-point numbers, and, with the exceptionof the module responsible for audio input, all modules acceptmatrices of floats as input.

9 These two features have impor-tant consequences. First, authors of the front end C++ classescan focus solely on the algorithmic details, and do not need toworry about how their code will interact with other classes:as long as they can accept and produce matrices of floats,the Python library will handle the rest. Second, we can real-ize front end signal processing algorithms as directed acyclicgraphs of interacting instances, allowing for the production,for example, of perceptual linear prediction (PLP) and pitchfeatures in a Mandarin SPEECH RECOGNITION Front EndAvailable front end modules include audio input from a file,audio input from a socket, audio input from a sound card,downsampling, power spectrum, Mel binning with supportfor vocal tract length normalization (VTLN), Gaussianiza-tion, PLP coefficients, Mel-frequency cepstral (MFCC) co-efficients, mean and variance normalization, splicing of suc-cessive frames into supervectors, application of a linear trans-form or projection such as linear discriminant analysis (LDA),feature-space maximum likelihood linear regression (FM-LLR)

10 , fusion of parallel feature streams, pitch estimation, andfeature-space discriminative Hidden Markov Models and Context ModifiersWe use a three layer structure for hidden Markov models:(1) word graph, (2) phone graph, and (3) state graph. Theword graph is usually constructed using a sausage struc-ture to represent pronunciation variants and optional training, a linear word sequence is constructed fromthe reference, and then it is extended to a sausage by addingalternative pronunciations and marking some tokens ( ,si-lence) as optional. The phone graph is generated by apply-ing the pronunciation lexicon to the word graph. In a similarfashion, the state graph is generated from the phone graph byapplying a user defined HMM topology. Users can manipu-late the HMM directly at the scripting level, adding nodes andtransitions, setting transition costs, and so HMM object also handles the phonetic contextneeded to build decision trees by making the phone graphcontext dependent.


Related search queries