Connectionist Temporal Classiﬁcation: Labelling ...

Connectionist Temporal Classification: Labelling UnsegmentedSequence Data with Recurrent neural NetworksAlex Fern urgen Dalle Molle di Studi sull Intelligenza Artificiale (IDSIA), Galleria 2, 6928 Manno-Lugano, Switzerland2 Technische Universit at M unchen (TUM), Boltzmannstr. 3, 85748 Garching, Munich, GermanyAbstractMany real-world sequence learning tasks re-quire the prediction of sequences of labelsfrom noisy, unsegmented input data. Inspeech recognition, for example, an acousticsignal is transcribed into words or sub-wordunits.

Recurrent neural networks (RNNs) arepowerful sequence learners that would seemwell suited to such tasks. However, becausethey require pre-segmented training data,and post-processing to transform their out-puts into label sequences, their applicabilityhas so far been limited. This paper presents anovel method for training RNNs to label un-segmented sequences directly, thereby solv-ing both problems. An experiment on theTIMIT speech corpus demonstrates its ad-vantages over both a baseline HMM and ahybrid IntroductionLabelling unsegmented sequence data is a ubiquitousproblem in real-world sequence learning.

It is partic-ularly common in perceptual tasks ( handwritingrecognition, speech recognition, gesture recognition)where noisy, real-valued input streams are annotatedwith strings of discrete labels, such as letters or , graphical models such as hidden MarkovModels (HMMs; Rabiner, 1989), conditional randomfields (CRFs; Lafferty et al., 2001) and their vari-ants, are the predominant framework for sequence la-Appearing inProceedings of the 23rdInternational Con-ference on Machine Learning, Pittsburgh, PA, 2006. Copy-right 2006 by the author(s)/owner(s).

Belling. While these approaches have proved success-ful for many problems, they have several drawbacks:(1) they usually require a significant amount of taskspecific knowledge, to design the state models forHMMs, or choose the input features for CRFs; (2)they require explicit (and often questionable) depen-dency assumptions to make inference tractable, assumption that observations are independent forHMMs; (3) for standard HMMs, training is generative,even though sequence Labelling is neural networks (RNNs), on the other hand,require no prior knowledge of the data, beyond thechoice of input and output representation.

They canbe trained discriminatively, and their internal stateprovides a powerful, general mechanism for modellingtime series. In addition, they tend to be robust totemporal and spatial far, however, it has not been possible to applyRNNs directly to sequence Labelling . The problem isthat the standard neural network objective functionsare defined separately for each point in the training se-quence; in other words, RNNs can only be trained tomake a series of independent label classifications. Thismeans that the training data must be pre-segmented,and that the network outputs must be post-processedto give the final label present, the most effective use of RNNs for se-quence Labelling is to combine them with HMMs in theso-called hybrid approach (Bourlard & Morgan, 1994;Bengio.)

, 1999). Hybrid systems use HMMs to modelthe long-range sequential structure of the data, andneural nets to provide localised classifications. TheHMM component is able to automatically segmentthe sequence during training, and to transform thenetwork classifications into label sequences. However,as well as inheriting the aforementioned drawbacks ofConnectionist Temporal ClassificationHMMs, hybrid systems do not exploit the full poten-tial of RNNs for sequence paper presents a novel method for Labelling se-quence data with RNNs that removes the need for pre-segmented training data and post-processed outputs,and models all aspects of the sequence within a singlenetwork architecture.

The basic idea is to interpretthe network outputs as a probability distribution overall possible label sequences, conditioned on a given in-put sequence. Given this distribution, an objectivefunction can be derived that directly maximises theprobabilities of the correct labellings. Since the objec-tive function is differentiable, the network can then betrained with standard backpropagation through time(Werbos, 1990).In what follows, we refer to the task of Labelling un-segmented data sequences astemporal classification(Kadous, 2002), and to our use of RNNs for this pur-pose asconnectionist Temporal classification(CTC).

By contrast, we refer to the independent Labelling ofeach time-step, or frame, of the input sequence asframewise next section provides the mathematical formalismfor Temporal classification, and defines the error mea-sure used in this paper. Section 3 describes the outputrepresentation that allows RNNs to be used as tempo-ral classifiers. Section 4 explains how CTC networkscan be trained. Section 5 compares CTC to hybrid andHMM systems on the TIMIT speech corpus. Section 6discusses some key differences between CTC and othertemporal classifiers, giving directions for future work,and the paper concludes with section Temporal ClassificationLetSbe a set of training examples drawn from a fixeddistributionDX Z.

The input spaceX= (Rm) isthe set of all sequences ofmdimensional real val-ued vectors. The target spaceZ=L is the setof all sequences over the (finite) alphabetLof la-bels. In general, we refer to elements ofL aslabelsequencesorlabellings. Each example inSconsistsof a pair of sequences (x,z). The target sequencez= (z1,z2,..,zU) is at most as long as the inputsequencex= (x1,x2,..,xT), T. Since theinput and target sequences are not generally the samelength, there is noa prioriway of aligning aim is to useSto train a Temporal classifierh:X7 Zto classify previously unseen input se-quences in a way that minimises some task specificerror Label Error RateIn this paper, we are interested in the following errormeasure.

Given a test setS DX Zdisjoint fromS,define thelabel error rate(LER) of a Temporal classi-fierhas the normalised edit distance between its clas-sifications and the targets onS , (h,S ) =1Z (x,z) S ED(h(x))(1)whereZis the total number of target labels inS ,andED(p,q) is the edit distance between the two se-quencespandq the minimum number of inser-tions, substitutions and deletions required to is a natural measure for tasks (such as speech orhandwriting recognition) where the aim is to minimisethe rate of transcription Connectionist Temporal ClassificationThis section describes the output representation thatallows a recurrent neural network to be used for crucial step is to transform the network outputsinto a conditional probability distribution over labelsequences.

Connectionist Temporal Classiﬁcation: Labelling ...

Tags:

Information

Transcription of Connectionist Temporal Classiﬁcation: Labelling ...

Related search queries

Connectionist Temporal Classiﬁcation: Labelling ...

Tags:

Information

Documents from same domain

Related documents

Related search queries