Academia.eduAcademia.edu
Master of Science in Informatics at Grenoble option Artificial intelligence and the web Data Selection for trainable Neural Machine Translation Models Wejdene Abdelmoula 24 June 2016 Research project performed at Laboratory of Informatics of Grenoble Under the supervision of: Pr. Laurent Besacier, University Grenoble Alpes Dr. Christophe Servan, University Grenoble Alpes Defended before a jury composed of: Dr. Marc Dymetman, Xerox Reseach Centre Europe Pr. Massih Reza Amini, University Grenoble Alpes Pr. Nadia Brauner, University Grenoble Alpes Pr. Jean Claude Fernandez, University Grenoble Alpes Pr. Jérôme Euzenat, University Grenoble Alpes Pr. Eric Gaussier, University Grenoble Alpes June 2016 Abstract Machine translation (NMT) is a new approach to translate text from one language into another. The core of NMT is a single deep neural network with millions of neurons that learn to map source sentences to target sentences. Despite being relatively recent, NMT has already shown promising results on various translation tasks. In our contributions, we implemented a NMT with attention mechanism (Bi-directional Recurrent Neural Network (RNN) encoder and a simple Recurrent decoder). In this thesis, we will describe all the basics to understand what RNNs are and what they propose. We will experiment our translation model RNN using a Python library (Theano). To train our translation model we need text to learn from, we used an (Arabic/French, English/French) corpus. Moreover, we used a data selection technique to make the neural models trainable in reasonable tipe while keeping acceptable translation quality. N EURAL Keywords: Neural Machine Translation (NMT), Attention mechanism, Recurrent Neural Networks (RNN), Translation Model, Data selection Contents Abstract i List of Figures iii List of Tables iv 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 2 State of the art 2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Recurrent Neural Network for Natural Language Processing . . . . . . . . . . 2.3 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 9 3 Experimentation 3.1 Data . . . . . . . . . . . . . . . . 3.2 Pre-processing . . . . . . . . . . . 3.3 Data selection . . . . . . . . . . . 3.4 Evaluation of machine translation . 3.5 Model Architecture . . . . . . . . 3.6 Results . . . . . . . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 17 17 18 19 21 25 4 Conclusion and Future work 27 4.1 Future of Neural Machine Translation: . . . . . . . . . . . . . . . . . . . . . . 27 A Appendix A 29 B Appendix B 33 Bibliography 35 ii List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 Recurrent Neural Network [33]. Three time-steps are shown . . . . . . . . . . . A one-hot-vector to a continuous-space representation . . . . . . . . . . . . . . . LSTM with one memory cell [12] . . . . . . . . . . . . . . . . . . . . . . . . . Gate recurrent unit [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recurrent Neural Network: Encoder/Decoder [33], here the German phrase is translated to an English one. The first three time-steps encode the German words into h3 and the last two decode h3 into English words outputs. . . . . . . . . . . Bidirectional encoder [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . 12 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 Attention Based Encoder/Decoder . . . . . . . . . . . . . . . . . . . . . . . Attention Bloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dev and Sample BLEU scores: 800,000 lines o f the (English/French) corpus) Dev and Sample BLEU scores: 800,000 lines of the random corpus . . . . . . Dev BLEU scores with different sub-corpus (En/Fr) size . . . . . . . . . . . Sample BLEU scores with different sub-corpus (En/Fr) size . . . . . . . . . . Dev BLEU scores with different sub-corpus (En/Fr) size at iteration 140 k . . Perplexity score for different adapted sub-corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 22 22 23 23 24 25 A.1 A.2 A.3 A.4 A.5 A.6 Dev BLEU scores (10 percent of the Arabic–French corpus) . Sample BLEU scores (10 percent of the Arabic–French corpus) Dev BLEU scores (20 percent of the Arabic–French corpus) . Sample BLEU scores (20 percent of the Arabic–French corpus) Dev BLEU scores (50 percent of the Arabic–French corpus) . Sample BLEU scores (50 percent of the Arabic–French corpus) . . . . . . . . . . . . . . . . . . 29 30 30 31 32 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 8 List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 Bilingual corpus(Ar–Fr) size . . . . . . . . . . . . . . . . . . . . . . . . . . . . Corpus(DEV/TEST) size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilingual corpus(En–Fr) size . . . . . . . . . . . . . . . . . . . . . . . . . . . . En–Fr Dev Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Both (Ar/Fr, En/Fr) corpus size after applying all pre-processing steps . . . . . . BLEU scores for PB-SMT vs. Neural MT, The BLEU scores in the parentheses are not final results, the model still training . . . . . . . . . . . . . . . . . . . . . . A.1 BLEU scores for PB-SMT vs. Neural MT, for 20 percent of Arabic–French corpus iv . . . . . 15 16 16 17 18 . 24 31 1 Introduction 1.1 Motivation learning system is inspired by the cyclical connectivity of neurons in the brain. Although, the neural networks architectures in machine learning are used every so often to understand brain function. They are not designed to be realistic models of biological function. Recently, the popularity and the usefulness of the deep learning has growth fast, by reason of new and large data-sets and computational power and the various techniques to train deeper neural networks. In particular, neural networks are powerful learning models that achieve stateof-the-art results in a wide range of supervised and unsupervised machine learning tasks. Traditional translation models are quite complex, they consist of numerous machine learning algorithms applied to different stages of the language translation pipeline. In this thesis, we will focus on Recurrent Neural Networks (RNNs) as a replacement to traditional translation modules what we called "neural machine translation". Neural Machine Translation (NMT) is a recently proposed approach to machine translation, a new way of teaching machines to translate using deep neural networks, they are designed to maximize the translation performance. NMT has achieved state-of-the-art results in the translation tasks for various language pairs. NMT are much easier to build and train than traditional MT. Usually, for machine translation task we deal with variable-length input and output sentences. In other words, source and the target are not fixed. RNNs with their capability of capturing the dynamics of sequences via cycles in the network of nodes, are one of the best class of artificial neural network architectures that can be used for translation task. Clearly, RNN are capable of represent the history. Unlike, feed-forward neural networks, whenever a single sample is fed into this last, the activation of the hidden units, is computed from scratch and is not influenced by the state computed from the previous sample. A recent trend in Deep Learning is Attention Mechanism. Attention Mechanisms in Neural Networks are based on the visual attention mechanism found in humans. Human usually focuses on certain region of an image or certain words of text to well understand and then they analyses the totality. The human brain can distinguish the most important part from the less important. Human visual attention is well-studied and there exist different models inspired from it. Recently attention mechanisms made their way into recurrent neural networks architectures that are typically used in Natural Language Processing (NLP). In this thesis, we will describe how we have pushed the limits of NMT, making it applicable to a wide variety of D EEP languages with state-of-the-art performance. The initial research goal in this thesis is (1) to understand and to implement a neural machine translation for two pairs of languages: ArabicFrench, English-French (2) since NMT training on very large corpora is too long, propose data selection (filtering) algorithms to make NMT trainable and efficient. 1.2 Structure of the thesis The next chapter introduces the recurrent neural network, their architecture and defines our approach of neural machine translation. Then, in chapter 3, we present the data sets that are further used in the thesis. Moreover, the training algorithm and the architecture of our model are described in detail. we will describe how an attention mechanism can be incorporated into the simple encoder-decoder model also we will introduces a simple and advanced language modeling techniques and a training data selection approach. Finally, we focus on the results after application of the Neural machine translation to different pairs of languages Arabic/French, English/French and discusses computational limitations of the model. 2 2 State of the art Introduction T chapter provides some background on Recurrent Neural Network (RNN) and neural machine translation (NMT) that will make this report relatively self-contained. HIS 2.1 Recurrent Neural Network NN [30] is a class of artificial neural network that have been recently used in various task such as speech recognition [11] [31], learning word embedding [29], language modeling [38]. RNN have shown great promise result in number of tasks in natural language processing. R 2.1.1 What are RNN? The power of the RNN is the capability of representing all the history. RNN can handle sequential data with the output, which depends on the previous computation. In fact, the hidden state of the RNN represents all the previous history, which allows a form of memory. Unlike the traditional neural network “feed-forward”where the input (same for the output) are independent and the history is still just previous several computation. In the equation 2.1, yt is estimated, which is the output probability distribution over the vocabulary at time step t. ht is the hidden state at time step, it represent the “memory”of the RNN. ht is calculated based on the previous hidden state and the input at the current step, where W (hh) , W (hx) and W (s) are respectively the hidden to hidden weight matrix used to condition the output of the previous time-step, the input to hidden weight matrix used to condition the input vector and hidden to output weight matrix. yt = so f tmax(W (s) ht ) (2.1) Figure 2.1 shows the RNN architecture where rectangular box is a hidden layer. Each layer of them holds a number of neurons. Where xt is the input at time step t, xt could be a one-hot vector representation corresponding to n words. It also shows a three time-steps RNN which tie the weight at each time step (fix set of weight w). They allow to condition for each output given the entire sequence of words before by keeping around the hidden state ht . So at each hidden state we have input xt (new word) and we want to predict the output yt (next word). Figure 2.1: Recurrent Neural Network [33]. Three time-steps are shown 2.1.2 What can RNNs do? We can use the neural network in various natural language processing tasks [9] including: partof-speech tagging [37], named entity recognition [20], etc. In particular, RNN have demonstrated efficiency in this field due to his capability of memorizing and the ability to have a feedback from past experiences, contrary to the Convolutional Neural Networks and forward neural networks, which don’t have the concept of time nor experience. For many task, traditional neural network are not sufficient. For instance, if we want to predict the next word in a sentence, it would be better to know which words came before it, so we can have a good prediction. 2.1.3 Back propagation through time Both the feed-forward and the recurrent architecture of the neural network model can be trained by stochastic gradient descent using a well-known back-propagation. Usually Neural network use back-propagation based on the gradient descent in the aim to update weight and minimize the cost function. Though, for the RNN the use of back-propagation through time [15] is much more efficient. In fact, back propagation through time (BPTT) propagate gradients of errors in the network back in time through the recurrent weights, this algorithm is more suitable to the RNN than ordinary back propagation (BP). 2.2 2.2.1 Recurrent Neural Network for Natural Language Processing Recurrent neural network language model A language model (LM) allows us to predict and return the probability of the next word given an input window of (n − 1) words. This traditional language model calculate the probability 4 using the equation 2.2. m P(w1 , ..., xm ) = ∏ P(wi | wi−(n−1) ...wi−1 ) (2.2) i=1 We can say that the goal of the statistical LM is to anticipate the next word in a specific context, it more likely to measure how likely a sentence is. RNN allow us to condition the probability for each output given the entire sequence of words before by keeping around hidden state. In fact, at each hidden state ht we have a new word xt and we want to predict the output yt , which is the next word at xt+1 : wt . RNN LM, in contrast, model word probabilities using continuous vector representations. The input of this language modeling function should represent each and every world in the vocabulary is equidistant away from the others. The 1-of-k-encoding is a representation, where each word in the vocabulary is represented as a binary vector wi whose sum equals to one. Actually, we set the ith element in the vector to one and others elements are set to zero. This vector is called one-hot-vector, so the input to our function will be a set of (n−1) one-hot-vector (w1 , w2 , ...wn−1 ). Since we are using a neural network, these vectors are multiplied by a weight matrix E, to produce a sequence of continuous vectors (s1 ,s2 ,...,sn−1 ), where s j = E T w j After multiplying the weight matrix with the one-hot-vector, all the rows of the matrix will be set to zero except those corresponding to the ith element. Their weights will be multiplied by 1. s j is called the word embeddings and s = [p1 ; p2 ; ...; pn−1 ] is called the context vector. A transformation layer will be applied to the context vector such as non-linear function f and softmax function to obtain probabilities. This whole function is a neural model function. Figure 2.2 represent steps from word to a one-hot-vector, after that we will have a continuous space representation. Figure 2.2: A one-hot-vector to a continuous-space representation We use the same weights at all time steps, and everything else is the same: the input x1 ...xt , presented with wi which is a one-hot-vector. h0 is some initialization vector for the hidden layer at time step. ht = (h(t − 1), st ) (2.3) Our softmax yt = softmax(W (s) ht ) where ŷ a probability distribution over the vocabulary: P(xt+1 = v j | xt ...x1 ) = ŷt, j (2.4) Finally, the cross entropy loss function (but predicting words instead of classes). |V | Jt (θ ) = − ∑ (yt, j log ŷt, j ) (2.5) j=1 2.2.2 Long Short Term Memory and Gated Recurrent Unit 2.2.2.1 Long Short Term Memory Long short term memories [13] is complex activation unit, consists of recurrently connected memory blocks presented in the following figure 2.3: Figure 2.3: LSTM with one memory cell [12] Each LSTM unit consists of: • Forget gate ; • A memory cell ; • Input gate ; • Output gate. 6 The forget gate layer is a sigmoid layer that decide what information will be forgotten (or throw away). It takes the past hidden state ht−1 and the input word xt as input. It outputs a number between 0 (completely get rid of this) and 1 (completely keep this) for each number in the cell state Ct−1 . (2.6) ft = σ (W f xt +U f ht−1 ) A vector of new candidate values will be created ,c̃t , that could be added to the memory state. In the next step, we will combine these two to create an update to the state: j j j j j ct = ft ct−1 + it c̃t (2.7) The memory cell of the jth LSTM unit at time-step t. Where the new content: c̃t = tanh(Wc xt +Uc ht−1 ) (2.8) Another important step is make a decision what new information will be stored in the cell. Then, the input gate layer decides which values will be update. it = σ (Wi xt +Ui ht−1 ) (2.9) Finally, the output gate controls how much information we’re going to output from the memory state. ot = σ (Wo xt +Uo ht−1 ) (2.10) Cells connected together,replacing the traditional hidden unit in the recurrent neural network, to allow the operation of forget,memorize and display the memory content. All the gating units have a sigmoid non-linearity, while the input unit can have any squashing non-linearity. 2.2.2.2 Gate recurrent unit Gate recurrent unit (GRU) is a new type of hidden unit proposed in [7]. GRU unit is inspired from the LSTM unit but much simple to use and to implement. It adaptively remembers and forgets its state based on the input signal to the unit. GRU unit consists in reset gate, update gate and input gate. Unlike the LSTM, a single gating unit (update gate) controls the forget part and the decision to update. The Figure 2.4 is a detailed internals of a GRU, followed by a mathematical description of GRU’s four fundamental stages. The reset gate decides how much information will be used from the previous memory state to compute the next target state. If the values of reset gate is close to zero than we ignores previous memory and we only stores new word information. rt = σ (Wz xt +Ur ht−1 ) (2.11) The Update gate controls how much information from the previous memory state will be used. j j If zt = 1, then we copy previous hidden state ht−1 and we ignore h̃t what currently happen. Contrarily, if zt = 0, then most of the new memory h̃t is forwarded to the next memory. zt = σ (Wr xt +Uz ht−1 ) (2.12) Figure 2.4: Gate recurrent unit [7] In equation 2.12, the hidden state ht is finally created using the new memory cell, the update gate and the previous hidden input (see equation 2.13). The new memory is the fusion of the input word xt and the previous hidden state h(t − 1). j j j j j ht = (1 − zt )ht−1 + zt h̃t h̃t = tanh(W xt + rt K Uht−1 ) (2.13) (2.14) 2.2.2.3 LSTM Vs. GRU RNN propagate weight matrices from one time-step to the next. Indeed, the goal of a RNN implementation is to enable propagate context information thought faraway time-steps. This make the RNN hard to train. In fact, there is a basic problem that gradients propagated over many stages tend to either vanish (most of the time,when the gradient value gradually vanishes as they propagate and goes to zero) or explode (rarely when the gradient values grows extremely large,it causes an overflow, with much damage to the optimization). This to problem that risk to happen during the training and are called respectively: vanishing gradient and exploding gradient [28] To solve the this problems we must use RNN units that allow the network to accumulate information over a long duration. However, once that information has been used, it might be useful for the neural network to forget the old state. For example, if a sequence is made of sub-sequences and we want the unit to accumulate evidence inside each sub-sequence, we need a mechanism to forget the old state by setting it to zero. Instead of manually deciding when to clear the state, we want the neural network to learn to decide when to do it. This is what gated RNN do. GRU are quite new (proposed by [7] in 2014), and their trade-offs haven’t been fully explored yet. According to empirical evaluations in [14] and [8], there is not a clear winner. In many tasks both architectures return almost the same performance quality. GRU contains less 8 parameters. Indeed, it may train a bit faster also it need less data to generalize. On the other hand, with enough data, LSTM may give better results. LSTM has a greater expressive power due to his numerous parameters. 2.3 Neural Machine Translation 2.3.1 Statistical Machine Translation The name machine translation is based on the fact that we want a system to translate text (source sentence), to corresponding text (target sentence). There are multiple ways to build such machine. The statistical approach of Machine Translation allows us to translate a sentence from one source language to another target language automatically. Moreover, SMT has many alternative, it depend on the data-sets, our needs and the translation model. For instance, the common approach in SMT is the Phrase-Based (PB-SMT) approach proposed by [18]. tˆ = P(t|s) = P(s|t).P(t) (2.15) In equation 2.15, we can’t estimate directly P(t|s), where s and t are respectively the source and the target utterance. But thanks to Bayes’ rules, we can transform the equation. In PB-SMT approach, we propose to estimate P(s|t).P(t), in which P(t) is a target Language Model and P(s|t) is a Translation Model. SMT has limitations as well. Pre-processing alignments to extract bi-phrases is a very complex and hard task. Language pairs that have different word order are particularly tricky. NMT use recurrent neural networks to build a system that learns to map a whole sentence from source to target all. This solve word-ordering, because the system learns whole sentences at once. 2.3.2 Neural Machine Translation Neural machine translation (NMT) is a new approach of machine translation [16], [34], [3], the goal of NMT is to provide a fully trainable model that maximize the translation performance. The NMT model starts from a representation of a source sentence and finishes by giving a representation of a target sentence. Most of the proposed neural machine translation approaches are based on the standard RNN encoder/decoder. An neural network encoder read the source sentence and encode it into a fixed length, but the decoder outputs a translated sentence from the inputs fixed vectors. Cho, Kyunghyun and Van Merri in [7] developed a traditional RNN encoder/decoder based on GRU units. In this approach, studied by Cho in [6], the translation score decreases while the length of the sentence increase. This is due to the model, which has to encode the whole source sentence information into a single fixed vector. They also mention that the vocabulary size has a high impact on the translation score. Luong and al. demonstrate in [21] that deep NMT with 3 or 4 layers preforms better than those with 1 or 2 layers in terms of perplexity and the translation score. They implement a machine translation system created by Delvin [10], which takes a feed forward neural network as a translation model by predicting one word in a target phrase at a time. In [34], they used a neural network with 4 layers with LSTM units. But, they used a trick: they reversed the input sequence which able them to achieve a BLEU score equals to 37. It makes things work better in practice, but it’s not a formalized solution. Most translation benchmarks involve language pairs like French and Spanish, which are translation to or from English. But, there are languages (like Japanese or German) where the last word of a sentence could be highly predictive of the first word in an English translation. In that case, reversing the input would make things worse. The aims of [22], by luong and Manning, is to explores the effectiveness of NMT in spoken language. They used a global and local attention proposed in [24]. 2.3.2.1 Standard RNN encoder/decoder The RNN encoder/decoder [7] is based on two RNN that preform as an encoder and decoder pair. This type of architecture (Figure 2.5) is at the core of deep learning. Figure 2.5: Recurrent Neural Network: Encoder/Decoder [33], here the German phrase is translated to an English one. The first three time-steps encode the German words into h3 and the last two decode h3 into English words outputs. The encoder is a recurrent neural network, which goes from generation to a vector. In fact, it encodes the input sentence X to a context vector c, which is a continuous vector. 10 The equation 2.16 compute the hidden layer output at each time-step t, where f could be LSTM or GRU. ht = f (xt , ht−1 ) (2.16) Equation 2.17 presents the c vector generated from the sequence of the hidden states (q is a non-linear function). c = q(h1 , ..., ht ) (2.17) The recurrent activation function will be applied recursively over the input sequence, until the end when the final internal state of the RNN ht is the summary of the whole input sentence. The decoder makes a generalization from the vector c. It is also an RNN, a simple translation model (TM) conditioned on the source sentence X. RNN-TM is used to predict the next word yt , given the context c and all the previous words (y1 ..yt ). The decoder defines the probability, where g is a non-linear function. P(yt | y − 1...y(t−1) , c) = g(y(t−1) , ht , c). (2.18) 2.3.2.2 Can we do better than the simple encoder-decoder based model? To overcome the vocabulary size shortcoming, Luong et al. [25], propose "copy" mechanisms for <unk>. Simple and effective approach that can treat any NMT as a black box, which annotate occurrences of target <unk> with positional information to track their alignments. Moreover,he proposed recently a character-level translation [23] that treat the language complexity problem. The main idea of the work proposed by Bahdanau [3] for solving the sentence length is the attention mechanism. Usually human when they translate a long sentence, first they reads the whole paragraph to understand the meaning and to have an idea about the context and then they translate the sentence/paragraph word by word, this is the concept of the attention approach. 2.3.2.3 Learning to Align in Machine Translation The NMT by jointly learning to align and translate [3] is based on encoder/decoder architectures extended with an attention mechanism, which allows for each new word, to focus on part of the original sentence. Usually the traditional RNN encoder/decoder, produce just a single hidden state that summaries the hole sentence, oppositely in this NMT model the encoder is a bidirectional recurrent neural network (Figure 2.6) composed of a forward RNN and a separate backward RNN, the forward reads the sentence from the first word to the end, the backward reads from the last word to the first in this way the annotation will contains information about the window of words around. On the other hand, they transformed the context vector (c) into a variable-length, from which a decoder generates a translation, instead of a fixed-size vector. The use of a variable-length improve the performance of the encoder/decoder models by solving the problem of the sentences length. Each time the decoder RNN, like in traditional decoder, is trained to predict next word, it determines the softmax of each hidden states to take as input. In fact the probability is conditioned on a distinct context vector ci , depends on a sequence of annotation which contains information about the whole input sequence and with a strong focus on parts surrounding the i-th word of the input . Basically the context vector is computed as a sum of annotations. we Figure 2.6: Bidirectional encoder [5] will give a detailed description of the model in the next chapter (section 3.5). This approach use an alignment model as part from the decoding process. This alignment between the target words and source is very useful especially to deal with the problem of unknown word, which has an important influence on the performance of the translation. The work of Bayhdanau et al. [3] is an extension from the basic encode/decoder, this new approach appear to improve translation quality of long sentences. This is however does not mean that their approach is the perfect, they still have the language complexity problem and also the computation of the annotation weight for every word in the source sentence. 2.3.3 Drawbacks While NMT has already achieved state-of-the-art performances for many languages. It offers many advantages over traditional translation approaches, but has not solved the three shortcomings: • The vocabulary size problem: all work in NMT has used the unk replacement technique with a restricted vocabularies (shortlist of most frequent words), any others words is mapped to <UNK> symbol. • The sentence length problem: translation quality dramatically degrades as the length of the source sentence increases when the encoder/decoder model size is small, to overcome this problem, the dimensionality of the context vector must be large enough that a sentence of any length can be compressed [7]. • The language complexity problem: "Copying" mechanisms are not sufficient, they ignore several properties of languages. Conclusion: In this chapter, we introduced recent advances in recurrent neural networks, centered around neural machine translation. We first introduced the standard encoder/decoder model for machine translation. However, we discussed some weakness of this simple model, on the other 12 hand, we described a recently proposed approach "attention mechanism" that overcome some of this shortcoming. Indeed, our main experiments were run using the attention based mechanism, our intent is to obtain a good translation quality and to optimize our model as possible as we can, so beside the choice of this model we will work on improving the quality of the trainable data, which we will speak about in our next chapter. 3 Experimentation Introduction this chapter we propose to study the impact of corpus based domain adaptation techniques on neural machine translation approach. It will be compared to a baseline approach, which is the standard phrase-based SMT approach. We will introduce the training environment, the data sets used in our experiment. Moreover we will be interested in the pre-processing of our training data and the data selection. We will also describe the evaluation protocol and the recurrent neural network (RNN) model used for training. I N 3.1 Data The evaluation is performed on the Arabic–French and English–French translation tasks. To train the translation model we must have a training set called parallel corpus described in the next sub-section. 3.1.1 Arabic–French Corpus The tables 3.1, 3.2 resume the contents of our Arabic–French corpus. Corpus new-lc opensub trames wit3 Total number of lines 90 753 4 381 835 20 539 87 732 4 580 859 number of words (Ar/Fr) 3 Millions/2 Millions 41 Millions/40 Millions 0,8 Millions/0,8 Millions 2 Millions/2 Millions 46,8 Millions/44,8 Millions Table 3.1: Bilingual corpus(Ar–Fr) size The Arabic–French training corpus composed of several corpora. The out-of-domain corpus is composed of News-commentary [36] provided by WMT for training SMT, News-commentary contains 12 languages. We use 90 753 Arabic lines(3M word) and 90 753 lines french (16M words). We also use Open-subtitles1 , a parallel corpus of movie subtitles [35]. We use 4M Arabic lines (41M words)and 4M french lines (196M words). WIT32 is a version for research purposes of the multilingual transcriptions of TED talks corpus3 , for the translation task of the IWSLT evaluation task. We used 87 732 Arabic lines (2M words) and 87 732 french lines (15M words). Our in-domain-corpus is the corpus Trames, transcripts from radio and television and use for the evaluation campaign of the TRAD project4 . We used 20 539 Arabic lines (0,8M words) and 20 539 french lines (5M words). The development corpus (Dev) is used during the training of the model, to optimize the parameters of the translation model. This corpus consists of 795 sentence in each languages. Corpus Newspapers Mail C3 Mail C4 DEV 250 lines 155 lines 390 lines Table 3.2: Corpus(DEV/TEST) size 3.1.2 English–French Corpus As out-of-domain-corpus, we use 2 007 723 lines of Europarl parallel (Ep) corpus. Ep is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages. Finally, 274 419 lines from the UN corpus are added. It is extracted from the United Nations Website5 and composed of official records and other parliamentary documents. The in-domain-corpus is the European Medicines Agency (EMEA). It is a parallel corpus made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdf-to-text. EMEA contains 22 languages. The development corpus is also from EMEA. The tables 3.3 and 3.4 resume the contents of the English–French corpus. Corpus ep7 EMEA UN Total number of lines 2M 549 K 474 K 3.6 M number of words (En–Fr) 55 M /61 M 11 M/13 M 7 M/7 M 73 M/81 M Table 3.3: Bilingual corpus(En–Fr) size 1 http://www.opensubtitles.org/ 2 wit3: Web Inventory of Transcribed and Translated Talks, website: https://wit3.fbk.eu 3 http://www.ted.com/ 4 http://www.trad-campaign.org 5 http://ods.un.org 16 Corpus EMEA DEV number of lines 1 665 number of words (En–Fr) 34 948/41 151 Table 3.4: En–Fr Dev Corpus 3.2 Pre-processing To train our translation model we need the corpus described before. But we first need to do some pre-processing to get our data into the right format. 3.2.1 Tokenize Text The first step of pre-processing are tokenization and normalization. We applied a tokenization script (from the open-source machine translation package, Moses) to our raw corpus. This script tokenize our text into sentences, and sentences into words, also it handle the punctuation so the sentence "she is beautiful!" we will split to 4 tokens: "she", "is","beautiful", "!". 3.2.2 Remove infrequent words In fact NMT has its limitation in handling a larger vocabulary (having a huge vocabulary will make our model very slow to train). Moreover, to really understand how to appropriately use a word, human need to have seen it in different contexts, SMT approaches do the same. A usual practice is to construct a target vocabulary of the k most frequent words (called shortlist) where k is here 30,000 like in [3]. Words, which are not included in the shortlist, are mapped to a special token ([UNK]), in order to limit the vocabulary size. The word ([UNK]) will become part of our vocabulary and we will predict it just like any other word. We also removed all the numbers from all the corpora and replaced them with a special token: <NUMBER>. Besides, all the sentence are started with the token <S> and ends with </S>. 3.2.3 Filters out sentences by length After applying the tokenization script, we propose to filters out long sentences in order to avoid long range association problems in NMT approaches [6]. We filter out too long sentences, which are over 30 words length in our case. 3.3 Data selection Many experiments have proven that the more training data we have, the better score we will get. The size of the training data effect clearly on the improvement of the translation scores, but it’s still not a linear relationship. We can say that increasing the training data size, it’s not always the best choice [19]. Furthermore, data selection approach [2] has shown a very good result for both improving translation quality and to reducing model size for statistical machine translation. Therefore, we will use this technique to adapt our NMT to a specific domain. Especially, that the neural machine translation (NMT) system performance is dependent on the quantity and quality of available training data. Data selection allows us to have a better adaption by measuring the similarity of sentences based on the in domain data, either development or the test. The similarity is measured with different metric such as the perplexity (PP) and the entropy (H). The entropy (equation 3.23.2) metric is the exponent in the perplexity score (equation 3.1) for sequence of words W (w1 , w2 , ..., wn ). PP(W ) = 2H (W ) (3.1) 1 (3.2) H(W ) = − log P(w1 , w2 , ..., wn ) N A various approach have been proposed for data selection, [2], [17], [39]. Therefore, to improve the quality of our output we employ the data selection approach proposed by Moore and Lewis in [26]. They used the cross-entropy difference (DCE) according to domain-specific (or indomain) and non-domain-specific (or out-of-domain) language models for each sentence. DCE(S) = HI (S) − HN (S) (3.3) The previous equation allow to score the segments, where I is an in-domain data set and N a non-domain data set. HI (S) and HN (S) are respectively the entropy of a sentence S according to a language model trained on I and the entropy of a sentence S according to a language model trained on a corpus N. The corpus sentences is ranked based on the lowest-scoring. In table 3.5 is shown the size of each (Arabic–French, English–French) corpus size after applying all pre-processing steps. We can see that pre-processing steps effect on the training data size. As a matter of fact, all pre-processing reduce the corpus size. Corpus Ar/Fr En/Fr Number of lines 4303676 2000000 Number of words 40 Millions / 40 Millions 32Millions / 35Millions Table 3.5: Both (Ar/Fr, En/Fr) corpus size after applying all pre-processing steps On one hand the data selection tends to favor short sentences, so usually we get short sentences at the beginning and the longest ones at the end. On the other hand our sample train is not random but it’s composed of the 1500 first sentences, for this reason and after applying the data selection, we decided to mixed the pairs of sentences randomly. 3.4 Evaluation of machine translation After training our machine translation, the proposed question is how do we evaluate this model? We must have automatic evaluation metric in order to efficiently test and compare different machine translation models. We propose to use the most common automatic metric used in translation evaluation tasks: the BLEU score. BLEU (Bilingual Evaluation Understudy) The Bilingual Evaluation Understudy, BLEU [27] score, is one of the first and the most widely used automatic evaluation metric for assessing the quality of translations. In fact, this algorithm 18 measures the quality of the output text produced by the machine with correspondence to the one translated by human (BLEU achieve a high correlation with human judgments of quality). N BLEU = BP ∗ exp(1/n ∑ loge (pn )) (3.4) n=1 Where BP (Brevity penalty) is a multiplicative factor, modifying the overall BLEU score.  1 ifc > r (3.5) BP = exp(1 − r/c) else BP is a decaying exponential in r/c, where r is test corpus’s effective reference length. It is computed by summing the best match lengths for each candidate sentence in the corpus. c is the total length of the candidate translation corpus. BLEU is definitely not a perfect metric, some of the alternatives available at the moment are METEOR6 [4] and TER7 [32]. Although, the BLEU metric is adequate to our system. 3.5 Model Architecture Figure 3.1 shows the attention-based Encoder/Decoder where rectangular box is an annotation h j at a time step t. Each layer holds a number of n neurons. Figure 3.1: Attention Based Encoder/Decoder 6 Metric for Evaluation of Translation with Explicit ORdering Error Rate 7 Translation Our encoder is a bidirectional neural networks of two layers, each layer contains 1000 Gated recurrent units (GRU). The first layer is a forward RNN, which reads the source sen→ − tence, results a hidden state noted h j , which summarizes the source sentence up to the jth word beginning from the first word. The second layer is a backward RNN, reads the source sentence ← − → − ← − in reverse order, which results reversed memory states noted h j . h j and h j together are called annotation and noted h j . It summarizes the whole input sentence. In other words, each annotation will contains information about the whole sequence with a strong focus on the words around x j . The annotations are states of a bidirectional network, driven by word embeddings of the source sentence. The attention mechanism assigns weights to the annotations. The weighted sum of the annotations is further used by the translation network to predict the next word of the generated translation. The Decoder contains just one layer of 1000 GRU units, the decoder will be able to selectively focus on one or more of the annotation vectors for each target word. The attention mechanism (presented by Figure 3.2) takes as input the previous decoder’s hidden state zi , one of the annotation h j and the previous generated target word ỹi−1 . Figure 3.2: Attention Bloc More general, the attention mechanism is a small neural network with a single hidden layer 20 and a single scalar output ei j source word. Once the score of each input word is compute, we apply a softmax to sum all the scores to produce the attention weights. Annotation h j contains the summary of both the preceding and following words. The equation 3.6 shows the relation between zi−1 previous memory state which summarizes whats has been translated up to the (i-1)t h word, ỹi−1 and the previous generated target word. What we called alignment model (ei j ) or energy, scores how well the input around position j and the output at position i match. ei j = a(zi−1 , ỹi−1 , h j ) (3.6) Attention weight αi j mentioned in our attention model figure is the probability that the target word yi is aligned to a source word x j . Tx αi j = exp(ei j )/ ∑ exp(eik ) (3.7) k=1 The equation 3.8 described bellow is used to compute the new memory state zi of the decoder. ci is called the weighted average. Tx ci = ∑ αi j h j (3.8) j=1 3.6 Results We adapted a fully neural machine translation using Python-theano 8 , a library which uses GPU. To train our models we used the data described in section 3.1. In our experience, we tested our model with different size of the training corpus. On one hand, we want to observe the model performances with a variety of data-sets size. on the other hand, we want to evaluate the model according to the data-sets quality. To better understand the behavior of the NMT approach, we tested our models with two language-pairs: English–French and Arabic–French. The result of the Arabic–French language pairs presented in the appendix A. 3.6.1 The English–French task As a first step in the pre-processing, we tokenized all our English/French corpus, and then we mapped all the infrequent words (words that do not belong to the 30,000 shortlist) and also we mapped all the number with a specific token <NUMBER>. For the second step, in order to limit the memory use during training we removed sentences that have length more then 30 words. On the other hand, data selection tends to favor short segments, another reason for long sentence removing. Then we shuffled our corpus to get our baseline (Random corpus). Finally, we applied a data selection on the whole corpus to have the most interesting sentences according to the development corpus. Then, we took different size of data-sets as a sub-corpus of the sorted data and we will compare it to the baseline. We trained 800 K lines of both random and adapted corpus (tokenized, filtered, selected), during 16 epochs. We propose to evaluate the effect of the data-sets quality on the translation performance of the NMT model. 8 https://github.com/mila-udem/blocks-examples/tree/master/machine_translation The Figure 3.3, contains the BLEU score of each dev and sample files of the adapted corpus. We applied a simple "head" to take the first 800 K lines from a tokenized, filtered and selected corpus. The validation of BLEU scores start after 20000 updates and then every 5000 updates. Figure 3.3: Dev and Sample BLEU scores: 800,000 lines o f the (English/French) corpus) Figure 3.4 shows the BLEU score of each dev and sample files for random sub-corpus: first 800 K lines of a tokenized, filtered, shuffled corpus. This is the result after 140 K iterations. Figure 3.4: Dev and Sample BLEU scores: 800,000 lines of the random corpus At iteration 140 k, with the same size of data we have a large difference between the BLEU score of the adapted data (Dev 38.65 and Sample 39.58) and the random (Dev 28.70 and Sample 22 39.12) one. Approximately +10 BLEU points difference for the dev and +0.46 for the sample. From figure 3.3 and 3.4, we note that data selection has an effect on translation scores. The figures 3.5 and 3.6 displays BLEU scores with respects to the different sub-corpus for the dev and for sample. We applied all the steps of pre-processing on different sub-corpus size. Indeed, we tokenized the whole corpus then we removed sentences, which their length exceed 30 words and we applied a data selection. we took then the first 400 k lines (respectively 800 K, 1 200 k, 1 600 k, 2M). This time, we wanted to observe how the data size will influence the BLEU scores. Figure 3.5: Dev BLEU scores with different sub-corpus (En/Fr) size Figure 3.6: Sample BLEU scores with different sub-corpus (En/Fr) size The above figures shows that BLEU scores increase when the adapted corpus size decrease. For instance, at iteration 240 000 for 400 K lines the BLEU score for the dev is equal to 67.26 (for the sample 47.08), while for 800 K lines the dev achieved 41.56 BLEU points (sample 46.12). furthermore, with 1 200 K lines we have 37.55 BLEU points for the dev (sample 46.27) and with 1 600 K lines we achieved 32.75 points (sample 44.79). Finally, 31.94 for 2M lines (sample 44.45). 3.6.1.1 Comparison with a phrase-base approach The following table 3.6 presents the BLEU scores result for Phrased-Based Machine translation by Moses and our implemented Neural Machine Translation on 800 K/400 K lines of the English–French adapted corpus (tokenized, filtered and selected) and the random (tokenized, filtered and shuffled) one. Number of lines 800 K 400 K MT Model PB-SMT (Adapted Corpus) Neural MT (Adapted Corpus) Neural MT (Random-Corpus) PB-SMT (Adapted Corpus) Neural MT (Adapted Corpus) BLEU scores 59,62 (60.42) (44,73) 60,08 67,50 Table 3.6: BLEU scores for PB-SMT vs. Neural MT, The BLEU scores in the parentheses are not final results, the model still training Figure 3.7 presents dev BLEU scores for different adapted sub-corpus size at iteration 140 k. The translation score decrease remarkably between 400 K lines and 800 K lines (threshold of approximately +20 BLEU points). Although, for the phased based machine translation score the difference is +1,30 BLEU points. Figure 3.7: Dev BLEU scores with different sub-corpus (En/Fr) size at iteration 140 k 24 Perplexity : corpus(En/Fr) 40 "ppl" using 1:3 35 Perplexity 30 25 20 15 10 0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06 Corpus size 2e+06 Figure 3.8: Perplexity score for different adapted sub-corpus Figure 3.8 presents the result on term of perplexities obtained on a test for the same adapted sub-corpus presented on Figure 3.7. The lowest perplexity is obtained for 400 K lines, which is equal to 24.5672 points. Indeed, the adapted sub-corpus with 400k lines has the highest BLEU score and the lowest perplexity. 3.7 3.7.1 Discussion Cost of the attention mechanism As mentioned before, the attention mechanism is given the neural network the access to his hidden state of the encoder (or we can say to his internal memory). Unlike typical memory, the attention mechanism makes the memory access softer (the network recovers a weighted combination of all memory locations), so we can easily train the network end-to-end using back-propagation. Despite all the benefits of the attention approach, it still have some shortcoming. If we look closely to this approach we can see that it’s costly. In fact, we need to calculate an attention value for each combination of input and output word. For instance, if you have a 50-word input sequence and generate a 50-word output sequence that would be 2500 attention values. In same way, if we are dealing with too long sequences the above attention mechanisms can become prohibitively expensive and also he will take so much time to train. Actually, we are essentially looking at everything in detail before deciding what to focus on. Probably, it is like after generating an output word, and we go back through all of the the hidden state (internal memory) of the text in order to decide which word to produce next. That seems like a waste. Nevertheless, that hasn’t stopped attention mechanisms from performing well on many tasks (machine translation, image/video caption). Recently, attention mechanism become quite popular. 3.7.2 Effect of data selection on translation quality Data selection (The cross-entropy difference approach) has shown significant improvements in effective use of training data by extracting sentences from large general-domain corpora to adapt our NMT system to in-domain. Considering the best result may depend on the size of the selected data, we investigate each of selected corpora starting from using 400K lines of general corpus (400K-l, 800K-l, 1,200K-l, 1,600K-l, 2,000K-l) where X-l means X lines of general corpus are selected as a subset. Translation quality improves by at most +7 BLEU points when using 400 K lines data of the general corpus as shown in figure 3.6. Then the performance begins to drop when the size threshold is more than 800K lines (Figure 3.5). The results show that keywords overlap plays a significant role in retrieving sentences in similar domains. 3.7.3 Arabic: a challenge for translation tasks Arabic has a rich morphology compared to the French language. A single Arabic word can function as an entire sentence in French what make translation a hard task to produce even with neural machine translation [1]. Arabic is a very complex language for the computer to understand. From our experiments, even after applying a script for normalization and an OpenNLP9 tokenizer (Diacritic removal, Letter normalization and the Tatweel removal) we have not obtain a high BLEU scores. 9 https://opennlp.apache.org/ 26 4 Conclusion and Future work Conclusion this thesis we presented a new approach of machine translation :Neural Machine Translation (NMT). NMT is a radically new way of teaching machines to translate using deep neural networks. Though developed just last year, NMT has achieved state-of-the-art results in the WMT translation tasks for various language pairs such as English-French, English-German, and English-Czech. NMT is appealing since it is conceptually simple. NMT is essentially a big recurrent neural network that can be trained end-to-end and translates. It reads through the given source words one by one until the end, and then, starts emitting one target word at a time until a special end-of-sentence symbol is produced. Recently, a very interesting approach: attention mechanism, has shown a very promising result in neural machine translation. The problem solved by this approach is that it allows the neural network to refer back to the source sentence, instead of encoding all the information in one fixed vector. As we already mentioned in this thesis we have trained our English–French and Arabic–French corpus with the NMT model described in [3]. This approach has given an impressive result especially for the (English–French) corpus. In our work, we wanted to make clear the relation between quality, quantity of the corpus and translation performance. We have shown first result on selected data neural MT and we have compared with phrase-based system with the same adapted corpus. Finally, we have concluded that quality of data-sets effect performance of the NMT. In fact, our adapted corpus has improve the translation quality compared to the tokenized-only corpus. I N 4.1 4.1.1 Future of Neural Machine Translation: Short-term projects Short-term projects present two goals that we try to achieve as an extension to our previous work: (1)Decoding: after that we trained our models we need to find a translation that maximizes the conditional probability, we can simply use a beam search. (2)implement approach: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models used in [23], hybrid systems that translate at the word level while consult the character components for infrequent words, this approach is very easy and fast to train compared to character level recurrent networks and for the other hands it does not produces any special token <UNK>. This hybrid approach improved the BLEU scores by +7.9 BLEU points for English/Czech translation over models that do not have any special treatment for unknown words. It is a very interesting approach to train with our English/French and Arabic/French corpus to improve the translation quality. 4.1.2 Long-term projects There are many challenges related to neural machine translation. NMT is a recent advances with a promising result. Our translation model works with languages (sentences, character or words level): translation from a source languages to a target one, but the question is: can this model work with other type of input/output (1)suitable model for very long sentences (paragraph, documents..) (2)suitable model for graphs(eg. a word lattices used by Moses). 28 A Appendix A This appendix A contains some result of Arabic–French languages pair. In fact, during the training of this corpus we faced some problem. As mentioned before Arabic languages is a challenges for natural language processing and it is hard to train. The result described bellow are not very efficient we are still working to improve it. Result for Arabic–French Corpus We applied the same steps of pre-processing (tokenize text, remove infrequent words, filters out sentences by length, data selection) on 10 percent, 20 percent and 50 percent of the data-sets. We compared the BLEU scores of the adapted sub-corpus and the result of the training before any pre-processing (Raw data-sets). A.0.1 Result for 10 percent of the corpus Figure A.1: Dev BLEU scores (10 percent of the Arabic–French corpus) The result in the following graph A.2 with approximately 10 percent of the initial corpus (400,000 lines) and after 15 epochs. We trained three different sub-corpus (same data with different pre-processing): 10 percent of the raw corpus, 10 percent tokenized and filtered, 10 percent selected and shuffled. Figure A.2: Sample BLEU scores (10 percent of the Arabic–French corpus) A.0.2 Result for 20 percent of the corpus Figure A.3: Dev BLEU scores (20 percent of the Arabic–French corpus) 30 Figures A.3, A.4 presents the BLEU scores for 20 percent of the corpus (800,000 lines) 12epochs for the three different steps of pre-processing: Figure A.4: Sample BLEU scores (20 percent of the Arabic–French corpus) MT Model PB-SMT (Adapted Corpus) Neural MT (Adapted Corpus) Neural MT (only tokenized-Corpus) BLEU scores 11,5 16.37 9,55 Table A.1: BLEU scores for PB-SMT vs. Neural MT, for 20 percent of Arabic–French corpus A.0.3 Result for 50 percent of the corpus Figures A.5 and A.6 displays BLEU scores result for 50 percent of the corpus (2,000,000 lines) after 8 epochs: Figure A.5: Dev BLEU scores (50 percent of the Arabic–French corpus) Figure A.6: Sample BLEU scores (50 percent of the Arabic–French corpus) 32 B Appendix B Parameters: • One epoch = one forward pass and one backward pass of all the training examples. • Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need. • Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes). • Batch size = 80 Default value. • Sort k batch = 12 (Default value) : This many batches will be read ahead and sorted. • CNMeM = 512MB the start size of the GPU memory. • Max number of updates 1000000. • Show 2 samples at each sampling. • Source and target vocabularies sizes (include bos, eos, unk tokens)= 30000. • Beam size=12. • Start bleu validation after 80000 (default value, in our experience we change it to 2000) updates. • Bleu validation every 5000 updates. • Maximum number of updates = 1000000. • Bleu scripts ≪ multi-bleu.perl≫ (moses multi-perl). • Sample-size = 1500 lines. • Word embedding dimensionality: 620. • Multilayer network with a single maxout hidden layer. • The training algorithm use a SGD with Adadelta (algorithm to adapt the learning rate of each algorithm, decayrate=0.95, epsilon=1e-06). • GPU : Tesla K40m. Definitions: This some basic definitions1 : • Activation Function: To allow Neural Networks to learn complex decision boundaries, we apply a nonlinear activation function to some of its layers. Commonly used functions include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these. • Adadelta: is a gradient descent based learning algorithm that adapts the learning rate per parameter over time. It was proposed as an improvement over Adagrad, which is more sensitive to hyper-parameters and may decrease the learning rate too aggressively. Adadelta It is similar to rmsprop and can be used instead of vanilla SGD. • Dropout: is a regularization technique for Neural Networks that prevents overfitting. It prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each training iteration. Dropout can be interpreted in various ways, such as randomly sampling from an exponential number of different networks. Dropout layers first gained popularity through their use in CNNs, but have since been applied to other layers, including input embeddings or recurrent networks. • The softmax function is typically used to convert a vector of raw scores into class probabilities at the output layer of a Neural Network used for classification. It normalizes the scores by exponentiating and dividing by a normalization constant. If we are dealing with a large number of classes, a large vocabulary in Machine Translation for example, the normalization constant is expensive to compute. There exist various alternatives to make the computation more efficient, including Hierarchical Softmax or using a sampling-based loss such as NCE. • Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions. It contains many building blocks for deep neural networks. Theano is a lowlevel library similar to Tensorflow. Higher-level libraries include Keras and Caffe. 1 http://www.wildml.com/deep-learning-glossary/ 34 Bibliography [1] Amjad Almahairi, Kyunghyun Cho, Nizar Habash, and Aaron Courville. First result on arabic neural machine translation. arXiv preprint arXiv:1606.02680, 2016. [2] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo indomain data selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 355–362. Association for Computational Linguistics, 2011. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [4] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72, 2005. [5] Kyunghyun Cho. Natural Language Understanding with Distributed Representation. Technical report, New York University, 2015. Lecture Note for DS-GA 3001. [6] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. 2014. [7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [8] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. [9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12:2493–2537, 2011. [10] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. In ACL (1), pages 1370–1380, 2014. [11] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013. [12] Alex Graves. Supervised sequence labelling. In Supervised Sequence Labelling with Recurrent Neural Networks, pages 5–13. Springer Berlin Heidelberg, 2012. [13] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8):1735–1780, 1997. [14] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2342–2350, 2015. [15] Paul J.Werbos. Backpropagation Through Time: what it does and how to do it. Proceedings of the IEEE, 1550-1560, 1990. [16] Nal Kalchbrenner and Phil Blunsom. Recurrent Continuous Translation Models. EMNLP, 3(39):413, 2013. [17] Philipp Koehn and Barry Haddow. Towards effective use of training data in statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 317–321. Association for Computational Linguistics, 2012. [18] Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54, 2003. [19] Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf. Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 284–293, 2011. [20] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016. [21] Minh-Thang Luong, Michael Kayser, and Christopher D Manning. Deep Neural Language Models for Machine Translation. CoNLL 2015, page 305, 2015. [22] Minh-Thang Luong and Christopher D Manning. Stanford Neural Machine Translation Systems for Spoken Language Domains. 2015. [23] Minh-Thang Luong and Christopher D Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788, 2016. [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015. 36 [25] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206, 2014. [26] Robert C Moore and William Lewis. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220–224. Association for Computational Linguistics, 2010. [27] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002. [28] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012. [29] Suman Ravuri and Andreas Stolcke. A comparative study of recurrent neural network models for lexical domain classification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6075–6079. IEEE, 2016. [30] D. Rumelhart, G. Hinton, and R. Williams. Leraning representations by back-propagation errors. Nature, 323, 533-536, 1986. [31] Haşim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays. Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947, 2015. [32] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, pages 223–231, 2006. [33] Richard Socher. Deep NLP Recurrent Neural Networks. Computer Science Department, Stanford University, August 2015. Available at http://videolectures.net/ deeplearning2015_socher_deep_nlp. [34] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. pages 3104–3112, 2014. [35] Jörg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pages 237–248. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria, 2009. [36] Jörg Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA). [37] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv preprint arXiv:1510.06168, 2015. [38] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. Recurrent neural networks for language understanding. In INTERSPEECH, pages 2524–2528, 2013. [39] Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, and Eiichiro Sumita. Method of selecting training data to build a compact and efficient translation model. In IJCNLP, pages 655–660, 2008. 38