Master of Science in Informatics at Grenoble
option Artificial intelligence and the web
Data Selection for trainable Neural
Machine Translation Models
Wejdene Abdelmoula
24 June 2016
Research project performed at Laboratory of Informatics of Grenoble
Under the supervision of:
Pr. Laurent Besacier, University Grenoble Alpes
Dr. Christophe Servan, University Grenoble Alpes
Defended before a jury composed of:
Dr. Marc Dymetman, Xerox Reseach Centre Europe
Pr. Massih Reza Amini, University Grenoble Alpes
Pr. Nadia Brauner, University Grenoble Alpes
Pr. Jean Claude Fernandez, University Grenoble Alpes
Pr. Jérôme Euzenat, University Grenoble Alpes
Pr. Eric Gaussier, University Grenoble Alpes
June
2016
Abstract
Machine translation (NMT) is a new approach to translate text from
one language into another. The core of NMT is a single deep neural network
with millions of neurons that learn to map source sentences to target sentences.
Despite being relatively recent, NMT has already shown promising results on various translation tasks. In our contributions, we implemented a NMT with attention
mechanism (Bi-directional Recurrent Neural Network (RNN) encoder and a simple Recurrent decoder). In this thesis, we will describe all the basics to understand
what RNNs are and what they propose. We will experiment our translation model
RNN using a Python library (Theano). To train our translation model we need text
to learn from, we used an (Arabic/French, English/French) corpus. Moreover, we
used a data selection technique to make the neural models trainable in reasonable
tipe while keeping acceptable translation quality.
N
EURAL
Keywords: Neural Machine Translation (NMT), Attention mechanism,
Recurrent Neural Networks (RNN), Translation Model, Data selection
Contents
Abstract
i
List of Figures
iii
List of Tables
iv
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2 State of the art
2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Recurrent Neural Network for Natural Language Processing . . . . . . . . . .
2.3 Neural Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
9
3 Experimentation
3.1 Data . . . . . . . . . . . . . . . .
3.2 Pre-processing . . . . . . . . . . .
3.3 Data selection . . . . . . . . . . .
3.4 Evaluation of machine translation .
3.5 Model Architecture . . . . . . . .
3.6 Results . . . . . . . . . . . . . . .
3.7 Discussion . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
17
17
18
19
21
25
4 Conclusion and Future work
27
4.1 Future of Neural Machine Translation: . . . . . . . . . . . . . . . . . . . . . . 27
A Appendix A
29
B Appendix B
33
Bibliography
35
ii
List of Figures
2.1
2.2
2.3
2.4
2.5
2.6
Recurrent Neural Network [33]. Three time-steps are shown . . . . . . . . . . .
A one-hot-vector to a continuous-space representation . . . . . . . . . . . . . . .
LSTM with one memory cell [12] . . . . . . . . . . . . . . . . . . . . . . . . .
Gate recurrent unit [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Recurrent Neural Network: Encoder/Decoder [33], here the German phrase is
translated to an English one. The first three time-steps encode the German words
into h3 and the last two decode h3 into English words outputs. . . . . . . . . . .
Bidirectional encoder [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 10
. 12
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
Attention Based Encoder/Decoder . . . . . . . . . . . . . . . . . . . . . . .
Attention Bloc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dev and Sample BLEU scores: 800,000 lines o f the (English/French) corpus)
Dev and Sample BLEU scores: 800,000 lines of the random corpus . . . . . .
Dev BLEU scores with different sub-corpus (En/Fr) size . . . . . . . . . . .
Sample BLEU scores with different sub-corpus (En/Fr) size . . . . . . . . . .
Dev BLEU scores with different sub-corpus (En/Fr) size at iteration 140 k . .
Perplexity score for different adapted sub-corpus . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
20
22
22
23
23
24
25
A.1
A.2
A.3
A.4
A.5
A.6
Dev BLEU scores (10 percent of the Arabic–French corpus) .
Sample BLEU scores (10 percent of the Arabic–French corpus)
Dev BLEU scores (20 percent of the Arabic–French corpus) .
Sample BLEU scores (20 percent of the Arabic–French corpus)
Dev BLEU scores (50 percent of the Arabic–French corpus) .
Sample BLEU scores (50 percent of the Arabic–French corpus)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
30
30
31
32
32
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
6
8
List of Tables
3.1
3.2
3.3
3.4
3.5
3.6
Bilingual corpus(Ar–Fr) size . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Corpus(DEV/TEST) size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bilingual corpus(En–Fr) size . . . . . . . . . . . . . . . . . . . . . . . . . . . .
En–Fr Dev Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Both (Ar/Fr, En/Fr) corpus size after applying all pre-processing steps . . . . . .
BLEU scores for PB-SMT vs. Neural MT, The BLEU scores in the parentheses are
not final results, the model still training . . . . . . . . . . . . . . . . . . . . . .
A.1 BLEU scores for PB-SMT vs. Neural MT, for 20 percent of Arabic–French corpus
iv
.
.
.
.
.
15
16
16
17
18
. 24
31
1
Introduction
1.1
Motivation
learning system is inspired by the cyclical connectivity of neurons in the brain. Although, the neural networks architectures in machine learning are used every so often to
understand brain function. They are not designed to be realistic models of biological function.
Recently, the popularity and the usefulness of the deep learning has growth fast, by reason
of new and large data-sets and computational power and the various techniques to train deeper
neural networks. In particular, neural networks are powerful learning models that achieve stateof-the-art results in a wide range of supervised and unsupervised machine learning tasks.
Traditional translation models are quite complex, they consist of numerous machine learning algorithms applied to different stages of the language translation pipeline. In this thesis,
we will focus on Recurrent Neural Networks (RNNs) as a replacement to traditional translation
modules what we called "neural machine translation".
Neural Machine Translation (NMT) is a recently proposed approach to machine translation,
a new way of teaching machines to translate using deep neural networks, they are designed
to maximize the translation performance. NMT has achieved state-of-the-art results in the
translation tasks for various language pairs. NMT are much easier to build and train than
traditional MT.
Usually, for machine translation task we deal with variable-length input and output sentences. In other words, source and the target are not fixed. RNNs with their capability of
capturing the dynamics of sequences via cycles in the network of nodes, are one of the best
class of artificial neural network architectures that can be used for translation task. Clearly,
RNN are capable of represent the history. Unlike, feed-forward neural networks, whenever a
single sample is fed into this last, the activation of the hidden units, is computed from scratch
and is not influenced by the state computed from the previous sample.
A recent trend in Deep Learning is Attention Mechanism. Attention Mechanisms in Neural
Networks are based on the visual attention mechanism found in humans. Human usually focuses on certain region of an image or certain words of text to well understand and then they
analyses the totality. The human brain can distinguish the most important part from the less
important. Human visual attention is well-studied and there exist different models inspired
from it. Recently attention mechanisms made their way into recurrent neural networks architectures that are typically used in Natural Language Processing (NLP). In this thesis, we will
describe how we have pushed the limits of NMT, making it applicable to a wide variety of
D
EEP
languages with state-of-the-art performance. The initial research goal in this thesis is (1) to
understand and to implement a neural machine translation for two pairs of languages: ArabicFrench, English-French (2) since NMT training on very large corpora is too long, propose data
selection (filtering) algorithms to make NMT trainable and efficient.
1.2
Structure of the thesis
The next chapter introduces the recurrent neural network, their architecture and defines our
approach of neural machine translation.
Then, in chapter 3, we present the data sets that are further used in the thesis. Moreover,
the training algorithm and the architecture of our model are described in detail. we will describe how an attention mechanism can be incorporated into the simple encoder-decoder model
also we will introduces a simple and advanced language modeling techniques and a training
data selection approach. Finally, we focus on the results after application of the Neural machine translation to different pairs of languages Arabic/French, English/French and discusses
computational limitations of the model.
2
2
State of the art
Introduction
T
chapter provides some background on Recurrent Neural Network (RNN) and neural
machine translation (NMT) that will make this report relatively self-contained.
HIS
2.1
Recurrent Neural Network
NN [30] is a class of artificial neural network that have been recently used in various task
such as speech recognition [11] [31], learning word embedding [29], language modeling
[38]. RNN have shown great promise result in number of tasks in natural language processing.
R
2.1.1
What are RNN?
The power of the RNN is the capability of representing all the history. RNN can handle sequential data with the output, which depends on the previous computation. In fact, the hidden state
of the RNN represents all the previous history, which allows a form of memory. Unlike the traditional neural network “feed-forward”where the input (same for the output) are independent
and the history is still just previous several computation.
In the equation 2.1, yt is estimated, which is the output probability distribution over the
vocabulary at time step t. ht is the hidden state at time step, it represent the “memory”of the
RNN. ht is calculated based on the previous hidden state and the input at the current step, where
W (hh) , W (hx) and W (s) are respectively the hidden to hidden weight matrix used to condition the
output of the previous time-step, the input to hidden weight matrix used to condition the input
vector and hidden to output weight matrix.
yt = so f tmax(W (s) ht )
(2.1)
Figure 2.1 shows the RNN architecture where rectangular box is a hidden layer. Each layer
of them holds a number of neurons. Where xt is the input at time step t, xt could be a one-hot
vector representation corresponding to n words.
It also shows a three time-steps RNN which tie the weight at each time step (fix set of
weight w). They allow to condition for each output given the entire sequence of words before
by keeping around the hidden state ht . So at each hidden state we have input xt (new word) and
we want to predict the output yt (next word).
Figure 2.1: Recurrent Neural Network [33]. Three time-steps are shown
2.1.2
What can RNNs do?
We can use the neural network in various natural language processing tasks [9] including: partof-speech tagging [37], named entity recognition [20], etc. In particular, RNN have demonstrated efficiency in this field due to his capability of memorizing and the ability to have a
feedback from past experiences, contrary to the Convolutional Neural Networks and forward
neural networks, which don’t have the concept of time nor experience. For many task, traditional neural network are not sufficient. For instance, if we want to predict the next word in
a sentence, it would be better to know which words came before it, so we can have a good
prediction.
2.1.3
Back propagation through time
Both the feed-forward and the recurrent architecture of the neural network model can be trained
by stochastic gradient descent using a well-known back-propagation. Usually Neural network
use back-propagation based on the gradient descent in the aim to update weight and minimize
the cost function. Though, for the RNN the use of back-propagation through time [15] is much
more efficient. In fact, back propagation through time (BPTT) propagate gradients of errors in
the network back in time through the recurrent weights, this algorithm is more suitable to the
RNN than ordinary back propagation (BP).
2.2
2.2.1
Recurrent Neural Network for Natural Language
Processing
Recurrent neural network language model
A language model (LM) allows us to predict and return the probability of the next word given
an input window of (n − 1) words. This traditional language model calculate the probability
4
using the equation 2.2.
m
P(w1 , ..., xm ) = ∏ P(wi | wi−(n−1) ...wi−1 )
(2.2)
i=1
We can say that the goal of the statistical LM is to anticipate the next word in a specific
context, it more likely to measure how likely a sentence is.
RNN allow us to condition the probability for each output given the entire sequence of
words before by keeping around hidden state. In fact, at each hidden state ht we have a new
word xt and we want to predict the output yt , which is the next word at xt+1 : wt .
RNN LM, in contrast, model word probabilities using continuous vector representations.
The input of this language modeling function should represent each and every world in the
vocabulary is equidistant away from the others. The 1-of-k-encoding is a representation, where
each word in the vocabulary is represented as a binary vector wi whose sum equals to one.
Actually, we set the ith element in the vector to one and others elements are set to zero. This
vector is called one-hot-vector, so the input to our function will be a set of (n−1) one-hot-vector
(w1 , w2 , ...wn−1 ).
Since we are using a neural network, these vectors are multiplied by a weight matrix E, to
produce a sequence of continuous vectors (s1 ,s2 ,...,sn−1 ), where s j = E T w j After multiplying
the weight matrix with the one-hot-vector, all the rows of the matrix will be set to zero except
those corresponding to the ith element. Their weights will be multiplied by 1. s j is called the
word embeddings and s = [p1 ; p2 ; ...; pn−1 ] is called the context vector.
A transformation layer will be applied to the context vector such as non-linear function f
and softmax function to obtain probabilities. This whole function is a neural model function.
Figure 2.2 represent steps from word to a one-hot-vector, after that we will have a continuous
space representation.
Figure 2.2: A one-hot-vector to a continuous-space representation
We use the same weights at all time steps, and everything else is the same: the input x1 ...xt ,
presented with wi which is a one-hot-vector. h0 is some initialization vector for the hidden layer
at time step.
ht = (h(t − 1), st )
(2.3)
Our softmax yt = softmax(W (s) ht ) where ŷ a probability distribution over the vocabulary:
P(xt+1 = v j | xt ...x1 ) = ŷt, j
(2.4)
Finally, the cross entropy loss function (but predicting words instead of classes).
|V |
Jt (θ ) = − ∑ (yt, j log ŷt, j )
(2.5)
j=1
2.2.2
Long Short Term Memory and Gated Recurrent Unit
2.2.2.1 Long Short Term Memory
Long short term memories [13] is complex activation unit, consists of recurrently connected
memory blocks presented in the following figure 2.3:
Figure 2.3: LSTM with one memory cell [12]
Each LSTM unit consists of:
• Forget gate ;
• A memory cell ;
• Input gate ;
• Output gate.
6
The forget gate layer is a sigmoid layer that decide what information will be forgotten (or
throw away). It takes the past hidden state ht−1 and the input word xt as input. It outputs a
number between 0 (completely get rid of this) and 1 (completely keep this) for each number in
the cell state Ct−1 .
(2.6)
ft = σ (W f xt +U f ht−1 )
A vector of new candidate values will be created ,c̃t , that could be added to the memory state.
In the next step, we will combine these two to create an update to the state:
j
j j
j j
ct = ft ct−1 + it c̃t
(2.7)
The memory cell of the jth LSTM unit at time-step t. Where the new content:
c̃t = tanh(Wc xt +Uc ht−1 )
(2.8)
Another important step is make a decision what new information will be stored in the cell.
Then, the input gate layer decides which values will be update.
it = σ (Wi xt +Ui ht−1 )
(2.9)
Finally, the output gate controls how much information we’re going to output from the memory
state.
ot = σ (Wo xt +Uo ht−1 )
(2.10)
Cells connected together,replacing the traditional hidden unit in the recurrent neural network, to allow the operation of forget,memorize and display the memory content. All the gating
units have a sigmoid non-linearity, while the input unit can have any squashing non-linearity.
2.2.2.2 Gate recurrent unit
Gate recurrent unit (GRU) is a new type of hidden unit proposed in [7]. GRU unit is inspired
from the LSTM unit but much simple to use and to implement. It adaptively remembers and
forgets its state based on the input signal to the unit. GRU unit consists in reset gate, update
gate and input gate. Unlike the LSTM, a single gating unit (update gate) controls the forget part
and the decision to update.
The Figure 2.4 is a detailed internals of a GRU, followed by a mathematical description of
GRU’s four fundamental stages.
The reset gate decides how much information will be used from the previous memory state
to compute the next target state. If the values of reset gate is close to zero than we ignores
previous memory and we only stores new word information.
rt = σ (Wz xt +Ur ht−1 )
(2.11)
The Update gate controls how much information from the previous memory state will be used.
j
j
If zt = 1, then we copy previous hidden state ht−1 and we ignore h̃t what currently happen.
Contrarily, if zt = 0, then most of the new memory h̃t is forwarded to the next memory.
zt = σ (Wr xt +Uz ht−1 )
(2.12)
Figure 2.4: Gate recurrent unit [7]
In equation 2.12, the hidden state ht is finally created using the new memory cell, the update
gate and the previous hidden input (see equation 2.13). The new memory is the fusion of the
input word xt and the previous hidden state h(t − 1).
j
j
j j
j
ht = (1 − zt )ht−1 + zt h̃t
h̃t = tanh(W xt + rt
K
Uht−1 )
(2.13)
(2.14)
2.2.2.3 LSTM Vs. GRU
RNN propagate weight matrices from one time-step to the next. Indeed, the goal of a RNN
implementation is to enable propagate context information thought faraway time-steps. This
make the RNN hard to train. In fact, there is a basic problem that gradients propagated over
many stages tend to either vanish (most of the time,when the gradient value gradually vanishes
as they propagate and goes to zero) or explode (rarely when the gradient values grows extremely
large,it causes an overflow, with much damage to the optimization). This to problem that risk
to happen during the training and are called respectively: vanishing gradient and exploding
gradient [28]
To solve the this problems we must use RNN units that allow the network to accumulate
information over a long duration. However, once that information has been used, it might be
useful for the neural network to forget the old state. For example, if a sequence is made of
sub-sequences and we want the unit to accumulate evidence inside each sub-sequence, we need
a mechanism to forget the old state by setting it to zero. Instead of manually deciding when to
clear the state, we want the neural network to learn to decide when to do it. This is what gated
RNN do.
GRU are quite new (proposed by [7] in 2014), and their trade-offs haven’t been fully explored yet. According to empirical evaluations in [14] and [8], there is not a clear winner. In
many tasks both architectures return almost the same performance quality. GRU contains less
8
parameters. Indeed, it may train a bit faster also it need less data to generalize. On the other
hand, with enough data, LSTM may give better results. LSTM has a greater expressive power
due to his numerous parameters.
2.3
Neural Machine Translation
2.3.1
Statistical Machine Translation
The name machine translation is based on the fact that we want a system to translate text
(source sentence), to corresponding text (target sentence). There are multiple ways to build
such machine. The statistical approach of Machine Translation allows us to translate a sentence
from one source language to another target language automatically. Moreover, SMT has many
alternative, it depend on the data-sets, our needs and the translation model. For instance, the
common approach in SMT is the Phrase-Based (PB-SMT) approach proposed by [18].
tˆ = P(t|s)
= P(s|t).P(t)
(2.15)
In equation 2.15, we can’t estimate directly P(t|s), where s and t are respectively the source and
the target utterance. But thanks to Bayes’ rules, we can transform the equation. In PB-SMT
approach, we propose to estimate P(s|t).P(t), in which P(t) is a target Language Model and
P(s|t) is a Translation Model.
SMT has limitations as well. Pre-processing alignments to extract bi-phrases is a very
complex and hard task. Language pairs that have different word order are particularly tricky.
NMT use recurrent neural networks to build a system that learns to map a whole sentence from
source to target all. This solve word-ordering, because the system learns whole sentences at
once.
2.3.2
Neural Machine Translation
Neural machine translation (NMT) is a new approach of machine translation [16], [34], [3], the
goal of NMT is to provide a fully trainable model that maximize the translation performance.
The NMT model starts from a representation of a source sentence and finishes by giving a
representation of a target sentence.
Most of the proposed neural machine translation approaches are based on the standard RNN
encoder/decoder. An neural network encoder read the source sentence and encode it into a fixed
length, but the decoder outputs a translated sentence from the inputs fixed vectors.
Cho, Kyunghyun and Van Merri in [7] developed a traditional RNN encoder/decoder based
on GRU units. In this approach, studied by Cho in [6], the translation score decreases while the
length of the sentence increase. This is due to the model, which has to encode the whole source
sentence information into a single fixed vector. They also mention that the vocabulary size has
a high impact on the translation score.
Luong and al. demonstrate in [21] that deep NMT with 3 or 4 layers preforms better than
those with 1 or 2 layers in terms of perplexity and the translation score. They implement a
machine translation system created by Delvin [10], which takes a feed forward neural network
as a translation model by predicting one word in a target phrase at a time.
In [34], they used a neural network with 4 layers with LSTM units. But, they used a
trick: they reversed the input sequence which able them to achieve a BLEU score equals to
37. It makes things work better in practice, but it’s not a formalized solution. Most translation
benchmarks involve language pairs like French and Spanish, which are translation to or from
English. But, there are languages (like Japanese or German) where the last word of a sentence
could be highly predictive of the first word in an English translation. In that case, reversing the
input would make things worse.
The aims of [22], by luong and Manning, is to explores the effectiveness of NMT in spoken
language. They used a global and local attention proposed in [24].
2.3.2.1 Standard RNN encoder/decoder
The RNN encoder/decoder [7] is based on two RNN that preform as an encoder and decoder
pair. This type of architecture (Figure 2.5) is at the core of deep learning.
Figure 2.5: Recurrent Neural Network: Encoder/Decoder [33], here the German phrase is
translated to an English one. The first three time-steps encode the German words into h3 and
the last two decode h3 into English words outputs.
The encoder is a recurrent neural network, which goes from generation to a vector. In fact,
it encodes the input sentence X to a context vector c, which is a continuous vector.
10
The equation 2.16 compute the hidden layer output at each time-step t, where f could be
LSTM or GRU.
ht = f (xt , ht−1 )
(2.16)
Equation 2.17 presents the c vector generated from the sequence of the hidden states (q is a
non-linear function).
c = q(h1 , ..., ht )
(2.17)
The recurrent activation function will be applied recursively over the input sequence, until the
end when the final internal state of the RNN ht is the summary of the whole input sentence.
The decoder makes a generalization from the vector c. It is also an RNN, a simple translation model (TM) conditioned on the source sentence X. RNN-TM is used to predict the
next word yt , given the context c and all the previous words (y1 ..yt ). The decoder defines the
probability, where g is a non-linear function.
P(yt | y − 1...y(t−1) , c) = g(y(t−1) , ht , c).
(2.18)
2.3.2.2 Can we do better than the simple encoder-decoder based model?
To overcome the vocabulary size shortcoming, Luong et al. [25], propose "copy" mechanisms
for <unk>. Simple and effective approach that can treat any NMT as a black box, which annotate occurrences of target <unk> with positional information to track their alignments. Moreover,he proposed recently a character-level translation [23] that treat the language complexity
problem.
The main idea of the work proposed by Bahdanau [3] for solving the sentence length is the
attention mechanism. Usually human when they translate a long sentence, first they reads the
whole paragraph to understand the meaning and to have an idea about the context and then they
translate the sentence/paragraph word by word, this is the concept of the attention approach.
2.3.2.3 Learning to Align in Machine Translation
The NMT by jointly learning to align and translate [3] is based on encoder/decoder architectures extended with an attention mechanism, which allows for each new word, to focus on part
of the original sentence. Usually the traditional RNN encoder/decoder, produce just a single
hidden state that summaries the hole sentence, oppositely in this NMT model the encoder is a
bidirectional recurrent neural network (Figure 2.6) composed of a forward RNN and a separate
backward RNN, the forward reads the sentence from the first word to the end, the backward
reads from the last word to the first in this way the annotation will contains information about
the window of words around.
On the other hand, they transformed the context vector (c) into a variable-length, from which a
decoder generates a translation, instead of a fixed-size vector. The use of a variable-length improve the performance of the encoder/decoder models by solving the problem of the sentences
length.
Each time the decoder RNN, like in traditional decoder, is trained to predict next word, it
determines the softmax of each hidden states to take as input. In fact the probability is conditioned on a distinct context vector ci , depends on a sequence of annotation which contains
information about the whole input sequence and with a strong focus on parts surrounding the
i-th word of the input . Basically the context vector is computed as a sum of annotations. we
Figure 2.6: Bidirectional encoder [5]
will give a detailed description of the model in the next chapter (section 3.5).
This approach use an alignment model as part from the decoding process. This alignment
between the target words and source is very useful especially to deal with the problem of
unknown word, which has an important influence on the performance of the translation.
The work of Bayhdanau et al. [3] is an extension from the basic encode/decoder, this new
approach appear to improve translation quality of long sentences. This is however does not
mean that their approach is the perfect, they still have the language complexity problem and
also the computation of the annotation weight for every word in the source sentence.
2.3.3
Drawbacks
While NMT has already achieved state-of-the-art performances for many languages. It offers
many advantages over traditional translation approaches, but has not solved the three shortcomings:
• The vocabulary size problem: all work in NMT has used the unk replacement technique
with a restricted vocabularies (shortlist of most frequent words), any others words is
mapped to <UNK> symbol.
• The sentence length problem: translation quality dramatically degrades as the length of
the source sentence increases when the encoder/decoder model size is small, to overcome this problem, the dimensionality of the context vector must be large enough that a
sentence of any length can be compressed [7].
• The language complexity problem: "Copying" mechanisms are not sufficient, they ignore
several properties of languages.
Conclusion:
In this chapter, we introduced recent advances in recurrent neural networks, centered around
neural machine translation. We first introduced the standard encoder/decoder model for machine translation. However, we discussed some weakness of this simple model, on the other
12
hand, we described a recently proposed approach "attention mechanism" that overcome some
of this shortcoming.
Indeed, our main experiments were run using the attention based mechanism, our intent is
to obtain a good translation quality and to optimize our model as possible as we can, so beside
the choice of this model we will work on improving the quality of the trainable data, which we
will speak about in our next chapter.
3
Experimentation
Introduction
this chapter we propose to study the impact of corpus based domain adaptation techniques
on neural machine translation approach. It will be compared to a baseline approach, which
is the standard phrase-based SMT approach.
We will introduce the training environment, the data sets used in our experiment. Moreover
we will be interested in the pre-processing of our training data and the data selection. We will
also describe the evaluation protocol and the recurrent neural network (RNN) model used for
training.
I
N
3.1
Data
The evaluation is performed on the Arabic–French and English–French translation tasks. To
train the translation model we must have a training set called parallel corpus described in the
next sub-section.
3.1.1
Arabic–French Corpus
The tables 3.1, 3.2 resume the contents of our Arabic–French corpus.
Corpus
new-lc
opensub
trames
wit3
Total
number of lines
90 753
4 381 835
20 539
87 732
4 580 859
number of words (Ar/Fr)
3 Millions/2 Millions
41 Millions/40 Millions
0,8 Millions/0,8 Millions
2 Millions/2 Millions
46,8 Millions/44,8 Millions
Table 3.1: Bilingual corpus(Ar–Fr) size
The Arabic–French training corpus composed of several corpora. The out-of-domain corpus is composed of News-commentary [36] provided by WMT for training SMT, News-commentary
contains 12 languages. We use 90 753 Arabic lines(3M word) and 90 753 lines french (16M
words). We also use Open-subtitles1 , a parallel corpus of movie subtitles [35]. We use 4M
Arabic lines (41M words)and 4M french lines (196M words). WIT32 is a version for research
purposes of the multilingual transcriptions of TED talks corpus3 , for the translation task of the
IWSLT evaluation task. We used 87 732 Arabic lines (2M words) and 87 732 french lines (15M
words).
Our in-domain-corpus is the corpus Trames, transcripts from radio and television and use
for the evaluation campaign of the TRAD project4 . We used 20 539 Arabic lines (0,8M words)
and 20 539 french lines (5M words).
The development corpus (Dev) is used during the training of the model, to optimize the
parameters of the translation model. This corpus consists of 795 sentence in each languages.
Corpus
Newspapers
Mail C3
Mail C4
DEV
250 lines
155 lines
390 lines
Table 3.2: Corpus(DEV/TEST) size
3.1.2
English–French Corpus
As out-of-domain-corpus, we use 2 007 723 lines of Europarl parallel (Ep) corpus. Ep is
extracted from the proceedings of the European Parliament. It includes versions in 21 European
languages. Finally, 274 419 lines from the UN corpus are added. It is extracted from the United
Nations Website5 and composed of official records and other parliamentary documents.
The in-domain-corpus is the European Medicines Agency (EMEA). It is a parallel corpus
made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdf-to-text. EMEA contains 22 languages. The
development corpus is also from EMEA.
The tables 3.3 and 3.4 resume the contents of the English–French corpus.
Corpus
ep7
EMEA
UN
Total
number of lines
2M
549 K
474 K
3.6 M
number of words (En–Fr)
55 M /61 M
11 M/13 M
7 M/7 M
73 M/81 M
Table 3.3: Bilingual corpus(En–Fr) size
1 http://www.opensubtitles.org/
2 wit3:
Web Inventory of Transcribed and Translated Talks, website: https://wit3.fbk.eu
3 http://www.ted.com/
4 http://www.trad-campaign.org
5 http://ods.un.org
16
Corpus
EMEA DEV
number of lines
1 665
number of words (En–Fr)
34 948/41 151
Table 3.4: En–Fr Dev Corpus
3.2
Pre-processing
To train our translation model we need the corpus described before.
But we first need to do some pre-processing to get our data into the right format.
3.2.1
Tokenize Text
The first step of pre-processing are tokenization and normalization. We applied a tokenization
script (from the open-source machine translation package, Moses) to our raw corpus. This
script tokenize our text into sentences, and sentences into words, also it handle the punctuation
so the sentence "she is beautiful!" we will split to 4 tokens: "she", "is","beautiful", "!".
3.2.2
Remove infrequent words
In fact NMT has its limitation in handling a larger vocabulary (having a huge vocabulary will
make our model very slow to train). Moreover, to really understand how to appropriately use a
word, human need to have seen it in different contexts, SMT approaches do the same.
A usual practice is to construct a target vocabulary of the k most frequent words (called
shortlist) where k is here 30,000 like in [3]. Words, which are not included in the shortlist, are
mapped to a special token ([UNK]), in order to limit the vocabulary size. The word ([UNK])
will become part of our vocabulary and we will predict it just like any other word. We also
removed all the numbers from all the corpora and replaced them with a special token: <NUMBER>. Besides, all the sentence are started with the token <S> and ends with </S>.
3.2.3
Filters out sentences by length
After applying the tokenization script, we propose to filters out long sentences in order to avoid
long range association problems in NMT approaches [6]. We filter out too long sentences,
which are over 30 words length in our case.
3.3
Data selection
Many experiments have proven that the more training data we have, the better score we will
get. The size of the training data effect clearly on the improvement of the translation scores,
but it’s still not a linear relationship. We can say that increasing the training data size, it’s not
always the best choice [19]. Furthermore, data selection approach [2] has shown a very good
result for both improving translation quality and to reducing model size for statistical machine
translation. Therefore, we will use this technique to adapt our NMT to a specific domain.
Especially, that the neural machine translation (NMT) system performance is dependent on the
quantity and quality of available training data.
Data selection allows us to have a better adaption by measuring the similarity of sentences based on the in domain data, either development or the test. The similarity is measured
with different metric such as the perplexity (PP) and the entropy (H). The entropy (equation
3.23.2) metric is the exponent in the perplexity score (equation 3.1) for sequence of words W
(w1 , w2 , ..., wn ).
PP(W ) = 2H (W )
(3.1)
1
(3.2)
H(W ) = − log P(w1 , w2 , ..., wn )
N
A various approach have been proposed for data selection, [2], [17], [39]. Therefore, to improve
the quality of our output we employ the data selection approach proposed by Moore and Lewis
in [26]. They used the cross-entropy difference (DCE) according to domain-specific (or indomain) and non-domain-specific (or out-of-domain) language models for each sentence.
DCE(S) = HI (S) − HN (S)
(3.3)
The previous equation allow to score the segments, where I is an in-domain data set and N a
non-domain data set. HI (S) and HN (S) are respectively the entropy of a sentence S according
to a language model trained on I and the entropy of a sentence S according to a language model
trained on a corpus N.
The corpus sentences is ranked based on the lowest-scoring.
In table 3.5 is shown the size of each (Arabic–French, English–French) corpus size after
applying all pre-processing steps. We can see that pre-processing steps effect on the training
data size. As a matter of fact, all pre-processing reduce the corpus size.
Corpus
Ar/Fr
En/Fr
Number of lines
4303676
2000000
Number of words
40 Millions / 40 Millions
32Millions / 35Millions
Table 3.5: Both (Ar/Fr, En/Fr) corpus size after applying all pre-processing steps
On one hand the data selection tends to favor short sentences, so usually we get short
sentences at the beginning and the longest ones at the end. On the other hand our sample train
is not random but it’s composed of the 1500 first sentences, for this reason and after applying
the data selection, we decided to mixed the pairs of sentences randomly.
3.4
Evaluation of machine translation
After training our machine translation, the proposed question is how do we evaluate this model?
We must have automatic evaluation metric in order to efficiently test and compare different
machine translation models. We propose to use the most common automatic metric used in
translation evaluation tasks: the BLEU score.
BLEU (Bilingual Evaluation Understudy)
The Bilingual Evaluation Understudy, BLEU [27] score, is one of the first and the most widely
used automatic evaluation metric for assessing the quality of translations. In fact, this algorithm
18
measures the quality of the output text produced by the machine with correspondence to the
one translated by human (BLEU achieve a high correlation with human judgments of quality).
N
BLEU = BP ∗ exp(1/n ∑ loge (pn ))
(3.4)
n=1
Where BP (Brevity penalty) is a multiplicative factor, modifying the overall BLEU score.
1
ifc > r
(3.5)
BP =
exp(1 − r/c) else
BP is a decaying exponential in r/c, where r is test corpus’s effective reference length. It is
computed by summing the best match lengths for each candidate sentence in the corpus. c is
the total length of the candidate translation corpus.
BLEU is definitely not a perfect metric, some of the alternatives available at the moment
are METEOR6 [4] and TER7 [32]. Although, the BLEU metric is adequate to our system.
3.5
Model Architecture
Figure 3.1 shows the attention-based Encoder/Decoder where rectangular box is an annotation
h j at a time step t. Each layer holds a number of n neurons.
Figure 3.1: Attention Based Encoder/Decoder
6 Metric
for Evaluation of Translation with Explicit ORdering
Error Rate
7 Translation
Our encoder is a bidirectional neural networks of two layers, each layer contains 1000
Gated recurrent units (GRU). The first layer is a forward RNN, which reads the source sen→
−
tence, results a hidden state noted h j , which summarizes the source sentence up to the jth word
beginning from the first word. The second layer is a backward RNN, reads the source sentence
←
− →
−
←
−
in reverse order, which results reversed memory states noted h j . h j and h j together are called
annotation and noted h j . It summarizes the whole input sentence. In other words, each annotation will contains information about the whole sequence with a strong focus on the words
around x j . The annotations are states of a bidirectional network, driven by word embeddings
of the source sentence.
The attention mechanism assigns weights to the annotations. The weighted sum of the
annotations is further used by the translation network to predict the next word of the generated
translation. The Decoder contains just one layer of 1000 GRU units, the decoder will be able
to selectively focus on one or more of the annotation vectors for each target word.
The attention mechanism (presented by Figure 3.2) takes as input the previous decoder’s
hidden state zi , one of the annotation h j and the previous generated target word ỹi−1 .
Figure 3.2: Attention Bloc
More general, the attention mechanism is a small neural network with a single hidden layer
20
and a single scalar output ei j source word. Once the score of each input word is compute, we
apply a softmax to sum all the scores to produce the attention weights.
Annotation h j contains the summary of both the preceding and following words. The equation 3.6 shows the relation between zi−1 previous memory state which summarizes whats has
been translated up to the (i-1)t h word, ỹi−1 and the previous generated target word. What we
called alignment model (ei j ) or energy, scores how well the input around position j and the
output at position i match.
ei j = a(zi−1 , ỹi−1 , h j )
(3.6)
Attention weight αi j mentioned in our attention model figure is the probability that the target
word yi is aligned to a source word x j .
Tx
αi j = exp(ei j )/ ∑ exp(eik )
(3.7)
k=1
The equation 3.8 described bellow is used to compute the new memory state zi of the decoder.
ci is called the weighted average.
Tx
ci =
∑ αi j h j
(3.8)
j=1
3.6
Results
We adapted a fully neural machine translation using Python-theano 8 , a library which uses
GPU. To train our models we used the data described in section 3.1. In our experience, we
tested our model with different size of the training corpus. On one hand, we want to observe
the model performances with a variety of data-sets size. on the other hand, we want to evaluate
the model according to the data-sets quality. To better understand the behavior of the NMT
approach, we tested our models with two language-pairs: English–French and Arabic–French.
The result of the Arabic–French language pairs presented in the appendix A.
3.6.1
The English–French task
As a first step in the pre-processing, we tokenized all our English/French corpus, and then we
mapped all the infrequent words (words that do not belong to the 30,000 shortlist) and also
we mapped all the number with a specific token <NUMBER>. For the second step, in order
to limit the memory use during training we removed sentences that have length more then 30
words. On the other hand, data selection tends to favor short segments, another reason for long
sentence removing. Then we shuffled our corpus to get our baseline (Random corpus).
Finally, we applied a data selection on the whole corpus to have the most interesting sentences according to the development corpus. Then, we took different size of data-sets as a
sub-corpus of the sorted data and we will compare it to the baseline.
We trained 800 K lines of both random and adapted corpus (tokenized, filtered, selected),
during 16 epochs. We propose to evaluate the effect of the data-sets quality on the translation
performance of the NMT model.
8 https://github.com/mila-udem/blocks-examples/tree/master/machine_translation
The Figure 3.3, contains the BLEU score of each dev and sample files of the adapted corpus.
We applied a simple "head" to take the first 800 K lines from a tokenized, filtered and selected
corpus. The validation of BLEU scores start after 20000 updates and then every 5000 updates.
Figure 3.3: Dev and Sample BLEU scores: 800,000 lines o f the (English/French) corpus)
Figure 3.4 shows the BLEU score of each dev and sample files for random sub-corpus: first
800 K lines of a tokenized, filtered, shuffled corpus. This is the result after 140 K iterations.
Figure 3.4: Dev and Sample BLEU scores: 800,000 lines of the random corpus
At iteration 140 k, with the same size of data we have a large difference between the BLEU
score of the adapted data (Dev 38.65 and Sample 39.58) and the random (Dev 28.70 and Sample
22
39.12) one. Approximately +10 BLEU points difference for the dev and +0.46 for the sample.
From figure 3.3 and 3.4, we note that data selection has an effect on translation scores.
The figures 3.5 and 3.6 displays BLEU scores with respects to the different sub-corpus for
the dev and for sample. We applied all the steps of pre-processing on different sub-corpus size.
Indeed, we tokenized the whole corpus then we removed sentences, which their length exceed
30 words and we applied a data selection. we took then the first 400 k lines (respectively 800
K, 1 200 k, 1 600 k, 2M). This time, we wanted to observe how the data size will influence the
BLEU scores.
Figure 3.5: Dev BLEU scores with different sub-corpus (En/Fr) size
Figure 3.6: Sample BLEU scores with different sub-corpus (En/Fr) size
The above figures shows that BLEU scores increase when the adapted corpus size decrease.
For instance, at iteration 240 000 for 400 K lines the BLEU score for the dev is equal to 67.26
(for the sample 47.08), while for 800 K lines the dev achieved 41.56 BLEU points (sample
46.12). furthermore, with 1 200 K lines we have 37.55 BLEU points for the dev (sample 46.27)
and with 1 600 K lines we achieved 32.75 points (sample 44.79). Finally, 31.94 for 2M lines
(sample 44.45).
3.6.1.1 Comparison with a phrase-base approach
The following table 3.6 presents the BLEU scores result for Phrased-Based Machine translation by Moses and our implemented Neural Machine Translation on 800 K/400 K lines of the
English–French adapted corpus (tokenized, filtered and selected) and the random (tokenized,
filtered and shuffled) one.
Number of lines
800 K
400 K
MT Model
PB-SMT (Adapted Corpus)
Neural MT (Adapted Corpus)
Neural MT (Random-Corpus)
PB-SMT (Adapted Corpus)
Neural MT (Adapted Corpus)
BLEU scores
59,62
(60.42)
(44,73)
60,08
67,50
Table 3.6: BLEU scores for PB-SMT vs. Neural MT, The BLEU scores in the parentheses are
not final results, the model still training
Figure 3.7 presents dev BLEU scores for different adapted sub-corpus size at iteration 140
k. The translation score decrease remarkably between 400 K lines and 800 K lines (threshold
of approximately +20 BLEU points). Although, for the phased based machine translation score
the difference is +1,30 BLEU points.
Figure 3.7: Dev BLEU scores with different sub-corpus (En/Fr) size at iteration 140 k
24
Perplexity : corpus(En/Fr)
40
"ppl" using 1:3
35
Perplexity
30
25
20
15
10
0
200000
400000
600000
800000
1e+06 1.2e+06 1.4e+06 1.6e+06 1.8e+06
Corpus size
2e+06
Figure 3.8: Perplexity score for different adapted sub-corpus
Figure 3.8 presents the result on term of perplexities obtained on a test for the same adapted
sub-corpus presented on Figure 3.7. The lowest perplexity is obtained for 400 K lines, which is
equal to 24.5672 points. Indeed, the adapted sub-corpus with 400k lines has the highest BLEU
score and the lowest perplexity.
3.7
3.7.1
Discussion
Cost of the attention mechanism
As mentioned before, the attention mechanism is given the neural network the access to his
hidden state of the encoder (or we can say to his internal memory). Unlike typical memory,
the attention mechanism makes the memory access softer (the network recovers a weighted
combination of all memory locations), so we can easily train the network end-to-end using
back-propagation. Despite all the benefits of the attention approach, it still have some shortcoming. If we look closely to this approach we can see that it’s costly. In fact, we need to
calculate an attention value for each combination of input and output word. For instance, if you
have a 50-word input sequence and generate a 50-word output sequence that would be 2500
attention values. In same way, if we are dealing with too long sequences the above attention
mechanisms can become prohibitively expensive and also he will take so much time to train.
Actually, we are essentially looking at everything in detail before deciding what to focus
on. Probably, it is like after generating an output word, and we go back through all of the the
hidden state (internal memory) of the text in order to decide which word to produce next. That
seems like a waste. Nevertheless, that hasn’t stopped attention mechanisms from performing
well on many tasks (machine translation, image/video caption). Recently, attention mechanism
become quite popular.
3.7.2
Effect of data selection on translation quality
Data selection (The cross-entropy difference approach) has shown significant improvements
in effective use of training data by extracting sentences from large general-domain corpora to
adapt our NMT system to in-domain. Considering the best result may depend on the size of the
selected data, we investigate each of selected corpora starting from using 400K lines of general
corpus (400K-l, 800K-l, 1,200K-l, 1,600K-l, 2,000K-l) where X-l means X lines of general
corpus are selected as a subset.
Translation quality improves by at most +7 BLEU points when using 400 K lines data of
the general corpus as shown in figure 3.6. Then the performance begins to drop when the size
threshold is more than 800K lines (Figure 3.5). The results show that keywords overlap plays
a significant role in retrieving sentences in similar domains.
3.7.3
Arabic: a challenge for translation tasks
Arabic has a rich morphology compared to the French language. A single Arabic word can
function as an entire sentence in French what make translation a hard task to produce even
with neural machine translation [1]. Arabic is a very complex language for the computer to
understand. From our experiments, even after applying a script for normalization and an OpenNLP9 tokenizer (Diacritic removal, Letter normalization and the Tatweel removal) we have not
obtain a high BLEU scores.
9 https://opennlp.apache.org/
26
4
Conclusion and Future work
Conclusion
this thesis we presented a new approach of machine translation :Neural Machine Translation (NMT). NMT is a radically new way of teaching machines to translate using deep
neural networks. Though developed just last year, NMT has achieved state-of-the-art results in
the WMT translation tasks for various language pairs such as English-French, English-German,
and English-Czech. NMT is appealing since it is conceptually simple.
NMT is essentially a big recurrent neural network that can be trained end-to-end and translates. It reads through the given source words one by one until the end, and then, starts emitting
one target word at a time until a special end-of-sentence symbol is produced. Recently, a very
interesting approach: attention mechanism, has shown a very promising result in neural machine translation. The problem solved by this approach is that it allows the neural network to
refer back to the source sentence, instead of encoding all the information in one fixed vector.
As we already mentioned in this thesis we have trained our English–French and Arabic–French
corpus with the NMT model described in [3]. This approach has given an impressive result
especially for the (English–French) corpus. In our work, we wanted to make clear the relation between quality, quantity of the corpus and translation performance. We have shown first
result on selected data neural MT and we have compared with phrase-based system with the
same adapted corpus. Finally, we have concluded that quality of data-sets effect performance
of the NMT. In fact, our adapted corpus has improve the translation quality compared to the
tokenized-only corpus.
I
N
4.1
4.1.1
Future of Neural Machine Translation:
Short-term projects
Short-term projects present two goals that we try to achieve as an extension to our previous
work: (1)Decoding: after that we trained our models we need to find a translation that maximizes the conditional probability, we can simply use a beam search. (2)implement approach:
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models used in [23], hybrid systems that translate at the word level while consult the character
components for infrequent words, this approach is very easy and fast to train compared to
character level recurrent networks and for the other hands it does not produces any special
token <UNK>. This hybrid approach improved the BLEU scores by +7.9 BLEU points for English/Czech translation over models that do not have any special treatment for unknown words.
It is a very interesting approach to train with our English/French and Arabic/French corpus to
improve the translation quality.
4.1.2
Long-term projects
There are many challenges related to neural machine translation. NMT is a recent advances
with a promising result. Our translation model works with languages (sentences, character
or words level): translation from a source languages to a target one, but the question is: can
this model work with other type of input/output (1)suitable model for very long sentences
(paragraph, documents..) (2)suitable model for graphs(eg. a word lattices used by Moses).
28
A
Appendix A
This appendix A contains some result of Arabic–French languages pair. In fact, during the
training of this corpus we faced some problem. As mentioned before Arabic languages is a
challenges for natural language processing and it is hard to train. The result described bellow
are not very efficient we are still working to improve it.
Result for Arabic–French Corpus
We applied the same steps of pre-processing (tokenize text, remove infrequent words, filters out
sentences by length, data selection) on 10 percent, 20 percent and 50 percent of the data-sets.
We compared the BLEU scores of the adapted sub-corpus and the result of the training before
any pre-processing (Raw data-sets).
A.0.1 Result for 10 percent of the corpus
Figure A.1: Dev BLEU scores (10 percent of the Arabic–French corpus)
The result in the following graph A.2 with approximately 10 percent of the initial corpus
(400,000 lines) and after 15 epochs. We trained three different sub-corpus (same data with
different pre-processing): 10 percent of the raw corpus, 10 percent tokenized and filtered, 10
percent selected and shuffled.
Figure A.2: Sample BLEU scores (10 percent of the Arabic–French corpus)
A.0.2 Result for 20 percent of the corpus
Figure A.3: Dev BLEU scores (20 percent of the Arabic–French corpus)
30
Figures A.3, A.4 presents the BLEU scores for 20 percent of the corpus (800,000 lines)
12epochs for the three different steps of pre-processing:
Figure A.4: Sample BLEU scores (20 percent of the Arabic–French corpus)
MT Model
PB-SMT (Adapted Corpus)
Neural MT (Adapted Corpus)
Neural MT (only tokenized-Corpus)
BLEU scores
11,5
16.37
9,55
Table A.1: BLEU scores for PB-SMT vs. Neural MT, for 20 percent of Arabic–French corpus
A.0.3 Result for 50 percent of the corpus
Figures A.5 and A.6 displays BLEU scores result for 50 percent of the corpus (2,000,000 lines)
after 8 epochs:
Figure A.5: Dev BLEU scores (50 percent of the Arabic–French corpus)
Figure A.6: Sample BLEU scores (50 percent of the Arabic–French corpus)
32
B
Appendix B
Parameters:
• One epoch = one forward pass and one backward pass of all the training examples.
• Batch size = the number of training examples in one forward/backward pass. The higher
the batch size, the more memory space you’ll need.
• Number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the
forward pass and backward pass as two different passes).
• Batch size = 80 Default value.
• Sort k batch = 12 (Default value) : This many batches will be read ahead and sorted.
• CNMeM = 512MB the start size of the GPU memory.
• Max number of updates 1000000.
• Show 2 samples at each sampling.
• Source and target vocabularies sizes (include bos, eos, unk tokens)= 30000.
• Beam size=12.
• Start bleu validation after 80000 (default value, in our experience we change it to 2000)
updates.
• Bleu validation every 5000 updates.
• Maximum number of updates = 1000000.
• Bleu scripts ≪ multi-bleu.perl≫ (moses multi-perl).
• Sample-size = 1500 lines.
• Word embedding dimensionality: 620.
• Multilayer network with a single maxout hidden layer.
• The training algorithm use a SGD with Adadelta (algorithm to adapt the learning rate of
each algorithm, decayrate=0.95, epsilon=1e-06).
• GPU : Tesla K40m.
Definitions:
This some basic definitions1 :
• Activation Function: To allow Neural Networks to learn complex decision boundaries,
we apply a nonlinear activation function to some of its layers. Commonly used functions
include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these.
• Adadelta: is a gradient descent based learning algorithm that adapts the learning rate
per parameter over time. It was proposed as an improvement over Adagrad, which is
more sensitive to hyper-parameters and may decrease the learning rate too aggressively.
Adadelta It is similar to rmsprop and can be used instead of vanilla SGD.
• Dropout: is a regularization technique for Neural Networks that prevents overfitting. It
prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each
training iteration. Dropout can be interpreted in various ways, such as randomly sampling
from an exponential number of different networks. Dropout layers first gained popularity
through their use in CNNs, but have since been applied to other layers, including input
embeddings or recurrent networks.
• The softmax function is typically used to convert a vector of raw scores into class probabilities at the output layer of a Neural Network used for classification. It normalizes
the scores by exponentiating and dividing by a normalization constant. If we are dealing
with a large number of classes, a large vocabulary in Machine Translation for example, the normalization constant is expensive to compute. There exist various alternatives to make the computation more efficient, including Hierarchical Softmax or using a
sampling-based loss such as NCE.
• Theano is a Python library that allows you to define, optimize, and evaluate mathematical
expressions. It contains many building blocks for deep neural networks. Theano is a lowlevel library similar to Tensorflow. Higher-level libraries include Keras and Caffe.
1 http://www.wildml.com/deep-learning-glossary/
34
Bibliography
[1] Amjad Almahairi, Kyunghyun Cho, Nizar Habash, and Aaron Courville. First result on
arabic neural machine translation. arXiv preprint arXiv:1606.02680, 2016.
[2] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain adaptation via pseudo indomain data selection. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing, pages 355–362. Association for Computational Linguistics, 2011.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation
by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[4] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with
improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,
volume 29, pages 65–72, 2005.
[5] Kyunghyun Cho. Natural Language Understanding with Distributed Representation.
Technical report, New York University, 2015. Lecture Note for DS-GA 3001.
[6] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the
Properties of Neural Machine Translation: Encoder-Decoder Approaches. 2014.
[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations
using RNN encoder-decoder for statistical machine translation.
arXiv preprint
arXiv:1406.1078, 2014.
[8] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
[9] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and
Pavel Kuksa. Natural language processing (almost) from scratch. The Journal of Machine
Learning Research, 12:2493–2537, 2011.
[10] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz,
and John Makhoul. Fast and robust neural network joint models for statistical machine
translation. In ACL (1), pages 1370–1380, 2014.
[11] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with
deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on, pages 6645–6649. IEEE, 2013.
[12] Alex Graves. Supervised sequence labelling. In Supervised Sequence Labelling with
Recurrent Neural Networks, pages 5–13. Springer Berlin Heidelberg, 2012.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9(8):1735–1780, 1997.
[14] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of
recurrent network architectures. In Proceedings of the 32nd International Conference on
Machine Learning (ICML-15), pages 2342–2350, 2015.
[15] Paul J.Werbos. Backpropagation Through Time: what it does and how to do it. Proceedings of the IEEE, 1550-1560, 1990.
[16] Nal Kalchbrenner and Phil Blunsom. Recurrent Continuous Translation Models. EMNLP,
3(39):413, 2013.
[17] Philipp Koehn and Barry Haddow. Towards effective use of training data in statistical
machine translation. In Proceedings of the Seventh Workshop on Statistical Machine
Translation, pages 317–321. Association for Computational Linguistics, 2012.
[18] Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrase-based translation.
In Proceedings of the 2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology-Volume 1, pages 48–54,
2003.
[19] Patrik Lambert, Holger Schwenk, Christophe Servan, and Sadaf Abdul-Rauf. Investigations on translation model adaptation using monolingual data. In Proceedings of the Sixth
Workshop on Statistical Machine Translation, pages 284–293, 2011.
[20] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami,
and Chris Dyer. Neural architectures for named entity recognition. arXiv preprint
arXiv:1603.01360, 2016.
[21] Minh-Thang Luong, Michael Kayser, and Christopher D Manning. Deep Neural Language Models for Machine Translation. CoNLL 2015, page 305, 2015.
[22] Minh-Thang Luong and Christopher D Manning. Stanford Neural Machine Translation
Systems for Spoken Language Domains. 2015.
[23] Minh-Thang Luong and Christopher D Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. arXiv preprint arXiv:1604.00788,
2016.
[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to
attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
36
[25] Minh-Thang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba.
Addressing the rare word problem in neural machine translation. arXiv preprint
arXiv:1410.8206, 2014.
[26] Robert C Moore and William Lewis. Intelligent selection of language model training data.
In Proceedings of the ACL 2010 conference short papers, pages 220–224. Association for
Computational Linguistics, 2010.
[27] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for
automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on
association for computational linguistics, pages 311–318. Association for Computational
Linguistics, 2002.
[28] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. arXiv preprint arXiv:1211.5063, 2012.
[29] Suman Ravuri and Andreas Stolcke. A comparative study of recurrent neural network
models for lexical domain classification. In 2016 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 6075–6079. IEEE, 2016.
[30] D. Rumelhart, G. Hinton, and R. Williams. Leraning representations by back-propagation
errors. Nature, 323, 533-536, 1986.
[31] Haşim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays. Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint
arXiv:1507.06947, 2015.
[32] Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul.
A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, pages 223–231, 2006.
[33] Richard Socher. Deep NLP Recurrent Neural Networks. Computer Science Department, Stanford University, August 2015. Available at http://videolectures.net/
deeplearning2015_socher_deep_nlp.
[34] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. pages 3104–3112, 2014.
[35] Jörg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with
Tools and Interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V, pages 237–248. John
Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria, 2009.
[36] Jörg Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente
Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the
Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may 2012. European Language Resources Association (ELRA).
[37] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv preprint
arXiv:1510.06168, 2015.
[38] Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. Recurrent
neural networks for language understanding. In INTERSPEECH, pages 2524–2528, 2013.
[39] Keiji Yasuda, Ruiqiang Zhang, Hirofumi Yamamoto, and Eiichiro Sumita. Method of
selecting training data to build a compact and efficient translation model. In IJCNLP,
pages 655–660, 2008.
38