Packt+ | Advance your knowledge in tech

You're reading from Natural Language Processing with TensorFlow Teach language to machines using Python's deep learning library

Product type Paperback

Published in May 2018

Publisher Packt

ISBN-13 9781788478311

Length 472 pages

Edition 1st Edition

Languages

Processing

Tools

Processing

Concepts

Deep Learning

Authors (2):

Saad

Ganegedara

View More author details

Table of Contents (16) Chapters

Natural Language Processing with TensorFlow

Contributors

Preface

1. Introduction to Natural Language Processing FREE CHAPTER

2. Understanding TensorFlow

3. Word2vec – Learning Word Embeddings

4. Advanced Word2vec

5. Sentence Classification with Convolutional Neural Networks

6. Recurrent Neural Networks

7. Long Short-Term Memory Networks

8. Applications of LSTM – Generating Text

9. Applications of LSTM – Image Caption Generation

10. Sequence-to-Sequence Learning – Neural Machine Translation

11. Current Trends and the Future of Natural Language Processing

Mathematical Foundations and Advanced TensorFlow

Index

Introduction to the TensorFlow seq2seq library

We used the raw TensorFlow API for all our implementations in this book for better transparency of the actual functionality of the models and for a better learning experience. However, TensorFlow has various libraries that hide all the fine-grained details of the implementations. This allows users to implement sequence-to-sequence models like the Neural Machine Translation (NMT) model we saw in Chapter 10, Sequence-to-Sequence Learning – Neural Machine Translation with fewer lines of code and without worrying about more specific technical details about how they work. Knowledge about these libraries is important as they provide a much cleaner way of using these models in production code or researching beyond the existing methods. Therefore, we will go through a quick introduction of how to use the TensorFlow seq2seq library. This code is available as an exercise in the seq2seq_nmt.ipynb file.

Defining embeddings for the encoder and decoder

We will first define the encoder inputs, decoder inputs, and decoder output placeholders:

enc_train_inputs = []
dec_train_inputs, dec_train_labels = [],[]
for ui in range(source_sequence_length):
    enc_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_inputs_%d'%ui))

for ui in range(target_sequence_length):
    dec_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_inputs_%d'%ui))
    dec_train_labels.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_outputs_%d'%ui))

Next, we will define the embedding lookup function for all the encoder and decoder inputs, to obtain the word embeddings:

encoder_emb_inp = [tf.nn.embedding_lookup(encoder_emb_layer, src) for src in enc_train_inputs]
encoder_emb_inp = tf.stack(encoder_emb_inp)

decoder_emb_inp = [tf.nn.embedding_lookup(decoder_emb_layer, src) for src in dec_train_inputs]
decoder_emb_inp = tf.stack(decoder_emb_inp)

Defining the encoder

The encoder is made with an LSTM cell as its basic building block. Then, we will define dynamic_rnn, which takes the defined LSTM cell as the input, and the state is initialized with zeros. Then, we will set the time_major parameter to True because our data has the time axis as the first axis (that is, axis 0). In other words, our data has the [sequence_length, batch_size, embeddings_size] shape, where time-dependent sequence_length is in the first axis. The benefit of dynamic_rnn is its ability to handle dynamically sized inputs. You can use the optional sequence_length argument to define the length of each sentence in the batch. For example, consider you have a batch of size [3,30] with three sentences having lengths of [10, 20, 30] (note that we pad the short sentences up to 30 with a special token). Passing a tensor that has values [10, 20, 30] as sequence_length will zero out LSTM outputs that are computed beyond the length of each sentence. For the cell state, it will not zero out, but take the last cell state computed within the length of the sentence and copy that value beyond the length of the sentence, until 30 is reached:

encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

initial_state = encoder_cell.zero_state(batch_size, dtype=tf.float32)

encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_emb_inp, initial_state=initial_state,
    sequence_length=[source_sequence_length for _ in range(batch_size)], 
    time_major=True, swap_memory=True)

The swap_memory option allows TensorFlow to swap the tensors produced during the inference process between GPU and CPU, in case the model is too complex to fit entirely in the GPU.

Defining the decoder

The decoder is defined similar to the encoder, but has an extra layer called, projection_layer, which represents the softmax output layer for sampling the predictions made by the decoder. We will also define a TrainingHelper function that properly feeds the decoder inputs to the decoder. We also define two types of decoders in this example: a BasicDecoder and BahdanauAttention decoders. (The attention mechanism is discussed in Chapter 10, Sequence-to-Sequence Learning – Neural Machine Translation.) Many other decoders exist in the library, such as BeamSearchDecoder and BahdanauMonotonicAttention:

decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)

projection_layer = Dense(units=vocab_size, use_bias=True)

helper = tf.contrib.seq2seq.TrainingHelper(
    decoder_emb_inp, [target_sequence_length for _ in range(batch_size)], time_major=True)

if decoder_type == 'basic':
    decoder = tf.contrib.seq2seq.BasicDecoder(
        decoder_cell, helper, encoder_state,
        output_layer=projection_layer)
    
elif decoder_type == 'attention':
    decoder = tf.contrib.seq2seq.BahdanauAttention(
        decoder_cell, helper, encoder_state,
        output_layer=projection_layer)

We will use dynamic decoding to get the outputs of the decoder:

outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
    decoder, output_time_major=True,
    swap_memory=True
)

Next, we will define the logits, cross-entropy loss, and train prediction operations:

logits = outputs.rnn_output

crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=dec_train_labels, logits=logits)
loss = tf.reduce_mean(crossent)

train_prediction = outputs.sample_id

Then, we will define two optimizers, where we use AdamOptimizer for the first 10,000 steps and vanilla stochastic GradientDescentOptimizer for the rest of the optimization process. This is because, using Adam optimizer for a long term gives rise to some unexpected behaviors. Therefore, we will use Adam to obtain a good initial position for the SGD optimizer and then use SGD from then on:

with tf.variable_scope('Adam'):
    optimizer = tf.train.AdamOptimizer(learning_rate)
with tf.variable_scope('SGD'):
    sgd_optimizer = tf.train.GradientDescentOptimizer(learning_rate)

gradients, v = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, 25.0)
optimize = optimizer.apply_gradients(zip(gradients, v))

sgd_gradients, v = zip(*sgd_optimizer.compute_gradients(loss))
sgd_gradients, _ = tf.clip_by_global_norm(sgd_gradients, 25.0)
sgd_optimize = optimizer.apply_gradients(zip(sgd_gradients, v))

Note

A rigorous evaluation on how optimizers perform in NMT training is found in a paper by Bahar and others, called, Empirical Investigation of Optimization Algorithms in Neural Machine Translation, The Prague Bulletin of Mathematical Linguistics, 2017.