|
| 1 | +# Handling Question duplicates using Siamese Networks |
| 2 | + |
| 3 | +In this section I have explored Siamese networks applied to natural language processing and fundamentals of Trax. I learned how to implement models with different architectures. |
| 4 | + |
| 5 | +## Outline |
| 6 | + |
| 7 | +- [Overview](#0) |
| 8 | +- [Part 1: Importing the Data](#1) |
| 9 | + - [1.1 Loading in the data](#1.1) |
| 10 | + - [1.2 Converting a question to a tensor](#1.2) |
| 11 | + |
| 12 | +- [Part 2: Defining the Siamese model](#2) |
| 13 | + - [2.1 Understanding Siamese Network](#2.1) |
| 14 | + |
| 15 | + - [2.2 Hard Negative Mining](#2.2) |
| 16 | + |
| 17 | +- [Part 3: Training](#3) |
| 18 | + - [3.1 Training the model](#3.1) |
| 19 | +- [Part 4: Evaluation](#4) |
| 20 | + - [4.1 Evaluating our siamese network](#4.1) |
| 21 | + - [4.2 Classify](#4.2) |
| 22 | +- [Part 5: Testing with our own questions](#5) |
| 23 | +- [On Siamese networks](#6) |
| 24 | + |
| 25 | +<a name='0'></a> |
| 26 | +### Overview |
| 27 | +In this section, we will: |
| 28 | + |
| 29 | +- Learn about Siamese networks |
| 30 | +- Understand how the triplet loss works |
| 31 | +- Understand how to evaluate accuracy |
| 32 | +- Use cosine similarity between the model's outputted vectors |
| 33 | +- Use the data generator to get batches of questions |
| 34 | +- Predict using our own model |
| 35 | + |
| 36 | +We will start by preprocessing the data. After processing the data we will build a classifier that will allow us to identify whether to questions are the same or not. |
| 37 | +We will process the data first and then perform padding. Our model will take in the two question embeddings, run them through an LSTM, and then compare the outputs of the two sub networks using cosine similarity. |
| 38 | + |
| 39 | +<a name='1'></a> |
| 40 | +# Part 1: Importing the Data |
| 41 | +<a name='1.1'></a> |
| 42 | +### 1.1 Loading in the data |
| 43 | + |
| 44 | +We will be using the Quora question answer dataset to build a model that could identify similar questions. This is a useful task because we don't want to have several versions of the same question posted. |
| 45 | + |
| 46 | +<img src="./images/quora_dataset.png"></img> |
| 47 | + |
| 48 | +<a name='1.2'></a> |
| 49 | +### 1.2 Converting a question to a tensor |
| 50 | + |
| 51 | +You will now convert every question to a tensor, or an array of numbers, using our vocabulary built above. |
| 52 | + |
| 53 | +<a name='2'></a> |
| 54 | +# Part 2: Defining the Siamese model |
| 55 | + |
| 56 | +<a name='2.1'></a> |
| 57 | + |
| 58 | +### 2.1 Understanding Siamese Network |
| 59 | +A Siamese network is a neural network which uses the same weights while working in tandem on two different input vectors to compute comparable output vectors.The Siamese network we are about to implement looks like this: |
| 60 | + |
| 61 | +<img src="./images/siamese_networks.png"></img> |
| 62 | + |
| 63 | +we get the question embedding, run it through an LSTM layer, normalize `v_1` and `v_2`, and finally use a triplet loss (explained below) to get the corresponding cosine similarity for each pair of questions. As usual, we will start by importing the data set. The triplet loss makes use of a baseline (anchor) input that is compared to a positive (truthy) input and a negative (falsy) input. The distance from the baseline (anchor) input to the positive (truthy) input is minimized, and the distance from the baseline (anchor) input to the negative (falsy) input is maximized. In math equations, we are trying to maximize the following. |
| 64 | + |
| 65 | +<img src="./images/triplet_loss.png"></img> |
| 66 | + |
| 67 | +`A` is the anchor input, for example `q1_1`, `P` the duplicate input, for example, `q2_1`, and `N` the negative input (the non duplicate question), for example `q2_2`.<br> |
| 68 | +`\alpha` is a margin; we can think about it as a safety net, or by how much we want to push the duplicates from the non duplicates. |
| 69 | +<br> |
| 70 | + |
| 71 | +<a name='ex02'></a> |
| 72 | +### Exercise 02 |
| 73 | + |
| 74 | +**Instructions:** Implement the `Siamese` function below. we should be using all the objects explained below. |
| 75 | + |
| 76 | +To implement this model, we will be using `trax`. Concretely, we will be using the following functions. |
| 77 | + |
| 78 | + |
| 79 | +- `tl.Serial`: Combinator that applies layers serially (by function composition) allows we set up the overall structure of the feedforward. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/combinators.py#L26) |
| 80 | + - we can pass in the layers as arguments to `Serial`, separated by commas. |
| 81 | + - For example: `tl.Serial(tl.Embeddings(...), tl.Mean(...), tl.Dense(...), tl.LogSoftmax(...))` |
| 82 | + |
| 83 | + |
| 84 | +- `tl.Embedding`: Maps discrete tokens to vectors. It will have shape (vocabulary length X dimension of output vectors). The dimension of output vectors (also called d_feature) is the number of elements in the word embedding. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L113) |
| 85 | + - `tl.Embedding(vocab_size, d_feature)`. |
| 86 | + - `vocab_size` is the number of unique words in the given vocabulary. |
| 87 | + - `d_feature` is the number of elements in the word embedding (some choices for a word embedding size range from 150 to 300, for example). |
| 88 | + |
| 89 | + |
| 90 | +- `tl.LSTM` The LSTM layer. It leverages another Trax layer called [`LSTMCell`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTMCell). The number of units should be specified and should match the number of elements in the word embedding. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTM) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/rnn.py#L87) |
| 91 | + - `tl.LSTM(n_units)` Builds an LSTM layer of n_units. |
| 92 | + |
| 93 | + |
| 94 | +- `tl.Mean`: Computes the mean across a desired axis. Mean uses one tensor axis to form groups of values and replaces each group with the mean value of that group. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Mean) / [source code](https://github.com/google/trax/blob/1372b903bb66b0daccee19fd0b1fdf44f659330b/trax/layers/core.py#L276) |
| 95 | + - `tl.Mean(axis=1)` mean over columns. |
| 96 | + |
| 97 | + |
| 98 | +- `tl.Fn` Layer with no weights that applies the function f, which should be specified using a lambda syntax. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.base.Fn) / [source doce](https://github.com/google/trax/blob/70f5364dcaf6ec11aabbd918e5f5e4b0f5bfb995/trax/layers/base.py#L576) |
| 99 | + - `x` -> This is used for cosine similarity. |
| 100 | + - `tl.Fn('Normalize', lambda x: normalize(x))` Returns a layer with no weights that applies the function `f` |
| 101 | + |
| 102 | + |
| 103 | +- `tl.parallel`: It is a combinator layer (like `Serial`) that applies a list of layers in parallel to its inputs. [docs](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Parallel) / [source code](https://github.com/google/trax/blob/37aba571a89a8ad86be76a569d0ec4a46bdd8642/trax/layers/combinators.py#L152) |
| 104 | + |
| 105 | +<a name='2.2'></a> |
| 106 | + |
| 107 | +### 2.2 Hard Negative Mining |
| 108 | + |
| 109 | + |
| 110 | +we will now implement the `TripletLoss`.<br> |
| 111 | +As explained in the lecture, loss is composed of two terms. One term utilizes the mean of all the non duplicates, the second utilizes the *closest negative*. Our loss expression is then: |
| 112 | + |
| 113 | +<img src="./images/new_triplet_loss.png"></img> |
| 114 | + |
| 115 | + |
| 116 | +Further, two sets of instructions are provided. The first set provides a brief description of the task. If that set proves insufficient, a more detailed set can be displayed. |
| 117 | + |
| 118 | +<a name='ex03'></a> |
| 119 | +### Exercise 03 |
| 120 | + |
| 121 | +**Instructions (Brief):** Here is a list of things we should do: <br> |
| 122 | + |
| 123 | +- As this will be run inside trax, use `fastnp.xyz` when using any `xyz` numpy function |
| 124 | +- Use `fastnp.dot` to calculate the similarity matrix `v_1v_2^T` of dimension `batch_size` x `batch_size` |
| 125 | +- Take the score of the duplicates on the diagonal `fastnp.diagonal` |
| 126 | +- Use the `trax` functions `fastnp.eye` and `fastnp.maximum` for the identity matrix and the maximum. |
| 127 | + |
| 128 | + |
| 129 | + |
| 130 | +<a name='3'></a> |
| 131 | + |
| 132 | +# Part 3: Training |
| 133 | + |
| 134 | +Now we are going to train our model. As usual, we have to define the cost function and the optimizer. we also have to feed in the built model. Before, going into the training, we will use a special data set up. We will define the inputs using the data generator we built above. The lambda function acts as a seed to remember the last batch that was given. Run the cell below to get the question pairs inputs. |
| 135 | + |
| 136 | +<a name='3.1'></a> |
| 137 | + |
| 138 | +### 3.1 Training the model |
| 139 | + |
| 140 | +we will now write a function that takes in our model and trains it. To train our model we have to decide how many times we want to iterate over the entire data set; each iteration is defined as an `epoch`. For each epoch, we have to go over all the data, using our training iterator. |
| 141 | + |
| 142 | +<a name='ex04'></a> |
| 143 | +### Exercise 04 |
| 144 | + |
| 145 | +**Instructions:** Implement the `train_model` below to train the neural network above. Here is a list of things we should do, as already shown in lecture 7: |
| 146 | + |
| 147 | +- Create `TrainTask` and `EvalTask` |
| 148 | +- Create the training loop `trax.supervised.training.Loop` |
| 149 | +- Pass in the following depending on the context (train_task or eval_task): |
| 150 | + - `labeled_data=generator` |
| 151 | + - `metrics=[TripletLoss()]`, |
| 152 | + - `loss_layer=TripletLoss()` |
| 153 | + - `optimizer=trax.optimizers.Adam` with learning rate of 0.01 |
| 154 | + - `lr_schedule=lr_schedule`, |
| 155 | + - `output_dir=output_dir` |
| 156 | + |
| 157 | + |
| 158 | +we will be using our triplet loss function with Adam optimizer. Please read the [trax](https://trax-ml.readthedocs.io/en/latest/trax.optimizers.html?highlight=adam#trax.optimizers.adam.Adam) documentation to get a full understanding. |
| 159 | + |
| 160 | +This function should return a `training.Loop` object. To read more about this check the [docs](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html?highlight=loop#trax.supervised.training.Loop). |
| 161 | + |
| 162 | +<a name='4'></a> |
| 163 | + |
| 164 | +# Part 4: Evaluation |
| 165 | + |
| 166 | +<a name='4.1'></a> |
| 167 | + |
| 168 | +### 4.1 Evaluating our siamese network |
| 169 | + |
| 170 | +In this section we will learn how to evaluate a Siamese network. we will first start by loading a pretrained model and then we will use it to predict. |
| 171 | + |
| 172 | +<a name='4.2'></a> |
| 173 | +### 4.2 Classify |
| 174 | +To determine the accuracy of the model, we will utilize the test set that was configured earlier. While in training we used only positive examples, the test data, Q1_test, Q2_test and y_test, is setup as pairs of questions, some of which are duplicates some are not. |
| 175 | +This routine will run all the test question pairs through the model, compute the cosine simlarity of each pair, threshold it and compare the result to y_test - the correct response from the data set. The results are accumulated to produce an accuracy. |
| 176 | + |
| 177 | + |
| 178 | +<a name='ex05'></a> |
| 179 | +### Exercise 05 |
| 180 | + |
| 181 | +**Instructions** |
| 182 | + - Loop through the incoming data in batch_size chunks |
| 183 | + - Use the data generator to load q1, q2 a batch at a time. **Don't forget to set shuffle=False!** |
| 184 | + - copy a batch_size chunk of y into y_test |
| 185 | + - compute v1, v2 using the model |
| 186 | + - for each element of the batch |
| 187 | + - compute the cos similarity of each pair of entries, v1[j],v2[j] |
| 188 | + - determine if d > threshold |
| 189 | + - increment accuracy if that result matches the expected results (y_test[j]) |
| 190 | + - compute the final accuracy and return |
| 191 | + |
| 192 | +Due to some limitations of this environment, running classify multiple times may result in the kernel failing. If that happens *Restart Kernal & clear output* and then run from the top. During development, consider using a smaller set of data to reduce the number of calls to model(). |
| 193 | + |
| 194 | +<a name='5'></a> |
| 195 | + |
| 196 | +# Part 5: Testing with our own questions |
| 197 | + |
| 198 | +In this section we will test the model with our own questions. we will write a function `predict` which takes two questions as input and returns `1` or `0` depending on whether the question pair is a duplicate or not. |
| 199 | + |
| 200 | +But first, we build a reverse vocabulary that allows to map encoded questions back to words: |
| 201 | + |
| 202 | +Write a function `predict`that takes in two questions, the model, and the vocabulary and returns whether the questions are duplicates (`1`) or not duplicates (`0`) given a similarity threshold. |
| 203 | + |
| 204 | +<a name='ex06'></a> |
| 205 | +### Exercise 06 |
| 206 | + |
| 207 | + |
| 208 | +**Instructions:** |
| 209 | +- Tokenize our question using `nltk.word_tokenize` |
| 210 | +- Create Q1,Q2 by encoding our questions as a list of numbers using vocab |
| 211 | +- pad Q1,Q2 with next(data_generator([Q1], [Q2],1,vocab['<PAD>'])) |
| 212 | +- use model() to create v1, v2 |
| 213 | +- compute the cosine similarity (dot product) of v1, v2 |
| 214 | +- compute res by comparing d to the threshold |
| 215 | + |
| 216 | +<a name='6'></a> |
| 217 | + |
| 218 | + |
| 219 | +### <span style="color:blue"> Output of Siamese Networks </span> |
| 220 | + |
| 221 | +<img src="./images/sample_output_1.png"></img> |
| 222 | +<img src="./images/sample_output_2.png"></img> |
| 223 | + |
| 224 | +### <span style="color:blue"> On Siamese networks </span> |
| 225 | + |
| 226 | +Siamese networks are important and useful. Many times there are several questions that are already asked in quora, or other platforms and we can use Siamese networks to avoid question duplicates. |
| 227 | + |
| 228 | + |
| 229 | + |
0 commit comments