Question about ReLU in Multi-Head Attention

In multi-head attention, there is a relu after queries, keys, and values. Is this a correct implementation? The paper did not mention the relu in Eq. 5. Besides, it seems that the relu will make the attention matrix always positive.

```python
# Linear projections
Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu)
K = tf.layers.dense(keys, num_units, activation=tf.nn.relu)
V = tf.layers.dense(values, num_units, activation=tf.nn.relu)```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about ReLU in Multi-Head Attention #21

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about ReLU in Multi-Head Attention #21

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions