Seq2seq-attention Peeping Into The Encoder-states Bypasses Last Encoder-hidden-state
Solution 1:
Bottom line, you should try different approaches and see what model works best for your data. Without knowing anything about your data or running some tests it is impossible to speculate on whether attention mechanism, CNN, etc. provides any benefits or not.
However, if you are using the tensorflow seq2seq models available in tensorflow/tensorflow/python/ops/seq2seq.py let me share some observations about the attention mechanism as implemented in embedding_attention_seq2seq()
and attention_decoder()
that related to your question(s):
- Hidden state of decoder is initialized with the final state of encoder...so attention does not "effectively bypass the hidden state at end of encoding" IMHO
The following code in embedding_attention_seq2seq()
passes in the last time step encoder_state
as the initial_state
in the 2nd argument:
return embedding_attention_decoder(
decoder_inputs, encoder_state, attention_states, cell,
num_decoder_symbols, embedding_size, num_heads=num_heads,
output_size=output_size, output_projection=output_projection,
feed_previous=feed_previous,
initial_state_attention=initial_state_attention)
And you can see that initial_state
is used directly in attention_decoder()
without going through any kind of attention states:
state = initial_state
...
for i, inp inenumerate(decoder_inputs):
if i > 0:
variable_scope.get_variable_scope().reuse_variables()
# If loop_function is set, we use it instead of decoder_inputs.if loop_function isnotNoneand prev isnotNone:
with variable_scope.variable_scope("loop_function", reuse=True):
inp = loop_function(prev, i)
# Merge input and previous attentions into one vector of the right size.
input_size = inp.get_shape().with_rank(2)[1]
if input_size.value isNone:
raise ValueError("Could not infer input size from input: %s" % inp.name)
x = linear([inp] + attns, input_size, True)
# Run the RNN.
cell_output, state = cell(x, state)
....
Attention states are combined with decoder inputs via learned linear combinations
x = linear([inp] + attns, input_size, True)
# Run the RNN.
cell_output, state = cell(x, state)
...the linear()
does the W, b matrix operations to down rank the combined input + attn into the decoder input_size. The model will learn values for W and b.
Summary: the attention states are combined with inputs into the decoder, but the last hidden state of the encoder is fed in as the initial hidden state of the decoder without attention.
Finally, the attention mechanism still has the last encoding state at it's disposal and would only "bypass" it if learned that was the best thing to do during training.
Post a Comment for "Seq2seq-attention Peeping Into The Encoder-states Bypasses Last Encoder-hidden-state"