attention¶
Attention module for multi-head attention operations.
This module implements the multi-head attention mechanism as described in “Attention Is All You Need” (Vaswani et al., 2017). It includes functions for building attention masks and the main Attention class for performing multi-head attention operations.
- class Attention(embedding_dim, n_heads, context_window)[source]¶
Bases:
Op
Multi-head attention operation.
This class implements the multi-head attention mechanism as described in “Attention Is All You Need” (Vaswani et al., 2017).
- Parameters:
embedding_dim (int)
n_heads (int)
context_window (int)
- embedding_dim¶
An integer representing the dimension of the input embeddings.
- n_heads¶
An integer representing the number of attention heads.
- context_window¶
An integer representing the size of the context window.
- mask¶
A tensor representing the attention mask.
- _grad¶
A tensor to store gradients during backpropagation.
- backward(grad)[source]¶
Compute the gradient of the attention operation.
- Parameters:
grad (Tensor) – A Tensor representing the upstream gradient.
- Returns:
A Tensor representing the gradient with respect to the input.
- forward(tensor)[source]¶
Apply the multi-head attention operation to the input tensor.
- Parameters:
tensor (Tensor) – A Tensor of shape (batch_size, seq_len, embedding_dim * 3). The input should contain concatenated query, key, and value projections.
- Returns:
A Tensor representing the output after applying multi-head attention.
- build_mask(context_window, n_heads)[source]¶
Build an attention mask to prevent attending to future tokens.
This function creates a boolean mask that can be used in multi-head attention mechanisms to implement causal (unidirectional) attention.
- Parameters:
context_window (int) – An integer representing the size of the context window.
n_heads (int) – An integer representing the number of attention heads.
- Returns:
A boolean tensor of shape (n_heads, context_window, context_window) representing the attention mask.
- Return type: