blocks¶
Several layers can be grouped together into a single layer called a block.
This module provides various block implementations used in transformer-based models, including multi-head self-attention, MLP blocks, and transformer blocks.
- class FeedForward(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]¶
Bases:
Layer
A simple llama style feed forward block with 2 linear layers around a swiglu function.
The size of the hidden dimension is expansion_ratio * the size of the input.
- embedding_dim¶
The dimension of the input embedding.
- Type:
int
- dropout_prob¶
The probability of dropout.
- Type:
float
- expansion_ratio¶
The ratio to expand the hidden dimension.
- Type:
float
- activation_fn¶
The activation function to use.
- Type:
- linear_1¶
The first linear layer.
- Type:
- linear_2¶
The second linear layer.
- Type:
- Parameters:
embedding_dim (int) – The dimension of the input embedding.
dropout_prob (float) – The probability of dropout.
expansion_ratio (float) – The ratio to expand the hidden dimension. Defaults to 4.
activation_fn (Layer) – The activation function to use. Can be a Layer object or a string. Defaults to GeLU().
- dropout_prob: float¶
- embedding_dim: int¶
- expansion_ratio: float¶
- from_gpu()[source]¶
Move the layer from the GPU to the CPU.
- Returns:
The FeedForward layer moved to the CPU.
- Return type:
- to_gpu(device=0)[source]¶
Move the layer to the GPU.
- Parameters:
device (int) – The GPU device number to move the layer to. Defaults to 0.
- Returns:
The FeedForward layer moved to the GPU.
- Return type:
- update(optimiser)[source]¶
Update the parameters of the layer using the given optimiser.
- Parameters:
optimiser (Optimiser) – The optimiser to use for updating the parameters.
- Returns:
The updated FeedForward layer.
- Return type:
- class GPT2TransformerBlock(embedding_dim, n_heads, context_window, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>, norm_fn='layer_norm', residual_dropout_prob=0, linear_dropout_prob=0)[source]¶
Bases:
Layer
A GPT-2 style transformer block.
This block combines multi-head self-attention with an MLP block and includes normalization and residual connections.
- Parameters:
embedding_dim (int)
n_heads (int)
context_window (int)
expansion_ratio (float)
activation_fn (Layer)
norm_fn (Literal['layer_norm'] | ~typing.Literal['rms_norm'])
residual_dropout_prob (float)
linear_dropout_prob (float)
- embedding_dim¶
The dimension of the input embeddings.
- Type:
int
- expansion_ratio¶
The ratio for expanding the hidden dimension in the MLP block.
- Type:
float
- residual_dropout_prob¶
The dropout probability for residual connections.
- Type:
float
- linear_dropout_prob¶
The dropout probability for the MLP block.
- Type:
float
- embedding_dim: int¶
- expansion_ratio: float¶
- linear_dropout_prob: float¶
- residual_dropout_prob: float¶
- to_gpu(device=0)[source]¶
Move the layer’s parameters to the GPU.
- Parameters:
device (int, optional) – The GPU device number. Defaults to 0.
- class MLPBlock(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]¶
Bases:
Layer
A simple GPT-2 style MLP block with 2 linear layers around an activation function.
The size of the hidden dimension is expansion_ratio * the size of the input.
- Parameters:
embedding_dim (int)
dropout_prob (float)
expansion_ratio (float)
activation_fn (Layer)
- embedding_dim¶
The dimension of the input embeddings.
- Type:
int
- dropout_prob¶
The dropout probability.
- Type:
float
- expansion_ratio¶
The ratio for expanding the hidden dimension.
- Type:
float
- dropout_prob: float¶
- embedding_dim: int¶
- expansion_ratio: float¶
- from_gpu()[source]¶
Move the layer’s parameters from the GPU to the CPU.
- Returns:
The MLPBlock instance with parameters moved to CPU.
- Return type:
- to_gpu(device=0)[source]¶
Move the layer’s parameters to the GPU.
- Parameters:
device (int, optional) – The GPU device number. Defaults to 0.
- Returns:
The MLPBlock instance with parameters moved to GPU.
- Return type:
- class MultiHeadSelfAttention(embedding_dim, n_heads, context_window, residual_dropout_prob=0.0, initialiser=<function init_xavier>)[source]¶
Bases:
Layer
Multi-head self-attention layer.
This layer implements the multi-head self-attention mechanism used in transformer models.
- Parameters:
embedding_dim (int)
n_heads (int)
context_window (int)
residual_dropout_prob (float)
- embedding_dim¶
The dimension of the input embeddings.
- Type:
int
- n_heads¶
The number of attention heads.
- Type:
int
- context_window¶
The size of the context window.
- Type:
int
- context_window: int¶
- embedding_dim: int¶
- n_heads: int¶
- to_gpu(device=0)[source]¶
Move the layer’s parameters to the GPU.
- Parameters:
device (int, optional) – The GPU device number. Defaults to 0.
- build_mask(context_window)[source]¶
Build an attention mask to stop the model from being able to see future tokens.
- Parameters:
context_window (int) – The size of the context window.
- Returns:
A mask tensor with shape (context_window, context_window).
- Return type: