blocks¶

Several layers can be grouped together into a single layer called a block.

This module provides various block implementations used in transformer-based models, including multi-head self-attention, MLP blocks, and transformer blocks.

class FeedForward(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]¶

Bases: Layer

A simple llama style feed forward block with 2 linear layers around a swiglu function.

The size of the hidden dimension is expansion_ratio * the size of the input.

embedding_dim¶

The dimension of the input embedding.

Type:: int

dropout_prob¶

The probability of dropout.

Type:: float

expansion_ratio¶

The ratio to expand the hidden dimension.

Type:: float

activation_fn¶

The activation function to use.

Type:: tricycle.layers.Layer

linear_1¶

The first linear layer.

Type:: tricycle.layers.Dense

linear_2¶

The second linear layer.

Type:: tricycle.layers.Dense

Parameters:

embedding_dim (int) – The dimension of the input embedding.
dropout_prob (float) – The probability of dropout.
expansion_ratio (float) – The ratio to expand the hidden dimension. Defaults to 4.
activation_fn (Layer) – The activation function to use. Can be a Layer object or a string. Defaults to GeLU().

activation_fn: Layer¶

dropout_prob: float¶

embedding_dim: int¶

expansion_ratio: float¶

forward(x)[source]¶

Forward pass of the FeedForward layer.

Parameters:: x (Tensor) – Input tensor.
Returns:: The output tensor after passing through the feed-forward block.
Return type:: Tensor

from_gpu()[source]¶

Move the layer from the GPU to the CPU.

Returns:: The FeedForward layer moved to the CPU.
Return type:: FeedForward

linear_1: Dense¶

linear_2: Dense¶

to_gpu(device=0)[source]¶

Move the layer to the GPU.

Parameters:: device (int) – The GPU device number to move the layer to. Defaults to 0.
Returns:: The FeedForward layer moved to the GPU.
Return type:: FeedForward

update(optimiser)[source]¶

Update the parameters of the layer using the given optimiser.

Parameters:: optimiser (Optimiser) – The optimiser to use for updating the parameters.
Returns:: The updated FeedForward layer.
Return type:: FeedForward

zero_grad()[source]¶

Zero out the gradients of the layer’s parameters.

Returns:: The FeedForward layer with zeroed gradients.
Return type:: FeedForward

class GPT2TransformerBlock(embedding_dim, n_heads, context_window, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>, norm_fn='layer_norm', residual_dropout_prob=0, linear_dropout_prob=0)[source]¶

Bases: Layer

A GPT-2 style transformer block.

This block combines multi-head self-attention with an MLP block and includes normalization and residual connections.

Parameters:

embedding_dim (int)
n_heads (int)
context_window (int)
expansion_ratio (float)
activation_fn (Layer)
norm_fn (Literal['layer_norm'] | ~typing.Literal['rms_norm'])
residual_dropout_prob (float)
linear_dropout_prob (float)

embedding_dim¶

The dimension of the input embeddings.

Type:: int

expansion_ratio¶

The ratio for expanding the hidden dimension in the MLP block.

Type:: float

activation_fn¶

The activation function to use in the MLP block.

Type:: Layer

residual_dropout_prob¶

The dropout probability for residual connections.

Type:: float

linear_dropout_prob¶

The dropout probability for the MLP block.

Type:: float

activation_fn: Layer¶

embedding_dim: int¶

expansion_ratio: float¶

forward(x)[source]¶

Perform a forward pass through the GPT-2 transformer block.

Parameters:: x (Tensor) – The input tensor.
Returns:: The output tensor after applying the transformer block.
Return type:: Tensor

from_gpu()[source]¶: Move the layer’s parameters from the GPU to the CPU.

linear_dropout_prob: float¶

residual_dropout_prob: float¶

to_gpu(device=0)[source]¶

Move the layer’s parameters to the GPU.

Parameters:: device (int, optional) – The GPU device number. Defaults to 0.

update(optimiser)[source]¶

Update the layer’s parameters using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating parameters.

zero_grad()[source]¶: Zero out the gradients of the layer’s parameters.

class MLPBlock(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]¶

Bases: Layer

A simple GPT-2 style MLP block with 2 linear layers around an activation function.

The size of the hidden dimension is expansion_ratio * the size of the input.

Parameters:

embedding_dim (int)
dropout_prob (float)
expansion_ratio (float)
activation_fn (Layer)

embedding_dim¶

The dimension of the input embeddings.

Type:: int

dropout_prob¶

The dropout probability.

Type:: float

expansion_ratio¶

The ratio for expanding the hidden dimension.

Type:: float

activation_fn¶

The activation function to use.

Type:: Layer

linear_1¶

The first linear layer.

Type:: Dense

linear_2¶

The second linear layer.

Type:: Dense

activation_fn: Layer¶

dropout_prob: float¶

embedding_dim: int¶

expansion_ratio: float¶

forward(x)[source]¶

Perform a forward pass through the MLP block.

Parameters:: x (Tensor) – The input tensor.
Returns:: The output tensor after applying the MLP block.
Return type:: Tensor

from_gpu()[source]¶

Move the layer’s parameters from the GPU to the CPU.

Returns:: The MLPBlock instance with parameters moved to CPU.
Return type:: MLPBlock

linear_1: Dense¶

linear_2: Dense¶

to_gpu(device=0)[source]¶

Move the layer’s parameters to the GPU.

Parameters:: device (int, optional) – The GPU device number. Defaults to 0.
Returns:: The MLPBlock instance with parameters moved to GPU.
Return type:: MLPBlock

update(optimiser)[source]¶

Update the layer’s parameters using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating parameters.
Returns:: The updated MLPBlock instance.
Return type:: MLPBlock

zero_grad()[source]¶

Zero out the gradients of the layer’s parameters.

Returns:: The MLPBlock instance with zeroed gradients.
Return type:: MLPBlock

class MultiHeadSelfAttention(embedding_dim, n_heads, context_window, residual_dropout_prob=0.0, initialiser=<function init_xavier>)[source]¶

Bases: Layer

Multi-head self-attention layer.

This layer implements the multi-head self-attention mechanism used in transformer models.

Parameters:

embedding_dim (int)
n_heads (int)
context_window (int)
residual_dropout_prob (float)

embedding_dim¶

The dimension of the input embeddings.

Type:: int

n_heads¶

The number of attention heads.

Type:: int

context_window¶

The size of the context window.

Type:: int

context_window: int¶

embedding_dim: int¶

forward(tensor)[source]¶

Perform a forward pass through the multi-head self-attention layer.

Parameters:: tensor (Tensor) – The input tensor.
Returns:: The output tensor after applying multi-head self-attention.
Return type:: Tensor

from_gpu()[source]¶: Move the layer’s parameters from the GPU to the CPU.

n_heads: int¶

to_gpu(device=0)[source]¶

Move the layer’s parameters to the GPU.

Parameters:: device (int, optional) – The GPU device number. Defaults to 0.

update(optimiser)[source]¶

Update the layer’s parameters using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating parameters.

zero_grad()[source]¶: Zero out the gradients of the layer’s parameters.

build_mask(context_window)[source]¶

Build an attention mask to stop the model from being able to see future tokens.

Parameters:: context_window (int) – The size of the context window.
Returns:: A mask tensor with shape (context_window, context_window).
Return type:: Tensor

masked_fill(tensor, mask_shape, full_mask)[source]¶

Apply an attention_mask to a tensor.

Parameters:

tensor (Tensor) – The input tensor to be masked.
mask_shape (tuple[int, int]) – The shape of the mask to be applied.
full_mask (Tensor) – The full mask tensor.

Returns:

The masked tensor.

Return type:

Tensor