blocks

Several layers can be grouped together into a single layer called a block.

This module provides various block implementations used in transformer-based models, including multi-head self-attention, MLP blocks, and transformer blocks.

class FeedForward(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]

Bases: Layer

A simple llama style feed forward block with 2 linear layers around a swiglu function.

The size of the hidden dimension is expansion_ratio * the size of the input.

embedding_dim

The dimension of the input embedding.

Type:

int

dropout_prob

The probability of dropout.

Type:

float

expansion_ratio

The ratio to expand the hidden dimension.

Type:

float

activation_fn

The activation function to use.

Type:

tricycle.layers.Layer

linear_1

The first linear layer.

Type:

tricycle.layers.Dense

linear_2

The second linear layer.

Type:

tricycle.layers.Dense

Parameters:
  • embedding_dim (int) – The dimension of the input embedding.

  • dropout_prob (float) – The probability of dropout.

  • expansion_ratio (float) – The ratio to expand the hidden dimension. Defaults to 4.

  • activation_fn (Layer) – The activation function to use. Can be a Layer object or a string. Defaults to GeLU().

activation_fn: Layer
dropout_prob: float
embedding_dim: int
expansion_ratio: float
forward(x)[source]

Forward pass of the FeedForward layer.

Parameters:

x (Tensor) – Input tensor.

Returns:

The output tensor after passing through the feed-forward block.

Return type:

Tensor

from_gpu()[source]

Move the layer from the GPU to the CPU.

Returns:

The FeedForward layer moved to the CPU.

Return type:

FeedForward

linear_1: Dense
linear_2: Dense
to_gpu(device=0)[source]

Move the layer to the GPU.

Parameters:

device (int) – The GPU device number to move the layer to. Defaults to 0.

Returns:

The FeedForward layer moved to the GPU.

Return type:

FeedForward

update(optimiser)[source]

Update the parameters of the layer using the given optimiser.

Parameters:

optimiser (Optimiser) – The optimiser to use for updating the parameters.

Returns:

The updated FeedForward layer.

Return type:

FeedForward

zero_grad()[source]

Zero out the gradients of the layer’s parameters.

Returns:

The FeedForward layer with zeroed gradients.

Return type:

FeedForward

class GPT2TransformerBlock(embedding_dim, n_heads, context_window, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>, norm_fn='layer_norm', residual_dropout_prob=0, linear_dropout_prob=0)[source]

Bases: Layer

A GPT-2 style transformer block.

This block combines multi-head self-attention with an MLP block and includes normalization and residual connections.

Parameters:
  • embedding_dim (int)

  • n_heads (int)

  • context_window (int)

  • expansion_ratio (float)

  • activation_fn (Layer)

  • norm_fn (Literal['layer_norm'] | ~typing.Literal['rms_norm'])

  • residual_dropout_prob (float)

  • linear_dropout_prob (float)

embedding_dim

The dimension of the input embeddings.

Type:

int

expansion_ratio

The ratio for expanding the hidden dimension in the MLP block.

Type:

float

activation_fn

The activation function to use in the MLP block.

Type:

Layer

residual_dropout_prob

The dropout probability for residual connections.

Type:

float

linear_dropout_prob

The dropout probability for the MLP block.

Type:

float

activation_fn: Layer
embedding_dim: int
expansion_ratio: float
forward(x)[source]

Perform a forward pass through the GPT-2 transformer block.

Parameters:

x (Tensor) – The input tensor.

Returns:

The output tensor after applying the transformer block.

Return type:

Tensor

from_gpu()[source]

Move the layer’s parameters from the GPU to the CPU.

linear_dropout_prob: float
residual_dropout_prob: float
to_gpu(device=0)[source]

Move the layer’s parameters to the GPU.

Parameters:

device (int, optional) – The GPU device number. Defaults to 0.

update(optimiser)[source]

Update the layer’s parameters using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating parameters.

zero_grad()[source]

Zero out the gradients of the layer’s parameters.

class MLPBlock(embedding_dim, dropout_prob, expansion_ratio=4, activation_fn=<tricycle.activation.GeLU object>)[source]

Bases: Layer

A simple GPT-2 style MLP block with 2 linear layers around an activation function.

The size of the hidden dimension is expansion_ratio * the size of the input.

Parameters:
  • embedding_dim (int)

  • dropout_prob (float)

  • expansion_ratio (float)

  • activation_fn (Layer)

embedding_dim

The dimension of the input embeddings.

Type:

int

dropout_prob

The dropout probability.

Type:

float

expansion_ratio

The ratio for expanding the hidden dimension.

Type:

float

activation_fn

The activation function to use.

Type:

Layer

linear_1

The first linear layer.

Type:

Dense

linear_2

The second linear layer.

Type:

Dense

activation_fn: Layer
dropout_prob: float
embedding_dim: int
expansion_ratio: float
forward(x)[source]

Perform a forward pass through the MLP block.

Parameters:

x (Tensor) – The input tensor.

Returns:

The output tensor after applying the MLP block.

Return type:

Tensor

from_gpu()[source]

Move the layer’s parameters from the GPU to the CPU.

Returns:

The MLPBlock instance with parameters moved to CPU.

Return type:

MLPBlock

linear_1: Dense
linear_2: Dense
to_gpu(device=0)[source]

Move the layer’s parameters to the GPU.

Parameters:

device (int, optional) – The GPU device number. Defaults to 0.

Returns:

The MLPBlock instance with parameters moved to GPU.

Return type:

MLPBlock

update(optimiser)[source]

Update the layer’s parameters using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating parameters.

Returns:

The updated MLPBlock instance.

Return type:

MLPBlock

zero_grad()[source]

Zero out the gradients of the layer’s parameters.

Returns:

The MLPBlock instance with zeroed gradients.

Return type:

MLPBlock

class MultiHeadSelfAttention(embedding_dim, n_heads, context_window, residual_dropout_prob=0.0, initialiser=<function init_xavier>)[source]

Bases: Layer

Multi-head self-attention layer.

This layer implements the multi-head self-attention mechanism used in transformer models.

Parameters:
  • embedding_dim (int)

  • n_heads (int)

  • context_window (int)

  • residual_dropout_prob (float)

embedding_dim

The dimension of the input embeddings.

Type:

int

n_heads

The number of attention heads.

Type:

int

context_window

The size of the context window.

Type:

int

context_window: int
embedding_dim: int
forward(tensor)[source]

Perform a forward pass through the multi-head self-attention layer.

Parameters:

tensor (Tensor) – The input tensor.

Returns:

The output tensor after applying multi-head self-attention.

Return type:

Tensor

from_gpu()[source]

Move the layer’s parameters from the GPU to the CPU.

n_heads: int
to_gpu(device=0)[source]

Move the layer’s parameters to the GPU.

Parameters:

device (int, optional) – The GPU device number. Defaults to 0.

update(optimiser)[source]

Update the layer’s parameters using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating parameters.

zero_grad()[source]

Zero out the gradients of the layer’s parameters.

build_mask(context_window)[source]

Build an attention mask to stop the model from being able to see future tokens.

Parameters:

context_window (int) – The size of the context window.

Returns:

A mask tensor with shape (context_window, context_window).

Return type:

Tensor

masked_fill(tensor, mask_shape, full_mask)[source]

Apply an attention_mask to a tensor.

Parameters:
  • tensor (Tensor) – The input tensor to be masked.

  • mask_shape (tuple[int, int]) – The shape of the mask to be applied.

  • full_mask (Tensor) – The full mask tensor.

Returns:

The masked tensor.

Return type:

Tensor