layers

class Dense(from_size, to_size, initialiser=<function init_xavier>, name=None)[source]

Bases: Layer

A dense (fully connected) layer.

Parameters:
  • from_size (int)

  • to_size (int)

  • name (str | None)

weights

The weight matrix.

Type:

Tensor

from_size

Input size.

Type:

int

to_size

Output size.

Type:

int

name

Optional name for the layer.

Type:

str | None

forward(tensor)[source]

Perform the forward pass of the dense layer.

Parameters:

tensor (Tensor) – Input tensor.

Returns:

Output of the dense layer.

Return type:

Tensor

from_gpu()[source]

Move the layer from GPU to CPU.

Returns:

The layer itself.

Return type:

Dense

from_size: int
grad_back_fn(grad)[source]

Compute gradients with respect to input.

Parameters:

grad (Tensor) – Gradient from the next layer.

Returns:

Gradient with respect to input.

Return type:

Tensor

name: str | None
to_gpu(device=0)[source]

Move the layer to GPU.

Parameters:

device (int) – The GPU device number. Defaults to 0.

Returns:

The layer itself.

Return type:

Dense

to_size: int
update(optimiser)[source]

Update the weights using the given optimiser.

Parameters:

optimiser (Optimiser) – The optimiser to use for updating weights.

weight_back_fn(grad)[source]

Compute gradients with respect to weights.

Parameters:

grad (Tensor) – Gradient from the next layer.

Returns:

Gradient with respect to weights.

Return type:

Tensor

weights: Tensor
zero_grad()[source]

Reset gradients to zero.

class Dropout(probability)[source]

Bases: Layer

A dropout layer for regularization.

Parameters:

probability (float)

probability

The probability of dropping out a unit.

Type:

float

forward(tensor)[source]

Perform the forward pass of the dropout layer.

Parameters:

tensor (Tensor) – Input tensor.

Returns:

Output tensor with dropout applied.

Return type:

Tensor

class Embedding(from_size, to_size, name=None, initialiser=<function init_xavier>)[source]

Bases: Layer

Embedding layer that converts indices to dense vectors.

This layer implements a lookup-based embedding, converting input indices to dense vector representations.

Parameters:
  • from_size (int)

  • to_size (int)

  • name (str | None)

weights

The embedding matrix.

Type:

Tensor

vocab_size

Size of the vocabulary (number of embeddings).

Type:

int

back_fn(grad)[source]

Computes the gradient with respect to the embedding weights.

Parameters:

grad (Tensor) – The gradient tensor.

Returns:

The gradient with respect to the embedding weights.

Return type:

Tensor

forward(tensor)[source]

Performs the embedding lookup.

Parameters:

tensor (Tensor) – Input tensor containing indices to be embedded.

Returns:

The embedded representation of the input indices.

Return type:

Tensor

from_gpu()[source]

Moves the embedding weights from GPU to CPU.

Returns:

The embedding layer with weights moved to CPU.

Return type:

Embedding

to_gpu(device=0)[source]

Moves the embedding weights to the GPU.

Parameters:

device (int) – The GPU device number.

Returns:

The embedding layer with weights moved to GPU.

Return type:

Embedding

update(optimiser)[source]

Updates the embedding weights using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating weights.

zero_grad()[source]

Resets the gradient of the weights to None.

class Layer[source]

Bases: ABC

A generic Layer object, representing a single operation in a neural network.

tensors

Dictionary of tensors used in the layer.

Type:

dict[str, Tensor]

layers

Sequence of sub-layers, if any.

Type:

Sequence[Layer]

abstract forward(tensor)[source]

Perform the forward pass of the layer.

Parameters:

tensor (Tensor) – Input tensor.

Raises:

NotImplementedError – This method should be implemented by subclasses.

from_gpu()[source]

Move the layer from GPU to CPU.

layers: Sequence[Layer] = []
tensors: dict[str, Tensor] = {}
to_gpu(device=0)[source]

Move the layer to GPU.

Parameters:

device (int) – The GPU device number. Defaults to 0.

update(optimiser)[source]

Update the layer’s parameters using the given optimiser.

Parameters:

optimiser (Optimiser) – The optimiser to use for updating parameters.

zero_grad()[source]

Reset gradients to zero.

class LayerNorm(embedding_dim, eps=1e-05)[source]

Bases: Layer

A Layer Normalization layer.

Parameters:

embedding_dim (int)

eps

A small value added for numerical stability.

Type:

float

gamma

Scale parameter.

Type:

Tensor

beta

Shift parameter.

Type:

Tensor

back_fn(grad)[source]

Compute gradients with respect to input.

Parameters:

grad (Tensor) – Gradient from the next layer.

Returns:

Gradient with respect to input.

Return type:

Tensor

beta_back_fn(grad)[source]

Compute gradients with respect to beta.

Parameters:

grad (Tensor) – Gradient from the next layer.

Returns:

Gradient with respect to beta.

Return type:

Tensor

forward(tensor)[source]

Perform the forward pass of the layer normalization.

Parameters:

tensor (Tensor) – Input tensor of shape (batch_size, *).

Returns:

Normalized tensor of the same shape as input.

Return type:

Tensor

from_gpu()[source]

Move the layer from GPU to CPU.

Returns:

The layer itself.

Return type:

LayerNorm

gamma_back_fn(grad)[source]

Compute gradients with respect to gamma.

Parameters:

grad (Tensor) – Gradient from the next layer.

Returns:

Gradient with respect to gamma.

Return type:

Tensor

to_gpu(device=0)[source]

Move the layer to GPU.

Parameters:

device (int) – The GPU device number. Defaults to 0.

Returns:

The layer itself.

Return type:

LayerNorm

update(optimiser)[source]

Update the layer’s parameters using the given optimiser.

Parameters:

optimiser (Optimiser) – The optimiser to use for updating parameters.

zero_grad()[source]

Reset gradients to zero.

class RMSNorm(embedding_dim, REALLY_SMALL_NUMBER=0.0001)[source]

Bases: Layer

Root Mean Square Layer Normalization.

This class implements RMSNorm, a normalization technique that normalizes the inputs using the root mean square.

Parameters:

embedding_dim (int)

embedding_dim

The size of the input’s last dimension.

Type:

int

REALLY_SMALL_NUMBER

A small constant to avoid division by zero.

Type:

float

weights

Learnable scale parameters.

Type:

Tensor

back_fn(grad)[source]

Computes the gradient with respect to the input.

Parameters:

grad (Tensor) – The gradient tensor.

Returns:

The gradient with respect to the input.

Return type:

Tensor

forward(tensor)[source]

Applies RMS normalization to the input tensor.

Parameters:

tensor (Tensor) – Input tensor to be normalized.

Returns:

The normalized output tensor.

Return type:

Tensor

from_gpu()[source]

Moves the layer’s parameters from GPU to CPU.

Returns:

The layer with parameters moved to CPU.

Return type:

RMSNorm

to_gpu(device=0)[source]

Moves the layer’s parameters to the GPU.

Parameters:

device (int) – The GPU device number.

Returns:

The layer with parameters moved to GPU.

Return type:

RMSNorm

update(optimiser)[source]

Updates the layer’s parameters using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating parameters.

weight_back_fn(grad)[source]

Computes the gradient with respect to the weights.

Parameters:

grad (Tensor) – The gradient tensor.

Returns:

The gradient with respect to the weights.

Return type:

Tensor

zero_grad()[source]

Resets the gradient of the weights to None.

class RotaryEncode(embedding_dim, n_heads, context_window, theta=None)[source]

Bases: Layer

Applies rotary positional encoding to a key and query.

This layer implements the Rotary Position Embedding (RoPE) technique for transformer models.

Parameters:
  • embedding_dim (int)

  • n_heads (int)

  • context_window (int)

  • theta (float)

embedding_dim

The size of the embedding dimension.

Type:

int

n_heads

The number of attention heads.

Type:

int

context_window

The size of the context window.

Type:

int

theta

The base value for frequency calculation.

Type:

float

head_size

The size of each attention head.

Type:

int

freqs_cos

Precomputed cosine of frequencies.

Type:

ArrayLike

freqs_sin

Precomputed sine of frequencies.

Type:

ArrayLike

backward(grad)[source]

Computes the gradient for the rotary encoding operation.

Parameters:

grad (Tensor) – The gradient tensor.

Returns:

The gradient with respect to the input.

Return type:

Tensor

context_window: int
embedding_dim: int
forward(tensor)[source]

Applies rotary positional encoding to the input tensor.

Parameters:

tensor (Tensor) – The input tensor.

Returns:

The tensor with rotary positional encoding applied.

Return type:

Tensor

n_heads: int
precompute_constants()[source]

Precomputes the cosine and sine of frequencies for rotary encoding.

Returns:

Precomputed cosine and sine values.

Return type:

tuple[ArrayLike, ArrayLike]

theta: float = 10000.0
class Sequential(*layers)[source]

Bases: Layer

A sequential container of layers.

This class allows for the creation of a sequential chain of layers, where the output of each layer is fed as input to the next layer.

Parameters:

layers (Sequence[Layer])

layers

A tuple of Layer objects in the sequential chain.

Type:

tuple

forward(tensor)[source]

Performs a forward pass through all layers in the sequential chain.

Parameters:

tensor (Tensor) – The input tensor.

Returns:

The output tensor after passing through all layers.

Return type:

Tensor

from_gpu()[source]

Moves all layers from GPU to CPU.

to_gpu(device=0)[source]

Moves all layers to the GPU.

Parameters:

device (int) – The GPU device number.

update(optimiser)[source]

Updates all layers using the given optimizer.

Parameters:

optimiser (Optimiser) – The optimizer to use for updating layers.

zero_grad()[source]

Resets the gradients of all layers to None.