layers¶

class Dense(from_size, to_size, initialiser=<function init_xavier>, name=None)[source]¶

Bases: Layer

A dense (fully connected) layer.

Parameters:

from_size (int)
to_size (int)
name (str | None)

weights¶

The weight matrix.

Type:: Tensor

from_size¶

Input size.

Type:: int

to_size¶

Output size.

Type:: int

name¶

Optional name for the layer.

Type:: str | None

forward(tensor)[source]¶

Perform the forward pass of the dense layer.

Parameters:: tensor (Tensor) – Input tensor.
Returns:: Output of the dense layer.
Return type:: Tensor

from_gpu()[source]¶

Move the layer from GPU to CPU.

Returns:: The layer itself.
Return type:: Dense

from_size: int¶

grad_back_fn(grad)[source]¶

Compute gradients with respect to input.

Parameters:: grad (Tensor) – Gradient from the next layer.
Returns:: Gradient with respect to input.
Return type:: Tensor

name: str | None¶

to_gpu(device=0)[source]¶

Move the layer to GPU.

Parameters:: device (int) – The GPU device number. Defaults to 0.
Returns:: The layer itself.
Return type:: Dense

to_size: int¶

update(optimiser)[source]¶

Update the weights using the given optimiser.

Parameters:: optimiser (Optimiser) – The optimiser to use for updating weights.

weight_back_fn(grad)[source]¶

Compute gradients with respect to weights.

Parameters:: grad (Tensor) – Gradient from the next layer.
Returns:: Gradient with respect to weights.
Return type:: Tensor

weights: Tensor¶

zero_grad()[source]¶: Reset gradients to zero.

class Dropout(probability)[source]¶

Bases: Layer

A dropout layer for regularization.

Parameters:: probability (float)

probability¶

The probability of dropping out a unit.

Type:: float

forward(tensor)[source]¶

Perform the forward pass of the dropout layer.

Parameters:: tensor (Tensor) – Input tensor.
Returns:: Output tensor with dropout applied.
Return type:: Tensor

class Embedding(from_size, to_size, name=None, initialiser=<function init_xavier>)[source]¶

Bases: Layer

Embedding layer that converts indices to dense vectors.

This layer implements a lookup-based embedding, converting input indices to dense vector representations.

Parameters:

from_size (int)
to_size (int)
name (str | None)

weights¶

The embedding matrix.

Type:: Tensor

vocab_size¶

Size of the vocabulary (number of embeddings).

Type:: int

back_fn(grad)[source]¶

Computes the gradient with respect to the embedding weights.

Parameters:: grad (Tensor) – The gradient tensor.
Returns:: The gradient with respect to the embedding weights.
Return type:: Tensor

forward(tensor)[source]¶

Performs the embedding lookup.

Parameters:: tensor (Tensor) – Input tensor containing indices to be embedded.
Returns:: The embedded representation of the input indices.
Return type:: Tensor

from_gpu()[source]¶

Moves the embedding weights from GPU to CPU.

Returns:: The embedding layer with weights moved to CPU.
Return type:: Embedding

to_gpu(device=0)[source]¶

Moves the embedding weights to the GPU.

Parameters:: device (int) – The GPU device number.
Returns:: The embedding layer with weights moved to GPU.
Return type:: Embedding

update(optimiser)[source]¶

Updates the embedding weights using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating weights.

zero_grad()[source]¶: Resets the gradient of the weights to None.

class Layer[source]¶

Bases: ABC

A generic Layer object, representing a single operation in a neural network.

tensors¶

Dictionary of tensors used in the layer.

Type:: dict[str, Tensor]

layers¶

Sequence of sub-layers, if any.

Type:: Sequence[Layer]

abstract forward(tensor)[source]¶

Perform the forward pass of the layer.

Parameters:: tensor (Tensor) – Input tensor.
Raises:: NotImplementedError – This method should be implemented by subclasses.

from_gpu()[source]¶: Move the layer from GPU to CPU.

layers: Sequence[Layer] = []¶

tensors: dict[str, Tensor] = {}¶

to_gpu(device=0)[source]¶

Move the layer to GPU.

Parameters:: device (int) – The GPU device number. Defaults to 0.

update(optimiser)[source]¶

Update the layer’s parameters using the given optimiser.

Parameters:: optimiser (Optimiser) – The optimiser to use for updating parameters.

zero_grad()[source]¶: Reset gradients to zero.

class LayerNorm(embedding_dim, eps=1e-05)[source]¶

Bases: Layer

A Layer Normalization layer.

Parameters:: embedding_dim (int)

eps¶

A small value added for numerical stability.

Type:: float

gamma¶

Scale parameter.

Type:: Tensor

beta¶

Shift parameter.

Type:: Tensor

back_fn(grad)[source]¶

Compute gradients with respect to input.

Parameters:: grad (Tensor) – Gradient from the next layer.
Returns:: Gradient with respect to input.
Return type:: Tensor

beta_back_fn(grad)[source]¶

Compute gradients with respect to beta.

Parameters:: grad (Tensor) – Gradient from the next layer.
Returns:: Gradient with respect to beta.
Return type:: Tensor

forward(tensor)[source]¶

Perform the forward pass of the layer normalization.

Parameters:: tensor (Tensor) – Input tensor of shape (batch_size, *).
Returns:: Normalized tensor of the same shape as input.
Return type:: Tensor

from_gpu()[source]¶

Move the layer from GPU to CPU.

Returns:: The layer itself.
Return type:: LayerNorm

gamma_back_fn(grad)[source]¶

Compute gradients with respect to gamma.

Parameters:: grad (Tensor) – Gradient from the next layer.
Returns:: Gradient with respect to gamma.
Return type:: Tensor

to_gpu(device=0)[source]¶

Move the layer to GPU.

Parameters:: device (int) – The GPU device number. Defaults to 0.
Returns:: The layer itself.
Return type:: LayerNorm

update(optimiser)[source]¶

Update the layer’s parameters using the given optimiser.

Parameters:: optimiser (Optimiser) – The optimiser to use for updating parameters.

zero_grad()[source]¶: Reset gradients to zero.

class RMSNorm(embedding_dim, REALLY_SMALL_NUMBER=0.0001)[source]¶

Bases: Layer

Root Mean Square Layer Normalization.

This class implements RMSNorm, a normalization technique that normalizes the inputs using the root mean square.

Parameters:: embedding_dim (int)

embedding_dim¶

The size of the input’s last dimension.

Type:: int

REALLY_SMALL_NUMBER¶

A small constant to avoid division by zero.

Type:: float

weights¶

Learnable scale parameters.

Type:: Tensor

back_fn(grad)[source]¶

Computes the gradient with respect to the input.

Parameters:: grad (Tensor) – The gradient tensor.
Returns:: The gradient with respect to the input.
Return type:: Tensor

forward(tensor)[source]¶

Applies RMS normalization to the input tensor.

Parameters:: tensor (Tensor) – Input tensor to be normalized.
Returns:: The normalized output tensor.
Return type:: Tensor

from_gpu()[source]¶

Moves the layer’s parameters from GPU to CPU.

Returns:: The layer with parameters moved to CPU.
Return type:: RMSNorm

to_gpu(device=0)[source]¶

Moves the layer’s parameters to the GPU.

Parameters:: device (int) – The GPU device number.
Returns:: The layer with parameters moved to GPU.
Return type:: RMSNorm

update(optimiser)[source]¶

Updates the layer’s parameters using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating parameters.

weight_back_fn(grad)[source]¶

Computes the gradient with respect to the weights.

Parameters:: grad (Tensor) – The gradient tensor.
Returns:: The gradient with respect to the weights.
Return type:: Tensor

zero_grad()[source]¶: Resets the gradient of the weights to None.

class RotaryEncode(embedding_dim, n_heads, context_window, theta=None)[source]¶

Bases: Layer

Applies rotary positional encoding to a key and query.

This layer implements the Rotary Position Embedding (RoPE) technique for transformer models.

Parameters:

embedding_dim (int)
n_heads (int)
context_window (int)
theta (float)

embedding_dim¶

The size of the embedding dimension.

Type:: int

n_heads¶

The number of attention heads.

Type:: int

context_window¶

The size of the context window.

Type:: int

theta¶

The base value for frequency calculation.

Type:: float

head_size¶

The size of each attention head.

Type:: int

freqs_cos¶

Precomputed cosine of frequencies.

Type:: ArrayLike

freqs_sin¶

Precomputed sine of frequencies.

Type:: ArrayLike

backward(grad)[source]¶

Computes the gradient for the rotary encoding operation.

Parameters:: grad (Tensor) – The gradient tensor.
Returns:: The gradient with respect to the input.
Return type:: Tensor

context_window: int¶

embedding_dim: int¶

forward(tensor)[source]¶

Applies rotary positional encoding to the input tensor.

Parameters:: tensor (Tensor) – The input tensor.
Returns:: The tensor with rotary positional encoding applied.
Return type:: Tensor

n_heads: int¶

precompute_constants()[source]¶

Precomputes the cosine and sine of frequencies for rotary encoding.

Returns:: Precomputed cosine and sine values.
Return type:: tuple[ArrayLike, ArrayLike]

theta: float = 10000.0¶

class Sequential(*layers)[source]¶

Bases: Layer

A sequential container of layers.

This class allows for the creation of a sequential chain of layers, where the output of each layer is fed as input to the next layer.

Parameters:: layers (Sequence[Layer])

layers¶

A tuple of Layer objects in the sequential chain.

Type:: tuple

forward(tensor)[source]¶

Performs a forward pass through all layers in the sequential chain.

Parameters:: tensor (Tensor) – The input tensor.
Returns:: The output tensor after passing through all layers.
Return type:: Tensor

from_gpu()[source]¶: Moves all layers from GPU to CPU.

to_gpu(device=0)[source]¶

Moves all layers to the GPU.

Parameters:: device (int) – The GPU device number.

update(optimiser)[source]¶

Updates all layers using the given optimizer.

Parameters:: optimiser (Optimiser) – The optimizer to use for updating layers.

zero_grad()[source]¶: Resets the gradients of all layers to None.