configs

Configurations for different GPT models.

This module contains configuration classes for various GPT models, including a base configuration class and specific configurations for debugging, Shakespeare-based models, and a small GPT model.

Classes:

GPTConfig: Base configuration class for GPT models. DebugConfig: Configuration for debugging purposes. ShakespeareConfig: Configuration for Shakespeare-based models. SmolGPTConfig: Configuration for a small GPT model.

class DebugConfig[source]

Bases: GPTConfig

Configuration for debugging purposes.

This class inherits from GPTConfig and sets specific values for debugging.

activation_fn: str = 'gelu'
batch_size: int = 5
beta1: float = 0.9
beta2: float = 0.99
context_window: int = 13
device_idx: int = 0
embedding_dim: int = 14
eval_interval: int = 1
eval_steps = 1
expansion_ratio: float = 4
gradient_accumulation_steps: int = 1
input_dropout_prob: float = 0.2
linear_dropout_prob: float = 0.2
max_learning_rate: float = 0.001
min_learning_rate: float = 0.0001
mlflow_enabled = False
mlflow_tracking_uri: str = ''
momentum = 0
n_heads: int = 2
n_layers: int = 1
norm_fn: str = 'layer_norm'
residual_dropout_prob: float = 0.2
sample_size = 4
steps: int | Literal['chinchilla_optimal'] = 250
vocab_size: int = 11
warmup_steps: int = 100
weight_decay: float = 0.1
class GPTConfig[source]

Bases: object

Base configuration class for GPT models.

This class defines the common parameters and hyperparameters used in GPT model training and evaluation.

embedding_dim

Dimension of the embedding layer.

Type:

int

context_window

Size of the context window.

Type:

int

vocab_size

Size of the vocabulary.

Type:

int

n_heads

Number of attention heads.

Type:

int

n_layers

Number of transformer layers.

Type:

int

expansion_ratio

Expansion ratio for feed-forward layers.

Type:

float

activation_fn

Activation function used in the model.

Type:

str

norm_fn

Normalization function used in the model.

Type:

str

input_dropout_prob

Dropout probability for input embeddings.

Type:

float

residual_dropout_prob

Dropout probability for residual connections.

Type:

float

linear_dropout_prob

Dropout probability for linear layers.

Type:

float

max_learning_rate

Maximum learning rate for training.

Type:

float

min_learning_rate

Minimum learning rate for training.

Type:

float

warmup_steps

Number of warmup steps for learning rate scheduling.

Type:

int

weight_decay

Weight decay factor for regularization.

Type:

float

momentum

Momentum factor for optimization.

Type:

float

beta1

Beta1 parameter for Adam optimizer.

Type:

float

beta2

Beta2 parameter for Adam optimizer.

Type:

float

steps

Number of training steps or “chinchilla_optimal”.

Type:

int | Literal[“chinchilla_optimal”]

eval_interval

Interval between evaluations.

Type:

int

batch_size

Batch size for training.

Type:

int

gradient_accumulation_steps

Number of steps for gradient accumulation.

Type:

int

device_idx

Index of the device to use for training.

Type:

int

mlflow_tracking_uri

URI for MLflow tracking server.

Type:

str

mlflow_experiment_name

Name of the MLflow experiment.

Type:

str

activation_fn: str
batch_size: int
beta1: float
beta2: float
context_window: int
device_idx: int
dict()[source]

Convert the configuration to a dictionary.

Returns:

A dictionary representation of the configuration.

Return type:

dict[str, int | float | str | bool]

embedding_dim: int
eval_interval: int
expansion_ratio: float
gradient_accumulation_steps: int
input_dropout_prob: float
linear_dropout_prob: float
max_learning_rate: float
min_learning_rate: float
mlflow_experiment_name: str
mlflow_tracking_uri: str
momentum

alias of float

n_heads: int
n_layers: int
norm_fn: str
residual_dropout_prob: float
steps: int | Literal['chinchilla_optimal']
vocab_size: int
warmup_steps: int
weight_decay: float
class ShakespeareConfig[source]

Bases: GPTConfig

Configuration for Shakespeare-based models.

This class inherits from GPTConfig and sets specific values for Shakespeare-based language models.

activation_fn: str = 'gelu'
batch_size: int = 128
beta1: float = 0.9
beta2: float = 0.99
context_window: int = 256
device_idx: int = 1
embedding_dim: int = 384
eval_interval: int = 250
eval_steps = 128
expansion_ratio: float = 4
gradient_accumulation_steps: int = 1
input_dropout_prob: float = 0.2
linear_dropout_prob: float = 0.2
max_learning_rate: float = 0.01
min_learning_rate: float = 0.0001
mlflow_enabled = True
mlflow_tracking_uri: str = 'http://localhost:5000'
momentum = 0
n_heads: int = 6
n_layers: int = 6
norm_fn: str = 'layer_norm'
residual_dropout_prob: float = 0.2
sample_size = 512
steps: int | Literal['chinchilla_optimal'] = 5000
vocab_size: int = 1024
warmup_steps: int = 100
weight_decay: float = 0.1
class SmolGPTConfig[source]

Bases: GPTConfig

Configuration for a small GPT model.

This class inherits from GPTConfig and sets specific values for a small-scale GPT model.

activation_fn: str = 'gelu'
batch_size: int = 4
beta1: float = 0.9
beta2: float = 0.95
context_window: int = 1024
device_idx: int = 0
embedding_dim: int = 768
eval_interval: int = 100
eval_steps = 128
expansion_ratio: float = 4
gradient_accumulation_steps: int = 128
input_dropout_prob: float = 0
linear_dropout_prob: float = 0
max_learning_rate: float = 0.0006
min_learning_rate: float = 0
mlflow_enabled = True
mlflow_tracking_uri: str = 'http://localhost:5000'
momentum = 0
n_heads: int = 12
n_layers: int = 12
n_tokens_to_generate = 512
norm_fn: str = 'layer_norm'
residual_dropout_prob: float = 0
steps: int | Literal['chinchilla_optimal'] = 'chinchilla_optimal'
vocab_size: int = 50256
warmup_steps: int = 150
weight_decay: float = 0.1