configs¶
Configurations for different GPT models.
This module contains configuration classes for various GPT models, including a base configuration class and specific configurations for debugging, Shakespeare-based models, and a small GPT model.
- Classes:
GPTConfig: Base configuration class for GPT models. DebugConfig: Configuration for debugging purposes. ShakespeareConfig: Configuration for Shakespeare-based models. SmolGPTConfig: Configuration for a small GPT model.
- class DebugConfig[source]¶
Bases:
GPTConfig
Configuration for debugging purposes.
This class inherits from GPTConfig and sets specific values for debugging.
- activation_fn: str = 'gelu'¶
- batch_size: int = 5¶
- beta1: float = 0.9¶
- beta2: float = 0.99¶
- context_window: int = 13¶
- device_idx: int = 0¶
- embedding_dim: int = 14¶
- eval_interval: int = 1¶
- eval_steps = 1¶
- expansion_ratio: float = 4¶
- gradient_accumulation_steps: int = 1¶
- input_dropout_prob: float = 0.2¶
- linear_dropout_prob: float = 0.2¶
- max_learning_rate: float = 0.001¶
- min_learning_rate: float = 0.0001¶
- mlflow_enabled = False¶
- mlflow_tracking_uri: str = ''¶
- momentum = 0¶
- n_heads: int = 2¶
- n_layers: int = 1¶
- norm_fn: str = 'layer_norm'¶
- residual_dropout_prob: float = 0.2¶
- sample_size = 4¶
- steps: int | Literal['chinchilla_optimal'] = 250¶
- vocab_size: int = 11¶
- warmup_steps: int = 100¶
- weight_decay: float = 0.1¶
- class GPTConfig[source]¶
Bases:
object
Base configuration class for GPT models.
This class defines the common parameters and hyperparameters used in GPT model training and evaluation.
- embedding_dim¶
Dimension of the embedding layer.
- Type:
int
- context_window¶
Size of the context window.
- Type:
int
- vocab_size¶
Size of the vocabulary.
- Type:
int
- n_heads¶
Number of attention heads.
- Type:
int
- n_layers¶
Number of transformer layers.
- Type:
int
- expansion_ratio¶
Expansion ratio for feed-forward layers.
- Type:
float
- activation_fn¶
Activation function used in the model.
- Type:
str
- norm_fn¶
Normalization function used in the model.
- Type:
str
- input_dropout_prob¶
Dropout probability for input embeddings.
- Type:
float
- residual_dropout_prob¶
Dropout probability for residual connections.
- Type:
float
- linear_dropout_prob¶
Dropout probability for linear layers.
- Type:
float
- max_learning_rate¶
Maximum learning rate for training.
- Type:
float
- min_learning_rate¶
Minimum learning rate for training.
- Type:
float
- warmup_steps¶
Number of warmup steps for learning rate scheduling.
- Type:
int
- weight_decay¶
Weight decay factor for regularization.
- Type:
float
- momentum¶
Momentum factor for optimization.
- Type:
float
- beta1¶
Beta1 parameter for Adam optimizer.
- Type:
float
- beta2¶
Beta2 parameter for Adam optimizer.
- Type:
float
- steps¶
Number of training steps or “chinchilla_optimal”.
- Type:
int | Literal[“chinchilla_optimal”]
- eval_interval¶
Interval between evaluations.
- Type:
int
- batch_size¶
Batch size for training.
- Type:
int
- gradient_accumulation_steps¶
Number of steps for gradient accumulation.
- Type:
int
- device_idx¶
Index of the device to use for training.
- Type:
int
- mlflow_tracking_uri¶
URI for MLflow tracking server.
- Type:
str
- mlflow_experiment_name¶
Name of the MLflow experiment.
- Type:
str
- activation_fn: str¶
- batch_size: int¶
- beta1: float¶
- beta2: float¶
- context_window: int¶
- device_idx: int¶
- dict()[source]¶
Convert the configuration to a dictionary.
- Returns:
A dictionary representation of the configuration.
- Return type:
dict[str, int | float | str | bool]
- embedding_dim: int¶
- eval_interval: int¶
- expansion_ratio: float¶
- gradient_accumulation_steps: int¶
- input_dropout_prob: float¶
- linear_dropout_prob: float¶
- max_learning_rate: float¶
- min_learning_rate: float¶
- mlflow_experiment_name: str¶
- mlflow_tracking_uri: str¶
- momentum¶
alias of
float
- n_heads: int¶
- n_layers: int¶
- norm_fn: str¶
- residual_dropout_prob: float¶
- steps: int | Literal['chinchilla_optimal']¶
- vocab_size: int¶
- warmup_steps: int¶
- weight_decay: float¶
- class ShakespeareConfig[source]¶
Bases:
GPTConfig
Configuration for Shakespeare-based models.
This class inherits from GPTConfig and sets specific values for Shakespeare-based language models.
- activation_fn: str = 'gelu'¶
- batch_size: int = 128¶
- beta1: float = 0.9¶
- beta2: float = 0.99¶
- context_window: int = 256¶
- device_idx: int = 1¶
- embedding_dim: int = 384¶
- eval_interval: int = 250¶
- eval_steps = 128¶
- expansion_ratio: float = 4¶
- gradient_accumulation_steps: int = 1¶
- input_dropout_prob: float = 0.2¶
- linear_dropout_prob: float = 0.2¶
- max_learning_rate: float = 0.01¶
- min_learning_rate: float = 0.0001¶
- mlflow_enabled = True¶
- mlflow_tracking_uri: str = 'http://localhost:5000'¶
- momentum = 0¶
- n_heads: int = 6¶
- n_layers: int = 6¶
- norm_fn: str = 'layer_norm'¶
- residual_dropout_prob: float = 0.2¶
- sample_size = 512¶
- steps: int | Literal['chinchilla_optimal'] = 5000¶
- vocab_size: int = 1024¶
- warmup_steps: int = 100¶
- weight_decay: float = 0.1¶
- class SmolGPTConfig[source]¶
Bases:
GPTConfig
Configuration for a small GPT model.
This class inherits from GPTConfig and sets specific values for a small-scale GPT model.
- activation_fn: str = 'gelu'¶
- batch_size: int = 4¶
- beta1: float = 0.9¶
- beta2: float = 0.95¶
- context_window: int = 1024¶
- device_idx: int = 0¶
- embedding_dim: int = 768¶
- eval_interval: int = 100¶
- eval_steps = 128¶
- expansion_ratio: float = 4¶
- gradient_accumulation_steps: int = 128¶
- input_dropout_prob: float = 0¶
- linear_dropout_prob: float = 0¶
- max_learning_rate: float = 0.0006¶
- min_learning_rate: float = 0¶
- mlflow_enabled = True¶
- mlflow_tracking_uri: str = 'http://localhost:5000'¶
- momentum = 0¶
- n_heads: int = 12¶
- n_layers: int = 12¶
- n_tokens_to_generate = 512¶
- norm_fn: str = 'layer_norm'¶
- residual_dropout_prob: float = 0¶
- steps: int | Literal['chinchilla_optimal'] = 'chinchilla_optimal'¶
- vocab_size: int = 50256¶
- warmup_steps: int = 150¶
- weight_decay: float = 0.1¶