dataset¶
- class CausalLMDataset(tokens, vocab_size, batch_size, context_window, should_one_hot_encode=False)[source]¶
Bases:
objectA dataset for causal language modeling tasks.
This class provides functionality for creating batches of token sequences for training causal language models.
- tokens¶
The input token sequence.
- vocab_size¶
The size of the vocabulary.
- batch_size¶
The size of each batch.
- context_window¶
The size of the context window.
- is_batch¶
A boolean indicating if the data is batched.
- as_tensor¶
A boolean indicating if the data should be returned as tensors.
- _idx¶
The current index for iteration.
- batch_indices¶
The indices used for batching.
- should_one_hot_encode¶
A boolean indicating if outputs should be one-hot encoded.
- device¶
The device (GPU) to use for tensors.
- Parameters:
tokens (ndarray) – The input token sequence.
vocab_size (int) – The size of the vocabulary.
batch_size (int) – The size of each batch.
context_window (int) – The size of the context window.
should_one_hot_encode (bool) – Whether to one-hot encode the outputs.
- batch()[source]¶
Configures the dataset for batch processing.
- Returns:
The dataset object configured for batch processing.
- from_gpu()[source]¶
Resets the device to CPU processing.
- Returns:
The dataset object configured for CPU processing.
- shuffle()[source]¶
Shuffles the batch indices.
- Returns:
The dataset object with shuffled batch indices.
- Raises:
NotImplementedError – If trying to shuffle a non-batched dataset.
- to_gpu(device=0)[source]¶
Sets the device for GPU processing.
- Parameters:
device (int) – The GPU device number.
- Returns:
The dataset object configured for GPU processing.
- class Dataset(inputs, outputs)[source]¶
Bases:
objectAn in-memory dataset: not suitable for large datasets.
This class represents a basic dataset with inputs and corresponding outputs. It supports iteration, indexing, and shuffling of data.
- inputs¶
A sequence of input data.
- outputs¶
A sequence of output data corresponding to the inputs.
- _indices¶
A list of indices for accessing data.
- _index¶
The current index for iteration.
- Parameters:
inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.
- Raises:
AssertionError – If the length of inputs and outputs are not equal.
- class InfiniteBatchDataset(inputs, outputs, batch_size)[source]¶
Bases:
DatasetAn infinite batch dataset that generates random batches.
This class extends the Dataset class to provide infinite batches of data. It randomly selects items from the dataset to form batches.
- is_infinite¶
A boolean indicating if the dataset is infinite.
- _to_tensor¶
A boolean indicating if the data should be converted to tensors.
- is_batched¶
A boolean indicating if the data is batched.
- batch_size¶
The size of each batch.
- Parameters:
inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.
batch_size (int) – The size of each batch.
- is_batched = True¶
- is_infinite = True¶