dataset¶
- class CausalLMDataset(tokens, vocab_size, batch_size, context_window, should_one_hot_encode=False)[source]¶
Bases:
object
A dataset for causal language modeling tasks.
This class provides functionality for creating batches of token sequences for training causal language models.
- tokens¶
The input token sequence.
- vocab_size¶
The size of the vocabulary.
- batch_size¶
The size of each batch.
- context_window¶
The size of the context window.
- is_batch¶
A boolean indicating if the data is batched.
- as_tensor¶
A boolean indicating if the data should be returned as tensors.
- _idx¶
The current index for iteration.
- batch_indices¶
The indices used for batching.
- should_one_hot_encode¶
A boolean indicating if outputs should be one-hot encoded.
- device¶
The device (GPU) to use for tensors.
- Parameters:
tokens (ndarray) – The input token sequence.
vocab_size (int) – The size of the vocabulary.
batch_size (int) – The size of each batch.
context_window (int) – The size of the context window.
should_one_hot_encode (bool) – Whether to one-hot encode the outputs.
- batch()[source]¶
Configures the dataset for batch processing.
- Returns:
The dataset object configured for batch processing.
- from_gpu()[source]¶
Resets the device to CPU processing.
- Returns:
The dataset object configured for CPU processing.
- shuffle()[source]¶
Shuffles the batch indices.
- Returns:
The dataset object with shuffled batch indices.
- Raises:
NotImplementedError – If trying to shuffle a non-batched dataset.
- to_gpu(device=0)[source]¶
Sets the device for GPU processing.
- Parameters:
device (int) – The GPU device number.
- Returns:
The dataset object configured for GPU processing.
- class Dataset(inputs, outputs)[source]¶
Bases:
object
An in-memory dataset: not suitable for large datasets.
This class represents a basic dataset with inputs and corresponding outputs. It supports iteration, indexing, and shuffling of data.
- inputs¶
A sequence of input data.
- outputs¶
A sequence of output data corresponding to the inputs.
- _indices¶
A list of indices for accessing data.
- _index¶
The current index for iteration.
- Parameters:
inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.
- Raises:
AssertionError – If the length of inputs and outputs are not equal.
- class InfiniteBatchDataset(inputs, outputs, batch_size)[source]¶
Bases:
Dataset
An infinite batch dataset that generates random batches.
This class extends the Dataset class to provide infinite batches of data. It randomly selects items from the dataset to form batches.
- is_infinite¶
A boolean indicating if the dataset is infinite.
- _to_tensor¶
A boolean indicating if the data should be converted to tensors.
- is_batched¶
A boolean indicating if the data is batched.
- batch_size¶
The size of each batch.
- Parameters:
inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.
batch_size (int) – The size of each batch.
- is_batched = True¶
- is_infinite = True¶