dataset¶

class CausalLMDataset(tokens, vocab_size, batch_size, context_window, should_one_hot_encode=False)[source]¶

Bases: object

A dataset for causal language modeling tasks.

This class provides functionality for creating batches of token sequences for training causal language models.

tokens¶: The input token sequence.

vocab_size¶: The size of the vocabulary.

batch_size¶: The size of each batch.

context_window¶: The size of the context window.

is_batch¶: A boolean indicating if the data is batched.

as_tensor¶: A boolean indicating if the data should be returned as tensors.

_idx¶: The current index for iteration.

batch_indices¶: The indices used for batching.

should_one_hot_encode¶: A boolean indicating if outputs should be one-hot encoded.

device¶: The device (GPU) to use for tensors.

Parameters:

tokens (ndarray) – The input token sequence.
vocab_size (int) – The size of the vocabulary.
batch_size (int) – The size of each batch.
context_window (int) – The size of the context window.
should_one_hot_encode (bool) – Whether to one-hot encode the outputs.

batch()[source]¶

Configures the dataset for batch processing.

Returns:: The dataset object configured for batch processing.

from_gpu()[source]¶

Resets the device to CPU processing.

Returns:: The dataset object configured for CPU processing.

shuffle()[source]¶

Shuffles the batch indices.

Returns:: The dataset object with shuffled batch indices.
Raises:: NotImplementedError – If trying to shuffle a non-batched dataset.

to_gpu(device=0)[source]¶

Sets the device for GPU processing.

Parameters:: device (int) – The GPU device number.
Returns:: The dataset object configured for GPU processing.

to_tensor()[source]¶

Configures the dataset to return tensors.

Returns:: The dataset object configured to return tensors.

unbatch()[source]¶

Configures the dataset for non-batch processing.

Returns:: The dataset object configured for non-batch processing.

class Dataset(inputs, outputs)[source]¶

Bases: object

An in-memory dataset: not suitable for large datasets.

This class represents a basic dataset with inputs and corresponding outputs. It supports iteration, indexing, and shuffling of data.

inputs¶: A sequence of input data.

outputs¶: A sequence of output data corresponding to the inputs.

_indices¶: A list of indices for accessing data.

_index¶: The current index for iteration.

Parameters:

inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.

Raises:

AssertionError – If the length of inputs and outputs are not equal.

copy()[source]¶

Creates a shallow copy of the dataset.

Returns:: A new Dataset object with copied inputs and outputs.

reset()[source]¶

Resets the iteration index to 0.

Returns:: The dataset object with reset index.

shuffle()[source]¶

Shuffles the dataset indices.

Returns:: The dataset object with shuffled indices.

to_tensor()[source]¶

Converts inputs and outputs to Tensor objects.

Returns:: The dataset object with inputs and outputs as Tensors.

class InfiniteBatchDataset(inputs, outputs, batch_size)[source]¶

Bases: Dataset

An infinite batch dataset that generates random batches.

This class extends the Dataset class to provide infinite batches of data. It randomly selects items from the dataset to form batches.

is_infinite¶: A boolean indicating if the dataset is infinite.

_to_tensor¶: A boolean indicating if the data should be converted to tensors.

is_batched¶: A boolean indicating if the data is batched.

batch_size¶: The size of each batch.

Parameters:

inputs (Sequence) – A sequence of input data.
outputs (Sequence) – A sequence of output data.
batch_size (int) – The size of each batch.

is_batched = True¶

is_infinite = True¶

to_tensor()[source]¶

Sets the flag to convert data to tensors.

Returns:: The dataset object with _to_tensor flag set to True.