dataset

class CausalLMDataset(tokens, vocab_size, batch_size, context_window, should_one_hot_encode=False)[source]

Bases: object

A dataset for causal language modeling tasks.

This class provides functionality for creating batches of token sequences for training causal language models.

tokens

The input token sequence.

vocab_size

The size of the vocabulary.

batch_size

The size of each batch.

context_window

The size of the context window.

is_batch

A boolean indicating if the data is batched.

as_tensor

A boolean indicating if the data should be returned as tensors.

_idx

The current index for iteration.

batch_indices

The indices used for batching.

should_one_hot_encode

A boolean indicating if outputs should be one-hot encoded.

device

The device (GPU) to use for tensors.

Parameters:
  • tokens (ndarray) – The input token sequence.

  • vocab_size (int) – The size of the vocabulary.

  • batch_size (int) – The size of each batch.

  • context_window (int) – The size of the context window.

  • should_one_hot_encode (bool) – Whether to one-hot encode the outputs.

batch()[source]

Configures the dataset for batch processing.

Returns:

The dataset object configured for batch processing.

from_gpu()[source]

Resets the device to CPU processing.

Returns:

The dataset object configured for CPU processing.

shuffle()[source]

Shuffles the batch indices.

Returns:

The dataset object with shuffled batch indices.

Raises:

NotImplementedError – If trying to shuffle a non-batched dataset.

to_gpu(device=0)[source]

Sets the device for GPU processing.

Parameters:

device (int) – The GPU device number.

Returns:

The dataset object configured for GPU processing.

to_tensor()[source]

Configures the dataset to return tensors.

Returns:

The dataset object configured to return tensors.

unbatch()[source]

Configures the dataset for non-batch processing.

Returns:

The dataset object configured for non-batch processing.

class Dataset(inputs, outputs)[source]

Bases: object

An in-memory dataset: not suitable for large datasets.

This class represents a basic dataset with inputs and corresponding outputs. It supports iteration, indexing, and shuffling of data.

inputs

A sequence of input data.

outputs

A sequence of output data corresponding to the inputs.

_indices

A list of indices for accessing data.

_index

The current index for iteration.

Parameters:
  • inputs (Sequence) – A sequence of input data.

  • outputs (Sequence) – A sequence of output data.

Raises:

AssertionError – If the length of inputs and outputs are not equal.

copy()[source]

Creates a shallow copy of the dataset.

Returns:

A new Dataset object with copied inputs and outputs.

reset()[source]

Resets the iteration index to 0.

Returns:

The dataset object with reset index.

shuffle()[source]

Shuffles the dataset indices.

Returns:

The dataset object with shuffled indices.

to_tensor()[source]

Converts inputs and outputs to Tensor objects.

Returns:

The dataset object with inputs and outputs as Tensors.

class InfiniteBatchDataset(inputs, outputs, batch_size)[source]

Bases: Dataset

An infinite batch dataset that generates random batches.

This class extends the Dataset class to provide infinite batches of data. It randomly selects items from the dataset to form batches.

is_infinite

A boolean indicating if the dataset is infinite.

_to_tensor

A boolean indicating if the data should be converted to tensors.

is_batched

A boolean indicating if the data is batched.

batch_size

The size of each batch.

Parameters:
  • inputs (Sequence) – A sequence of input data.

  • outputs (Sequence) – A sequence of output data.

  • batch_size (int) – The size of each batch.

is_batched = True
is_infinite = True
to_tensor()[source]

Sets the flag to convert data to tensors.

Returns:

The dataset object with _to_tensor flag set to True.