fineweb

Prepares and manages web text data from Fineweb.

This module provides functionality to download, tokenize, and manage the fineweb dataset. It includes utilities for data preparation and a custom dataset class for efficient data loading.

Typical usage example:

dataset = FineWeb(vocab_size=50257, split=’train’) tokens = dataset[0:1000] # Get the first 1000 tokens

class FineWeb(vocab_size, split, token_path=None)[source]

Bases: Sequence

A custom dataset class for efficient loading of tokenized fineweb data.

This class provides an interface to access tokenized fineweb data, supporting indexing and length operations. It also includes methods for encoding and decoding tokens.

vocab_size

An integer representing the vocabulary size.

Type:

int

token_path

A Path object pointing to the tokenized data file.

Type:

pathlib.Path

tokeniser_string

A string specifying the tokenizer to use (default: “gpt2”).

Type:

str

tokens

A numpy memmap of the tokenized data.

Type:

numpy.ndarray

Parameters:
  • vocab_size (int) – An integer specifying the vocabulary size.

  • split (Literal['train'] | ~typing.Literal['valid']) – A string literal, either “train” or “valid”, specifying the dataset split.

  • token_path (Path) – An optional Path object for the tokenized data file.

Raises:

ValueError – If the tokenizer’s max token value doesn’t match the specified vocab size.

decode(*args)[source]

Decodes the input tokens into text.

Parameters:

*args – Variable length argument list to be passed to the tokenizer.

Returns:

A string of decoded text.

encode(*args)[source]

Encodes the input text into tokens.

Parameters:

*args – Variable length argument list to be passed to the tokenizer.

Returns:

A list of integer token IDs.

token_path: Path
tokeniser_string: str = 'gpt2'
tokens: ndarray
vocab_size: int
prepare_data()[source]

Downloads and tokenizes the coreparrot dataset.

This function is adapted from Andrej Karpathy’s NanoGPT: https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py

The function performs the following steps: 1. Loads the dataset 2. Splits it into train and validation sets 3. Tokenizes the dataset 4. Saves the tokenized data to binary files

Note

This function uses OpenAI’s tiktoken for tokenization due to performance considerations.

tokenise_document(example)[source]

Tokenizes a single document from the dataset.

Parameters:

example – A dictionary containing the ‘text’ field to be tokenized.

Returns:

A dictionary with ‘ids’ (tokenized text) and ‘len’ (number of tokens).