fineweb¶
Prepares and manages web text data from Fineweb.
This module provides functionality to download, tokenize, and manage the fineweb dataset. It includes utilities for data preparation and a custom dataset class for efficient data loading.
Typical usage example:
dataset = FineWeb(vocab_size=50257, split=’train’) tokens = dataset[0:1000] # Get the first 1000 tokens
- class FineWeb(vocab_size, split, token_path=None)[source]¶
Bases:
Sequence
A custom dataset class for efficient loading of tokenized fineweb data.
This class provides an interface to access tokenized fineweb data, supporting indexing and length operations. It also includes methods for encoding and decoding tokens.
- vocab_size¶
An integer representing the vocabulary size.
- Type:
int
- token_path¶
A Path object pointing to the tokenized data file.
- Type:
pathlib.Path
- tokeniser_string¶
A string specifying the tokenizer to use (default: “gpt2”).
- Type:
str
- tokens¶
A numpy memmap of the tokenized data.
- Type:
numpy.ndarray
- Parameters:
vocab_size (int) – An integer specifying the vocabulary size.
split (Literal['train'] | ~typing.Literal['valid']) – A string literal, either “train” or “valid”, specifying the dataset split.
token_path (Path) – An optional Path object for the tokenized data file.
- Raises:
ValueError – If the tokenizer’s max token value doesn’t match the specified vocab size.
- decode(*args)[source]¶
Decodes the input tokens into text.
- Parameters:
*args – Variable length argument list to be passed to the tokenizer.
- Returns:
A string of decoded text.
- encode(*args)[source]¶
Encodes the input text into tokens.
- Parameters:
*args – Variable length argument list to be passed to the tokenizer.
- Returns:
A list of integer token IDs.
- token_path: Path¶
- tokeniser_string: str = 'gpt2'¶
- tokens: ndarray¶
- vocab_size: int¶
- prepare_data()[source]¶
Downloads and tokenizes the coreparrot dataset.
This function is adapted from Andrej Karpathy’s NanoGPT: https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py
The function performs the following steps: 1. Loads the dataset 2. Splits it into train and validation sets 3. Tokenizes the dataset 4. Saves the tokenized data to binary files
Note
This function uses OpenAI’s tiktoken for tokenization due to performance considerations.