codeparrot¶
This module prepares and handles the CodeParrot dataset: a dataset of python files scraped from github
It downloads, tokenizes, and processes the CodeParrot dataset, creating memory-mapped files for efficient data handling during training. The module also provides a CodeParrot class for easy access to the processed data.
Typical usage example:
dataset = CodeParrot(vocab_size=100000, split=”train”) tokens = dataset[0:1000] # Get the first 1000 tokens
- class CodeParrot(vocab_size, split, token_path=None)[source]¶
Bases:
Sequence
A class to handle the CodeParrot dataset.
This class provides an interface to access the tokenized CodeParrot dataset, including methods for encoding and decoding text.
- url¶
The source URL of the dataset.
- Type:
str
- vocab_size¶
The size of the vocabulary.
- Type:
int
- token_path¶
The path to the tokenized data file.
- Type:
pathlib.Path
- tokeniser_string¶
The name of the tokenizer to use.
- Type:
str
- tokens¶
The memory-mapped array of tokens.
- Type:
numpy.ndarray
- Parameters:
vocab_size (int) – The size of the vocabulary to use.
split (Literal['train'] | ~typing.Literal['valid']) – The dataset split to use (“train” or “valid”).
token_path (Path) – Optional custom path to the tokenized data file.
- decode(*args)[source]¶
Decodes the input tokens into text.
- Parameters:
*args – The tokens to decode.
- Returns:
The decoded text as a string.
- encode(*args)[source]¶
Encodes the input text into tokens.
- Parameters:
*args – The text to encode.
- Returns:
A list of token ids.
- token_path: Path¶
- tokeniser_string: str = 'cl100k_base'¶
- tokens: ndarray¶
- url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶
- vocab_size: int¶
- prepare_data()[source]¶
Downloads and tokenizes the CodeParrot dataset.
This function splits the dataset into train and validation sets, tokenizes the content, and saves the tokenized data as memory-mapped files.
Note
This script is adapted from Andrej Karpathy’s NanoGPT: https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py