codeparrot¶

This module prepares and handles the CodeParrot dataset: a dataset of python files scraped from github

It downloads, tokenizes, and processes the CodeParrot dataset, creating memory-mapped files for efficient data handling during training. The module also provides a CodeParrot class for easy access to the processed data.

Typical usage example:

dataset = CodeParrot(vocab_size=100000, split=”train”) tokens = dataset[0:1000] # Get the first 1000 tokens

class CodeParrot(vocab_size, split, token_path=None)[source]¶

Bases: Sequence

A class to handle the CodeParrot dataset.

This class provides an interface to access the tokenized CodeParrot dataset, including methods for encoding and decoding text.

url¶

The source URL of the dataset.

Type:: str

vocab_size¶

The size of the vocabulary.

Type:: int

token_path¶

The path to the tokenized data file.

Type:: pathlib.Path

tokeniser_string¶

The name of the tokenizer to use.

Type:: str

tokens¶

The memory-mapped array of tokens.

Type:: numpy.ndarray

Parameters:

vocab_size (int) – The size of the vocabulary to use.
split (Literal['train'] | ~typing.Literal['valid']) – The dataset split to use (“train” or “valid”).
token_path (Path) – Optional custom path to the tokenized data file.

decode(*args)[source]¶

Decodes the input tokens into text.

Parameters:: *args – The tokens to decode.
Returns:: The decoded text as a string.

encode(*args)[source]¶

Encodes the input text into tokens.

Parameters:: *args – The text to encode.
Returns:: A list of token ids.

token_path: Path¶

tokeniser_string: str = 'cl100k_base'¶

tokens: ndarray¶

url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶

vocab_size: int¶

prepare_data()[source]¶

Downloads and tokenizes the CodeParrot dataset.

This function splits the dataset into train and validation sets, tokenizes the content, and saves the tokenized data as memory-mapped files.

Note

This script is adapted from Andrej Karpathy’s NanoGPT: https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py

tokenise_document(example)[source]¶

Tokenizes a single document from the dataset.

Parameters:: example – A dictionary containing the document content.
Returns:: A dictionary with tokenized ‘ids’ and ‘len’ fields.