codeparrot

This module prepares and handles the CodeParrot dataset: a dataset of python files scraped from github

It downloads, tokenizes, and processes the CodeParrot dataset, creating memory-mapped files for efficient data handling during training. The module also provides a CodeParrot class for easy access to the processed data.

Typical usage example:

dataset = CodeParrot(vocab_size=100000, split=”train”) tokens = dataset[0:1000] # Get the first 1000 tokens

class CodeParrot(vocab_size, split, token_path=None)[source]

Bases: Sequence

A class to handle the CodeParrot dataset.

This class provides an interface to access the tokenized CodeParrot dataset, including methods for encoding and decoding text.

url

The source URL of the dataset.

Type:

str

vocab_size

The size of the vocabulary.

Type:

int

token_path

The path to the tokenized data file.

Type:

pathlib.Path

tokeniser_string

The name of the tokenizer to use.

Type:

str

tokens

The memory-mapped array of tokens.

Type:

numpy.ndarray

Parameters:
  • vocab_size (int) – The size of the vocabulary to use.

  • split (Literal['train'] | ~typing.Literal['valid']) – The dataset split to use (“train” or “valid”).

  • token_path (Path) – Optional custom path to the tokenized data file.

decode(*args)[source]

Decodes the input tokens into text.

Parameters:

*args – The tokens to decode.

Returns:

The decoded text as a string.

encode(*args)[source]

Encodes the input text into tokens.

Parameters:

*args – The text to encode.

Returns:

A list of token ids.

token_path: Path
tokeniser_string: str = 'cl100k_base'
tokens: ndarray
url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
vocab_size: int
prepare_data()[source]

Downloads and tokenizes the CodeParrot dataset.

This function splits the dataset into train and validation sets, tokenizes the content, and saves the tokenized data as memory-mapped files.

Note

This script is adapted from Andrej Karpathy’s NanoGPT: https://github.com/karpathy/nanoGPT/blob/master/data/openwebtext/prepare.py

tokenise_document(example)[source]

Tokenizes a single document from the dataset.

Parameters:

example – A dictionary containing the document content.

Returns:

A dictionary with tokenized ‘ids’ and ‘len’ fields.