tokeniser¶

A module for implementing a Byte Pair Encoding (BPE) tokenizer.

This module provides functionality for training and using a BPE tokenizer, which is a subword tokenization algorithm commonly used in natural language processing.

The implementation uses Numba for performance optimization of critical functions.

class BPETokeniser(vocab_size)[source]¶

Bases: object

A simple byte pair encoding tokeniser.

This class implements a BPE tokenizer with performance optimizations using Numba. It can be trained on text data and used to tokenize and detokenize text.

Parameters:: vocab_size (int)

vocab_size¶: The maximum size of the vocabulary.

merges¶: A dictionary mapping token pairs to new token IDs.

pairs¶: A list of token pairs in the order they were merged.

vocab¶: A list of byte strings representing each token.

type_¶: A string indicating the implementation type (always “numba” in this version).

MIN_TOKENS = 256¶

decode(tokens)[source]¶

Convert tokens into a string.

Parameters:: tokens (ndarray | int) – A numpy array of token IDs or a single integer token ID.
Returns:: The decoded string.
Return type:: str

encode(text)[source]¶

Tokenize a string.

Parameters:: text (str) – A string to tokenize.
Returns:: A numpy array of token IDs.
Return type:: ndarray

classmethod load(path)[source]¶

Load a tokeniser from a file.

Parameters:: path (str | Path) – A string or Path object representing the file path to load the tokeniser from.
Returns:: A BPETokeniser instance loaded from the file.

most_common_pair(counts, token_id)[source]¶

Find the most common pair of tokens in the given counts array.

Parameters:

counts (ndarray) – A numpy array containing the counts of each possible pair.
token_id (int) – The maximum token ID to consider.

Returns:

A tuple of two integers representing the most common pair, or None if no repeated pairs exist.

Return type:

tuple[int, int] | None

replace_pair(data, pair, token_id)[source]¶

Replace occurrences of a pair with a new token ID.

This method is a wrapper around the replace_pair function.

Parameters:

data (ndarray) – A numpy array of integers representing the input data.
pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.
token_id (int) – An integer representing the new token ID to replace the pair.

Returns:

A numpy array with the pair replacements applied.

Return type:

ndarray

save(path)[source]¶

Save the tokeniser to a file.

Parameters:: path (str | Path) – A string or Path object representing the file path to save the tokeniser.

tokenise_ints(int_array, loading_bar=False)[source]¶

Tokenize an array of integers.

Parameters:

int_array (ndarray) – A numpy array of integers to tokenize.
loading_bar – A boolean indicating whether to display a progress bar during tokenization.

Returns:

A numpy array of tokenized integers.

Return type:

ndarray

train(text)[source]¶

Train the tokeniser on a string.

Parameters:: text (str) – A string to train the tokeniser on.
Returns:: The trained BPETokeniser instance.

train_ints(int_array, loading_bar=False)[source]¶

Train the tokeniser on an array of integers.

Parameters:

int_array (ndarray) – A numpy array of integers representing the training data.
loading_bar – A boolean indicating whether to display a progress bar during training.

Returns:

The trained BPETokeniser instance.

Warns:

If the number of pairs after training is less than the specified vocab_size.

count_pairs(data, token_id)[source]¶

Count the number of occurrences of each pair of integers in an array.

Parameters:

data (ndarray) – A numpy array of integers representing the input data.
token_id (int) – The maximum token ID to consider when counting pairs.

Returns:

A numpy array containing the counts of each possible pair.

Return type:

ndarray

Note

Parallel execution was tried but found to be slower than sequential execution.

replace_pair(data, pair, token_id)[source]¶

Replace every occurrence of pair with token_id in the given data array.

Parameters:

data (ndarray) – A numpy array of integers representing the input data.
pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.
token_id (int) – An integer representing the new token ID to replace the pair.

Returns:

A numpy array with the pair replacements applied.

Return type:

ndarray