tokeniser

A module for implementing a Byte Pair Encoding (BPE) tokenizer.

This module provides functionality for training and using a BPE tokenizer, which is a subword tokenization algorithm commonly used in natural language processing.

The implementation uses Numba for performance optimization of critical functions.

class BPETokeniser(vocab_size)[source]

Bases: object

A simple byte pair encoding tokeniser.

This class implements a BPE tokenizer with performance optimizations using Numba. It can be trained on text data and used to tokenize and detokenize text.

Parameters:

vocab_size (int)

vocab_size

The maximum size of the vocabulary.

merges

A dictionary mapping token pairs to new token IDs.

pairs

A list of token pairs in the order they were merged.

vocab

A list of byte strings representing each token.

type_

A string indicating the implementation type (always “numba” in this version).

MIN_TOKENS = 256
decode(tokens)[source]

Convert tokens into a string.

Parameters:

tokens (ndarray | int) – A numpy array of token IDs or a single integer token ID.

Returns:

The decoded string.

Return type:

str

encode(text)[source]

Tokenize a string.

Parameters:

text (str) – A string to tokenize.

Returns:

A numpy array of token IDs.

Return type:

ndarray

classmethod load(path)[source]

Load a tokeniser from a file.

Parameters:

path (str | Path) – A string or Path object representing the file path to load the tokeniser from.

Returns:

A BPETokeniser instance loaded from the file.

most_common_pair(counts, token_id)[source]

Find the most common pair of tokens in the given counts array.

Parameters:
  • counts (ndarray) – A numpy array containing the counts of each possible pair.

  • token_id (int) – The maximum token ID to consider.

Returns:

A tuple of two integers representing the most common pair, or None if no repeated pairs exist.

Return type:

tuple[int, int] | None

replace_pair(data, pair, token_id)[source]

Replace occurrences of a pair with a new token ID.

This method is a wrapper around the replace_pair function.

Parameters:
  • data (ndarray) – A numpy array of integers representing the input data.

  • pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.

  • token_id (int) – An integer representing the new token ID to replace the pair.

Returns:

A numpy array with the pair replacements applied.

Return type:

ndarray

save(path)[source]

Save the tokeniser to a file.

Parameters:

path (str | Path) – A string or Path object representing the file path to save the tokeniser.

tokenise_ints(int_array, loading_bar=False)[source]

Tokenize an array of integers.

Parameters:
  • int_array (ndarray) – A numpy array of integers to tokenize.

  • loading_bar – A boolean indicating whether to display a progress bar during tokenization.

Returns:

A numpy array of tokenized integers.

Return type:

ndarray

train(text)[source]

Train the tokeniser on a string.

Parameters:

text (str) – A string to train the tokeniser on.

Returns:

The trained BPETokeniser instance.

train_ints(int_array, loading_bar=False)[source]

Train the tokeniser on an array of integers.

Parameters:
  • int_array (ndarray) – A numpy array of integers representing the training data.

  • loading_bar – A boolean indicating whether to display a progress bar during training.

Returns:

The trained BPETokeniser instance.

Warns:

If the number of pairs after training is less than the specified vocab_size.

count_pairs(data, token_id)[source]

Count the number of occurrences of each pair of integers in an array.

Parameters:
  • data (ndarray) – A numpy array of integers representing the input data.

  • token_id (int) – The maximum token ID to consider when counting pairs.

Returns:

A numpy array containing the counts of each possible pair.

Return type:

ndarray

Note

Parallel execution was tried but found to be slower than sequential execution.

replace_pair(data, pair, token_id)[source]

Replace every occurrence of pair with token_id in the given data array.

Parameters:
  • data (ndarray) – A numpy array of integers representing the input data.

  • pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.

  • token_id (int) – An integer representing the new token ID to replace the pair.

Returns:

A numpy array with the pair replacements applied.

Return type:

ndarray