tokeniser¶
A module for implementing a Byte Pair Encoding (BPE) tokenizer.
This module provides functionality for training and using a BPE tokenizer, which is a subword tokenization algorithm commonly used in natural language processing.
The implementation uses Numba for performance optimization of critical functions.
- class BPETokeniser(vocab_size)[source]¶
Bases:
object
A simple byte pair encoding tokeniser.
This class implements a BPE tokenizer with performance optimizations using Numba. It can be trained on text data and used to tokenize and detokenize text.
- Parameters:
vocab_size (int)
- vocab_size¶
The maximum size of the vocabulary.
- merges¶
A dictionary mapping token pairs to new token IDs.
- pairs¶
A list of token pairs in the order they were merged.
- vocab¶
A list of byte strings representing each token.
- type_¶
A string indicating the implementation type (always “numba” in this version).
- MIN_TOKENS = 256¶
- decode(tokens)[source]¶
Convert tokens into a string.
- Parameters:
tokens (ndarray | int) – A numpy array of token IDs or a single integer token ID.
- Returns:
The decoded string.
- Return type:
str
- encode(text)[source]¶
Tokenize a string.
- Parameters:
text (str) – A string to tokenize.
- Returns:
A numpy array of token IDs.
- Return type:
ndarray
- classmethod load(path)[source]¶
Load a tokeniser from a file.
- Parameters:
path (str | Path) – A string or Path object representing the file path to load the tokeniser from.
- Returns:
A BPETokeniser instance loaded from the file.
- most_common_pair(counts, token_id)[source]¶
Find the most common pair of tokens in the given counts array.
- Parameters:
counts (ndarray) – A numpy array containing the counts of each possible pair.
token_id (int) – The maximum token ID to consider.
- Returns:
A tuple of two integers representing the most common pair, or None if no repeated pairs exist.
- Return type:
tuple[int, int] | None
- replace_pair(data, pair, token_id)[source]¶
Replace occurrences of a pair with a new token ID.
This method is a wrapper around the replace_pair function.
- Parameters:
data (ndarray) – A numpy array of integers representing the input data.
pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.
token_id (int) – An integer representing the new token ID to replace the pair.
- Returns:
A numpy array with the pair replacements applied.
- Return type:
ndarray
- save(path)[source]¶
Save the tokeniser to a file.
- Parameters:
path (str | Path) – A string or Path object representing the file path to save the tokeniser.
- tokenise_ints(int_array, loading_bar=False)[source]¶
Tokenize an array of integers.
- Parameters:
int_array (ndarray) – A numpy array of integers to tokenize.
loading_bar – A boolean indicating whether to display a progress bar during tokenization.
- Returns:
A numpy array of tokenized integers.
- Return type:
ndarray
- train(text)[source]¶
Train the tokeniser on a string.
- Parameters:
text (str) – A string to train the tokeniser on.
- Returns:
The trained BPETokeniser instance.
- train_ints(int_array, loading_bar=False)[source]¶
Train the tokeniser on an array of integers.
- Parameters:
int_array (ndarray) – A numpy array of integers representing the training data.
loading_bar – A boolean indicating whether to display a progress bar during training.
- Returns:
The trained BPETokeniser instance.
- Warns:
If the number of pairs after training is less than the specified vocab_size.
- count_pairs(data, token_id)[source]¶
Count the number of occurrences of each pair of integers in an array.
- Parameters:
data (ndarray) – A numpy array of integers representing the input data.
token_id (int) – The maximum token ID to consider when counting pairs.
- Returns:
A numpy array containing the counts of each possible pair.
- Return type:
ndarray
Note
Parallel execution was tried but found to be slower than sequential execution.
- replace_pair(data, pair, token_id)[source]¶
Replace every occurrence of pair with token_id in the given data array.
- Parameters:
data (ndarray) – A numpy array of integers representing the input data.
pair (tuple[int, int]) – A tuple of two integers representing the pair to be replaced.
token_id (int) – An integer representing the new token ID to replace the pair.
- Returns:
A numpy array with the pair replacements applied.
- Return type:
ndarray