shakespeare¶
Provides classes for handling Shakespeare datasets.
This module contains two main classes: 1. Shakespeare: For handling tokenized Shakespeare text using BPE tokenization. 2. ShakespeareChar: For handling character-level Shakespeare text.
Both classes provide methods for downloading, tokenizing, encoding, and decoding Shakespeare’s text.
Typical usage example:
shakespeare = Shakespeare(1024) char_shakespeare = ShakespeareChar()
- class Shakespeare(vocab_size, token_path=None, raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'), tokeniser_path=PosixPath('datasets/shakespeare/tokeniser.pkl'))[source]¶
Bases:
Sequence
A class for handling tokenized Shakespeare text using BPE tokenization.
This class downloads the Shakespeare dataset, tokenizes it using BPE, and provides methods for encoding and decoding text.
- Parameters:
vocab_size (int)
token_path (Path)
raw_data_path (Path)
tokeniser_path (Path)
- url¶
A string containing the URL for the Shakespeare dataset.
- Type:
str
- vocab_size¶
An integer representing the size of the vocabulary.
- Type:
int
- token_path¶
A Path object for the tokenized data file.
- Type:
pathlib.Path
- raw_data_path¶
A Path object for the raw data file.
- Type:
pathlib.Path
- tokens¶
A numpy array containing the tokenized data.
- Type:
numpy.ndarray
- tokeniser¶
A BPETokeniser object for tokenization.
- decode(*args)[source]¶
Decodes the input using the BPE tokenizer.
- Parameters:
*args – Arguments to pass to the tokenizer’s decode method.
- Returns:
The decoded input.
- download()[source]¶
Downloads the Shakespeare dataset.
The downloaded data is saved to the path specified by raw_data_path.
- encode(*args)[source]¶
Encodes the input using the BPE tokenizer.
- Parameters:
*args – Arguments to pass to the tokenizer’s encode method.
- Returns:
The encoded input.
- generate()[source]¶
Downloads and tokenizes the Shakespeare dataset.
- Returns:
A BPETokeniser object trained on the Shakespeare dataset.
- Return type:
- raw_data_path: Path¶
- token_path: Path¶
- tokens: ndarray¶
- url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶
- vocab_size: int¶
- class ShakespeareChar(raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'))[source]¶
Bases:
Sequence
A class for handling character-level Shakespeare text.
This class downloads the Shakespeare dataset and provides methods for encoding and decoding text at the character level.
- Parameters:
raw_data_path (Path)
- url¶
A string containing the URL for the Shakespeare dataset.
- Type:
str
- vocab_size¶
An integer representing the size of the vocabulary.
- Type:
int
- raw_data_path¶
A Path object for the raw data file.
- Type:
pathlib.Path
- chars¶
A list of integers representing the characters in the dataset.
- Type:
list[int]
- chars: list[int]¶
- decode(char_ids)[source]¶
Decodes the input character IDs into characters.
- Parameters:
char_ids (list[int]) – A list of integer character IDs to decode.
- Returns:
A list of decoded characters.
- download()[source]¶
Downloads the Shakespeare dataset.
The downloaded data is saved to the path specified by raw_data_path.
- encode(chars)[source]¶
Encodes the input characters into character IDs.
- Parameters:
chars (list[int] | str) – A list of integers or a string to encode.
- Returns:
A list of integer character IDs.
- generate()[source]¶
Downloads and processes the Shakespeare dataset.
- Returns:
A list of integers representing the characters in the dataset.
- Return type:
list[int]
- raw_data_path: Path¶
- url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶
- vocab_size: int¶