shakespeare¶

Provides classes for handling Shakespeare datasets.

This module contains two main classes: 1. Shakespeare: For handling tokenized Shakespeare text using BPE tokenization. 2. ShakespeareChar: For handling character-level Shakespeare text.

Both classes provide methods for downloading, tokenizing, encoding, and decoding Shakespeare’s text.

Typical usage example:

shakespeare = Shakespeare(1024) char_shakespeare = ShakespeareChar()

class Shakespeare(vocab_size, token_path=None, raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'), tokeniser_path=PosixPath('datasets/shakespeare/tokeniser.pkl'))[source]¶

Bases: Sequence

A class for handling tokenized Shakespeare text using BPE tokenization.

This class downloads the Shakespeare dataset, tokenizes it using BPE, and provides methods for encoding and decoding text.

Parameters:

vocab_size (int)
token_path (Path)
raw_data_path (Path)
tokeniser_path (Path)

url¶

A string containing the URL for the Shakespeare dataset.

Type:: str

vocab_size¶

An integer representing the size of the vocabulary.

Type:: int

token_path¶

A Path object for the tokenized data file.

Type:: pathlib.Path

raw_data_path¶

A Path object for the raw data file.

Type:: pathlib.Path

tokens¶

A numpy array containing the tokenized data.

Type:: numpy.ndarray

tokeniser¶: A BPETokeniser object for tokenization.

decode(*args)[source]¶

Decodes the input using the BPE tokenizer.

Parameters:: *args – Arguments to pass to the tokenizer’s decode method.
Returns:: The decoded input.

download()[source]¶

Downloads the Shakespeare dataset.

The downloaded data is saved to the path specified by raw_data_path.

encode(*args)[source]¶

Encodes the input using the BPE tokenizer.

Parameters:: *args – Arguments to pass to the tokenizer’s encode method.
Returns:: The encoded input.

generate()[source]¶

Downloads and tokenizes the Shakespeare dataset.

Returns:: A BPETokeniser object trained on the Shakespeare dataset.
Return type:: BPETokeniser

raw_data_path: Path¶

token_path: Path¶

tokens: ndarray¶

url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶

vocab_size: int¶

class ShakespeareChar(raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'))[source]¶

Bases: Sequence

A class for handling character-level Shakespeare text.

This class downloads the Shakespeare dataset and provides methods for encoding and decoding text at the character level.

Parameters:: raw_data_path (Path)

url¶

A string containing the URL for the Shakespeare dataset.

Type:: str

vocab_size¶

An integer representing the size of the vocabulary.

Type:: int

raw_data_path¶

A Path object for the raw data file.

Type:: pathlib.Path

chars¶

A list of integers representing the characters in the dataset.

Type:: list[int]

chars: list[int]¶

decode(char_ids)[source]¶

Decodes the input character IDs into characters.

Parameters:: char_ids (list[int]) – A list of integer character IDs to decode.
Returns:: A list of decoded characters.

download()[source]¶

Downloads the Shakespeare dataset.

The downloaded data is saved to the path specified by raw_data_path.

encode(chars)[source]¶

Encodes the input characters into character IDs.

Parameters:: chars (list[int] | str) – A list of integers or a string to encode.
Returns:: A list of integer character IDs.

generate()[source]¶

Downloads and processes the Shakespeare dataset.

Returns:: A list of integers representing the characters in the dataset.
Return type:: list[int]

raw_data_path: Path¶

url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'¶

vocab_size: int¶