shakespeare

Provides classes for handling Shakespeare datasets.

This module contains two main classes: 1. Shakespeare: For handling tokenized Shakespeare text using BPE tokenization. 2. ShakespeareChar: For handling character-level Shakespeare text.

Both classes provide methods for downloading, tokenizing, encoding, and decoding Shakespeare’s text.

Typical usage example:

shakespeare = Shakespeare(1024) char_shakespeare = ShakespeareChar()

class Shakespeare(vocab_size, token_path=None, raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'), tokeniser_path=PosixPath('datasets/shakespeare/tokeniser.pkl'))[source]

Bases: Sequence

A class for handling tokenized Shakespeare text using BPE tokenization.

This class downloads the Shakespeare dataset, tokenizes it using BPE, and provides methods for encoding and decoding text.

Parameters:
  • vocab_size (int)

  • token_path (Path)

  • raw_data_path (Path)

  • tokeniser_path (Path)

url

A string containing the URL for the Shakespeare dataset.

Type:

str

vocab_size

An integer representing the size of the vocabulary.

Type:

int

token_path

A Path object for the tokenized data file.

Type:

pathlib.Path

raw_data_path

A Path object for the raw data file.

Type:

pathlib.Path

tokens

A numpy array containing the tokenized data.

Type:

numpy.ndarray

tokeniser

A BPETokeniser object for tokenization.

decode(*args)[source]

Decodes the input using the BPE tokenizer.

Parameters:

*args – Arguments to pass to the tokenizer’s decode method.

Returns:

The decoded input.

download()[source]

Downloads the Shakespeare dataset.

The downloaded data is saved to the path specified by raw_data_path.

encode(*args)[source]

Encodes the input using the BPE tokenizer.

Parameters:

*args – Arguments to pass to the tokenizer’s encode method.

Returns:

The encoded input.

generate()[source]

Downloads and tokenizes the Shakespeare dataset.

Returns:

A BPETokeniser object trained on the Shakespeare dataset.

Return type:

BPETokeniser

raw_data_path: Path
token_path: Path
tokens: ndarray
url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
vocab_size: int
class ShakespeareChar(raw_data_path=PosixPath('datasets/shakespeare/raw_data.txt'))[source]

Bases: Sequence

A class for handling character-level Shakespeare text.

This class downloads the Shakespeare dataset and provides methods for encoding and decoding text at the character level.

Parameters:

raw_data_path (Path)

url

A string containing the URL for the Shakespeare dataset.

Type:

str

vocab_size

An integer representing the size of the vocabulary.

Type:

int

raw_data_path

A Path object for the raw data file.

Type:

pathlib.Path

chars

A list of integers representing the characters in the dataset.

Type:

list[int]

chars: list[int]
decode(char_ids)[source]

Decodes the input character IDs into characters.

Parameters:

char_ids (list[int]) – A list of integer character IDs to decode.

Returns:

A list of decoded characters.

download()[source]

Downloads the Shakespeare dataset.

The downloaded data is saved to the path specified by raw_data_path.

encode(chars)[source]

Encodes the input characters into character IDs.

Parameters:

chars (list[int] | str) – A list of integers or a string to encode.

Returns:

A list of integer character IDs.

generate()[source]

Downloads and processes the Shakespeare dataset.

Returns:

A list of integers representing the characters in the dataset.

Return type:

list[int]

raw_data_path: Path
url: str = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
vocab_size: int