`fundaml.tokenizers`

Module Contents

A class to wrap the Hugging Face tokenizer for easy use.

class fundaml.tokenizers.HFTokenizer(hf_checkpoint_name='distilbert-base-cased')

A class to wrap the Hugging Face tokenizer for easy use.

encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True): Encodes a given list of sentences.
get_vocab_size(): Returns the vocabulary size of the tokenizer.
decode(encoded, skip_special_tokens=True): Decodes a given encoded input.

encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True)

Encodes a given list of sentences.

sentencesstr or List[str]: The sentences to be encoded.
paddingbool, optional: Whether to pad the sentences. Default is True.
truncationbool, optional: Whether to truncate the sentences. Default is True.
max_lengthint, optional: The maximum length for the sentences. Default is 512.
add_special_tokensbool, optional: Whether to add special tokens. Default is True.

List[int] or List[List[int]]: The encoded sentences. Returns a list of integers if only one sentence is given.

get_vocab_size()

Returns the vocabulary size of the tokenizer.

decode(encoded, skip_special_tokens=True)

Decodes a given encoded input.

encodedList[int]: The encoded input to be decoded.
skip_special_tokensbool, optional: Whether to skip special tokens during decoding. Default is True.