fundaml.tokenizers

Module Contents

Classes

HFTokenizer

A class to wrap the Hugging Face tokenizer for easy use.

class fundaml.tokenizers.HFTokenizer(hf_checkpoint_name='distilbert-base-cased')

A class to wrap the Hugging Face tokenizer for easy use.

Attributes:

checkpointstr

The name of the Hugging Face checkpoint.

tokenizertransformers.AutoTokenizer

The Hugging Face tokenizer.

Methods:

encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True)

Encodes a given list of sentences.

get_vocab_size()

Returns the vocabulary size of the tokenizer.

decode(encoded, skip_special_tokens=True)

Decodes a given encoded input.

encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True)

Encodes a given list of sentences.

Parameters:
sentencesstr or List[str]

The sentences to be encoded.

paddingbool, optional

Whether to pad the sentences. Default is True.

truncationbool, optional

Whether to truncate the sentences. Default is True.

max_lengthint, optional

The maximum length for the sentences. Default is 512.

add_special_tokensbool, optional

Whether to add special tokens. Default is True.

Returns:
List[int] or List[List[int]]

The encoded sentences. Returns a list of integers if only one sentence is given.

get_vocab_size()

Returns the vocabulary size of the tokenizer.

Returns:
int

The vocabulary size.

decode(encoded, skip_special_tokens=True)

Decodes a given encoded input.

Parameters:
encodedList[int]

The encoded input to be decoded.

skip_special_tokensbool, optional

Whether to skip special tokens during decoding. Default is True.

Returns:
str

The decoded input.