fundaml.tokenizers
Module Contents
Classes
A class to wrap the Hugging Face tokenizer for easy use. |
- class fundaml.tokenizers.HFTokenizer(hf_checkpoint_name='distilbert-base-cased')
A class to wrap the Hugging Face tokenizer for easy use.
Attributes:
- checkpointstr
The name of the Hugging Face checkpoint.
- tokenizertransformers.AutoTokenizer
The Hugging Face tokenizer.
Methods:
- encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True)
Encodes a given list of sentences.
- get_vocab_size()
Returns the vocabulary size of the tokenizer.
- decode(encoded, skip_special_tokens=True)
Decodes a given encoded input.
- encode(sentences, padding=True, truncation=True, max_length=512, add_special_tokens=True)
Encodes a given list of sentences.
Parameters:
- sentencesstr or List[str]
The sentences to be encoded.
- paddingbool, optional
Whether to pad the sentences. Default is True.
- truncationbool, optional
Whether to truncate the sentences. Default is True.
- max_lengthint, optional
The maximum length for the sentences. Default is 512.
- add_special_tokensbool, optional
Whether to add special tokens. Default is True.
Returns:
- List[int] or List[List[int]]
The encoded sentences. Returns a list of integers if only one sentence is given.