`fundaml.transformer_from_scratch`

Module Contents

Classes

`MultiHeadSelfAttentionBlock`	Initializes the MultiHeadSelfAttentionBlock module.
`TransformerBlock`	A Transformer Block class that defines a single block in a transformer model.
`Encoder`	The Encoder class for a Transformer model.
`DecoderBlock`	The DecoderBlock class that forms a part of the Decoder in a Transformer model.
`Decoder`	The Decoder class for a Transformer model.
`Transformer`	The Transformer model class that combines an Encoder and a Decoder.

Functions

get_device()

fundaml.transformer_from_scratch.get_device()

class fundaml.transformer_from_scratch.MultiHeadSelfAttentionBlock(embedding_size, num_heads, device=None)

Bases: torch.nn.Module

Initializes the MultiHeadSelfAttentionBlock module.

Parameters:

embedding_size (int) – The size of the input embeddings. The value should be divisible by num_heads. This is because the embeddings are split into num_heads different pieces during the self-attention process.
num_heads (int) – The number of attention heads. In the multi-head attention mechanism, the model generates num_heads different attention scores for each token in the sequence. This allows the model to focus on different parts of the sequence for each token.

Raises:

AssertionError – If embedding_size is not divisible by num_heads. The embedding size needs to be divisible by the number of heads to ensure even division of embeddings for multi-head attention.

forward(values, keys, queries, mask)

Forward pass for the MultiHeadSelfAttentionBlock module.

Parameters:

values (torch.Tensor) – The values tensor of shape (N, value_len, embedding_size), where N is the batch size, value_len is the sequence length for the values.
keys (torch.Tensor) – The keys tensor of shape (N, key_len, embedding_size), where key_len is the sequence length for the keys.
queries (torch.Tensor) – The query tensor of shape (N, query_len, embedding_size), where query_len is the sequence length for the queries.
mask (torch.Tensor, optional) – A mask tensor of shape (N, 1, 1, query_len/key_len), where the values are either 1 (for positions to be attended to) or 0 (for positions to be masked). The mask is used to prevent attention to certain positions. Default is None.

Returns:

The output tensor of shape (N, query_len, embedding_size),: where N is the batch size, query_len is the sequence length for the queries, and embedding_size is the dimension of the output embeddings. This tensor represents the result of applying self-attention on the input.

Return type:

torch.Tensor

Note

The method first transforms the input tensors (values, keys, query) using separate linear transformations. Then, it splits the transformed embeddings into multiple heads and computes the attention scores (queries * keys). If a mask is provided, it is applied to the scores. The scores are then normalized to create the attention weights. The method then computes the weighted sum of the values (attention * values), applies a final linear transformation, and returns the result.

class fundaml.transformer_from_scratch.TransformerBlock(embed_size, num_heads, dropout_rate, forward_expansion, device=None)

Bases: torch.nn.Module

A Transformer Block class that defines a single block in a transformer model.

Parameters:

embed_size (int) – The dimensionality of the input embeddings.
num_heads (int) – The number of attention heads for the self-attention mechanism.
dropout_rate (float) – The dropout rate used in the dropout layers to prevent overfitting.
forward_expansion (int) – The factor by which the dimensionality of the input is expanded in the feed-forward network. The feed-forward network expands the dimensionality of the input from embed_size to forward_expansion * embed_size and then reduces it back to embed_size.

self_attention

The self-attention layer used in the transformer block.

Type:: SelfAttention

norm1

The first layer normalization used to stabilize the outputs of the self-attention layer.

Type:: nn.LayerNorm

norm2

The second layer normalization used to stabilize the outputs of the feed-forward network.

Type:: nn.LayerNorm

feed_forward

The feed-forward network used to transform the output of the self-attention layer.

Type:: nn.Sequential

dropout

The dropout layer used to prevent overfitting.

Type:: nn.Dropout

The Transformer Block consists of a self-attention layer followed by normalization, a feed-forward network followed by normalization, and a dropout layer.

forward(values, keys, queries, mask)

Forward pass of the Transformer Block.

Parameters:

values (torch.Tensor) – The values used by the self-attention layer. They have shape (N, value_len, embed_size) where N is the batch size, value_len is the length of the value sequence, and embed_size is the size of the embeddings.
keys (torch.Tensor) – The keys used by the self-attention layer. They have shape (N, key_len, embed_size) where N is the batch size, key_len is the length of the key sequence, and embed_size is the size of the embeddings.
queries (torch.Tensor) – The queries used by the self-attention layer. They have shape (N, query_len, embed_size) where N is the batch size, query_len is the length of the query sequence, and embed_size is the size of the embeddings.
mask (torch.Tensor) – The mask to be applied on the attention outputs to prevent the model from attending to certain positions. It has shape (N, 1, 1, src_len), where N is the batch size and src_len is the source sequence length.

Returns:

The output tensor from the transformer block, it has shape: (N, query_len, embed_size), where N is the batch size, query_len is the length of the query sequence, and embed_size is the size of the embeddings.

Return type:

out (torch.Tensor)

The forward method first applies the self-attention mechanism on the input tensor using the provided keys, queries, and values. The output from the self-attention layer is then passed through a normalization layer and a dropout layer. The output from these layers is then passed through the feed-forward network. The output from the feed-forward network is also passed through a normalization layer and a dropout layer. The final output is then returned.

class fundaml.transformer_from_scratch.Encoder(src_vocab_size, embed_size, num_layers, num_heads, forward_expansion, dropout_rate, max_length, device=None)

Bases: torch.nn.Module

The Encoder class for a Transformer model.

Parameters:

src_vocab_size (int) – The size of the source vocabulary.
embed_size (int) – The dimensionality of the input embeddings.
num_layers (int) – The number of layers in the transformer.
num_heads (int) – The number of attention heads in the transformer block.
device (torch.device) – The device to run the model on (CPU or GPU).
forward_expansion (int) – The expansion factor for the feed forward network in transformer block.
dropout (float) – The dropout rate used in the dropout layers to prevent overfitting.
max_length (int) – The maximum sequence length the model can handle.

forward(x, mask)

Forward method for the Encoder class.

Parameters:

x (torch.Tensor) – The input tensor of shape (batch_size, seq_length).
mask (torch.Tensor) – The mask to be applied on the attention outputs to prevent the model from attending to certain positions.

Returns:

The output tensor from the encoder.

Return type:

out (torch.Tensor)

class fundaml.transformer_from_scratch.DecoderBlock(embed_size, num_heads, forward_expansion, dropout_rate, device=None)

Bases: torch.nn.Module

The DecoderBlock class that forms a part of the Decoder in a Transformer model.

Parameters:

embed_size (int) – The dimensionality of the input embeddings.
num_heads (int) – The number of attention heads for the self-attention mechanism.
forward_expansion (int) – The factor by which the dimensionality of the input is expanded in the feed-forward network.
dropout_rate (float) – The dropout rate used in the dropout layers to prevent overfitting.
device (torch.device) – The device to run the model on (CPU or GPU).

norm

Layer normalization.

Type:: nn.LayerNorm

attention

The self-attention mechanism.

Type:: SelfAttention

transformer_block

A transformer block.

Type:: TransformerBlock

dropout

Dropout layer for regularization.

Type:: nn.Dropout

forward(x, value, key, src_mask, trg_mask)

Forward method for the DecoderBlock class.

Parameters:

x (torch.Tensor) – The input tensor.
value (torch.Tensor) – The values to be used in the self-attention mechanism.
key (torch.Tensor) – The keys to be used in the self-attention mechanism.
src_mask (torch.Tensor) – The source mask to prevent attention to certain positions.
trg_mask (torch.Tensor) – The target mask to prevent attention to certain positions.

Returns:

The output tensor from the transformer block.

Return type:

out (torch.Tensor)

class fundaml.transformer_from_scratch.Decoder(trg_vocab_size, embed_size, num_layers, num_heads, forward_expansion, dropout_rate, max_length, device=None)

Bases: torch.nn.Module

The Decoder class for a Transformer model.

Parameters:

trg_vocab_size (int) – The size of the target vocabulary.
embed_size (int) – The dimensionality of the input embeddings.
num_layers (int) – The number of layers in the transformer.
num_heads (int) – The number of attention heads in the transformer block.
forward_expansion (int) – The expansion factor for the feed forward network in transformer block.
dropout_rate (float) – The dropout rate used in the dropout layers to prevent overfitting.
device (torch.device) – The device to run the model on (CPU or GPU).
max_length (int) – The maximum sequence length the model can handle.

forward(x, enc_out, src_mask, trg_mask)

Forward method for the Decoder class.

Parameters:

x (torch.Tensor) – The input tensor of shape (batch_size, seq_length).
enc_out (torch.Tensor) – The output from the encoder.
src_mask (torch.Tensor) – The source mask to prevent the model from attending to certain positions.
trg_mask (torch.Tensor) – The target mask to prevent the model from attending to certain positions.

Returns:

The output tensor from the decoder.

Return type:

out (torch.Tensor)

class fundaml.transformer_from_scratch.Transformer(src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size=512, num_layers=6, forward_expansion=4, heads=8, dropout=0, max_length=100, device=None)

Bases: torch.nn.Module

The Transformer model class that combines an Encoder and a Decoder.

Parameters:

src_vocab_size (int) – The size of the source vocabulary.
trg_vocab_size (int) – The size of the target vocabulary.
src_pad_idx (int) – The index of the source padding token in the source vocabulary.
trg_pad_idx (int) – The index of the target padding token in the target vocabulary.
embed_size (int) – The dimensionality of the input embeddings.
num_layers (int) – The number of layers in the transformer.
forward_expansion (int) – The expansion factor for the feed forward network in transformer block.
heads (int) – The number of attention heads in the transformer block.
dropout (float) – The dropout rate used in the dropout layers to prevent overfitting.
device (torch.device) – The device to run the model on (CPU or GPU).
max_length (int) – The maximum sequence length the model can handle.

static make_src_mask(src, src_pad_idx, device=None)

Creates a mask for the source input sequence.

Parameters:: src (torch.Tensor) – The source input sequence.
Returns:: The mask for the source input sequence.
Return type:: src_mask (torch.Tensor)

static make_trg_mask(trg, device=None)

Creates a mask for the target input sequence.

Parameters:: trg (torch.Tensor) – The target input sequence.
Returns:: The mask for the target input sequence.
Return type:: trg_mask (torch.Tensor)

forward(src, trg)

Forward method for the Transformer class.

Parameters:

src (torch.Tensor) – The source input sequence.
trg (torch.Tensor) – The target input sequence.

Returns:

The output tensor from the transformer.

Return type:

out (torch.Tensor)

fundaml.transformer_from_scratch

Module Contents

Classes

Functions

`fundaml.transformer_from_scratch`