Cse425 Assignement - 20101257

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

COURSE: CSE425

TASK: ASSIGNMENT 1

SUBMITTED TO:

FACULTY NAME: MOIN MUSTAKIN

FACULTY INITIAL: MMM

SUBMITTED BY:

NAME: SUDIPTA NANDI SARNA

STUDENT ID: 20101257

SECTION: 01
Scenario:
You are working as a data scientist at a software development company. Your team has been
assigned a project to develop a language model that can generate coherent and contextually
relevant text. To accomplish this task, you decide to implement a Vanilla Recurrent Neural
Network (RNN) due to its ability to process sequential data. Your goal is to train the RNN model
on a large corpus of text and generate meaningful text based on the learned patterns

1. Data Preparation:
Suitable data set chosen is the movie script from the movie Star Wars. The information
contained within the "character" and "dialogue" columns enabled the development of a language
model centered around characters, capable of producing meaningful and contextually
appropriate text by drawing upon the acquired patterns from the script.
Dataset link: https://www.kaggle.com/datasets/xvivancos/star-wars-movie-scripts

The process begins by importing the dataset from a file, which is then subjected to initial
preprocessing involving word tokenization. The build_vocabulary function serves to establish
two dictionaries, namely word_to_int and int_to_word, responsible for bidirectional
word-to-integer mapping. With the vocabulary in place, the pseudocode transforms the dataset,
converting text into sequences of integers instead of words. Through the employment of the
text_to_sequences function, a list of words is converted into a corresponding list of integers
using the word_to_int dictionary.

In order to effectively train the Recurrent Neural Network (RNN), uniform sequence lengths are
essential. To achieve this, the pseudocode calculates the maximum sequence length within the
dataset and applies padding to all sequences using zeros, aligning them with this determined
length via the pad_sequences function.

Continuing, the pseudocode proceeds to partition the dataset into training and validation sets,
utilizing various ratios for this purpose. Additionally, it calculates the average line count within
each set, contributing to the subsequent stages of the process.

The code is given below:

dataset_path = "path_to_SW_EpisodeV.txt" # Replace with the actual path to the dataset


dataset = load_dataset(dataset_path)

def preprocess_text(text):
tokens = tokenize_text(text)
return tokens
preprocessed_dataset = []
for dialogue in dataset:
character, text = dialogue['character'], dialogue['dialogue']
preprocessed_text = preprocess_text(text)
preprocessed_dialogue = {'character': character, 'dialogue': preprocessed_text}
preprocessed_dataset.append(preprocessed_dialogue)
def build_vocabulary(dataset):
all_tokens = [token for dialogue in dataset for token in dialogue['dialogue']]
unique_tokens = list(set(all_tokens))
word_to_int = {word: idx for idx, word in enumerate(unique_tokens)}
int_to_word = {idx: word for word, idx in word_to_int.items()}
return word_to_int, int_to_word
word_to_int, int_to_word = build_vocabulary(preprocessed_dataset)

def text_to_sequences(text, word_to_int):


return [word_to_int[word] for word in text]

sequences_dataset = []
for dialogue in preprocessed_dataset:
character, text = dialogue['character'], dialogue['dialogue']
sequence = text_to_sequences(text, word_to_int)
sequences_dialogue = {'character': character, 'sequence': sequence}
sequences_dataset.append(sequences_dialogue)

def pad_sequences(sequences, max_length):


return sequences + [0] * (max_length - len(sequences))

max_sequence_length = max(len(dialogue['sequence']) for dialogue in sequences_dataset)

padded_dataset = []
for dialogue in sequences_dataset:
character, sequence = dialogue['character'], dialogue['sequence']
padded_sequence = pad_sequences(sequence, max_sequence_length)
padded_dialogue = {'character': character, 'padded_sequence': padded_sequence}
padded_dataset.append(padded_dialogue)

def train_validation_split(dataset, split_ratio):


split_index = int(len(dataset) * split_ratio)
return dataset[:split_index], dataset[split_index:]

split_ratios = [0.3, 0.4, 0.2]

train_datasets = []
validation_datasets = []
for split_ratio in split_ratios:
train_data, validation_data = train_validation_split(preprocessed_dataset, split_ratio)
train_datasets.append(train_data)
validation_datasets.append(validation_data)

def calculate_average_split(datasets):
total_train_lines = sum(len(dataset) for dataset in datasets)
total_validation_lines = sum(len(dataset) for dataset in datasets)
average_train_lines = total_train_lines / len(datasets)
average_validation_lines = total_validation_lines / len(datasets)
return average_train_lines, average_validation_lines
average_train_lines, average_validation_lines = calculate_average_split(train_datasets)

print("Average Train Lines:", average_train_lines)


print("Average Validation Lines:", average_validation_lines)

2. Implementing a Vanilla RNN:


We commence by bringing in the essential libraries required for PyTorch operations. PyTorch is
employed for tensor manipulations, nn is utilized to define layers of neural networks, optim is
chosen to designate the optimizer, and DataLoader serves the purpose of loading data in
batches during the training phase.

The code is given below:


Import Libraries
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

We create a personalized Dataset class called `SWDataset` designed for managing our
collection of Star Wars dialogues. This class facilitates the retrieval of specific dialogues through
indexing. For the training process, we employ DataLoader to effectively load data in grouped
batches. This approach enhances training speed and optimizes memory utilization during the
training procedure.
The code is given below:
class SWDataset(Dataset):
def __init__(self, dialogues):
self.dialogues = dialogues

def __len__(self):
return len(self.dialogues)

def __getitem__(self, index):


return self.dialogues[index]['dialogue']

train_dataset = SWDataset(train_datasets)
validation_dataset = SWDataset(validation_datasets)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=64, shuffle=False)

The RNN layer is defined using `nn.RNN`, a basic RNN cell that accepts input, processes it in a
sequential manner, and retains hidden states.

The code is given below:


class VanillaRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(VanillaRNN, self).__init__()
self.hidden_size = hidden_size
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)

def forward(self, x):


h0 = torch.zeros(1, x.size(0), self.hidden_size).to(x.device)
out, _ = self.rnn(x, h0)
out = self.fc(out)
return out

Prior to initiating the training phase, it is essential to establish specific hyperparameters. The
parameter `input_size` signifies the dimension of the input data. The `hidden_size` is
designated as 128, representing the quantity of hidden units within the RNN layer. Similarly, the
`output_size` corresponds to the vocabulary size, aligning with the objective of generating words
from the existing vocabulary.

For the optimization process during training, we opt for the Adam optimizer with a learning rate
of 0.001, which is responsible for fine-tuning the model's parameters. As for the loss function,
we select `nn.CrossEntropyLoss()` to evaluate the model's performance during training.
The code is given below:
input_size = vocab_size
hidden_size = 128
output_size = vocab_size
learning_rate = 0.001
num_epochs = 10

model = VanillaRNN(input_size, hidden_size, output_size)


criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

The training iteration occurs for the designated count of epochs. Within each epoch, we traverse
the training data in batches via the `train_loader`. To create targets, we shift the input data by
one step to the left, as the model's objective is to predict the subsequent word based on the
preceding sequence.

We execute a forward pass through the model, generating outputs. The loss is computed by
contrasting the projected outputs with the targets, utilizing the `nn.CrossEntropyLoss` method.
Subsequently, this loss is back-propagated through the network, and the optimizer undertakes
parameter adjustments for the model via the `optimizer.step()` function.

The code is given below:


for epoch in range(num_epochs):
model.train()
total_loss = 0.0

for batch in train_loader:


inputs = batch.to(device)
targets = inputs[:, 1:]
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs[:, :-1].contiguous().view(-1, output_size), targets.view(-1))
loss.backward()
optimizer.step()
total_loss += loss.item()

average_loss = total_loss / len(train_loader)

print(f"Epoch [{epoch + 1}/{num_epochs}], Average Loss: {average_loss:.4f}")


For model assessment, we create a function named `calculate_perplexity`. Within this function,
the model is configured for evaluation mode through the use of `model.eval()`. Subsequently, we
proceed to iterate over the validation data employing the `validation_loader`, mirroring the
approach taken in the training loop.

During this process, we compute the loss and maintain a record of the cumulative loss value
along with the token count within the validation dataset. Perplexity serves as a metric for
gauging the model's predictive accuracy regarding the data. This metric is determined as the
exponential of the mean loss. A lower perplexity value signifies a higher level of performance
exhibited by the model.

The code is given below:


def calculate_perplexity(model, data_loader, criterion):
model.eval()
total_loss = 0.0
total_tokens = 0

with torch.no_grad():
for batch in data_loader:
inputs = batch.to(device) # Move data to GPU if available
targets = inputs[:, 1:] # Shift targets one step to the left

outputs = model(inputs)
loss = criterion(outputs[:, :-1].contiguous().view(-1, output_size), targets.view(-1))

total_loss += loss.item()
total_tokens += targets.numel()

average_loss = total_loss / len(data_loader)


perplexity = torch.exp(torch.tensor(average_loss))
return perplexity.item()

perplexity = calculate_perplexity(model, validation_loader, criterion)


print(f"Validation Perplexity: {perplexity:.2f}")

3. Text Generation:
Prior to text generation, it is essential to retrieve the Vanilla RNN model that was previously
trained and saved in a checkpoint. By applying `model.eval()`, we configure the model for
evaluation mode, thereby deactivating dropout layers to maintain consistent behavior during the
inference stage.

The code is given below:


model = torch.load('trained_model.pth')
model.eval()

In this stage, we develop a function named `tokenize_input_prompt` aimed at tokenizing the


input prompt using the same method employed for tokenizing the training data in the data
preparation phase. Subsequently, we translate the tokenized input prompt into a numerical form
by leveraging the vocabulary. This process is vital for preparing the input prompt for input into
the model.

The code is given below:


def tokenize_input_prompt(input_prompt):
# Tokenize the input prompt
# Convert the tokenized prompt to numerical representation using the vocabulary
pass
input_prompt = " …: "
tokenized_input = tokenize_input_prompt(input_prompt)

The function `generate_text` assumes the responsibility of creating text based on the input
prompt provided. It requires the trained model, tokenized input, and certain parameters like
`max_length` and `temperature`.

Throughout the text generation procedure, we follow an iterative approach to predict the
subsequent token using the ongoing input prompt. At each step, the current input is transformed
into a tensor and passed through the model to acquire the output. The level of text randomness
is regulated by the `temperature` parameter. Elevated values (e.g., above 1.0) result in more
diverse text, while lower values (e.g., below 1.0) tend to make it more predictable.

We utilize `torch.multinomial` to sample the subsequent token from the output probabilities,
thereby introducing a degree of randomness to the generation process. The generated token is
then added to the existing input, and this sequence persists until the maximum length is attained
or an end-of-sequence token is generated.

The code is given below:


def generate_text(model, tokenized_input, max_length=100, temperature=1.0):
current_input = tokenized_input
with torch.no_grad():
for _ in range(max_length):
current_input_tensor =
torch.tensor(current_input).unsqueeze(0).to(device)

output = model(current_input_tensor)

output = output[:, -1, :] / temperature

next_token = torch.multinomial(torch.softmax(output, dim=-1),


num_samples=1).squeeze().item()

current_input.append(next_token)

if next_token == end_of_sequence_token:
break
return current_input

Ultimately, we produce text for various input prompts by invoking the `generate_text`
function for each prompt. For each input prompt, we tokenize it, generate text rooted in
the tokenized form, and subsequently transform the generated tokens into readable text
through the utilization of a `detokenize_tokens` function (omitted in the provided
pseudocode). This `detokenize_tokens` function essentially undoes the tokenization
process, resulting in the ultimate generated text output.

The code is given below:


input_prompts = ["LUKE: ", "VADER: ", "LEIA: "]
for prompt in input_prompts:
tokenized_input = tokenize_input_prompt(prompt)
generated_tokens = generate_text(model, tokenized_input)
generated_text = detokenize_tokens(generated_tokens)
print(prompt + generated_text)

Assessment of Quality and Consistency:


When assessing the excellence and uniformity of input, it becomes imperative to examine
factors like contextual appropriateness, grammatical accuracy, variety in prompt context,
prevention of overfitting, or making comparisons with perplexity using validation data.
Experiments with temperature during text generation might also impact the randomness and
quality of the outcomes.
4. Limitations of Vanilla RNN:
a. Vanilla RNNs, a subset of neural networks, utilize recurrent connections to analyze sequential
data. Despite their versatility, they encounter specific drawbacks, particularly in handling
long-range connections within sequences. These challenges encompass:

- The Vanishing Gradient Problem


- The Exploding Gradient Problem
- Limited Short-Term Memory
- Difficulty with Lengthy Sequences
- Absence of Explicit Contextual Information
- Issues with Irregular Time Intervals

b. Vanishing Gradient Problem: A fundamental issue with Vanilla RNNs is the vanishing gradient
problem. During backpropagation through time, when training deep RNNs on lengthy
sequences, the gradients of the error function with respect to the parameters tend to become
exceedingly small. This impedes learning and hampers the RNN's ability to recognize distant
dependencies.

Exploding Gradient Issue: Conversely, Vanilla RNNs can also experience the exploding gradient
problem. Occasionally, gradients can inflate to significant magnitudes during backpropagation,
leading to unstable training and numerical challenges.

c. These limitations can influence the performance of the implemented language model:

- Vanishing Gradient Problem: In contrast to Vanilla RNNs, the Transformer architecture exhibits
fewer instances of vanishing gradient issues. Multi-head self-attention mechanisms enable
Transformers to effectively capture dependencies across distant tokens, alleviating the
disappearing gradient concern. However, the quadratic computational complexity of attention
can still pose challenges for processing lengthy sequences.

- Exploding Gradient Problem: The Transformer architecture mitigates the exploding gradient
problem through meticulous weight initialization and layer normalization techniques. These
strategies reduce the likelihood of gradient explosions during training, resulting in more stable
learning.

- Limited Short-Term Memory: Transformers possess a robust attentional system that excels at
identifying dependencies across various spatial separations within sequences. The
self-attention mechanism enables the model to weigh the significance of multiple tokens,
enhancing contextual awareness and overcoming short-term memory limitations.
- Difficulty with Lengthy Sequences: While Transformers surpass Vanilla RNNs in capturing
long-range dependencies, they still face difficulties with exceedingly lengthy sequences. The
quadratic complexity of the attention mechanism makes processing extended sequences
computationally and memory-intensive.

- Absence of Explicit Contextual Information: Transformers excel at capturing contextual


information accurately. The self-attention mechanism empowers the model to gather insights
from all sequence points, thereby improving performance in language-related tasks and
enhancing contextual comprehension.

- Difficulty with Irregular Time Intervals: Similar to Vanilla RNNs, Transformers operate under a
fixed time assumption. This might not be ideal for applications that involve asynchronous or
irregularly spaced data points.

d. Various solutions and alternative models have been developed to overcome Vanilla RNN
limitations and enhance language model performance:

- Long Short-Term Memory (LSTM) Networks: LSTMs are a type of RNN that addresses the
vanishing gradient problem by employing gating mechanisms. These mechanisms enable the
model to control information flow, retaining relevant information over extended periods. LSTMs
exhibit superior performance in language modeling and other sequential tasks, particularly those
involving long-range relationships.

- Gated Recurrent Units (GRUs): GRUs, akin to LSTMs but with a simpler design, offer
enhanced memory retention while being computationally more efficient. They provide a powerful
alternative to LSTMs for tackling the vanishing gradient problem and capturing long-term
dependencies.

- Transformer-Based Models: Transformers have become the standard architecture for


language modeling tasks. By leveraging self-attention mechanisms, transformers efficiently
capture long-range relationships, achieving state-of-the-art performance in various natural
language processing tasks. Efficiencies like sparse attention methods have made transformers
more scalable for processing larger sequences.

5. Report:
We successfully generated text from the "Star Wars Movie Scripts" dataset by developing a
language model using Vanilla Recurrent Neural Networks (RNN). Nonetheless, we encountered
challenges related to the incomplete capture of long-term dependencies, leading to difficulties in
maintaining text coherence and quality. To tackle these concerns, alternative approaches such
as LSTM, GRU, Transformer, and attention-based RNNs offer solutions. These models have
shown enhanced capabilities in handling long-range dependencies, resulting in the generation
of text that is both contextually relevant and coherent. These advancements hold promising
potential to elevate the language modeling task, yielding more accurate and pertinent outcomes.

You might also like