Cse425 Assignement - 20101257
Cse425 Assignement - 20101257
Cse425 Assignement - 20101257
TASK: ASSIGNMENT 1
SUBMITTED TO:
SUBMITTED BY:
SECTION: 01
Scenario:
You are working as a data scientist at a software development company. Your team has been
assigned a project to develop a language model that can generate coherent and contextually
relevant text. To accomplish this task, you decide to implement a Vanilla Recurrent Neural
Network (RNN) due to its ability to process sequential data. Your goal is to train the RNN model
on a large corpus of text and generate meaningful text based on the learned patterns
1. Data Preparation:
Suitable data set chosen is the movie script from the movie Star Wars. The information
contained within the "character" and "dialogue" columns enabled the development of a language
model centered around characters, capable of producing meaningful and contextually
appropriate text by drawing upon the acquired patterns from the script.
Dataset link: https://www.kaggle.com/datasets/xvivancos/star-wars-movie-scripts
The process begins by importing the dataset from a file, which is then subjected to initial
preprocessing involving word tokenization. The build_vocabulary function serves to establish
two dictionaries, namely word_to_int and int_to_word, responsible for bidirectional
word-to-integer mapping. With the vocabulary in place, the pseudocode transforms the dataset,
converting text into sequences of integers instead of words. Through the employment of the
text_to_sequences function, a list of words is converted into a corresponding list of integers
using the word_to_int dictionary.
In order to effectively train the Recurrent Neural Network (RNN), uniform sequence lengths are
essential. To achieve this, the pseudocode calculates the maximum sequence length within the
dataset and applies padding to all sequences using zeros, aligning them with this determined
length via the pad_sequences function.
Continuing, the pseudocode proceeds to partition the dataset into training and validation sets,
utilizing various ratios for this purpose. Additionally, it calculates the average line count within
each set, contributing to the subsequent stages of the process.
def preprocess_text(text):
tokens = tokenize_text(text)
return tokens
preprocessed_dataset = []
for dialogue in dataset:
character, text = dialogue['character'], dialogue['dialogue']
preprocessed_text = preprocess_text(text)
preprocessed_dialogue = {'character': character, 'dialogue': preprocessed_text}
preprocessed_dataset.append(preprocessed_dialogue)
def build_vocabulary(dataset):
all_tokens = [token for dialogue in dataset for token in dialogue['dialogue']]
unique_tokens = list(set(all_tokens))
word_to_int = {word: idx for idx, word in enumerate(unique_tokens)}
int_to_word = {idx: word for word, idx in word_to_int.items()}
return word_to_int, int_to_word
word_to_int, int_to_word = build_vocabulary(preprocessed_dataset)
sequences_dataset = []
for dialogue in preprocessed_dataset:
character, text = dialogue['character'], dialogue['dialogue']
sequence = text_to_sequences(text, word_to_int)
sequences_dialogue = {'character': character, 'sequence': sequence}
sequences_dataset.append(sequences_dialogue)
padded_dataset = []
for dialogue in sequences_dataset:
character, sequence = dialogue['character'], dialogue['sequence']
padded_sequence = pad_sequences(sequence, max_sequence_length)
padded_dialogue = {'character': character, 'padded_sequence': padded_sequence}
padded_dataset.append(padded_dialogue)
train_datasets = []
validation_datasets = []
for split_ratio in split_ratios:
train_data, validation_data = train_validation_split(preprocessed_dataset, split_ratio)
train_datasets.append(train_data)
validation_datasets.append(validation_data)
def calculate_average_split(datasets):
total_train_lines = sum(len(dataset) for dataset in datasets)
total_validation_lines = sum(len(dataset) for dataset in datasets)
average_train_lines = total_train_lines / len(datasets)
average_validation_lines = total_validation_lines / len(datasets)
return average_train_lines, average_validation_lines
average_train_lines, average_validation_lines = calculate_average_split(train_datasets)
We create a personalized Dataset class called `SWDataset` designed for managing our
collection of Star Wars dialogues. This class facilitates the retrieval of specific dialogues through
indexing. For the training process, we employ DataLoader to effectively load data in grouped
batches. This approach enhances training speed and optimizes memory utilization during the
training procedure.
The code is given below:
class SWDataset(Dataset):
def __init__(self, dialogues):
self.dialogues = dialogues
def __len__(self):
return len(self.dialogues)
train_dataset = SWDataset(train_datasets)
validation_dataset = SWDataset(validation_datasets)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
validation_loader = DataLoader(validation_dataset, batch_size=64, shuffle=False)
The RNN layer is defined using `nn.RNN`, a basic RNN cell that accepts input, processes it in a
sequential manner, and retains hidden states.
Prior to initiating the training phase, it is essential to establish specific hyperparameters. The
parameter `input_size` signifies the dimension of the input data. The `hidden_size` is
designated as 128, representing the quantity of hidden units within the RNN layer. Similarly, the
`output_size` corresponds to the vocabulary size, aligning with the objective of generating words
from the existing vocabulary.
For the optimization process during training, we opt for the Adam optimizer with a learning rate
of 0.001, which is responsible for fine-tuning the model's parameters. As for the loss function,
we select `nn.CrossEntropyLoss()` to evaluate the model's performance during training.
The code is given below:
input_size = vocab_size
hidden_size = 128
output_size = vocab_size
learning_rate = 0.001
num_epochs = 10
The training iteration occurs for the designated count of epochs. Within each epoch, we traverse
the training data in batches via the `train_loader`. To create targets, we shift the input data by
one step to the left, as the model's objective is to predict the subsequent word based on the
preceding sequence.
We execute a forward pass through the model, generating outputs. The loss is computed by
contrasting the projected outputs with the targets, utilizing the `nn.CrossEntropyLoss` method.
Subsequently, this loss is back-propagated through the network, and the optimizer undertakes
parameter adjustments for the model via the `optimizer.step()` function.
During this process, we compute the loss and maintain a record of the cumulative loss value
along with the token count within the validation dataset. Perplexity serves as a metric for
gauging the model's predictive accuracy regarding the data. This metric is determined as the
exponential of the mean loss. A lower perplexity value signifies a higher level of performance
exhibited by the model.
with torch.no_grad():
for batch in data_loader:
inputs = batch.to(device) # Move data to GPU if available
targets = inputs[:, 1:] # Shift targets one step to the left
outputs = model(inputs)
loss = criterion(outputs[:, :-1].contiguous().view(-1, output_size), targets.view(-1))
total_loss += loss.item()
total_tokens += targets.numel()
3. Text Generation:
Prior to text generation, it is essential to retrieve the Vanilla RNN model that was previously
trained and saved in a checkpoint. By applying `model.eval()`, we configure the model for
evaluation mode, thereby deactivating dropout layers to maintain consistent behavior during the
inference stage.
The function `generate_text` assumes the responsibility of creating text based on the input
prompt provided. It requires the trained model, tokenized input, and certain parameters like
`max_length` and `temperature`.
Throughout the text generation procedure, we follow an iterative approach to predict the
subsequent token using the ongoing input prompt. At each step, the current input is transformed
into a tensor and passed through the model to acquire the output. The level of text randomness
is regulated by the `temperature` parameter. Elevated values (e.g., above 1.0) result in more
diverse text, while lower values (e.g., below 1.0) tend to make it more predictable.
We utilize `torch.multinomial` to sample the subsequent token from the output probabilities,
thereby introducing a degree of randomness to the generation process. The generated token is
then added to the existing input, and this sequence persists until the maximum length is attained
or an end-of-sequence token is generated.
output = model(current_input_tensor)
current_input.append(next_token)
if next_token == end_of_sequence_token:
break
return current_input
Ultimately, we produce text for various input prompts by invoking the `generate_text`
function for each prompt. For each input prompt, we tokenize it, generate text rooted in
the tokenized form, and subsequently transform the generated tokens into readable text
through the utilization of a `detokenize_tokens` function (omitted in the provided
pseudocode). This `detokenize_tokens` function essentially undoes the tokenization
process, resulting in the ultimate generated text output.
b. Vanishing Gradient Problem: A fundamental issue with Vanilla RNNs is the vanishing gradient
problem. During backpropagation through time, when training deep RNNs on lengthy
sequences, the gradients of the error function with respect to the parameters tend to become
exceedingly small. This impedes learning and hampers the RNN's ability to recognize distant
dependencies.
Exploding Gradient Issue: Conversely, Vanilla RNNs can also experience the exploding gradient
problem. Occasionally, gradients can inflate to significant magnitudes during backpropagation,
leading to unstable training and numerical challenges.
c. These limitations can influence the performance of the implemented language model:
- Vanishing Gradient Problem: In contrast to Vanilla RNNs, the Transformer architecture exhibits
fewer instances of vanishing gradient issues. Multi-head self-attention mechanisms enable
Transformers to effectively capture dependencies across distant tokens, alleviating the
disappearing gradient concern. However, the quadratic computational complexity of attention
can still pose challenges for processing lengthy sequences.
- Exploding Gradient Problem: The Transformer architecture mitigates the exploding gradient
problem through meticulous weight initialization and layer normalization techniques. These
strategies reduce the likelihood of gradient explosions during training, resulting in more stable
learning.
- Limited Short-Term Memory: Transformers possess a robust attentional system that excels at
identifying dependencies across various spatial separations within sequences. The
self-attention mechanism enables the model to weigh the significance of multiple tokens,
enhancing contextual awareness and overcoming short-term memory limitations.
- Difficulty with Lengthy Sequences: While Transformers surpass Vanilla RNNs in capturing
long-range dependencies, they still face difficulties with exceedingly lengthy sequences. The
quadratic complexity of the attention mechanism makes processing extended sequences
computationally and memory-intensive.
- Difficulty with Irregular Time Intervals: Similar to Vanilla RNNs, Transformers operate under a
fixed time assumption. This might not be ideal for applications that involve asynchronous or
irregularly spaced data points.
d. Various solutions and alternative models have been developed to overcome Vanilla RNN
limitations and enhance language model performance:
- Long Short-Term Memory (LSTM) Networks: LSTMs are a type of RNN that addresses the
vanishing gradient problem by employing gating mechanisms. These mechanisms enable the
model to control information flow, retaining relevant information over extended periods. LSTMs
exhibit superior performance in language modeling and other sequential tasks, particularly those
involving long-range relationships.
- Gated Recurrent Units (GRUs): GRUs, akin to LSTMs but with a simpler design, offer
enhanced memory retention while being computationally more efficient. They provide a powerful
alternative to LSTMs for tackling the vanishing gradient problem and capturing long-term
dependencies.
5. Report:
We successfully generated text from the "Star Wars Movie Scripts" dataset by developing a
language model using Vanilla Recurrent Neural Networks (RNN). Nonetheless, we encountered
challenges related to the incomplete capture of long-term dependencies, leading to difficulties in
maintaining text coherence and quality. To tackle these concerns, alternative approaches such
as LSTM, GRU, Transformer, and attention-based RNNs offer solutions. These models have
shown enhanced capabilities in handling long-range dependencies, resulting in the generation
of text that is both contextually relevant and coherent. These advancements hold promising
potential to elevate the language modeling task, yielding more accurate and pertinent outcomes.