Narayana Swa My
Narayana Swa My
Narayana Swa My
Narayanaswamy R
2033022
November 2023
DEPARTMENT OF COMPUTING
COIMBATORE INSTITUTE OF TECHNOLOGY
(Autonomous Institution affiliated to Anna University)
COIMBATORE – 641014
COIMBATORE INSTITUTE OF TECHNOLOGY
(Autonomous Institution affiliated to Anna University)
COIMBATORE 641014
(Bonafide Certificate)
Project Work - I
Seventh Semester
December 2022
____________ ________________
Faculty Guide Head of the
Department
__________________ _________________
Internal Examiner External Examiner
CONTENTS
CHAPTER PAGE NO
ACKNOWLEDGEMENT i
SYNOPSIS ii
PREFACE iii
I INTRODUCTION 1
1.1 ORGANIZATION PROFILE 1
1.2 PROBLEM STATEMENT 1
1.3 DESCRIPTIVE STATSTICAL SUMMARY 3
1.4 OVERVIEW OF PREDICTIVE ANALYSIS 4
1.5 INFERENCES SUMMARY 5
VI CONCLUSION 37
6.1 BIBLIOGRAPHY 38
ACKNOWLEDGEMENT
Apart from my efforts, the success of any project depends largely on the
encouragement and guidelines of many others. I take this opportunity to express my
gratitude to the people who have been instrumental in the successful completion of this
project.
I finally express at most gratitude to the almighty, my parents, my mentors and all of
the team members for their help and support. They have been my motivators through my
thick and thin.
SYNOPSIS
The project titled " Lease Agreement Classification" aims to develop an advanced
predictive model for understanding and forecasting voter behaviour in elections. This
project's primary objective is to leverage a combination of machine learning algorithms
and ensemble techniques to unravel the intricate factors that influence electoral decision-
making and provide accurate predictions regarding individual voting choices.
The project addresses the challenge of deciphering the complex dynamics behind
voter choices in elections. Electoral outcomes depend on a myriad of factors, including
demographic, socioeconomic, and historical variables, making prediction a challenging
task. The proposed system will be of significant interest to political analysts, campaign
strategists, and policymakers seeking insights into voter behaviour. By providing precise
predictions, the system can help political campaigns tailor their strategies, enhance voter
engagement efforts, and potentially improve the overall democratic process.Tasks include
data collection, preprocessing, feature engineering, and implementing multiple machine
learning models. Ensemble techniques will be developed to improve predictive accuracy.
Model evaluation will employ metrics like AUC, precision, recall, and F1-score.
Interpretability will analyse feature importance, providing insights into voter decision-
making.
Python and libraries like NumPy, Pandas, Scikit-Learn, and XGBoost will be used.
Exploratory Data Analysis (EDA) will uncover dataset characteristics. Feature
importance analysis will identify key variables. Model selection and evaluation with
cross-validation and hyperparameter tuning will ensure robust predictions. Inferences aim
to reveal influential factors and showcase the effectiveness of ensemble techniques. This
project provides valuable, data-driven insights for election stakeholders.
PREFACE
INTRODUCTION
This section gives a detailed description of the organization for which the model is
developed along with the explanation of the problem definition, goals and scope of the
proposed model. This section also gives a descriptive summary of the data and specifies the
methods, techniques and tools used in the development of the model and finally concludes
with aninference.
1.2.1 OBJECTIVE
The primary objective of this project is to create an accurate and automated lease agreement
classification system that can categorize lease documents into predefined types or classes, thereby
streamlining document organization, retrieval, and decision-making processes in the real estate and
legal industries.
1
1.2.2 SCOPE:
1. Data Aggregation and Cleansing: The project involves automated data processing and cleansing to
aggregate lease agreement electoral data sources, ensuring consistency and accuracy.
2. Collaborative Decision Support: Collaboration features will be integrated to facilitate engagement
among technical, and analytical stakeholders. This transparency will demystify the predictive models,
promoting trust in the decision-making process.
3. Access Control and Privacy: Regulatory-compliant access controls will be implemented to manage
data accessibility. Pseudonymization techniques will enable secure engagement with sensitive lease
agreement data, maintaining lables while allowing comprehensive analysis.
4. Decision Support Enhancement: Beyond the primary scope, potential extensions for real-time
decision support systems will be explored. Additionally, comprehensive documentation will be
provided for future reference and the continuous improvement of the lease agreement decision-
making process.
5. Real-time Data Integration: Explore the integration of real-time data sources to keep predictive
models up-to-date with the latest lease agreement and developments. This can be particularly valuable
during fast-changing lease agreement.
6. Multi-Channel Data Analysis: Extend the scope to include the ability to analyse lease agreement
data. This comprehensive approach can offer a more holistic understanding of labels sentiments.
1.2.3 Users
1. Real Estate Professionals: Real estate agents, property managers, and brokers can use the system to
quickly classify and manage lease agreements for different properties.
2. Property Owners: Owners of residential or commercial properties can use the system to organize
and categorize lease agreements for their properties.
3. Property Management Companies: Companies specializing in property management can automate
document management and ensure compliance with lease terms.
4. Archivists and Records Managers: Those responsible for maintaining an organization's document
archives can benefit from the system's ability to categorize and retrieve lease agreements efficiently.
5. Legal Tech Companies: Companies providing legal technology solutions can integrate lease
agreement classification as a feature in their platforms.
6. Model Explain ability: Enhance the scope by focusing on model explain ability techniques.
Understanding why the model makes specific predictions is crucial for building trust and explaining
2
results to stakeholders.
7. Continuous Feedback Loop: Implement a feedback loop where the model's predictions are
continuously compared to actual labels outcomes. This allows for model refinement and adaptation
over time, improving prediction accuracy.
3
1.4 OVERVIEW OF PREDICTIVE ANALYSIS
1.4.1 METHODOLOGY
The project embarks on the challenging task of Lease Agreement Classification, addressing a
complex issue deeply intertwined with real-world demands. In the realm of real estate and legal
matters, lease agreements vary widely in content, structure, and purpose. The accurate categorization
of these agreements is crucial for efficient document management and informed decision-making.
This project sets out to create a classification model capable of deciphering the intricate aspects of
lease agreements, enabling precise categorization for real-world applications.
The methodology employed in this project mirrors the pragmatic approach used in real-world
scenarios. It commences with comprehensive data collection, akin to the practices of real estate
management and legal professionals who gather a variety of lease agreements for analysis.
These datasets encompass a wide spectrum, including residential, commercial, and industrial
lease agreements. They may vary in terms of content, language, and structure, reflecting the diversity
of lease agreements encountered in practice. This data compilation reflects the modern need for
efficient document management in real estate and legal contexts, where digital platforms play a
crucial role in organizing and categorizing lease agreements.
4
classify lease documents.
Ensemble techniques are employed to further enhance predictive accuracy. The project utilizes
advanced ensemble methods, such as stacking and voting, to combine predictions from individual
models. Hyperparameter optimization is conducted with care to fine-tune model performance and
achieve optimal classification results.
An exhaustive evaluation process follows, wherein models are rigorously assessed using
metrics such as accuracy, precision, recall, and F1-score. The comparative analysis of different
models and ensembles aids in selecting the most effective candidate for deployment in the lease
agreement classification system.
5
CHAPTER II
DATA MODELING AND EXPLORATION
7
2.2.2 DATA PREPARATION
Data preparation is a vital step in the process of Lease Agreement Classification from PDF
documents. It involves a meticulous approach to cleaning and structuring the data to ensure its
accuracy and suitability for analysis. The process starts by addressing issues like missing information,
outliers, and inconsistencies within the lease agreements. To handle missing data, a variety of
imputation techniques are employed, which may include mean imputation or predictive modelling to
intelligently fill gaps. Feature engineering is another essential component, where variables are created
or transformed to capture relevant information for the predictive models. This includes encoding
categorical variables and scaling numerical attributes to maintain consistency. Additionally, data
integration efforts harmonize lease agreements from different sources, ensuring uniformity in terms of
data types and structures. This rigorous data preparation process serves as the foundation for
subsequent analyses, enabling accurate and meaningful insights into the dynamics of lease agreement
classification.
8
phrases, or sentences. Calculate basic text statistics such as word frequency, sentence length, and
vocabulary size. This can provide insights into the complexity and readability of lease agreements.
2.3.4 Keyword Analysis: Identify relevant keywords or phrases within the documents that are indicative of
different lease agreement categories. For example, keywords like "renewal," "maintenance," or
"sublet" may be associated with specific sections.
2.3.5 Data Distribution: Analyse the distribution of lease agreements across different categories or labels.
Understand the balance or imbalance in the dataset, as this can impact model performance.
2.3.6 Visualization (Optional): Although EDA for textual data may not involve traditional charts and
graphs, you can create word clouds to visualize the most common terms in each category. This can
provide a qualitative understanding of document content.
9
Fig 2.3.2 Frequency of words
2.3.7 Text Pre-processing: Apply text pre-processing techniques such as text cleaning, stop-word
removal, and stemming/lemmatization to prepare the text for feature extraction.
10
CHAPTER III
PREDICTIVE ANALYTICS PROCESS
ANALYSIS MODEL
11
Classification. Despite its name, it's primarily used for binary classification, where
the goal is to predict one of two possible outcomes. In the context of Lease
Agreement Classification, it can be adapted to categorize lease agreements into
predefined classes or categories based on their textual content.
Logistic Regression operates by modelling the probability that a given
lease agreement belongs to a specific category. It accomplishes this by fitting a
logistic function to the input data, which maps input features to a probability
score. This score is then thresholder to make binary predictions. The algorithm is
known for its simplicity and interpretability, making it a valuable choice when
insights into feature importance are essential.
In Lease Agreement Classification, Logistic Regression can effectively
capture linear relationships between features and document categories. It evaluates
textual elements, keywords, and structural cues within lease agreements to make
informed decisions about their categorization. Additionally, Logistic Regression's
regularization techniques, such as L1 and L2 regularization, help prevent
overfitting and enhance model generalization.
12
data makes it well-suited for this task.
Furthermore, Random Forest provides valuable insights into feature
importance, helping users understand which aspects of lease agreements are most
influential in the classification process. This information can guide data pre-
processing and feature engineering efforts, ultimately improving the model's
performance. With its adaptability, interpretability, and excellent predictive
capabilities, Random Forest stands as a robust choice for Lease Agreement
Classification, contributing to more accurate and efficient document
categorization and management.
13
In the context of Lease Agreement Classification, SGD is utilized as an
optimization technique to train machine learning models efficiently. It works by
iteratively updating the model's parameters in a way that minimizes a predefined
loss function, ultimately leading to the best possible model fit. What sets SGD
apart is its "stochastic" nature, meaning that it optimizes the model using random
subsets of the training data (mini-batches) rather than the entire dataset. This not
only accelerates training but also introduces a level of randomness that can help
escape local minima in the optimization process.
SGD's adaptability and speed make it a powerful choice for text
classification tasks. During the training phase, the algorithm adjusts the model's
weights to better align with the textual features extracted from lease agreements.
This optimization process continues until a satisfactory model is achieved, capable
of accurately classifying lease agreements into their respective categories.
14
loss, or the difference between the actual class value of the training example and
the predicted class value. It isn't required to understand the process for reducing
the classifier's loss, but it operates similarly to gradient descent in a neural
network. In the case of Gradient Boosting Machines, every time a new weak
learner is added to the model, the weights of the previous learners are frozen or
cemented in place, left unchanged as the new layers are introduced. This is
distinct from the approaches used in AdaBoosting where the values are adjusted
when new learners are added. The power of gradient boosting machines comes
from the fact that it can be used on more than binary classification problems, also
can be used on multi-class classification problems and even regression problems.
Gradient boosting systems have two other necessary parts: a weak learner
and an additive component. Gradient boosting systems use decision trees as their
weak learners. Regression trees are used for the weak learners, and these
regression trees output real values. Because the outputs are real values, as new
learners are added into the model the output of the regression trees can be added
together to correct for errors in the predictions. The additive component of a
gradient boosting model comes from the fact that trees are added to the model
over time, and when this occurs the existing trees aren't manipulated, their values
remain fixed. Gradient boosting models can perform incredibly well on very
complex datasets, but they are also prone to overfitting.
15
Fig source 3.1.4 is the diagram that explains the work flow of Gradient Boosting
Algorithm
One of the key advantages of SVMs is their ability to handle both linear
and nonlinear data by using appropriate kernel functions, such as the radial basis
function (RBF) kernel. This flexibility allows SVMs to capture intricate
relationships within lease agreement text, making them well-suited for the task.
SVMs are known for their capacity to perform effectively in high-
dimensional feature spaces, which is essential when dealing with the multifaceted
content of lease agreements. Additionally, SVMs provide a clear separation of
categories and are less prone to overfitting, ensuring reliable classification results.
In summary, Support Vector Machines offer a robust and adaptable
approach to Lease Agreement Classification. They excel at handling complex
textual data, providing accurate categorization, and supporting various kernel
functions to capture intricate patterns within lease agreements.
16
3.2 TOOLS DESCRIPTION
The technologies and tools that are used in the project and a brief
description about each of them are discussed in this section.
17
developers have access to a wealth of resources and support, making it an
excellent choice for a wide range of projects.
3.3.2 Colab
Google Colab, short for Google Colaboratory, is a cloud-based platform that
offers a collaborative and interactive environment for developing and running
Python code. It has gained immense popularity among data scientists, machine
learning engineers, and researchers due to its ease of use, free access to GPU
resources, and seamless integration with Google Drive. Colab provides a Jupyter
Notebook-like interface, making it convenient for users to create, edit, and execute
Python code in a notebook format. This format enables the combination of code,
documentation, and visualizations in a single document, making it ideal for data
analysis, machine learning experiments, and collaborative research projects.
One of Colab's standout features is its provision of free GPU and TPU (Tensor
Processing Unit) resources. This capability allows users to accelerate
computationally intensive tasks, such as training deep learning models, without
the need for expensive hardware. Additionally, Colab's integration with Google
Drive simplifies data management and sharing. Users can easily access datasets
and files stored in their Google Drive and share their Colab notebooks with
collaborators. These collaborative features make Google Colab a valuable tool for
both individuals and teams working on data-driven projects, research, and
development tasks in various fields.
3.3.3 SCIKIT-LEARN
Natural Language Processing (NLP) libraries are essential tools for working
with human language data and enabling machines to understand, process, and
generate human-like text. Among the most prominent NLP libraries, NLTK
(Natural Language Toolkit) is widely recognized for its extensive collection of
text processing libraries and corpora, making it a valuable resource for NLP
research and development. NLTK provides tools for tokenization, stemming, part-
of-speech tagging, named entity recognition, sentiment analysis, and more. Its
user-friendly interface and detailed documentation make it an excellent choice for
18
educational purposes and NLP projects ranging from text analysis to machine
learning applications.
Another powerful NLP library is spaCy, known for its speed and efficiency in
handling large-scale text processing tasks. spaCy offers pre-trained models for
various languages, enabling users to perform tasks like entity recognition,
dependency parsing, and text classification with ease. Its API is designed for
production use, making it a preferred choice for building NLP applications and
integrating NLP capabilities into software systems. spaCy's focus on performance
and accuracy has made it a popular choice among developers and researchers
looking to leverage NLP capabilities for real-world applications. Both NLTK and
spaCy, along with other NLP libraries, play pivotal roles in advancing the field of
natural language processing and enabling a wide range of language-related tasks
in machine learning, text analysis, and information retrieval.
3.3.4 PANDAS
Pandas is a widely-used Python library for data manipulation and analysis. It
provides an easy-to-use and highly flexible data structure known as a DataFrame,
which is akin to a spreadsheet or database table. With Pandas, users can efficiently
load, clean, transform, and analyze data from various sources, making it an
indispensable tool for data scientists, analysts, and researchers. Pandas simplifies
data exploration by offering a wide range of functions and methods for tasks such
as data indexing and selection, grouping, aggregation, and time series
manipulation. Its seamless integration with other Python libraries, like NumPy and
Matplotlib, allows for comprehensive data analysis and visualization. Pandas'
intuitive and powerful data processing capabilities make it an essential part of the
data science toolkit.
19
possible to combine multiple datasets based on common keys or indexes. This
feature is invaluable for integrating data from different sources and performing
complex data transformations. Whether you're working on data cleaning,
exploration, or complex data analysis tasks, Pandas remains a versatile and
indispensable library for efficiently managing and analyzing tabular data in
Python.
3.3.5 Excel
Excel is a widely used spreadsheet application developed by Microsoft,
renowned for its versatility and ease of use in managing and analyzing data. It
offers a grid-like interface where users can input, organize, and manipulate data in
rows and columns. Excel provides a plethora of functions and formulas for
performing calculations, statistical analysis, and data visualization. Its user-
friendly features, such as drag-and-drop functionality and cell formatting options,
make it accessible to a broad range of users, from students and professionals to
data analysts and financial experts.
One of Excel's core strengths is its ability to create visually appealing and
informative charts and graphs, facilitating data visualization and presentation.
Users can choose from a variety of chart types, including bar charts, pie charts,
and line graphs, to represent data in a way that best conveys insights and trends.
Additionally, Excel supports the creation of pivot tables, which enable users to
summarize and explore large datasets efficiently. Excel's extensive functionality,
coupled with its widespread availability in both personal and professional settings,
makes it a go-to tool for tasks like budgeting, financial analysis, project
management, and data reporting.
20
3.4.2 Data Splitting: Split your dataset into training, validation, and test
sets using scikit-learn's train_test_split function.
3.4.3 Model Selection: Choose the machine learning model you want to
use for your task. Import the relevant model class from scikit-learn, e.g.,
from sklearn.ensemble import RandomForestClassifier for a Random
Forest Classifier.
3.4.4 Hyperparameter Tuning: Utilize scikit-learn's GridSearchCV to
perform hyperparameter tuning and find the best combination of
hyperparameters for your model.
3.4.5Model Training: Train your selected model on the training data using
the best hyperparameters found during tuning.
3.4.7 Final Model Training and Testing: Once satisfied with your model's
performance, train it on the entire training dataset (including the validation
set) and evaluate it on the test set to get a final assessment of its
performance.
3.4.8 Results and Visualization: Analyze and visualize the results using
libraries like matplotlib and seaborn. You can create various plots and
visualizations to interpret the model's predictions and insights.
3.4.9 Exporting Results: Export the model, its predictions, and relevant
results to Excel or other formats if needed. You can use libraries like
pandas to save data frames as Excel files.
21
CHAPTER IV
ANALYTICAL MODEL EVALUATION
22
4.1.1 MODEL RESULT
The dataset is initially split into two parts: 80% for training
and 20% for validation. Then, the 80% training portion is further
divided into 80% for training and 20% for testing.
Recall will determine the proportion of real positives that were correctly
identified. Recall is a performance metric in machine learning and statistics that
measures the ability of a model to correctly identify all relevant instances from a
dataset. It quantifies the proportion of true positive predictions (correctly
identified positive cases) out of all actual positive instances in the dataset
23
calculated as the harmonic mean of precision and recall, providing a single value
that balances the trade-off between precision and recall.
24
Confusion Matrix for Logistic Regression Model:
25
The rows in the confusion matrix represent the actual classes or labels. The
columns represent the predicted classes made by the model. Each cell in the
matrix contains the count of instances falling into a specific combination of actual
and predicted classes. Here's how to interpret the key metrics from the confusion
matrix:Precision: It measures the model's ability to make accurate positive
predictions. In this context, precision is high for most classes, indicating that when
the model predicts a certain class, it is often correct. For example, for
"Alterations," "Insurance," and "Lease Year," the precision is 1.00, meaning that
the model rarely makes false positive predictions for these classes.
Recall: Recall quantifies the model's ability to capture all positive
instances correctly. Similar to precision, recall is high for many classes, indicating
that the model effectively identifies positive cases. For instance, the recall for
"Area," "Basic Information," and "Insurance" is perfect (1.00), suggesting that the
model rarely misses these positive instances.
F1-Score: The F1-score is a harmonic mean of precision and recall,
providing a balanced measure of a model's overall performance. High F1-scores
indicate models that are both precise and able to capture positive instances. For
example, "Alterations" and "Insurance" have F1-scores of 1.00, indicating strong
performance for these classes. the F1-score for the "Assignment/Sublet" class is
0.87, indicating a reasonably good balance between precision and recall for this
class.Support: Support represents the number of instances in each class, indicating
how many data points belong to each category.
The confusion matrix itself is a tabular representation that shows the
model's predictions (rows) compared to the actual ground truth (columns) for each
class. It provides a detailed view of the model's performance for each category,
highlighting where it excels and where it may have challenges. This information is
valuable for understanding the strengths and weaknesses of the classification
model and for making improvements as needed.
4.3 INFERENCE:
• In Precision, "Alterations," "Insurance," and "Lease Year," the precision is 1.00,
meaning that the model rarely makes false positive predictions for these classes.
26
• The recall for "Area," "Basic Information," and "Insurance" is perfect (1.00),
suggesting that the model rarely misses these positive instances.
• "Alterations" and "Insurance" have F1-scores of 1.00, indicating strong
performance for these classes.
27
4.4.3 Inference:
Accuracy: The model achieves an overall accuracy of approximately
83%, indicating that it correctly predicts the lease agreement categories for
the majority of instances in the dataset.
Precision: Precision measures how many of the predicted positive
cases were actually positive. For most categories, the precision is
relatively high, ranging from 50% to 100%. For example, in the
"Alterations" category, the precision is 100%, indicating that when the
model predicts "Alterations," it is almost always correct.
Recall: Recall measures how many of the actual positive cases were
correctly predicted as positive. Similar to precision, recall scores are
generally high across categories, ranging from 69% to 100%. In the "Area"
category, the recall is 100%, suggesting that the model effectively
identifies instances belonging to this category.
F1-Score: The F1-score is the harmonic mean of precision and recall,
providing a balanced measure of a model's performance. The F1-scores
are strong, with values between 0.67 and 1.00, indicating that the model
performs well in terms of both precision and recall.
Confusion Matrix: The confusion matrix provides a detailed breakdown
of true positives, true negatives, false positives, and false negatives for
each category. It allows for a more granular assessment of model
performance.
28
CHAPTER V
ANALYSIS REPORT
5.1 ANALYSIS REPORTS AND INFERENCES
This chapter explains the reports and screens generated as part of the
project
29
Figure Source. 5.1 Report for Logistic Regression
Figure Source. 5.1 Depicts the Classification Report and Confusion Matrix for
Logistic Regression, which gives good Accuracy, along with Precision and Recall
value.
30
5.1.2 REPORTS FOR SVM ALGORITHM
31
Figure Source. 5.2 Depicts the Classification Report and Confusion Matrix for
SVM which gives good Accuracy, along with Precision and Recall value.
5.1.3 REPORTS FOR SGD CLASSIFIER
32
Figure Source. 5.3 Depicts the Classification Report and Confusion Matrix of
SGD, which gives good Accuracy, along with Precision and Recall value.
33
Figure Source. 5.2 Depicts the Classification Report and Confusion Matrix of
Naïve Bayes for Test Data, which gives good Accuracy, along with Precision and
Recall value.
34
Figure Source. 5.5 Depicts the Classification Report and Confusion
MultinomialNB, which gives good Accuracy, along with Precision and Recall
value.
5.2 Inference
35
For recall-focused tasks (capturing all positives), SVC or Bagging Classifier may
be chosen.
Decision Tree Classifier, while having lower accuracy, might be suitable for cases
where interpretability is essential, as decision trees are inherently interpretable.
36
CHAPTER VI
CONCLUSION
6.1 CONCLUSION
In conclusion, the primary goal of this project was to develop and
deploy machine learning algorithms capable of effectively classifying lease
agreements, thereby streamlining the process of document management and
decision-making in lease-related matters. The ultimate objective was to strike a
balance between minimizing false negatives, ensuring all genuine lease
agreements are correctly identified, and minimizing false positives, preventing the
misclassification of non-lease documents.
Achieving this balance was a challenging task, given the inherent trade-off
between precision and recall. In light of the project's focus on improving the
accuracy of lease agreement classification, special emphasis was placed on
reducing false positives. Extensive efforts were invested in fine-tuning
hyperparameters and optimizing the model's performance while maintaining
computational efficiency.
In essence, the success of this project has provided a valuable tool for lease
management professionals and organizations. By harnessing the power of machine
learning algorithms, they can now classify lease agreements more accurately,
enabling efficient document management and informed decision-making. This
project represents a significant step toward enhancing lease document processing
through data-driven insights and precise classification.
37
1.2 BIBLIOGRAPHY
• Natural Language Processing - Overview - GeeksforGeeks
• NLTK :: Natural Language Toolkit
• Natural Language Processing With Python's NLTK Package – Real Python
• NLP using NLTK Library | NLTK Library for Natural Language Processing
(analyticsvidhya.com)
• Installing scikit-learn — scikit-learn 1.3.1 documentation
• scikit-learn · PyPIs
• Untitled10.ipynb - Colaboratory (google.com)
• NLP Preprocessing Steps in Easy Way - Analytics Vidhya
38