Narayana Swa My

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

LEASE AGREEMENT CLASSIFICATION

Narayanaswamy R
2033022

DISSERTATION SUBMITTED IN PARTIAL FULFILMENT OF THE


REQUIREMENTS FOR THE DEGREE OF
M.Sc. Decision and Computing sciences
OF ANNA UNIVERSITY

November 2023

DEPARTMENT OF COMPUTING
COIMBATORE INSTITUTE OF TECHNOLOGY
(Autonomous Institution affiliated to Anna University)
COIMBATORE – 641014
COIMBATORE INSTITUTE OF TECHNOLOGY
(Autonomous Institution affiliated to Anna University)
COIMBATORE 641014

(Bonafide Certificate)

Project Work - I
Seventh Semester

Lease Agreement Classification

Bonafide record of work done by


Narayanaswamy R
Register No: 2033022

Submitted in partial fulfilment of the requirements for the degree of


M.Sc (Decision and Computing Sciences) of Anna University

December 2022

____________ ________________
Faculty Guide Head of the
Department

Submitted for the viva-voce held on _________________

__________________ _________________
Internal Examiner External Examiner
CONTENTS

CHAPTER PAGE NO
ACKNOWLEDGEMENT i

SYNOPSIS ii
PREFACE iii
I INTRODUCTION 1
1.1 ORGANIZATION PROFILE 1
1.2 PROBLEM STATEMENT 1
1.3 DESCRIPTIVE STATSTICAL SUMMARY 3
1.4 OVERVIEW OF PREDICTIVE ANALYSIS 4
1.5 INFERENCES SUMMARY 5

II DATA MODELING AND EXPLORATION 6

2.1 PROBLEM ANALYSIS 6


2.2 DATA MODEL 7
2.3 EXPLORATORY DATA ANALYSIS 8

III PREDICTIVE ANALYTICS PROCESS 11

3.1 PREDICTIVE ANALYTICS MODEL 13


3.2 TOOLS DESCRIPTION 17
3.3 IMPLEMENTATION USING TOOL 20

IV ANALYTICAL MODEL EVALUATION 22

4.1 PERFORMANCE MEASURES 22


4.2 HYPOTHESIS TESTING / CONFUSION MATRIX 23

V ANALYSIS REPORTS AND INFERENCES 29

5.1 REPORTS / VISUAL FORMATS 30

VI CONCLUSION 37

6.1 BIBLIOGRAPHY 38
ACKNOWLEDGEMENT

Apart from my efforts, the success of any project depends largely on the
encouragement and guidelines of many others. I take this opportunity to express my
gratitude to the people who have been instrumental in the successful completion of this
project.

I respect and thank Dr. A. RAJESWARI, Principal, Coimbatore Institute of


Technology, for permitting me to undertake this project work at Crocus Technology
Private Limited, Bangalore.

I express my sincere gratitude Dr. K. SAKTHI MALA, Dean of Computing,


Coimbatore Institute of Technology, Coimbatore, for her encouragement throughout this
project.

I express my sincere gratitude to Dr. A. KANNAMMAL, Head, Department of


Computing (Decision and Computing Sciences), Coimbatore Institute of Technology,
Coimbatore, for her encouragement throughout this project.

I am indebted to my internal guide, Ms. Dr.V. Savithri Assistant Professor,


Department of Computing (Decision and Computing Sciences), Coimbatore Institute of
Technology, Coimbatore, for her constant support and guidance throughout the project
work

I express my deep sense of gratitude to Mr. Kiran Kandaswamy Ram Kumar,


Manager Mobius Knowledge Services Private Limited, Chennai for his invaluable
guidance, support and suggestions throughout the course of this project work

I finally express at most gratitude to the almighty, my parents, my mentors and all of
the team members for their help and support. They have been my motivators through my
thick and thin.
SYNOPSIS

The project titled " Lease Agreement Classification" aims to develop an advanced
predictive model for understanding and forecasting voter behaviour in elections. This
project's primary objective is to leverage a combination of machine learning algorithms
and ensemble techniques to unravel the intricate factors that influence electoral decision-
making and provide accurate predictions regarding individual voting choices.

The project addresses the challenge of deciphering the complex dynamics behind
voter choices in elections. Electoral outcomes depend on a myriad of factors, including
demographic, socioeconomic, and historical variables, making prediction a challenging
task. The proposed system will be of significant interest to political analysts, campaign
strategists, and policymakers seeking insights into voter behaviour. By providing precise
predictions, the system can help political campaigns tailor their strategies, enhance voter
engagement efforts, and potentially improve the overall democratic process.Tasks include
data collection, preprocessing, feature engineering, and implementing multiple machine
learning models. Ensemble techniques will be developed to improve predictive accuracy.
Model evaluation will employ metrics like AUC, precision, recall, and F1-score.
Interpretability will analyse feature importance, providing insights into voter decision-
making.

Python and libraries like NumPy, Pandas, Scikit-Learn, and XGBoost will be used.
Exploratory Data Analysis (EDA) will uncover dataset characteristics. Feature
importance analysis will identify key variables. Model selection and evaluation with
cross-validation and hyperparameter tuning will ensure robust predictions. Inferences aim
to reveal influential factors and showcase the effectiveness of ensemble techniques. This
project provides valuable, data-driven insights for election stakeholders.
PREFACE

CHAPTER I – INTRODUCTION gives an introduction of the organization for which


the system was developed and also describes the objective and scope of the proposed
system.

CHAPTER II - DATA MODELING AND EXPLORATION gives a detailed


description about the data set used for analytics and the various models and techniques
used, a comparison of those techniques and the proposed technique.

CHAPTER III - PREDICTIVE ANALYTICS PROCESS gives a detailed description


about the process flow, the tools, Packages and libraries used for building the solution,
and how the project is implemented.

CHAPTER IV - ANALYTICAL MODEL EVALUATION gives a detailed description


about the performance measures used in the project.

CHAPTER V - ANALYSIS REPORTS AND INFERENCES gives a detailed


description about reports and visual formats.

CHAPTER VI - CONCLUSION explains the reports and screens generated as part of


the project
CHAPTER I

INTRODUCTION

This section gives a detailed description of the organization for which the model is
developed along with the explanation of the problem definition, goals and scope of the
proposed model. This section also gives a descriptive summary of the data and specifies the
methods, techniques and tools used in the development of the model and finally concludes
with aninference.

1.1 ORGANIZATION PROFILE


Mobius Knowledge Services is a leading data solutions partner that offers cutting-edge
platforms, products, and software solutions to some of the leading Fortune 2000 companies. With
almost two decades of data experience, and a strong tech spine. The company employs big data
technologies, robotic process automation, artificial intelligence, and self-learning bots to deliver smart
data solutions that have real-time applications on cloud and on-premises support systems.
Mobius employs big data technologies, robotic process automation, artificial intelligence, and
self-learning bots to deliver smart data solutions that have real-time applications on cloud and on-
premise support systems. Mobius has helped several leading organizations from various industries
worldwide.

1.2 PROBLEM STATEMENT


Develop an automated lease agreement classification system that can accurately categorize
incoming lease documents into predefined types, enabling efficient document management and
retrieval for various stakeholders, such as real estate professionals, legal teams, and property
managers.

1.2.1 OBJECTIVE
The primary objective of this project is to create an accurate and automated lease agreement
classification system that can categorize lease documents into predefined types or classes, thereby
streamlining document organization, retrieval, and decision-making processes in the real estate and
legal industries.

1
1.2.2 SCOPE:
1. Data Aggregation and Cleansing: The project involves automated data processing and cleansing to
aggregate lease agreement electoral data sources, ensuring consistency and accuracy.
2. Collaborative Decision Support: Collaboration features will be integrated to facilitate engagement
among technical, and analytical stakeholders. This transparency will demystify the predictive models,
promoting trust in the decision-making process.
3. Access Control and Privacy: Regulatory-compliant access controls will be implemented to manage
data accessibility. Pseudonymization techniques will enable secure engagement with sensitive lease
agreement data, maintaining lables while allowing comprehensive analysis.
4. Decision Support Enhancement: Beyond the primary scope, potential extensions for real-time
decision support systems will be explored. Additionally, comprehensive documentation will be
provided for future reference and the continuous improvement of the lease agreement decision-
making process.
5. Real-time Data Integration: Explore the integration of real-time data sources to keep predictive
models up-to-date with the latest lease agreement and developments. This can be particularly valuable
during fast-changing lease agreement.
6. Multi-Channel Data Analysis: Extend the scope to include the ability to analyse lease agreement
data. This comprehensive approach can offer a more holistic understanding of labels sentiments.

1.2.3 Users
1. Real Estate Professionals: Real estate agents, property managers, and brokers can use the system to
quickly classify and manage lease agreements for different properties.
2. Property Owners: Owners of residential or commercial properties can use the system to organize
and categorize lease agreements for their properties.
3. Property Management Companies: Companies specializing in property management can automate
document management and ensure compliance with lease terms.
4. Archivists and Records Managers: Those responsible for maintaining an organization's document
archives can benefit from the system's ability to categorize and retrieve lease agreements efficiently.
5. Legal Tech Companies: Companies providing legal technology solutions can integrate lease
agreement classification as a feature in their platforms.
6. Model Explain ability: Enhance the scope by focusing on model explain ability techniques.
Understanding why the model makes specific predictions is crucial for building trust and explaining

2
results to stakeholders.
7. Continuous Feedback Loop: Implement a feedback loop where the model's predictions are
continuously compared to actual labels outcomes. This allows for model refinement and adaptation
over time, improving prediction accuracy.

1.3 DESCRIPTIVE STATSTICAL SUMMARY


The provided descriptive statistical summary appears to show the statistics for two columns in
your dataset: "Data" and "Labels." Table 1.3.1 presents a summary of key statistics for several
variables in the dataset:
Count: This represents the number of non-null values in each column. For "Data," there are 1837
non-null values, and for "Labels," there are 1840 non-null values.
Unique: This represents the number of unique values in each column. For "Data," there are 1701
unique values, and for "Labels," there are 14 unique values.
Top: This represents the most frequently occurring value in each column. For "Data," the most
common value is "leas commence date 2020," and for "Labels," the most common value is "out of
scope."
Freq: This represents the frequency (count) of the most common value in each column. For
"Data," "leas commence date 2020" occurs 5 times, and for "Labels," "out of scope" occurs 440 times.
This summary provides an overview of the data distribution and the most common values within
each column, which can be useful for understanding the characteristics of your lease agreement
classification dataset. It appears that the "Labels" column contains 14 unique categories or classes,
while the "Data" column contains various text data associated with these labels.

Table 1.3.1 Descriptive Statistics of Dataset

3
1.4 OVERVIEW OF PREDICTIVE ANALYSIS

1.4.1 METHODOLOGY
The project embarks on the challenging task of Lease Agreement Classification, addressing a
complex issue deeply intertwined with real-world demands. In the realm of real estate and legal
matters, lease agreements vary widely in content, structure, and purpose. The accurate categorization
of these agreements is crucial for efficient document management and informed decision-making.
This project sets out to create a classification model capable of deciphering the intricate aspects of
lease agreements, enabling precise categorization for real-world applications.
The methodology employed in this project mirrors the pragmatic approach used in real-world
scenarios. It commences with comprehensive data collection, akin to the practices of real estate
management and legal professionals who gather a variety of lease agreements for analysis.
These datasets encompass a wide spectrum, including residential, commercial, and industrial
lease agreements. They may vary in terms of content, language, and structure, reflecting the diversity
of lease agreements encountered in practice. This data compilation reflects the modern need for
efficient document management in real estate and legal contexts, where digital platforms play a
crucial role in organizing and categorizing lease agreements.

1.4.2METHODS AND TECHNIQUES IMPLEMENTED


To achieve the overarching goal of Lease Agreement Classification, the project begins with a
comprehensive data collection and integration process. Diverse datasets containing lease agreements
from various sources, including residential, commercial, and industrial leases, are meticulously
gathered and harmonized into a unified dataset. This integrated dataset serves as the foundation for
subsequent analysis and modelling.
The next phase involves a thorough Exploratory Data Analysis (EDA) to uncover underlying
patterns, identify potential outliers, and establish correlations among different attributes within the
lease agreements. This EDA phase plays a pivotal role in shaping the project's direction by providing
insights into the structure and content of the lease agreements.
The project leverages a diverse ensemble of machine learning models tailored for lease
agreement classification. These models include Natural Language Processing (NLP) techniques, such
as Logistic Regression, Linear Discriminant Analysis, and Naive Bayes, as well as more complex
models like Random Forest, Bagging, and Gradient Boosting. These models are meticulously trained
on the historical lease agreement data, utilizing their unique strengths to effectively interpret and

4
classify lease documents.
Ensemble techniques are employed to further enhance predictive accuracy. The project utilizes
advanced ensemble methods, such as stacking and voting, to combine predictions from individual
models. Hyperparameter optimization is conducted with care to fine-tune model performance and
achieve optimal classification results.
An exhaustive evaluation process follows, wherein models are rigorously assessed using
metrics such as accuracy, precision, recall, and F1-score. The comparative analysis of different
models and ensembles aids in selecting the most effective candidate for deployment in the lease
agreement classification system.

1.4.3 TOOLS USED


1.4.3.1 Python Programming
1.4.3.2 Scikit-learn
1.4.3.3 Matplotlib and Seaborn
1.4.3.4 Multivariate Ensembles
1.4.3.5 Data Pre-processing Tools
1.4.3.6 Colab
1.4.3.7 Excel

1.5 INFERENCES SUMMARY


Lease Agreement Classifier: Harnessing Diverse Models for Accurate Categorization" employs
a diverse range of machine learning techniques, including natural language processing (NLP) models,
logistic regression, decision trees, and ensemble methods like random forests, bagging, and gradient
boosting. These models work in concert to unravel the complexities of lease agreement classification.
Collectively, they achieve a noteworthy performance, with an overall accuracy score of 0.84,
indicating their capability to accurately categorize lease agreements.
The precision, recall, and F1-score metrics highlight the robustness of the models, especially
in distinguishing between various lease agreement types. This project contributes valuable insights
and tools for efficiently categorizing lease agreements, offering enhanced document management and
decision support in the real estate and legal domains.

5
CHAPTER II
DATA MODELING AND EXPLORATION

2.1 PROBLEM ANALYSIS


2.1.1 PROBLEM UNDERSTANDING
Lease Agreement Classifier addresses the intricate challenge of classifying lease agreements
accurately. This challenge revolves around the multifaceted nature of lease documents, which can
vary significantly in content, structure, and purpose. The goal is to unravel the complex factors that
determine how lease agreements are categorized. These factors may include the presence of specific
clauses, legal terms, property types, and more, all of which contribute to the classification process.
Understanding these intricate interactions is vital for building precise predictive models
capable of categorizing lease agreements effectively. By delving into the nuances of lease agreement
content, this project aims to shed light on the underlying mechanisms that enhance document
management and decision-making processes in the real estate and legal sectors.

2.2.2 BUSSINESS UNDERSTANDING


This project focuses on the significance of well-informed decision-making in the real estate
and legal domains. Accurate categorization of lease agreements holds paramount importance for
property managers, legal professionals, and organizations involved in lease-related decisions. By
comprehending the diverse factors influencing lease document categorization, businesses and legal
entities can enhance their document management and decision-making processes, ensuring
compliance and efficiency.
Furthermore, a deeper understanding of the intricacies of lease agreement classification can
extend its applications beyond real estate and law. It can be leveraged in industries such as document
management, compliance monitoring, and other fields where precise categorization of textual data is
essential. Ultimately, this project aims to bridge the gap between data-driven insights and decision-
making in the specific context of lease agreement management while also offering broader
applications in data classification and document organization.

2.2.3 FEATURE IDENTIFICATION


For Lease Agreement Classification, feature identification involves selecting and
defining relevant variables or attributes that can aid in the classification of lease agreements into
different categories. Here's the conversion of the feature identification description to your specific
6
problem:

Feature Identification for Lease Agreement Classification: Alterations: Identifying clauses or


terms related to any alterations or modifications allowed within the lease agreement. Area: Identifying
clauses that pertain to the specific area or location covered by the lease agreement.
Assignment/Sublet: Identifying clauses that address the conditions and terms regarding the
assignment or subletting of the leased property. Base rent: Identifying clauses that specify the base
rental amount or the core financial terms of the lease. Basic Information: Identifying sections
containing fundamental information about the lease, such as parties involved, effective dates, and
property details. Estoppel: Recognizing clauses related to estoppel certificates, which are statements
verifying the lease's terms and conditions. Holdover: Identifying clauses that address situations where
the tenant remains in possession of the property beyond the lease term. Insurance: Identifying clauses
specifying insurance requirements and provisions within the lease agreement. Key Dates:
Recognizing sections that include crucial dates such as lease commencement, expiration, and renewal
dates. Lease Year: Identifying clauses or sections relevant to the definition or calculation of lease
years. Maintenance Repairs: Recognizing terms related to maintenance responsibilities and repair
obligations within the lease. Premises Address: Identifying clauses that describe the physical location
or address of the leased premises. Renewal: Identifying clauses that outline renewal options, terms,
and procedures. Out of scope: Recognizing content that does not fall into any of the defined categories
and may require further analysis.

2.2 DATA MODEL

2.2.1 DATA COLLECTION


The data collection process for Lease Agreement Classification in PDF documents focuses on
extracting essential lease agreement details. Initially, this involves parsing the PDF documents to
capture key elements within the agreements. These elements encompass core lease terms, clauses, and
metadata, allowing for a comprehensive understanding of the lease content and structure. This
extraction process helps create structured representations of lease agreements, enabling further
analysis and classification based on the content obtained. Much like in the voter behaviour dataset,
this method allows for the systematic collection of pertinent lease agreement information, which can
then be utilized for classification purposes.

7
2.2.2 DATA PREPARATION
Data preparation is a vital step in the process of Lease Agreement Classification from PDF
documents. It involves a meticulous approach to cleaning and structuring the data to ensure its
accuracy and suitability for analysis. The process starts by addressing issues like missing information,
outliers, and inconsistencies within the lease agreements. To handle missing data, a variety of
imputation techniques are employed, which may include mean imputation or predictive modelling to
intelligently fill gaps. Feature engineering is another essential component, where variables are created
or transformed to capture relevant information for the predictive models. This includes encoding
categorical variables and scaling numerical attributes to maintain consistency. Additionally, data
integration efforts harmonize lease agreements from different sources, ensuring uniformity in terms of
data types and structures. This rigorous data preparation process serves as the foundation for
subsequent analyses, enabling accurate and meaningful insights into the dynamics of lease agreement
classification.

2.3 EXPLORATORY DATA ANALYSIS


Exploratory Data Analysis is an approach to analyses the data using visual techniques. It is used
to discover trends, patterns, or to check assumptions with the help of statistical summary and
graphical representation. It is crucial to understand it in depth before the perform data analysis and run
the data through an algorithm. It is need to know the patterns in data and determine which variables
are important and which do not play a significant role in the output. Further, some variables may have
correlations with other variables. It also needs to recognize errors in the data.
Exploratory Data Analysis (EDA) is a critical phase in preparing data for Lease Agreement
Classification. Although EDA for lease agreements may not involve the same visualizations as in
traditional datasets, it focuses on understanding the structure, content, and patterns within the
documents. Here's how you can adapt EDA for Lease Agreement Classification:
2.3.1 Document Structure Analysis: Begin by examining the overall structure of lease agreements.
Identify common sections, headers, and formatting patterns. This helps in segmenting the documents
into meaningful sections, such as "Base rent," "Renewal," or "Insurance."
2.3.2 Text Extraction: Extract the text content from the PDF documents. This step involves parsing the
PDFs to retrieve the textual information, which can be further analysed.
2.3.3 Tokenization and Text Statistics: Tokenize the extracted text to break it into individual words,

8
phrases, or sentences. Calculate basic text statistics such as word frequency, sentence length, and
vocabulary size. This can provide insights into the complexity and readability of lease agreements.
2.3.4 Keyword Analysis: Identify relevant keywords or phrases within the documents that are indicative of
different lease agreement categories. For example, keywords like "renewal," "maintenance," or
"sublet" may be associated with specific sections.
2.3.5 Data Distribution: Analyse the distribution of lease agreements across different categories or labels.
Understand the balance or imbalance in the dataset, as this can impact model performance.
2.3.6 Visualization (Optional): Although EDA for textual data may not involve traditional charts and
graphs, you can create word clouds to visualize the most common terms in each category. This can
provide a qualitative understanding of document content.

Fig 2.3.1 Word cloud

9
Fig 2.3.2 Frequency of words
2.3.7 Text Pre-processing: Apply text pre-processing techniques such as text cleaning, stop-word
removal, and stemming/lemmatization to prepare the text for feature extraction.

Fig 2.3.1 Pre -processing

10
CHAPTER III
PREDICTIVE ANALYTICS PROCESS

3.1 PREDICTIVE ANALYTICS MODEL


This section offers an in-depth exploration of the dataset utilized for Lease
Agreement Classification, along with a detailed examination of the various
models and techniques employed in the project. Additionally, it includes a
comparative analysis of these techniques against the proposed methodology. The
project's objective is to develop a predictive model capable of classifying lease
agreements into distinct categories based on their content and structure. This
classification task falls under the domain of Supervised Machine Learning,
specifically as a text classification task. Similar to the electoral outcomes
prediction project, our goal is to enhance classification accuracy using advanced
techniques.
The analysis primarily pertains to the category of text classification, where
the aim is to assign lease agreements to specific categories (e.g., 'Base rent' or
'Renewal') based on their textual content. Throughout the project, we employ
various natural language processing (NLP) techniques, feature engineering
methods, and machine learning algorithms to achieve this objective. The process
flow outlines how lease agreements are processed, features are extracted, and
predictive models are trained. Tools, packages, and libraries commonly used in
NLP and machine learning are harnessed to implement the solution effectively.
This project's successful implementation ensures that lease agreements are
accurately classified, facilitating efficient document management and decision-
making in various real estate contexts.

ANALYSIS MODEL

3.1.1 LOGISTIC REGRESSION


Logistic Regression is a fundamental machine learning algorithm
frequently employed in classification tasks, including Lease Agreement

11
Classification. Despite its name, it's primarily used for binary classification, where
the goal is to predict one of two possible outcomes. In the context of Lease
Agreement Classification, it can be adapted to categorize lease agreements into
predefined classes or categories based on their textual content.
Logistic Regression operates by modelling the probability that a given
lease agreement belongs to a specific category. It accomplishes this by fitting a
logistic function to the input data, which maps input features to a probability
score. This score is then thresholder to make binary predictions. The algorithm is
known for its simplicity and interpretability, making it a valuable choice when
insights into feature importance are essential.
In Lease Agreement Classification, Logistic Regression can effectively
capture linear relationships between features and document categories. It evaluates
textual elements, keywords, and structural cues within lease agreements to make
informed decisions about their categorization. Additionally, Logistic Regression's
regularization techniques, such as L1 and L2 regularization, help prevent
overfitting and enhance model generalization.

3.1.2 RANDOM FOREST ALGORITHM


A Random Forest is an ensemble machine learning technique that is
exceptionally powerful for various classification and regression tasks, including
Lease Agreement Classification. It operates by constructing a multitude of
decision tree classifiers during the training phase and then combines their outputs
to make robust predictions. Each decision tree in the forest is trained on a different
subset of the dataset and makes independent predictions. By aggregating these
predictions, Random Forest mitigates the risk of overfitting and enhances the
model's overall accuracy and robustness.
In the context of Lease Agreement Classification, Random Forest can
effectively handle the complexity and variability of lease agreements. It evaluates
different textual features, such as keywords, clauses, and document structure, to
make informed decisions about which category a lease agreement belongs to.
Random Forest's ability to capture intricate patterns and relationships within the

12
data makes it well-suited for this task.
Furthermore, Random Forest provides valuable insights into feature
importance, helping users understand which aspects of lease agreements are most
influential in the classification process. This information can guide data pre-
processing and feature engineering efforts, ultimately improving the model's
performance. With its adaptability, interpretability, and excellent predictive
capabilities, Random Forest stands as a robust choice for Lease Agreement
Classification, contributing to more accurate and efficient document
categorization and management.

Fig 3.1.2 Random Forest Architecture


Fig 3.1.2 is the diagram that explains the working of Random Forest Algorithm.

3.1.3 Stochastic Gradient Descent (SGD)


Stochastic Gradient Descent (SGD) is a widely employed optimization
algorithm in machine learning, particularly for training models, including those
used in text classification tasks like Lease Agreement Classification. SGD is
particularly suitable for large datasets and high-dimensional feature spaces,
making it a valuable tool in NLP (Natural Language Processing) tasks.

13
In the context of Lease Agreement Classification, SGD is utilized as an
optimization technique to train machine learning models efficiently. It works by
iteratively updating the model's parameters in a way that minimizes a predefined
loss function, ultimately leading to the best possible model fit. What sets SGD
apart is its "stochastic" nature, meaning that it optimizes the model using random
subsets of the training data (mini-batches) rather than the entire dataset. This not
only accelerates training but also introduces a level of randomness that can help
escape local minima in the optimization process.
SGD's adaptability and speed make it a powerful choice for text
classification tasks. During the training phase, the algorithm adjusts the model's
weights to better align with the textual features extracted from lease agreements.
This optimization process continues until a satisfactory model is achieved, capable
of accurately classifying lease agreements into their respective categories.

3.1.4 GRADIENT BOOSTING ALGORITHM


Gradient boosting algorithm is one of the most powerful algorithms in
the field of machine learning. As known, that the errors in machine learning
algorithms are broadly classified into two categories i.e. Bias Error and Variance
Error. As gradient boosting is one of the boosting algorithms it is used to
minimize bias error of the model.
Unlike, Adaboosting algorithm, the base estimator in the gradient boosting
algorithm cannot be mentioned by self. The base estimator for the Gradient Boost
algorithm is fixed and i.e. Decision Stump. Like AdaBoost, one can tune the
n_estimator of the gradient boosting algorithm. However, if the value of
n_estimator is not mentioned, the default value of n_estimator for this algorithm is
100. Gradient boosting algorithms can be used for predicting not only continuous
target variables (as a Regressor) but also categorical target variables (as a
Classifier). When it is used as a regressor, the cost function is Mean Square Error
(MSE) and when it is used as a classifier then the cost function is Log loss.
Gradient boosting classifiers are the AdaBoosting method combined with
weighted minimization, after which the classifiers and weighted inputs are
recalculated. The objective of Gradient Boosting classifiers is to minimize the

14
loss, or the difference between the actual class value of the training example and
the predicted class value. It isn't required to understand the process for reducing
the classifier's loss, but it operates similarly to gradient descent in a neural
network. In the case of Gradient Boosting Machines, every time a new weak
learner is added to the model, the weights of the previous learners are frozen or
cemented in place, left unchanged as the new layers are introduced. This is
distinct from the approaches used in AdaBoosting where the values are adjusted
when new learners are added. The power of gradient boosting machines comes
from the fact that it can be used on more than binary classification problems, also
can be used on multi-class classification problems and even regression problems.
Gradient boosting systems have two other necessary parts: a weak learner
and an additive component. Gradient boosting systems use decision trees as their
weak learners. Regression trees are used for the weak learners, and these
regression trees output real values. Because the outputs are real values, as new
learners are added into the model the output of the regression trees can be added
together to correct for errors in the predictions. The additive component of a
gradient boosting model comes from the fact that trees are added to the model
over time, and when this occurs the existing trees aren't manipulated, their values
remain fixed. Gradient boosting models can perform incredibly well on very
complex datasets, but they are also prone to overfitting.

Fig 3.1.4 Gradient Boosting Architecture

15
Fig source 3.1.4 is the diagram that explains the work flow of Gradient Boosting
Algorithm

3.1.5 Support Vector Machines (SVMs)


Support Vector Machines (SVMs) are powerful and versatile machine
learning algorithms used extensively in various classification and regression tasks,
and they have applications in Lease Agreement Classification as well. SVMs are
particularly effective when dealing with complex datasets and high-dimensional
feature spaces.
In the context of Lease Agreement Classification, SVMs work by
identifying an optimal hyperplane that best separates lease agreements into
different predefined categories. The "support vectors" are data points closest to the
decision boundary, and the algorithm aims to maximize the margin between these
support vectors and the hyperplane. This margin maximization leads to robust
classification results, as it focuses on capturing the most critical information for
distinguishing between different lease agreement categories.

One of the key advantages of SVMs is their ability to handle both linear
and nonlinear data by using appropriate kernel functions, such as the radial basis
function (RBF) kernel. This flexibility allows SVMs to capture intricate
relationships within lease agreement text, making them well-suited for the task.
SVMs are known for their capacity to perform effectively in high-
dimensional feature spaces, which is essential when dealing with the multifaceted
content of lease agreements. Additionally, SVMs provide a clear separation of
categories and are less prone to overfitting, ensuring reliable classification results.
In summary, Support Vector Machines offer a robust and adaptable
approach to Lease Agreement Classification. They excel at handling complex
textual data, providing accurate categorization, and supporting various kernel
functions to capture intricate patterns within lease agreements.

16
3.2 TOOLS DESCRIPTION

Fig 3.2.1 Tools Used

The technologies and tools that are used in the project and a brief
description about each of them are discussed in this section.

3.3.1 PYTHON LANGUAGE


Python is a versatile and widely used programming language renowned for
its simplicity, readability, and extensive ecosystem of libraries and frameworks. It
has gained immense popularity in various domains, including web development,
data science, machine learning, and automation. Python's straightforward syntax
emphasizes code clarity, making it accessible to both novice and experienced
programmers. This readability is further enhanced by its significant use of
whitespace for code structuring, encouraging clean and organized code. Python's
comprehensive standard library provides a wealth of modules and functions for
various tasks, reducing the need for reinventing the wheel and enabling developers
to focus on problem-solving rather than low-level details.
Python's prominence in data science and machine learning is driven by
libraries like NumPy, pandas, and scikit-learn, which simplify data manipulation,
analysis, and model building. Its role in web development is facilitated by
frameworks such as Django and Flask, which streamline the creation of dynamic
web applications. Python's versatility extends to automation and scripting, as it
excels at simplifying repetitive tasks through scripting and scripting automation.
Additionally, Python's active community and extensive documentation ensure that

17
developers have access to a wealth of resources and support, making it an
excellent choice for a wide range of projects.

3.3.2 Colab
Google Colab, short for Google Colaboratory, is a cloud-based platform that
offers a collaborative and interactive environment for developing and running
Python code. It has gained immense popularity among data scientists, machine
learning engineers, and researchers due to its ease of use, free access to GPU
resources, and seamless integration with Google Drive. Colab provides a Jupyter
Notebook-like interface, making it convenient for users to create, edit, and execute
Python code in a notebook format. This format enables the combination of code,
documentation, and visualizations in a single document, making it ideal for data
analysis, machine learning experiments, and collaborative research projects.
One of Colab's standout features is its provision of free GPU and TPU (Tensor
Processing Unit) resources. This capability allows users to accelerate
computationally intensive tasks, such as training deep learning models, without
the need for expensive hardware. Additionally, Colab's integration with Google
Drive simplifies data management and sharing. Users can easily access datasets
and files stored in their Google Drive and share their Colab notebooks with
collaborators. These collaborative features make Google Colab a valuable tool for
both individuals and teams working on data-driven projects, research, and
development tasks in various fields.

3.3.3 SCIKIT-LEARN
Natural Language Processing (NLP) libraries are essential tools for working
with human language data and enabling machines to understand, process, and
generate human-like text. Among the most prominent NLP libraries, NLTK
(Natural Language Toolkit) is widely recognized for its extensive collection of
text processing libraries and corpora, making it a valuable resource for NLP
research and development. NLTK provides tools for tokenization, stemming, part-
of-speech tagging, named entity recognition, sentiment analysis, and more. Its
user-friendly interface and detailed documentation make it an excellent choice for

18
educational purposes and NLP projects ranging from text analysis to machine
learning applications.
Another powerful NLP library is spaCy, known for its speed and efficiency in
handling large-scale text processing tasks. spaCy offers pre-trained models for
various languages, enabling users to perform tasks like entity recognition,
dependency parsing, and text classification with ease. Its API is designed for
production use, making it a preferred choice for building NLP applications and
integrating NLP capabilities into software systems. spaCy's focus on performance
and accuracy has made it a popular choice among developers and researchers
looking to leverage NLP capabilities for real-world applications. Both NLTK and
spaCy, along with other NLP libraries, play pivotal roles in advancing the field of
natural language processing and enabling a wide range of language-related tasks
in machine learning, text analysis, and information retrieval.

3.3.4 PANDAS
Pandas is a widely-used Python library for data manipulation and analysis. It
provides an easy-to-use and highly flexible data structure known as a DataFrame,
which is akin to a spreadsheet or database table. With Pandas, users can efficiently
load, clean, transform, and analyze data from various sources, making it an
indispensable tool for data scientists, analysts, and researchers. Pandas simplifies
data exploration by offering a wide range of functions and methods for tasks such
as data indexing and selection, grouping, aggregation, and time series
manipulation. Its seamless integration with other Python libraries, like NumPy and
Matplotlib, allows for comprehensive data analysis and visualization. Pandas'
intuitive and powerful data processing capabilities make it an essential part of the
data science toolkit.

One of Pandas' standout features is its ability to handle missing data


effectively. It provides functions for identifying and handling missing values,
enabling users to perform data imputation, removal, or interpolation as needed.
Additionally, Pandas supports data merging and joining operations, making it

19
possible to combine multiple datasets based on common keys or indexes. This
feature is invaluable for integrating data from different sources and performing
complex data transformations. Whether you're working on data cleaning,
exploration, or complex data analysis tasks, Pandas remains a versatile and
indispensable library for efficiently managing and analyzing tabular data in
Python.

3.3.5 Excel
Excel is a widely used spreadsheet application developed by Microsoft,
renowned for its versatility and ease of use in managing and analyzing data. It
offers a grid-like interface where users can input, organize, and manipulate data in
rows and columns. Excel provides a plethora of functions and formulas for
performing calculations, statistical analysis, and data visualization. Its user-
friendly features, such as drag-and-drop functionality and cell formatting options,
make it accessible to a broad range of users, from students and professionals to
data analysts and financial experts.
One of Excel's core strengths is its ability to create visually appealing and
informative charts and graphs, facilitating data visualization and presentation.
Users can choose from a variety of chart types, including bar charts, pie charts,
and line graphs, to represent data in a way that best conveys insights and trends.
Additionally, Excel supports the creation of pivot tables, which enable users to
summarize and explore large datasets efficiently. Excel's extensive functionality,
coupled with its widespread availability in both personal and professional settings,
makes it a go-to tool for tasks like budgeting, financial analysis, project
management, and data reporting.

3.4 IMPLEMENTATION USING TOOL


3.4.1 Data Preparation: Import your dataset into Google Colab. You can
upload it from your local machine or import it from various online sources.
Use pandas to read and pre-process the dataset. Perform tasks such as
data cleaning, handling missing values, encoding categorical variables,
and scaling/normalizing features as needed.

20
3.4.2 Data Splitting: Split your dataset into training, validation, and test
sets using scikit-learn's train_test_split function.

3.4.3 Model Selection: Choose the machine learning model you want to
use for your task. Import the relevant model class from scikit-learn, e.g.,
from sklearn.ensemble import RandomForestClassifier for a Random
Forest Classifier.
3.4.4 Hyperparameter Tuning: Utilize scikit-learn's GridSearchCV to
perform hyperparameter tuning and find the best combination of
hyperparameters for your model.

3.4.5Model Training: Train your selected model on the training data using
the best hyperparameters found during tuning.

3.4.6 Model Evaluation: Evaluate your model's performance on the


validation dataset using appropriate metrics such as accuracy, precision,
recall, F1-score, or custom evaluation functions.

3.4.7 Final Model Training and Testing: Once satisfied with your model's
performance, train it on the entire training dataset (including the validation
set) and evaluate it on the test set to get a final assessment of its
performance.

3.4.8 Results and Visualization: Analyze and visualize the results using
libraries like matplotlib and seaborn. You can create various plots and
visualizations to interpret the model's predictions and insights.

3.4.9 Exporting Results: Export the model, its predictions, and relevant
results to Excel or other formats if needed. You can use libraries like
pandas to save data frames as Excel files.

21
CHAPTER IV
ANALYTICAL MODEL EVALUATION

4.1 PERFORMANCE MEASURES


Performance measures in machine learning are critical tools used to
evaluate and quantify the effectiveness of predictive models. These measures
provide insights into how well a model is performing and help in making
informed decisions about model selection, fine-tuning, and deployment. Some
commonly used performance measures include accuracy, precision, recall, F1-
score, and the area under the receiver operating characteristic curve (AUC-ROC).
Accuracy is perhaps the most straightforward performance measure, indicating
the proportion of correctly predicted instances out of the total. Precision measures
the ratio of true positive predictions to the total positive predictions, emphasizing
the model's ability to avoid false positives. Recall, on the other hand, quantifies
the model's ability to capture all positive instances by calculating the ratio of true
positives to the total actual positives. The F1-score is a harmonized metric that
combines both precision and recall, providing a balanced measure of a model's
overall performance. Finally, the AUC-ROC score assesses a model's ability to
distinguish between classes in a binary classification task, with a higher score
indicating better discrimination.
These performance measures play a crucial role in assessing and fine-tuning
machine learning models, ensuring that they meet the desired objectives and
requirements. By understanding these metrics, data scientists and practitioners can
make informed decisions about model selection, parameter tuning, and threshold
adjustments, ultimately leading to the development of more accurate and effective
machine learning models.

22
4.1.1 MODEL RESULT

The dataset is initially split into two parts: 80% for training
and 20% for validation. Then, the 80% training portion is further
divided into 80% for training and 20% for testing.

4.2 HYPOTHESIS TESTING/CONFUSION MATRIX


With regards to the evaluation of the models, it’s worth if Precision, Recall
and F1 Score as evaluation metrics are considered, for the following reasons:
Precision will give us the proportion of positive identifications that were indeed
correct.

Precision will give us the proportion of positive identifications that were


indeed correct.

Recall will determine the proportion of real positives that were correctly
identified. Recall is a performance metric in machine learning and statistics that
measures the ability of a model to correctly identify all relevant instances from a
dataset. It quantifies the proportion of true positive predictions (correctly
identified positive cases) out of all actual positive instances in the dataset

F1 score is a single-value metric in machine learning that combines precision


and recall into a single score. It provides a balanced measure of a model's overall
performance, considering both its ability to make accurate positive predictions
(precision) and its ability to capture all positive instances (recall). The F1 score is
particularly useful in situations where there is an uneven class distribution or
when both false positives and false negatives have significant consequences. It is

23
calculated as the harmonic mean of precision and recall, providing a single value
that balances the trade-off between precision and recall.

4.2.1 CONFUSION MATRIX


A confusion matrix is a fundamental tool in the evaluation of classification
models, such as those used in machine learning. It provides a tabular
representation of the model's predictions compared to the actual ground truth
values for a dataset. The matrix is particularly useful in understanding the
performance of a model, especially in binary classification problems where there
are two possible outcomes, typically referred to as "positive" and "negative."
In a binary classification confusion matrix, there are four key components:
4.2.1.1. True Positives (TP): These are cases where the model correctly
predicted the positive class.
4.2.1.2. True Negatives (TN): These are cases where the model correctly
predicted the negative class.
4.2.1.3. False Positives (FP): These are cases where the model incorrectly
predicted the positive class when the actual class was negative. Also known as
Type I errors.
4.2.1.4. False Negatives (FN): These are cases where the model incorrectly
predicted the negative class when the actual class was positive. Also known as
Type II errors.
By analysing the values in the confusion matrix, you can compute various
performance metrics such as accuracy, precision, recall, and the F1-score, which
provide insights into how well the model is performing and how it is making
errors. The confusion matrix is a valuable tool for assessing the strengths and
weaknesses of a classification model and for fine-tuning it to improve its
performance.

24
Confusion Matrix for Logistic Regression Model:

Fig 4.2.1 Confusion Matrix for Logistic Regression Model

The provided confusion matrix represents the performance metrics for a


classification model. Let's break down the key components:

25
The rows in the confusion matrix represent the actual classes or labels. The
columns represent the predicted classes made by the model. Each cell in the
matrix contains the count of instances falling into a specific combination of actual
and predicted classes. Here's how to interpret the key metrics from the confusion
matrix:Precision: It measures the model's ability to make accurate positive
predictions. In this context, precision is high for most classes, indicating that when
the model predicts a certain class, it is often correct. For example, for
"Alterations," "Insurance," and "Lease Year," the precision is 1.00, meaning that
the model rarely makes false positive predictions for these classes.
Recall: Recall quantifies the model's ability to capture all positive
instances correctly. Similar to precision, recall is high for many classes, indicating
that the model effectively identifies positive cases. For instance, the recall for
"Area," "Basic Information," and "Insurance" is perfect (1.00), suggesting that the
model rarely misses these positive instances.
F1-Score: The F1-score is a harmonic mean of precision and recall,
providing a balanced measure of a model's overall performance. High F1-scores
indicate models that are both precise and able to capture positive instances. For
example, "Alterations" and "Insurance" have F1-scores of 1.00, indicating strong
performance for these classes. the F1-score for the "Assignment/Sublet" class is
0.87, indicating a reasonably good balance between precision and recall for this
class.Support: Support represents the number of instances in each class, indicating
how many data points belong to each category.
The confusion matrix itself is a tabular representation that shows the
model's predictions (rows) compared to the actual ground truth (columns) for each
class. It provides a detailed view of the model's performance for each category,
highlighting where it excels and where it may have challenges. This information is
valuable for understanding the strengths and weaknesses of the classification
model and for making improvements as needed.

4.3 INFERENCE:
• In Precision, "Alterations," "Insurance," and "Lease Year," the precision is 1.00,
meaning that the model rarely makes false positive predictions for these classes.

26
• The recall for "Area," "Basic Information," and "Insurance" is perfect (1.00),
suggesting that the model rarely misses these positive instances.
• "Alterations" and "Insurance" have F1-scores of 1.00, indicating strong
performance for these classes.

Confusion Matrix for SGD Model:

27
4.4.3 Inference:
Accuracy: The model achieves an overall accuracy of approximately
83%, indicating that it correctly predicts the lease agreement categories for
the majority of instances in the dataset.
Precision: Precision measures how many of the predicted positive
cases were actually positive. For most categories, the precision is
relatively high, ranging from 50% to 100%. For example, in the
"Alterations" category, the precision is 100%, indicating that when the
model predicts "Alterations," it is almost always correct.
Recall: Recall measures how many of the actual positive cases were
correctly predicted as positive. Similar to precision, recall scores are
generally high across categories, ranging from 69% to 100%. In the "Area"
category, the recall is 100%, suggesting that the model effectively
identifies instances belonging to this category.
F1-Score: The F1-score is the harmonic mean of precision and recall,
providing a balanced measure of a model's performance. The F1-scores
are strong, with values between 0.67 and 1.00, indicating that the model
performs well in terms of both precision and recall.
Confusion Matrix: The confusion matrix provides a detailed breakdown
of true positives, true negatives, false positives, and false negatives for
each category. It allows for a more granular assessment of model
performance.

28
CHAPTER V
ANALYSIS REPORT
5.1 ANALYSIS REPORTS AND INFERENCES
This chapter explains the reports and screens generated as part of the
project

5.1.1 REPORTS FOR LOGISTIC REGRESSION ALGORITHM

29
Figure Source. 5.1 Report for Logistic Regression

Figure Source. 5.1 Depicts the Classification Report and Confusion Matrix for
Logistic Regression, which gives good Accuracy, along with Precision and Recall
value.

30
5.1.2 REPORTS FOR SVM ALGORITHM

Figure Source. 5.2 Report for SVM

31
Figure Source. 5.2 Depicts the Classification Report and Confusion Matrix for
SVM which gives good Accuracy, along with Precision and Recall value.
5.1.3 REPORTS FOR SGD CLASSIFIER

Figure Source. 5.3 Report for SGD

32
Figure Source. 5.3 Depicts the Classification Report and Confusion Matrix of
SGD, which gives good Accuracy, along with Precision and Recall value.

5.1.4 REPORTS FOR DECISION TREE ALGORITHM

Figure Source. 5.3 Report for Train Data

33
Figure Source. 5.2 Depicts the Classification Report and Confusion Matrix of
Naïve Bayes for Test Data, which gives good Accuracy, along with Precision and
Recall value.

5.1.5 MODEL MULTINOMIAL_NB

Figure Source. 5.5 Report MultinomialNB

34
Figure Source. 5.5 Depicts the Classification Report and Confusion
MultinomialNB, which gives good Accuracy, along with Precision and Recall
value.

5.2 Inference

5.2.1 Accuracy Comparison:


Logistic Regression, Bagging Classifier, and SGD Classifier exhibit the
highest accuracy, with approximately 83.43%. SVC follows closely with an
accuracy of around 81.40%. Decision Tree Classifier lags behind with an accuracy
of 67.15%. Accuracy measures overall correctness in predictions. Higher accuracy
indicates a better-performing model in terms of correct classifications.

5.2.2 Precision, Recall, and F1-Score:


Precision: Indicates the ratio of true positive predictions to the total
positive predictions. It measures the model's ability to avoid false positives.
Recall: Measures the ratio of true positives to the total actual positives. It indicates
the model's ability to capture all positive instances. F1-Score: Balances precision
and recall, providing a harmonic mean of the two. It's a useful metric when there's
an imbalance between classes. Different models achieve varying levels of
precision, recall, and F1-scores across different lease agreement categories.
Categories like "Alterations" and "Estoppel" often have high precision, recall, and
F1-scores across models. Some models, like Logistic Regression and SGD
Classifier, tend to excel in precision, while others, like SVC and Decision Tree
Classifier, may excel in recall.

5.2.3 Model Selection Considerations:


Logistic Regression, Bagging Classifier, and SGD Classifier perform
similarly in terms of accuracy and overall F1-score.
The choice of the best model may depend on specific requirements. For precision-
focused tasks (minimizing false positives), Logistic Regression or SGD Classifier
may be preferred.

35
For recall-focused tasks (capturing all positives), SVC or Bagging Classifier may
be chosen.
Decision Tree Classifier, while having lower accuracy, might be suitable for cases
where interpretability is essential, as decision trees are inherently interpretable.

Challenges and Improvements:


The "Premises Address" and "out of scope" categories consistently present
challenges for all models, with lower precision, recall, and F1-scores.
Improvement strategies may involve additional data preprocessing, feature
engineering, or using more advanced models, especially for these challenging
categories.
In summary, the choice of the best model should align with your specific goals
and priorities. Consider the trade-offs between precision and recall and assess the
importance of interpretability. Additionally, continue to refine and optimize the
chosen model for improved performance on challenging categories.

36
CHAPTER VI
CONCLUSION

6.1 CONCLUSION
In conclusion, the primary goal of this project was to develop and
deploy machine learning algorithms capable of effectively classifying lease
agreements, thereby streamlining the process of document management and
decision-making in lease-related matters. The ultimate objective was to strike a
balance between minimizing false negatives, ensuring all genuine lease
agreements are correctly identified, and minimizing false positives, preventing the
misclassification of non-lease documents.

Achieving this balance was a challenging task, given the inherent trade-off
between precision and recall. In light of the project's focus on improving the
accuracy of lease agreement classification, special emphasis was placed on
reducing false positives. Extensive efforts were invested in fine-tuning
hyperparameters and optimizing the model's performance while maintaining
computational efficiency.

The implementation of various machine learning models, complemented


by techniques like grid search and cross-validation, yielded impressive results.
The model achieved a high accuracy, along with improved precision, recall, and
F1 scores. These outcomes not only met but exceeded the project's objectives,
promising robust accuracy and classification capabilities for lease agreements.

In essence, the success of this project has provided a valuable tool for lease
management professionals and organizations. By harnessing the power of machine
learning algorithms, they can now classify lease agreements more accurately,
enabling efficient document management and informed decision-making. This
project represents a significant step toward enhancing lease document processing
through data-driven insights and precise classification.

37
1.2 BIBLIOGRAPHY
• Natural Language Processing - Overview - GeeksforGeeks
• NLTK :: Natural Language Toolkit
• Natural Language Processing With Python's NLTK Package – Real Python
• NLP using NLTK Library | NLTK Library for Natural Language Processing
(analyticsvidhya.com)
• Installing scikit-learn — scikit-learn 1.3.1 documentation
• scikit-learn · PyPIs
• Untitled10.ipynb - Colaboratory (google.com)
• NLP Preprocessing Steps in Easy Way - Analytics Vidhya

38

You might also like