Classification of Input Document or Text in Different Indian IT Laws Using Machine Learning Techniques
Classification of Input Document or Text in Different Indian IT Laws Using Machine Learning Techniques
Classification of Input Document or Text in Different Indian IT Laws Using Machine Learning Techniques
Abstract - In this paper , we study how section preprocessing. The data set is then splitted into
classification can help individuals to train and validation sets.
understand various acts and laws that can be 2. Feature Engineering:The next step is the
applied on a legal document using n grams Feature Engineering in which the raw data set is
model. Due to the exponential growth of transformed into flat features which can be used in
information there is a need for automated tools, a machine learning model. This step also includes
which can process and understand such texts. the process of creating new features from the
Our model will allow users to use a legal existing data.
document such as an FIR and find the legal 3.Model Training:The final step is the Model
section broken by the accused ,thus helping a Building step in which a machine learning model
naive person to understand law. is trained on a labelled data set.
Data extraction is used for extracting data from We are using two algorithms to classify the legal
the web. Here we are using data extraction to text and hence identifying under which section of
creat3 a data set upon which we will train our laws it comes under.
model for prediction of laws
All data used in the research is available on the 1.Random forest algorithm - is a supervised
Creating a data set for our classifier by scraping classification algorithm. We can see it from its
data from the site "INDIAN KANOON.ORG". name, which is to create a forest by some way and
make it random. There is a direct relationship
between the number of trees in the forest and the
II. DATA PREPROCESSING results it can get: the larger the number of trees,
the more accurate the result . Random forests are a
The text data cannot be directly input into our way of averaging multiple deep decision trees,
proposed model so it needs to be preprocessed trained on different parts of the same training set,
before using the data obtained by web scraping. with the goal of reducing the variance.
Data preprocessing will include removal of
punctuations , extra white spaces present in the The training algorithm for random forests applies
text special characters and leave only alpha the general technique of bootstrap aggregating, or
numeric values. Also all data must be converted bagging, to tree learners. This bootstrapping
into lower case. procedure leads to better model performance
because it decreases the variance of the model,
without increasing the bias. This means that while
III.FEATURE EXTRACTION the predictions of a single tree are highly sensitive
to noise in its training set, the average of many
Before training the model we need to design the trees is not, as long as the trees are not correlated.
features to obtain score of different text present in Simply training many trees on a single training set
legal document. . As all legal documents have a would give strongly correlated trees (or even the
fixed format therefore it is possible for our model same tree many times, if the training algorithm is
deterministic); bootstrap sampling is a way of de- input and the output layer corresponding to words
correlating the trees by showing them different of the given input vector by looking up in the
training sets. . Random forests differ in only one tables owned by learning nodes. The output layer
way from this general scheme: they use a modified generates the categorical scores indicating
tree learning algorithm that selects, at each memberships of the string vector in categories as
candidate split in the learning process, a random the output.
subset of features.
For each re sample Bm, m = 1, …, M, grow a Figure 1. The architecture of neural network
regression tree Tm. For predicting the test case C0 classifier
with covariate x0, the predicted value by the whole
RF is obtained by combining the results given by
individual trees.
The problem can be written as Text categorization is defined as follows.
1.The number of the input nodes should be
1M∑Mm=1f^∗m(x0). identical to the dimension of string vectors
representing documents. This layer receives an
input vector given as a string vector, so each node
corresponds to each word in the string vector.
2. Neural network - The proposed neural network
2.The number of the learning nodes should be
follows Perceptron in that synaptic weights are
identical to the number of predefined categories.
connected directly between the input layer and the
Nodes of this layer own tables corresponding to
output layer, and the weights are updated only
predefined categories, and determine weights
when each training example is misclassified. The
between the input and output layer, to each word
learning layer given as an additional layer to the
in the input vector.
input and the output layer is different from the
3. The number of the output nodes should be
hidden layer of back propagation with respect to
identical to the number of predefined categories.
its role. The learning layer determines synaptic
weights between the input and the output layer by
This layer generates categorical scores as the
referring to the tables owned by learning nodes.
output, and they correspond to predefined
The learning of neural network classifier refers to
categories.
the process of optimizing weights stored in the
tables the architecture of the neural network. It
consists of the three layers: input layer, output
layer, and learning layer. The input layer receives
an input vector given as a string vector. The MODEL DESIGN
learning layer determines weights between the
4.Tokenization: Data is tokenized, filtered and
stemmed in order to create a unique list of stems
and their frequencies.
The classification process consists of three steps
that are shown in Fig. 2. 5.Model Train: Training the data using various
classification techniques.
Critical review of research papers
CONCLUSION