PA Notes 2
PA Notes 2
PA Notes 2
Introduction to Regularization
During the Machine Learning model building, the Regularization Techniques is an unavoidable and
important step to improve the model prediction and reduce errors. This is also called the Shrinkage
method. Which we use to add the penalty term to control the complex model to avoid overfitting by
reducing the variance.
Let’s discuss the available methods, implementation, and benefits of those in detail here.
The too many parameters/attributes/features along with data and those permutations and
combinations of attributes would be sufficient, to capture the possible relationship within dependent
and independent variables.
To understand the relationship between the target and available independent variables with several
permutations and combinations. For this exercise certainly, we need adequate data in terms of records
or datasets for our analysis, hope you are aware of that.
If you have fewer data with huge attributes the depth vs breadth analysis there might lead to that not all
possible permutations and combinations among the dependent and independent variables. So those
missing values force good or bad to your model. Of course, we can call out this circumstance as Curse of
Dimensionality. Here we are looking for these aspects from data along with
parameters/attributes/features.
Curse of Dimensionality is not directly mean that too many dimensions, this is the lack of possible
permutation and combination.
In another way round the missing data and gap generates empty space, so we couldn’t connect the dots
and create the perfect model. It means that the algorithm cannot understand the data and spread
across with given space or empty, with multi-dimensional mode and meets with kind of relationship
between dependent and independent variables and predicting the future data. If you try to visualize
this, it would be really complex format and difficult to follow.
During the training, you will get the above-said observation, but during the testing, the new and not
exposed data combination to model’s accuracy will jump across and it suffers from error, because of
variance [variance error] and not fit for production move and risk for prediction.
Due to the too many Dimensions with too few data, the algorithm would build the best fit with peaks
and deep-down dells in the observation along with the high magnitude of coefficient its leads to
overfitting and is not suitable for production. [drastic fluctuation in surface inclination]
To understand or implement these techniques, we should understand the cost function of your linear
models.
Understanding the Regression Graph: The below graph represents the entire parameters existing in the
LR model and is very self-explanatory.
Cost function/Error function: Takes slope-intercept (m and c) values and returns the error value/cost
value. It shows the error between predicted outcomes is compared with the actual outcomes. It explains
how your model is inaccurate in its prediction.
It is used to estimate how badly models are performing for the given dataset and its dimensions.
Why is cost function important in machine learning? Yes, the cost function helps us reach the optimal
solution, So how can we do this. will see all possible methods and simple steps using Python libraries.
This function helps us to a figure-out best straight line by minimizing the error
The best fit line is that line where the sum of squared errors around the line is minimized
Regularization Techniques
Let’s discuss the available Regularization techniques and followed by the implementation
1. Ridge Regression (L2 Regularization):
Basically here, we’re going to minimize the sum of squared errors and sum of the squared
coefficients (β). In the background,
the coefficients (β) with a large magnitude will generate the graph peak and
deep slope, to suppress this we’re using the lambda (λ) use to be called a
Penalty Factor and help us to get a smooth surface instead of an irregular-graph. Ridge Regression
is used to push the coefficients (β) value nearing zero in terms of magnitude. This is L2
regularization, since its adding a penalty-equivalent to the Square-of-the Magnitude of
coefficients.
Ridge Regression = Loss function + Regularized term
As we know Multi-Linear-Regression
y=β0+ β1 x1+ β2 x2+………………+ βn xn —————–1
Σ y i– β 0– Σ βi xi
Put 3 and 4 in 5
Compare L2 and L1 Regularization: Hope after seeing the code level implementation, you could
able to relate the importance of regularization techniques and their influence on the model
improvements. As a final touch let’s compare the L1 & L2.
Quite accurate and keep all features More Accurate than Ridge
λ ==> Sum of the squares of coefficient λ ==> Sum of the absolute of coefficient.
The coefficient can be not to zeroed, but rounded The coefficient can be zeroed
Variable selection and keeping all variables Model selection by dropping coefficient
When you begin learning anything new one of the things you grapple with is the lingo of the field you’re
getting into. Clearly understanding the terms (and in some cases the symbols and acronyms) used in a
field is the first and most fundamental step to understanding the subject matter itself. When I started out
in Machine Learning, the concept of parameters and hyperparameters confused me a lot. If you are here I
am supposing you also find it confusing. So, I wrote this article to dispel whatever confusion you might
have and set you on a path of absolute clarity.
In ML/DL, a model is defined or represented by the model parameters. However, the process of
training a model involves choosing the optimal hyperparameters that the learning algorithm will use to
learn the optimal parameters that correctly map the input features (independent variables) to the
labels or targets (dependent variable) such that you achieve some form of intelligence.
So what exactly are parameters and hyperparameters and how do they relate?
Hyperparameters
Hyperparameters are parameters whose values control the learning process and determine the values of
model parameters that a learning algorithm ends up learning. The prefix ‘hyper_’ suggests that they are
‘top-level’ parameters that control the learning process and the model parameters that result from it.
As a machine learning engineer designing a model, you choose and set hyperparameter values that your
learning algorithm will use before the training of the model even begins. In this light, hyperparameters
are said to be external to the model because the model cannot change its values during learning/training.
Hyperparameters are used by the learning algorithm when it is learning but they are not part of the
resulting model. At the end of the learning process, we have the trained model parameters which
effectively is what we refer to as the model. The hyperparameters that were used during training are not
part of this model. We cannot for instance know what hyperparameter values were used to train a model
from the model itself, we only know the model parameters that were learned.
Basically, anything in machine learning and deep learning that you decide their values or choose their
configuration before training begins and whose values or configuration will remain the same when
training ends is a hyperparameter.
Here are some common examples
Train-test split ratio
Learning rate in optimization algorithms (e.g. gradient descent)
Choice of optimization algorithm (e.g., gradient descent, stochastic gradient descent, or Adam
optimizer)
Choice of activation function in a neural network (nn) layer (e.g. Sigmoid, ReLU, Tanh)
The choice of cost or loss function the model will use
Number of hidden layers in a nn
Number of activation units in each layer
The drop-out rate in nn (dropout probability)
Number of iterations (epochs) in training a nn
Number of clusters in a clustering task
Kernel or filter size in convolutional layers
Pooling size
Batch size
Parameters
Parameters on the other hand are internal to the model. That is, they are learned or estimated purely
from the data during training as the algorithm used tries to learn the mapping between the input features
and the labels or targets.
Model training typically starts with parameters being initialized to some values (random values or set to
zeros). As training/learning progresses the initial values are updated using an optimization algorithm (e.g.
gradient descent). The learning algorithm is continuously updating the parameter values as learning
progress but hyperparameter values set by the model designer remain unchanged.
At the end of the learning process, model parameters are what constitute the model itself.
Examples of parameters
The coefficients (or weights) of linear and logistic regression models.
Weights and biases of a nn
The cluster centroids in clustering
Simply put, parameters in machine learning and deep learning are the values your learning algorithm can
change independently as it learns and these values are affected by the choice of hyperparameters you
provide. So you set the hyperparameters before training begins and the learning algorithm uses them to
learn the parameters. Behind the training scene, parameters are continuously being updated and the
final ones at the end of the training constitute your model.
Therefore, setting the right hyperparameter values is very important because it directly impacts the
performance of the model that will result from them being used during model training. The process of
choosing the best hyperparameters for your model is called hyperparameter tuning and in the next
article, we will explore a systematic way of doing hyperparameter tuning.
Conclusion
I trust that you now have a clear understanding of what hyperparameters and parameters exactly are and
understand that hyperparameters have an impact on the parameters your model learns. I will be
following this up with a detailed practical article on hyperparameter tuning.
A Guide to Hyperparameter Tuning: Enhancing Machine Learning Models
Introduction
Hyperparameter tuning is a critical process in the development of machine learning models. It is the art
and science of finding the optimal configuration of hyperparameters that govern the behavior of a
machine learning algorithm. Hyperparameters are parameters set prior to the training process, which are
not learned from the data itself. They significantly impact a model’s performance, and choosing the right
hyperparameters can make the difference between an underperforming model and one that achieves
state-of-the-art results. In this article, we will explore the importance of hyperparameter tuning and
various techniques to achieve it successfully.
The Role of Hyperparameters
Machine learning models can be thought of as a combination of data, model architecture, and
hyperparameters. While data is the lifeblood of a machine learning model and model architecture defines
its structure, hyperparameters act as the knobs and levers that fine-tune the model’s behavior.
Here are some commonly used hyperparameters:
1. Learning Rate: It controls the step size during the training process. A higher learning rate can lead to
faster convergence but might overshoot the optimal solution, while a lower learning rate might
converge too slowly.
2. Number of Hidden Layers and Units: In neural networks, the architecture is defined by the number
of hidden layers and units in each layer. The choice of these hyperparameters can greatly impact the
model’s capacity and ability to capture complex patterns.
3. Batch Size: It determines the number of data samples used in each iteration during training. Smaller
batch sizes often lead to better generalization, but they can also slow down training.
4. Regularization Parameters: L1 and L2 regularization parameters control the degree of regularization
applied to the model, helping prevent overfitting.
5. Activation Functions: The choice of activation functions in neural networks, such as ReLU, Sigmoid,
or Tanh, can impact the model’s non-linearity and learning behavior.
The Importance of Hyperparameter Tuning
Hyperparameter tuning is crucial because selecting the right values can significantly impact a model’s
performance. When hyperparameters are poorly chosen, a model may underfit (perform poorly on both
the training and testing data) or overfit (perform well on the training data but poorly on unseen data).
Proper tuning can result in a model that generalizes well to new data, achieving higher accuracy and
robustness.
Hyperparameter tuning also plays a pivotal role in the development of real-world applications. For
example, in healthcare, the choice of hyperparameters can affect the accuracy of medical diagnosis
models. In autonomous vehicles, it can influence the safety and reliability of self-driving systems. The
financial industry relies on tuned models for predicting stock prices and managing risk.
Hyperparameter Tuning Techniques
1. Grid Search: Grid search is a systematic approach that involves specifying a range of values for each
hyperparameter and exhaustively testing all possible combinations. While it can be computationally
expensive, it ensures that no optimal configuration is missed.
2. Random Search: In random search, hyperparameters are sampled randomly from predefined
distributions. This approach is less computationally intensive than grid search and can often yield
similar results.
3. Bayesian Optimization: Bayesian optimization uses a probabilistic model to predict the performance
of different hyperparameter configurations. It then selects configurations that are more likely to
lead to improved results, making it an efficient and effective technique.
4. Genetic Algorithms: Inspired by natural selection, genetic algorithms evolve a population of
hyperparameter configurations over multiple generations. The best-performing configurations
survive and reproduce, leading to the discovery of optimal settings.
5. Automated Hyperparameter Tuning: Tools like AutoML and platforms such as Google’s Cloud
AutoML or H2O.ai offer automated solutions for hyperparameter tuning. These platforms streamline
the process and require minimal manual intervention.
K means Clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset
into different clusters. The article aims to explore the fundamentals and working of k mean clustering
along with the implementation.
What is K-means Clustering?
Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data
and enabling the algorithm to operate on that data without supervision. Without any previous data
training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and
variations.
K means clustering, assigns data points to one of the K clusters depending on their distance from the
center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data
point assign to one of the cluster based on its distance from centroid of the cluster. After assigning each
point to one of the cluster, new cluster centroids are assigned. This process runs iteratively until it finds
good cluster. In the analysis we assume that number of cluster is given in advanced and we have to put
points in one of the group.
In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means
clustering performs best data is well separated. When data points overlapped this clustering is not
suitable. K Means is faster as compare to other clustering technique. It provides strong coupling
between the data points. K Means cluster do not provide clear information regarding the quality of
clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means
algorithm is sensitive to noise. It maymhave stuck in local minima.
K-mean++:
To overcome the above-mentioned drawback we use K-means++. This algorithm ensures a smarter
initialization of the centroids and improves the quality of the clustering. Apart from initialization, the
rest of the algorithm is the same as the standard K-means algorithm. That is K-means++ is the
standard K-means algorithm coupled with a smarter initialization of the centroids.
Initialization algorithm:
The steps involved are:
Randomly select the first centroid from the data points.
2. For each data point compute its distance from the nearest, previously chosen centroid.
3. Select the next centroid from the data points such that the probability of choosing a point as
centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e.
the point having maximum distance from the nearest centroid is most likely to be selected next as
a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled
Intuition: By following the above procedure for initialization, we pick up centroids that are far away
from one another. This increases the chances of initially picking up centroids that lie in different
clusters. Also, since centroids are picked up from the data points, each centroid has some data points
associated with it at the end.
Applications of k-means++ algorithm
Image segmentation: K-means++ can be used to segment images into different regions based on
their color or texture features. This is useful in computer vision applications, such as object
recognition or tracking.
Customer segmentation: K-means++ can be used to group customers into different segments
based on their purchasing habits, demographic data, or other characteristics. This is useful in
marketing and advertising applications, as it can help businesses target their marketing efforts
more effectively.
Anomaly detection: K-means++ can be used to identify outliers or anomalies in a dataset. This is
useful in fraud detection, network intrusion detection, and other security applications.
Document clustering: K-means++ can be used to group similar documents together based on
their content. This is useful in natural language processing applications, such as text
classification or sentiment analysis.
Recommender systems: K-means++ can be used to recommend products or services to users
based on their past purchases or preferences. This is useful in e-commerce and online
advertising applications.
Kmeans++
Kmeans++ suggests picking the next center randomly but with probability proportional to its distance to
the closest center. The higher the distance to the closest center, the higher the probability of being
chosen. Thus minimizing the chance of picking centers that are close to one another while avoiding the
systematic pitfall of the Farthest-First Traversal method.
K-Means Clustering Explained
Clustering was introduced in 1932 by H.E. Driver and A.L.Kroeber in their paper on “Quantitative
expression of cultural relationship”. Since then this technique has taken a big leap and has been used to
discover the unknown in a number of application areas eg. Healthcare.
Clustering is a type of unsupervised learning where the references need to be drawn from unlabelled
datasets. Generally, it is used to capture meaningful structure, underlying processes, and grouping
inherent in a dataset. In clustering, the task is to divide the population into several groups in such a way
that the data points in the same groups are more similar to each other than the data points in other
groups. In short, it is a collection of objects based on their similarities and dissimilarities.
With clustering, data scientists can discover intrinsic grouping among unlabelled data. Though there are
no specific criteria for a good clustering and it completely depends on the user, how they want to use it
for their specific needs. It can be used to find unusual data points/outliers in the data or to identify
unknown properties to find a suitable grouping in the dataset.
Let’s take an example, imagine you work in a Walmart Store as a manager and would like to better
understand your customers to scale up your business by using new and improved marketing strategies.
It is difficult to segment your customers manually. You have some data that contains their age and
purchase history, here clustering can help to group customers based on their spending. Once the
customer segmentation will be done, you can define different marketing strategies for each of the
groups as per target audiences.
There are many clustering algorithms grouped into different cluster models. Before choosing any
algorithm for a use case, it is important to get familiar with the cluster models and if it is suitable for the
use case. One more thing which should be considered while choosing any clustering algorithm is the size
of your dataset.
Datasets can contain millions of records and not all algorithms scale efficiently. K-Means is one of the
most popular algorithms and it is also scale-efficient as it has a complexity of O(n). In this article, we will
talk about K-Means in-depth and what makes it popular.
K-Means clustering
K-means is a centroid-based clustering algorithm, where we calculate the distance between each data
point and a centroid to assign it to a cluster. The goal is to identify the K number of groups in the
dataset.
“K-means clustering is a method of vector quantization, originally from signal processing, that aims to
partition n observations into k clusters in which each observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster.” – Source
It is an iterative process of assigning each data point to the groups and slowly data points get clustered
based on similar features. The objective is to minimize the sum of distances between the data points
and the cluster centroid, to identify the correct group each data point should belong to.
Here, we divide a data space into K clusters and assign a mean value to each. The data points are placed
in the clusters closest to the mean value of that cluster. There are several distance metrics available that
can be used to calculate the distance.
Let’s take an example to understand how K-means work step by step. The algorithm can be broken
down into 4-5 steps.
1. Choosing the number of clusters : The first step is to define the K number of clusters in which we
will group the data. Let’s select K=3.
2. Initializing centroids: Centroid is the center of a cluster but initially, the exact center of data
points will be unknown so, we select random data points and define them as centroids for each
cluster. We will initialize 3 centroids in the dataset.
3. Assign data points to the nearest cluster
Now that centroids are initialized, the next step is to assign data points Xn to their closest cluster
centroid Ck
In this step, we will first calculate the distance between data point X and centroid C using Euclidean
Distance metric.
And then choose the cluster for data points where the distance between the data point and the centroid
is minimum.
4. Re-initialize centroids
Next, we will re-initialize the centroids by calculating the average of all data points of that cluster.
5. Repeat steps 3 and 4: We will keep repeating steps 3 and 4 until we have optimal centroids and the
assignments of data points to correct clusters are not changing anymore.
Does this iterative process sound familiar? Well, K-means follows the same approach as Expectation-
Maximization(EM). EM is an iterative method to find the maximum likelihood of parameters where the
machine learning model depends on unobserved features. This approach consists of two steps
Expectation(E) and Maximization(M) and iterates between these two.
For K-means, The Expectation(E) step is where each data point is assigned to the most likely cluster and
the Maximization(M) step is where the centroids are recomputed using the least square optimization
technique.
Positioning the initial centroids can be challenging and the aim is to initialize centroids as close as
possible to optimal values of actual centroids. It is recommended to use some strategies for defining
initial centroids as it directly impacts the overall runtime. The traditional way is to select the centroids
randomly but there are other methods as well which we will cover in the section.
Random Data Points
This is the traditional approach of initializing centroids where K random data points are selected and
defined as centroids. As we saw in the above example, in this method each data instance in the dataset
will have to be enumerated and will have to keep a record of the minimum/maximum value of each
attribute. This is a time-consuming process; with increased dataset complexity the number of steps to
achieve the correct centroid or correct cluster will also increase.
Naive Sharding
The sharding centroid initialization algorithm primarily depends on the composite summation value of
all the attributes for a particular instance or row in a dataset. The idea is to calculate the composite
value and then use it to sort the instances of the data. Once the data set is sorted, it is then divided
horizontally into k shards.
Finally, all the attributes from each shard will be summed and their mean will be calculated. The shard
attributes mean value collection will be identified as the set of centroids that can be used for
initialization
Centroid initialization using sharding happens in linear time and the resultant execution time is much
better than random centroid initialization.
K-Means++
K-means++ is a smart centroid initialization method for the K-mean algorithm. The goal is to spread out
the initial centroid by assigning the first centroid randomly then selecting the rest of the centroids based
on the maximum squared distance. The idea is to push the centroids as far as possible from one
another.
This denotes the distance of a data point xi from the farthest centroid Cj
How to choose K?
Some factors can challenge the efficacy of the final output of the K-means clustering algorithm and one
of them is finalizing the number of clusters(K). Selecting a lower number of clusters will result in
underfitting while specifying a higher number of clusters can result in overfitting. Unfortunately, there is
no definitive way to find the optimal number.
The optimal number of clusters depends on the similarity measures and the parameters used for
clustering. So, to find the number of clusters in the data, we need to run the k-means clustering for a
range of values and compare the outcomes. At present, we don’t have any method to determine the
exact accurate value of clusters K but we can estimate the value using some techniques, including Cross-
validation, Elbow method, Information Criteria, the Silhouette method, and the G-means algorithm.
Elbow method: The distance metric is one of the commonly used metrics to compare results across
different K values. When the number of clusters, K is increased, the distance from centroid to data
points will be decreased and will reach a point where K is the same as the number of data points. This is
the reason we have been using the mean of the distance to the centroids. In the elbow method, we plot
the mean distance and look for the elbow point where the rate of decrease shifts. This elbow point can
be used to determine K.
“The elbow method is a heuristic used in determining the number of clusters in a data set. The method
consists of plotting the explained variation as a function of the number of clusters, and picking the elbow
of the curve as the number of clusters to use.” - Source
Elbow point is used as a cutoff point in mathematical optimization to decide at which point the
diminishing returns are no longer worth the additional cost. Similarly, in clustering, this is used to choose
a number of clusters when adding another cluster doesn’t improve the outcomes of modeling. It is an
iterative process where K-means clustering will be done on the dataset for a range of values of K as
below.
1. Perform K-means clustering with all the K values. For each K value, we compute the average
distance to the centroid across all the data points:
2. Plot each of these points and find the point where mean distance suddenly falls (Elbow):
This is probably the most popular method to determine the optimal number of clusters. Though finding
elbow points can be a challenge, because in practice there may not be a sharp elbow.
Silhouette method: Finding an Elbow point is challenging in practice but there are other techniques to
determine the optimal value of K and one of them is the Silhouette Score method.
“Silhouette refers to a method of interpretation and validation of consistency within clusters of data.
The technique provides a succinct graphical representation of how well each object has been classified.”
– Source
Silhouette coefficient is used to measure the quality of the clusters by checking how similar a data point
is within a cluster compared to the other clusters. Silhouette analysis can be used to study the distance
between the resulting clusters. This discrete measure ranges between -1 and 1:
+1 indicates that the data point is far away from the neighboring cluster and thus optimally
positioned.
0 indicates either it is on or very close to the decision boundary between two neighbor clusters.
-1 indicates that the data point is assigned to the wrong cluster.
To find an optimal value for the number of clusters K, we use a silhouette plot to display a measure of
how close each point in one cluster is to a point in the neighboring clusters and thus provide a way to
assess parameters like the number of clusters visually. Let’s see how it works.
Using the above Silhouette analysis, we can choose K’s optimal value as 3 because the average
silhouette score is higher and indicates that the data points are optimally positioned.
In clustering, we don’t have any labeled data but just a set of features and the objective is to obtain high
intra-cluster similarity and low inter-cluster similarity for those features. Evaluating the performance of
any clustering algorithm is not as easy as calculating the number of errors or finding precision or recall
like in supervised learning. Here we evaluate the outcomes based on the similarities or dissimilarities
between data points.
In the previous section, we saw how distance measures and silhouette score can help in finding the
optimal value of K. So, many of these evaluation metrics can be used to find the best clustering points
too for the parametric clustering algorithms. The clustering algorithm is only as good as your similarity
measures. So, we need to make sure that we use appropriate similarity measures. One way is to
experiment with your measures and determine which algorithm can provide more accurate similarities.
“In theory, the clustering researcher has acquired an intuition for the clustering evaluation, but in
practise the mass of data on the one hand and the subtle details of data representation and clustering
algorithms on the other hand, make an intuitive judgement impossible.” - Phd Thesis, University of
Stuttgart
There are several clustering evaluation metrics available and continuously evolving to help researchers
with clustering. In this section, we will be discussing some of the most common and popular metrics.
Dunn index: Dunn Index is used to identify dense and well-separated groups. It is the ratio between
minimum inter-cluster distance and maximum intra-cluster distance. The Dunn index can be computed
as below:
Here d(i,j) is the distance between clusters i and j, which is the minimum of all inter-cluster distances,
and d(k) is the intra-cluster distance of cluster k, which is the maximum of all intra-cluster distances. The
algorithms that create clusters with a high Dunn index are more desirable as that way, clusters would be
more compact and different from each other.
Silhouette score: The average Silhouette score is also used as an evaluation measure in clustering. The
best silhouette score is 1 and the worst is -1. Values close to zero indicate that data points are on the
boundary i.e overlapping the clusters.
F-Measure: F-measure is applied to the precision and recall of pairs and is used to balance false
negatives by weighing recall.
In clustering, the common approach is to apply the F-Measure to the precision and recall of pairs, which
is referred to as pair counting f-measure. We can calculate the F-measure using the below formula.
Here is chosen such that recall is considered times as important as precision. When we set as 1, It will be
the harmonic mean of precision and recall. We calculate the recall and precision of the cluster for each
given class i.e a set of classes should be given for the objects.
Rand index
Rand index can be used to compute how similar the clusters are to the benchmark. The value of the
Rand Index can be found using the following formula.
Here TP is the number of true positives, TN is the number of true negatives, FP is the number of false
positives and FN is the number of false negatives. With this evaluation metric, we can count the number
of correct pairwise assignments.
TP is the number of data point pairs that are clustered together, in the predicted partition, and in the
ground truth partition, FP is the number of data point pairs that are clustered together in the predicted
partition but not in the ground truth partition.