N-Modal Contrastive Losses With Applications To Social Media Data in Trimodal Space

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

N-Modal Contrastive Losses with Applications to

Social Media Data in Trimodal Space

William Theisen Walter Scheirer


University of Notre Dame University of Notre Dame
[email protected]
arXiv:2403.12747v1 [cs.CV] 18 Mar 2024

Abstract
Мій організм через дві
​ аносекунди після приїзду на
н
​позицію, по якій безперервно
​працює арта

The social media landscape of conflict dynamics has
grown increasingly multi-modal. Recent advancements in Image Video Text
​Encoder ​Encoder ​Encoder
model architectures such as CLIP have enabled researchers
to begin studying the interplay between the modalities of I
1
I
2
I
3
I 4
I 5
I 6
I
7
I 8 V1 V2 V3 V4 V5 V6 V7 V8 T1 T2 T3 T4 T5 T6 T7 T8

text and images in a shared latent space. However, CLIP


models fail to handle situations on social media when Projection Layer
modalities present in a post expand above two. Social me-
dia dynamics often require understanding the interplay be- V'
1

V' V'
tween not only text and images, but video as well. In this pa- I4

V'
Ti
2

per we explore an extension of the contrastive loss function I' T i


3

I T'
I' T'
i
i
4
2

V V
to allow for any number of modalities, and demonstrate its 8 3 3
6

V
usefulness in trimodal spaces on social media. By extending I' T 7

I T'
I' T'
i
8
i 6

CLIP into three dimensions we can further aide understand- 7 7

ing social media landscapes where all three modalities are


present (an increasingly common situation). We use a newly Figure 1. An intuitive visualization of contrastive loss expanded
collected public data set of Telegram posts containing all to a trimodal space, with the optimization happening across a cube
three modalities to train, and then demonstrate the useful- of similarities rather than a 2-dimensional grid in order to account
ness of, a trimodal model in two OSINT scenarios: clas- for the multi-modal properties of a social media post containing
sifying a social media artifact post as either pro-Russian not only images and text, but video as well. After training a shared
or pro-Ukrainian and identifying which account a given ar- projection layer embeddings from all modalities are projected into
tifact originated from. While trimodal CLIP models have a shared latent space, with artifacts from the same post being close
to each other.
been explored before (though not on social media data), we
also display a novel quadmodal CLIP model. This model
can learn the interplay between text, image, video, and au- platoons location [1]. Unfortunately most OSINT work re-
dio. We demonstrate new state-of-the-art baseline results quires the majority of the heavy lifting to be done manually
on retrieval for quadmodel models moving forward. by human operators. Understanding not only the inter-post
context of a social media artifact, let alone the surround-
ing context, is a difficult problem. One such issue prevent-
ing more advanced computational success in this scenario is
1. Introduction the difficulty of understanding multi-modal posts on social
As social media continues to grow so to has open-source media. Social media posts are no longer just small snip-
Intelligence (OSINT) played an increasingly large role in pets of text, and indeed posts containing images and videos
government’s responses to international conflicts. The re- have surpassed the amount of text that is now posted on so-
cent invasion of the Ukraine by Russia has been accompa- cial media [*]. To further computational OSINT capabilities
nied by a truly digital form of warfare, with social media multi-modal understanding is critical.
literally leading to ”boots-on-the-ground” action. An exam- One recent advance in multi-modal understanding in the
ple of this is the Ukrainian government using a social me- computer sciences has been the introduction of Contrastive
dia post by a Russian soldier to target a missile strike on his Language-Image Pre-training (hereafter CLIP) [25]. By
training models on pairs of images and text the model can video, image, text, and audio. For vision we use a masked
learning the similarities between the two modalities. CLIP autoencoder from Meta [12]. We pair with this a multilin-
models have become ubiquitous in the field and form the gual distilBert model [31] [27]. For video we use another
foundation of many exciting new multimodal AI tools to- masked autoencoder: videoMAE [36]. Audio feature ex-
day such as Stable Diffusion [29] and DALLE [26]. These traction was done using wav2vec2 [4]. Having strong fea-
models work extremely well for relating images and text tures is the foundation on top of which contrastive learning
together. Unfortunately, as discovered by Theisen and operates. We did not task pre-train these autoencoders in
Scheirer [35], CLIP models trained on non-social media this work.
data do not work when applied to social media data. This Multimodal Embedding Models: Early work in mul-
means that models for use on social media data need to be timodal understanding was focused primarily on ”fusion”
trained from scratch. While CLIP models for three modal- techniques, or methods of combining relatively unrelated
ities have been explored (lightly) in the literature, they’ve features post-hoc [23] [37]. The two most common fu-
yet to be applied to social media. Additionally we formal- sion techniques were early fusion and late fusion. Early
ize the extension of CLIP loss to dimensions higher than fusion [37] [24] relies on concatenating feature vectors, as
three and give a novel demonstration of what a quadmodal can be seen in [34]. Late fusion uses a weighted average
model would look like. technique to consider the monomodal feature vectors differ-
In short, the paper makes the following contributions: ently [44] [40] [47]. While fusion worked, it was rather inel-
1. An formal extension of two contrastive losses to 3+ egant and struggled to capture the interplay between modal-
modalities ities at a fundamental level. As fusion techniques contin-
2. A new data set of social media posts containing videos ued to be developed, tensor-based fusion [20] [43] [14] and
and text, out of which can be synthesized a trimodal low-rank modality fusion rose to the fore [19]. These meth-
data set ods used an outer product method to learn both intra- and
3. Publicly available trimodal models inter-modality features. As contrastive losses have become
4. A demonstration of trimodal CLIP applications ap- more prevalent, fusion techniques have fallen by the way-
plied to social media, via a binary stance classifier and side.
a multiclass account provenance classifier Prior to the introduction of contrastive loss, triplet loss
5. A novel proof-of-concept quadmodal CLIP model, had been shown to work on bimodal data sets with slight
providing a new baseline for quadmodal models in the modifications. In 2018 Deng et al. [9] showed that triplet
future loss could be used to facilitate hashing methods for cross
modal retrieval. Wang et al. [39] demonstrated that ex-
2. Related Work panding the triplet loss with additional terms and improv-
Motivating Works: There are a number of pipelines in- ing neighborhood contstraints could increase the retrieval
tending to aid social media understanding that have been results across modalities.
published. One part of computation social media analysis This work is heavily based on prior advances in multi-
is understanding the interplay between different types of modal machine learning. First and foremost is the introduc-
items posted on social media such as images, videos, and tion of Contrastive Image-Language Pre-training (CLIP)
text. Early work in social media image understanding was by Radford et al. in 2021 [25]. CLIP allows for multi-
done by Zannettou et al. [45], Beskow et al. [5], Dubey et modal pairs (original image-text) pairs to be contrasted into
al. [10], and Theisen et al. [34]. All of these were pub- a shared latent space. The algorithm introduced in CLIP
lished prior to CLIP, and thus suffer from focusing on only led to a flurry of papers focused on multimodal understand-
a single modality (images). A similar work that expands ing such as VideoCLIP [42], Wav2CLIP [41] (a method for
across modalities is ”Few-shot Learning for Multi-modal learning audio representations via CLIP), Clip-nerf [38] (al-
Social Media Event Filtering” by Nascimento et al. [22]. lowing for the manipulation of neural radiance fields), and
This work uses CLIP to embed image-text pairs into the C-CLIP [35] (a version of CLIP trained explicitly for social
same space to increase the ability to detect ”events” from media). C-CLIP is the most similar work, and the one we
social media posts. Another such project is called ”MEWS: compare against, due to its focus on social media.
Misinformation Early Warning System”, demo’ed by Ford Extending contrastive loss to three modalities has been
et al [11]. While this project extracts embeddings for more previously explored in the literature [2] [18] [3]. Mai et
than two types of modalities, it keeps them in separate latent al. [21] proposed what they termed a ”Hybrid Contrastive”
spaces and instead does the comparison post-hoc. model for projecting audio, images, and text into a shared
Autoencoders: The models used in this paper are merely latent space with the goal of using the combined vectors
built on top of pre-existing work. We use four different off to perform sentiment analysis. Another exploration of tri-
the shelf models for the four modalities studied in this work: modal learning via contrastive loss was by Ruan et al. [30]
in 2024. In their work ”TriCoLo: Trimodal Contrastive
Loss for Text to Shape Retrieval” they showed that one can
successfully map three modalities: voxel shapes of tables
L_{total} = \alpha \sum _{i=0}^{N-1} \sum _{a=1}^{M} \sum _{p=1}^{M} \sum _{q=1}^{M} L(e^{a}_i, e^{p}_i, e^{q}_{(i+1) \mod N}) (2)
and chairs, images of these voxel shapes from different an-
gles, and text descriptions of the shapes into the same latent
space. Our experimental setup is similar to theirs, but we where:
instead optimize for social media data, due to the task dis-
crepancy outlined in Theisen and Scheirer [35]. • are the embeddings for the anchor and positive
e^{a}_i, e^{p}_i

Classifiers — Binary and Multiclass: To demonstrate examples in the i-th element, selected from the set of
the usefulness of the models we demonstrate two classifiers, M modalities,
a binary classifier and a multiclass classifier. Three common
• is the embedding for the negative exam-
e^{q}_{(i+1) \mod N}
techniques were tested using the features extracted from our
ple from the cyclically next element in the batch, en-
models, for both binary and multiclass applications: Naive
suring a diverse selection,
Bayes [28], Random Forests [7], and SVMs [13] (Support
Vector Machines). In addition to these three, we train two • L(a, p, n) represents the triplet loss between an anchor
basic models, one binary and one multiclass following the a, a positive p, and a negative n example,
outline provided by Tam [32] [33].
• \alpha is a scaling factor applied to the total loss.
3. N-modal Losses The indices a, p, q iterate over the modalities (1 through
M ), allowing for all combinations of anchor, positive,
For this work we explore the extension of two common and negative selections within and across modalities. So
loss functions used in bimodal learning, triplet loss and con- much like in bimodal triplet loss, the positive, anchor, and
trastive loss, in a manner similar to Ruan et al. [30]. Given negative are computed across modalities. When three or
below is a brief overview of the two loss functions at the more modalities are present, one can simple brute-force all
bimodal level, followed by an explanation of our extension triplets across modalities for which the triplet loss could be
of these two losses to a higher number of modalities. computed. Unsurprisingly this is a costly loss to compute,
Bimodal Triplet Loss: Triplet loss involves three ele- as with three modalities it requires 9 comparisons, and with
ments: an anchor, a positive example, and a negative ex- four it would require 16. So while it works, it scales very
ample. The goal is to ensure that the anchor is closer to the poorly, requiring N2 computations, where N is the number
positive example than to the negative example in the embed- of modalities. This appears directly in the runtimes of train-
ding space. Triplet loss can be mathematically expressed as: ing a model using this loss and can be seen in Table 1.
Bimodal CLIP Loss: Constrasted to triplet loss, CLIP
loss is a method used to learn embeddings by contrasting
& L_{triplet} = \max \{sim(A, P) \nonumber \\ & - sim(A, N) + margin, 0\} positive pairs (similar or related items) against all other neg-
(1) ative pairs (dissimilar or unrelated items). The goal is to en-
sure that positive pairs are closer in the shared embedding
space than negative pairs.
where A represents the anchor, P the positive example, The loss is defined as:
N the negative example, sim() is a similarity function, and
margin is a hyperparameter. When used for bimodal learn-
ing, the positive sample and the anchor are from the two
different modalities associated with the ground-truth pair, L_{CLIP} = -\frac {1}{2n} \sum _{i=1}^n \Biggl [\log \frac {e^{sim(I_i, T_i)/\tau }}{\sum _{j=1}^n e^{sim(I_i, T_j)/\tau }} \nonumber \\ + \log \frac {e^{sim(T_i, I_i)/\tau }}{\sum _{j=1}^n e^{sim(T_j, I_i)/\tau }}\Biggr ]
while the positive sample and the negative sample are from
the same modality but different pairs. In this manner, the
(3)
loss function ensures the cross-modal positive samples are
closer than the cross-modal negative samples, therefore al-
lowing for bimodal comparisons in the latent space. where I_i and T_i are the embeddings of the image and text
N-Modal Triplet Loss: A basic extension of triplet loss in the i^{th} pair, respectively [25] [46].
to include additional modalities beyond two is relatively N-Modal CLIP Loss: N-Modal CLIP Loss extends the
straight forward. One can simply iterate over the differ- traditional concept of contrastive loss to accommodate more
ent modality combinations to generate comparisons across than two modalities. In N-Dimensional CLIP Loss, data
all modality pairings. We can generalize an N-modal triplet from N different modalities are projected into a shared em-
loss function over a batch of size N and M modalities as: bedding space. Similarly to the bimodal loss shown above,
the objective is to minimize the distance between corre- A bimodal model is typically built with two encoders,
sponding tuples from different modalities while maximiz- one for each modality, on top of which one or more projec-
ing the distance between non-corresponding tuples. We take tion layers are placed. It is these projection layers that are
as our ”ground-truth” tuples the N artifacts present in a sin- then trained using the loss to minimize the distance between
gle social media post, where each artifact is from a different the paired encodings. By using the trimodal contrastive loss
modality. If bimodal contrastive loss can be visualized as a extension we can perform a similar process using tuples of
grid of pairs, trimodal contrastive loss could be visualized three items which are encoded separately. For this paper
as a cube (refer to Figure 1). More than three modalities we use the following three encoders: text — ’distilbert-
begins to be harder to visualize, but the process of the ex- base-multilingual-cased’ [15], image — ’facebook/vit-mae-
tensions is the same as going from 2 to 3 modalities. base’ [16], video — ’MCG-NJU/videomae-base’ [17]. All
The loss function for N-Dimensional CLIP Loss is given three of these encoders are used as-is and are publicly avail-
by: able from HuggingFace. These encoders also have the prop-
erty of being transformers, meaning their default output
\begin {aligned} & L_{N-contrastive} = \\ & \frac {1}{N}\sum _{i=1}^N \sum _{j=1, j\neq i}^N -\log \frac {e^{sim(M_i, M_j)/\tau }}{\sum _{k=1, k\neq i}^N e^{sim(M_i, M_k)/\tau }} \end {aligned} \label {eq:nclip}
length is 768. This means that in theory the projection layer
(4) could output upwards of a length 768 vector. Prior to encod-
ing, videos and images were compressed to the size of 244 x
244 pixels, standard practice. Of particular importance was
where M_i represents the embedding from the i^{th} modal- the selection of text encoder. The corpus is multilingual and
ity, sim() is a similarity function (like cosine similarity), it was important to select an encoder that could handle the
and \tau is a temperature parameter that scales the similarity various languages present in the collected data. As recom-
scores. As per Ruan et al. [30], the final step of the loss mended by Theisen and Scheirer [35], we use a multilingual
function is summing the pairwise losses. Whereas they pin distilbert model, in order to handle the multilingual dataset
it to three dimensions we choose to represent this as a sum- without first needing to translate the texts.
mation across any number of dimensions and then averaged The implementation is based on Shariatnia’s implemen-
by 2N. This is what formally allows our loss to extend to tation of CLIP, but extended to allow for more than two
any number of modalities. encoders. For the triTRIP and triCLIP implementations,
New Dataset: Social media datasets with more than two each of the three encoders is given its own projection head,
modalities are shockingly difficult to come by. To foster through which we train a projection layer. The projection
the work we introduce a new dataset collected from Tele- heads are how the model goes takes a high dimensional
gram. This dataset consists of Telegram posts containing a vector (in the case of our three encoders, 768 dimensional)
video of at least 64 frames and text of at least 5 words. Ex- and projects it into a shared, lower dimensional, latent space
tending the collection methodology outlined in Theisen and (256 dimensions in our standard implementation of triTRIP
Scheirer [35] we collect videos and text instead of images and triCLIP). So to train, first the encodings are calculated
and text. Over the course of the project we collected 69,831 for every item in the batch, all of these embeddings are
video-text pairs from 33 accounts randomly selected from then passed through the projection heads assigned to their
the list they provided. The discerning reading will notice modality. With the output of the projection heads the loss
that this dataset still only contains two modalities, videos is then calculated. The runtimes of training are given in Ta-
and text. A standard video transformer only operates on ble 1 for the models at various amounts of training data and
a small number of frames from the overall video (in this epochs. The models were trained using the standard Adam
work, 16 frames). This means that we can treat a frame not LR optimizer and on a single GPU (out of a heterogeneous
considered by the video transformer as a separate but re- collection of). A batch size of 128 was used.
lated image. While imperfect, it allows us to generate a tri-
Evaluation of the models was done on a post-level re-
modal dataset out of a bimodal video-text dataset. This new
trieval task. So given a single-modality artifact from a post,
dataset will be made available upon publication, along with
the goal is to determine which post it belongs to. Retrieval
any code involved with the work, and all models trained.
was measured at K=1, 5, 10, and 25. As the similarities
4. Building a Trimodal Models were calculated at the artifact level, they had to be collated
to the post level. This was done by simply summing the
To show that the losses can successfully be extended to similarities for each post’s artifact if it was returned. A vi-
more than two dimensions, we build two different models sualization of this can be seen in Fig. ??. For each trial there
based on the N-dimensional triplet and contrastive losses, was a population of 100 posts, all of whose artifacts were
hereafter referred to as triTRIP and triCLIP respectively, tested as the query embedding, leading to 300 queries being
with the amount of training data given as a suffix after the performed during test time. All tests were then repeated 5
model name. times, including training, to get an average recall score over
Recall Testing Method
Cosine
​Similarity
T1
T2
Post 1

Projection Layer
T1 V1 V1

6
I

0.3
45
1

VN
0.
Test Posts
Randomly Selected

T2
Post 2 ​Query Embedding @K = 5
0.91
T2 V2

...
Post 1 = 0.45
V2
I
2
I2 0.78 Post 2 = 0.91 + 0.78

V2
True Post = 2 Post N = 0.80 + 0.63

TN

0.
63
TN
Post N

0.8
TN VN V1

0
I N

VN

Figure 2. The evaluation method for measuring the recall of our models. The chosen embedding is compared only to those embeddings that
are not of the same modality, to highlight the cross-modal abilities of the model. Similarites for the top-K embeddings are then summed
together when they are from the same post.

Average Training Time for Models by Epochs (5 Trials) ing features from 50,000 posts took roughly 21 hours). Re-
Model 1 10 50 100 grettably, the increased training time required for triTRIP
triTRIP-100 00:00:14 00:00:31 00:01:36 00:01:57 models does not correlate to higher accuracy on the retrieval
triCLIP-100 00:00:12 00:00:13 00:00:21 00:00:17 task, especially at K=1.
triTRIP-1k 00:00:22 00:02:24 00:07:16 00:23:32
triCLIP-1k 00:00:13 00:00:11 00:00:16 00:00:19 To test the success of the training we evaluated models
triTRIP-10k 00:01:48 00:17:28 01:16:42 03:47:24 on a simple retrieval task: given a monomodal social media
triCLIP-10k 00:00:10 00:00:16 00:00:38 00:01:13 artifact, we attempt to retrieve the post the artifact belongs
triTRIP-50k 00:11:13 01:53:11 05:59:07 18:52:17 to. The model is attempting to maximize the similarity be-
triCLIP-50k 00:00:15 00:00:41 00:02:33 00:05:15 tween a tuple of three artifacts. If we are querying a video,
then the post the model should return is the one containing
Table 1. The average training times for the trimodal models in said video. However if we compare this video against all
HH:MM:SS format. Features were pre-extracted, leading to much three artifacts for every post, the task becomes too easy, as
shorter training times than one might expect. The triTRIP models the most similar object would of course be the video itself
take much longer to train due to the modality-pairwise computa-
and would therefore overly bias the recall metric. For this
tions required.
purpose when we query an artifact we query it only across
modalities. I.E. if a text is queried it is only compared to im-
ages and videos. This means the most similar posts returned
the 5 trials (deviations of recall can be found in the supple-
are most similar to the text by virtue of their images and
mental material).
videos and avoids achieving a 1.0 recall because the sys-
tem simply retrieves the post that contains the query data
5. Experiments and Results directly. This also highlights the model’s ability to learn
In total, 41 models were trained. triTRIP and triCLIP cross-modal similarities; the goal of the work.
models were trained on training sets of sizes 100, 1000, As mentioned in Sec. 4, the projection layer could the-
10000, and 50000 posts (each of which has three artifacts). oretically keep the output vector at 768 dimensions, as this
Each of these training sizes was trained across a series of is what all three input vectors are. A natural instinct is to
4 epochs: 1, 10, 50, and 100. The runtimes for training assume that a higher output dimensionality would lead to
can be seen in Table 1. The cost of training a triTRIP higher recalls, however per Table 4 this appears not to be the
model is significantly higher than training a triCLIP model case. Our hypothesis is, due to the original task-mismatch
due to the modality-pairwise triplet loss computations that of the three transformers used, the higher dimension pro-
need to be performed (the author is willing to allow their jection layers leads to the retaining of information extrane-
optimization skills to be called into question at this time). ous to the task at hand (relating the three modalities to each
The features used during training were pre-extracted, so if other). By reducing the output dimensionality the model is
one were to run the pipeline from scratch, time would need able to shed the information that is unrelated to the task, and
to be added for feature extraction (for reference, extract- focus the output embeddings on cross-modality relations.
Average Recall for Models @K by Epochs Trained (5 Trials)
Epochs 1 10 50 100
@K @1 @5 @10 @25 @1 @5 @10 @25 @1 @5 @10 @25 @1 @5 @10 @25
triTRIP-100 1.00% 5.00% 9.33% 25.80% 1.13% 5.73% 12.20% 29.27% 2.27% 7.47% 15.47% 34.67% 3.53% 14.27% 23.80% 46.93%
triCLIP-100 0.60% 4.80% 9.60% 24.73% 0.80% 6.13% 13.13% 30.20% 2.27% 11.80% 21.20% 42.73% 4.53% 16.07% 26.60% 52.73%
triTRIP-1k 1.73% 6.00% 11.40% 30.07% 2.93% 12.07% 21.40% 41.40% 10.80% 34.27% 46.07% 68.80% 15.67% 41.00% 56.33% 76.67%
triCLIP-1k 1.40% 7.13% 13.60% 29.47% 4.40% 18.27% 28.60% 49.13% 15.00% 40.13% 53.27% 72.33% 15.47% 40.20% 56.53% 74.07%
triTRIP-10k 2.67% 11.67% 21.00% 44.33% 21.27% 45.33% 57.47% 75.73% 34.33% 62.80% 71.40% 86.53% 32.87% 60.00% 70.13% 84.73%
triCLIP-10k 4.33% 17.53% 27.13% 49.40% 24.53% 55.00% 68.60% 83.13% 60.47% 76.27% 80.87% 88.33% 59.47% 73.40% 79.07% 87.27%
triTRIP-50k 12.80% 36.13% 47.40% 68.73% 28.67% 61.33% 73.13% 86.33% 46.60% 73.60% 83.33% 93.67% 45.40% 73.87% 81.00% 91.27%
triCLIP-50k 14.27% 38.73% 52.20% 71.47% 57.73% 73.33% 78.07% 86.20% 66.13% 80.33% 85.07% 92.20% 67.47% 79.60% 85.20% 91.40%

Table 2. The recall results of the models trained, averaged across 5 training trials. Evaluated given a single-modality artifact (text, image,
or video), the goal was to return the post the artifact originated from. For each of the 5 trials, 300 artifacts were queried. Unsurprisingly
the models achieve higher recall when trained on more data for longer, with triCLIP-50k achieving the best results (split across 50 epochs
and 100 epochs).

While the original CLIP paper pins the output dimensions Comparison of Recall @K with a Post Population of 100
at 256, the findings beg the question of ”would reducing the Recall
Model
output dimensions further increase recall”. Therefore in ad- @1 @5 @10 @25
dition to testing recall with output dimensions of 512 and M-CLIP [8] 23.40% 38.80% 47.70% 66.70%
768, we test recall at 128 and 64. As can be seen in Table 4, C-CLIP [35] 16.09% 34.89% 48.59% 68.10%
a projection layer with size 256 leads to the best results on bi-triTRIP-10k 12.40% 34.60% 50.27% 73.87%
the majority of K values (and is within the standard devi- bi-triCLIP-10k 10.93% 27.20% 39.87% 66.53%
ation of the best result when it is not). Thus, for all other triTRIP-10k 34.33% 62.80% 71.40% 86.53%
model results reported in this paper, we use the default CLIP triCLIP-10k 60.47% 76.27% 80.87% 88.33%
projection layer size of 256 dimensions.
In order to ground our results to prior work, we compare Table 3. A comparison to baseline M-CLIP [8] and C-CLIP [35]
with the most similar bimodal CLIP application. Theisen models from the literature. The triTRIP and triCLIP models were
trained on image-video-text tuples taken from similar Telegram
and Scheirer developed a C-CLIP model on a similar dataset
accounts. All models were trained on 10,000 items for 50 epochs,
of Telegram image-text pairs. Table 3 shows that when other than M-CLIP (which was trained on 7M image-text pairs).
trained on the same number of items (image-text pairs in The bi-tri models are evaluated on image-text pairs only, to allow
their case, image-text-video tuples in ours) we achieve re- for a more direct comparison to the baseline.
call results that are better than other state-of-the-art social
media CLIP models, thus demonstrating that the extension
of a model to three modalities using the aforementioned
loss functions works. An intuitive explanation for the in- beddings which allows the classification of videos, images,
crease in accuracy could be that due to the correct post and text as either pro-Russian or pro-Ukrainian. Figure 3
having two artifacts floating around in the population, it is gives the ROC curves for the classifiers using the various
more likely that given the combination of the two similar- models trained. The models were evaluated using a 5-fold
ity scores for them, the correct post is more likely to be cross validation technique. The baseline accuracy of a bi-
identified. To demonstrate this, we also test our models on nary classifier is 50% and our classifiers achieve accuracies
bimodal data (image-text pairs), after training on trimodal of 94.23%. While we train four single-purpose binary clas-
data. The results of this seem to support the hypothesis, sifiers following Tam’s example [32], the encodings from
with the image-text only retrieval results being similar to the triCLIP models could be utilized by any of the other
the baseline (though slightly lower). common binary classification methods such as random for-
est [7], naive Bayes [28], or SVMs [13]. The four models
trained were each given features from a different triCLIP
6. Applications in Trimodal Space model, with the amount of training data the model saw vary-
To demonstrate the usefulness of N-dimensional CLIP ing. This was to explore the effect that more training data
loss applied to OSINT scenarios we demonstrate two appli- had on a downstream classification task. Interestingly, us-
cations of an N-dimensional model: a stance classifier for ing features from a model with more training data did not
pro-Ukrainian or pro-Russian social media artifacts and a appear to improve the classification accuracy (the models
provenance classifier, labeling an artifact as belonging to a used for feature extraction were all trained for 50 epochs).
certain Telegram account. Table ?? shows the classification results of our model
Binary Stance Classifier: We demonstrate binary compared to the three other common methods. Our simple
stance classifiers using our trimodal model to extract em- model matches the best AUC result by a baseline (Random
Figure 3. The reciever-operator characteristics curve (ROC) for the stance classifiers on 10,000 posts, along with the area under curve for
each classifier. As can be seen, all classifiers achieve results well above the baseline. The accuracy table on the right shows that Random
Forests achieve the highest stance classification accuracy at 80.91%, though all methods other than Naive Bayes were within 1% of each
other.

Figure 4. The ROC curves and AUC for the account classifiers, alongside the per-method classification accuracies. Random Forests
achieved a 64.57% accuracy across 10,000 posts when using triCLIP-50k features. Much like with the binary classifier, Naive Bayes
performed significantly worse than any of the other methods.

Forest) with an AUC of 0.86. The results here are not meant the train and test sets as a function of the number of epochs
to show that our simple biary classification model is state- trained for.
of-the-art but instead meant to demonstrate the usefulness
of features extracted using the triCLIP models developed
above. With these new models, operators can classify the The final accuracy at epoch 5000 was 67.9%. With 33
stance of an image, a text, or a video with a high degree of accounts in the testing set the baseline random accuracy
accuracy using simple out of the box binary classifiers with is 3.03% showing that the classifier works quite well. Ta-
the features for all three modalities coming from a single ble ?? shows the classification results alongside three other
model. methods of multi-class classification (KNN, Naive Bayes,
Account Provenance Classifier: We also show a multi- and SVM). Due to the large imbalance between the number
class account classifier using the triCLIP models, allowing of posts each account has in the dataset, we make use of
the user to identify probabilistically what account a given a synthetic minority over-sampling technique (SMOTE) by
artifact originated from. Figure 4 shows the accuracy on Chawla et al. [6] to artificially balance the dataset.
Effect of Projection Layer Dimensions on Recall @K for triCLIP-10k (5 Trials)
Epochs 10 50 until scratch-and-sniff monitors are released, scaling above
@K @1 @5 @10 @25 @1 @5 @10 @25 four dimensions seems unlikely. Our novel results provide
64D 24.60% 53.27% 65.87% 76.80% 58.67% 68.80% 75.20% 87.00%
128D 22.40% 52.93% 63.60% 76.07% 56.53% 73.13% 78.33% 87.33% a benchmark for future efforts in the quad-modal space.
256D 24.53% 55.00% 68.60% 83.13% 60.47% 76.27% 80.87% 88.33%
512D 27.73% 54.40% 65.27% 79.53% 54.43% 71.33% 77.00% 86.53%
768D 27.20% 55.27% 64.93% 77.87% 54.13% 71.07% 76.73% 85.00% 8. Future Work and Conclusions
Table 4. The effects of the size of the projection layer output on Future Work: We demonstrate that trimodal contrastive
recall and train time. The model tested was trained on 10,000 posts loss models can successfully be applied to social media. A
at 10 and 50 epochs. Interestingly, a higher output dimensionality possible avenue for future work is exploring the effects the
does not directly correlate to a higher recall. encoders have on their individual modalities, in addition to
jointly training encoders on the data. This work leveraged
out of the box encoders and performed no pre-training. One
7. Quadmodal Contrastive Loss could expect the results to improve were this done.
The results of quadCLIP on retrievaly show a drop rel-
To demonstrate the fact that the N-dimensional triplet ative to similar trimodal models. The underlying causes of
loss formalized above extends beyond the trimodal appli- this are unclear, though it would be natural to suspect it’s
cations shown above we train a simple proof-of-concept something to do with the audio embeddings. More explo-
model on tuples with 4 modalities, the first of its kind to ration needs to be done on successfully embedding audio
our knowledge. In addition to image, video, and text we for use in a quadmodal model. New datasets of social me-
add audio. As outlined above, current video encoders sim- dia data where each post contains more than two modalities
ply sample several frames from a video and treat them as would also be of use to future efforts in this space.
images. Due to this process the audio of a video plays no Conclusions: To further our computational analysis of
role in the generation of the embedding and is more or less social media it’s of vital importance to take the entire con-
entirely cast aside. By extending our model to 4 dimen- text of a post into consideration, which includes studying
sions, via adding an additional audio encoder that can take all of its modalities. The amount of videos available on so-
in the audio from a video, we can avoid losing the informa- cial media only continues to grow. By extending contrastive
tion that the signal provides and use it as another positive loss to N-dimensions we can theoretically enable the com-
sample in our tuple. parison of any number of modalities on social media. High
We use an audio encoder similar to the ones used for the accuracies on media retrieval is useful for OSINT opera-
other three modalities: ’wav2vec2’ [41]. Using this encoder tors trying to understand and discover similar posts in fast-
we can extract an additional audio embedding, which is ig- moving multi-modal situations.
nored when the video embedding is generated. The audio In addition to an extension of contrastive loss and a dis-
is sampled at 16000 Hz. and then from the audio track a play of its usefulness in retrieval settings, we demonstrate
single 768 dimensional vector is extracted. Using a varia- the usefulness of the trimodal extension to various use-
tion of Equation 4 pegged to four modalities we can train a cases on social media. Using a triCLIP model allows the
model following the same process as done for the triCLIP classification of three different modalities using a singu-
models. lar model. We show results on two different classification
Table 5 gives the retrieval results for the quadCLIP mod- tasks: a binary stance classification task (pro-Ukrainian or
els, evaluated in the same manner as the triCLIP models pro-Russian) and a multiclass classification task of account
from Section 4. As the model is intended more as a proof- provenance.
of-concept we train on sets of only 100, 1000, and 10,000 We also demonstrate, a first to our knowledge, the exten-
items and for only 1, 10, and 50 epochs. The best results sion of CLIP loss to four modalities. The results of quad-
were achieved by the model trained for the most epochs on CLIP provide a new baseline for others in this space moving
the most data. Similarly to triTRIP models, the results at forward. Having a single model that can handle any modal-
K=1 is much lower than its triCLIP counterpart, though they ity seen on social media and measure similarity across the
get closer as K grows larger. While these results demon- modalities is valuable to many non-technical workers in the
strate that as is, the quadCLIP model could prove useful, OSINT space and extending the tooling on top of triCLIP
more exploration is needed to understand why quadCLIP and quadCLIP models presented here is an exciting task.
experiences the drop in retrieval, especially at lower values
of K. References
While theoretically the N-dimensional contrastive loss [1] Open-source intelligence is piercing the fog of war in
could be scaled to an infinite number of modalities, it seems ukraine, Jan 2023. 1
likely a ceiling would quickly take effect. In reality there are [2] Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong
only so many modalities that are present on social media, so Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. VATT:
Average Recall for quadCLIP Models @K (5 Trials)
Model quadCLIP-100 quadCLIP-1k quadCLIP-10k triCLIP-10k
Epochs @1 @5 @10 @25 @1 @5 @10 @25 @1 @5 @10 @25 @1 @5 @10 @25
1 0.67% 7.00% 13.33% 30.00% 0.67% 4.00% 9.00% 25.67% 2.00% 12.67% 19.67% 37.33% 4.33% 17.53% 27.13% 49.90%
10 0.33% 3.00% 7.00% 26.00% 2.67% 14.00% 21.67% 41.00% 10.00% 27.33% 36.00% 52.00% 24.53% 55.00% 68.60% 83.13%
50 2.67% 6.33% 14.67% 34.67% 5.00% 18.67% 29.67% 48.33% 15.67% 37.00% 47.00% 61.67% 60.47% 76.27% 80.87% 87.27%

Table 5. The recall results for the quadCLIP model (and triCLIP-10k for easy comparison). They are noticeably worse than the results for
triCLIP models at the same number of epochs and training data but further exploration is required to determine the reason.

transformers for multimodal self-supervised learning from Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, ed-
raw video, audio and text. CoRR, abs/2104.11178, 2021. itors, Advances in Neural Information Processing Systems,
2 volume 32. Curran Associates, Inc., 2019. 2
[3] Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, [15] HuggingFace. distilbert-base-multilingual-case. 4
Relja Arandjelovic, Jason Ramapuram, Jeffrey De Fauw, [16] HuggingFace. facebook/vit-mae-base. 4
Lucas Smaira, Sander Dieleman, and Andrew Zisserman. [17] HuggingFace. Mcg-nju/videomae-base. 4
Self-supervised multimodal versatile networks. CoRR, [18] Yunze Liu, Qingnan Fan, Shanghang Zhang, Hao Dong,
abs/2006.16228, 2020. 2 Thomas A. Funkhouser, and Li Yi. Contrastive multimodal
[4] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and fusion with tupleinfonce. CoRR, abs/2107.02575, 2021. 2
Michael Auli. wav2vec 2.0: A framework for self-supervised [19] Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshmi-
learning of speech representations. CoRR, abs/2006.11477, narasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe
2020. 2 Morency. Efficient low-rank multimodal fusion with
modality-specific factors. CoRR, abs/1806.00064, 2018. 2
[5] D. Beskow, S. Kumar, and K. M. Carley. The evolution of po-
[20] Sijie Mai, Haifeng Hu, and Songlong Xing. Divide, conquer
litical memes: Detecting and characterizing internet memes
and combine: Hierarchical feature fusion network with local
with multi-modal deep learning. Information Processing and
and global perspectives for multimodal affective computing.
Management, 57(2), 2020. 2
In Anna Korhonen, David Traum, and Lluı́s Màrquez, edi-
[6] Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and tors, Proceedings of the 57th Annual Meeting of the Asso-
W. Philip Kegelmeyer. SMOTE: synthetic minority over- ciation for Computational Linguistics, pages 481–492, Flo-
sampling technique. CoRR, abs/1106.1813, 2011. 7 rence, Italy, July 2019. Association for Computational Lin-
[7] L Breiman. Random forests. Machine Learning, 45:5–32, guistics. 2
10 2001. 3, 6 [21] Sijie Mai, Ying Zeng, Shuangjia Zheng, and Haifeng Hu.
[8] Fredrik Carlsson, Philipp Eisen, Faton Rekathati, and Mag- Hybrid contrastive learning of tri-modal representation for
nus Sahlgren. Cross-lingual and multilingual clip. In Pro- multimodal sentiment analysis. CoRR, abs/2109.01797,
ceedings of the Language Resources and Evaluation Confer- 2021. 2
ence, pages 6848–6854, Marseille, France, June 2022. Euro- [22] José Nascimento, João Phillipe Cardenuto, Jing Yang, and
pean Language Resources Association. 6 Anderson Rocha. Few-shot learning for multi-modal social
[9] Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and media event filtering, 2022. 2
Dacheng Tao. Triplet-based deep hashing network for cross- [23] Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hus-
modal retrieval. IEEE Transactions on Image Processing, sain. A review of affective computing: From unimodal anal-
27(8):3893–3903, 2018. 2 ysis to multimodal fusion. Information Fusion, 37:98–125,
[10] A. Dubey, E. Moro, M. Cebrian, and I. Rahwan. Memese- 2017. 2
quencer: Sparse matching for embedding image macros. In [24] Soujanya Poria, Erik Cambria, Devamanyu Hazarika,
In Proceedings of the International World Wide Web Confer- Navonil Majumder, Amir Zadeh, and Louis-Philippe
ence, 2018. 2 Morency. Context-dependent sentiment analysis in user-
generated videos. In Regina Barzilay and Min-Yen Kan, edi-
[11] Trenton W. Ford, William Theisen, Michael Yankoski, Tom
tors, Proceedings of the 55th Annual Meeting of the Associa-
Henry, Farah Khashman, Katherine R. Dearstyne, and Tim
tion for Computational Linguistics (Volume 1: Long Papers),
Weninger. Mews: Real-time social media manipulation de-
pages 873–883, Vancouver, Canada, July 2017. Association
tection and analysis, 2022. 2
for Computational Linguistics. 2
[12] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Dollár, and Ross B. Girshick. Masked autoencoders are scal- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
able vision learners. CoRR, abs/2111.06377, 2021. 2 Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
[13] M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, and B. Krueger, and Ilya Sutskever. Learning transferable visual
Scholkopf. Support vector machines. IEEE Intelligent Sys- models from natural language supervision, 2021. 1, 2, 3
tems and their Applications, 13(4):18–28, 1998. 3, 6 [26] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
[14] Ming Hou, Jiajia Tang, Jianhai Zhang, Wanzeng Kong, and Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
Qibin Zhao. Deep multimodal multilinear fusion with high- Zero-shot text-to-image generation. CoRR, abs/2102.12092,
order polynomial pooling. In H. Wallach, H. Larochelle, A. 2021. 2
[27] Nils Reimers and Iryna Gurevych. Making monolingual sen- Christoph Feichtenhofer. Videoclip: Contrastive pre-
tence embeddings multilingual using knowledge distillation, training for zero-shot video-text understanding. CoRR,
2020. 2 abs/2109.14084, 2021. 2
[28] Irina Rish. An empirical study of the naı̈ve bayes classifier. [43] Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria,
IJCAI 2001 Work Empir Methods Artif Intell, 3, 01 2001. 3, and Louis-Philippe Morency. Tensor fusion network for mul-
6 timodal sentiment analysis. CoRR, abs/1707.07250, 2017. 2
[29] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [44] Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe
Patrick Esser, and Björn Ommer. High-resolution image syn- Morency. MOSI: multimodal corpus of sentiment intensity
thesis with latent diffusion models. CoRR, abs/2112.10752, and subjectivity analysis in online opinion videos. CoRR,
2021. 2 abs/1606.06259, 2016. 2
[30] Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, and [45] S. Zanneettou, T. Caulfield, J. Blackburn, E. D. Cristofaro,
Angel X Chang. Tricolo: Trimodal contrastive loss for text M. Sirivianos, G. Stringhini, and G. Suarez-Tangil. On the
to shape retrieval. In Proceedings of the IEEE/CVF Win- origins of memes by means of fringe web communities. In
ter Conference on Applications of Computer Vision, pages ACM Internet Measurement Conference, 2018. 2
5815–5825, 2024. 2, 3, 4 [46] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D.
[31] Victor Sanh, Lysandre Debut, Julien Chaumond, and Manning, and Curtis P. Langlotz. Contrastive learning of
Thomas Wolf. Distilbert, a distilled version of BERT: medical visual representations from paired images and text.
smaller, faster, cheaper and lighter. CoRR, abs/1910.01108, CoRR, abs/2010.00747, 2020. 3
2019. 2 [47] Yuxuan Zhao, Xinyan Cao, Jinlong Lin, Dunshan Yu, and
[32] Adrian Tam. Building a binary classification model in py- Xixin Cao. Multimodal affective states recognition based
torch, Apr 2023. 3, 6 on multiscale cnns and biologically inspired decision fu-
[33] Adrian Tam. Building a multiclass classification model in sion model. IEEE Transactions on Affective Computing,
pytorch, Apr 2023. 3 14(2):1391–1403, Apr. 2023. 2
[34] William Theisen, Daniel Gonzalez Cedre, Zachariah
Carmichael, Daniel Moreira, Tim Weninger, and Walter
Scheirer. Motif mining: Finding and summarizing remixed
image content. In 2023 IEEE/CVF Winter Conference on
Applications of Computer Vision (WACV), pages 1319–1328,
2023. 2
[35] William Theisen and Walter Scheirer. C-clip: Contrastive
image-text encoders to close the descriptive-commentative
gap, 2023. 2, 3, 4, 6
[36] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang.
Videomae: Masked autoencoders are data-efficient learners
for self-supervised video pre-training, 2022. 2
[37] Johannes Wagner, Elisabeth Andre, Florian Lingenfelser,
and Jonghwa Kim. Exploring fusion methods for multimodal
emotion recognition with missing data. IEEE Transactions
on Affective Computing, 2(4):206–218, 2011. 2
[38] Can Wang, Menglei Chai, Mingming He, Dongdong Chen,
and Jing Liao. Clip-nerf: Text-and-image driven manipula-
tion of neural radiance fields. CoRR, abs/2112.05139, 2021.
2
[39] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning
two-branch neural networks for image-text matching tasks.
CoRR, abs/1704.03470, 2017. 2
[40] Chung-Hsien Wu and Wei-Bin Liang. Emotion recogni-
tion of affective speech based on multiple classifiers using
acoustic-prosodic information and semantic labels. IEEE
Transactions on Affective Computing, 2(1):10–21, 2011. 2
[41] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and
Juan Pablo Bello. Wav2clip: Learning robust audio repre-
sentations from clip. In ICASSP 2022 - 2022 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Process-
ing (ICASSP), pages 4563–4567, 2022. 2, 8
[42] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko,
Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and

You might also like