Distributed Marker Representation For Ambiguous Discourse Markers and Entangled Relations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Distributed Marker Representation for Ambiguous

Discourse Markers and Entangled Relations


Dongyu Ru1 , Lin Qiu1 , Xipeng Qiu2 , Yue Zhang3 , Zheng Zhang1
1 AmazonAWS AI 2 School of Computer Science, Fudan University
3 School of Engineering, Westlake University

{rudongyu,quln,zhaz}@amazon.com
[email protected]
[email protected]

Abstract
Discourse analysis is an important task because
it models intrinsic semantic structures between
sentences in a document. Discourse markers
arXiv:2306.10658v1 [cs.CL] 19 Jun 2023

are natural representations of discourse in our


daily language. One challenge is that the mark-
ers as well as pre-defined and human-labeled
discourse relations can be ambiguous when
describing the semantics between sentences. Figure 1: Entangled discourse relations and correspond-
We believe that a better approach is to use ing markers between clauses. As shown in the figure,
a contextual-dependent distribution over the there exist diverse discourse relations (marked in blue)
markers to express discourse information. In and corresponding markers (marked in red) for the same
this work, we propose to learn a Distributed pair of clauses. It suggests that the semantic meaning
Marker Representation (DMR) by utilizing the of different discourse relations can be entangled to each
(potentially) unlimited discourse marker data other.
with a latent discourse sense, thereby bridg-
ing markers with sentence pairs. Such repre-
(e.g., and, but, or). The availability of these mark-
sentations can be learned automatically from
data without supervision, and in turn provide ers makes it easier to identify corresponding rela-
insights into the data itself. Experiments show tions (Pitler et al., 2008), as is in the task of explicit
the SOTA performance of our DMR on the im- discourse relation recognition (EDRR), since there
plicit discourse relation recognition task and is strong correlation between discourse markers
strong interpretability. Our method also offers and relations. On the contrary, implicit discourse
a valuable tool to understand complex ambigu- relation recognition (IDRR), where markers are
ity and entanglement among discourse markers
missing, remains a more challenging problem.
and manually defined discourse relations.
Prior work aims to address such challenges by
1 Introduction making use of discourse marker information over
explicit data in learning implicit discourse rela-
Discourse analysis is a fundamental problem in tions, either by injecting marker prediction knowl-
natural language processing. It studies the linguis- edge into a representation model (Zhou et al., 2010;
tic structures beyond the sentence boundary and Braud and Denis, 2016), or transferring the marker
is a component of chains of thinking. Such struc- prediction task into implicit discourse relation pre-
tural information has been widely applied in many diction by manually defining a marker-relation
downstream applications, including information mapping (Xiang et al., 2022; Zhou et al., 2022). It
extraction (Peng et al., 2017), long documents sum- has been shown that discourse marker information
marization (Cohan et al., 2018), document-level can effectively improve relation prediction results.
machine translation (Chen et al., 2020), conversa- Nevertheless, relatively little work has investigated
tional machine reading (Gao et al., 2020). various subtleties concerning the correlation be-
Discourse relation recognition (DRR) focuses tween discourse markers and discourse relations,
on semantic relations, namely, discourse senses and their effect to IDRR in further detail.
between sentences or clauses. Such inter-sentence To properly model discourse relations and mark-
structures are sometimes explicitly expressed in nat- ers, we need to consider that manually-defined dis-
ural language by discourse connectives, or markers course relations can be semantically entangled and
markers are ambiguous. As shown in Fig. 1, for explain the plausibility of probabilistic modeling
a pair of clauses, based on different emphasis on of discourse relations and markers.
semantics, we have different choices on discourse
relations and their corresponding markers. The 2 Related Work
existence of multiple plausible discourse relations
indicates the entanglement between their semantic Discourse analysis (Brown et al., 1983; Joty et al.,
meaning. Besides, discourse markers and relations 2019; McCarthy et al., 2019), targets the discourse
do not exclusively map to each other. As an ex- relation between adjacent sentences. It has at-
ample, “Ann went to the movies, and Bill went tracted attention beyond intra-sentence semantics.
home” (Temporal.Synchrony) and “Ann went to the It is formulated into two main tasks: explicit dis-
movies, and Bill got upset” (Contingency.Cause) course relation recognition and implicit discourse
both use the marker and but express different mean- relation recognition, referring to the relation iden-
ings. Identifying relations based on single markers tification between a pair of sentences with mark-
are difficult in certain scenarios because of such am- ers explicitly included or not. While EDRR has
biguity. Thus, a discrete and deterministic mapping achieved satisfactory performance (Pitler et al.,
between discourse relations and markers can not 2008) with wide applications, IDRR remains to be
precisely express the correlations between them. challenging (Pitler et al., 2009; Zhang et al., 2015;
Based on the study of above issues, we propose Rutherford et al., 2017; Shi and Demberg, 2019).
to use Distributed Marker Representation to en- Our work builds upon the correlation between the
hance the informativeness of discourse expression. two critical elements in discourse analysis: dis-
Specifically, We use a probabilistic distribution on course relations and markers.
markers or corresponding latent senses instead of
Discourse markers have been used for not
a single marker or relation to express discourse se-
only marker prediction training (Malmi et al.,
mantics. We introduce a bottleneck in the latent
2018), but also for improving the performance of
space, namely a discrete latent variable indicat-
IDRR (Marcu and Echihabi, 2002; Rutherford and
ing discourse senses, to capture semantics between
Xue, 2015) and representation learning (Jernite
clauses. The latent sense then produces a distribu-
et al., 2017). Prior efforts on exploring markers
tion of plausible markers to reflect its surface form.
have found that training with discourse markers
This probabilistic model, which we call DMR, nat-
can alleviate the difficulty on IDRC (Sporleder and
urally deals with ambiguities between markers and
Lascarides, 2008; Zhou et al., 2010; Braud and
entanglement among the relations. We show that
Denis, 2016). Compared to their work, we focus
the latent space reveals a hierarchical marker-sense
on a unified representation using distributed mark-
clustering, and that entanglement among relations
ers instead of relying on transferring from explicit
are currently under-reported. Empirical results on
markers to implicit relations. Jernite et al. (2017)
the IDRR benchmark Penn Discourse Tree Bank
first extended the usage of markers to sentence rep-
2 (PDTB2) (Prasad et al., 2008) shows the effec-
resentation learning, followed by Nie et al. (2019);
tiveness of our framework. We summarize our
Sileo et al. (2019) which introduced principled pre-
contributions as follows:
training frameworks and large-scale marker data.
• We propose a latent-space learning framework Xiang et al. (2022); Zhou et al. (2022) explored
for discourse relations and effectively optimize it the possibility of connecting markers and relations
with cheap marker data.1 with prompts. In this work, we continue the line of
improving the expression of discourse information
• With the latent bottleneck and corresponding as distributed markers.
probabilistic modeling, our framework achieves
the SOTA performance on implicit discourse rela-
tion recognition without a complicated architecture 3 Distributed Marker Representation
design. Learning

• We investigate the ambiguity of discourse mark- We elaborate on the probabilistic model in Sec. 3.1
ers and entanglement among discourse relations to and its implementation with neural networks in
1 Code is publicly available at: https://github. Sec. 3.2. We then describe the way we optimize
com/rudongyu/DistMarker the model (Sec. 3.3).
I am weak large-scale corpus under this assumption:
𝑠1

L (𝜓, 𝜙) = E (𝑠1 ,𝑠2 ,𝑚)∼𝐷 log 𝑝 𝜓, 𝜙 (𝑚|𝑠1 , 𝑠2 ). (2)


𝒛 𝒎 𝒎=so 𝒎=although ···
𝒛=𝑧 1 0.01 0.00 ···
𝒛=𝑧 2 0.01 0.58 ··· 3.2 Neural Architecture
𝑠2
𝒛=𝑧 3 0.00 0.00 ···
I go to the gym 𝒛=𝑧 4 0.64 0.01 ··· Our model begins by processing each sentence with
everyday ··· ··· ··· ··· an encoder SentEnc:
Figure 2: The graphical model of 𝑝(𝒎|𝑠1 , 𝑠2 ). 𝒛 is the
latent variable indicating the latent sense, namely the
ℎ = SentEnc 𝜓𝑠 ( [𝑠1 , [SEP], 𝑠2 ]), (3)
semantic relation between two clauses. 𝐾 is the number
of candidate values for the random variable 𝒛. where ℎ ∈ R𝑑 denote the sentence pair represen-
tation in 𝑑 dimensions for 𝑠1 and 𝑠2 . 𝜓 𝑠 are pa-
3.1 Probabilistic Formulation rameters of the sentence encoder. The encoder
We learn the distributed marker representation by is instantiated as a pre-trained language model in
predicting markers given pairs of sentences. We practice.
model the distribution of markers by introducing Then we use two linear layers to map the pair
an extra latent variable 𝒛 which indicates the latent representation ℎ to the distribution of 𝒛 as below:
senses between two sentences. We assume the
distribution of markers depends only on the latent ℎ 𝑧 = 𝜓 𝑤1 · ℎ + 𝜓 𝑏1 , (4)
senses, and is independent of the original sentence 𝑝 𝜃 (𝑧|𝑠1 , 𝑠2 ) = softmax(𝜓 𝑤2 · ℎ 𝑧 + 𝜓 𝑏2 ), (5)
pairs when 𝒛 is given, namely 𝒎 ⊥ (𝑠1 , 𝑠2 )|𝒛.
∑︁ where 𝜓 𝑤1 ∈ R𝑑×4𝑑 , 𝜓 𝑏1 ∈ R𝑑 , 𝜓 𝑤2 ∈
𝑝 𝜓, 𝜙 (𝒎|𝑠1 , 𝑠2 ) = 𝑝 𝜓 (𝒛|𝑠1 , 𝑠2 ) · 𝑝 𝜙 (𝒎|𝒛),(1) R𝐾 ×𝑑 , 𝜓 𝑏2 ∈ R𝐾 are trainable parameters. 𝐾 is
𝒛
the dimension of latent discourse senses.
where the latent semantic senses 𝒛 describes the un- The parameter 𝜓 𝑤2 not only acts as the mapping
ambiguous semantic meaning of 𝑚 in the specific from representation ℎ 𝑧 to 𝒛’s distribution, but can
context, and our target is to model the probabilistic also be seen as an embedding lookup table for the
distribution 𝑝(𝒎|𝑠1 , 𝑠2 ) with 𝒛. The probabilistic 𝐾 values of 𝒛. Each row in 𝜓 𝑤2 is a representation
model is depicted in Fig. 2 with an example. vector for the corresponding value, as an anchor in
The key inductive bias here is that we assume the companion continuous space of 𝒛.
the distribution of discourse markers is indepen- To parameterize the z2m mapping, the parameter
dent of the original sentence pairs given the latent 𝜙 ∈ R𝐾 × 𝑁 is defined as a probabilistic transition
semantic senses (Eq. 1). This formulation is based matrix from latent semantic senses 𝑧 to markers 𝑚
on the intuition that humans decide the relationship (in log space), where 𝑁 is the number of candidate
between two sentences in their cognitive worlds markers:
first, then pick one proper expression with a map-
ping from latent senses to expressions (which we log 𝑝 𝜙 (𝑚|𝑧) = log softmax(𝜙), (6)
call z2m mapping in this paper) without reconsid-
ering the semantic of sentences. Decoupling the where 𝜓 = (𝜓 𝑠 , 𝜓 𝑤1 , 𝜓 𝑏1 , 𝜓 𝑤2 , 𝜓 𝑏2 ), 𝜙 are the
z2m mapping from the distribution of discourse learnable parameters for parameterize the distri-
marker prediction makes the model exhibit more bution 𝑝 𝜓, 𝜙 (𝒎|𝑠1 , 𝑠2 ).
interpretability and transparency.
Therefore, the probabilistic distribution of 3.3 Optimization
𝑝 𝜓, 𝜙 (𝒎|𝑠1 , 𝑠2 ) can be decomposed into 𝑝 𝜓 (𝑚|𝒛) We optimize the parameters 𝜓 and 𝜙 with the clas-
and 𝑝 𝜙 (𝒛|𝑠1 , 𝑠2 ) based on the independence as- sic EM algorithm due to the existence of the latent
sumption above. 𝜓 and 𝜙 denote parameters for variable 𝒛. The latent variable 𝒛 serves as a reg-
each part 2 . The training objective with latent ularizer during model training. In the E-step of
senses included is to maximize the likelihood on each iteration, we obtain the posterior distribution
2We omit the subscript of parameters 𝜓 and 𝜙 in some 𝑝(𝑧|𝑠1 , 𝑠2 , 𝑚) according to the parameters in the
expressions later for conciseness. current iteration 𝜓 (𝑡 ) , 𝜙 (𝑡 ) as shown in Eq. 7.
Algorithm 1 EM Optimization for Discourse Marker Training with Latent Senses
1: Initialize model parameters as 𝜓 0 , 𝜙0 .
2: while not converge do ⊲ 𝑡-th iteration
3: Sample a batch of examples for EM optimization.
4: for each example (𝑠1 , 𝑠2 , 𝑚) in the EM batch do
5: Calculate and save the posterior 𝑝(𝒛|𝑠1 , 𝑠2 , 𝑚) according to 𝜓 (𝑡 ) , 𝜙 (𝑡 ) .
6: end for
7: for each example (𝑠1 , 𝑠2 , 𝑚) in the EM batch do
8: Estimate E 𝑝 (𝑧 |𝑠1 ,𝑠2 ,𝑚) [log 𝑝 𝜓, 𝜙 (𝑚, 𝑧|𝑠1 , 𝑠2 )] according to 𝜓 (𝑡 ) , 𝜙 (𝑡 ) . ⊲ E-step
9: end for
10: Update parameters 𝜓 to 𝜓 (𝑡+1) in mini-batch with the gradient calculated as ∇ 𝜓 L (𝜓, 𝜙 (𝑡 ) ).
11: Update parameters 𝜙 to 𝜙 (𝑡+1) according to the updated 𝜓 (𝑡+1) and the gradient ∇ 𝜙 L (𝜓 (𝑡+1) , 𝜙).
⊲ M-step
12: end while

Based on our assumption that 𝒎 ⊥ (𝑠1 , 𝑠2 )|𝒛, 4 Experiments


we can get the posterior distribution:
DMR adopts a latent bottleneck for the space of
𝑝(𝑚|𝑠1 , 𝑠2 , 𝑧) · 𝑝(𝑧|𝑠1 , 𝑠2 ) latent discourse senses. We first prove the effec-
𝑝(𝑧|𝑠1 , 𝑠2 , 𝑚)=
𝑝(𝑚|𝑠1 , 𝑠2 ) tiveness of the latent variable and compare against
𝑝(𝑚|𝑧) · 𝑝(𝑧|𝑠1 , 𝑠2 ) current SOTA solutions on the IDRR task. We then
=
𝑝(𝑚|𝑠1 , 𝑠2 ) examine what the latent bottleneck learned during
∝ 𝑝 𝜓 (𝑡 ) (𝑧|𝑠1 , 𝑠2 ) · 𝑝 𝜙 (𝑡 ) (𝑚|𝑧). (7) training and how it addresses the ambiguity and
entanglement of discourse markers and relations.
In M-step, we optimize the parameters 𝜓, 𝜙 by
maximizing the expectation of joint log likelihood 4.1 Dataset
on estimated posterior 𝑝(𝑧|𝑠1 , 𝑠2 , 𝑚). The updated We use two datasets for learning our DMR model
parameters 𝜓 (𝑡+1) , 𝜙 (𝑡+1) for the next iteration can and evaluating its strength on downstream implicit
be obtained as in Eq. 8. discourse relation recognition, respectively. See
Appendix A for statistics of the datasets.
𝜓 (𝑡+1) , 𝜙 (𝑡+1) = (8)
arg maxE 𝑝 (𝑧 |𝑠1 ,𝑠2 ,𝑚) [log 𝑝 𝜓, 𝜙 (𝑚, 𝑧|𝑠1 , 𝑠2 )]. Discovery Dataset (Sileo et al., 2019) is a large-
𝜓, 𝜙 scale discourse marker dataset extracted from com-
In practice, the alternative EM optimization can moncrawl web data, the Depcc corpus (Panchenko
be costly and unstable due to the expensive ex- et al., 2018). It contains 1.74 million sentence
pectation computation and the subtlety on hyper- pairs with a total of 174 types of explicit discourse
parameters when optimizing 𝜓 and 𝜙 jointly. We markers between them. Markers are automatically
alleviate the training difficulty by empirically esti- extracted based on part-of-speech tagging. We use
mating the expectation on mini-batch and separate top-k accuracy ACC@k to evaluate the marker pre-
the optimization of 𝜓 and 𝜙. We formulate the loss diction performance on this dataaset. Note that we
functions as below, for separate gradient descent use explicit markers to train DMR but evaluate it
optimization of 𝜓 and 𝜙: on IDRR thanks to different degrees of verbosity
when using markers in everyday language.
L (𝜓, 𝜙 (𝑡 ) ) = KLDiv( 𝑝(𝑧|𝑠1 , 𝑠2 , 𝑚), 𝑝 𝜓, 𝜙 (𝑡 ) (𝑚, 𝑧|𝑠1 , 𝑠2 )), Penn Discourse Tree Bank 2.0 (PDTB2)
L (𝜓 (𝑡+1) , 𝜙) = − log 𝑝 𝜓 (𝑡+1) , 𝜙 (𝑚|𝑠1 , 𝑠2 ), (Prasad et al., 2008) is a popular discourse analysis
benchmark with manually-annotated discourse re-
where 𝜙 (𝑡 ) means the value of 𝜙 before the 𝑡-th lations and markers on Wall Street Journal articles.
iteration and 𝜓 (𝑡+1) means the value of 𝜓 after the We perform the evaluation on its implicit part with
𝑡-th iteration of optimization. KLDiv denotes the 11 major second-level relations included. We fol-
Kullback-Leibler divergence. The overall optimiza- low (Ji and Eisenstein, 2015) for data split, which
tion algorithm is summarized in Algorithm 1. is widely used in recent studies for IDRR. Macro-
Model Backbone macro-F1 ACC
IDRR-C&E (Dai and Huang, 2019) ELMo 33.41 48.23
MTL-MLoss (Nguyen et al., 2019) ELMo - 49.95
BERT-FT (Kishimoto et al., 2020) BERT - 54.32
HierMTN-CRF (Wu et al., 2020) BERT 33.91 52.34
BMGF-RoBERTa (Liu et al., 2021) RoBERTa - 58.13
MTL-MLoss-RoBERTa† (Nguyen et al., 2019) RoBERTa 38.10 57.72
HierMTN-CRF-RoBERTa† (Wu et al., 2020) RoBERTa 38.28 58.61
LDSGM (Wu et al., 2022) RoBERTa 40.49 60.33
PCP-base (Zhou et al., 2022) RoBERTa 41.55 60.54
PCP-large (Zhou et al., 2022) RoBERTa 44.04 61.41
DMR-basew/o z RoBERTa 37.24 59.89
DMR-largew/o z RoBERTa 41.59 62.35
DMR-base RoBERTa 42.41 61.35
DMR-large RoBERTa 43.78 64.12

Table 1: Experimental Results of Implicit Discourse Relation Classification on PDTB2. Results with † are from Wu
et al. (2022). DMR-large and DMR-base adopt roberta-large and roberta-base as SentEnc, respectively.

F1 and ACC are metrics for IDRR performance. BMGF LDSGM DMR
We note that although annotators are allowed to Comp.Concession 0. 0. 0.
annotate multiple senses (relations), only 2.3% Comp.Contrast 59.75 63.52 63.16
of the data have more than one relation. There- Cont.Cause 59.60 64.36 62.65
fore whether DMR can capture more entanglement Cont.Pragmatic Cause 0. 0. 0.
Expa.Alternative 60.0 63.46 55.17
among relations is of interest as well (Sec. 4.5).
Expa.Conjunction 60.17 57.91 58.54
4.2 Baselines Expa.Instantiation 67.96 72.60 72.16
Expa.List 0. 8.98 36.36
We compare our DMR model with competitive Expa.Restatement 53.83 58.06 59.19
baseline approaches to validate the effectiveness Temp.Async 56.18 56.47 59.26
of DMR. For the IDRR task, we compare DMR- Temp.Sync 0. 0. 0.
Macro-f1 37.95 40.49 42.41
based classifier with current SOTA methods, in-
cluding BMGF (Liu et al., 2021), which combines
Table 2: Experimental Results of Implicit Discourse
representation, matching, and fusion; LDSGM (Wu Relation Recognition on PDTB2 Second-level Senses
et al., 2022), which considers the hierarchical de-
pendency among labels; the prompt-based connec-
tive prediction method, PCP (Zhou et al., 2022) and stacked on top of models to predict relations.
so on. For further analysis on DMR, we also in-
clude a vanilla sentence encoder without the latent 4.4 Implicit Discourse Relation Recognition
bottleneck as an extra baseline, denoted as BASE. We first validate the effectiveness of modeling la-
tent senses on the challenging IDRR task.
4.3 Implementation Details
Our DMR model is trained on 1.57 million ex- Main Results DMR demonstrates comparable
amples with 174 types of markers in Discovery performance with current SOTAs on IDRR, but
dataset. We use pretrained RoBERTa model (Liu with a simpler architecture. As shown in Table 1,
et al., 2019) as SentEnc in DMR. We set the DMR leads in terms of accuracy by 2.7pt and is a
default latent dimension 𝐾 to 30. More details re- close second in macro-F1 .
garding the implementation of DMR can be found The results exhibit the strength of DMR by more
in Appendix A. straightforwardly modeling the correlation between
For the IDRR task, we strip the marker genera- discourse markers and relations. Despite the ab-
tion part from the DMR model and use the hidden sence of supervision on discourse relations during
state ℎ 𝑧 as the pair representation. BASE uses DMR learning, the semantics of latent senses dis-
the [CLS] token representation as the representa- tilled by EM optimization successfully transferred
tion of input pairs. A linear classification layer is to manually-defined relations in IDRR.
40 Model ACC@1 ACC@3 ACC@5 ACC@10
35 Discovery 24.26 40.94 49.56 61.81
30 DMR30 8.49 22.76 33.54 48.11
performance
25 DMR174 22.43 40.92 50.18 63.21
20
15 DMR-acc
Table 4: Experimental results of marker prediction on
10 DMR-f1 the Discovery test set. DMR30 and DMR174 indicate
BASE-acc
5 BASE-f1
the models with the dimension K equals to 30 and 174
0 100 200 300 400 500
respectively.
# training examples Marker 1st Cluster 2nd Cluster
as a result, for example,
Figure 3: Few-shot IDRR Results on PDTB2 additionally 𝒛1 : in turn, 𝒛 20 : for instance,
simultaneously specifically
# Training Examples 25 100 500 full (10K) thankfully, oddly,
amazingly 𝒛9 : fortunately, 𝒛 21 : strangely,
ACC - 32.20 33.85 59.45
BASE luckily unfortunately
F1 - 13.40 16.70 34.34
ACC - 33.76 37.56 60.90 indeed, anyway,
BASE 𝑝
F1 - 13.54 17.21 35.45 but 𝒛 19 : nonetheless, 𝒛 24 : and,
ACC 19.12 34.07 39.23 63.19 nevertheless well
BASE𝑔
F1 5.75 13.72 19.27 36.59
ACC 21.32 37.14 42.53 62.97 Table 5: Top 2 clusters of three random sampled mark-
DMR
F1 7.01 15.29 19.57 39.33 ers. Each cluster corresponds to a latent 𝒛 coupled with
its top 3 markers.
Table 3: Few-shot IDRR Results on PDTB2

ers as an extra input, we augment the data in


Based on the comparison to DMR without la- two ways: BASE𝑔 inserts the groundtruth marker,
tent z, we observe a significant performance drop and BASE 𝑝 where the markers are predicted by
resulted from the missing latent bottleneck. It indi- a model3 officially released by Discovery (Sileo
cates that the latent bottleneck in DMR serves as a et al., 2019). Table 3 presents the results where the
regularizer to avoid overfitting on similar markers. informative markers are inserted to improve the per-
formance of BASE, following the observations and
Fine-grained Performance We list the fine-
ideas from (Zhou et al., 2010; Pitler et al., 2008).
granined performance of DMR and compare it
DMR continues to enjoy the lead, even when the
with SOTA approaches on second-level senses of
markers are groundtruth (i.e. BASE𝑔 ), suggest-
PDTB2. As shown in Table 2, DMR achieves sig-
ing DMR’s hidden state contains more information
nificant improvements on relations with little super-
than single markers.
vision, like Expa.List and Temp.Async. The perfor-
mance of majority classes, e.g. Expa.Conjunction, 4.5 Analysis & Discussion
are slightly worse. It may be caused by the entan-
Marker Prediction The performance of DMR
glement between Expa.Conjunction and Expa.List
on marker prediction is sensitive to the capacity of
to be discussed in Sec. 4.5. In summary, DMR
the bottleneck. When setting 𝐾 to be the number
achieves better overall performance by maintain-
of markers (174), it matches and even outperforms
ing equilibrium among entangled relations with
the Discovery model which directly predicts the
different strength of supervision.
markers on the same data (Table 4). A smaller 𝐾
Few-shot Analysis Fig. 3 shows DMR achieves sacrifices marker prediction performance but it can
significant gains against BASE in few-shot learning cluster related senses, resulting in more informative
experiments. The results are averaged on 3 inde- and interpretable representation.
pendent runs for each setting. In fact, with only Multiple markers may share similar meanings
∼60% of annotated data, DMR achieves the same when connecting sentences. Thus, evaluating the
performance as BASE with full data by utilizing performance of marker prediction simply on top1
the cheap marker data more effectively. accuracy is inappropriate. In Table 4, we demon-
To understand the ceiling of the family of strated the results on ACC@k and observed that
such BERT-based pretrained model with mark- 3 They also use the RobERTa model as a backbone.
for_example in_sum, 0.9 0.83
0.8 anno
for_instance in_short, human
by_then in_particular, in_turn,
1 0.7
0.6 0.58 0.55
as_a_result,
by_doing_this,
33% 0.5
because_of_this in_contrast,
0 0.4 0.39
because_of_that by_contrast, 3%
by_comparison, 0.3
48% 16% 0.19
on_the_other_hand 3 0.2
in_other_words [no-conn] 2 0.1 0.07
on_the_contrary,
in_the_end, 0.0
1st 2nd 3rd
in_the_meantime,
meantime,
(a) (b)
(a) The cropped T-SNE visualization of discourse
markers from the BASE PLM. Figure 5: Human Evaluation. Figure (a) shows numbers
lately of reasonable relations in top-3 predictions. Figure (b)
60 recently shows the accuracy although,
forfortunately,
each of the top-3 predictions
in_the_meantime thus,particularly,
surprisingly,
instead,
z15 evaluated by annotations or human, respectively.
typically,
this soon previously
lately, afterward subsequently originally
40 here next eventually
gradually once onymous markers are clustered for_example expected,well,
asthough, seman-
frequently, thereafter
occasionally, also,
for_instance
actually,
finally,
originally, z 25 simultaneously
z16
zsubsequently,
z27 tically related clusters specifically
are really,
often closer. still,
Fig. yet,
absolutely, 4b
, furthermore afterward
further, consequently
22
immediately, z20 meaning, rather, perhaps, currently,
conversely shows the top left cornerconversely of the T-SNE result. We
alternatively
(b) The cropped
ultimately,
T-SNE visualization of latent 𝒛 in_contrast alternately
from DMR. thankfully can see significantly
that the temporal similarly
connectives preferablysenses
and
fortunately notablyin the top left
moreover overall, luckily oddly importantly
are located z5 corner. z8
According tohere,
soon,
nevertheless 20 4: The cropped T-SNE
Figure nationally,
naturally,
visualization of latent 𝒛 z7
indeed
probably,
elsewhere, meanwhile,
suddenly, z9 strangely
their coupled markers, we nonetheless
can recover the semantic
se, however
erwise, truly, Each 𝒛 is coupledrather
indeed, unfortunately neverthelessthereby
accordingly plus, usually,
mostly, only,
especially, seco
third,
25 , this,
from DMR. with its top 3 markers
by_doing_this
yway, besides,
admittedly, ideally
of
z21
these latent 𝒛: preceding
z19
(𝑧 ),
27thussucceeding (𝑧
from z2m additionally
mapping.
regardless, z10 naturally 𝑧 22 , 𝑧 16 ) and synchronous (𝑧 15z)14formas_a_result nearby but
or, often,
obviously, ultimately separated clusters. in_turnand so,
previously, separately,
alternatively
alternately,undoubtedly because_of_that simultaneously but first,
nonetheless 0 apparently, probably truly because_of_this z again, may
zrealistically therefore with connective-based prompt- together, once, somet
4 the For a comparison 1
DMR(K=174)
unsurprisingly,
undoubtedly,
unfortunately, presumably, gets better performance
presently, firstly traditionally,
perhaps against
frankly
although
later, then,
essentially,byfirst
model optimized an MLE objective historically,
maybe when kz gets honestly
anyway ing approaches,
though z28 we also demonstrate the T-SNE now,
tally,
tally,
ntly, especially z and 2
next,
evitably, larger. We assume that
interestingly, it comes
significantly,
29
from the marker
well zvisualization of marker representations from BASE
luckily, z26
importantly,
6
happily,ambiguity. Our hopefully,
DMR models the altogether, ambiguity z24better, in Fig. 4a. Unlike semantically aligned vector
20 with any ofinitially, plus
collectively,
theoretically,
realistically, essentially
space of DMR, sometimes
locality of markers in the space
thus the plausible markers besidesbasically,
easier
ironically,
technically,
to betheoretically usually increasingly
already, also basically
of BASE representation is determined by surface
observed in a larger range of predictions but more
thankfully, z occasionally historically
amazingly,
sadly, 13 z17 z locallyby_comparisonseparately
curiously,
seriously, difficult as top1.honestly,truthfully,
To prove second the marker ambiguity form of markers 23 and shifted
z
from their exact mean-
nationally elsewhere
arguably, directly, we randomly thirdly
strangely,
more sample 50 examples to
18 currently
ing. Marker representations of the model w/o latent meanwhile
third eventually,
gradually, z3 lexical z12 formats in-
40
analyze their top5 predictions. z11 The statistics show 𝑧 are closer because of similar
similarly,supposedly, normally,
generally,stead of underlying discourse.collectively
notably,
remarkably, that
preferably, surely, clearly,80% of those predictions have plausible
over
certainly, optionally, altogether
frankly, To conclude, considerable exampleslocally,
explanations. ideally,
From z2m mapping,together
specifically, we can take a step further
increasingly, personally, z0
have multiple plausible slowly,markers thus ACC@k with to analyze the correlation between markers learned
oddly,
larger knamely,
60 can better reflect40the true performance 20 on by DMR. 0 Table 5 shows 20 the top 2 corresponding 40
accordingly
lastly,
markerrecently, thirdly,
secondly,
prediction, where DMR can beat the MLE- clusters of three randomly sampled markers. We
optimized model. firstly, can observe correlations between markers like pol-
ysemy and synonym.
z2m Mapping The latent space is not inter-
20 0
pretable, but DMR has a transition matrix that out- 20Understanding 40
Entanglement Labeling dis-60
puts a distribution of markers, which reveals what course relations is challenging since some of them
a particular dimension may encode. can correlate, and discern the subtleties can be chal-
To analyze the latent space, we use 𝜓 𝑤2 (Eq. 5) lenging. For example, List strongly correlates with
as the corresponding embedding vectors and per- Conjunction and the two are hardly distinguishable.
form T-SNE visualization of the latent 𝒛, similar to DMR is trained to predict a distribution of mark-
what Discover (Sileo et al., 2019) does using the ers, thus we expect its hidden state to capture the
softmax weight at the final prediction layer. The distribution of relations as well even when the
complete T-SNE result can be found in Appendix B. multi-sense labels are scarce. We drew 100 random
What we observe is an emerging hierarchical pat- samples and ask two researchers to check whether
tern, in addition to proximity. That is, while syn- each of the corresponding top-3 predictions is valid
𝑠1 𝑠2 markers relations
because_of_this
Rather, they tend to have a set of Sometimes, they’ll choose Ragu therefore Contingency.Cause
two or three favorites spaghetti sauce for_example Expansion.Instantitation
for_instance
the HIAA is working on a pro-
It just makes healthy businesses posal to establish a privately because_of_this
subsidize unhealthy ones and funded reinsurance mechanism conversely Contingency.Cause
gives each employer less incen- to help cover small groups that therefore Comparison.Contrast
tive to keep his workers healthy can’t get insurance without ex- in_contrast
cluding certain employees
Typically, Hart-Scott is used now
although
The Hart-Scott filing is then re- to give managers of target firms
though Comparison.Concession
viewed and any antitrust con- early news of a bid and a chance
besides Expansion.Conjunction
cerns usually met to use regulatory review as a de-
also
laying tactic

Table 6: Case Study on Marker Ambiguity and Discourse Relation Entanglement.

Concession 1.6
Contrast 1.4
confusion.
Cause 1.2
Alternative
Conjunction
1.0 We use the top-3 predictions of the 20 highest
Instantiation 0.8
entropy examples to demonstrate highly confus-
List 0.6
Restatement 0.4 ing discourse relations as shown in Fig. 6. The
Asynchronous 0.2
Synchrony
accumulated joint probability of paired relations
0.0
on these examples is computed as weights in the
n

ast

ative

n
iation

List

ous
hrony
emen
Caus
essio

nctio
Contr

chron
Altern

confusion matrix. The statistics meet our expecta-


nt

Sync
Conju
Conc

t
Resta
Insta

Asyn

tion that there exist specific patterns of confusion.


Figure 6: Confusion on Discourse Relations. We use For example, asynchronous relations are correlated
entropy as the metric for filtering most confusing ex- with causal relations, while another type of tempo-
amples. We use the top-3 predictions of the 20 most ral relations, synchronous ones are correlated with
confusing examples to show the entanglement between conjunction. A complete list of these high entropy
relations. We use accumulated 𝑝(𝑟 𝑖 ) · 𝑝(𝑟 𝑗 ) as weights examples is listed in Appendix C.
for a pair of relations 𝑟 𝑖 , 𝑟 𝑗 . Note that implausible pre-
dictions are suppressed to ignore model errors. To further prove DMR can learn diverse distri-
bution even when multi-sense labels are scarce, we
and give a binary justification4 . Fig. 5a shows that a also evaluate our model on the DiscoGeM (Schol-
considerable amount of 64% examples have two or man et al., 2022), where each instance is annotated
more relations evaluated as reasonable in top-3 pre- by 10 crowd workers. The distribution discrep-
dictions, much higher than 2.3% multi-sense labels ancy is evaluated with cross entropy. Our model,
in PDTB2. This suggests that one way to improve trained solely on majority labels, achieved a cross
upon the lack of multi-sense annotation is to use entropy score of 1.81 against all labels. Notably,
DMR to provide candidates for the annotators. For our model outperforms the BMGF model (1.86)
these samples, we also inspect annotator agreement under the same conditions and comes close to the
in PDTB2 (Fig. 5b). While the trend is consistent performance of the BMGF model trained on multi-
with what DMR reports, it also validates again that ple labels (1.79) (Yung et al., 2022). These results
the PTDB2 annotators under-labeled multi-senses. highlight the strength of our model in capturing
To gain a deeper understanding of relation cor- multiple senses within the data.
relation, we rank the sentence pairs according to To conclude, while we believe explicit relation
the entropy of relation prediction, a higher entropy labeling is still useful, it is incomplete without also
suggests more model uncertainty, namely more specifying a distribution. As such, DMR’s ℎ 𝑧 or the
4 The annotators achieve a substantial agreement with a distribution of markers are legitimate alternatives
Kappa coefficient of 0.68. to model inter-sentence discourse.
Case Study on Specific Examples As a comple- on large-scale data.
tion of the previous discussion on understanding
entanglement in a macro perspective, we present a
few examples in PDTB2 with markers and relations References
predicted by the DMR-based model. As demon- Chloé Braud and Pascal Denis. 2016. Learning
strated in Table 6, the identification of discourse connective-based word representations for implicit
discourse relation identification. In Proceedings of
relations relies on different emphasis of seman- the 2016 Conference on Empirical Methods in Nat-
tic pairs. Taking the first case as an example, the ural Language Processing, pages 203–213, Austin,
connection between “two or three favorities” and Texas. Association for Computational Linguistics.
“Ragu spaghetti sauce” indicates the Instantiation Gillian Brown, Gillian D Brown, Gillian R Brown,
relation while the connection between complete George Yule, and Brown Gillian. 1983. Discourse
semantics of these two sentences results in Cause. analysis. Cambridge university press.
Thanks to the probabilistic modeling of discourse Junxuan Chen, Xiang Li, Jiarui Zhang, Chulun Zhou,
information in DMR, the cases demonstrate entan- Jianwei Cui, Bin Wang, and Jinsong Su. 2020. Mod-
glement among relations and ambiguity of markers eling discourse structure for document-level neural
machine translation. In Proceedings of the First
well.
Workshop on Automatic Simultaneous Translation,
pages 30–36, Seattle, Washington. Association for
5 Conclusion Computational Linguistics.
In this paper, we propose the distributed marker Arman Cohan, Franck Dernoncourt, Doo Soon Kim,
representation for modeling discourse based on the Trung Bui, Seokhwan Kim, Walter Chang, and Nazli
Goharian. 2018. A discourse-aware attention model
strong correlation between discourse markers and for abstractive summarization of long documents. In
relations. We design the probabilistic model by in- Proceedings of the 2018 Conference of the North
troducing a latent variable for discourse senses. We American Chapter of the Association for Computa-
use the EM algorithm to effectively optimize the tional Linguistics: Human Language Technologies,
Volume 2 (Short Papers), pages 615–621, New Or-
framework. The study on our well-trained DMR
leans, Louisiana. Association for Computational Lin-
model shows that the latent-included model can guistics.
offer a meaningful semantic view of markers. Such
Zeyu Dai and Ruihong Huang. 2019. A regulariza-
semantic view significantly improves the perfor- tion approach for incorporating event knowledge and
mance of implicit discourse relation recognition. coreference relations into neural discourse parsing.
Further analysis of our model provides a better In Proceedings of the 2019 Conference on Empirical
understanding of discourse relations and markers, Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language
especially the ambiguity and entanglement issues. Processing (EMNLP-IJCNLP), pages 2976–2987.
Limitation & Risks Yifan Gao, Chien-Sheng Wu, Jingjing Li, Shafiq Joty,
Steven C.H. Hoi, Caiming Xiong, Irwin King, and
In this paper, we bridge the gap between discourse Michael Lyu. 2020. Discern: Discourse-aware entail-
markers and the underlying relations. We use dis- ment reasoning network for conversational machine
reading. In Proceedings of the 2020 Conference on
tributed discourse markers to express discourse Empirical Methods in Natural Language Processing
more informatively. However, learning DMR re- (EMNLP), pages 2439–2449, Online. Association for
quires large-scale data on markers. Although it’s Computational Linguistics.
potentially unlimited in corpus, the distribution Yacine Jernite, Samuel R Bowman, and David Son-
and types of markers may affect the performance tag. 2017. Discourse-based objectives for fast un-
of DMR. Besides, the current solution proposed in supervised sentence representation learning. arXiv
this paper is limited to relations between adjacent preprint arXiv:1705.00557.
sentences. Yangfeng Ji and Jacob Eisenstein. 2015. One vector is
Our model can be potentially used for natural not enough: Entity-augmented distributed semantics
for discourse relations. Transactions of the Associa-
language commonsense inference and has the po-
tion for Computational Linguistics, 3:329–344.
tential to be a component for large-scale common-
sense acquisition in a new form. Potential risks Shafiq Joty, Giuseppe Carenini, Raymond Ng, and
Gabriel Murray. 2019. Discourse analysis and its ap-
include a possible bias on collected commonsense plications. In Proceedings of the 57th Annual Meet-
due to the data it relies on, which may be alleviated ing of the Association for Computational Linguistics:
by introducing a voting-based selection mechanism Tutorial Abstracts, pages 12–17.
Yudai Kishimoto, Yugo Murawaki, and Sadao Kuro- Emily Pitler, Annie Louis, and Ani Nenkova. 2009. Au-
hashi. 2020. Adapting bert to implicit discourse re- tomatic sense prediction for implicit discourse rela-
lation classification with a focus on discourse con- tions in text.
nectives. In Proceedings of The 12th Language
Resources and Evaluation Conference, pages 1152– Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani
1158. Nenkova, Alan Lee, and Aravind K Joshi. 2008. Eas-
ily identifiable discourse relations. Technical Reports
Xin Liu, Jiefu Ou, Yangqiu Song, and Xin Jiang. 2021. (CIS), page 884.
On the importance of word and sentence represen-
tation learning in implicit discourse relation classifi- Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-
cation. In Proceedings of the Twenty-Ninth Interna- sakaki, Livio Robaldo, Aravind Joshi, and Bonnie
tional Conference on International Joint Conferences Webber. 2008. The penn discourse treebank 2.0. In
on Artificial Intelligence, pages 3830–3836. Proceedings of the Sixth International Conference on
Language Resources and Evaluation (LREC’08).
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Attapol Rutherford, Vera Demberg, and Nianwen Xue.
Luke Zettlemoyer, and Veselin Stoyanov. 2019. 2017. A systematic study of neural discourse models
Roberta: A robustly optimized bert pretraining ap- for implicit discourse relation. In Proceedings of
proach. arXiv preprint arXiv:1907.11692. the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume
Eric Malmi, Daniele Pighin, Sebastian Krause, and 1, Long Papers, pages 281–291.
Mikhail Kozhevnikov. 2018. Automatic prediction
of discourse connectives. In Proceedings of the Attapol Rutherford and Nianwen Xue. 2015. Improv-
Eleventh International Conference on Language Re- ing the inference of implicit discourse relations via
sources and Evaluation (LREC 2018). classifying explicit discourse connectives. In Pro-
ceedings of the 2015 Conference of the North Amer-
Daniel Marcu and Abdessamad Echihabi. 2002. An ican Chapter of the Association for Computational
unsupervised approach to recognizing discourse rela- Linguistics: Human Language Technologies, pages
tions. In Proceedings of the 40th Annual Meeting on 799–808, Denver, Colorado. Association for Compu-
Association for Computational Linguistics, ACL ’02, tational Linguistics.
page 368–375, USA. Association for Computational
Linguistics. Merel Scholman, Tianai Dong, Frances Yung, and Vera
Demberg. 2022. DiscoGeM: A crowdsourced corpus
Michael McCarthy, Matthiessen Christian, and Diana of genre-mixed implicit discourse relations. In Pro-
Slade. 2019. Discourse analysis. In An introduction ceedings of the Thirteenth Language Resources and
to applied linguistics, pages 55–71. Routledge. Evaluation Conference, pages 3281–3290, Marseille,
France. European Language Resources Association.
Linh The Nguyen, Linh Van Ngo, Khoat Than, and
Thien Huu Nguyen. 2019. Employing the corre- Wei Shi and Vera Demberg. 2019. Next sentence pre-
spondence of relations and connectives to identify diction helps implicit discourse relation classification
implicit discourse relations via label embeddings. In within and across domains. In Proceedings of the
Proceedings of the 57th Annual Meeting of the Asso- 2019 conference on empirical methods in natural
ciation for Computational Linguistics, pages 4201– language processing and the 9th international joint
4207, Florence, Italy. Association for Computational conference on natural language processing (EMNLP-
Linguistics. IJCNLP), pages 5790–5796.

Allen Nie, Erin Bennett, and Noah Goodman. 2019. Damien Sileo, Tim Van de Cruys, Camille Pradel, and
Dissent: Learning sentence representations from ex- Philippe Muller. 2019. Mining discourse markers
plicit discourse relations. In Proceedings of the 57th for unsupervised sentence representation learning. In
Annual Meeting of the Association for Computational Proceedings of the 2019 Conference of the North
Linguistics, pages 4497–4510. American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Alexander Panchenko, Eugen Ruppert, Stefano Far- Volume 1 (Long and Short Papers), pages 3477–3486.
alli, Simone P. Ponzetto, and Chris Biemann. 2018.
Building a web-scale dependency-parsed corpus from Caroline Sporleder and Alex Lascarides. 2008. Using
CommonCrawl. In Proceedings of the Eleventh In- automatically labelled examples to classify rhetorical
ternational Conference on Language Resources and relations: an assessment. Natural Language Engi-
Evaluation (LREC 2018), Miyazaki, Japan. European neering, 14(3):369–416.
Language Resources Association (ELRA).
Changxing Wu, Liuwen Cao, Yubin Ge, Yang Liu, Min
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Zhang, and Jinsong Su. 2022. A label dependence-
Toutanova, and Wen-tau Yih. 2017. Cross-sentence aware sequence generation model for multi-level im-
n-ary relation extraction with graph LSTMs. Trans- plicit discourse relation recognition. In Proceedings
actions of the Association for Computational Linguis- of the AAAI Conference on Artificial Intelligence,
tics, 5:101–115. volume 36, pages 11486–11494.
Changxing Wu, Chaowen Hu, Ruochen Li, Hongyu
Lin, and Jinsong Su. 2020. Hierarchical multi-task
learning with crf for implicit discourse relation recog-
nition. Knowledge-Based Systems, 195:105637.
Wei Xiang, Zhenglin Wang, Lu Dai, and Bang Wang.
2022. ConnPrompt: Connective-cloze prompt
learning for implicit discourse relation recognition.
In Proceedings of the 29th International Confer-
ence on Computational Linguistics, pages 902–911,
Gyeongju, Republic of Korea. International Commit-
tee on Computational Linguistics.
Frances Yung, Kaveri Anuranjana, Merel Scholman,
and Vera Demberg. 2022. Label distributions help
implicit discourse relation classification. In Pro-
ceedings of the 3rd Workshop on Computational Ap-
proaches to Discourse, pages 48–53, Gyeongju, Re-
public of Korea and Online. International Conference
on Computational Linguistics.

Biao Zhang, Jinsong Su, Deyi Xiong, Yaojie Lu, Hong


Duan, and Junfeng Yao. 2015. Shallow convolutional
neural network for implicit discourse relation recog-
nition. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing,
pages 2230–2235.
Hao Zhou, Man Lan, Yuanbin Wu, Yuefeng Chen, and
Meirong Ma. 2022. Prompt-based connective predic-
tion method for fine-grained implicit discourse rela-
tion recognition. arXiv preprint arXiv:2210.07032.
Zhi-Min Zhou, Yu Xu, Zheng-Yu Niu, Man Lan, Jian
Su, and Chew Lim Tan. 2010. Predicting discourse
connectives for implicit discourse relation recogni-
tion. In Coling 2010: Posters, pages 1507–1514,
Beijing, China. Coling 2010 Organizing Committee.
Train Valid Test C High Entropy Examples from Human
1566k 174k 174k Evaluation

Table 7: Statistics of Discovery Dataset For analysis of the entanglement among relations,
we did a human evaluation on randomly extracted
examples from PDTB2. To better understand the
Relations Train Valid Test entanglement among relations, we further filter the
Comp.Concession 180 15 17 20 most confusing examples with entropy as a met-
Comp.Contrast 1566 166 128 ric. The entanglement is shown as Fig.6 in Sec. 4.5.
Cont.Cause 3227 281 269 We list these examples in Table 9 for clarity.
Cont.Pragmatic Cause 51 6 7
Expa.Alternative 146 10 9
Expa.Conjunction 2805 258 200
Expa.Instantiation 1061 106 118
Expa.List 330 9 12
Expa.Restatement 2376 260 211
Temp.Async 517 46 54
Temp.Sync 147 8 14
Total 12406 1165 1039

Table 8: Statistics of PDTB2 Dataset

A Implementation Details

We use Huggingface transformers (4.2.1) for the


use of PLM backbones in our experiments. For
optimization, we optimize the overall framework
according to Algorithm 1. We train the model on
Discovery for 3 epochs with the learning rate for
𝜓 set to 3e-5 and the learning rate for 𝜙 set to 1e-
2. The EM batchsize is set to 500 according to
the trade-off between optimization efficiency and
performance. The optimization requires around
40 hrs to converge in a Tesla-V100 GPU. For the
experiments on PDTB2, we use them according to
the LDC license for research purposes on discourse
relation classification. The corresponding statistics
of the two datasets are listed in Table 7 and Table 8.

B Visualization of the latent 𝒛

To obtain an intrinsic view of how well the con-


nections between markers 𝒎 and 𝒛 can be learned
in our DMR model. We draw a T-SNE 2-d visual-
ization of 𝒛’s representations in Fig. 7 with top-3
connectives of each 𝑧 attached nearby. The repre-
sentation vector for each 𝑧 is extracted from 𝜓 𝑤2 .
The results are interesting that we can observe not
only the clustering of similar connectives as 𝑧, but
also semantically related 𝑧 closely located in the
representation space.
lately
60 recently
in_the_meantime
z15
this soon previously
afterward subsequently originally
here next eventually once for_example
40
z25 thereaftergradually z27
for_instance
specifically
z22 z16
z20
conversely alternatively
in_contrast alternately
thankfully significantly similarly preferably
fortunately notably z
luckily oddly importantly 5 z8
20 indeed
z strangely z nonetheless
neverthelessthereby
7
rather 9
unfortunately
by_doing_this accordingly
ideally z21 z19 thus
z10 naturally z14 as_a_result
ultimately in_turn
undoubtedly because_of_that simultaneously
0 probably ztruly because_of_this z1
perhaps realistically
4
although therefore
firstly maybe frankly honestly
first anyway though z28
especially z29 andz2
z26 well z6
z24
20 plus essentially
besides theoretically sometimes
also basically usually increasingly
z13 occasionally historically
z17 z23 locallyby_comparison
separately
second z nationally elsewhere
thirdly 18 currently meanwhile
third z3 z12
40 z11
collectively
altogether
together
z0
60 40 20 0 20 40

Figure 7: T-SNE Visualization of the Latent 𝒛. We draw the t-sne embeddings of each latent 𝑧 in 2-d space with the
well-trained 𝜓 𝑤2 as corresponding embedding vectors. While each 𝑧 groups markers with similar meanings, we can
also observe that related senses are clustered together. For example, temporal connectives and senses are located in
the top left corner with preceding (𝑧 27 ), succeeding (𝑧25 , 𝑧22 , 𝑧 16 ), synchronous (𝑧15 ) ones separated. The existence
of 𝒛 helps to construct a hierarchical view of semantics between sentences.
60 for_example
for_instance in_sum,
in_short,
in_fact,
in_particular,
in_turn,
by_then
by_doing_this,
as_a_result,
because_of_this
because_of_that in_contrast,
by_contrast,
by_comparison,
on_the_other_hand
in_other_words
[no-conn]
on_the_contrary,
in_the_end,
in_the_meantime,
40 meantime,

although,
fortunately,
thus,particularly,
surprisingly,
instead,
typically,

20 lately,
frequently, also, though, well,
finally, occasionally,
simultaneously actually, still, yet,
absolutely,
thereby, originally, really,
thereafter,
therefore further,
furthermore afterward
subsequently,
consequently
immediately, rather, perhaps,
currently,
conversely meaning,
ultimately,
moreover overall, here,
soon,
nevertheless nationally,
naturally, probably,
elsewhere,
likewise, meanwhile,
however suddenly,
truly,
otherwise, indeed, plus, usually,
mostly, only,
especially,
this, second,
third,
anyway, besides,
admittedly,
0 additionally
regardless, or, often,
obviously,
previously, separately,
alternatively and so, but first,
alternately,
nonetheless apparently, again,
together, maybe,
once, sometimes,
unsurprisingly, presumably,
undoubtedly,presently,
unfortunately, traditionally, later, then,
essentially, historically, now,
coincidentally,
incidentally,
evidently, next,
inevitably, interestingly,
significantly,
hence, luckily,
happily, importantly,
hopefully, altogether,
initially, theoretically,
realistically,
collectively,
basically,
20 already,
ironically,
technically,
thankfully,
amazingly,
sadly,
curiously,
seriously, truthfully,
honestly,
arguably, strangely,
eventually,
gradually,
supposedly,
similarly, normally,
generally,optionally,
notably,
remarkably, clearly,
certainly,
preferably, surely,
frankly, ideally,
specifically,
locally,
increasingly, personally,
slowly,
40 namely,
oddly,
accordingly
lastly,
thirdly,
recently, secondly,
firstly,

40 20 0 20 40 60

Figure 8: T-SNE Visualization of discourse markers from BASE. We draw the t-sne embeddings of each marker
in 2-d space with averaged token representations of markers from BASE PLM. Comparing to the well-organized
hierarchical view of latent senses in DMR, markers are not well-aligned to semantics in the representation space of
BASE. It indicates the limitation of bridging markers and relations with a direct mapping.
s1 s2 1st-pred 2nd-pred 3rd-pred
Instantiation Restatement List
Right away you notice the following It attracts people with funny hair
0.502 0.449 0.014
things about a Philip Glass concert
Restatement Conjunction Instantiation
There is a recognizable musical style The music is not especially pianistic
0.603 0.279 0.048
here, but not a particular performance
style
Restatement Instantiation List
Numerous injuries were reported Some buildings collapsed, gas and wa-
0.574 0.250 0.054
ter lines ruptured and fires raged
Cause Restatement Instantiation
this comparison ignores the intensely Its supposedly austere minimalism over-
0.579 0.319 0.061
claustrophobic nature of Mr. Glass’s lays a bombast that makes one yearn for
music the astringency of neoclassical Stravin-
sky, the genuinely radical minimalism
of Berg and Webern, and what in ret-
rospect even seems like concision in
Mahler
Cause Asynchronous Conjunction
The issue exploded this year after a Fed- While not specifically mentioned in the
0.504 0.400 0.045
eral Bureau of Investigation operation FBI charges, dual trading became a fo-
led to charges of widespread trading cus of attempts to tighten industry regu-
abuses at the Chicago Board of Trade lations
and Chicago Mercantile Exchange
Cause Conjunction Asynchronous
A menu by phone could let you decide, You’ll start to see shows where viewers
0.634 0.188 0.116
‘I’m interested in just the beginning of program the program
story No. 1, and I want story No. 2 in
depth
Cause Conjunction Restatement
His hands sit farther apart on the key- The chords modulate
0.604 0.266 0.082
board.Seventh chords make you feel as
though he may break into a (very slow)
improvisatory riff
Cause Restatement Instantiation
His more is always less Far from being minimalist, the music
0.456 0.433 0.052
unabatingly torments us with apparent
novelties not so cleverly disguised in
the simplicities of 4/4 time, octave inter-
vals, and ragtime or gospel chord pro-
gressions
Contrast Cause Concession
It requires that "discharges of pollu- Whatever may be the problems with this
0.484 0.387 0.072
tants" into the "waters of the United system, it scarcely reflects "zero risk"
States" be authorized by permits that re- or "zero discharge
flect the effluent limitations developed
under section 301
Restatement Conjunction Cause
The study, by the CFTC’s division of Whether a trade is done on a dual or non-
0.560 0.302 0.095
economic analysis, shows that "a trade dual basis doesn’t seem to have much
is a trade economic impact
Restatement Synchrony Asynchronous
Currently in the middle of a four-week, He sits down at the piano and plays
0.357 0.188 0.115
20-city tour as a solo pianist, Mr. Glass
has left behind his synthesizers, equip-
ment and collaborators in favor of going
it alone
Conjunction Contrast Synchrony
For the nine months, Honeywell re- Sales declined slightly to $5.17 billion
0.541 0.319 0.109
ported earnings of $212.1 million, or
$4.92 a share, compared with earnings
of $47.9 million, or $1.13 a share, a year
earlier
Restatement Conjunction Cause
The Bush administration is seeking an that while Bush wouldn’t alter a long-
0.465 0.403 0.094
understanding with Congress to ease re- standing ban on such involvement,
strictions on U.S. involvement in for- "there’s a clarification needed" on its
eign coups that might result in the death interpretation
of a country’s leader
s1 s2 1st-pred 2nd-pred 3rd-pred
Synchrony Asynchronous Cause
With "Planet News Mr. Glass gets go- His hands sit farther apart on the key-
0.503 0.202 0.147
ing board
Alternative Contrast Restatement
The Clean Water Act contains no "legal It requires that "discharges of pollu-
0.395 0.386 0.096
standard" of zero discharge tants" into the "waters of the United
States" be authorized by permits that re-
flect the effluent limitations developed
under section 301
Contrast Concession Conjunction
Libyan leader Gadhafi met with Egypt’s They stopped short of resuming diplo-
0.379 0.373 0.129
President Mubarak, and the two offi- matic ties, severed in 1979
cials pledged to respect each other’s
laws, security and stability
Conjunction Synchrony List
His hands sit farther apart on the key- Contrasts predictably accumulate
0.445 0.303 0.181
board.Seventh chords make you feel as
though he may break into a (very slow)
improvisatory riff.The chords modulate,
but there is little filigree even though
his fingers begin to wander over more
of the keys
Conjunction Restatement Contrast
NBC has been able to charge premium but to be about 40% above regular day-
0.409 0.338 0.224
rates for this ad time time rates
Cause Instantiation Restatement
Mr. Glass looks and sounds more like a The piano compositions are relentlessly
0.380 0.323 0.241
shaggy poet describing his work than a tonal (therefore unthreatening), unvary-
classical pianist playing a recital ingly rhythmic (therefore soporific),
and unflaggingly harmonious but un-
melodic (therefore both pretty and un-
conventional
Cause Asynchronous Conjunction
It attracts people with funny hair Whoever constitute the local Left Bank
0.369 0.331 0.260
come out in force, dressed in black

Table 9: High Entropy Examples of Model Inference on Implicit Discourse Relation Classification

You might also like