Reversible fingerprinting for genomic
information
Daniel Naro, Jaime Delgado, Silvia Llorente
Departament d’Arquitectura de Computadors (DAC),
Universitat Politècnica de Catalunya · UPC BarcelonaTECH,
C/Jordi Girona, 1-3, 08034 Barcelona
{dnaro, jaime.delgado, silviall}@ac.upc.edu
http://dmag.ac.upc.edu
Corresponding author
Name: Silvia Llorente
e-mail:
[email protected]
Phone: +34 93 401 74 09
Abstract – New genome sequencing technologies have simplified the generation of genomic data,
making them more common but in turn a likely target of attack. Security strategies have been devised
such as restricting the amount of information that can be queried or using new encryption techniques.
These solutions might not be enough if the entire file has to be shared, as the recipient might leak
the accessible information. This contribution addresses this issue using watermarking. Each read in
a genomic file is modified depending on its content and a secret key. This allows generating different
watermarked instances of the original file. Each watermark acts as a fingerprint: if a leak occurs, the
unique modifications of the instance points to who originated the unauthorized publication. Using
the key, the modifications can be undone. This allows sharing a leak-discouraging version with
which the relevance of a file can be assessed, and can be reversed to the original if needed.
Keywords: Watermarking, Genomic Information, Information Leakage, MPEG-G, Fingerprinting
1 Introduction
The genomic information obtained from sequencing the DNA (DeoxyriboNucleic
Acid) [1], [2] of an individual contains sensitive information. As with other
biometric measures such as the palm print, the genome cannot be modified in order
to mitigate the leakage of sensitive information. This information not only identifies
the sequenced individual for life and informs about possible health issues, but also
gives information about blood relatives and the diseases they might have to face.
This privacy issue motivates the development of new ways of protecting the
genomic information: for example, through the anonymization offered by beacons
[3], cryptography [4] or access rules [5].
In the case of the beacons, the interaction with the data changes. Only the
genomic information for given regions of the genome is stored in the beacons,
which can then be queried with statistical requests such as the frequency of a
mutation in a population suffering from a given disease. This is a way to query the
1
data without making it available in its entirety, similar to the biometric
identification solution described in [6], where features are scattered across multiple
storages.
The use of beacons alters how the data is used, as only the statistics for an
entire population are known. For some studies, however, the entire genome is
required and, in those cases, the researcher must have access to the entire data, not
only to statistical information. In this case, and if it appears to be wrongfully
released, we are interested in a method to identify who is responsible for the
leakage, and this is what we propose in this paper with the use of watermarking.
From the world of audio, images and video (in summary, multimedia) we
are familiar with the idea of watermarking: inserting some alteration which can be
identified and hardly undone. This is a first step in addressing the introduced
problem. With one mark identifying each one of the known data-holders, i.e. all
individuals in possession of one instance of a genomic information file, we know
that we will be able to find who broke the rules by publishing the genomic sequence.
This is referred to as fingerprinting [7]. Nevertheless, as genomic information has
many important applications and each modification could have an effect on the
conclusions of a study, we prefer a modification method which has as limited effect
as possible, whilst identifying multiple copies of the shared information, in order to
detect possible information leaks.
The structure of this paper is as follows. First of all, we introduce some
genomic concepts needed to understand the method proposed and which kind of
information we are processing. Then, we review in which use cases a watermarking
or fingerprinting method might be useful to protect genomic information. Then, we
introduce different properties a watermark could have, both in terms of features and
resistance to modifications. After that, we present our proposal to watermark /
fingerprint genomic information in a deterministic and reversible manner, allowing
to generate a new file with a minimum amount of changes, taking as input a
genomic information file, a key and a set of parameters, and returning a modified
genomic information file. We qualify the method as reversible based on the fact
that, if the key is available, it is possible to restore the original state of the genomic
information. Next, we present some results to the application of such method.
Finally, we draw some conclusions and future work.
2 Background
2.1 Introduction to genomic concepts
The genomic information is stored in each cell as multiple molecules of DNA,
which can be interpreted as a sequence of nucleotides. In DNA, there are four types
of nucleotides, A (adenine), C (cytosine), G (guanine) and T (thymine) [1], [2].
During the life of a cell, the DNA molecules are translated into proteins. Some
portions of the DNA molecule encode the protein sequence: three nucleotides at a
time are read and translated into an amino acid, the building block of proteins.
2
Within a species, the genomic information is almost the same, however
some mutations can occur: some nucleotides might change, be inserted or be
removed from the DNA, leading to changes in the resulting proteins.
Research in this field is interested in finding mutations explaining certain
diseases or advantages. In order to do so, the genomes of individuals are compared
with one another. This allows the definition of a reference genome. Then, it is
possible to determine the difference between one individual’s genome and the
reference defined.
The first step to obtain the genomic information from one individual is to
sequence a biological sample. Nowadays, this can be done with Next Generation
Sequencing (NGS) machines, like Illumina devices [8]. These machines sequence
the genome by obtaining small subsequences of contiguous nucleotides forming the
DNA molecules. We call chromosome to each DNA molecule and we refer to these
subsequences as reads. Chromosomes are grouped by pairs: one chromosome is
inherited from the mother and the other one from the father. The output is stored in
text files in a format called FASTQ [9], representing the collection of generated
reads as the sequence of obtained nucleotides and the confidence with which the
nucleotide is identified (i.e. a measure of the quality of the read for each nucleotide).
The order of appearance of the reads is random within the FASTQ file, respect to
the species genome. This means that to be able to process the genomic information
stored in the FASTQ file, several steps have to be taken, as described next.
The next step in the genome sequencing pipeline is to align the FASTQ file
information to a reference genome. A reference genome is assembled by scientists
as a representative example of a species' set of genes. The result of the alignment
process is stored in a SAM file (Sequence Alignment Map) [10]. The information
generated for each read during the alignment is the position where the read is stored
(i.e. on which chromosome and where on the chromosome), and the differences
between the reference genome and the read.
The main differences that can be found during alignment are mutations,
insertions and deletions. A mutation means that a nucleotide on the read is different
from the nucleotide on the reference for that location. An insertion means that some
nucleotides have been added. Finally, a deletion means that some nucleotides are
missing. There are other possible differences, which are further explained in [10].
The way of representing these differences inside a SAM file is to use CIGAR
(Concise Idiosyncratic Gapped Alignment Report) strings.
A CIGAR string contains information about the reference sequence, the read
sequences, and the type of operations (for example, match (M), insertion (I) and/or
deletion (D)). A match means that the position of the nucleotide in the read is equal
to the position on the reference genome, but does not allow to discriminate between
the case where the two nucleotides are the same or not (this can be done either by
comparison with the reference genome, using an auxiliary tag bounded to the read
in the SAM, or by using a newer set of symbols in the CIGAR allowing to
discriminate the two cases).
In summary, the SAM file stores the genomic information contained in the
FASTQ file, indicating where it is regarding the reference genome. In case there is
3
not a perfect alignment, the differences between original genome and reference
genome are stored, being summarized in a CIGAR string, an example of which is
shown in Table 1. In the example, a fake read is aligned to a fake reference. The
meaning of the values of the CIGAR operations row in Table 1 are as follows:
- the first two nucleotides of the sequence match the ones from the reference (i.e.
we have two M operations),
- a nucleotide is inserted (i.e. we have an I operation),
- a nucleotide matches the reference (M operation),
- a nucleotide from the reference is not present in the read (D operation),
- a nucleotide matches the reference (M operation),
- a nucleotide in the read replaces the nucleotide in the reference (M operation).
Finally, all operations are collapsed into one representation indicating the
type of operation and the number of nucleotides to which the operation applies, as
shown in the “Read CIGAR” row in Table 1.
Table 1: CIGAR example
Reference
A
C
T
C
M
T
M
G
A
C
T
D
C
M
A
M
G
A
C
T
G
(example
chromosome)
Read
CIGAR
operations
Read CIGAR
(as
stored
SAM)
in
A
I
G
M
2M1I1M1D2M
Additionally, when a read’s beginning or end does not match to the
reference sequence, part of its nucleotides can be clipped. This is indicated with the
type S operation (for soft-clipping) in the CIGAR string which means that the
nucleotides are left in the read even though they are not mapped. This could be due
to different reasons: sequencing errors at the beginning or end of the read, lack of
information in the reference genome for the location, or more complex reasons (for
example one part of a chromosome has been copied to another chromosome).
It is worth noting that the text-based information stored in a SAM file can
be compressed into a BAM file (Binary Alignment Map) [10] in order to reduce the
size of the information stored. SAM or BAM files also contain metadata regarding
the process followed to generate them, like the tools used to perform the alignment.
Nevertheless, sequencing machines can make mistakes when identifying
nucleotides, and as such, the fact that one nucleotide appears altered in one read
does not mean that the studied individual has a mutation at that location. Instead,
enough reads are sequenced such that there are multiple reads mapping to a same
position, in order to have enough evidence. Then, all reads are analyzed in order to
determine if mutations are really present. This information is stored in a VCF file
4
(Variant Call Format) [11], listing each mutation with its position, its type, and if it
affects to one or two copies of the chromosome.
2.2 Watermarking in genomics
2.2.1 Watermarking
Watermarking techniques can be used for a broad set of tasks. In [12], the authors
list different use cases which could be covered with the use of watermarking in the
context of healthcare, specifically when the watermark is applied to an image.
These use cases include:
- A simplification in the access control: if the image’s metadata (e.g. information
about the patient) is merged with the image by the means of the watermark,
protecting the image and controlling the access to it already ensures the
protection and the access to the metadata.
- Moving identification information present in the image to a watermark only
accessible through a key.
- Improving the captioning of the image opening new ways to attach information
to it.
- Using a watermark to carry information identifying the origin, such as a
signature.
The work presented here has something in common with [12], as we are
using the new channel obtained thanks to watermarking to merge an identification
of the receiver to the genomic information.
As in the case of multimedia information, the watermarking may be more
or less obtrusive (depending on the significance of the introduced changes). There
are different criteria to compare watermark methods, for example:
- The easiness to be recognized by a computer.
- The decrease in value of the data.
- The resistance of the detection of the watermark to file alterations.
Following with the multimedia analogy, these points would correspond to:
- Is the watermark visible at plain sight, or is it hidden, for example using
steganography?
- Is the watermark adding known content which can be easily searched for, or is
it just adding noise?
- Is the watermark still present if the image is downscaled, rotated or otherwise
modified?
On the other hand, the modifications can be more or less intrusive: methods
can limit their effects on less significant regions, or, on the contrary, modify the
content without considering its impact. We can find in [13] an example of limited
effect modifications in images, as the proposed algorithm only modifies the least
significant bits of certain elements.
The modification of the least significant bit is also employed in [14]. The
authors include a new signal to the wireless transmission of an ECG
(electrocardiogram): the proposal of the new signal is to indicate the identity of the
5
patient within the ECG signal in order. The effect of the least significant bit
modification is mitigated by upscaling the values of the signal.
Another strategy is the one employed by the authors of [15]: their
contribution addresses the need for watermarking images. In order to preserve the
clinically relevant portions of the image, they discriminate its regions as of interest
(Region Of Interest, ROI) or not (Non-Region Of Interest, NROI). By applying the
watermarking techniques to only the NROI regions, the potential drawbacks due to
the file modification are mitigated.
In order to know if the watermark will be resistant to file modifications, we
first identify actions which can be performed when modifying a genomic file, and
the implications on the security method. They are as follows:
- Exporting portions of the file, e.g. data for one chromosome: the effects of the
watermark should be present in the entirety of the file, so that it is unlikely that
any significant portion has not been modified.
- Importing other (portions) of file: the watermark should not be invalidated by
the presence of new non-watermarked data.
- Modifying without semantic changes (file changed but not its meaning, e.g. the
read ids are changed): the watermarking method should rely on a minimal set
of information, unlikely to be modified.
- Modifying with semantic changes: certain mutations have been added or
deleted. The watermarking method should limit the input from the file's data, so
that sparse modifications have as little effect as possible.
Additionally, the watermarking strategy could be defeated by collusion: if
multiple instances of the same file (with different watermarks) are obtained, they
can be compared in order to determine the non-watermarked version.
As previously referred to, a watermarking technique might alter the original
host signal. The magnitude of the modification could render the signal unusable.
One approach to mitigate this issue, further than just constraining the magnitude of
the modification, is to ensure that the modification as a whole is reversible. This is
done for example in [16], where two images are merged together, and both are
recoverable. The host signal is an image to which a logo is associated, without
notably altering the original image. When received, and if the necessary side
information is also available, both the image and the logo can be recovered.
2.2.2 Use of watermarking in genomic information
The described process to sequence and align the genome is important for many use
cases. For example, it can be used to perform research on some disease, to identify
the best treatment or, when the genome is not from a human being, to find better
crops. Thus, it is important to be able to share the genomic information, but we
might want to retain some control on the shared data: either because of intellectual
property or to hold bad actors accountable in case of data leaks. To this end, it is
interesting to be able to modify in some way the described data (i.e. watermarking
it), in order to identify leakers in case of an audit. We have to make clear the
6
difference between watermarking a DNA molecule as in a living cell or including
a mark in the result of sequencing such molecule.
[17] and [18] describe an algorithm to introduce content in the genome of a
living organism. The information is introduced by changing some nucleotides
(specifically the last nucleotide of each group of three nucleotides). Each changed
nucleotide allows to encode two bits of information (as there are four different
nucleotides). The nucleotides to be changed are decided upon the consequence of
changing the nucleotide. In the process of translating the DNA to a protein, each
group of three nucleotides is translated to an amino acid, and for certain
combinations of the two first nucleotides, the translation will result in the same
amino acid no matter the value of the third nucleotide. Only those changes which
do not change the protein synthesized from the edited portion of the DNA are
possible. This procedure ensures that the living organism will still produce the same
proteins. As the modifications are meant to be present in a living cell, and DNA
replications might occur, the authors propose to integrate error detection and
correction schemes to the message being encoded in the DNA: the method and the
length of the correction strategies are determined by the likelihood of a mutation
affecting the modified region.
Therefore, the authors of [17], [18] focus on ensuring that the modifications
do not affect the living cell.
On the contrary, we are focusing on watermarking the representation of the
result of sequencing DNA molecules. There are other authors with similar
objectives, as those of [19]. They modify sequential data before sending it to every
recipient, in such a way that the ability to reconstruct the original data is minimal,
the modifications are overall unique for each recipient, and it is hard for the
recipients to collude and revert the modifications. Furthermore, the number of
modified nucleotides is a constraint given as an input.
[19] uses the concept of sequential data as a sequence of data points where
each point can take one value from a fixed pool of values. The data points
(nucleotides which could be modified) they use, are the locations of well-known
mutations, and the pool of values is whether the mutation is present in none, one or
the two chromosomes of the individual. As such, [19] does not address one specific
genomic file format, but rather the genomic data as an array of properties. This
array of properties resembles more the data represented in a VCF file (limited to
mutations affecting only one nucleotide (Single Nucleotide Polymorphism, SNP),
but it could also be used for sequenced data files (such as a SAM file), if all reads
covering one position modified by the watermark are modified.
Based on the knowledge of what has been shared before, and with which
modifications, the authors use an optimization problem represented as an Integer
Linear Program (ILP) to decide what modifications should be operated on the data
before sending it to the next recipient. The optimization minimizes the probability
of the recipients to collude and deduce the watermarked positions, under the
constraints guaranteeing certain levels of utility.
7
Each modification is deliberate (the ILP minimization can select each data
point independently). This allows them to perform such actions as to repeat certain
modifications across multiple watermarked instances to fight off collusions.
The authors consider the utility as a fraction of data modified: if the number
of modified data points compared to the number of total points is low, the authors
consider the utility to be high.
A limitation of this algorithm is that the watermark cannot be undone.
2.3 Use case for watermarking in genomics
Our use case is intended to apply watermarking as a fingerprinting to genomic
information. Nevertheless, several situations can be covered, as described next.
Either the genomic data owner or the genomic data custodian can receive a
request for a copy of the data. The request might be extended with metadata
regarding the intentions of the researcher (e.g. conduct a study on cancer or use it
as input for a genealogy search), allowing to check with the owner's policy if the
request should be replied with a positive answer.
In addition to this protection, the genomic data owner or custodian might
prefer to slightly alter the file during the transfer by including a fingerprint. If the
alteration is unique for each request, the data owner is then able to audit the result
of leakages. If the leaked data matches the transferred one (exactly the same reads
are present, or the number of identical mutations is above a certain threshold), it is
worth comparing the present variations to the one inserted willingly. In the case
where they match the ones sent to one of the requesters, the owner is at least aware
of who breached the link of trust.
Another application could be to limit the temptation of separating the
genomic data from associated metadata and privacy rules. In the upcoming MPEG
standard for genomic data, ISO/IEC 23092, MPEG-G [20], [21], [22], it will be
possible to convey such type of fields alongside the genomic information. By
modifying the privacy rules, an attacker could repurpose a file without permission.
In this case, the watermarking could be generated using the privacy rules as input
instead of the genomic data. In the case where some auditing agent detects data
supplied with a privacy rules field which is not the input used for the watermark, a
flag could be raised indicating a modification of the content.
Finally, if for a given case the data request is just for a brief showing in order
to decide if this could be a valuable input, but protection methods such as
cryptography are for any reason non suitable, watermarking could be an approach
to ensure that the data has a low value outside the very scope of the request. The
ideas used in [23] could be of interest in such a use case: the same concepts of
optimizing the trade-off between privacy and utility and how to measure both ideas
are relevant to this scenario.
8
3 Fingerprinting of reads with mutations
3.1 Introduction
The aim of the presented method is to be as less intrusive as possible. We focus on
watermarking reads with some mutations, considering them more likely candidates
for an exportation of data. The reads with mutations are independent subsequences
of DNA read by the sequencing machine where the alignment has detected
differences between the read and the reference genome, as introduced in Subsection
2.1. Each read is treated separately, so importing other information will not make
the watermark disappear.
On the other hand, the proposed method results are hard to detect if the reads
are modified. This, however, is limited by the cost of the modification: introducing
or removing mutations could alter possible conclusions drawn from the file, thus
rendering this approach as a non-suitable solution to defeat the fingerprinting. An
important feature from our approach is the fact that the user holding the
watermarking key can reverse the modifications.
The proposed method consists of four steps for each read sequence:
- generate a description for it,
- verify that it is a suitable candidate for the fingerprint method,
- based on a secure transformation over the description, decide if we can perform
the modifications,
- in case of positive result, perform the modifications.
The four steps are described in detail in the following four subsections.
3.2 Description
The assumption on top of which this method is built is that the valuable information
within a genomic file are the positions of the modified nucleotides (i.e. the positions
of an insertion, deletion or skipped nucleotide base). Therefore, we assume that no
such operation will be added or removed, and that we can use them to construct the
description.
To do so, we propose to use a 256 bits long description. For every
modification (insertion I, deletion D, skipped N and mutation X) and in the order
of the read, we append to our description the position of the modification, for
example as an 8 bits unsigned integer (which is enough to store the position in the
case of a file with a maximum read length of 150 base pairs), followed by another
byte encoding the modification operation (first four bits) and the number of
nucleotides to which the operation applies (last four bits). The remaining bits are
left with a 0 value. An example is shown in Table 2, where we separate with
hyphens each field used for the description.
9
Table 2: Example of description construction
Read CIGAR 5M
Aggregated
0
position
Information
stored
in
description
2I
5
5-I-2
3M
7
4D
10
1M
14
10-D-4
3.3 Suitability of read
As we have seen, the method works on the information contained in the CIGAR
sequence of each read (i.e. the list of operations in respect to the reference such as
insertion or deletion).
Under the conditions introduced in the previous subsection, we can neither
overflow the 256 bits of description (limiting to 16 operations) nor the four bits
operation length (limiting to 16 the maximum length of each operation). We
consider suitable those reads for which we can construct the description. Non
suitable reads are disregarded: by changing the size of the description and its fields
the number of reads taken into account varies.
3.4 Secure transformation
In order for this algorithm to be secure, we need to obtain a secret from the
previously constructed description. This secret will define how to apply the
fingerprint. Furthermore, in order to be able to undo the modification, the result of
this secret must always be the same given the description. We refer with secure
transformation to such an operation which, given a description, generates
deterministically a secret.
Both during the watermark auditing or the watermark removal we want to
obtain the same result from the secure transformation. To this end, we do not
consider the numbering of the read to be a suitable input since, for example, if at
some point the user generates a new file containing only a subset of the reads
(perhaps all the reads aligned to one reference sequence only), this input is lost. As
we would require an initialization vector per read (in order to maintain the
independence of reads), such transmission would not be reasonable. Therefore, the
only input to the transformation will be the previously constructed description.
One candidate for the transformation is the AES cipher in electronic
codebook mode. In this mode, the cipher is stateless and takes only the plaintext
block and the key as input: the consequence is that the same input always returns
the same output. There are flaws in this secure transformation which has
repercussions on the fingerprinting method. This is discussed in Section 4.
3.5 Perform the modifications
The result of the fingerprinting will be a modification of the soft clipping at the
beginning of the read (see Subsection 2.1 for details). As soft clipping is not
10
considered in the description's formulation, the process will not affect the
description. The method can be readily extended to support also modifications at
the end of the read.
As explained later, we interpret the result of the secure transformation as the
conditions to be met and the output of the fingerprinting operation. We want the
obtained secret to convey the necessary information to perform and reverse the
fingerprinting process. The modification process at the beginning or the end of the
read is the same, the only difference is that the effect is mirrored. The basic idea is
that the secure transformation result will encode the state before and after
fingerprinting, the state being the length of the soft clip operation and its content.
We define a number of bits encoding the length of the soft clip operation. In
this case, it is preferable not to use a straightforward encoding as this would lead to
sampling uniformly in the pool of possible lengths. On the contrary, if we plot the
histogram of soft clip lengths (Figure 1), we observe that short lengths are far more
likely. As such, it is preferable to construct the decoding of the length of the soft
clip operation using the observed Cumulative Distribution Function.
Fig. 1. Histogram of softclip lengths
With this mechanism, we can now first read from the result of the secure
transformation a guess for the current soft clip operation size guess size, and then the
length of the one in which we should transform transformation size. This distribution
function is likely to return for the two random variables the value zero, thus
reducing the likelihood of executing a transformation. Additionally, we also decode
base pair strings: guess string and transformation string. For the interpretation of the
string we use only the four base pairs found in nature ('A','C','T','G'). As such, each
11
byte of the secure transformation encodes up to four base pairs. In Figure 2 we show
how the bytes are interpreted.
Fig. 2. Interpretation of the secure transformation output.
The operation is performed if guess size and guess string match the observed
operation. This ensures that if we perform the reverse operation we will have the
original soft clip content encoded in the output of the secure transformation. One
exception to this is when the guess is not equal to the observed operation, but the
proposed result of the transformation is. In other words, transformation size
and transformation string match the observed soft clip. In this case, if we do not apply
any modification, during the reverse operation we would be wrongfully induced to
believe that the watermark operation was performed. In order to avoid this, we swap
the values of guess and transformation. In doing so, we signal to the unfingerprinting routine this special case and we are able to correctly reverse the
modifications (see Algorithm 1 for an illustration of this).
The operation is only applied when both guess size and transformation size are
smaller than the position of the first non-softclip and non-equal operation. This is
to ensure that all the information required to undo the operation is available in the
reference.
12
Alg. 1. Pseudocode for reading watermarking function parameters.
3.5.1 Fingerprinting operation
In a nutshell, the fingerprinting operation will replace the initial 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 content with
the 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 input. This gives us new criteria for the suitability of the read
to undergo a fingerprinting operation: there cannot be a base pair involved in the
description in the first 𝑡𝑡𝑡𝑡𝑚𝑚�𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ , 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ � base pairs of
the read. This ensures that we will only replace soft clip operations (for which the
content is encoded in 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡) or match operations (in which case the
content is straightforwardly present in the reference genome).
See Figure 3 for a visual representation of the case where
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ < 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ . In the inverse case, we would copy the
reference genome in the undoing operation. Figure 4 shows how a read, its softclipping and its corresponding CIGAR string is modified after applying the
transformation operation.
3.5.2 Undoing fingerprints
The process to undo the modifications is the same as for modifying. If the read starts
with 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡, we modify back to 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔. In the case where it starts with
𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔, we change the beginning with 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡.
13
Fig. 3. Soft clip contraction operation
Fig. 4. Example of application of transformation operation to a read and its associated CIGAR String
4 Results and Discussion
4.1 Basic tests
The fingerprinting method needs a clear mapping between each of the file's
recipients and watermarking keys. In case of auditing leaked data, we would test
for each key if the leaked data matches the modified result. In case of granting
access to the unmodified version, only the fingerprinting key is required by the
recipient to reverse the modification process. If one of the keys is leaked, the
corresponding file could be reversed, but the other files do not lose their
watermarking. In fact, using an incorrect key to remove the watermark from a file
would add a new watermark to it.
In order to test the algorithm, we use a file containing the sequencing of a
human genome, generated with an Illumina device [8]. The file has a low coverage
14
(2.3). Coverage refers to the average number of identifications for a single
nucleotide, as described in [24]. The file is part of the database of test material [25]
used in the standardization of MPEG-G.
The file contains 5.5 ⋅ 107 reads, from which 3.8 ⋅ 107 are perfect matches
and are therefore discarded. From the remaining reads, the description could be
built for 6.2 ⋅ 105 .
We discard those cases where the change in soft clip length would be too
big (arbitrarily defined at 10), this discards 2.8 ⋅ 105 reads. The algorithm can only
be applied if both 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ and 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ are less than the position of
the first operation which is not a soft clip or a match. This is the case in 3.4 ⋅ 105
reads, from which for 3.1 ⋅ 105 either 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 or 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 match the beginning.
However, as it is highly likely to read a null length from the secure transformation,
in 2.0 ⋅ 105 instances, the ciphertext does not represent a real transformation. In the
end, 1.2 ⋅ 105 reads are transformed.
This number has to be compared to the 6.2 ⋅ 105 reads which could have
been possibly modified. Someone trying to circumvent the mark, would need to
work on these 6.2 ⋅ 105 in order to reverse the modifications purposely introduced
in 18.6% of them.
4.2 Resistance to collusion attacks
4.2.1 Vulnerability
Let us assume a collusion attack, where all parties suppose that the most frequent
version of a read has not been modified. Such situation is more likely if a
watermarked read is modified differently for multiple recipients. In order to test
such risk, we reduce the previous test data to only those reads mapped to the first
chromosome. We then simulate sharing the file with multiple parties: we do this by
executing the fingerprint method with different keys. For each read there are
different outcomes possible: the read could not be fingerprinted with any of the
keys, it could be fingerprinted with some of the keys, or with all of the keys. As the
probability of a read being modified is fairly low, the most likely outcome is that if
across multiple watermarked instances of the same file there are different variants
of the same read, then the most common variant is the original, non-watermarked
one.
In this test, we reduced the number of reads to 4.1 ⋅ 106 , and we used seven
different keys for watermarking. On average, for each key there where around 1.8 ⋅
104 reads watermarked, for a total of 6.62 ⋅ 104 reads being watermarked at least
once across all instances. For a total of 6.59 ⋅ 104 reads, the most frequent variant
was the original, non-watermarked one.
Figure 5 shows how the risk of collusion increases as the number of
watermarked files increases: as the number of watermarked files increases, the ratio
of cases where the non-watermarked variant is the most frequent variant increases,
thus simplifying a collusion attack which only selects the most frequent variant.
The here presented version of the algorithm is vulnerable to such attacks. However,
15
some of the concepts from [19] could be applied to the algorithm. Although we
cannot select each read to be watermarked independently, due to the way the reads
are selected, the algorithm could be executed multiple times, once for each of the
provided key. As soon as a read is watermarked with one of the keys, the iterations
stop.
Fig. 5. Evolution of the ratio of cases where the original non-watermarked variant is not the most frequent
across all watermarked file.
As for each read the likelihood of it being modified is low, we can think as
if each key was modifying a distinct subset of reads, without any overlapping with
any other set. In these conditions, what we want is a configuration such that for
each recipient we have selected a unique set of keys, that none of the sets is empty
(as it would imply that the recipient would receive a non-modified version), and
that each key is present in a majority of sets (thus we ensure that the modified
version is the most frequent version, even when all recipients collude). This
selection of configuration also needs to minimize the number of keys used in each
set, as each key implies modifications to the file thus decreasing its utility, possibly
harming conclusions. If the number of recipients is known before-hand the task is
trivial, and the collusion strategy of selecting the most frequent version will always
fail.
4.2.2 An improvement using an ILP
In the case where the number of recipients is unknown, some concessions have to
be made. Let us assume that the set of keys for a new user is decided upon receiving
the new request. We only know which keys were used for the previous recipients.
16
Similar to [19], we construct an Integer Linear Program (ILP), which aims
at minimizing the associated cost with the decision. We construct the objective
function as a lexicographic search: first we minimize the risk of leakage, then we
minimize the sum of the sizes of sets, and finally the size of the biggest set. As
previously explained, each time a key is used at least once, but is not present in a
majority of set, the reads associated to that key will be recoverable. The ILP decides
the set for the new recipient, which has to be different to all previous sets employed.
Table 3 shows the result for this iterative process: in order to obtain a
column, all previous columns were provided as input. We can observe how the
solver attempts to maintain the number of keys used as low as possible, but as soon
as all combinations are used, a new key appears in the pool of used keys. This key
is then source of leaking as it appears in a minority of sets, but in subsequent calls
the key is always selected, therefore it eventually reaches a point where it is not a
source anymore. However, this happens when all combinations are used, therefore
the problem rises again at the next iteration. The process can be observed for
recipient 2, 4, 8, and 16.
Table 3: ILP iterative decisions
Key 1
Key 7
Key 3
Key 2
Key 5
Key 4
Key 6
Recipient 1
Y
2
Y
3
Y
Y
4
Y
5
Y
Y
Y
6
Y
Y
7
8
Y
Y
Y
9
Y
Y
Y
Y
10
Y
Y
11
12
Y
Y
Y
Y
Y
13
Y
14
Y
Y
Y
Y
15
Y
16
Y
Y
We generate the fingerprints for each of the recipients decided by the ILP,
the results for which are summarized in Figure 6. We can see that as the number of
recipients increases, the average number of modified reads for each recipient
increases. Furthermore, and as previously explained, the introduction of a new key
to the pool creates situation of leaking, which are corrected as soon as the file has
been shared with enough recipients.
17
Fig. 6. Number of reads modified and leakage ratio versus number of recipients.
There is a trade-off between the use of multiple keys to avoid collusion
attack and the computation needed to generate such files with multiple keys: each
key increases the time required to generate such file as depicted in Figure 7.
Fig. 7. Time required to fingerprint a file compared to copying it
18
4.3 Generalization of the description
Our next step is to slightly modify the behavior of the algorithm in order to improve
it. In this way, a broader set of cases are accepted as input. We use a file with
variable read length, some of which have a length greater than 255 (average depth
of 8.1 reads per base). Thus, the byte used to store the position of the mutation is
overflowed: we convert this to a two bytes system. The maximal length of the
description is kept at 32 bytes; therefore, we are able to store less mutations in it.
The observed likelihood of a read being marked, given that it had at least one
mutation and the description could be constructed, is 25.5%.
Finally, we test a file with 1.7 ⋅ 108 reads [26], for an average depth of 7.74.
In this file, there are 1.5 ⋅ 108 candidates. For 6 ⋅ 107 of them, a description could
be build, and in 25.3% the constructed description resulted in a modification of the
read.
The proposed watermarking method has the positive side to be reversible
and to limit the alteration of the file.
The negative aspect is that an attacker can render it useless by modifying
the reads. For example, by shifting all reads one position to the left, or inserting
new modifications, the process is broken. Furthermore, by modifying the length of
every soft clip operation the result of the watermarking is removed. However, this
same approach decreases greatly the value of the published information, as it
modifies blindly the data thus introducing more noise to the data. In order to have
successfully defeated the security method, the attacker would have to publish a
version of the file as close as possible to the original non-fingerprinted file, without
the modifications introduced to identify that instance of the data.
An attacker could also try to reconstruct partially the original file. As the
secure transformation will always give the same output for the same description,
the attacker knows that for all reads whose descriptions are the same, some will
share the same soft clip operation due to the watermarking. If there are enough
instances, it would be possible to infer the values of 𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔𝑔 and 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑔𝑔𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
allowing to reverse the process for those reads.
Similarly, and as mentioned in Section 3, it is possible that different
recipients of the same data, but with different fingerprints, could collude in order to
work the fingerprinting back. In this case, the different attackers could compare the
different variants of the same read. The most common variant is most likely not to
be marked by any fingerprint. As such, a collusion of three or more actors could
allow to work back the security measure: considering a marking likelihood of 25%
there is a probability of 84% that a read will be at most worked once in the three
copies.
5 Conclusions
We have presented a method to modify a file of aligned genomic information, in
order to introduce recognizable changes, but which are hard to identify as such
19
without the required parameters. These changes can be undone in the case where
the original parameters of the method are known.
The proposed method is designed with some assumptions in mind: mainly
that the reads to be marked by the fingerprint operation are those carrying
information about mutations, and that a file recipient has no incentive to modify
blindly and broadly the content of the file showing mutations, as this decreases the
value of the information.
The inner workings of the fingerprint operation can be viewed as separate
entities. First, special features of each data entry are combined, this combination is
then used as input of a cryptographic function, and lastly the ciphertext is
interpreted as the modifications to be applied to the data entry. By modifying each
one of these three steps, the proposed method could be adapted to other needs or
another set of assumptions. For example, if modifying the soft clip operations is
considered too harmful to the value of the file, a new interpretation of the ciphertext
could be devised to modify the quality values instead.
The closest algorithms to what we propose are watermarking strategies as
used in the audio-visual world, e.g. in the case of images (still and video) and audio.
Alongside the familiar strategy of clearly modifying the values of certain regions
of an image, other approaches have been proposed as surveyed in [27]. As in the
case of image watermarking, we have to be concerned with the possible
transformations done over the file. However, despite the similarity in the objectives
(either mark the ownership of the intellectual property or identifying a specific
copy), the data types are quite different. For example, in the case of genomic
information we have multiple reads storing multiple copies of what should be the
same information, with no clear equivalent in the case of an image file. Similarly,
a modification not visible to the human eye or perceptible to the ear is acceptable
in the audio-visual world, however in the case of genomic information it might be
wrongly interpreted as a mutation.
Furthermore, another path which could be explored in order to create a
fingerprinting method for genomic information is exploiting the less significant bits
for the qualities. Each read's base pair comes with a quality score, a representation
of the sequencer device's confidence when identifying that base pair. Using the less
significant bits could however fail if a new format used lossy compression for the
quality scores [21].
Some requirements used in this version of the algorithm could be lifted in
the case where the reversing properties were not to be used. This could allow a new
version of the proposed method where more reads are changed.
Acknowledgements
The work presented in this paper has been partially supported by the Spanish
Research Agency/ERDF(EU), through the project Secure Genomic Information
Compression (GenCom, TEC2015-67774-C2-1-R, TEC2015-67774-C2-2-R), and
by the Generalitat de Catalunya (2017 SGR 1749).
20
References
[1] E. S. Lander et al., “Initial sequencing and analysis of the human genome”,
Nature, vol. 409, no. 6822, pp. 860–921, Feb. 2001.
[2] J. C. Venter et al., “The Sequence of the Human Genome”, Science (80), vol.
291, no. 5507, pp. 1304–1351, Feb. 2001.
[3] M. Fiume et al., “Federated discovery and sharing of genomic data using
Beacons”, Nat. Biotechnol., vol. 37, no. 3, pp. 220–224, Mar. 2019.
[4] H. Tang et al., “Protecting genomic data analytics in the cloud: state of the art
and opportunities”, BMC Med. Genomics, vol. 9, no. 1, pp. 1–9, Dec. 2016.
[5] J. Delgado, S. Llorente, and D. Naro, “Protecting Privacy of Genomic
Information”, Stud. Health Technol. Inform., vol. 235, pp. 318–322, 2017.
[6] L. Leng, A. B. Teoch, and M. Li, “Simplified 2dpalmhash code for secure
palmprint verification”, Multimed. Tools Appl., vol. 76, no. 6, pp. 8373-8398,
Mar. 2017.
[7] R. Popa, “An Analysis of Steganographic Techniques”, The “Politehnica”
University of Timisoara, 1998.
[8] M. L. Metzker, “Sequencing technologies — the next generation”, Nat. Rev.
Genet., vol. 11, p. 31, Dec. 2009.
[9] P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice, “The Sanger
FASTQ file format for sequences with quality scores, and the Solexa/Illumina
FASTQ variants”, Nucleic Acids Res., vol. 38, no. 6, pp. 1767–1771, 2009.
[10]
H. Li et al., “The Sequence Alignment/Map format and SAMtools”,
Bioinformatics, 2009.
[11]
P. Danecek et al., “The variant call format and VCFtools”, Bioinformatics,
vol. 27, no. 15, pp. 2156–2158, 2011.
[12]
A. Giakoumaki, S. Pavlopoulos, and D. Koutsouris, “Multiple Image
Watermarking
Applied
to
Health
Information
Management”,
IEEE
Transactions on Information Technology in Biomedicine, vol. 10, no. 4, pp.
722–732, 2006.
[13]
N. Provos and P. Honeyman, “Hide and seek: An introduction to
steganography”, IEEE Secur. Priv., vol. 1, no. 3, pp. 32–44, May 2003.
[14]
A. Ibaida, I. Khalil, and R. van Schyndel, “A low complexity high capacity
ECG signal watermark for wearable sensor-net health monitoring system”,
2011 Computing in Cardiology, pp.393–396, 2011.
21
[15]
D. S. Chauhan, A. K. Singh, B. Kumar, and J. P. Saini, “Quantization based
multiple medical information watermarking for secure e-health”, Multimed.
Tools Appl., vol. 78, no. 4, pp. 3911–3923, 2019.
[16]
H. Zarrabi, M. Hajabdollahi, S. M. R. Soroushmehr, N. Karimi, S. Samavi,
and K. Najarian. “Reversible Image Watermarking for Health Informatics
Systems Using Distortion Compensation in Wavelet Domain”, 2018 40th
Annual International Conference of the IEEE Engineering in Medicine and
Biology Society (EMBC), pp. 798–801, 2018.
[17]
D. Heider and A. Barnekow, “DNA-based watermarks using the DNA-
Crypt algorithm”, BMC Bioinformatics, vol. 8, no. 1, p. 176, 2007.
[18]
D. Heider and A. Barnekow, “DNA watermarks: A proof of concept”, BMC
Mol. Biol., 2008.
[19]
A. Yilmaz and E. Ayday, “Collusion-Secure Watermarking for Sequential
Data”, eprint arXiv:1708.01023, Aug. 2017.
[20]
J. Delgado, S. Llorente, and D. Naro, “Protecting Privacy of Genomic
Information”, Stud. Health Technol. Inform., vol. 235, pp. 318–322, Apr. 2017.
[21]
ISO/IEC JTC 1/SC 29/WG 11, “MPEG-G, ISO/IEC 23092 Genomic
Information
Representation”,
2019.
https://mpeg.chiariglione.org/standards/mpeg-g.
[Online].
Available:
[Accessed:
12-September-
2019].
[22]
C. Alberti et al., “An introduction to MPEG-G, the new ISO standard for
genomic information representation”, bioRxiv, p. 426353, Jan. 2018.
[23]
M. Humbert, E. Ayday, J.-P. Hubaux, and A. Telenti, “Reconciling Utility
with Privacy in Genomics”, in Proceedings of the 13th Workshop on Privacy in
the Electronic Society - WPES ’14, pp. 11–20, 2014.
[24]
K. Song et al., “Coverage recommendation for genotyping analysis of
highly heterologous species using next-generation sequencing technology”,
Scientific Reports, vol. 6, Oct. 2016.
[25]
Run: ERR317482, “Illumina HiSeq 2000 paired end sequencing”, 2019.
Available: https://www.ebi.ac.uk/ena/data/view/ERR317482. [Accessed: 12September -2019].
[26]
Run: ERR194146, “Utah residents (CEPH) with Northern and Western
European ancestry”, 2019. https://storage.googleapis.com/genomics-public-
22
data/platinum-genomes/bam/NA12877_S1.bam. [Accessed: 12- September 2019].
[27]
N. Agarwal, A. K. Singh, and P. K. Singh, “Survey of robust and
imperceptible watermarking”, Multimed. Tools Appl., vol. 78, no. 7, pp. 8603–
8633, Apr. 2019.
23