Academia.eduAcademia.edu

Medical document anonymization with a semantic lexicon

2000, Proceedings / AMIA ... Annual Symposium. AMIA Symposium

We present an original system for locating and removing personally-identifying information in patient records. In this experiment, anonymization is seen as a particular case of knowledge extraction. We use natural language processing tools provided by the MEDTAG framework: a semantic lexicon specialized in medicine, and a toolkit for word-sense and morpho-syntactic tagging. The system finds 98-99% of all personally-identifying information.

Medical Document Anonymization with a Semantic Lexicon Patrick Ruch, Robert H. Baud, Anne-Marie Rassinoux; Pierrette Bouillon, Gilbert Robert Medical Informatics Division, University Hospital of Geneva; ISSCO, University of Geneva described earlier [2], the toolkit we used is of Markovian type, i.e. it uses local grammar [3], but the first system was data-driven, while the present one is rule-based: disambiguation rules are written manually, as in the FACILE project4. A third component using formal recursive transition networks (RTN) [5] for extraction and removal of the confidential items has been specially applied. We present an original system for locating and removing personally-identifying information in patient records. In this experiment, anonymization is seen as a particular case of knowledge extraction. We use natural language processing tools provided by the MEDTAG framework: a semantic lexicon specialized in medicine, and a toolkit for word-sense and morpho-syntactic tagging. The system finds 9899% of all personally-identifying information. OVERVIEW OF THE PROBLEM INTRODUCTION AND BACKGROUND The anonymization is seen as a particular case of knowledge extraction, as before removing the specific information, it is necessary to localize it. To show the complexity of the task, here are some excerpts4: _______________________________________ Miss Maria Christina GOMEZ DA LOVIS (0), born 3.3.1956 (1), without medical antecedents, but a caesarian section by Pfannentiel (2) However, doctors Robert de Baud (3) and Anton Gicebuhler (4) as they know her well [...I For centuries, care providers have been taking written notes about their patients. But, with the age of the electronic patient record (EPR), the very confidential relationship between the clinician and the patient may reache its limits, if hundreds of professionals have access to this information; even if such access is legitimate: as for example the retrieval of similar cases in clinical document warehouses'. Research on corpora, as it needs both important amounts of textual data and frequent cooperation with more specialized extnd this thi problm tends tendsto to extend groups, groups problem out off the medical sphere. If we refer to some of the rare2 works [1] studying document anonymization, our system is a data deidentifier -a 'scrubbing' system- likely to remove explicit identifiers' such as name, address, phone number, and date of birth. However such operation does not guarantee that the result is umversally anonmous i.. tht noodywilleverbe ikel to infermsome inf. thatinoconcerin by One stitches up by Donati (5), with Maxon (6) 4.0. nvrit Doctors ofthe Geneva University Hospital (7)[...] In the night, she decided to phone the EMS (8) [.1 Alan River, MD (9) Table 1: Some examples of identifiers 2 and 5 are technical names. 0 is the full name of a atient I is a date of birth. 3 4 and 9 are Dotr ofteGnv optl(),. hysician's y they are composed of several items. Let us names; note that the last 3 is with a lower se somepikens fo name in starting case letter, while in 9, the last name is also a common noun. 61iS a medical device. 7 and 8 are health care institutions. In conclusion, in the only 0, 1, 3, 4 and 9 must be anonymized, whereas 2, 5, 6, 7 and 8 must be left intact. The replacement operation we have desiged is very simple, it replaces each character of any confidential items by an 'x', and it exanple by linking the document to other sources. Therefore the tool should be used together will more classical and legal procedural barriers. From a technical point of view, the system is based on the MEDTAG lexicon [2] for the lexical resources, and on an original rule-based word-sense (WS) and morpho-syntactic (MS) tagger for the disambiguation respects the case (capital letters are replaced by an task. The basic idea is to rely on markers to locate ,X); punctuation (hyphen, dot...) occurring within (i.e. identifiers, but some markers may be ambiguous confidential items iS kept (see Fig. 2). they may not be markers im some contexts). 'Taggers .~~~~~~ are used to solve such ambiguities. Like the taggers METHODS sample 'This tool is part of the University Hospital EPR. 2 If medical papers are rare, the message understanding conferences (MUC) provide an interesting framework for name entity recognition. 3 In the following, identifier refers to explicit personally-identifying items. 1067-5027/00/$5.00 © 2000 AMIA, Inc. Hopefully, among so many capitalized words, one very productive pattem emerges: an IDentity Marker (IDM, such as markers of politeness: Sir, Mrs., Miss., 4Anly print of real names is fortuitous. 729 or titles: Dr., doctors) precedes -or follows as MD in case 9- each occurrence of patient and doctors identifiers. Notice that it is also true for 4, which is related to the marker Doctors by the coordination and. Unfortunately, if each occurrence of identifiers is preceded -or followed- by an IDM, the reverse is not true: we may have IDM, which are not followed or preceded- by identifiers. As in 7, where Doctors refers to some humans, without indicating their names, or in 8, where phone is not followed by a digit but by the name of a medical institution. Here is the starting point of the study, and the task-orientated semantic separation between strict IDM (Ms., Dr....), which are always directly followed -or preceded- by identifiers, and tokens likely to refer to general persons (doctors, professors ... ), which are not necessary followed -or preceded by names. Again, this rule has never been violated within our corpora, instead, we found a lot of misspelled occurrences (prfessor for professor), and a lot of errors on mixing up small letters and capitalized letters. As for example in Miss Jane Dermott and Mr. Lawrence van Belleghem, which were incorrectly written jane Dermott and Lawrence Van Belleghem. Concerning phone numbers and dates, together with explicit markers -similar to IDM- such as born (cf. case 1), we can also rely on very well defined patterns, which can be exhaustively listed. For example a token with 3 digits, followed by 4 digits, with a dash in-between, is considered as a phone numbers. (less than 1%), and an even smaller amount is written in German. This is particularly true with discharge summaries, mainly for patients leaving abroad, due to the international status of the city and the different languages spoken across the country. Therefore the system we designed is able to work in such multilingual environment. In the test corpus only two documents were written in English, but the examples are provided in English for sake of clarity. <Header> <mpi5> 423751 <name> Dr. P. GALLAWAY <address> 56 Montaigne Av. <city code> 1211 GENEVA < |unit> <departmenP> Sur ery Digestive Surgery <phone> Phone: 345-7343 <place, date> Geneva, the 5th of December, 1999 <Body> Dear Dr. Gallaway, Your patient, Mr. A.-M. COGER, born 11.8.1959, stayed in our service, from 05105199 to 05106199. DIAGNOSIS: Acute pancreatitis of unknown origin. Comorbidities Gastrtisi Hypertension. Esophagitis with backflow. Mr. Coger, 72 year-old6, has been admitted in emergency, after [..]. However tests for the cytomegalovirus and the EBV were negative. Therefore we perform an abdominal CT-scan [...] M. Coger will be followed in ambulatory by Drs. David Ducruet and Robert van Belleghem [ ...]. Methodological hypothesis From an epistemic point of view, three main hypotheses are guiding the experiment: a. syntax can help to distinguish meanings of words having different syntactic categories; b. syntactic and semantic ambiguities can be solved using simple taggers c. information extraction can also be tag-assisted We earlier verified hypotheses a and b [2]. This experiment focuses particularly on hypothesis c, and reports on our attempt to apply the MEDTAG tag-like framework to the task of anonymization. These hypotheses have been tested in te following way: texts are first annotated with the MS tagger, in order to make explicit the part-of-speech (POS). This stage provides the first disambiguation filter at the MS level. Then, the WS tagger provides the WS tag. Finally, this information serves to extract and annotate the identifiers, via a formal RTN algorithm. Choice of a corpus Let us note that French is the regular working language at the Geneva University Hospital. However a small amount of documents is written in English Figure 1: Example of a complete medical document Standard documents within the University Hospital of Geneva include a header and a body (see Fig. 1). The header, where only structured data occurs, can be easily handled as we can process individually each field. So the document header has been automatically discarded in this experiment. We finally picked 1000 documents for a total of 80784 tokens: 600 postoperative reports, 200 laboratory and test results, and 200 discharge summares. Two sets of documents have been extracted from this ad hoc corpus. The first set (set A, a representative sample wit about 20% of the corpus, i.e. 16456 tokens) helped set up the system, whereas the second (set B, about 80% of the corpus, i.e. 64328 tokens) was used for assessment purposes. In the first set, we counted 124 identifiers. s MPI stands for Master Patient Index. 6 The age of the patient is considered as a clinical data, and therefore is not to be anonymiized. 730 Adapting taggers and lexicons The MEDTAG tagset (see Tab. 3) based on the UMLS Metathesaurus is seen as a basic ontology of the domain. With less than 40 tags, it aims at describing major semantic features related to the medical domain. The main target of the MEDTAG taggers is the word-sense disambiguation. Considering an ambiguous token, both taggers attempt to provide the right tag (first MS, then WS), as for example a token like miss, is ambiguous out of its contexts between an action (to fail) and a person (a young lady), but may become unambiguous in a particular context. However, applying both taggers to the anonyniization task has implied some refinements, mainly on the WS tagger. Lexical refinements to be added, but we Some lexical informatioxg had tobeadded,butwe decided not to modify the MEDTAG tagset, and we worked with lists of particular cases, in order to provide the anonymization-specific tags. Thus, we created two tags. First idm, which is attributed to lexemes such as Dr., second, id, which is given to human proper nouns (David, Louise...). Few proper nouns are present in the lexicon, and this tag is mainly used for recognizing identifiers (cf. the distribution of these tags, together with the MEDTAG tagset in Tab. 3). It was also necessary to improve the coverage of the MEDTAG lexicon, which contains now 5131 entries. Thus, it was necessary to list more tokens likely to occur with a capital letter. As it is clearly impossible to put in the lexicon all proper names (tagged id), we decided to work on lexemes written with a capital letter, but which are not proper names. We focused on tokens referring to medical institutions (hospitals, clinics, ... tagged hcorg or spec). We also linked partially the MEDTAG lexicon together with the Swiss Compendium, which describes all the drugs used within the country. Finally, most of the medical devices (tagged mdev, such as CT scan, Doppler, Macron... ) were also added to the MEDTAG lexicon. Strategy Basically, it is necessary to recognize if a given token is an 1DM. However, for some of these markers, to recognize the occurrences is not enough, as they may appear without being followed by confidential items, as for example Doctors in 7. Therefore the class pers has been split into two tags: idm and pers. For example Dr. is unambiguously an idm, i.e. it is necessary followed by an name identifier, while Doctors, may be both an idm, if it is followed by an identifier, or a pers, if it is followed by anything else. A last group gathers tokens such as miss. Miss is ambiguous at an MS level between a common noun SomeLlexicalrefinforrmentios and a verb, before being ambiguous at a WS level, between an action and an IDM. Hopefully, such ambiguities are rare: 1DM are generally clearly defined at the MS level (they are conmnon noun, tagged nc), as for example Dr., Mr., Prof., Professor. And only some of them are ambiguous at the WS level (Professor, Doctors...) between and idm and a pers. Disambiguation: MS and WS taggn The output of the disambiguation task is a 3 level stream (see Tab. 2): the token, the MS tag, and the WS tag. Tab. 2 shows the MS tagging process: column 1 provides the token, as it appears in the corpus, column 2 provides the lexical tag-like morpho-syntactic information. Column 3 picks up the preferred MS tags, which contains the POS. Using the POS, a lexical access returns the hypothetical WS tags. Lexical ambiguities are separated by a 'A Finally, colun 5 provides the preferredWS tags. had Lex. WS WS Lex. MS MS Tokens Miss v/nc[s] nelsi idm/pers 1dm L. Mitchell phones the x x v/nc[p] det x x v det x x act def id id act def Hospital nc nctsl hcorg hcorg Doctors of the nc[p] sp det ncipi idm/pers rel def _ _ pers rel def _ I.I sp det _ Table 2: morpho-syntactic disambiguation . and word-sens MS tags: The symbol before the bracket means the part-ofspeech (verb, noun, adjective ... ), while the symbol between the brackets provides optional information about the features (s for singular, p for plural) mnorpho-syntactic common nouns, nc is the MS tag for v is the MS tag for verbs det is the MS tag for determiners sp is the MS tag for prepositions x is the MS tag for unknown tokens 7 The MS tagset tends to follow the European MULTEXT standard: http:llwww.lpl.univaix.fr/projects/multext/LEX/LEXl .htnml 7 731 is expected. In such case the ambiguous tag idmipers is attributed to the next occurring coordination, and usual rules are applied. As in the following example, idmipers is attributed to the coordination and: Writing the rules While adding more lexical information, it was necessary to write the disambiguation rules. The WS and the MS taggers use the same framework: a set of very contextual rules, applied on a short string of tokens (up to pentagrams), ranked according to their reliability. Here are two examples: In Tab. 2, the following two-level rule is applied in order to disambiguate the token miss: v/nc[s] Doctors B. Billot of the Simali of the oncology department {TOK:miss} ; np -* nc[s];np This rule means: 'At the MS level: A word ambiguous (/) between a v and a nc[s], and whose the token form (between {}) is miss, and which is followed (;) by a np, must be rewritten (-i) nc[s]'. RESULTS More than 40 rules (1,2, and 3-level rules) were written to reach a 100% success rate for the set A. It took around three weeks to write it all. In order to assess the performances of the anonymizer, we ran the engine on the assessment set (set B), and then we checked manually the output. We counted six types of results: a. Identifiers in the corpus: 467 (100%) b. Identifiers correctly removed: 452 (96.8%) c. Identifiers removed with removing also irrelevant tokens: 8 (1.7%) d. Identifiers incompletely removed: 3 (0.6%) e. Identifiers left in the text: 4 (0.9%) f. Tokens removed, which are not identifiers: 0 In Tab. 2, the following two-level rule is applied to disambiguate Doctors: idm/pers * rel {MS:sp} 4 pers;rel fl7el WS level J This rule means: 'At the WS level: A word ambiguous (/) between an idm and a pers, and which is followed (;) by a rel, which is also a sp at the MS level (between {}), must be rewritten ( ) pers'. The first interesting result concerns the scalability of the approach: with tuning the system on 20% of the corpus we addresses correctly 96-97% of all cases. If the target is clearly reached (98<b+c<99%), we beleve the system is perfectible. Thus, a spelling error iS responsible for one of the four errors in e, and ovdbby adding digoemr toohr roscnb one more can be solved others errors ~~~~~~~~~~two idm to the lexicon (we forgot an idm, similar to MD). In c, five errors can be mastered by adding more words to the lexicon, as unknown tokens (tagged x) are considered as identifiers, when they follow an Reliable and long-distance rules [6]and[7]),the and [7]), Unlike most systems (see [6] output of both taggers (Tab. 2, still be ambiguous, if no relevant disambiguation rules can be applied. The basic idea is that we prefer tan to tochooe thewron tag!When not t chooe, When than choose the not to choose, wrong tag! . between idm and* pers) ambiguities (for. example, remain unsolved, the extraction module finally considers the token as being an idm. In this case, some tokens may be improperly removed, but as seen in the evaluation part, the system behaves very well. Finally, the n-grams contextual rules are completed by a set of long-distance rules (coded as finite-state automata). Such rules are necessary for mastering coordination dependencies. Plural idm (Drs, professors...) are expected to be followed by more than one identifier, therefore at least one coordination (sees Unlikble mndlost-similarstemse simclar surgery unit and G. Extraction The extraction module finally processes the 3-level stream returned by the tagging stage. Basically, the transition network switch on the extraction mode, when it reads a token, which is tagged id at the WS level, and switches off the extraction mode, when it reads a barrier, i.e. a token, which is not tagged id. Some nodes of the RTN are specialized for handling names with particles starting with a small letters (as van for Dutch names, von for German names, and de for French names). RS level L digestive tme identifier. 732 Dear Dr. xxxxxx, Your patient, Mr. X.-X. AXXX born x.x.xxxx, stayed in our service, from xx/xx/xx to xx/xx/xx. information extraction using pattern matching and linguistic processing. Proceedings of the 5th RIAO conference, McGill University, Montreal, Canada, 25th-27th June 1997. 5. Gazdar G., and Mellish C, 1989, Natural Language Processing in Prolog: An Introduction to Computational Linguistics, Chap. 3. Eds 6. Ruch P, Bouillon P, Baud R, Robert G, 2000, Minimal Commitment and Full Lexical Disambiguation: Balancing Rules and Hidden Markov Models. Proceedings of the 5h CoNLL Conference (ACL-SIGNLL), Lisbon, Portugal. 7. Silberztein, M, 1997, The Lexical Analysis of Natural Languages, in Finite-State Language Processing, Roche E, Schabes Y (Eds), p. 176203. MIT Press, Cambridge DIAGNOSIS: Acute pancreatitis of unknown origin. Comorbidities: Gastritis. Hypertension. Esophagitis with backflow. Mr. xxxx, 72 year-old, has been admitted in emergency, after [...]. However tests for the cytomegalovirus and the EBV were negative. Therefore we perform an abdominal CT-scan [...] Mr. Xxxxx will be followed in ambulatory by Drs. Xtxxx Xxxxx and Xxxxx xxx Xxxxxxx [.?. .]. Figure 2: Example of a complete anonyniization CONCLUSIONS AND FUTURE WORK Tag Freq.(%) 1-qual 10.1 2-acto 9.5 3-oc 9.3 8.7 4-spat 5.3 5-temp 5.1 6-mod 4.7 7-quant 4.5 8-papr 4.2 9-find 4.1 10-cpt 1 1-ther 4.0 3.6 12-mdev 13-thers 3.6 2.1 14-hca 1.9 15-diap 16-rel 1.9 1.7 17-medi 18-name 1.6 1.4 19-dis 1.4 20-bosp 21-bopr 1.3 1.2 22-rconj We have designed a system for removing identifiers in medical records, with a success rate of about 99%. Unlike other systems performing similarly, this system uses natural language processing tools. One detail must be mentioned concerning the replace operation: the tractability of the anonymization is not allowed in the system, i.e. P. Nertens and W. Keuster are both replaced by X Xxxxxxx. This is an advantage considering the security, as reverse 'scrubbing' is forbidden, but tractability may be necessary for studies on genealogy. Although this functionality is part of the replace task, and therefore does not question the extraction task of the system, we believe that this field should be investigated in future work. Acknowledgments Example fat leave liver high late maybe five infection fever idea inject scissors abscission care biopsy same penicillin Spigel diabetes vesicle digestion with blood watch behind to suffer think during ophthalmology carpenter carcinoma Label (definition) qualifier general act organ, body location spatial concept temporal concept modal quantitative concept pathological process signs or symptoms other concept therapeutic procedure medical device surgery procedure physician's act diagnosis procedure relationships (other) drugs and chemicals for medical techniques disease or syndrome body space or junction body process conjunction relation body substance general object spatial relation patient's act mental process temporal relation medical speciality occupation neoplastic process identity markers 23-bosu 1.0 0.9 24-obj 0.8 25-rspat 0.7 26-actp 0.7 27-mpr 28-rtemp 0.7 0.7 29-spec 30-occup 0.7 31 -neop 0.6 idm 0.6 Dr. person 32-pers 0.5 professor 33-tiss 0.5 T024 tissue 34-labo 0.5 crasis laboratory or test results 0.4 water 35-subst substance (other) 36,37, 38 respectively aux, def, and indef for respectively auxiliary, definite or non-definite determiner, freq << 0.1% Id <<0.1 Louise identity proper nouns The FNRS (Swiss National Foundation) funds the MEDTAG project. References 1. Sweeney LA, 1996, Replacing PersonallyIdentifying Information in Medical Records, the Scrub System. AMIA Annual Fall Symposium 1996, JJ Cimino Ed's, JAMIA, p 343-347. 2. P Ruch, J Wagner, P Bouillon, RH Baud, A-M Rassinoux, J-R Scherrer, 1999, MEDTAG: Taglike Semantics for Medical Document Indexing. AMIA Annual Fall Symposium 1999, NM Lorenzi Ed's, JAMIA, p 137-141. 3. Gross M, 1997, The construction of Local Grammars, in Finite-State Language Processing, Roche E, Schabes Y (Eds), p. 329-354. MIT Press, Cambridge. 4. Black WJ, Gilardoni L, Dressel R, Rinaldi F, 1997, Integrated text categorization and Table 3: Distribution of the semantic tagset in the lexicon. 733