AIML-HC Mod 04
AIML-HC Mod 04
AIML-HC Mod 04
LANGUAGE
PROCESSING IN
HEALTHCARE
Introduction
• Natural Language Processing or NLP pursues a defined set of problems within AI.
• NLP is defined as the ability of systems that can analyze, understand, and generate
human language, including speech and text.
• NLP is an aspect of computational linguistics (studying linguistics using computer
science) and is useful in the following:
• The retrieval of structured and unstructured data within a dataset. For example,
searching clinical notes by keyword or phrase
• Social media monitoring
• Question answering: interpretation of natural language from humans to interact
appropriately; for instance, as with virtual assistants or speech recognition software.
• Analysis of a document to determine key findings
• Ability to parse and interpret a text to understand sentiment and mood
• Recognizing distinctions among diagnoses and relationships
• Image-to-text recognition; for instance, reading a sign or menu
• Machine translation: NLP is used in machine translation programs in which one
human language is automatically translated into another human language.
• Topic modelling—What is this document talking about?
• Understanding sentiment from social media or discussion posts
• Interpreting natural language is fraught with challenges, as human language is
naturally ambiguous language, pronunciation, expression, and perception.
• Although there are rules with human language, they are often misunderstood and
misused. NLP takes into consideration the structure of language to derive meaning.
• Words make phrases; phrases make sentences; sentences make documents; and all
of the aforementioned convey ideas.
• NLP has a tool kit of text processing procedures including a range of data mining
methods that can be used for model development.
• Due to the nature of unstructured data, NLP tasks can be expensive regarding
computational resources and time.
• Neural networks and deep learning can also be used for NLP tasks.
• With most data generated existing in the form of unstructured data, NLP is a powerful tool
to interpret and understand natural language.
• As with any aspect of computing, there are several terms to understand before proceeding:
• Tokenization: the process of converting a corpus of text into smaller units, or tokens; there
are many algorithms available for breaking a text into tokens
• Tokens: words or entities present in the text
• Text object: a sentence or a phrase or a word or an Article
• Stemming: a basic rule-based process of stripping suffixes (“ing,” “ly,” “es,” “s,” etc.)
from words
• Stem: the text created after stemming
• Lemmatization: determining the root of a word from dictionaries and morphological
analysis
• Morpheme: unit of meaning in a language
• Syntax: arranging symbols (words) to make a sentence; it involves determining the
structural role of words in the sentence and phrases
• Semantics: meaning of words and how to join words into meaningful phrases
and sentences
• Pragmatics: using and understanding sentences in different situations and how
interpretations are affected.
• In this section, I aim to introduce key concepts and methods of NLP and to
demonstrate the techniques that can be applied to datasets to generate defined value.
• Natural Language Processing (NLP) has numerous important applications in
the medical field. Here's a brief overview of some key areas:
1. Clinical documentation:
• Automating medical transcription
• Extracting relevant information from clinical notes
2. Information retrieval:
• Searching medical literature databases more effectively
• Finding relevant patient information in electronic health records
3. Clinical decision support:
• Analyzing patient data to suggest diagnoses or treatments
• Identifying potential drug interactions or contraindications
4. Patient communication:
• Chatbots for initial patient triage or answering common questions
• Analyzing patient-reported outcomes from surveys or social media
5. Medical coding:
• Automatically assigning diagnostic and procedure codes
• Improving billing accuracy and efficiency
6. Predictive analytics:
• Identifying patients at risk for certain conditions
• Predicting disease progression or treatment outcomes
7. Medical education:
• Creating interactive learning tools for medical students
• Analyzing and providing feedback on case presentations
Getting Started with NLP
• Giving an NLP model an input sentence and receiving a useful output requires
several key components
Preprocessing: Lexical Analysis
• As with any dataset, a corpus of text that is not relevant to the context of data can be
understood as noise.
• The first stage in NLP is to clean and standardize the input text, ensuring it is noise
free and ready for analysis.
• Over and above spelling correction and grammar correction, the following
techniques are used to reduce noise.
Noise Removal
• Noise removal involves preparing a dictionary of noisy tokens (i.e., words) and
parsing the text, removing the tokens found in the noise dictionary.
• For instance, words like the, a, of, this, that, and so forth, would be removed.
Lexicon Normalization
• Follow, following, followed, follower are all variations of the word follow.
• Contextually, the words are similar. Lexicon normalization reduces dimensionality
through stemming, which strips suffixes and prefixes, and lemmatization, which is a
defined procedure that uses word structure and grammar relationships.
Porter Stemmer
• The Porter stemmer algorithm is a popular and useful method to improve the
effectiveness of information retrieval.
• The algorithm works on the principle that many words in English share a common
root.
• Stemming works on suffixes and removes common morphological and inflexional
endings from words.
• Therefore, stemming allows one to reduce similar words into a common root form.
• For example, take the text, “I felt troubled by the fact that my best friend was in
trouble. Not only that, but the issues I had dealt with yesterday were still troubling
me.”
• The words troubled, trouble, and troubling all share the common root trouble.
• Therefore, according to the Porter stemmer algorithm, instead of counting all three
words once, the stem trouble is counted thrice instead.
• The benefit of stemming is that common words can be clustered under a common
stem to provide a more accurate statistical representation of the number of
occurrences of a certain word.
• However, a drawback of stemming is that the semantic meaning of the word may be
lost.
• Stemming and lemmatization are used to reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form.
Object Standardization
• A corpus of text may contain words that cannot be found in lexicon dictionaries.
• For example, on Twitter, someone may mention DM’ing someone, or another may
like an RT of someone else’s tweet.
• Acronyms, hashtags, slang, and colloquialisms can be removed through prepared
dictionaries or using regular expressions.
Syntactic Analysis
• To be analyzed, the text needs to be converted into features.
• The syntactic analysis of text involves analyzing sentences to understand
relationships between words and assigning a syntactic structure to it.
• There are several algorithms for syntactic analysis; however, the Context-Free
Grammar is most popular due to it being the simplest style of grammar and
therefore widely used.
• Take the sentence, “David saw a patient with uncontrolled type 2 diabetes.”
• Within the sentence, we need to identify the subject, objects, noise, and attributes to
understand the sequence of words and its dependencies.
Dependency Parsing
• Sentences are composed of words in a structure.
• Basic grammar can determine the relationship or dependencies between a structure.
• Dependency parsing represents this in a tree structure and represents grammar and
arrangement of words
• Dependency grammar analyses asymmetrical binary relationships between tokens.
• The Stanford parser from NLTK is commonly used for this purpose.
• Using the example in Figure, the parse tree determines the root of the word as “saw”
and is then linked by subtrees.
• The subtrees are split by subject and object, with each subtree also showing
dependencies.
Part of Speech Tagging
• Part of speech tagging involves associating each word or token in asentence with a
part of speech (POS) tag.
• These tags are the basic English labels that you learned in primary school and
determine nouns, verbs, adjectives, adverbs, numbers, and so on.
• This is particularly useful for tasks such as building parse trees— which can, in turn,
be used for determining what something is, sentiment analysis, determining
appropriate answers to questions, or understanding similar entities see Figure below
Reducing Ambiguity
• Some sentences have multiple meanings given the structure; for example, take the
two sentences: “I managed to read my book on the train.” “Can you please book my
train tickets?”
• Part of speech tagging identifies “book” as a noun in the first sentence and as a verb
in the second sentence.
Identifying Features
• By identifying the types of speech alongside different contexts of a word, POS can
distinguish between uses and creates stronger features for use.
Normalization
• POS tags are the foundation of normalization and lemmatization, to understand
sentence structure and dependency.
Stopword Removal
• POS is useful in removing commonly used words, or stopwords, from a text.
Semantic analysis
• Semantic analysis is the most complex phase of NLP.
• It draws the exact meaning or the dictionary meaning from the text.
• Using knowledge about the structure of words and sentences among the context, the
meaning of words, phrases, sentences, and texts is stipulated, and subsequently, also
their purpose and consequences.
Techniques Used Within NLP
• Once data preprocessing, lexical, syntactical, and semantic analysis of corpus has
taken place, we are required to transform text into mathematical representations for
evaluation, comparison, and retrieval.
• For instance, searching a collection of patient profiles for users with “hypertension”
should only bring out those with hypertension.
• This is achieved through transforming documents into the vector space model with
scoring and term weighting essential for query ranking and search retrieval.
• Documents can be in the form of patient records, web pages, digitalized books, and
so forth.
• The following algorithms are typical in the comparison and evaluation process.
N-grams
• N-grams are used in many NLP problems. If X = number of words in a given
sentence K, the number of n-grams for sentence K would be
• Each term’s weight (Wtd ) is calculated by multiplying the frequency of the term
(ftd) by the log of the total number of documents (D) divided by the number of
documents the term occurs at least once in (NDt).
• The ordering of the terms may not necessarily be maintained.
• Weights for the terms gathered can then be used to determine documents with high
frequencies of the specific term within them.
• A collection of TF-IDF vectors could be used to represent a user’s interest.
• Weights for the terms gathered can then be used to determine documents with high
frequencies of the specific term within them.
Latent Semantic Analysis
• Where there is a corpus of a large number of documents, for each document d, the
dimension of the vector representing each document can typically exceed several
thousand.
• Latent semantic analysis relies on the fact that intuitively, terms in documents may
often be related.
• For example, if document d contains the term sea, it will often contain the word
beach.
• Equivalently, if the vector representing d has a non-zero component in the entry for
sea, it will also have a non-zero component in the beach component.
• If this kind of structure can be detected, relationships between words can be
automatically learned from the data.
• The word-document is represented by a matrix A, which is decomposed using
Singular Value Decomposition to give the strength of the most significant
correlations and their directions.
• The decomposition of A allows one to discover the semantics of the document
through correlations between terms and their significance within the document.
• The latent semantic analysis method can be applied to determine the context of a
variety of materials to present a web user with results that were contextually
relevant.
Cosine Similarity
• The cosine similarity is a metric that measures the similarity between two vectors.
• Therefore, this could either be used to define the similarity between two documents, or a
document and a query.
• The cosine similarity between two vectors can be defined as the following:
• The similarity between two vectors is calculated as the inner product of the vectors,
divided by the product of the lengths of the vectors in question.
• Intuitively, the greater the angle between two documents, the less similar they are.
• This holds for vectors in any N-dimensional space.
• Furthermore, the TF-IDF vector scheme can be integrated with the cosine similarity metric
to determine the similarity between a search query and a plethora of documents:
• The similarity between a query q and document d is calculated as the product of the
TF-IDF weights for the term in both the document and query summed over all the
terms in both the query and document; this is divided by the product of the length of
the document and length of the query.
• Practically, however, calculating Sim(q,d) would prove computationally expensive
as the number of documents used grows.
Naive Bayesian Classifier
• The Bayesian classifier is based on the Bayesian theorem and is particularly suited
when the dimensionality of the inputs is high.
• Despite its simplicity, Naïve Bayes can often outperform more sophisticated
classification methods.
• Given a specified threshold, this method can be used to classify the probability of
whether a vector representing a document is of interest to a user.
• Given an attribute d, we can calculate whether the example belongs in class C with
the following formula:
• Other techniques such as kNN and ANN can also be used to classify and retrieve
information.
Genetic Algorithms
• Genetic algorithms (GA) are a fascinating topic within machine learning.
• GA take inspiration from evolution to minimize the error rate, attempting to mimic
the function of chromosomes much like neural networks attempt to mimic the
human brain.
• Evolution is considered the optimal learning algorithm.
• In machine learning, the application of this is in models whereby several candidate
answers (referred to as chromosomes or genotypes) are produced, and the cost
function applied to all.
• In GA, a fitness function is defined that determines if the chromosomes fit enough
to mate.
• Chromosomes furthest away from the optimal outcome are removed.
• Chromosomes are also subject to mutation.
• GA are a type of search and optimization learner and apply to discrete and
continuous problems.
• Chromosomes that are close to the optimal solution may be combined.
• The combination or mating of chromosomes is known as a crossover.
• The survival of the fittest approach identifies chromosomes that aim to express
characteristics that adhere to natural selection—where the offspring is more optimal
than the parent.
• Mutation helps to overcome overfitting.
• It is a random process to get over local optima and find the global optimum.
• Mutation helps to ensure that child chromosomes are different from the parents’ and
continue evolution.
• The degree in which chromosomes mutate and mate are parameters that can be
controlled or left to the model to learn.
• GA have varied applications:
• Detection of blood vessels is ophthalmology imaging
• Detecting the structure of RNA
• Financial modeling
• Routing vehicles
• A group of chromosomes is referred to as a population.
• Although it stays at a defined, constant size, it usually evolves to better average
predictions over the course of generations or time.
• The evaluation of a chromosome, c, is calculated as its evaluation function value
divided by the average of the generation, represented as the following:
Low-Level NLP Components in
Healthcare AIML •
•
1.2 Clinical Text Cleaning
Removal of PHI (Personal Health Information)
• 1. Text Preprocessing Components • Handling of medical punctuation
• Special character processing
• 1.1 Medical Text Normalization
• Whitespace normalization
• Standardization of medical
• Noise removal specific to EMR/EHR systems
abbreviations (e.g., "pt" → "patient")
• 1.3 Tokenization for Medical Text
• Unit conversion (metric vs imperial)
• Medical compound word handling
• Date/time normalization for clinical • Drug name tokenization
events
• Chemical formula processing
• Handling of medical symbols (↑, ↓, Δ, • Medical measurement tokenization
±)
• Clinical abbreviation handling
• Numerical value standardization
• 2. Lexical Analysis Components
• 2.3 Part-of-Speech Tagging
• 2.1 Medical Named Entity Recognition (NER)
• Disease names identification • Medical term POS tagging
• Drug name recognition • Clinical narrative parsing
• Anatomical term detection • Temporal expression tagging
• Medical procedure identification • Numerical expression handling
• Laboratory test recognition
• Medical modifier identification
• Vital sign extraction
• 2.2 Medical Vocabulary Processing
• 3. Syntactic Analysis Components
• UMLS (Unified Medical Language System) • 3.1 Medical Dependency Parsing
integration
• Clinical relationship extraction
• SNOMED CT terminology processing
• ICD-10 code mapping
• Symptom-disease relationships
• RxNorm drug terminology • Drug-disease relationships
• LOINC laboratory codes • Treatment-outcome relationships
• Temporal relationship parsing
• 3.2 Medical Grammar Rules • 4. Semantic Analysis Components
• Clinical narrative structure • 4.1 Medical Concept Extraction
• Medical documentation patterns • Disease concept mapping
• Progress note formatting • Treatment concept identification
• Laboratory report structure • Diagnostic concept recognition
• Prescription syntax • Medication concept extraction
• 3.3 Phrase Chunking • Procedure concept mapping
• Medical phrase identification • 4.2 Clinical Relation Extraction
• Treatment regimen extraction • Symptom-disease associations
• Dosage instruction parsing • Drug-drug interactions
• Clinical finding grouping • Treatment-outcome relationships
• Temporal phrase recognition • Risk factor associations
• Contraindication identification
• 4.3 Medical Ontology Mapping
• 5.2 Negation Detection
• UMLS concept mapping
• Clinical negation patterns
• SNOMED CT hierarchy navigation
• ICD-10 code assignment • Absence of symptoms
• RxNorm terminology mapping • Rule-out diagnoses
• LOINC code identification • Medication discontinuation
• 5. Healthcare-Specific Features • Negative test results
• 5.1 Temporal Processing • 5.3 Uncertainty Analysis
• Clinical timeline extraction
• Diagnostic uncertainty
• Treatment duration analysis
• Treatment response probability
• Follow-up scheduling
• Disease progression tracking
• Risk assessment
• Medication timing analysis • Prognosis uncertainty
• Decision confidence levels
• 6. Domain-Specific Processing
• .3 Clinical Context Analysis
• 6.1 Specialty-Specific Components
• Patient history context
• Radiology report processing
• Treatment context
• Pathology report analysis
• Diagnostic context
• Surgical note parsing
• Follow-up context
• Mental health narrative analysis
• Emergency vs. routine care
• Emergency department note
processing • 7. Output Processing
• 6.2 Document Type Processing • 7.1 Structured Data Generation
• Admission notes • FHIR format conversion
• Progress notes • HL7 message generation
• Discharge summaries • Clinical database formatting
• Consultation reports • EMR/EHR integration
• Laboratory reports • Research database formatting
• .2 Clinical Report Generation
• . 8.2 Accuracy Improvements
• Summary generation
• Error detection mechanisms
• Alert generation
• Confidence scoring
• Recommendation formatting
• Validation rules
• Decision support output
• Quality assurance checks
• Patient education materials
• Performance monitoring
• 8. Performance Optimization
• 9. Integration Components
• 8.1 Processing Efficiency
• 9.1 API Integration
• Batch processing optimization
• FHIR API compatibility
• Real-time processing capabilities
• HL7 interface
• Memory management
• EMR/EHR integration
• CPU utilization optimization
• Laboratory system integration
• Pipeline parallelization
• Pharmacy system integration
• 9.2 Security Components
• PHI protection
• HIPAA compliance
• Access control
• Audit logging
• Data encryption
High-Level NLP Components in
Healthcare AIML
• 1. Clinical Text
Understanding Systems • 1.2 Information
Extraction • 1.3 Clinical
• 1.1 Document Summarization
Classification • Key clinical finding
extraction • Patient encounter
• Clinical note categorization summaries
• Medical specialty • Diagnosis identification • Medical history
identification compilation
• Treatment plan
• Emergency vs. routine extraction • Treatment progress
documentation summary
• Research document
• Medication regimen
classification analysis • Longitudinal care
overview
• Administrative document • Patient history
sorting summarization • Multi-document synthesis
• Triage note classification • Risk factor • Discharge summary
identification generation
• 2. Advanced Analytics Components
• 2.3 Population Health Analytics
• 2.1 Clinical Decision Support
• Epidemiological trend analysis
• Diagnosis suggestion systems
• Disease outbreak detection
• Treatment recommendation engines
• Healthcare utilization patterns
• Drug interaction analysis
• Public health monitoring
• Clinical pathway optimization
• Demographic health analysis
• Risk assessment models
• Resource allocation optimization
• Outcome prediction systems
• 3. Knowledge Discovery Systems
• 2.2 Predictive Analytics
• 3.1 Medical Knowledge Base Construction
• Disease progression prediction
• Clinical guideline extraction
• Readmission risk assessment
• Treatment protocol mining
• Treatment response prediction
• Disease-symptom relationship mapping
• Complications forecasting
• Drug-interaction database building
• Resource utilization prediction
• Medical literature synthesis
• Patient outcome modeling
• Best practice identification
• 3.2 Clinical Research Support
• 4. Interactive Systems
• Literature review automation
• 4.1 Clinical Question Answering
• Clinical trial matching • Medical query processing
• Research hypothesis generation • Evidence-based answering
• Evidence synthesis • Clinical decision support
• Systematic review assistance • Patient education systems
• Meta-analysis support • Healthcare provider assistance
• 3.3 Knowledge Graph Generation • Training and education support
• Medical entity relationship mapping • 4.2 Dialog Systems
• Treatment pathway visualization • Patient intake systems
• Disease progression modeling • Medical history collection
• Healthcare provider networks • Symptom assessment
• Patient journey mapping • Follow-up monitoring
• MetaMap
• Feature: UMLS concept mapping
• Components: Lexical analysis, variant generation
• Use cases: Biomedical text annotation, concept identification
• MedSpaCy
• Feature: Clinical text processing
• Components: Clinical pipelines, custom extensions
• Use cases: Healthcare data extraction, medical entity recognition
• 1.2 General Purpose NLP Libraries with Clinical Extensions
• NLTK Medical
• Feature: Medical text processing extensions
• Components: Medical tokenizers, specialized taggers
• Use cases: Basic clinical text processing
• SNOMED CT
• Content: Clinical healthcare terminology
• Access: National Release Center
• Coverage: Diagnoses, procedures, findings
• RxNorm
• Content: Clinical drug terminology
• Access: Through NLM APIs
• Updates: Monthly releases
• 2.2 Classification Systems
• ICD-10
• Content: Disease classification
• Access: WHO platform
• Versions: Country-specific modifications
• LOINC
• Content: Laboratory test codes
• Access: Regenstrief Institute
• Updates: Semi-annual releases
• RxNorm API
• Features: Drug information
• Access: Web services
• Coverage: Medication data