HCI - Notes-Ch1-2
HCI - Notes-Ch1-2
HCI - Notes-Ch1-2
Infrastructures
Degree of Biomedical Engineering
Universitat Rovira i Virgili
Hatem A. Rashwan
1
In Loving Memory
and Gratitude
• Before we delve into today's
lecture, I would like to take
a moment to remember and
express our heartfelt
gratitude to the esteemed
professor who previously
taught this course.
• Unfortunately, Dr. David
Riaño is no longer with us,
but their influence and
dedication continue to
inspire us.
2
Healthcare data management is critical for improving
patient care, patient safety, operational efficiency, research,
population health management, and evidence-based
decision-making for healthcare organizations.
3
Presentation
• Computer Science (CS) and Artificial
Intelligence (AI) technologies are being
incorporated in health care centers for a
better treatment of patients. Data-base
and knowledge-base technologies are at
the core of this revolution.
• In this course, we will introduce some of
the most relevant standards for clinical
data codification, data structuration,
electronic health records, and semantic
annotation. Moreover, the course will
introduce notions on data analysis,
knowledge representation and
management in medicine, and will
present decision support systems and
artificial intelligence (AI) techniques in
medicine.
4
Contents
1. Introduction 4. Clinical Knowledge
2. Clinical Data – Knowledge Representation: An
Introduction
– Data Sources – K. Representation: FOL
– Types of Data, Variables, and – K. Representation: Production Rules
Transformations – K. Representation: Objects
– Data Standards: Coding Systems and – K. Representation: Ontologies
Terminologies – Knowledge Life Cycle
– Patient Records 5. Clinical Decision Support Systems
– Interoperability – Differential Diagnosis Generators
– Drug Interaction Checkers
3. Clinical Data Analysis – Alarm and Surveillance Systems
– A Data Science Project 6. Ethics and Security Issues
– Data Pre-Processing – Spanish Legislation on health
– Statistical Clinical Data Analysis records
– Code of ethics of Health Inf.
– Artificial Intelligence Clinical Data Systems
Analysis – General Data Protection Regulation
– Quality of the Analysis 7. Conclusions
5
1. Introduction
• Health Care and Health Care Systems
Def (Health Care, HC): Efforts made to maintain or restore physical, mental, or
emotional well-being specially by trained and licensed professionals.
Def (Health Care System): The organization of people, institutions, and resources
that deliver health care services to meet the health needs of target populations.
• Health Care Center
Def (Health Care Center, HCC): a building or establishment housing local medical
services or the practice of a group of doctors.
7
Computer Technology in Health Care
Def (Information System): an integrated set of components for
collecting, storing, and processing data and for providing
information, knowledge, and digital products.
• Data and Information
– Single Clinical Data Units: Textual vs. Codified
– Information Structures: Unstructured (Textual), Semi-structured (Forms), Structured
(EHR).
– Data uses DATA 39
• Primary Use: Clinical Care
• Secondary Use: Research Units meaning
– Clinical Data Analysis INFORMATION Temp = 39
• Statistical
• Machine Learning (Artificial Intelligence) generalization
KNOWLEDGE Temp > 38 => Fever
• Knowledge
– Medical Knowledge Representation
– Clinical Decision Support System (CDSS): health information technology system that is
designed to provide physicians and other health professionals with clinical decision
support (CDS), that is, assistance with clinical decision-making tasks.
8
Computer Technology in Health Care
Def (Information System): an integrated set of components for collecting,
storing, and processing data and for providing information, knowledge, and
digital products.
• Knowledge
– Medical Knowledge Representation
– Clinical Decision Support System (CDSS): health information technology system that is
designed to provide physicians and other health professionals with clinical decision
support (CDS), that is, assistance with clinical decision-making tasks.
9
Simplified Global View
HEALTH CARE
CENTER (HC)
RESEARCH UNIT Upper
(Clinical Data Analysis)
Level
data
Computers (secondary use)
10
2. Clinical Data
• Data Sources in Health Care
• Data Uses in Health Care
• Types of Data
• Variables
• Data Transformations
• Wrong, Missing, and Censoring Data
• Big Data in Health Care
• Standards of Biomedical Data
• Patient Records
• Interoperability
• EHR Systems
11
Data Sources in Health Care
• Primary Source: data producers (where the data is produced)
– Visits and professional encounters
– Monitoring Equipment (e.g., ECG, Pulse Oximeter)
– Laboratories (e.g., blood analysis)
– Clinical Center Devices (e.g., X-Ray) PRIMARY SOURCE
12
Primary Data Sources in Health Care
Insurance
Companies
13
Secondary Data Sources in Health Care
Def. (Health Care Record, HR): Structured and systematic documentation about one
single patient’s medical history and care across the time.
Def (Clinical Data Repository, CDR): Real-time database that consolidates data from a
variety of clinical sources to present a unified view of a single patient. It is optimized
to allow clinicians to retrieve data for a single patient rather than to identify a
population of patients with common characteristics or to facilitate the management of
a specific clinical department.
– Clinical Data Warehouse (CDW) including the CDR and HR.
14
Secondary Data Sources in Health Care
Def. (Health Care Record, HR): Structured and systematic documentation about one
single patient’s medical history and care across the time.
Def (Clinical Data Repository, CDR): Real-time database that consolidates data from a
variety of clinical sources to present a unified view of a single patient. It is optimized
to allow clinicians to retrieve data for a single patient rather than to identify a
population of patients with common characteristics or to facilitate the management of
a specific clinical department.
– Clinical Data Warehouse (CDW)
Topic Clinical Data Repository Clinical Data Warehouse
Specificity of the data Detail-oriented, focused on the individual Aggregated data (summarized to decision-
patient making levels).
User’s data access Read/write Non-volatile, read-only access
Updates Real-time from operational systems Periodically (static) by operational systems
Data normalization Normalized data, no redundant data De-normalized and redundant data, often
Data contained Integrated clinical data (only) Integrated operational, clinical, and financial
data
Data comes from Clinical systems Clinical, financial, and administrative systems
• Big Data: According to Oracle, Big data is data that contains greater variety arriving
in increasing volumes, with ever-higher velocity, and of questionable veracity.
These are known as the four Vs. 15
CDW Basic Architecture
16
Data Uses in Health Care
17
Types of Data: Clinical Sense
• According to the Health Sciences Library at the University of
Washington, clinical data falls into six major types:
– Health care record data: Data are obtained, for a single patient, at the point of care at a
medical facility, hospital, clinic or practice. It is generally not available to outside
researchers, and the data collected includes administrative and demographic
information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic
monitoring data, hospitalization, patient insurance, etc.
– Administrative data: Often associated with electronic health records, these data are
primarily hospital discharge information (summary) reported to a government agency.
– Claims data: Claims data describe the billable interactions (insurance claims) between
insured patients and the healthcare delivery system. Claims data falls into four general
categories: inpatient, outpatient, pharmacy, and enrollment.
– Patient-Disease registry data: These registries are clinical information systems that track
a narrow range of key data for certain chronic condition such as Alzheimer's Disease,
cancer, diabetes, heart disease, and asthma. Registries often provide critical information
for managing patient conditions.
– Health surveys data: In order to provide an accurate evaluation of the population
health, national surveys of the most common chronic conditions are generally
conducted to provide prevalence estimates. These surveys are specific for research
purposes and policy decisions.
– Clinical trials data: Data on a subset of subjects for the purpose of testing a treatment
before it is introduced in a health care system.
https://guides.lib.uw.edu/hsl/data/findclin 18
Data Types: A Complete View
Sources
Patient
Professional
Center
Devices
Logistics
Agencies
19
Data Types: A Complete View
Sources
Patient
Professional
Center
Devices
Logistics
Agencies
20
Types of Data: Format
Images UNSTRUCTURED
X-ray Computed Tomography Mammography Positron-Emission Tomography Magnetic Resonance Imaging Ultrasound
(CT) (PET) (MRI)
Charts
pathology images
…
Electrocardiogram Spirometry Electroencephalogram Glycemic index chart
(ECG, EKG) (EEG) (GI)
STRUCTURED
Values Variables
Values can be obtained directly from the patient (e.g., temperature) or from the analysis of an
image/chart (e.g., cancer stage). Values are stored in variables.
21
Cumulative Data
• The clinical structured data sometimes contains
cumulative and non-cumulative data
• The difference between cumulative and non-
cumulative data is that
– cumulative data displays the total amount of
information that's been gathered over a period of
time,
– whereas non-cumulative data shows the amount
of information gathered only at a certain point in
time.
22
Cumulative Data
• Count(*): indicates a quantity. e.g., number of female patients.
• Ratio: a number divided by another number (e.g., body mass
index BMI = weight/height2, in kg/m2)
• Proportion: a ratio of counts where the numerator is a subset
of the denominator (e.g., 30 patients out of 50 are depressed,
30/50 or 0.6) – Range [0.0, 1.0]
• Percentage: proportion expressed as a percentage (e.g., 0.6 is
expressed as 60%) – Range [0.0, 100.0]
• Risk: a proportion where the numerator counts events that
happen prospectively (e.g., 80 patients started the clinical trial
but only 50 remain, the risk of censoring is 30/80 or 37.5%).
• Rate: proportion that involves a time (e.g., an ICU has a
mortality rate of 10% if in one year it receives 1500 patients of
which 150 died).
(*) Be careful because absolute numbers (counts) are not related to the universe of work. For example, knowing that 50 female
23
nurses in front of 3 male nurses smoke does not necessarily mean that female nurses smoke more than male nurses.
Indep. Dep.
Variables
• Clinical Variables according to their relationship:
– Independent: we can control or change (e.g., dosage)
– Dependent: variable that we measure (e.g., temperature)
24
Indep. Dep.
Variables
• Types of variables according to their possible values (domain):
– Quantitative or Numeric: there’s a distance function
defined on all the pairs of values of the variable (e.g.,
temperature, heart rate).
• Continuous: “all” values are possible (e.g., temperature =
37.4957oC)
• Discrete: “some” values are possible (Integer) (e.g., HR = 100 Beats
per minute, but not 100.35 bpm)
25
Changing Variable Types
Variable
Discretization
Quantitative Qualitative
(Numeric) Continuity (Categorical)
Binary N-ary
26
Discretization
BY BINNING BY FREQUENCY
Input: number of bins = n Input: number of bins = n
g1 g2 … gn g1 g2 … gn
Output: m1x<M1 m2<xM2 … mn<xMn Output: ~N/n ~N/n … ~N/n
• Max and min and the greatest and lowest value in the • Same values in same group, this is why not all groups
list, S=(Max-min)/n is the size of each bin, and [mi,Mi] are necessarily of size N/n.
= [min+(i-1)*S, min+i*S] is the i-th group.
28
Data Transformation II: Categorical to …
Continuity
• Binary to Continuous: One of the binary values (false) is converted to 0,
the other one (true) is converted to 1.
• n-Ary to Numeric: Conversion processes
– Unique Integers (careful): E.g., drugs acting on the cardiovascular system are
codified as numbers.
Antihypertensive (2), Diuretic (3), Peripheral vasodilators (4), Vasoprotectives (5), Beta blocking
agents (7), Calcium channel blockers (8), Agents acting on the renin–angiotensin system (9), Lipid
modifying agents (10), …
– Dummy Coding (without comparison group): E.g., the use of a drug d in a
treatment is marked with 1/0 in the corresponding new numeric column d.
d1, d2, d1, d1, d3, d4, d2, d4, d3 → 1000, 0100, 1000, 1000, 0010, 0100, 0001, 0010
– Dummy Coding (with comparison group g): E.g., the use of a drug dg in a
treatment is marked with 1/0 in the corresponding new numeric column d. Drug g
does not have a column.
d1, d2, d1, d1, d3, d4, d2, d4, d3 (comparison group d2) → 100, 000, 100, 100, 010, 000, 001, 010
– Effect Coding (with comparison group g): E.g., the use of a drug dg in a treatment
is marked with 1/0 in the corresponding new numeric column d. Drug g does not
have a column. Drug g has all dg columns to -1.
d1, d2, d1, d1, d3, d4, d2, d4, d3 (comparison group d2) → 100, -1-1-1, 100, 100, 010, -1-1-1, 001,
010
29
DUMMY CODING
(without comparison group) (with comparison group = v2)
Conversion Table V1 V2 V3 V4 V1 V2 V3 V4
V1 1, 0, 0, 0 1 0 0 0 1 0 0 0
https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/
1 1
UNIQUE INTEGERS V2 0, 1, 0, 0 2 0 1 0 0 2 0 1 0 0
3 0 0 1 0 3 0 0 1 0
V3 0, 0, 1, 0 1 0 0 0 1 0 0 0
4 4
1 V1 Conversion Table 1 V4 0, 0, 0, 1 0 1 0 0 0 1 0 0
5 5
2 V2 V1 1 2 Conversion Table 6 1 0 0 0 6 1 0 0 0
V2 2 7 0 0 1 0 7 0 0 1 0
3 V3 3 V1 1, 0, 0
8 0 1 0 0 8 0 1 0 0
V3 3
V2 0, 0, 0 0 1 0 0 0 1 0 0
4 V1 V4 4
1 9 9
V3 0, 1, 0 10 0 0 1 0 10 0 0 1 0
5 V2 2 11 1 0 0 0 11 1 0 0 0
V4 0, 0, 1
6 V1 1
EFFECT CODING (with comparison group = v2)
7 V3 3 V1 V2 V3 V4
8 Conversion Table 1 0 0 0
V2 2 1
-1 1 -1 -1
V1 1, 0. 0 2
9 V2 2 3 0 0 1 0
V2 -1, -1, -1
4 1 0 0 0
10 V3 3 V3 0, 1, 0 5 -1 1 -1 -1
11 V1 1 V4 0, 0, 1 6 1 0 0 0
7 0 0 1 0
Categorical Numeric 8 -1 1 -1 -1
9 -1 1 -1 -1
10 0 0 1 0
30
11 1 0 0 0
Benefits of Discretization and Continuity
… Discretization … Continuity
The goal of discretization is to reduce the number The goal of continuity is to convert a
of values a continuous variable assumes by nominal variable into a numeric variable
grouping them into a number of n intervals or by projecting it in a continuous space.
bins.
32
Wrong, Missing, and Censoring Data
• Dealing with wrong, missing, and censoring data:
1. Detect: not always possible
• Values out of range
• The average of the sample is different from epidemiological/population
studies
2. Correct: There are several alternatives, some of which are …
• Remove instances (high % of wrong/missing/censoring) (Complete Cases)
• Remove feature (high % of wrong/missing/censoring)
• Encoding: transform it to special values (e.g., -1, 99999, 0)
• Imputation (Replace): introduce the mean/median/mode value or
min/max
• Imputation (Predict): calculate value with a regression or k-nearest
neighbors (KNN) or Linear Interpolation (LI) algorithms
• Leave as NA (NaN or ?) and use algorithms which are capable to manage
missing values (e.g., XGBoost supports missing values by default.)
33
Wrong, Missing, and Censoring Data
34
Big Data in Health Care
Source: Minor L.B, Harnessing the Power of Data in Health, Stanford University School of Medicine, 2017. 35
The four Vs of Big Data
Biomedical Data
satisfies all four v’s
36
Let’s put some order …
37
Standards of Biomedical Data
38
Codification, Terminologies, and Ontologies
Clinical Data Standards: Standards like ICD, SNOMED CT, LOINC, and CPT are
used to codify clinical information, diagnoses, procedures, and laboratory
tests, ensuring consistent terminology across healthcare systems.
39
What do we want to code?
Health Care
Processes
Biomedical/Clinical Administrative
Processes Processes
ICD X X
DRG X X
ICD ICD
ICPC X X X
LOINC
ICF X
DRG
CPT X
ICPC
ATC X
LOINC X ATC ICF
CPT 40
SNOMED X X X X
Why do we want to code?
1. Standardization of health, clinical vocabulary, and meaning
2. Facilitates communication and multiple languages
3. Improves reports’ precision
4. Normalization of data structures and cross reference
5. Contributes to reduce error, inaccuracy, and
misunderstanding
6. Billing and cost analysis
7. Secondary use of data
8. Facilitates automated applications and data exchange and
sharing
41
International Classification of Diseases (ICD)
NN AANN.X ICD11
Used abbreviations: NEC (Not Elsewhere Classifiable), NOS (Not Otherwise Specified).
43
A: alphabetic; A,B,C,…
ICD-9-CM N: numeric; 0,1,2, …
X: alphanumeric (A U N)
• http://icd9.chrisendres.com/
• http://www.icd9data.com/
• https://www.findacode.com/
• 17,000 codes
• http://www.icd9data.com/
• https://www.findacode.com/
• +155,000 codes
• ICD-10-CM - Diseases:
– 21 chapters
– Format: ANN.N[NNNA]
– Examples:
• K40.1 Bilateral inguinal hernia, with gangrene
• L20.81 - Atopic neurodermatitis
• M24.541 - Contracture, right hand
• M24.542 - Contracture, left hand
• S12.110A - Anterior displaced Type II dens fracture,
initial encounter for closed fracture
• ICD-10-PCS - Procedures:
– 17 chapters
– Format: XXXXXXX
– Examples:
• 0016070 Bypass Cerebral Ventricle to Nasopharynx
with Autologous Tissue Substitute, Open Approach
• 30243G0 - Transfusion of Autologous Bone Marrow
into Central Vein, Percutaneous Approach
• BV18ZZZ Fluoroscopy of Vasa Vasorum
45
ICD-11-CM
• https://icd.who.int/browse11/l-m/en
• https://www.findacode.com/
• Diseases:
– 28 chapters
– Format: XXXX.X[X]
– Examples:
• 1C12.0 Whooping cough due to Bordetella
pertussis
• KB00.0 Perinatal arterial stroke
• HA01.10 Male erectile dysfunction, lifelong,
generalized
• LA85.20 Double outlet right ventricle with
subpulmonary ventricular septal defect,
transposition type
46
Overall comparison of ICD9 and ICD10
• Extension: Issues today with the ICD-9 diagnosis
and procedure code sets are addressed in ICD-10.
• More detailed Clinical information (laterality,
temporality, etc.): One concern today with ICD-9 is
the lack of specificity of the information conveyed
in the codes. For example, if a patient is seen for
treatment of a burn on the right arm, the ICD-9
diagnosis code does not distinguish that the burn is
on the right arm. If the patient is seen a few weeks
later for another burn on the left arm, the same
ICD-9 diagnosis code would be reported. Additional
documentation would likely be required for a claim
for the treatment to explain that the burn treated
at this time is a different burn from the one that
was treated previously. In the ICD-10 diagnosis code
set, characters in the code identify right versus left,
initial encounter versus subsequent encounter, and
other clinical information.
• Resizing: Another issue with ICD-9 is that some
chapters are full and impede the ability to add new
codes. In some cases, new codes have been
assigned to different chapters making it difficult to
locate all available codes. ICD-10 codes have
increased character length, which greatly expands
the number of codes that are available for use.
With more available codes, it is less likely that
chapters will run out of codes in the future.
• Updating: Other issues that are addressed in ICD-10
include the use of full code titles and appropriately
reflecting advances in medical knowledge and
technology.
Source: https://www.unitypoint.org/waterloo/filesimages/for%20providers/icd9-icd10-differences.pdf
47
Overall comparison of ICD10 and ICD11
ICD has been reviewed to accommodate for the needs of multiple use cases
and users in recording, reporting, and analysis of health information. ICD-11
comes with:
Source: https://www.who.int/classifications/icd/en/#page21
48
Diagnosis-Related Groups (DRG)
Def. (DRG): is a classification of the “services” that
acute health care centers can provide. DRGs were
intended to describe all types of patients in an acute
hospital setting. DRGs have been used in the US since
1982 to determine how much Medicare pays the
hospital for each “service", since patients within each
category are clinically similar and are expected to use
the same level of hospital resources. DRGs are assigned
to patients by a "grouper" program based on ICD
diagnoses (primary diagnosis), procedures, age, sex,
discharge status, and the presence of complications or
comorbidities (secondary diagnoses).
49
Int’l Classification of Primary Care (ICPC-2)
Def. (ICPC-2): it classifies patient data and clinical activity in the domains of General/Family Practice
and primary care, taking into account the frequency distribution of problems seen in these domains.
It allows classification of the patient’s reason for encounter (RFE), the problems/diagnosis managed,
interventions, and the ordering of these data in an episode of care structure.
• It has a biaxial structure consisting of 17 chapters, each one divided into 7 components.
• ~1,300 codes.
• Abbreviated ICPC-2 in Spanish: https://www.iqb.es/patologia/ciap/ciap_toc.htm
Chapters
A General and unspecified Components
B Blood, blood forming organs, lymphatics, spleen 1: symptoms and complaints
D Digestive 2: diagnostic, screening and preventive procedures
F Eye 3: medication, treatment and procedures
H Ear 4: test results
K Circulatory 5: administrative
L Musculoskeletal 6: referrals and other reasons for encounter
N Neurological 7: diseases
P Psychological
R Respiratory
S Skin
T Endocrine, metabolic and nutritional
U Urology
W Pregnancy, childbirth, family planning
X Female genital system and breast
Y Male genital system
Z Social problems
50
51
Source: http://docpatient.net/3CGP/QC/ICPC_desk.pdf
Overall Comparison of ICD and ICPC
• Oriented to Primary Care: ICD covers the needs of hospital care where patients normally present for a
single episode of care and mostly with one, often clearly differentiated, problem. In primary care,
however, healthcare providers deal with multiple episodes of care over time, and deal with many, often
undifferentiated, problems simultaneously. Therefore, ICPC allows to capture information on episodes of
care (EoC) over time. It does so by allowing the simple recording of the first contact between patient and
healthcare provider concerning a certain health problem, and ends with the last contact relating to this
same problem. The EoC allows for grouping of information over time. Healthcare providers can use this to
improve continuity and coordination of care. The ability to collect data using the EoC also creates more
insight into the processes related to certain conditions over time, and so a greater understanding of what
is needed and the associated costs.
• Reflects the content of primary care: ICPC is a classification system which aims to reflect the content of
primary care. The ICPC contains codes that are mainly based on the frequencies with which they are
encountered in primary care and with a level of detail that is appropriate for primary care. It is possible to
tailor ICPC to match local epidemiological needs. Enables easy and consistent coding. The whole of ICPC,
all codes, fit the front and back of one A4 sheet of paper.
• Reduced size: ICPC-2 has around 1,300 codes whereas ICD has between 14,000 – 140,000 codes with a
complex coding system.
• Formal structure: The components that form part of each ICPC chapter permit considerable specificity for
all three elements of the encounter (i.e., findings, diagnosis, and treatment), yet their symmetrical
structure and largely uniform numbering across all chapters also facilitate usage even in manual recording
systems.
• Multilingual: ICPC is available in Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French,
German, Greek, Italian, Japanese, Norwegian, Portuguese, Romanian, Russian, Serbian, Slovenian, and
Spanish.
52
Source: https://www.globalfamilydoctor.com/site/DefaultSite/filesystem/documents/Groups/WICC/International%20Classification%20of%20Primary%20Care%20Dec16.pdf
Episode of Care
An "Episode of Care" is a healthcare management and billing concept that refers to a
specific period during which a patient receives a sequence of healthcare services related
to a particular health issue, condition, or treatment. This concept is primarily used in
healthcare reimbursement and administration to organize and account for the care
provided to a patient over a defined period. Here are some key points to understand
about episodes of care:
• Episode of Care = (RFE; Diagnosis; Procedure)
– RFE = Patient’s Reason for Encounter (e.g., headache) Symptoms
– Diagnosis = differential knowledge acquired about the patient’s physical or
psychical state.
– Procedure = treatment action(s) or test(s) started on the patient.
• Simplified examples:
Encounter RFE Diagnosis Process
EoC2 “What’s the (blood) test “B80 iron deficiency “-40 diagnostic endoscopy”
result?” anemia” (colonoscoscopy)
EoC3 “What is the (colonoscopy) “D75 malignant neoplasm “-67 referral to
test result?” colon/rectum” physician/specialist/…”
53
Episode of Care
• Episode of Care = (RFE; Diagnosis; Procedure)
– RFE = Patient’s Reason for Encounter (e.g., headache)
– Diagnosis = differential knowledge acquired about the patient’s physical or
psychical state.
– Procedure = treatment action(s) or test(s) started on the patient.
• Simplified examples:
Encounter RFE Diagnosis Process
EoC2 “What’s the (blood) test “B80 iron deficiency “-40 diagnostic endoscopy”
result?” anemia” (colonoscoscopy)
EoC3 “What is the (colonoscopy) “D75 malignant neoplasm “-67 referral to
test result?” colon/rectum” physician/specialist/…”
54
Classification of Functioning, Disability and
Health (ICF)
Def (ICF): it is the WHO framework for measuring health and disability at both
individual and population levels. ICF was officially endorsed by all 191 WHO
Member States in 2001 as the international standard to describe and
measure health and disability.
• http://apps.who.int/classifications/icfbrowser/
• ICF has 4 chapters: b, d, e, s
• A-N-NN-N
56
Anatomical Therapeutic Chemical
Classification System (ATC)
Def. (ATC): Codification system in which drugs are classified at five levels.
A-NN-A-A-NN
Level 1: Anatomical Group (A): 14 groups
Level 2: Therapeutic Subgroup (NN)
Level 3: Therapeutic/Pharmacological Subgroup (A)
Level 4: Chemical/therapeutic/pharmacological Subgroup (A)
Level 5: Chemical Substance (NN)
https://www.whocc.no/atc_ddd_index/
Logical Observation Identifiers Names and
Codes (LOINC)
Def. (LOINC): it is an international standard to assist in the electronic exchange and gathering of
clinical results (such as laboratory tests, clinical observations, outcomes management and
research).
• https://loinc.org/
• > 71,000 observation terms
• LOINC has two parts:
– Laboratory: to describe results of laboratory and microbiology tests,
– Clinical: to refer to a variety of non-lab concepts (e.g., ECG, cardio echo, ultrasound). The
clinical part has:
• Terms for clinical documents: to be incorporated in clinical reports such as discharge summaries.
• Terms for survey instruments: to be used in standard surveys such as Glasgow Comma Score.
• Each code has six dimensions or parts:
– Component (Analyte): the substance or entity being measured or observed.
– Property: the characteristic or attribute of the analyte.
– Time: the interval of time which an observation was made.
– System (Specimen): The specimen or thing upon which the observation was made.
– Scale: how the observation value is qualified or expressed: quantitative, ordinal, nominal.
– Method (optional): how the observation was made. https://loinc.org/806-0/
• Example: code 806-0 (parts Leukocytes: NCnc: Pt: CSF: Qn: Manual count) stands for
“manual count of white blood cells in cerebral spinal fluid specimen”. NCns= number
58
concentration, Pt= Point in time, CSF= cerebral spinal fluid, Qn= quantitative.
Systematized Nomenclature of Medicine (SNOMED CT)
Def. (Systematized Nomenclature of Medicine - Clinical Terms, SNOMED CT): it is the most comprehensive,
multilingual and codified clinical terminology developed in the world. SNOMED CT is also a terminology product
that can be used to encode, retrieve, communicate, and analyze clinical data, enabling healthcare professionals to
represent information in an appropriate, accurate, and unambiguous way. Terminology is constituted, in a basic
way, by concepts, descriptions and relationships. These items are intended to accurately represent clinical
knowledge and information in the healthcare setting.
• https://browser.ihtsdotools.org
• Poly-hierarchical structure of
clinical concepts (IS-A relations):
Defined Concept
Attribute
Is-a relationship
Attribute group
Conjunction (AND)
Equivalence
Subsumption (Inclusion)
Unidirectional/Bidirection
al connectors
60
https://confluence.ihtsdotools.org/download/attachments/29951081/doc_DiagrammingGuideline_Current-en-US_INT_20140131.pdf?api=v2
SNOMED CT: A Big Ontology
61
Unified Medical Language System (UMLS)
Def. (UMLS): The UMLS is a set of files and software that brings together
many health and biomedical vocabularies and standards to enable
interoperability between computer systems.
• The UMLS integrates and distributes key terminology, classification and
coding standards, and associated resources to promote creation of more
effective and interoperable biomedical information systems and services,
including electronic health records.
• It contains 3 knowledge sources:
– Metathesaurus: Terms and codes from many vocabularies, including CPT, ICD-10-
CM, LOINC, MeSH, RxNorm, and SNOMED CT. Hierarchies, definitions, and other
relationships and attributes.
– Semantic Network: Broad categories (semantic types) and their relationships
(semantic relations).
– SPECIALIST Lexicon and Lexical Tools: A large syntactic lexicon of biomedical and
general English and tools for normalizing strings, generating lexical variants, and
creating indexes.
62
Electronic Health Records
• Definition and Related Terms
– Electronic Medical Record
– Electronic Health Record
– Electronic Personal Health Records
• Parts of an EHR
• EHR Software Strategies
• Health Information Systems: Past, Present, and Future
• EHR Standards
– HL7
– OpenEHR
– EHRcom
• EHR Systems
– Proprietary solutions: SAP Health-Care, HPCIS (DXC), Selene (CGM),
Millennium (Cerner)
– Ad hoc solutions: Diraya, Jimena, Abucasis, Ianus, e-Osabide.
63
Electronic Health Record (EHR)
Def. (Electronic Health Record, EHR): An electronic
(digital) collection of medical information about a
person that is stored on a computer.
An electronic health record includes information about
a patient’s health history, such as diagnoses, medicines,
tests, allergies, immunizations, and treatment plans.
Electronic health records can be seen by all healthcare
providers who are taking care of a patient and can be
used by them to help make recommendations about
the patient’s care.
“Also called electronic medical record”.
64
Office of the National Coordinator for
Health Information Technology (ONC)
within the Office of the Secretary for the U.S. Department of Health and Human Services (HHS).
65
Parts of an EHR
Sources: Kelley T. Electronic Health Records for Quality Nursing & Health Care. DEStech Publications, Inc. 2016.
Institute of Medicine (US) Committee on Data Standards for Patient Safety. Key Capabilities of an Electronic Health Record System: Letter Report.
Washington (DC): National Academies Press (US); 2003. Available from: https://www.ncbi.nlm.nih.gov/books/NBK221802/ doi: 10.17226/1078166
EHR: 1. Health Care Data and Information
• An EHR must contain certain data about patients. The input-output access
to this data by care providers must be efficient.
• Difference between data and information:
– Data: symbols representing numbers, letters, words, abbreviations, etc.
– Information: assignment of a (clinical) meaning to data that allows decision making
and knowledge generation.
• Health care data and information can be introduced in EHRs in either a
structured or unstructured format. E.g., text vs. SNOMED codes.
• Possible categories of EHR data and information:
– Patient demographics: e.g., name, date of birth, gender, ethnicity, race, medical
record number, etc.
– Patient list of problems and diagnoses: primary and secondary causes of treatment.
– List of medications and prescriptions: pharmacological treatment of the patient.
– List of allergies: allergies to medication, food, substances, etc.
– Clinical documentation: e.g., electronic forms, templates, spreadsheets, notes, etc.
– Patient orders: list of patient actions required by a health care professional for the
management of the patient.
– Medication and administration record (MAR): Nurses, pharmacists, and providers
use the MAR to administer patient medications and subsequently monitor the
patient’s response.
67
EHR: 2. Results Management
• An EHR must contain the results of the clinical tests performed on the
patient.
• This is a broad category that incorporates results from tests performed in
clinical areas, other diagnostic tests, and consultative exams.
– Result: outcome of a test or exam ordered by a health care provider and performed on
the patient. Used to evaluate the patient’s condition and to make clinical decisions
about the patient’s care.
• Examples: Complete blood count (CBC), Potassium level (K+), Chest X-Ray
(CXR), Electrocardiogram (ECG/EKG), Pulmonary function tests (PFTs),
Biopsies, Strep Test, Urinalysis (UA), Mononucleosis Test (Mono), Magnetic
Resonance Imaging (MRI), Nuclear Medicine Scans (NM), Computerized
Tomography (CT), etc.
68
EHR: 3. Order Entry and Management
• An EHR must contain the treatment orders asked by the health care
professionals. In modern HC systems order entry is computerized.
• Computerized Provider Order Entry (CPOE) as part of (or connected to)
the EHR.
• CPOE facilitates:
– Reduction of errors because hand-writing (easy reading).
– Automatic filling of fields in the order (e.g., patient/doctor name, date, etc.)
– Automatic incorporation of orders in the EHR of the patient.
– Electronic prescription.
– Time reduction.
– Automatic detection of medication errors (drugs, dosages, frequencies, etc.)
and interactions (drug-drug, drug-allergy, etc. checkers).
– Paper elimination and cost reduction.
69
EHR: 4. Clinical Decision Support
• An EHR can contain (or use) some tools for decision support.
• Clinical Decision Support (CDS) component in EHR was not possible in
paper-based health records.
• CDS is meant to assist the provider, nurse, and other health care
professionals to make optimal decisions about a patient’s treatment plan.
• CDS uses patient’s data and information in this purpose.
• CDS and CPOE proved a beneficial symbiosis working together.
• Monitoring values in normality ranges is another use of CDS
TO BE CONSIDERED LATER
70
EHR: 5. Electronic Communication and
Connectivity
• An EHR must facilitate secure data communication between different
units in the same health care center or among health care centers.
• Communication: EHR data and information moves
– Data sources: where EHR data is produced
– Data storages: where EHR data is stored (secondary sources)
– Data uses: where EHR data is consumed/required
…
…
…
Source m Storage n Use n
72
EHR: 7. Administrative Process
• This EHR component allows health care organization around the patient.
• It is normally used during admission, discharge (inpatients) or visit
(outpatients).
• It uses to be centered on demographic information (patient’s name, date
of birth, gender, race, ethnicity, language), the patient receives a medical
record identifier (MRI), patient’s insurance or payer information (social
security number, insurance number, credit card number), and patient
locations (exam rooms, emergency rooms, operating rooms, ICU rooms,
inpatient care rooms, etc.)
73
EHR: 8. Reporting and Population Health
Management
• Some EHR incorporate a module to extract summary information
required by National Health Care systems.
• Health care centers use to report their activities and information of their
patients to federal, state, and local governmental institutions for global
supervision of patient safety and health care quality.
• Units and departments in hospitals have also to report to the hospital as
an input for global management and decision making.
• The EHR becomes an outstanding tool in these processes.
• Automating EHR data extraction and arrangement for these purposes
should be achieved with EHR modules implementing these extractions and
processing algorithms.
• These components reduce time, costs, and accuracy during reporting.
74
EHR Software Strategies
• Ad hoc strategy: the hospital IT staff develops their own system.
• Off-the-shelf strategy: the hospital purchases a software and customizes
it.
– Single EHR: a single software is available with modules for financial, billing,
human resources, material management, etc. It’s a neat compact solution.
– Best-of-breed strategy: many vendors solutions are analyzed and the best
components of each are purchased and integrated. It involves intensive
interface development between components.
– Best-of-suite strategy: there is a core EHR to which other software systems are
integrated.
EHR EHR
HCC 1 HCC 1
HCC 4 HCC 4
… HCC 2 HCC 2
HCC 5
HL7
HCC 3 HCC 5
HCC 3
…
79
Simplified HL7 v3 RIM
• Entities: elements (either objects or agents) participating. Ex: person, organization, place,
material, etc.
• Roles: provide the kind of participation and liabilities of entities. Ex: patient, employee (for
doctors, nurses, ...), etc.
• Participations: information about the involvement of a role in an act.
• Acts: health-care past/present/future action. Ex: procedure, patient encounter, observations,
invoice, etc.
Example (operative report): Dr. Smith (entity) as physician (role) observes uncontrolled diabetes (observation act) in patient Mr. Jones (entity, role
patient) and prescribes metformin (supply act) to stabilize the situation (act). [Both the doctor and the patient (roles) participate in this clinical situation80
(participation)].
HL7 messages: v2 vs. v3
81
HL7 v2 Messages
• Editor: http://7edit.com/home
82
• Otros: https://hl7latam.blogspot.com/2018/07/editor-de-mensajes-hl7.html
7Edit: HL7 v2 Message Editor
Types of Segments
HL7 Message
HL7 RIM
HL7 List of Messages
83
Case Example 0: Use of 7Editor
1. Create a message of the sort ADT_01 (admission/visit notification).
2. Indicate that the admission was on Feb 4, 2020 at 12:30 h, by writing a segment of the sort EVN (event).
3. Incorporate personal information about the patient as a PID segment: Mary Higgins, born on Dec 15, 1997,
female, Asian, who lives in 1122 Alberton Av and home phone number 111222, and married. She receives
the patient id 123456.
4. Introduce a contact person: create a next of kin (NK1) segment with the information of the husband, John
Carter, with phone num. 333444.
5. Introduce the information that Mary has allergy (AL1 segment) to penicillin (ATC code J01C).
6. Mary was diagnosed (DG1 segment) of essential hypertension (ICD-9CM code 401.0) on Aug 10, 2019.
7. One hour after arrival, she received a procedure (PR1 segment) of anamnesis by doctor Charles Cannot
(ROL segment), who observed (OBX segment) a systolic blood pressure (SNOMED-CT code 163020007) of
140 mmHg, and prescribed (PR1 segment) atenolol 100 mg, with ATC code C07AB03.
84
HL7 CDA
Def. (HL7 CDA): it provides a standard for the representation,
persistence and communication of clinical documents for exchange
between systems.
• A CDA document is a tree-structure in which higher levels can
contain lower level CDA structures. They are both human readable
and machine processable.
• A CDA document is an XML file consisting of:
– Header: identifies the patient, provider, document type, etc.
– Body: mandatory human-readable part containing the complete content
and an optional encoded part that can be safely ignored by recipients
which are unable to process it.
• Depending on the body, there are tree classes of CDA documents
(called levels):
– Level 1: with a NonXMLBody (free style) body. Ex., PDF file.
– Level 2: with a StructuredBody body containing textual sections.
– Level 3: with a StructuredBody body containing textual sections with
codified entries (clinical information) for machine processing.
• Level 3 allows semantic interoperability. 85
https://web.archive.org/web/20081026071806/http://hl7book.net/index.php?title=CDA
CDA Document Complete Structure
Header: metadata about the document
• Who created the document
• Who is the document about
Header • When was the document created
• Where was the document created
• Etc.
Entry Example:
This entry is representing …
• an observation (OBS),
• with value 93 kg
86
CDA Header: Example
XML header
CDA declaration
CDA header
Birthday: 1932/09/24
The information about the patient is provided by the health care center 2.16.840.1.113883.19.5
87
CDA Body Examples
Non Structured Body Structured Body
Composed of 1+ elements <component>,
Expressed with element <text> each one with 0+ elements <section>,
composed of 0+ elements <entry>
plain text
patient.txt
http://www.hcc.com/patients/patient0012453.txt Patient with significant alterations …
The patient is described in a local (or remote) text file The document represents a patient’s textual
patient.txt (or patient0012453.txt) which is attached to anamnesis that states “Patient with significant
the body. A full URL reference to the remote document
alterations …”
could be provided.
88
CDA Documents Can Automate Print out
89
HL7 CDA Viewer: Backbeach
https://backbeachsoftware.com.au/challenge/index.htm
XML view of
CDA document
Examples in XML
Examples in blocks
CDA Header
Section
Visibility
Control
90
OpenEHR
• OpenEHR is an open standard that specifies all the architectural components
needed to create health information systems that are interoperable, highly
maintainable, and very flexible. It describes the management and storage,
retrieval, and exchange of health data in EHRs. It has three basic components:
– The Reference Model (RM) (or Information Model): is a hierarchy of data structures. NON-MODIFIABLE
– The Knowledge or Archetype Model (AM): composed of models to describe archetypes and templates.
– The Service Model (SM) under development, it includes definitions of basic services in the health
information environment, centered around the EHR.
91
https://specifications.openehr.org/releases/1.0.1/html/architecture/overview/Output/overviewTOC.html
openEHR: The Reference Model
• openEHR RM defines all the basic classes in openEHR
• Everything in openEHR (archetypes, templates, etc.) is based on the classes of openEHR RM.
• Classes are organized in the following packages:
Template DM
openEHR Archetype Profile
Archety0e DM
– Package ehr_extract: defines the semantics of Extracts (i.e., things that can be shared) from openEHR data sources,
including EHRs.
– Package ehr: contains the top level structure, the EHR.
– Package demographic: expresses attributes and relationships of demographic entities (e.g. contact address) which exist
regardless of particular clinical involvements or participations in particular events. Ex., concepts as PARTY, ROLE, etc.
– Package integration: for legacy and other data integration situations.
– Package composition: defines the containment and context semantics of the key concepts COMPOSITION, SECTION,
and ENTRY.
– Package common: defines abstract concepts and design patterns used in higher level openEHR models.
– Package data_structures: describes generic path-addressable data structures (e.g., single, list, table, tree) and a generic
notion of linear history (i.e., time-series structure), for recording events in past time.
– Package data_types: contains classes to define openEHR basic data types (ex., text, date_time, URI, etc.)
– Package support: defines the semantics respectively for constants, terminology access, access to externally defined
scientific units and conversion information 92
Source: https://specifications.openehr.org/releases/RM/latest/ehr.html
OpenEHR AM: Archetypes
OpenEHR archetypes are clinical content specifications that formalize the patterns and requirements for
representation of detailed, computable health-related concepts. Each archetype defines a topic-related set of
data groups and elements. For example, there are separate archetypes for recording symptoms, a blood
pressure measurement, an ultrasound report, and a medication order.
• Metadata (Archetype Header): each archetype describing a concept must have a concept name, a concept
description, a purpose, a use, and possible misuses.
• Archetype Classes: there are four basic classes to describe archetypes
– Composition: container class. All information stored within the EHR will be contained within a Composition. For example, an
encounter, a health summary, or a report. Similar to HL7’s body.
– Section: organizing class, usually contained within a composition. They correspond to the headings that you might find on a blank
piece of paper. They are most commonly used to provide a framework in which to place the smaller Entry and Cluster class
archetypes which hold most of the detailed clinical content.
– Entity: standalone 'semantic unit' of information. Entities can be grouped together usefully and re-used in many different
settings. The information within an Entry will mean the same thing no matter where it is used. Entities can be an observation
(without interpretation, ex., blood pressure), evaluation (is an interpretation of an observation, ex. gender or contraindication),
action (ex. procedure), instruction (ex., medication order), or admin (administrative, ex., patient admission).
– Cluster: reusable archetypes to be used in Entries or other Clusters. They can capture recursive concepts (e.g., observation).
They represent common and fundamental domain patterns that are required in many archetypes and clinical scenarios (e.g.,
size, symptom, inspection, and relative location).
• Archetype data contents: data units and archetypes that define a concrete archetype.
Some archetype examples follow … 93
Source: https://openehr.atlassian.net/wiki/spaces/healthmod/overview
Observation Archetype: Blood Pressure
Data Protocol
94
Evaluation Archetype: Gender
Data Protocol
98
OpenEHR AM: Templates
OpenEHR templates are combinations and constraints on archetypes to create context-specific
clinical data sets and documents such as clinical notes, discharge summary documents, or
messages that will be used in EHR systems.
• Making templates is a “document design process”:
– Determine which organizational models are used, in which order and which ‘primary’ archetypes these models will
contain.
– Set appropriate default values in the primary archetypes, if required
– Specialize some archetypes, if required.
• For example: Antenatal Examination Template
STEPS
History
Symptoms
Template will contain 4 sections: History, Physical Examination, …
Concerns
Sections will contain archetypes
Physical Examination
Blood Pressure Some archetypes may require refinement
Fetal Heart Rate
Palpation of Abdomen Some archetypes’ slots may require default values
Assessment
Plan
99
Clinical Knowledge Manager (CKM)
The openEHR Clinical Knowledge Manager (CKM) is an international, online clinical
knowledge resource to manage openEHR archetypes and templates that has gathered
an active community of interested and motivated individuals from around the world
focused on furthering an open and international approach to clinical informatics for
sharing health information between individuals, clinicians and organizations; between
applications, and across regional and national borders.
All contributions to CKM is on a voluntary basis, and all CKM content is open source
and freely available under a Creative Commons license.
Concretely:
• CKM is a library of clinical knowledge artefacts - currently predominantly openEHR
archetypes and templates;
• CKM supports the full life cycle management of openEHR archetypes and
templates through a review and publication process;
• CKM provides governance of the knowledge artefacts.
• https://ckm.openehr.org/ckm/
100
Clinical Knowledge Manager Tool
101
Generating New Archetypes
• New archetypes can be generated:
– From scratch: we can generate an archetype for a new
concept.
– By Specialization: we can take an existing archetype (ex.,
the archetype general laboratory observation) and refine it
to represent a more specific archetype (ex., laboratory
observation of blood glucose level).
– By Composition: we can take several existing archetypes
and combine them to form a new more complex
archetype. Composition is done by declaring archetype
slots of type allow_archetype and assign an archetype to
that slot. Example, if we have archetypes A1, …, An, we can
create archetype B(A1, …, An).
102
EHRcom
• The Health informatics - Electronic Health Record
Communication (EN 13606) was the European Standard
for an information architecture to communicate
Electronic Health Records (EHR) of a patient. The
standard was later adopted as ISO 13606 and later
replaced with ISO 13606-2 and recently ISO 13606-
5:2019.
• This standard was intended to support the
interoperability of systems and components that need to
communicate (access, transfer, add or modify) EHR data
via electronic messages or as distributed objects:
– preserving the original clinical meaning intended by the author;
– reflecting the confidentiality of that data as intended by the
author and patient.
103
ISO 13606 standard
• ISO 13606 is a standard from the International Standardization
Organization (ISO), originally designed by the European Committee for
Standardization (CEN).
• ISO 13606 defines a standard, rigorous, and stable information
architecture for communicating part or all of the electronic health record
(EHR) of a single subject of care (patient) between EHR systems, or
between EHR systems and a centralized EHR data repository. It may also
be used for EHR communication between an EHR system and clinical
applications or middleware components (such as decision support
components) that need to access EHR data, or as the representation of
EHR data within a distributed (federated) record system.
• ISO 13606 follows a Dual Model architecture separating information from
knowledge. Information is structured through a Reference Model with the
basic entities for representing any information of the EHR. Knowledge is
based on archetypes, which are formal definitions of clinical information
models, such as discharge report, glucose measurement or family history,
in the form of structured and constrained combinations of the entities of a
Reference Model.
106
Source: https://www.capterra.com/infographics/top-emr-software
107
The Top 5
By total customers (hospitals using it)
By total users (health care professionals using it)
By social media followers (presence in social media: facebook, linked in, twitter)
By vendor size (in number of employees)
108
By total customers and users
109
By social media followers and vendor size
110
EHR Systems in Spain
Source: Miguel Ángel Montero, Health Care & Social Services Director, IECISA. 21-July-2020.
Ad-hoc solutions
Diraya
Jimena
Abucasis
Ianus
e-Osabide
Proprietary solutions
SAP HealthCare
DXC HPCIS
(previously from HP)
CGM Selene
(Previously from Siemens/Cerner)
Cerner Millennium
111
3. Clinical Data Analysis
• A Data Science Project
• Statistical Analysis of Health Care Data
– Descriptive Statistics
– Inferential Statistics
– Regression
• Artificial Intelligence Analysis of Health Care Data
– Unsupervised Machine Learning
– Supervised Machine Learning
112
A Data Science Project
• Data science is an inter-disciplinary field that uses
scientific methods, processes, algorithms and systems to
extract knowledge and insights from many structural and
unstructured data. It unifies statistics, data analysis,
machine learning, domain knowledge and their related
methods in order to understand and analyze actual
phenomena, with data. Wikipedia
113
Data Cleaning
• Data cleaning (or data cleansing) is the process of detecting
and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database. Wikipedia
• The main issues in data cleaning are:
– Missing Values: some data can be absent. For example, the blood pressure of some
patients can be unknown.
– Outliers: some data can be highly atypical. For example, some patients may be older
than 100.
– Errors: some data can be corrupt. For example, heart rate can become null because the
connected machine disconnects while moving the patient.
– Duplicated Data: some data can be redundant. For example, we could have birth date,
age, and admission date (birthdate = admission_date – age).
– Pre-Calculation: some required data can be calculated from the available data. For
example, body mass index can be calculated from patient’s height and weight
(BMI=weight(in Kg) / height(in m)2).
– Useless Features: some features can be irrelevant to the current DS project. For
example, some study may not need patient’s gender information.
– Useless Cases: some cases can be irrelevant to the current DS project. For example,
pediatric studies may remove patients older than 18.
114
Feature Engineering and Model Creation
• Def (Feature engineering) the process of using domain knowledge to
extract features from raw data via data mining techniques. Wikipedia
• Def (Scientific modelling) the process of making a particular part or
feature of the world easier to understand, define, quantify, visualize, or
simulate by referencing it to existing and usually commonly accepted
knowledge. It identifies relevant aspects of a situation in the real world
and then uses different types of models for different aims, such as
conceptual models to better understand, operational models to
operationalize, mathematical models to quantify, and graphical models to
visualize the subject.
TO BE CONSIDERED LATER
115
Statistical Analysis of Health Care Data
• Sample Tools
– MS Excel
– Python
116
Sample Tool: Python
• Rationale:
– Accessibility: it is open access.
– Programmable: analyses can be embedded within
computer programs.
– Simple: with minor indications statistics with Python is
easy.
– Powerful: statistic functions are fast.
– Complete: python provides a great variety of statistical
functions implemented and ready to be used.
118
Statistical Data Analysis: Descriptive Statistics
Descriptive statistics is a branch of statistics aiming at quantitatively
describe or summarize features of a collection of data.
• Qualitative variables: proportion or percentage of occurrence of
each variable value (e.g., percentage of patients taking one drug).
• Quantitative variables:
– Measures of central tendency
• Mean: arithmetic average of the values.
• Median: middle value of the set of values. mode σ𝑛𝑖=1 𝑥𝑖
𝑚𝑒𝑎𝑛 =
• Mode: most commonly observed value of the set of values. median 𝑛
– Measures of dispersion or variability
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
• Variance and standard deviation: st. dev = square-root(variance) 𝑠=
𝑛−1
– ~68% of the cases are in the interval [mean st.dev]
– ~95% of the cases are in the interval [mean 2*st.dev]
• Interquartile range: obtain the first and third quartiles Q1 and Q3, then [Q1, Q3] in
the interquartile range containing 50% of the data.
119
Statistical Description of Data
N 451
Variable Type Mean St. Dev. 95% CI Population Description
Age Numeric 46.4080 16.4298 44.8916 47.9243
Sex Categoric male 44.79% female 55.21%
Height Numeric 166.1353 37.1946 162.7025 169.5681
Weight Numeric 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 367.2239 33.4208 364.1395 370.3084
T interval Numeric 169.9335 35.6711 166.6413 173.2257
P interval Numeric 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 1.11% not exists 98.89%
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
N = 303
Natt = 14
122
Case Study 1: Variable Types
Position Short name Type Description
1. age Quantitative Patient’s age in years
2. sex Qualitative Patient’s gender (1:male,0:female)
3. cp Qualitative Chest pain type (1:typical angina, 2:atypical angina, 3:non-
anginal pain, 4:asymptomatic)
4. trestbps Quantitative Resting blood pressure (in mm Hg on admission to the
hospital)
5. chol Quantitative Serum cholestoral in mg/dl
6. fbs Qualitative Fasting blood sugar > 120 mg/dl (1:true, 0:false)
7. restecq Qualitative resting electrocardiographic results (0:normal, 1:having ST-T
wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV), 2:showing probable or definite left
ventricular hypertrophy by Estes' criteria
8. thatlach Quantitative Maximum heart rate achieved
9. exang Qualitative Exercise induced angina (1:yes, 0:no)
10. oldpeak Quantitative ST depression induced by exercise relative to rest
11. slope Qualitative The slope of the peak exercise ST segment (1:upsloping, 2:
flat, 3:downsloping)
12. ca Qualitative (*) Number of major vessels (0-3) colored by flourosopy
13. thal Categorical 3: normal, 6: fixed defect, 7: reversable defect
14. num Categorical Diagnosis of heart disease (angiographic disease status) (0: <
50% diameter narrowing, 1: > 50% diameter narrowing)
123
CS1: Data Description with Python
1. Download the .csv file in a local folder
2. Take a look at the particularities of this data set
3. Make a table with all the attributes
4. In python, load the file in a pandas DataFrame structure indicating the names of
the columns
5. Look at the types of the attributes
6. Declare categorical attributes according to the previous table
7. Obtain the description of all the variables
8. Obtain frequency of all categories
9. Observe from the previous result that there are some missing values ?
10. Calculate the mean, median, and mode of age
11. Calculate the mean, median, and mode of age for males and females separately
12. Calculate 95% CI of chol
13. Calculate 95% CI of chol for women and men separately
14. Calculate 95% CI of all numeric attributes
15. Calculate interquartile ranges of chol
16. Calculate interquartile ranges of all numeric attributes
124
Statistical Data Analysis: Inferential Statistics
Inferential statistics: is a branch of statistics which aims to deduce properties of a population
under a probability distribution.
• Examples:
– One single population: e.g., probability of developing a lung cancer being a smoker.
– Two populations: e.g., treated patients vs. not-treated patients.
– N populations: e.g., effect of one drug depending on the patient’s disease.
• Related concepts:
– Confidence Intervals (CI): “P% CI [a, b]” means that we are P% confident that the population
parameter is in the interval [a, b] (it is not that P% of the population is within the interval).
– Hypothesis test: null hypothesis vs. alternative hypothesis.
• Equality: e.g., a treatment has no effect.
• Improvement: e.g., one treatment is better than another.
– P-value (or significance): probability that an observation is due to random chance.
• P-value < 0.05: we accept the alternative hypothesis (e.g., the treatment has an influence in the evolution of
the patient).
• P-value > 0.05: there’s not statistical evidence to reject the null hypothesis (e.g., we cannot conclude that the
drug improves the evolution of the patient).
– Test Statistics:
• Quantitative variables: Student’s t-test, Welch’s t-Test.
• Qualitative variables: Pearson’s Chi-squared test, Fisher’s test.
– ANOVA
125
Concepts Similarity
② Hypothesis Test
(needs data and tables)
③ P-value
(needs only data)
126
Student’s t-Test and Welch’s t-Test
A t-test is a type of inferential statistic used to determine if there
is a significant difference between the means of two groups.
• The sample (quantitative values) follows a normal distribution.
• One sample: To test whether the mean of a population has a
specific value. E.g., the average temperature of patients with
flu is 39.5 oC.
• Two samples: To compare the mean values of two samples.
– Variance of the Samples
• Same variance: use Student’s t-Test.
• Different or unknown variances: use Welch’s t-Test.
– Samples relationship
• Unpaired Samples: The two samples are independent. E.g., smoker mothers
affect the mean weight of newborns (we have separate independent samples of
smoking women and non-smoking women).
• Paired Samples: The two samples are dependent (they belong to the same
subject) E.g., the mean temperature of the patients after the treatment went
down (we compare the temperatures before and after the treatment of the
same patient).
𝑥ҧ − 𝜇0 We accept
one sample The average 𝑡= 0.7447 that the
39.0, 40.0, 38.9, 39.9, 𝑠Τ 𝑛
Student’s t- temperature is 8 39.55 0,4175 (> 2P/2) average
39.4, 39.7, 39.9, 39.6
test 39.5oC temperature
t = 0,3387
is 39.5oC.
𝑥1 − 𝑥2
𝑡=
two 1 1 Smoking
Weights 𝑠 +
samples smoking while 𝑛1 𝑛2 while
Smokers: 2.7, 2.9, 2.7,
unpaired pregnancy pregnancy
2.8, 3.0, 2.7,2.5, 3.1 8 2.80 0.1927 0.00198
(equal reduces the affects the
Non-Smokers: 3.4, 2.8, 10 3.26 0.3062 𝑛1 − 1 𝑠12
+ (𝑛2 − 1)𝑠22 (< P)
variances) mean weight of 𝑠= newborn
3.5, 3.2, 3.6, 2.9, 3.5, 𝑛1 + 𝑛2 − 2
Student’s t- newborns weight
3.2, 3.6, 2.9
test t = -3.6918 (reduction).
d = x 1 – x2
two Temp. Before-After: After the
the mean temp.
samples 39.0-37.5, 40.0-39.5, 𝑑ҧ − 𝜇0 treatment,
of the patients 𝑡= 0.02083
paired 38.9-37.8, 39.9-39.4, 8 đ = 0.8 sd = 0.7382 the
after treatment 𝑠𝑑 Τ 𝑛 (< P)
Student’s t- 39.4-37.5, 39.7-39.9, temperature
went down
test 39.9-38.9, 39.6-39.7 went down.
t = 2.9693
𝑥1 − 𝑥2 Smoking
Whelch’s smoking while Smokers: 2.7, 2.9, 2.7,
𝑡= while
test pregnancy 2.8, 3.0, 2.7,2.5, 3.1
8 2.8 0.1927 𝑠12 𝑠22 pregnancy
(unknown reduces the Non-Smokers: 3.4, 2.8, + 0.0014
10 3.26 0.3062 𝑛1 𝑛2 reduces the
variances) mean weight of 3.5, 3.2, 3.6, 2.9, 3.5,
newborn
newborns 3.2, 3.6, 2.9 t = -3.8848
weight.
128
Student’s/Welch’s t-Tests with Excel
(Functions in Excel)
df = n - 1
129
Student’s & Welch’s t-Tests with Excel
(Example)
one-sample Temperature x0
1 39.0 39.5
2 40.0 39.5
3 38.9 39.5
4 39.9 39.5
5 39.4 39.5
6 39.7 39.5
7 39.9 39.5
8 39.6 39.5
(4)
n= 8 Prueba t para medias de dos muestras emparejadas
mean= 39.55 promedio(data)
std.dev= 0.4175 desvest.m(data) Temperature x0
error= 0.1476 std.dev/sqrt(n) Media 39.55 39.5
df= 7 n-1 Varianza 0.17428571 0
t= 0.3388 abs(mean-x0)/error Observaciones 8 8
= 0.05 Coeficiente de correlación de Pearson #¡DIV/0!
Diferencia hipotética de las medias 0
(1) 1 TAIL 2 TAIL Grados de libertad 7
t-table (t=) 1.8946 INV.T(1-; df ) 2.3646 INV.T.2C( ; df) Estadístico t 0.33875374
(2) P(T<=t) una cola 0.37236486
P-value (from t)= 0.3724 DISTR.T.CD(t; df) 0.7447 DISTR.T.2C(t; df); Valor crítico de t (una cola) 1.89457861
(3) P(T<=t) dos colas 0.74472973
P-value (from data)= 0.3724 PRUEBA.T.(D1; D2; 1; 1) 0.7447 PRUEBA.T.2C(D1; D2; 2;1); Valor crítico de t (dos colas) 2.36462425
Options:
(1) Manual calculation of hypothesis test
(2) Based on P-values if we know or we want to calculate the t value (DISTR.T.* functions)
(3) Based on P-values if we don’t know the t value (PRUEBA.T.* functions)
(4) Using the Data > Data Analysis Complement of Excel
130
Student’s & Welch’s t-Tests with Python
import numpy as np
import scipy.stats as stats
fvalue, pvalue = stats.ttest_ind(smokers, non_smokers, equal_var=True) fvalue, pvalue = stats.ttest_ind(smokers, non_smokers, equal_var=False)
print(fvalue, pvalue) print(fvalue, pvalue)
-3.6918328279635917 0.0019761593646508767 -3.8848476429716334 0.0014179799721141778
Conclusion: we can reject smokers = non_smokers Conclusion: we can reject smokers = non_smokers
131
Pearson’s Chi-Square Test
Chi-square t-test: When the values in the sample are categorical apply Chi-square
test instead of t-Test.
• E.g., smoking while pregnancy affects the mean
weight of newborns (classified as normal-weight
and under-weight).
• Used to determine if there is a significant difference
between the expected frequencies and the observed
frequencies in one or more categories.
𝑜𝑏𝑠[𝑖, 𝑡𝑜𝑡𝑎𝑙] ∗ 𝑜𝑏𝑠[𝑡𝑜𝑡𝑎𝑙, 𝑗]
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑[𝑖, 𝑗] =
𝑜𝑏𝑠[𝑡𝑜𝑡𝑎𝑙, 𝑡𝑜𝑡𝑎𝑙]
• Chi-square value:
𝒏
𝟐
(𝑶𝒊 − 𝑬𝒊 )𝟐
𝝌 =
𝑬𝒊
𝒊=𝟏
• P-value < ➔ There’s a statistical significant
association between maternal smoking and the
probability of having an underweighted newborn.
Karl Pearson
(1857-1936) 132
Chi-Square Test with Excel
(Functions in Excel)
df = 3
133
Chi-Square Test with Excel
(Example)
Expected smoker no
normal 79.1005 50.8995
under 35.8995 23.1005
134
Chi-Square Test with Python
GROUP
Smoker Non Smoker
COMPARED
VARIABLE
Normal-weight 86 44
Under-weight 29 30
import numpy as np
import scipy.stats as stats
# Chi-square test
data = np.array([[86, 44],[29, 30]])
Conclusion: there’s an association between the two variables (smoking during pregnancy affects normality of the
weight of the newborns)
135
ANOVA
ANOVA or Analysis of Variance (between group means): used for
testing two or more groups to see whether there’s a difference
between their mean values. For example, testing if several
treatments cause the same result.
• Types of ANOVA tests:
– One-way ANOVA: used when you want to test several groups considering one
single factor (or category) to see if there’s a difference between them. For
example, testing whether a drug has the same effect in children, adults, and
elders (1 factor: age; 3 groups: children, adults, elders).
– Two-way ANOVA: used when you want to test several groups considering two
factors to see if there’s a difference between groups for each one of the
factors. For example, testing whether a drug has the same effect in children,
adults, and elders, but also if there’s any different effect on male and female
patients (2 factors: age, gender; 6 groups: children-male, children-female, …).
• With/without replication: If groups have m>1 cases or m=1 cases.
H0: there’s no difference between groups H0: there’s no difference between groups for factor 1
There’s no
··· ··· difference in the
··· drug effect in
males and
H1: there’re some groups that are different ··· females same as without replication
H1: there’re some groups that are different for factor 1 +
··· factor 1
G1 ··· There’s a
difference in the H0: there’s no interaction between factors
G2 ···
drug effect in
Gr ···
males and ··· For example,
females following a
··· concrete
H0: there’s no difference between groups for factor 2 treatment (f1)
There’s no does not
···
difference in the interact with
···
drug effect in ··· gender (f2).
terms of age
··· groups
H1: there’re some interaction between factors
H1: there’re some groups that are different for factor2 For example,
··· There’s a ··· taking
difference in the contraceptives
···
drug effect in ··· (f1) has an
terms of age interaction with
··· groups gender (f2) (e.g.,
G1 G2 G3 G4 Gc women are
factor 2 ··· more sensitive)
137
One-Way ANOVA
One-way analysis of variance (ANOVA): used to determine whether there are any
statistically significant differences between the means of two or more independent
(unrelated) groups.
– E.g., compare the efficacy of different dosages of a drug against high blood pressure.
···
···
One-factor ANOVA
SUMMARY
Group n sum mean variance
null 5 891 178.2 15.7
salt free 5 832 166.4 54.3
100 mg 5 823 164.6 27.8
150 mg 5 790 158 81.5
200 mg 5 757 151.4 44.3
null 175
import numpy as np
One-factor ANOVA
import pandas as pd null 182
SUMMARY import scipy.stats as stats null 181
Group n sum mean variance
null 5 891 178.2 15.7 salt free 172
data = np.array([[180,172,163,158,147],
salt free 5 832 166.4 54.3 import numpy as np salt free 158
[173,158,170, 146,152],
100 mg 5 823 164.6 27.8 import pandas as pd
150 mg 5 [175,167,158,160,143],
790 158 81.5 … …
import pingouin as pg
200 mg 5 [182,160,162,171,155],
757 151.4 44.3
[181,175,170,155,160]])
ANALYSIS OFdf THE
data = np.array([[1,180],[1,173],[1,175],[1,182],[1,181],
= VARIANCE
pd.DataFrame(data=data, columns=["null", "salt free",
Varation source Sum square"150 df [2,172],[2,158],[2,167],[2,160],[2,175],
"100 mg", mg", Mean
"200 square
mg"]) F P-value F crit
Entre grupos 2010.64 4 502.66 11.240161 0.0000606 2.8660814 [3,163],[3,170],[3,158],[3,162],[3,170],
Within groups 894.4 20 44.72 [4,158],[4,146],[4,160],[4,171],[4,155],
Total fvalue, pvalue = stats.f_oneway(df['null'],
2905.04 24 df['salt free'],
[5,147],[5,152],[5,143],[5,155],[5,160]])
df['100 mg'], df['150 mg'], df['200 mg'])
df = pd.DataFrame(data=data, columns=["Group","Value"])
print(fvalue, pvalue)
aov = pg.anova(dv='Value', between='Group', data=df, detailed=True)
print(aov)
11.240161001788922 6.0616159772040674e-05
Source SS DF MS F p-unc np2
0 Group 2010.64 4 502.66 11.240161 0.000061 0.692121
1 Within 894.40 20 44.72 NaN NaN NaN
···
···
···
···
···
𝑘 𝑛
2
Total 𝑆𝑆𝑡 = 𝑋𝑖𝑗 − 𝑋ത 𝑟·𝑐−1
141
𝑗=1 𝑖=1
Two-way ANOVA (w/o replication) with Excel
“We want to evaluate the efficacy of different doses of a drug and patient’s gender against
high blood pressure. For this, we 15 male and 15 female and take at random one individual
of each group to provide each one of the 5 given treatments. These treatments are: (1) not
given any treatment (placebo), (2) restrict a diet without salt, (3) provide drug dose 100 mg,
(4) provide drug at 150 mg dose and (5) provide drug at 200 mg dose. The systolic blood
pressures of the 30 subjects is measured and tabulated at the end of the treatments.”
TREATMENTS
null salt free 100 mg 150 mg 200 mg
1 men 180 172 163 158 147
2 173 158 170 146 152
3 175 167 158 160 143
4 women 182 160 162 171 155
5 181 175 170 155 160
6 179 163 150 148 140
Two-Way ANOVA
Source of variation
Square sum df Square aver F P-value F crit value
Gender 28,0333333 1 28,0333333 0,4740699 0,499030358 4,3512435
Treatment 2813,53333 4 703,383333 11,8948703 4,14415E-05 2,8660814
Interaction 63,1333333 4 15,7833333 0,26691094 0,895750572 2,8660814
Within group 1182,66667 20 59,1333333
Total 4087,36667 29
142
Two-way ANOVA with Replication
With replication: The intersection between the two factors defines groups of more than 1 value.
– E.g., compare the effect of tree different drugs in the systolic blood pressure of patients, depending on their gender,
when there are several (the same) number of males/females receiving each treatment.
– Three null hypotheses: (1) the means of the genders are the same, and (2) the means of the treatments are the same,
(3) there is no interaction between treatment and gender.
Group B1 Group B2 … Group Bc
Group A1 𝑋111 , …, 𝑋11𝑚 𝑋121 , … , 𝑋12𝑚 𝑋1𝑐1 , … , 𝑋1𝑐𝑚 𝑋1..
𝑚
1 𝑋𝑖𝑗𝑘 = ···
···
···
···
···
···
𝑋𝑖𝑗. = 𝑋𝑖𝑗𝑘
𝑚
𝑘=1 Group Ar 𝑋𝑟11 , … , 𝑋𝑟1𝑚 𝑋𝑟11 , … , 𝑋𝑟1𝑚 𝑋𝑟𝑐1 , … , 𝑋𝑟𝑐𝑚 𝑋𝑟..
𝑋.1. 𝑋.2. ··· 𝑋.𝑐. 𝑋…
𝑚 𝑟 𝑐
2
Total 𝑆𝑆𝑡 = 𝑋𝑖𝑗𝑘 − 𝑋… 𝑟·𝑐·𝑚−1 143
𝑘=1 𝑗=1 𝑖=1
Two-way ANOVA with Python
• There’s no evidence to conclude that men and women (factor A) have different SBP
• There’s an evidence that treatments (factor B) cause different SBP (P=0.000041)
• There’s no evidence that gender and treatment have some interaction (for example,
men don’t like some treatment and fail to follow it). 144
Post-Hoc Analysis: Tukey’s Test
• T-tests and ANOVA allow us to detect if there are differences between the group mean
values, but does not provide insight on the pairwise comparison of groups. The means can be
considered equal or different, but which ones are equal and which ones are different.
Post-hoc analysis: sort of study that allows pairwise comparison of a multi-group test.
Tukey’s test: is a post-hoc analysis used to find means that are significantly different from each
other.
Which treatments (column B) can be considered different to which other treatments?
A B mean(A) mean(B) diff se tail T \
import numpy as np 0 1 2 178.333333 165.833333 12.500000 4.121219 two-sided 3.033083
import pandas as pd 1 1 3 178.333333 162.166667 16.166667 4.121219 two-sided 3.922788
import pingouin as pg 2 1 4 178.333333 156.333333 22.000000 4.121219 two-sided 5.338227
3 1 5 178.333333 149.500000 28.833333 4.121219 two-sided 6.996312
4 2 3 165.833333 162.166667 3.666667 4.121219 two-sided 0.889704
data =
5 2 4 165.833333 156.333333 9.500000 4.121219 two-sided 2.305143
np.array([[1,1,180],[1,1,173],[1,1,175],[2,1,182],[2,1,181],[2,1,179], 6 2 5 165.833333 149.500000 16.333333 4.121219 two-sided 3.963229
[1,2,172],[1,2,158],[1,2,167],[2,2,160],[2,2,175],[2,2,163], 7 3 4 162.166667 156.333333 5.833333 4.121219 two-sided 1.415439
[1,3,163],[1,3,170],[1,3,158],[2,3,162],[2,3,170],[2,3,150], 8 3 5 162.166667 149.500000 12.666667 4.121219 two-sided 3.073524
[1,4,158],[1,4,146],[1,4,160],[2,4,171],[2,4,155],[2,4,148], 9 4 5 156.333333 149.500000 6.833333 4.121219 two-sided 1.658086
[1,5,147],[1,5,152],[1,5,143],[2,5,155],[2,5,160],[2,5,140]]) p-tukey hedges
df = pd.DataFrame(data=data, columns=["A","B","Value"]) 0 0.026411 1.616448
1 0.001503 2.090605 Treatment 1 (null) and 3 (100mg),
pt = pg.pairwise_tukey(data=df, dv='Value', between='B') 2 0.001000 2.844948 Treatment 1 (null) and 4 (150mg),
3 0.001000 3.728606 Treatment 1 (null) and 5 (200mg), and
print(pt)
4 0.891345 0.474158
5 0.154899 1.228500
6 0.001293 2.112158 Treatment 2 (salt free) and 5 (200mg) …
7 0.599289 0.754342
‘B’ is the treatment column 8 0.023565 1.638000
9 0.462465 0.883658
… all can be considered to have
John Wilder Tukey different means. 145
(1915-2000)
Normality test
• Values do not necessarily follow a normal distribution
• Testing normality:
– Rule of thumb: histogram of the data and observe if it follows a normal distr.
– Shapiro-Wilk method:
2
σ𝑛𝑖=1 𝑎𝑖 𝑥(𝑖)
𝑊= 𝑛
σ𝑖=1(𝑥𝑖 − 𝑥)ҧ 2
• Excel:
– Install “Real Statistics”
– Access real statistics with CTRL-M
– Select Shapiro-Wilk test
– Observe the P-values obtained
– P-value > => normality cannot be rejected
• Example:
– Data: 11.25, 10.00, 9.68, 10.52, 8.77, 9.92, 8.62, 10.21, 9.09, 10.36.
– P-value (=SWTEST(data)): 0.8122 => The data follow a Normal distribution
146
source: https://synapse.koreamed.org/Synapse/Data/PDFData/1010JRD/jrd-26-5.pdf
Shapiro-Wilk Test
Critical Values
Samuel Sanford Shapiro
(1930-)
A.
1. Calculate the SW statistic W for the sample:
2
σ𝑛𝑖=1 𝑎𝑖 𝑥(𝑖)
𝑊= 𝑛
σ𝑖=1(𝑥𝑖 − 𝑥)ҧ 2
2. Obtain the critical value CV from the table:
CV = table(n, )
3. Conclude about normality:
If W < CV, we reject normality
Otherwise we cannot reject normality
B.
1. Calculate the P-value of the data:
P-value = SWTEST(data)
2. Conclude about normality:
If P-value < , we reject normality
Otherwise we cannot reject normality
Martin B. Wilk
(1922–2013) 147
A-Coefficients for Shapiro-Wilks
148
Example Shapiro-Wilk Test in Excel
149
Kolmogorov-Smirnov Test
A.
1. Calculate the KS max value D for the
sample:
𝐷 = max 𝐹𝑛 𝑥 − 𝐹(𝑥)
𝑥
2. Obtain the critical value CV from the
table:
CV = table(n, )
3. Conclude about normality:
If D > CV, we reject normality
Otherwise we cannot reject normality
151
Normality test with Python
Shapiro-Wilks Kolmogorov-Smirnov
import numpy as np import numpy as np
import pandas as pd from scipy import stats
import pingouin as pg
data = np.array([2.4,4.5,-5.5,-0.8,-7.9,9.9,-1,7.8,0.3,
data = np.array([2.4,4.5,-5.5,-0.8,-7.9,9.9,-1,7.8,0.3, -10,4.6,5.8,-3.3,1.2,-4.8,7.1,-2.3,-3.8,-9])
-10,4.6,5.8,-3.3,1.2,-4.8,7.1,-2.3,-3.8,-9])
df = pd.DataFrame(data=data) ks, p_value = stats.kstest(data, 'norm')
print(ks, p_value)
sw = pg.normality(df)
print(sw)
W pval normal
0 0.972835 0.83139 True 0.4103285215572715 0.00208587800956761
Can’t discard data follows a normal distribution The data does not follow a normal distribution
Frank Wilcoxon Henry Berthold Mann D R Whitney William Henry "Bill" Kruskal Wilson Allen Wallis
(1892 –1965) (1905-2000) (? - ? ) (1919 –2005) (1912 – 1998) 153
Wilcoxon’s Rank Sum Test in Python
“Consider a Phase II clinical trial designed to investigate the effectiveness of a
new drug to reduce symptoms of asthma in children. A total of n=10
participants are randomized to receive either the new drug or a placebo.
Participants are asked to record the number of episodes of shortness of breath
over a 1 week period following receipt of the assigned treatment.”
Placebo 7 5 6 4 12 3 1 3 2 1
New Drug 3 6 4 2 1 8 9 12 1 10
import numpy as np
import pingouin as pg
data = np.array([1,2,5,3,2,1,1,3,2,1,4,3,6,5,2,6,1,6,5,4,9,6,7,7,5,1,8,9,6,5])
group = np.concatenate((np.full(10,1), np.full(10,2), np.full(10,3)))
• F-test
• Levene’s test
• Bartlett’t test
• Brown-Forsythe test
156
Comparing Two-Variable Groups
• Concept: y = f(x), where x is the predictor variable (or group), and y the
prediction variable or event occurrence variable (or factor to study).
• Consider y a binary variable.
– Ex1. survival = f(taking_drug_d)
– Ex2. develop_comorbidity = f(has_index_disease_D)
– Ex3. goes_to_hospital = f(patient’s_old)
Death
– Ex4. receives_the_correct_treatment = f(has_disease_D) Develop disease
…
• Risk and Odds: Quantifying one binary factor for one group:
y: event occurrence
(factor to study)
Drug
Treatment Yes No Total
…
x: predictor variable
Yes a b a+b
(group)
Example: what’s the risk of a diabetic patient to develop a second disease? Pr(develop|DM)
Example: what’s the odds of a diabetic patient to develop a second disease? Pr(develop|DM)/Pr(no develop|DM)
157
Comparing two groups: Risk Ratio
• Risk ratio (RR) or relative risk: measures how much times is likely to observe one property p
(event or factor) in a group A than it is in a group B (groups are the predictor variable). For
example, how many times is more risky to develop breast cancer (property) if there have been
close family antecedents (group A) than if there have not (group B).
RR = Risk(p in A) / Risk(p in B) = Pr(p | A) / Pr(p | B)
– If RR = 1: the risk of A and the risk of B are the same.
– If RR < 1: the risk of A is lower than the risk of B. Comparison of one same
– If RR > 1: the risk of A is higher than the risk of B. risk between two groups
𝑎/(𝑎 + 𝑏)
𝑅𝑅 =
𝑐/(𝑐 + 𝑑)
• Relative Risk Reduction (RRR): measures how much risk is reduced between a
treatment group and a control group. For example, how much risk is reduced when
a chemotherapy drug is tested with respect to the usual treatment.
RRR = Risk(p in ctrl) – Risk(p in treat))/ Risk(p in ctrl) = (Pr(p | ctrl) – Pr(p | treat)) / Pr(p | ctrl)
𝑎 𝑐
−
𝑅𝑅𝑅 = 𝑎 + 𝑏 𝑎 𝑐 + 𝑑
𝑎+𝑏 158
Comparing two groups: Odds Ratio
• Odds ratio (OR): measures the association between a certain property p1 and a second
property p2 in a population A, telling how the presence or absence of p2 affects the presence
or absence of p1. For example, what is the proportion of patients having diabetes mellitus
(p1) or not (p2), in the development of hypertension (A)?
OR = Odds(p1 in A) / Odds(p2 in A)
Odds(p of A respect to B) = Pr(p|A)/Pr(not p|A)
𝑎/𝑏
𝑂𝑅 =
𝑐/𝑑
159
RR and OR Examples
• What is the risk of developing secondary effects when a patient is
treated with drug D1 in comparison to the treatment with drug D2, Groups = {D1, D2}
if half of the D1 patients and one third of the D2 patients observed Factors = {2ary effect Y/N}
secondary effects after their treatments? Pr(2ary effect Y | D1) = 0.5
RR = 0.5 / 0.3333 = 1.5 Pr(2ary effect Y | D2) = 0.3333
(i.e., the risk of having secondary effects with D1 is 1.5 greater than with
D2)
• When we test a new drug on breast cancer patients we observe
that 20% of the control group (i.e., patients receiving the regular Groups = {ctrl, treat}
treatment) get worse, and 10% of the experimental group (i.e., Factors = {worse, better}
patients receiving the new drug) get worse. How much risk is Pr(worse | ctrl) = 20%
reduced with the use of this new drug? Pr(worse | treat) = 10%
RRR = (0.2-0.1)/0.2 = 0.5 = 50%
(i.e., the risk of getting worse with the new drug is reduced to one half)
(i.e., among people going to the beach, one has 30% more options to Protection N 15% 85%
develop skin melanoma than not, if you don’t use a protection cream
than if you use some) Odds (Prot-Y in melanoma) = 5/95
Odds (Prot-N in melanoma) = 15/85
160
RR and OR in Python
Effect No Effect
2x2 Contingency Table: Placebo a b
Treatment c d
import statsmodels.api as sm
1.5 3.0
starting event
Patient Type 1: dies before study ends
Patient Type 2: survives the study period
censored
Patient Type 3: withdraws from study X
Patient Type 4: enters in the middle of study (X & )
• Use Kaplan-Meier to:
– Measure the fraction of patients living for a certain amount of time after treatment.
– Calculate the probability of one patient to survive a certain amount of time after treatment
– Determine the probability of one patient to be discharged after an amount of time after diagnosis
– Etc.
• Elements of Survival Analysis:
1. Kaplan-Meier Survival table: What is the probability of surviving after a certain time?
2. Kaplan-Meier Survival curve: What’s the survival evolution of one group? And the mean survival time?
3. Log-Rank test: Can we compare the survival of two (or more) groups (treatments)? What’s the p-value?
4. Hazards Ratio: Can we quantify the survival difference between two (or more) groups? Can we provide
a confidence interval?
5. Proportional Hazards of Cox’s Hazards regression: To analyze the contribution of different variables in
the prediction of survival.
Cumulative Pr
2. Kaplan-Meier Survival Table time
Remained Withdrawn Deaths at Risk Pr of death Pr surviving
surviving
i ti ni =ri-1 wi di ri =(ni -wi -di ) di/ri pi =1-di /ri S(ti )=pi *S(ti-1)
𝑛𝑖 − 𝑑𝑖 0 0 12 12 0.0000 1.0000 1.0000
𝑆(𝑡) = ෑ 1 4 12 1 11 0.0909 0.9091 0.9091
𝑛𝑖 2 5 11 1 10 0.1000 0.9000 0.8182
𝑡𝑖 ≤𝑡
3 6 10 1 9 0.0000 1.0000 0.8182
S(t) = Probability of a patient to survive time t 4 7 9 1 8 0.1250 0.8750 0.7159
S(t) = proportion of patients surviving after time t 5 8 8 2 6 0.3333 0.6667 0.4773
6 9 6 1 1 4 0.2500 0.7500 0.3580
7 11 4 1 3 0.3333 0.6667 0.2386
8 12 3 3 0.0000 1.0000 0.2386
(survival probability)
0.0 0.2 0.4 0.6 0.8 1.0
0 2 4 6 8 10 12 14 163
(time)
Survival Analysis with Python
• Not this year
• If interested: check https://medium.com/towards-
artificial-intelligence/survival-analysis-with-python-
tutorial-how-what-when-and-why-19a5cfb3c312
164
Case Study 2: Data Analysis with Python
1. Download the .csv file in a local folder
2. Take a look at the particularities of this data set
3. Make a table with all the attributes
4. In python, load the file in a pandas DataFrame structure indicating the names of the columns
5. Declare categorical attributes according to the previous table
6. Calculate the mean cholesterol level of all the patients, but also of male and female separately
7. Can be the mean of men considered equal to the global mean ?
8. Can we consider the chol mean value in women 22.15 mg/dl (=261.75-239.6) upper than for men ?
9. Oldpeak measures the ST depression induced by exercise relative to rest. Let us assume that the half-bigger
depressions correspond to patients before doing exercise, and the half-smaller values to the same patients after
doing exercise. Could we conclude that doing exercise increases their maximum heart rate (thatlach)?
a) We have to calculate the mean value of oldpeak
b) We use this mean oldpeak value to split thatlach in two groups of values: one before doing exercise and another one after doing
exercise
c) We adjust the two sets to have the same number of values
d) We apply paired Student’s t-test to reach a conclusion
10. We could be interested in analyzing whether the type of chest pain (cp) is indicative of the sort of heart disease
(num), or not.
11. Could we consider the maximum heart rate (thatlach) depends on patients age, when grouped in young (<45),
middle (45-65), old (>65) ?
12. Are these groups of age and patient's gender affecting the average serum cholesterol (chol) ?
13. In a previous analysis we got that maximum heart rate (thatlach) is different between patient’s age groups, but
is it different between young and middle aged? And between middle aged and the elders?
14. In our initial analyses we tested cholesterol mean differences, but never checked if chol values follow a normal
distribution
15. We should recalculate if men and women have different chol mean values (recall that women had +22.15 mg/dl
than men, under normality assumption)
16. Is cholesterol lack of normality affecting the conclusions that age and gender affect the chol mean value?
165
Conclusions
Normal Same Variances
Task Samples Quantitative Test
(parametric) (homocedasticity)
Y Y 1, 2 Y Student’s
Y N 2 Y Welch’s
- - 1, 2 N Chi-square
Compare means
between groups ANOVA
Y Y 2+ Y
Tukey’s (2 samples)
N - 2 Y Wilcoxon-Mann-Whitney
N - 2+ Y Kruskal-Wallis
Shapiro-Wilks
Check normality ? - - Y
Kolmogorov-Smirnov
Association of one
property between
- - - - Odds Ratio
complementary
groups
Relative relation of
one property - - - - Risk Ratio
between two groups
Survival analysis - - - - Kaplan-Meier
166
Artificial Intelligence Analysis of Health Care Data
• Sample Tool:
– Rapid Miner
• Data Preprocessing
• Case Study 3: Data Preparation with Rapid Miner
• AI Data Analysis: Unsupervised Machine Learning
• AI Data Analysis: Supervised Machine Learning
• Quality Assessment
• Case Study 4: Data Modelling with Rapid Miner
167
Rapid Miner
• Rationale:
– Accessibility: complete free version for students.
– Capability: RM can analyze large sets of data.
– Simplicity: all the AI-based Data Analysis involved in the
course is easily performed with RM.
168
Data Structure (Cross Sectional Study)
• Data Analysis deals with data organized as a matrix
(set) where
– Each column represents a (clinical) variable
– Each row describes an instance (e.g., patient) in terms of
the variables in the columns
column
169
Example Data Set
• Heart Disease Data Set
• https://archive.ics.uci.edu/ml/datasets/Heart+Disease
• N = 303
• Features: 14
# Name Explanation Units Type Missing
1 Age Age of the patient Years Numeric
2 Sex Gender of the patient 1=male, 0=female Binary
3 CPT Chest Pain Type 1=typical angina, 2=atypical angina, 3=non-angina pain, Categorical
4=asymptomatic
4 BP Resting Blood Pressure on admission to hospital mmHg Numeric
5 Chol Serum Cholesteroral mg/dl Numeric
6 FBS Fasting Blood Sugar > 120 mg/dl 1=true, 0=false Binary
7 ECG Resting Electrocardiographic Results 0=normal, 1=having ST-T wave abnormality, 2=Showing Categorical
probable or definite LV hypertrophia
8 maxHR Maximum Heart Rate Achieved Numeric
9 ExAng Exercise Induced Angina 1=yes, 0=no Binary
10 OldPeak ST depression induced by exercise relative to rest Numeric
11 Slope slope of the peak exercise ST segment 1=upsloping, 2=flat, 3=downsloping Categorical
12 Ca Number of major vessels colored by flourosopy 0, 1, 2, 3 Numerical 4
13 Thal 3=normal, 6=fixed defect, 7=reversable defect Categorical 2
14 Diag diagnosis of heart disease 0=<50% diameter narrowing, 1=>50% diameter narrowing Categorical
170
Data (Descriptive) Statistics
• Binary data: proportion
• Quantitative data: average, min, max
• Qualitative data: histogram
171
Data Visualizations
values statistics
box-plots
plots
pie charts
Data preprocessing is used to transform the raw data into a useful and efficient format
for data analysis.
• Data Cleaning: raw data can have irrelevant or missing parts, among other issues.
Data cleaning deals with these issues.
• Data Transformation: data can come in suboptimal or wrong forms. Data
transformation converts data into forms that are suitable for mining.
• Data Reduction: on the one hand data mining results improve as more and more
data is available, on the other hand the mining processes can be time-consuming.
Data reduction deals with the application of trade-off solutions to minimize data
size without compromising quality too much.
• Data Balancing: in the data one subset can dominate over another subsets, this
causing predictions to be more biased towards majority class. Under the name
data balancing there is a set of methods to increase or reduce the data set to more
balanced data.
173
Source: https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
Data Preprocessing Summary Chart
Remove instance
Remove feature
Encoding
Missing Data Imputation (replace)
Imputation (predict)
Data Cleaning NA, NaN
Binning
Noisy Data Clustering
Statistical Norm.
Normalization Specific Range Norm.
Proportional Norm.
Interquartile range N.
Remove useless attrs.
Attr. Selection Transform attrs.
Combine attrs.
By size
Data Transformation By binning
Discretization By frequency
By user specification
Data Pre-Processing By entropy
(Concept Hierarchy) Generalization
Aggregation
Attr. Subset Selection
Data Reduction
Cardinality Reduction
Dimensionality Reduction
Collect more data
Penalized models
Data Balancing New models & Algorithms
Resample 174
Change Perform. metric
Data Cleaning: Missing Data
Missing data: When some data is detected not available in the matrix.
• Six alternative solutions are proposed:
– Remove instance: get rid of the instance that contains the missing value.
– Remove feature: get rid of the column that contains many missing values.
– Encoding: indicate missing values with a special symbol (e.g., 99999).
– Imputation (Replace): use the mean/median/mode or most frequent
categorical value.
– Imputation (Predict): apply a regression function to calculate the value.
– NA: leave it as not available value and only use data analysis algorithms that
allow missing values.
175
Missing Data with Rapid Miner
Solution RM Module Explanation
Remove Instance Filter Examples: The Operator returns those examples that match the given condition.
The conditions are defined by the user. Several pre-defined conditions also exist as
advanced options. One of the available conditions is no_missing_attributes that match
only the examples that have no missing values.
Remove Feature Select Attributes: The Operator provides different filter types to make Attribute
selection easy. Possibilities are for example: Direct selection of Attributes. Selection by
a regular expression or selecting only Attributes without missing values.
Encode Replace Missing Values: Missing values can be replaced by the minimum, maximum or
average value of that Attribute. Zero can also be used to replace missing values. Any
replenishment value can also be specified as a replacement of missing values.
Imputation Replace Missing Values with default field set to average, or value with replenishment
value the pre-calculated mean, median, or mode.
(Replace)
Imputation Impute Missing Values: This is a nested operator i.e. it has a subprocess. This
subprocess should always accept an ExampleSet and return a model. The Impute
(Predict) Missing Values operator estimates values for missing values by learning models for
each attribute (except the label) and applying those models to the ExampleSet. The
learner for estimating missing values should be placed in the subprocess of this
operator.
NA Declare Missing Value: The Declare Missing Value operator replaces the specified
values of the selected attributes by Double.NaN, thus these values will become missing
values. These values will be treated as missing values by the subsequent operators.
The desired values (to be converted) can be selected through nominal, numeric or
regular expression mode.
176
Data Cleaning: Noisy Data
Noisy data: When some data is detected to be wrong (meaningless) in the
matrix. It can appear due to faulty data collection, data entry errors, etc.
Noise data can affect the class label (label noise) –this occurs when an
example is incorrectly labelled causing either contradictory examples or
instance misclassification; or regular attributes (attribute noise) –this occurs
when a regular attribute has an erroneous value. Three actions are possible:
do nothing and use robust ML methods (avoid overfitting), detect & remove
noise, and detect & correct noise.
• The following approaches to noise cleaning are proposed:
– Data Binning: The original data values are used to generate small intervals (bins), then all
the values in the bin are replaced by the mean value in the bin (detect & correct).
– Clustering: Groups of similar data are made. The outliers fall out of the classes, and they
can be removed (detect & remove).
177
Binning Methods for Data Smoothing
• Smoothing data by equal frequency bins
Example: 15, 21, 27, 8, 16, 9, 21, 24, 30, 26, 30, 34
1. Sort the data: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
2. Make N equal frequency bins:
Bin 1: [8, 9, 15, 16]; Bin 2: [21, 21, 24, 26]; Bin 3: [27, 30, 30, 34]
3. Calculate mean within each bin:
Bin 1: 12; Bin 2: 23; Bin 3: 30
4. Replace the values in the bins by the bin mean:
Sorted: 12, 12, 12, 12, 23, 23, 23, 23, 30, 30, 30, 30
Original sequence becomes: 12, 23, 30, 12, 12, 12, 23, 23, 30, 23, 30, 30
178
Clustering: Local Outlier Factor (LOF)
Example:
1. 13 points in a 2-dimensional space
2. Find out the 3-nearest neighbors of
A:
{n1, n2, n3}
3. Calculate 3-density of A
3
𝑑𝐴 = = 0.23
max 𝑑𝑖𝑠𝑡(𝐴,𝑛𝑖 )
𝑖=1,…,3
4. Calculate the 3-density of the
neighbors n1, n2, n3
dn1 = 2.1; dn2 = 2.7; dn3 = 2.4
5. Calculate the average of dn1, dn2,
and dn3:
Avg = (2.1+2.7+2.4)/3 = 2.4
6. Calculate LOF = Avg /dA:
LOF = 2.4 / 0.23 = 10.43
7. If LOF > 1, A is an outlier.
179
Noisy Data with Rapid Miner
Detect Outlier: Rapid Miner implements several algorithms to detect outliers (based
Clustering on distances, densities, LOF –local outlier factors, and COF –class outlier factors)
180
Data Transformation: Normalization
Normalization: Normalization is used to scale values so they fit in a specific range
(e.g., [-1.0, +1.0] or [0.0, 1.0]). Adjusting the value range is very important when
dealing with attributes of different units and scales. For example, when using the
Euclidean distance all attributes should have the same scale for a fair comparison.
Normalization is useful to compare attributes that vary in size.
181
Normalization Examples (1)
182
Normalization Examples (2)
183
Normalization with Rapid Miner
184
Data Transformation: Attribute Selection
Attribute Selection: Attribute selection is about the process of choosing a
subset of representative attributes and getting rid of useless attributes.
Extending the idea, we could also consider transforming attributes, or
combining several attributes to form a new replacing one.
• Sorts of attribute selection:
– Remove useless attributes: Some attributes can become useless because of
the existence of other attributes and they can be removed. Ex., attributes with
a predominant value or with a low standard deviation (low variability), or
highly correlated attributes.
– Transform attributes: Some attributes may require a conversion of their
values. For example, change date of birth to years, change temperature in
Fahrenheit to Celsius, etc.
– Combine attributes: Some irrelevant attributes may be combined to form
other relevant attributes. Ex., patient’s height and weight can be combined to
describe the patient’s body mass index (BMI).
185
Attribute Selection with Rapid Miner
Solution RM Module Explanation
Remove Remove Useless Attributes: removes four kinds of useless
attributes: (1) Nominal attributes where the most frequent value
Useless is contained in more than the specified ratio of all examples. This
Attributes is used for removing nominal attributes in which one value
dominates all other values. (2) Nominal attributes where the
most frequent value is contained in less than the specified ratio
of all examples. This is used for removing nominal attributes with
too many possible values. (3) Numerical attributes where the
Standard Deviation is less than or equal to a given threshold. This
is used for removing attributes with low variability of their
values. (4) Nominal attributes where the value of all examples is
unique. This property is used to remove id-like attributes.
Remove Correlated Attributes: removes attributes which are
correlated above/below a given threshold.
Transform Map: maps specified values of selected attributes to new values
according to a conversion table.
Attributes
186
Data Transformation: Discretization
Discretization: Discretization is the process by which a continuous
valued attribute is transformed into a discrete (categorical) attribute.
• Five alternative types of discretization are considered:
– By Size: converts the selected numerical attribute into a nominal attribute by discretizing
the numerical attribute into bins of user-specified size. All the values of the attribute
(including all the repetitions) are ordered and grouped into bins of size N (predefined
values). Each bin defines the new discretized value of the attribute. All bins contain the
“same” number of examples. Notice that equal values are kept in the same bin.
– By Binning: discretizes the selected numerical attributes into a user-specified number of
bins N representing intervals of equal size. The number of the values in bins may vary.
– By Frequency: discretizes the selected numerical attributes into a user-specified number
of bins N. Bins of equal number of the values are made.
– By User Specification: discretizes the selected numerical attributes to nominal attributes.
The numerical values are mapped to the classes according to the thresholds specified by
the user in the classes parameter. Classes are specified by their upper limit.
– By Entropy: converts the selected numerical attributes into nominal attributes. The
boundaries of the bins are chosen so that the entropy is minimized in the induced
partitions.
187
Discretization with Rapid Miner
Solution RM Module Explanation
By Size Discretize by Size: converts the selected numerical attributes
into nominal attributes by discretizing the numerical attribute
into bins of user-specified size. Thus each bin contains a user-
defined number of examples.
By Binning Discretize by Binning: discretizes the selected numerical
attributes into user-specified number of bins. Bins of equal range
are automatically generated, the number of the values in
different bins may vary.
By Frequency Discretize by Frequency: converts the selected numerical
attributes into nominal attributes by discretizing the numerical
attribute into a user-specified number of bins. Bins of equal
frequency are automatically generated, the range of different
bins may vary.
By User Discretize by User Specification: discretizes the selected
numerical attributes into user-specified classes. The selected
Specification numerical attributes will be changed to nominal attributes.
By Entropy Discretize by Entropy: converts the selected numerical attributes
into nominal attributes. The boundaries of the bins are chosen so
that the entropy is minimized in the induced partitions.
188
Data Transformation: Concept Hierarchy Generalization
Concept Hierarchy Generation: If we have a hierarchy for the
values an attribute can take, then this method allows replacing
the values in the attribute by other values which are in higher
positions of the hierarchy (generalization).
• For numerical values, these can be generalized to ranges. For
example, numerical age to infant (<4), child (4-12), young (13-
18), adult (18-60), and elder (>60).
• For categorical values, if there is a hierarchy of such concepts,
they can be generalized to upper levels in the hierarchy. For
example, Chlorhexidine (ATC code R02AA05) can be
generalized to its group Antiseptic (ATC code R02AA).
189
Concept Hierarchy Generalization with
Rapid Miner
190
Data Reduction
Data Reduction: Some data mining technologies are slow to handle a huge amount of
data. In order to cope with this, we can use a data reduction technique.
• Some common data reduction techniques are:
– Aggregation: process by which information is summarized. For example, if you have the
weights of all patients in a database, you could consider to reduce the size of the data by
grouping all the patients with the same primary disease and using the mean weight of
the patients with one same diagnosis to represent the group’s weight.
– Attribute Subset Selection: some attributes are more relevant than others with respect
to the classification label. There are multiple alternatives to calculate the relevance
(weight) of attributes: information gain, information gain ratio, deviation, correlation,
Chi-square, Gini index, Tree (by random forest), Relief, Support Vector Machine (SVM),
Principal Component Analysis (PCA), etc.
– Cardinality reduction: The number of examples (rows) in the database can be reduced if
they do not satisfy one or several conditions. For example, remove examples with
missing data, or outliers.
– Dimensionality reduction: No this course.
191
Data Reduction with Rapid Miner
Solution RM Module Explanation
Aggregation Aggregate: performs the aggregation functions known from SQL. This
operator provides a lot of functionalities in the same format as provided
by the SQL aggregation functions. SQL aggregation functions and
GROUP BY and HAVING clauses can be imitated using this operator.
Aggregation functions include SUM, COUNT, MIN, MAX, AVERAGE and
many other.
Cardinality Filter Examples: returns those examples that match the given condition.
The conditions are defined by the user, and each condition consists of
Reduction an attribute, a comparison function and a value to match. Several pre-
defined conditions also exist as advanced options.
Dimensionality
NA 192
Reduction
Data Balancing
Data Balancing: Data imbalance takes place when the majority classes dominate
(i.e., bigger number) over the minority classes in a supervised matrix. Most ML
Algorithms are usually designed to improve accuracy by reducing the error. So
they do not take into account the balance of classes. In such cases, the predictive
model developed using conventional machine learning algorithms could be biased
and inaccurate.
• Five alternative solutions are proposed:
– Collect more data: a larger data set could reduce might turn an unbalanced dataset balanced.
– Penalized models: use penalizing ML methods (e.g., cost matrix) to increase the cost of
classification mistakes on the minority class.
– New models and algorithms: Imbalanced data can be solved using an appropriate model. For
example, XGBoost model internally takes care that the bags it trains on are not imbalanced.
– Resample:
• Over-sampling: When the quantity of data is insufficient, it tries to balance by incrementing the size of rare samples.
• Under-sampling: Reduce the size of the class which is dominant among the other classes.
IMPORTANT!!!!: USE RESAMPLING WITH THE UNBALANCED CLASS AND NOT WITH THE WHOLE DATASET.
– Change the performance metric: It is important to choose the evaluation metric of the model
correctly or one would end up optimizing a useless parameter. One should try to change the
performance metric while solving the problem of imbalanced data.
193
Data Balancing with Rapid Miner
Solution RM Module Explanation
Collect More Append: This operator builds a merged example set from two or more
compatible example sets by adding all examples together. All the example sets
Data in the input must have the same attribute signature.
Penalization Metacost: The MetaCost operator makes its base classifier cost-sensitive by
using the cost matrix specified in the cost matrix parameter. The MetaCost
with Cost Matrix operator is a nested operator; i.e., it has a subprocess. The subprocess must
have a learner; i.e., an operator that expects a data set and generates a model.
This operator tries to build a better model using the learner provided in its
subprocess.
Change
performance NA
Metric
194
Case Study 3: Data Preparation with Rapid Miner
1. Download the .csv file in a local folder
2. Import the data from the csv file and save it as a RM file
3. Check that data is correctly imported
4. Spend some time with the statistical description of the
different features
5. Visual analysis of data
6. Perform some data preprocessing actions
• Missing Values
• Noisy Data
• Normalization
• Attribute Selection
• Discretization
• Data Reduction
• Data Balance
7. The data set is ready for data analysis
195
AI Data Analysis: Machine Learning
• Machine Learning: subarea of artificial intelligence that
uses computer algorithms to build mathematical models
based on sample data, known as "training data", in order
to make predictions or decisions without being explicitly
programmed to do that.
• Types of ML:
– Unsupervised ML: the training data is not labelled, in this sense
nobody supervised the data to say which data correspond to
which decision group. For example, data on breast cancer
patients is captured but no information is given on survival.
– Supervised ML: the training data is composed of examples of
decision groups, annotated each example with the sort or
decision taken. Annotating data requires somebody to supervise
the data providing them with a meaning. For example, data on
breast cancer patient providing a column on whether the
patient survived or not.
196
AI Data Analysis: Unsupervised ML
• Some most famous unsupervised ML algorithms:
– k-means clustering
– Hierarchical Clustering
• Agglomerative
DataSet
197
k-means
Concept:
1. Select k data at random
2. Make one cluster with each data
as centroid
3. Assign the rest of the data to the
cluster with a closer centroid.
4. Recalculate centroids within each
cluster
5. Repeat steps 3-5 N times or until
the centroids stabilize
Concept (Agglomerative):
1. Calculate pairwise distance between objects
-cut 2. Make singletons (clusters of one object)
3. Combine closer clusters into superclusters applying a
linkage operation
4. Repeat step 3 until a single supercluster
Dendrogram
Linkage operations: the distance between 2 clusters is …
– Single linkage: … the distance between the closest objects
– Complete-linkage: …the distance btwn the farthest objects.
– Average-linkage: … the average distance between all pairs.
Attributes: qualitative and quantitative. – Centroid-linkage: … the distance between centroids.
199
Missing values: accepted
AI Data Analysis: Supervised ML
• Some most famous supervised ML algorithms:
– k-Nearest Neighbor
– Logistic Regression
– Decision trees
– Naïve Bayes Classifier
– Classification Rules DataSet
– Neural Networks
– Support Vector Machines
– Discriminant Analysis
– Etc.
• Ensembles
200
K-Nearest Neighbor (k-NN)
K-NN: algorithm that stores all
available cases and classifies new
cases based on a similarity measure
(e.g., distance function) by a majority
vote of its k closest neighbors, with
the case assigned to the most
common class.
Concept:
1. Let S be the training set
2. Let O be a new object
3-NN 3. Get N the set with the k closest
objects to O in S.
5-NN 4. Let C be the most frequent class in N
5. Return “O classifies in C”
Variable-Value pairs
x1 x2 … xi … xn
C1 p(C1)
C2 p(C2)
classes
… …
Cj p(xi | Cj) p(Cj)
Attributes: qualitative and quantitative. … …
Cm p(Cn)
Class: qualitative. 204
Missing values: accepted
Classification Rules
Classification Rule: if-then expression in
which the premise is a condition on a
subset of features and the conclusion is
one of the classes in the training set.
Artificial Neuron
activation
function
Attributes: quantitative.
Class: qualitative.
Missing values: accepted 207
Quality of Data Analysis
• Recommender systems can make two types of error:
– Type I Error: The system recommends something that should not be recommended. For
example, a dangerous drug is recommended.
– Type II Error: The system fails to recommend something that should have been
recommended. For example, a needed drug is not recommended.
• Confusion matrix
False Positives: objects in class N that are classified in class P. Equivalent to Type I error. 208
False Negatives: objects in class P that are classified in class N. Equivalent to Type II error.
Quality Ratios
• Accuracy: proportion of cases correctly classified. A system has a good accuracy if it detects
positive cases and rejects negative cases. For example, if required drugs are recommended
and unnecessary drugs are not recommended.
• Sensitivity (recall): proportion of positive cases correctly classified. A system has good
sensitivity if it detects many positive cases. For example, if required drugs are recommended.
• Specificity: proportion of negative cases correctly classified. A system has good specificity if it
rejects negative cases. For example, if unnecessary drugs are not recommended.
• Positive Predictive Value (precision): proportion of correct positive classified cases. A system
has good precision if it does not recommend unnecessary things. For example, if among the
recommended drugs, all are necessary.
• Negative Predictive Value: proportion of correct negative classified cases. A system has a
good negative predictive value if it does not fail to recommend necessary things. For
example, if unnecessary drugs are not recommended.
• Measures use to be given in pairs:
– Sensitivity + Specificity
– Precision + Recall
• In order to have one single quality measure:
– F-score: calculated as two times the quotient between precision and recall multiplied and added.
– AUC (area under the ROC curve): The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR)
as the threshold for a binary classification varies. AUC value computes the area under the ROC curve.
209
Testing: n-fold cross-validation
1 2 3 n
train
data …
test
1 2 3 n quality ratio
… q1
… q2
average
… q3
… qn 210
Quality Ratios and Test with Rapid Miner
• Performance (Binomial Classification): accuracy,
sensitivity, specificity, positive predictive value,
negative predictive value, AUC, etc.
• Performance (Classification): accuracy, normalized
absolute error, root mean square error, etc.
• Cross Validation:
211
A Data Science Project (recall)
(Previously introduced)
A Data Science project is defined as a cycling process
which combines the following steps:
1. Goals and Objectives Setting
2. Data Extraction
3. Data Cleaning
4. Feature Engineering
5. Model Creation and Assessment
6. Impact Analysis
212
Case Study 4: Data Modelling with Rapid Miner
1. Download the .csv file in a local folder
2. Insert the attribute names
3. Import the hepatitis.txt with rapid miner and save it as a RM
file.
4. Get rid of missing values
5. Obtain clusters of similar cases
6. Modeling for prediction 1: k-NN
7. Modeling for prediction 2: Logistic Regression
8. Modeling for prediction 3: Decision Trees
9. Modeling for prediction 4: Naïve Bayes Classifier
10. Modeling for prediction 5: Neural Network
11. Modeling for prediction 6: Support Vector Machine
12. Observe that Support Vector Machine is the best approach
13. Calculate sensitivity and specificity for all the approaches
213
4. KR and KM in Medicine
• What is knowledge?
• Types of knowledge
• Representation of knowledge with logic
• Representation of knowledge with production rules
• Representation of knowledge with objects
• Representation of knowledge with ontologies
• Knowledge management: Knowledge Lifecycle
214
Knowledge in Knowledge Representation
• Concepts
– Data: values without any meaning. They can be operated.
– Information: values with a meaning. They can be interpreted and combined.
– Knowledge: actionable generalization.
• Types of Knowledge
– Declarative (know what) ex. blood contains red cells
– Procedural (know how) ex. hand surgery consists on (1) anesthesia, (2) the
incision, (3) closing the incisions, and (4) see the results.
(Source: https://www.plasticsurgery.org/reconstructive-procedures/hand-surgery/procedure)
“Coronavirus disease 2019 (COVID-19) is an infectious disease(1) caused by severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2)(2). […] Common symptoms include fever, cough, fatigue, shortness of breath, and loss
of smell and taste(3). While most people have mild symptoms(4), some people develop acute respiratory distress
syndrome (ARDS) possibly precipitated by cytokine storm, multi-organ failure, septic shock, and blood clots(5).
[…] As is common with infections, there is a delay between the moment a person first becomes infected and the
appearance of the first symptoms(1’). This delay is called the incubation period(6). The median incubation period for
COVID-19 is four to five days. Most symptomatic people experience symptoms within two to seven days after
exposure, and almost all symptomatic people will experience one or more symptoms before day twelve.”
(Source: Wikipedia)
215
Knowledge Representation: Logic
• First Order Logic (FOL)
Declarative Knowledge:
• Jane is pregnant: Pregnant(Jane)
• Male patients cannot be pregnant: x: Male(x) Patient(x) → Pregnant(x)
• Pregnancy complications can only occur to pregnant patients: x, y: ICD9_Chapter11(x) has(y, x) →
Pregnant(y)
• Severe headaches, nose bleed, and fatigue are signs of hypertension: x: diagnosed(x, hypertension) →
has(x, severe_headache) has(x, nosebleed) has(x, fatigue)
Procedural knowledge:
• After anesthesia surgeon takes over the operation: x t t’ y : anesthetized(x, t) surgeon(y) → after(t’,
t) takesOver(y, x, t’) 216
Knowledge Representation: Production Rules
Working Memory
Knowledge Tuple
Knowledge Tuple
Declarative Knowledge:
• Positive cases of COVID19 must follow quarantine: IF (COVID19_positive :x) THEN (ASSERT (quarantine :x))
• After overcoming COVID19, the patient has antibodies: IF (has :x COVID19) (overcomes :x COVID) THEN
(RETRACT 1) (ASSERT (has_antibodies :x))
Procedural knowledge:
• ER admission must follow triage: IF (admitted :x ER) THEN (ASSERT (on_triage :x))
217
Knowledge Representation: Objects
Object
Script COVID-Treat
Slot Subject: Patient
Slot Starts: Triage
Declarative Knowledge: Object Sign Slot Steps:
• The current age of a patient is the difference
between the current date and the patient’s birth ObjectSlot Name: String (Triage … (Homecare)…)
Fever (Homecare …)
date: IF-NEEDED Slot Has_COVID:
Slot Subclass: Sign Boolean
(General_hosp …)
• If a patient is declared to have COVID19, he/she has :IF-ADDED () (ICU_hosp …)
fever: IF-ADDED Slot Has_COVID: Boolean
Object Patient
Slot …
Signs: list()of SIGN
• In general, patients arriving to a hospital do not :IF-ADDED
have COVID19: DEFAULT
Slot Name: String
Slot Signs: list of SIGN
Slot Birth: Date
Procedural knowledge: Slot Age: Integer
:IF-NEEDED to_integer(− Current_Date Self:Birth)
• The COVID-protocol starts with a triage of the
patient to decide if he/she needs homecare, general Slot Has_COVID: Boolean
hospitalization or ICU hospitalization: Script :DEFAULT False
:IF-ADDED insert( (create Fever) :Self:Signs) &
Insert ( (create COVID-Treat :Self) :Self:Treatements)
Slot Signs: list of Sign
Slot Treatments: list of Treatment 218
Knowledge Representation: Ontologies
Class
IS-A relationship
Subcl
Instance
ass
Domain Range
PROPERTY 1
PROPERTY 2
… property
…
PROPERTY n
Declarative Knowledge:
• COVID19 is an infectious disease: class (Infectious-Disease), class (COVID19), subclass (COVID19, Infectious-Disease)
• Infectious diseases are caused by disease causing agents, like viruses: class (Disease-Causing-Agent), class (Virus), subclass (Virus,
Disease-Causing-Agent), property (caused-By Domain(Infectious-Disease) Range(Disease-Causing-Agent))
• COVID19 is caused by SARS COVID2 virus: class (SARS-COVID-2), subclass (SARS-COVID-2, Virus), property (caused-By
Domain(COVID19) Range(SARS-COVID-2))
Procedural knowledge:
• Ad hoc solutions implementing chains as next properties in action objects: property (Next Domain(Action) Range(Action)),
minCardinality (Next, 0)
219
Knowledge Life Cycle
Def (Knowledge life cycle): is a cyclic process of transforming information
into knowledge within an organization.
• Create
Knowledge is identified and represented in a
formal way
• Share
Created knowledge and its possible uses are
introduced to the final users
• Store
Created knowledge is stored in (formal)
knowledge bases for a later use
• Use
Stored knowledge is exploited in the daily
practice of the knowledge-based organization
• Update
From daily practice needs of new, more specific
or corrected knowledge can be identified that
needs a new loop of the whole process.
220
5. Decision Support Systems (CDSS) and Artificial
Intelligence in Medicine (AIM)
• Def. (Biomedicine): the branch of medicine concerned with the application of the principles
of biology and biochemistry to medical research or practice.
• Def. (Medical practice): is the practice of medicine. A complex problem involving multiple
cognitive tasks such as diagnosis, treatment, and prognosis.
– Diagnosis: Differential knowledge that is acquired of the physical and mental state of the patient by
observing the signs and symptoms of the disease that they present.
– Treatment: The management and care of a patient for the purpose of combating disease, injury, or
disorder.
– Prognosis: The likely outcome or course of a disease; the chance of recovery or recurrence.
• These are decisional issues that can be supported by Clinical Decision Support Systems and
Artificial Intelligence tools.
– Clinical Decision Support System (CDSS): health information technology system that is designed to
provide health-care professionals with clinical decision support; i.e., assistance with clinical decision-
making tasks.
– Artificial Intelligence in Medicine (AIM): the use of artificial intelligence technology and automated
processes in the diagnosis, treatment, and prognosis of patients who require care.
• Some outstanding CDSS
– Electronic Differential Diagnosis (DDx) Generators
– Drug Interaction Checkers (DIG)
– Alert and surveillance Systems
221
Further Reading: https://www.ncbi.nlm.nih.gov/books/NBK543516/pdf/Bookshelf_NBK543516.pdf
Electronic Differential Diagnosis (DDx) Generators
Def. (Differential Diagnosis (DDX) Generators): Electronic tools that may
facilitate the diagnostic process by introducing the observed signs and
symptoms.
• Examples:
https://symptomchecker.isabelhealthcare.com/
https://symptoms.webmd.com/ ② Obtain ranking of possible causes
222
Drug Interaction Checkers (DIG)
Def. (Drug Interaction Checker (DIG)): Electronic tools that help detecting
(and solving) the interactions between groups of drugs and another
substances that prevent the drugs from performing as expected.
• Examples:
https://reference.medscape.com/drug-interactionchecker
https://www.webmd.com/interaction-checker/default.htm
223
Alert and surveillance Systems
Def. (Clinical Surveillance and Alert System): Clinical surveillance systems
utilize multivariate, continuous, real-time data from multiple monitoring
devices; applies advanced analytics to provide a quantitative and qualitative
estimate of a patient's condition over time; and communicates clinically
relevant alerts to the appropriate clinician.
• Examples:
Intelligent Health-care
Patient Device
Alarm System Professionals
224
6. Legality, Security and Ethics
• Legal Aspects of Health Care Administration
– USA
– Spain
• Code of Ethics of Medical Informatics
– International Medical Informatics Association
• Other proposals
– General Data Protection Regulation (GDPR)
– Spanish Ley Orgánica de Protección de Datos (LOPDGDD)
– Artificial Intelligence and Law Enforcement - Impact on
Fundamental Rights
225
Legal Aspects of Health Care Administration
226
Legal Aspects of Health Care Administration
• Parte de Lesiones
– Código Penal (art 147)
– Ley de Enjuiciamiento judicial (art
262, 355)
• Historia Clínica (HC)
– Ley General de Sanidad (art 10, 61)
– Código de Ética y Deontología
Médica (art 13)
– Ley 41/2002 básica reguladora de
la autonomía del paciente y de
derechos y obligaciones en materia
de información y documentación
clínica
• Uso de la HC
• Conservación de la HC
• Derecho de acceso a la HC
• Propiedad de la HC
Source: https://www.elsevier.es/es-revista-medicina-familia-semergen-40-pdf-13072713
227
Legislación sobre Historia Clínica en España (I)
Ley General de Sanidad Código de Ética y Deontología Médica
• Art 10: “Todos tienen los siguientes derechos con respecto a • Art 13:
las distintas administraciones públicas sanitarias: […] 1. Los actos médicos quedarán registrados en la correspondiente
11. A que quede constancia por escrito de todo su proceso. historia clínica. El médico tiene el deber y el derecho de
redactarla.
Al finalizar la estancia del usuario en una institución
hospitalaria, el paciente, familiar o persona a él allegada 2. El médico y, en su caso, la institución para la que trabaja, están
obligados a conservar las historias clínicas y los elementos
recibirá su informe de alta”.
materiales de diagnóstico. En caso de no continuar con su
conservación por el transcurso del tiempo, podrá destruir el
• Art 61: “En cada Área de Salud debe procurarse la máxima material citado que no se considere relevante, sin perjuicio de lo
integración de la información relativa a cada paciente, por que disponga la legislación especial. En caso de duda, deberá
lo que el principio de historia clínico sanitaria única por consultar a la Comisión de Deontología del Colegio.
cada uno deberá mantenerse, al menos, dentro de los 3. Cuando un médico cesa en su trabajo privado, su archivo podrá
límites de cada institución asistencial. Estará a disposición ser transferido al colega que le suceda, salvo que los pacientes
de los enfermos y de los facultativos que directamente manifiesten su voluntad en contra. Cuando no tenga lugar dicha
sucesión, el archivo podrá ser destruido, de acuerdo con lo
estén implicados en el diagnóstico y el tratamiento del
dispuesto en el apartado anterior.
enfermo, así como a efectos de inspección médica o para
fines científicos, debiendo quedar plenamente garantizados 4. Las historias clínicas se redactan y conservan para la asistencia
del paciente u otra finalidad que cumpla las reglas del secreto
el derecho del enfermo a su intimidad personal y familiar y
médico y cuente con la autorización del médico y del paciente.
el deber de guardar el secreto por quien, en virtud de sus
5. El análisis científico y estadístico de los datos contenidos en las
competencias, tenga acceso a la historia clínica. Los
historias y la presentación con fines docentes de algunos casos
poderes públicos adoptarán las medidas precisas para concretos pueden proporcionar informaciones muy valiosas, por
garantizar dichos derechos y deberes”. lo que su publicación y uso son conformes a la deontología,
siempre que se respete rigurosamente la confidencialidad y el
derecho a la intimidad de los pacientes.
6. El médico está obligado a la solicitud y, en beneficio del
paciente, a proporcionar a otro colega los datos necesarios para
completar el diagnóstico, así como a facilitarle el examen de las
pruebas realizadas.
228
Legislación sobre Historia Clínica en España (II)
Ley 41/2002 básica reguladora de la autonomía del paciente y
de derechos y obligaciones en materia de información y
documentación clínica
• Artículo 14. Definición y archivo de la historia clínica
• Artículo 15. Contenido de la historia clínica de cada paciente
• Artículo 16. Usos de la Historia Clínica
• Artículo 17. La conservación de la documentación clínica
• Artículo 18. Derechos de acceso a la historia clínica
• Artículo 19. Derechos relacionados con la custodia de la
historia clínica
229
Resumen Legislación Española sobre HC (I)
A. Uso de la historia clínica B. Conservación de la historia clínica
1. Por profesionales asistenciales del centro médico en el –El deber recae en el profesional sanitario en su ejercicio
que se realiza el diagnóstico o el tratamiento del paciente. profesional de forma individual y en el centro sanitario
2. Por personal de la administración y gestión del centro cuando se ejerce en este ámbito la asistencia médica.
sanitario (sólo datos relacionados con sus funciones) –Cuando el deber de custodia recae sobre el facultativo
3. Por el personal sanitario con funciones de inspección, es un deber patrimonial, por tanto, transmisible por
evaluación, acreditación y planificación, para comprobar la muerte a sus herederos. La ley 41/20027 prevé un tiempo
calidad asistencial o el respeto a los derechos del paciente. mínimo de conservación de 5 años desde la fecha del alta
4. Con fines judiciales, epidemiológicos, de salud pública, asistencial. El máximo variará en función del tipo de
de investigación o de docencia. Tener en cuenta: asistencia que se haya prestado, previsión de
complicaciones, persistencia de la necesidad de ciertos
– Que se necesitará una habilitación legal específica.
tratamientos, etc.
– La habilitación será siempre restrictiva, separando los datos
de identificación del paciente de los de carácter clínico –La conservación de la HC es fundamental en el caso de
asistencial. denuncias por responsabilidad. La no aportación de la
– Para fines judiciales, se reclama la historia clínica completa misma puede tener efectos adversos sobre el profesional
(hay diversas sentencias condenatorias al médico o a la
institución por no aportar la historia clínica alegando que
ésta se había extraviado, se había dado al paciente o no
se había hecho).
230
Resumen Legislación Española sobre HC (II)
231
Code of Ethics
• International Medical Informatics Association (IMIA)
• 2016 IMIA Code of Ethics for Health Information Professionals
Source: https://imia-medinfo.org/wp/imia-code-of-ethics/
232
Fundamental Ethical Principles
1. Principle of Autonomy: All persons have a fundamental right to self-determination.
2. Principle of Equality and Justice: All persons are equal as persons and have a right
to be treated accordingly.
3. Principle of Beneficence: All persons have a duty to advance the good of others
where the nature of this good is in keeping with the fundamental and ethically
defensible values of the affected party.
4. Principle of Non-Malfeasance: All persons have a duty to prevent harm to other
persons insofar as it lies within their power to do so without undue harm to
themselves.
5. Principle of Impossibility: All rights and duties hold subject to the condition that it
is possible to meet them under the circumstances that obtain.
6. Principle of Integrity: Whoever has an obligation has a duty to fulfil that obligation
to the best of their ability.
233
General Principles of Informatics Ethics
1. Principle of Information-Privacy and Disposition: All persons and group of persons have a fundamental
right to privacy, and hence to control over the collection, storage, access, use, communication,
manipulation, linkage and disposition of data about themselves.
2. Principle of Openness: The collection, storage, access, use, communication, manipulation, linkage and
disposition of personal data must be disclosed in an appropriate and timely fashion to the subject or
subjects of those data.
3. Principle of Security: Data that have been legitimately collected about persons or groups of persons should
be protected by all reasonable and appropriate measures against loss degradation, unauthorized
destruction, access, use, manipulation, linkage, modification or communication.
4. Principle of Access: The subjects of electronic health records have the right of access to those records and
the right to correct them with respect to its accurateness, completeness and relevance.
5. Principle of Legitimate Infringement: The fundamental right of privacy and of control over the collection,
storage, access, use, manipulation, linkage, communication and disposition of personal data is conditioned
only by the legitimate, appropriate and relevant data-needs of a free, responsible and democratic society,
and by the equal and competing rights of others.
6. Principle of the Least Intrusive Alternative: Any infringement of the privacy rights of a person or group of
persons, and of their right of control over data about them, may only occur in the least intrusive fashion
and with a minimum of interference with the rights of the affected parties.
7. Principle of Accountability: Any infringement of the privacy rights of a person or group of persons, and of
the right to control over data about them, must be justified to the latter in good time and in an appropriate
fashion.
234
Rules of Ethical Conduct
A. Subject-centered duties (12): duties derived from the relationship
between EHRs, the data contained in them and the subjects of those
records
B. Duties towards HCPs (6): HIPs obligations to assist HCPs they are
associated to, compatible with the duties towards the subjects of the
EHRs
C. Duties towards institutions, employers and agencies (10):
D. Duties towards society (5): HIPs social obligations
E. Self-regarding duties (7): HIPs own obligations
F. Duties towards the profession (5): HIPs as a group obligations
236
Artificial Intelligence and Law Enforcement
• Not this course
237
Source: https://www.europarl.europa.eu/thinktank/en/document.html?reference=IPOL_STU(2020)656295
Conclusions
In this course we learned !!!
• About clinical data…
… what are the sources and the sinks of data; what types of clinical data we can find and how to store it in variables; how to convert
variables between different types and why; some undesired issues with variables (wrong, missing, noise, etc.) and how to deal with
them; basic concepts on big data; what are the most frequently used standards of clinical data codification; what is an EHR, its parts,
uses, their past present and future, standards, and some of the current EHR software systems; what is interoperability and their relation
to clinical data sharing.
• About clinical data description and analysis…
… what are the parts of a data science project; how to use descriptive statistics to describe clinical data and how to do that in Python;
practical example on clinical data description (Case Study 1); analyze data with inferential statistics; Student’s and Welch’s t-tests;
Pearson's Chi-square test; ANOVA (one-way, two-way with/out replication); post-hoc analysis with Tukey’s test; how to check data
normality (Shapiro-Wilks and Kolmogorov-Smirnov tests); how to perform non-parametric tests (Wilcoxon’s Rank Sum, Kruskal-Wallis
tests); how to calculate and interpret risk ratio and odds ratio; how to do all this with Python; practical example on clinical data inference
(Case Study 2); perform survival analysis with Kaplan-Meier table, curve, and log-rank test.
• About artificial intelligence for clinical data analysis…
… practical use of Rapid Miner; basics on data statistics and visualization; identify data preprocessing issues and alternative solutions for
data cleaning, transformation, reduction, and balancing; practical application of these technologies with Rapid Miner (Case Study 3); use
of ML algorithms for data analysis and modeling; introduction to unsupervised ML (k-Means and Agglomerative Clustering); introduction
to supervised ML (k-NN, Logistic Regression, DT, NB classifier, Rules, ANN, SVM, discriminant analysis); know the main quality measures
to test clinical predictive models; develop a data science project for the analysis of hepatitis (Case Study 4).
• About knowledge representation in medicine…
… what is the difference between clinical data, information, and knowledge; practical distinction between descriptive and procedural
knowledge; 4 alternative ways to represent clinical knowledge (logic, production rules, objects, and ontologies); the steps of the
knowledge life cycle.
• About clinical decision support systems…
… what is a CDSS; practical introduction to 3 CDSS: differential diagnosis generators (diagnosis), drug interaction checkers (treatment),
and alert & surveillance systems (follow-up).
• About clinical legal, security, and ethics…
… legal aspects of HC administration in the USA and Spain; code of ethics of medical informatics; the European General Data Protection
Regulation (GDPR) and the Spanish LOPDGDD organic law implementation.
238
References
• Lloyd B. (2017) Stanford Medicine 2017 Health Trends Report Harnessing the Power of Data in
Health.
• Smith A., Nelson M. (1999) Data Warehouses and Clinical Data Repositories. In: Ball M.J., Douglas
J.V., Garets D.E. (eds) Strategies and Technologies for Healthcare Information. Health Informatics.
Springer, New York, NY
• Coding Systems for Categorical Variables in Regression Analysis. UCLA: Statistical Consulting Group,
Institute for Digital Research & Education, https://stats.idre.ucla.edu/spss/faq/coding-systems-for-
categorical-variables-in-regression-analysis/ (accessed Jan 2020).
• Hammond W.E., Cimino J.J. (2006) Standards in Biomedical Informatics. In: Shortliffe E.H., Cimino
J.J. (eds) Biomedical Informatics. Health Informatics. Springer, New York, NY.
• Hammond W.E., Jaffe C., Cimino J.J., Huff S.M. (2014) Standards in Biomedical Informatics. In:
Shortliffe E., Cimino J. (eds) Biomedical Informatics. Springer, London.
• Kelley T. Electronic Health Records for Quality Nursing & Health Care. DEStech Publications, Inc.
2016.
https://books.google.es/books?hl=es&lr=&id=BhqUCwAAQBAJ&oi=fnd&pg=PR11&dq=Electronic+H
ealth+Records+for+Quality+Nursing+and+Health+Care+pdf&ots=riUu7qPBdD&sig=5t1NrCDZIRrRiw
PWyS_OSw00Suw#v=onepage&q&f=false
• Machin D., Campbell M.J., Walters S.J. (2007) Medical Statistics. John Wiley & Sons Ltd 4th Edition.
• Material docente de la Unidad de Bioestadística Clinica. Hospital Universitario Ramón y Cajal.
http://www.hrc.es/bioest/M_docente.html
239