Text Retrieval Slide - Update
Text Retrieval Slide - Update
Text Retrieval Slide - Update
All-in-One Course
(TA Session)
Text Retrieval
Project
Dinh-Thang Duong – TA
Year 2023
AI VIETNAM
All-in-One Course
(TA Session)
Outline
➢ Introduction
➢ Create Corpus
➢ Text Representation
➢ Text Normalization
➢ Ranking
➢ Optional: Semantic Search with BERT
➢ Question 2
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
Most famous
search engines
3
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started
Search
4
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval
Text Retrieval (TR) (also called as Document Ad-hoc Retrieval2: A system aims to provide
Retrieval)1: A branch of Information documents from within the collection that
Retrieval (IR) where the system matching of are relevant to an arbitrary user information
some stated user search query against a set need, communicated to the system by
of texts. means of a one-off, user-initiated query
1: https://en.wikipedia.org/wiki/Document_retrieval 5
2: https://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval
Query
Search
Relevant
Documents 6
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Applications
Document
Indexing
Text
Representation
8
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Basic Text Retrieval Pipeline
• Query: a text describes user’s information need. Input:
• Search Query (a text)
• Corpus: a set of documents (texts). • The Corpus (collection of
Corpus documents)
• Relevance: satisfaction of user’s information need.
Output:
• Information need: the topic about which the user • Relevant Documents (collection
desires to know more. of documents)
Output
Input
Query Relevant
Query Searching
Processing Documents
9
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement
With MSMARCO Dataset, create a simple text retrieval program using Vector Space Model.
10
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement
Query
Text Retrieval
System
Corpus
Vector
Similarty
4. Ranking
12
AI VIETNAM
All-in-One Course
(TA Session) Introduction
Input
Vectorizer
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
13
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
Input
Vectorizer
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
14
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Problem
15
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Download
16
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 1: Install datasets
17
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 2: Load MS_MARCO
18
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 4: Extract text
1. Only use sample with type == entity 2. Load text (you can only load passage_text) and append to
corpus
19
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
Input
Vectorizer
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
20
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction
S1 = “this is a book”
S2 = “machine learning book”
Similarity?
21
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Challenges
Word relations:
• Synonymy: <water, H2O>…
• Antonymy: <up, down>...
• Polysemy: sentence, mouse..
• Similarity: <car, trunk>…
• Relatedness: <coffee, cup>…
• Connotation: great (positive),
terrible (negative)…
22
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Representation Taxonomy
Word
Embeddings
Without machine Transformer-
learning based
Context- Context-
Independent Dependent
Bag-of-Words
TF-IDF GPT
With machine
BERT Family
learning RNN-based
ELMo
GloVe
CBOW
23
https://medium0.com/nlplanet/two-minutes-nlp-11-word-embeddings-models-you-should-know-a0581763b9a9
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction to Bag-of-Words
1. The vocabulary
2. Weighting terms
Bag-of-Words (BoW): A text representation method that represent text as
method
the bag, disregarding grammar and even word order but keeping multiplicity.
24
https://en.wikipedia.org/wiki/Bag-of-words_model
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Bag-of-Words Pipeline
1
Text Create
Normalization Dictionary
Corpus
(List of paragraphs) New text representation
14 2 9 36 89
2
Text
Vectorize
Normalization
A string
(Text) 25
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Dictionary
Set of
doc_i = [‘book’, ‘deep’, documents
‘learning’]
BoW 1 2 1 1 0 0 0 0 1
Binary BoW 1 1 1 1 0 0 0 0 1
27
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer
Input
Vectorizer
s = “Hello AI
VIETNAM” Output
Text Normalization
[0, 0, 1, …, 0]
(vector n elements)
Bag-of-words
An n words
Vocabulary
28
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer
29
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Corpus
Indexing
Feedback
Output
Input
Query Relevant
Query Searching
Processing Documents
30
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Indexing: The process of organizing and structuring a collection of E.g: Inverted Indexing
documents or data to facilitate efficient retrieval of information. It
involves creating an index that enables quick access to relevant
documents based on search queries or specific attributes.
31
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Document-term Matrix: A mathematical matrix that describes the frequency of terms that occur in a collection of
document.
32
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Document-term Matrix
doc1 1 1 1 0 0 0 0 0 0
Represent
doc2 0 1 0 1 1 0 0 0 0 terms
doc3 0 1 0 0 0 1 1 1 0
doc4 0 0 0 0 0 1 0 0 1 33
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Doc-Term Matrix as Index
doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Normalize & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]
35
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
Input
Vectorizer
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
36
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation
doc1 = “Học sách học AI.” doc1 = [‘Học’, ‘sách’, ‘học’, ‘AI.’]
Tokenize
doc2 = “Sách Học Máy” doc2 = [‘Sách’, ‘Học’, ‘Máy’]
vocab_size = 10
Both refers to the meaning of “học” Both refers to the meaning of “sách”
37
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation
doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Preprocess & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]
vocab_size = 8
38
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Input/Output
Lowercasing
Input Output
Punctuations Removal
“Hello, this is AI
“hello ai vietnam”
VIETNAM!”
Stopwords Removal
Stemming
40
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stopwords Removal Less crucial words ➔ No need to represent
41
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
42
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
If we not consider semantic of words: We shouln’t include all forms of a word into dictionary
but only the root form
change
changing
The same meaning as
changes change
changing
changer
Input Output
change
changing
changing
changer
Stemming Rules
44
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
45
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
46
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Final Text Normalization Function
Lowercasing
Punctuations Removal
Stopwords Removal
Stemming
47
AI VIETNAM
All-in-One Course
(TA Session) Ranking
Input
Vectorizer
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
48
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Motivation
doc1 1 1 1 0 0 0
doc2 0 1 0 1 1 0
Ranked List
DocID Similarity
doc3 0 0 1 0 2 1
distance(q, d) d2 0.8165
d3 0.5774
d1 0.0000
query 0 0 0 1 1 0
49
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity
𝒃
Dot product favours long vectors (higher value in dimensions)
50
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity
term 1: ”learning”
Similarity
term 2: ”information”
51
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking based on similarity value
Descending
doc3 0 0 1 0 1 1 1
Sort
DocID Similarity
query 0 0 2 1 1 0 1 d3 0.756
d1 0.308
d2 0.218
52
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results
53
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results
54
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction
55
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction
56
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT
57
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT
Input Output
58
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Semantic Search
59
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Why Semantic Search?
Semantic Search: A search technique that aims to understand the meaning or semantics of a query and the content
being searched.
60
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
Input
Corpus
Semantic Search
BERT Encode Indexing
Pipeline
Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
61
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 1: Import BERT and encode corpus
62
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 2: Define Cosine Similarity function
63
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 3: Define Ranking function
64
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 4: Search
65
AI VIETNAM
All-in-One Course
(TA Session) Question
?
66
67