NLP Unit-1 - 1
NLP Unit-1 - 1
NLP Unit-1 - 1
Syntax Generates human language syntax Strict syntax for every language
enable computers to interact with human Solve the task and computational problems
Aim
language and do all manipulation
Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks
use this approach, in which words are treated as the basic units of meaning.
Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is
useful for tasks requiring individual sentence analysis or processing.
Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which
can be especially useful when dealing with morphologically rich languages or rare
words.
Character Tokenization:
This process divides the text into individual characters. This can be useful
for modelling character-level language.
STEMMING
• In natural language processing, stemming is the text preprocessing
normalization task concerned with bluntly removing word affixes
(prefixes and suffixes).
• Stemming in natural language processing reduces words to their base
or root form, aiding in text normalization for easier processing.
• For example, “chocolates” becomes “chocolate” and “retrieval”
becomes “retrieve.”
LEMMATIZATION
• Lemmatization is the process of grouping together different inflected forms of the
same word.
• Lemmatization is similar to stemming but it brings context to the words.
Three types of lemmatization techniques are:
• 1. Rule Based Lemmatization : Rule-based lemmatization involves the
application of predefined rules to derive the base or root form of a word.
Example:
Word: “walked”, Rule Application: Remove “-ed” , Result: “walk
• 2. Dictionary-Based Lemmatization : Dictionary-based lemmatization relies on
predefined dictionaries or lookup tables to map words to their corresponding base
forms or lemmas.
• 3. Machine Learning-Based Lemmatization: Machine learning-based
lemmatization leverages computational models to automatically learn the
relationships between words and their base forms.
PART OF SPEECH TAGGING
• Parts of Speech (PoS) tagging is giving each word in a text a grammatical category, such
as nouns, verbs, adjectives, and adverbs.
• In many NLP applications, including machine translation, sentiment analysis, and
information retrieval, PoS tagging is essential. PoS tagging serves as a link between
language and machine understanding, enabling the creation of complex language
processing systems and serving as the foundation for advanced linguistic analysis.