NLP Practicals

Pra 1
1)Natural Language Toolkit:
NLTK is an essential library that supports tasks such as classification, stemming, tagging, parsing,
semantic reasoning, and tokenization in Python.
It's your primary tool for natural language processing and machine learning. Today it serves as an
educational foundation for Python developers who are dipping their toes in this field (and machine
learning).
2)scikit-learn:
This handy NLP library provides developers with a wide range of algorithms for building machine-
learning models. It offers many functions for the bag-of-words method of creating features to tackle
text classification problems. The strength of this library is the intuitive class methods.
3)Pattern:
Pattern is a Python library designed for web mining, natural language processing, and machine
learning tasks. It provides modules for various text analysis tasks, including part-of-speech tagging,
sentiment analysis, word lemmatization, and language translation.
4) Polyglot Library:
Polyglot is a multilingual NLP library that supports over 130 languages. It offers functionalities for
tasks such as tokenization, named entity recognition, sentiment analysis, language detection, and
translation. Polyglot’s extensive language support makes it suitable for analyzing text data from
diverse sources.
5)FastText:
FastText is a library developed by Facebook AI Research for efficient text classification and word
representation learning. It provides tools for training and utilizing word embeddings and text
classifiers based on neural network architectures.
FastText’s key feature is its ability to handle large text datasets quickly, making it suitable for
applications requiring high-speed processing, such as sentiment analysis, document classification,
and language identification in diverse languages.
Pra 2
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
# Sample text
text = "The quick brown foxes were running quickly through the forest."
# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(token) for token in tokens if token.isalpha()]
print("Stems:", stems)
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized Tokens:", lemmatized_tokens)
output:
Tokens: ['The', 'quick', 'brown', 'foxes', 'were', 'running', 'quickly', 'through', 'the', 'forest', '.']
Stems: ['the', 'quick', 'brown', 'fox', 'were', 'run', 'quickli', 'through', 'the', 'forest']
Lemmatized Tokens: ['The', 'quick', 'brown', 'fox', 'were', 'running', 'quickly', 'through', 'the', 'forest',
'.']
Pra 3
import re
text = """
Contact us at [email protected] or [email protected].
Call us at +1 800-555-5555 or (123) 456-7890.
You can also reach out via email on [email protected].
Important dates: 12/31/2024, 31-12-2024, 2024-12-31, and December 31, 2024.
"""
# Define regex patterns
date_pattern = r'\b(?:\d{2}/\d{2}/\d{4}|\d{2}-\d{2}-\d{4}|\d{4}-\d{2}-\d{2}|\d{1,2} \w{3,9} \d{4})\b'

email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
phone_pattern = r'\b(?:\+?\d{1,4}[\s-]?)?\(?\d{2,4}\)?[\s-]?\d{3,4}[\s-]?\d{3,4}\b'
# Extract data using regex
dates = re.findall(date_pattern, text)
emails = re.findall(email_pattern, text)
phones = re.findall(phone_pattern, text)
# Print extracted data
print("Dates:", dates)
print("Emails:", emails)
print("Phones:", phones)
output:
Dates: ['12/31/2024', '31-12-2024', '2024-12-31']
Emails: ['[email protected]', '[email protected]', '[email protected]']
Phones: ['1 800-555-5555', '123) 456-7890']
Pra4
import re
text = """
Contact us at [email protected] or [email protected].
You can also reach out via email on [email protected].
"""
# Case-insensitive pattern for email addresses
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# Find email addresses (case-insensitive)
emails = re.findall(email_pattern, text, re.IGNORECASE)
# Print result
print("Emails:", emails)
output:
Emails: ['[email protected]', '[email protected]', '[email protected]']

Pra 5
1) what is toke?
Tokenization, in the realm of Natural Language Processing (NLP) and machine learning, refers to the
process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as
small as characters or as long as words.
2) what is POS tagging?
One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging, which is
giving each word in a text a grammatical category, such as nouns, verbs, adjectives, and adverbs.
3)why we use POS tagging?
POS tagging is useful for machine translation, named entity recognition, and information extraction,
among other things. It also works well for clearing out ambiguity in terms with numerous meanings
and revealing a sentence's grammatical structure.
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sentance ='With Colab you can harness the full power of popular Python libraries to analyze'
token = word_tokenize(sentance)
print(token)
pos_tag = nltk.pos_tag(token)
print(pos_tag)
output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
['With', 'Colab', 'you', 'can', 'harness', 'the', 'full', 'power', 'of', 'popular', 'Python', 'libraries', 'to',
'analyze']
[('With', 'IN'), ('Colab', 'NNP'), ('you', 'PRP'), ('can', 'MD'), ('harness', 'VB'), ('the', 'DT'), ('full', 'JJ'),
('power', 'NN'), ('of', 'IN'), ('popular', 'JJ'), ('Python', 'NNP'), ('libraries', 'NNS'), ('to', 'TO'), ('analyze',
'VB')]
import spacy
nlp = spacy.load('en_core_web_sm')
sentance = 'Pattern is a Python library designed for web mining, natural language processing, and
machine learning tasks.'
doc = nlp(sentance)
print(doc)
for token in doc:
print(f'{token.text}:{token.pos_}')
output:
Pattern is a Python library designed for web mining, natural language processing, and machine
learning tasks.
Pattern:NOUN
is:AUX
a:DET
Python:PROPN
library:NOUN
designed:VERB
for:ADP
web:NOUN
mining:NOUN
,:PUNCT
natural:ADJ
language:NOUN
processing:NOUN
,:PUNCT
and:CCONJ
machine:NOUN
learning:NOUN
tasks:NOUN
.:PUNCT
Pra 6
import spacy
nlp = spacy.load("en_core_web_sm")
text = """The Gemini API gives you access to Gemini models created by Google DeepMind"""
doc = nlp(text)
print(doc)
for ent in doc.ents:
print(f"Entity: {ent.text}, Label: {ent.label_}")
output:
The Gemini API gives you access to Gemini models created by Google DeepMind
Entity: Gemini, Label: GPE
Entity: Google DeepMind, Label: ORG

NLP Practicals

Uploaded by

Copyright:

Available Formats

NLP Practicals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Practicals

Uploaded by

Copyright:

Available Formats

Pra 1

1)Natural Language Toolkit:

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

from nltk.corpus import stopwords

from nltk.corpus import wordnet

stems = [stemmer.stem(token) for token in tokens if token.isalpha()]

lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print("Lemmatized Tokens:", lemmatized_tokens)

Contact us at [email protected] or [email protected].

Call us at +1 800-555-5555 or (123) 456-7890.

You can also reach out via email on [email protected].

Important dates: 12/31/2024, 31-12-2024, 2024-12-31, and December 31, 2024.

# Define regex patterns

date_pattern = r'\b(?:\d{2}/\d{2}/\d{4}|\d{2}-\d{2}-\d{4}|\d{4}-\d{2}-\d{2}|\d{1,2} \w{3,9} \d{4})\b'

# Extract data using regex

dates = re.findall(date_pattern, text)

emails = re.findall(email_pattern, text)

phones = re.findall(phone_pattern, text)

# Print extracted data

Dates: ['12/31/2024', '31-12-2024', '2024-12-31']

Emails: ['[email protected]', '[email protected]', '[email protected]']

Phones: ['1 800-555-5555', '123) 456-7890']

Contact us at [email protected] or [email protected].

You can also reach out via email on [email protected].

# Case-insensitive pattern for email addresses

# Find email addresses (case-insensitive)

emails = re.findall(email_pattern, text, re.IGNORECASE)

Emails: ['[email protected]', '[email protected]', '[email protected]']

2) what is POS tagging?

3)why we use POS tagging?

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.

for token in doc:

for ent in doc.ents:

print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: Gemini, Label: GPE

Entity: Google DeepMind, Label: ORG

You might also like