NLP Subject Orientation SH23
NLP Subject Orientation SH23
NLP Subject Orientation SH23
2
Department Mission
3
4
5
6
Pre-requisite: Theory of Computer Science, System Programming & Compiler Construction
7
Course Outcomes: Students will be able
1. To describe the field of natural language processing
2. To design language model for word level analysis, syntactic, semantics and pragmatics for text processing.
3. To design various language models and POS tagging techniques.
4. To design, implement and test algorithms for semantic analysis
5. To formulate the discourse segmentation and anaphora resolution
6. To apply NLP techniques to design real world NLP applications.
8
Textbooks:
T1. Daniel Jurafsky, James H. and Martin, Speech and Language Processing, Second
Edition, Prentice Hall, 2008.
T2. Christopher D.Manning and HinrichSchutze, Foundations of Statistical Natural
Language, Processing, MIT Press, 1999.
9
References:
R1. Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval, Oxford University Press, 2008.
R2. Daniel M Bikel and ImedZitouni ― Multilingual natural language processing applications: from theory to practice, IBM Press, 2013.
R3. Alexander Clark, Chris Fox, Shalom Lappin ― The Handbook of Computational Linguistics and Natural Language Processing, John
Wiley and Sons, 2012.
R4. Nitin Indurkhya and Fred J. Damerau, ―Handbook of Natural Language Processing, Second Edition, Chapman and Hall/CRC Press,
2010.
R5. Niel J le Roux and SugnetLubbe, A step by step tutorial: An introduction into R application and programming.
R6. Steven Bird, Ewan Klein and Edward Loper, Natural language processing with Python: Analyzing text with the natural language
toolkit, O’Reilly Media, 2009.
10
Digital References or Web:
1. http://www.cse.iitb.ac.in/~cs626-449
2. http://cse24-iiith.virtual-labs.ac.in/#
3. https://nptel.ac.in/courses/106105158
11
Assessment
Internal Assessment
Assessment consists of two class tests of 20 marks each. The first class test is to be conducted when approx. 40%
syllabus is completed and second class test when additional 40% syllabus is completed. Duration of each test shall be
one hour.
3. Questions will be mixed in nature (for example supposed Q.2 has part (a) from module 3 then part (b) will
be from any module other than module 3)
5. In question paper weightage of each module will be proportional to number of respective lecture hours as
mention in the syllabus.
12
No. of Hours, Weightage, Nature of
Questions
Units No. Of Hours Weightage Nature of Questions
I 3 8 Theory
II 9 23 Theory+Problems
III 10 26 Theory+Problems
IV 7 18 Theory+Problems
V 5 13 Theoy
VI 5 13 Theory
13
Module Detailed Content Content
1 Introduction to NLP (3 Hours)
Origin & History of NLP; Internet, Pawan goyal
Language, Knowledge and Grammar in language processing; NPTEL Course,
Stages in NLP; Books T1 & R1
Ambiguities and its types in English and Indian Regional Languages;
Challenges of NLP;
Applications of NLP
14
Sample Questions
15
Module Detailed Content Content
2 Word Level Analysis (9 Hours)
Basic Terms: Tokenization, Stemming, Lemmatization; T1 & R2, Pawan
Survey of English Morphology, Inflectional Morphology, Derivational Morphology; Goyal NPTEL
Regular expression with types; Course
Morphological Models: Dictionary lookup, finite state morphology;
Morphological parsing with FST (Finite State Transducer);
Lexicon free FST Porter Stemmer algorithm;
Grams and its variation: Bigram, Trigram
Simple (Unsmoothed) N-grams; N-gram Sensitivity to the Training Corpus;
Unknown Words: Open versus closed vocabulary tasks; Evaluating N-grams:
Perplexity; Smoothing: Laplace Smoothing, Good-Turing Discounting;
16
17
18
19
Exercise on steeming by porter stemmer, n-gram, k-gram, laplace smoothing,
good turing, FSA, FST
20
Module Detailed Content Contents
3 Syntax analysis (10 Hours)
Part-Of-Speech tagging(POS); Tag set for English (Upenn Treebank); T1, T2,
Difficulties /Challenges in POS tagging; Pawan
Rule-based, Stochastic and Transformation-based tagging; Goyal
Generative Model: Hidden Markov Model (HMM Viterbi) for POS tagging; NPTEL
Issues in HMM POS tagging; Discriminative Model: Maximum Entropy model, Conditional Course
random Field (CRF); Parsers: Top down and Bottom up;
Modeling constituency; Bottom Up Parser: CYK, PCFG (Probabilistic Context Free
Grammar), Shift Reduce Parser; Top Down Parser: Early Parser, Predictive Parser
21
22
Exercises on
HMM Model : Formation of Emission Probability Matrix, State Transition Matrix,
HMM Viterbi ALgorithm
Exercises on Parser: Bottom Up Parser: CYK, PCFG (Probabilistic Context Free Grammar), Shift Reduce
Parser; Top Down Parser: Early Parser, Predictive Parser
23
24
Module Detailed Content Hours
4 Semantic Analysis 7
Introduction, meaning representation; Lexical Semantics; T1,T2,
NPTEL
Corpus study; Study of Various language dictionaries like wordnet, Babelnet;
by
Relations among lexemes & their senses –Homonymy, Polysemy, Synonymy, Hyponymy; Pawan
Goyal
Semantic Ambiguity; Word Sense Disambiguation (WSD);
Knowledge based approach(Lesk’s Algorithm), Supervised (Naïve Bayes, Decision List),
Introduction to Semi-supervised method (Yarowsky) Unsupervised (Hyperlex)
25
Small Exercise or Think Questions on following topics
Knowledge based approach(Lesk’s Algorithm), Supervised (Naïve Bayes, Decision List),
Introduction to Semi-supervised method (Yarowsky) Unsupervised (Hyperlex)
26
Module Detailed Content Hours
5 Pragmatic & Discourse Analysis 5
Discourse: Reference Resolution, Reference Phenomena, Syntactic & Semantic constratint T1,T2
on coherence;
Anaphora Resolution using Hobbs and Centering Algorithm
27
Module Detailed Content Hours
6 Applications of NLP 5
Case studies on (preferable in regional language): T1, R2, R2
a) Machine translation;
NPTEL: a,b,c,d
b) Text Summarization; Sharavari Madam’s paper:QA
c) Sentiment analysis;
d) Information retrieval;
e) Question Answering system
28
CSDL7013: Natural Language processing Lab
29
Lab Objectives :
1. To understand the key concepts of NLP.
2. To learn various phases of NLP.
3. To design and implement various language models and POS tagging techniques.
4. To understand various NLP Algorithms
5. To learn NLP applications such as Information Extraction, Sentiment Analysis, Question answering,
Machine translation etc.
6. To design and implement applications based on natural language processing
30
Lab Outcomes:
At the end of the course student should be able to:
1. Apply various text processing techniques.
2. Design language model for word level analysis.
3. Model linguistic phenomena with formal grammar.
4. Design, implement and analyze NLP algorithms.
5. To apply NLP techniques to design real world NLP applications such as machine translation,
sentiment analysis, text summarization, information extraction, Question Answering system etc.
6. Implement proper experimental methodology for training and evaluating empirical NLP systems.
31
Suggested List of Experiments: (Select a Case Study(Mini Project)for performingthe experiments)
Sr. No. Name of the Experiment
Study various applications of NLP and Formulate the Problem Statement for Mini Project based on chosen
real world NLP applications: [Machine Translation, Text Categorization, Text summarization, chat Bot,
1
Plagarism, Spelling & Grammar checkers, Sentiment / opinion analysis, Question answering, Personal
Assistant, Tutoring Systems, etc.]
Apply various text preprocessing techniques for any given text : Tokenization and Filtration & Script
2
Validation.
Apply various other text preprocessing techniques for any given text :Stop Word Removal, Lemmatization
3
/ Stemming.
4 Perform morphological analysis and word generation for any given text.
32
Sr. No. Name of the Experiment
5 Implement N-Gram model for the given text input.
6 Study the different POS taggers and Perform POS tagging on the given text.
7 Perform Chunking for the given text input.
8 Implement Named Entity Recognizer for the given text input.
9 Implement Text Similarity Recognizer for the chosen text documents.
10 Exploratory data analysis of a given text (Word Cloud)
Mini Project Report:For any one chosen real world NLP application.Implementation and
Presentation of Mini Project
iNLTK package for indian languages: Hindi, Punjabi, Marathi, Bengali, Sanskrit
PyTorch
33
Term Work:
1 Term work should consist of 8 experiments and mini project
2 Case Study/ Mini projectis to be conducted on Indian Languages(Preferably).
3 The final certification and acceptance of term work ensures that satisfactory performance of
laboratory work and minimum passing marks in term work.
34
Thank You
35