TPLS, 09
TPLS, 09
TPLS, 09
Theory and Practice in Language Studies, Vol. 11, No. 9, pp. 1041-1049, September 2021
DOI: http://dx.doi.org/10.17507/tpls.1109.09
Fanghui Hu
School of Foreign Languages, Jining Medical University, Rizhao, Shandong Province, China
Abstract—Corpora play an important role in linguistics research and foreign language teaching. At present,
the relevant research on the corpus in China mainly uses WordSmith, Antconc and other retrieval tools. NLTK
library, which is based on Python language, can provide more flexible and rich research methods, and it can
use unified data standards to avoid the trouble of various data type conversion. At the same time, with the help
of Python’s numerous third-party libraries, it can make up for the shortcomings of other tools in syntax
analysis, graphic rendering, regular expression retrieval and other aspects. In terms of the main links in
corpus research, such as text cleaning, word form restoration, part of speech tagging and text retrieval
statistics, this paper takes the US presidential inaugural speech in the corpus as an example to show how to use
this tool to process the language data, and introduces the application of Python NLTK library in corpus
research.
I. INTRODUCTION
At present, many fields of linguistic research pay more attention to the application of the corpus, for the corpus,
taking massive and real language data as the research objects, is scientific and accurate (Feng, 2020). With the rapid
development of computer technology, the corpus has entered a stage of systematic theoretical innovation and extensive
application in the field of linguistics. More and more researchers from different academic backgrounds have joined the
corpus research team, and many research fields, such as lexicography, sociolinguistics, stylistic analysis, pragmatics and
so on, cannot do without corpus.
Both the construction of the corpus and the study of the corpus are inseparable from the processing of corpus data.
Currently, the commonly used corpus processing tools include WordSmith, AntConc, Range, PowerGREP, etc. Most of
the above tools provide functions such as retrieval, segmentation, substitution, statistics, etc. However, they are limited
to the lexical and lexical collocation level, rather than the syntactic and discoursal level. In addition, due to the
limitation of the software design, it cannot be flexibly customized, so researchers may have to learn the operation of
different software when necessary. The NLTK library, based on the computer programming language Python, is a
toolkit that can be used for natural language processing. The toolkit not only has the retrieval function commonly seen
in the above tools, but also has many functions such as text cleaning, word form merge, part of speech tagging,
grammar analysis and semantic analysis. Through this toolkit, researchers can complete the whole process from corpus
construction to research retrieval in one environment, eliminating the inconvenience of switching between different
software and data conversion, and further expanding the scope and depth of research.
Based on the introduction to the application of Python NLTK library in terms of text cleaning, word form restoration,
part of speech tagging and text retrieval statistics, this paper will take the American presidential inaugural speech in the
corpus as an example to introduce how to use the natural language processing toolkit to process the language data.
Consequently, corpus researchers will get familiar with and use them, research tools will increase, and corpus linguistics
research will develop rapidly.
This paper is supported by Education research project of Jining Medical University in 2018: construction of micro-course resource library of
Fundamentals of Program Design for medical students (No. : 18049)
© 2021 ACADEMY PUBLICATION
1042 THEORY AND PRACTICE IN LANGUAGE STUDIES
intelligence. It mainly studies various theories and methods that can realize effective communication between human
and computer with natural language. At present, the main computer programming language used in natural language
processing is python.
As a high-level programming language, Python, with its elegant, concise, clear rules, is very suitable for
non-computer professionals to learn. In addition, Python has a large number of third-party expansion library support,
making it greatly applied in the web crawler, data analysis, machine learning, artificial intelligence, natural language
processing and other fields, so that Python is an excellent programming language that is now widely regarded.
NLTK (Natural Language Toolkit) is one of the most widely used Python libraries in Natural Language processing.
NLTK is a Python library that can process Natural Language text quickly and easily. The toolkit was developed at the
University of Pennsylvania as a research and teaching tool for natural language processing. NLTK has a large number of
built-in corpora, including various types of text materials such as novels, news, network chat texts, film reviews, etc. It
includes Brown’s Corpus, Gutenberg’s Corpus, Inaugural Address Corpus, Reuters Corpus, etc. (Kambhampati, 2019).
In addition, NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with
a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning,
wrappers for industrial-strength NLP libraries, and an active discussion forum. In collaboration with Python’s powerful
standard library and other third-party libraries, it can conduct secondary processing of the processing results. It provides
a solid backing for processing complex text (Li, 2019).
The NLTK library is not a Python standard library, so it needs to be downloaded before using it. The platform used in
this paper is Windows, the Python version is 3.6.2, and the download and installation of NLTK is completed by using
the PIP tool. Use “pip install nltk” in command line to complete the automatic download and installation, as shown in
Figure 1.
After NLTK is installed, the necessary datasets / models need to be installed to use specific functions. You can install
the packages by running the following code in Python’s Interactive Development Environment (IDLE):
>>>import nltk
>>>nltk.download()
This will open the graphical NLTK Downloader, in which you can download various corpora, models, etc. as shown
in Figure 2.
You can select the collection you want in the “Collections” tab. It is recommended that you select “all” to install all
the collections. If you need a corpus related to books, you can select “book” and then click the “Download” button to
download. The corpus such as Moby Dick, Sense and Sensibility, The book of Genesis, Inaugural Address Corpus and
so on can be available. If you only want to download the corpus you need, you can switch to the “Corpora” tab, and
then select the corresponding corpus to download, such as “inaugural”.
If you only need the Inaugural Address Corpus, you can also download it with the following code in IDLE:
>>>nltk.download('inaugural')
In the above code, after importing the corpus, the isalpha() and lower() method are used to preserve and unify the
English words in the text into lowercase, and then the list of words is filtered by using the stopwords corpus. The
built-in Python function len() is used to calculate the number of imported words, and the set() function is used to
eliminate duplicate words in the text to calculate the word richness T. Vocabulary richness is used to analyze the number
of words in the text and reflect the overall use of words in the text. The lexical richness of the Inaugural Address Corpus
is 6.713%. The larger the T value is, the richer the vocabulary in the text, and the use of the vocabulary in the text can
be visually displayed numerically.
C. Part of Speech Tagging and Lemmatization
The English word has different forms, such as singular form, plural form, tense, and so on. For example, “do” has
five forms: “do”, “dose”, “did”, “done” and “doing”. In the actual study of the word, different forms of the same word
need to be combined as if they were the same word. This process is called lemmatization. The purpose of lemmatization
is to restore different forms of words to a common basic form. If we do not lemmatize the form of words, there will be a
big deviation in the statistical results. Lemmatization can be achieved by using WordNetLemmatizer module provided
by NLTK (Deng, 2017). The lemmatize() method is used to lemmatize the word form, whose first parameter is the word,
and the second parameter is the part of speech of the word, such as noun, verb, adjective, etc. The returned result of the
method is the result of lemmatization of the input word.
So the part of speech of each word needs to be determined before lemmatization, so as to get the accurate result. Part
of Speech Tagging is the process of automatically marking the parts of speech of all words in text according to the
context information in the text. That is, corresponding labels are added after all kinds of nouns, adjectives and verbs in
the text to facilitate retrieval and processing (Liu, 2015). The Part of Speech Tagging can be achieved by calling the
pos_tag() method in NLTK. The specific code is as follows:
>>>from nltk import pos_tag # Import pos_tag
>>>from nltk.corpus import wordnet # Import Semantic Dictionary wordnet
>>>from nltk.stem import WordNetLemmatizer # Import lemmatization tool
>>>wnl = WordNetLemmatizer()
>>>words_tag = pos_tag( filtered_words ) # Add part of speech tags
>>>original_words = [ wnl.lemmatize(i, j[0].lower()) if j[0].lower() in [ 'a' , 'n' , 'v' ] else wnl.lemmatize(i) for i, j in
words_tag ] #Complete lemmatization according to part of speech
>>>words_tag [:30] # PoS tagging results
>>>original_words [:30] #lemmatization results
In the above code, import the necessary libraries and modules first, and then calling the pos_tag() method for Part of
Speech Tagging. Finally, lemmatize() method is used to lemmatize all the words in the word list.
Part of Speech Tagging can not only accurately restore the form of words, but also help to analyze the sentence
components and divide the sentence structure. Part of Speech Tagging adds a part of speech tag to each term, as shown
in Figure 4. For example, "fellow" is marked as an adjective, "citizens" is marked as a plural noun, "among" is marked
as a preposition.
The lemmatization results are shown in Figure 5. Through the comparison, we can see that the words such as
"citizens", "filled", "forms", "flags" are successfully restored to "citizen", "fill", "anxiety" and "transmit".
As the most commonly used mathematical analysis method in NLTK, probability statistics is used for data processing
and analysis in text. In Python, we use the function that calculates the frequency defined in NLTK to count the word
frequency, word length and other related operations on words, collocations, common expressions or symbols that appear
in text. The FreqDist() in NLTK can realize the function of word frequency statistics. First call the function to create a
frequency distribution, and then you can call the most_Common(n) method to extract high-frequency words from
frequency distribution, and the tabulate(n) method can be called to output results in the form of table, where parameter n
is the number of extracted words (Steven, 2009). Some of the results are shown in Figure 7. And the specific code is as
follows:
>>>fdist = nltk.FreqDist( filtered_words ) # Create a frequency distribution for the cleaned and filtered words
>>>fdist.most_common(30) # Extract the top 30 high-frequency words
>>>fdist ['target word']
# Use the created frequency distribution to find the number of occurrences of the target vocabulary
>>>fdist.tabulate() # Output in tabular form
E. Graphic Display
In addition to the direct output of data results, you can also use Python's third-party libraries for secondary processing
of data to display data results in a visual form, which is more intuitive and better.
Matplotlib is a library for visualization in Python, which can be used to draw statistical charts for structured data,
such as histogram, sector chart, line chart, histogram and so on. The statistical data obtained in the previous stage are
displayed in the form of frequency line chart, as shown in Figure 8, from which we can see the comparison of different
vocabulary frequencies. The code is as follows:
>>>fdist.plot(30) # The top 30 items with the highest frequency are shown in line charts
You can also use the dispersion_plot() method to show the positions of words in the text in the form of a discrete
graph. The implementation code is shown below, and the results are shown in Figure 9. It can be seen from the figure
that the frequency of "us" in recent years is significantly higher than that in the early stage. Combined with the
chronological structure of the inaugural address corpus, we can see that there are significant differences in the frequency
of different speech words over time.
>>>text.dispersion_plot( ['government', 'people', 'us', 'citizens'] )
# Show the position of "government", "people", "us" and "citizens" in the text in the form of discrete graph
An interesting feature of the inaugural addresses corpus is its temporal dimension, so we can compare the frequency
of keywords used in inaugural addresses in different years to see the usage of words over time. We can use the NLTK
ConditionalFreqDist() method to see how many times each keyword appeared in speeches over time. The following
code takes the two keywords "American" and "citizen" as an example. First, we use w. lower() to convert the words in
the inaugural addresses corpus into lowercase, then use startswith() to check whether they start with the target words
"American" and "citizen", and finally count the frequency of words in each speech text. The statistical results are
displayed in the form of line chart, as shown in Figure 10. It can be seen from the figure that the word "citizen" peaked
in the text of 1941. The specific implementation code is as follows:
>>> cfd = nltk.ConditionalFreqDist( ( target, fileid[:4] )
for fileid in inaugural.fileids()
for w in inaugural.words( fileid )
for target in [ 'american', 'citizen' ]
if w.lower().startswith( target ))
#Use the conditional probability distribution method ConditionalFreqDist
>>> cfd.plot() # The frequency of two keywords is compared with the line chart
In addition, the word cloud image is also an effective form of data display. Word cloud image, also known as text
cloud, is a visual display of high frequency words in the text. Word cloud image filters out a large number of
low-frequency and low-quality text information, so that the viewer can appreciate the theme of the text as long as he has
a glance. The wordcloud library in Python can be used to generate all kinds of beautiful word cloud image. In this case,
the results of lemmatization are displayed in word cloud. First, import the wordcloud library and Matplotlib library, then
call the generate() function of the WordCloud module to generate the word cloud. Finally, use the pyplot function in
Matplotlib library to display the word cloud. The word cloud image is shown in Figure 11, and the specific code is as
follows:
>>>from wordcloud import WordCloud # Import WordCloud
>>>import matplotlib.pyplot as plt # Import pyplot
IV. CONCLUSION
NLTK, based on Python language, supports a large number of corpora and it is widely applied in natural language
processing, computational linguistics, scientific computational analysis and other aspects with powerful functions. At
present, in the domestic research based on the corpus, the commonly used tools are WordSmith, Antconc, Range, etc.,
while NLTK of Python is rarely used for research. The reason for this is that most researchers do not have the Python
programming skills to take full advantage of NLTK’s capabilities. Aiming at the main steps in corpus research and
taking the US presidential inaugural address corpus as an example, this paper introduces how to use the NLTK of
Python to process the corpus, and how to use the third-party Libraries of Python to data visualization, so that corpus
researchers can be familiar with and use it, enrich their research tools, promote the development of corpus linguistics
research, and promote the cross application of computer in different disciplines.
REFERENCES
[1] Deng Qingqiong, Peng Weiming, Yin Gan. (2017). A case study of practical word frequency statistics in Python teaching.
Computer Education 12, 20-27.
[2] Feng Min. (2020). Research on corpora in College English Grammar Teaching. China Journal of Multimedia & Network
Teaching 11, 150-152.
[3] Kambhampati Kalyana Kameswari, J Raghaveni, R. Shiva Shankar, Ch. Someswara Rao. (2019). Predicting Election Results
using NLTK. International Journal of Innovative Technology and Exploring Engineering 9.1, 4519-4529.
[4] Li Junfei. (2019). Research on the Application of Natural Language Processing Toolkit in College English Teaching. Education
Modernization 92, 136-137.
[5] Li Chen, Liu Weiguo. (2019). Chinese Text Information Extraction Based on NLTK. Computer Systems & Applications 28.1,
275−278.
[6] Liu Xu. (2015). The Application of NLTK Toolkit Based on Python in Corpus Research. Journal of Kunming Metallurgy
College 31.5, 65-69.
[7] Liu Weiguo, Li Chen. (2019). Case design of NLTK module application in Python programming teaching. Computer Education
3, 92-97.
[8] Steven B, Ewan K, Edward L. (2009). NLTK Natural Language Processing with Python. California: O 'Reilly Media.
[9] Wiebke Wagner. (2010). Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing
Text with the Natural Language Toolkit. Language Resources and Evaluation 44.4, 421-424.
Meng Wang received the B.S. and M.S. degree in computer science from Shandong Normal University, China in 2003 and 2006.
He is an associate professor in the School of Medical Information Engineering, Jining Medical University, Rizhao, Shandong,
China. His current research interests include computer application technology and the application of Python in English Research.