Group-04 ADCProject TopicModelling
Group-04 ADCProject TopicModelling
Group-04 ADCProject TopicModelling
Table of Contents
Business Understanding ............................................................................................................. 3
Data Understanding ................................................................................................................... 4
Data Preparation......................................................................................................................... 6
Modelling ................................................................................................................................... 8
Evaluation ................................................................................................................................ 10
Deployment .............................................................................................................................. 11
ANNEXURE............................................................................................................................ 12
Automated Data Collection Group-04 Term Project
Business Understanding
Blockchain technology has become a significant catalyst for change in diverse industries within
the rapidly expanding technological landscape of the present era. The decentralized and
transparent nature of this technology holds the potential to significantly transform operations,
bolster security measures, and facilitate the emergence of novel business models. The potential
uses of blockchain technology span across various sectors, including but not limited to financial
services, supply chain management, healthcare, and voting systems.
To have a more comprehensive understanding of the vast domain of blockchain technology, it
is imperative to conduct an extensive examination of the available scholarly literature. Scopus,
a widely recognized and esteemed academic database, is an indispensable platform for
retrieving various scholarly articles, conference papers, and research publications. The vast
scope of its coverage ensures the compilation of a comprehensive dataset, which is the basis
for conducting informed analysis.
Analyzing the extensive body of research on blockchain technology might be a formidable
challenge when attempting to derive significant findings. Topic modeling plays a crucial role
in this context. Topic modeling is a highly effective machine-learning technique that facilitates
the automated detection of latent themes within a substantial collection of textual data. The
process of categorizing interconnected papers into cohesive subjects provides a systematic
methodology for comprehending the diverse aspects of blockchain study.
Topic modelling extends beyond its technical nature, presenting significant potential for
academic and professional pursuits. Researchers and professionals can acquire a full
understanding of the ongoing conversation in the field of blockchain by identifying significant
issues within the extensive body of literature available.
The process of integrating a wide range of research findings into cohesive themes facilitates
the synthesis of knowledge and the development of complete idea maps. The evolution of
dominant topics through time facilitates the ability to forecast changes in research focus and
industrial trends. The utilisation of subject extraction techniques in the development of
recommender systems facilitates the augmentation of relevant information finding. Within the
domain of strategic decision-making, the process of aligning organizational plans with new
research directions is guided by the identified themes.
The primary objective of this project is to investigate and evaluate the five most significant
topics in the extensive domain of blockchain technology research. This investigation aims to
uncover and define the prevailing themes that influence the ongoing discourse surrounding
blockchain technology. The process of extracting significant insights from each highlighted
issue plays a crucial role in developing a comprehensive grasp of the varied applications and
difficulties associated with the technology. Visual representations are created with the purpose
of improving understanding and facilitating the communication of the identified themes.
Moreover, the primary objective of this initiative is to enhance the accessibility of research
findings to academic and corporate communities, thereby promoting informed discussions and
creating collaboration prospects.
Automated Data Collection Group-04 Term Project
Data Understanding
The data extracted from the Scopus platform contains various information related to scholarly
documents (papers, articles, etc.) that discuss or analyze blockchain technology. Here's a
breakdown of the different fields present in the dataset:
1. Authors: The authors who contributed to the document.
2. Author Full Names: The complete names of the authors.
3. Author(s) ID: A unique identifier assigned to each author in the Scopus database.
4. Title: The title of the research paper or document.
5. Year: The year in which the document was published.
6. Source Title: The title of the journal, conference, or source where the document was
published.
7. Volume: The volume number of the source.
8. Issue: The issue number of the source.
9. Art. No.: Article number, if applicable.
10. Page Start: The starting page number of the document within the source.
11. Page End: The ending page number of the document within the source.
12. Page Count: The total number of pages in the document.
13. Cited By: The number of times other articles have cited this document.
14. DOI: Digital Object Identifier, a unique alphanumeric string assigned to the document for
easy identification.
15. Link: URL link to the document.
16. Affiliations: The affiliations of the authors, indicating their institutional affiliations or
organizational affiliations.
Automated Data Collection Group-04 Term Project
17. Authors with Affiliations: The names of the authors along with their respective
affiliations.
18. Abstract: A summary of the document's content and main findings.
19. Author Keywords: Keywords provided by the authors to highlight the main concepts of
the document.
20. Index Keywords: Keywords assigned by the database to categorize the document.
21. Molecular Sequence Numbers: If applicable, sequence numbers related to molecular
structures.
22. Chemicals/CAS: Information about chemicals and their Chemical Abstracts Service
Registry Numbers.
23. Tradenames: Tradenames of products mentioned in the document.
24. Manufacturers: Manufacturers' names were mentioned in the document.
25. Funding Details: Information about funding sources for the research.
26. Funding Texts: Textual descriptions of the funding details.
27. References: List of references and citations used in the document.
28. Correspondence Address: Address of the corresponding author.
29. Editors: Names of editors if the document is part of an edited volume.
30. Publisher: The publisher of the source where the document was published.
31. Sponsors: Sponsoring organizations, if applicable.
32. Conference Name: Name of the conference if the document was presented at a conference.
33. Conference Date: Date of the conference.
34. Conference Location: Location of the conference.
35. Conference Code: Code associated with the conference.
36. ISSN: International Standard Serial Number of the source.
37. ISBN: International Standard Book Number, if applicable.
38. CODEN: Code assigned to the source.
39. PubMed ID: PubMed identifier, if the document is available on PubMed.
40. Language of Original Document: The language in which the original document is written.
41. Abbreviated Source Title: Abbreviated title of the source.
42. Document Type: The type of document (e.g., article, conference paper).
43. Publication Stage: The stage of publication (e.g., in press, published).
Automated Data Collection Group-04 Term Project
Data Preparation
The following is a comprehensive overview of the individual steps undertaken to prepare our
data for analysis:
Transforming the preprocessed text into a Document-Term Matrix enables the measurement of
term frequency across several documents. The matrix functions as a fundamental component
for future analysis, allowing for identifying prominent terms within each document and
simplifying the extraction of overarching themes.
inside complicated datasets, it is possible for certain documents to lack significant content
following preprocessing procedures. Consequently, this might lead to rows inside the
Document-Term Matrix with zero values. The exclusion of these rows guarantees that only
documents containing significant material are considered in the future analysis, enhancing the
dataset's representation of the thematic landscape of the corpus.
The computation of word frequencies is a crucial process in comprehending the relative
significance of terms within the corpus. Arranging words in descending order based on
frequency enables recognizing often occurring terms. These terms often indicate significant
themes or concepts that are extensively covered in the existing body of literature.
The utilization of visual representations of data has the potential to significantly amplify the
effectiveness of conveying analytical outcomes. The generation of a word cloud, in which the
size of each word is proportional to its frequency, offers a prompt and intuitive representation
of the most prominent terms within the collection of texts. This visualization tool facilitates
efficient comprehension of the prevailing topics in the text, benefiting both researchers and
readers. This serves as a fundamental basis for further in-depth analysis.
Modelling
The topic modeling technique is a powerful tool in the pursuit of understanding the complex
network of ideas and concepts present in the literature on blockchain technology. The following
steps delineate a complex analytical procedure designed to uncover the underlying themes
contained in preprocessed textual material. This process serves to facilitate scholars in attaining
a more profound comprehension of the discourse.
Automated Data Collection Group-04 Term Project
The initial step in topic modeling involves a fundamental inquiry: What is the ideal number of
topics that may effectively reveal the most significant themes? The question at hand is
addressed using the ldatuning library, which utilizes a range of metrics such as
"Griffiths2004," "CaoJuan2009," "Arun2010," and "Deveaud2014." Every metric provides
insight into the cohesion and uniqueness of themes. The analysis employs an iterative process
that explores a range of potential topic numbers, spanning from 2 to 40. This exploration uses
the Gibbs sampling approach, which aims to determine the most suitable number of subjects
for the given context. This stage effectively directs the subsequent analysis, guaranteeing that
the number of subjects chosen corresponds to the inherent structure of the corpus.
The convergence of coherence and uniqueness measurements can be observed through a visual
depiction of metrics across various topic numbers. This convergence provides insight into
determining the optimal number of topics. The presented visualization offers a tangible
foundation for selecting the most suitable number of subjects and enhances the process of
extracting sturdy and coherent themes.
After determining the ideal number of topics, the analysis proceeds to a more in-depth
exploration using Linear Dirichlet Allocation (LDA) methodology. This approach reveals the
underlying semantic organization inside the preprocessed dataset. The LDA process is carefully
optimized by adjusting many parameters, including burn-in, iteration, thinning, and seed. The
outcomes are evident in the form of a depiction of the prevalent topics within the collection of
texts and the arrangement of words within those topics.
Automated Data Collection Group-04 Term Project
The result of the Latent Dirichlet Allocation (LDA) analysis manifests as a matrix that exhibits
the relative prominence of each theme across the documents. In the matrix, each row is
associated with a document, while the columns reflect the detected subjects. The utilization of
this matrix facilitates a comprehensive comprehension of the relative frequency of each
thematic element within distinct papers, offering valuable insights into the range of issues
investigated within the literature on blockchain technology.
Moreover, including a word matrix for each topic serves to elucidate the particular words that
make a substantial contribution to the comprehension of each respective issue. This stage
facilitates the qualitative analysis of the subjects, as these selected words provide insight into
the fundamental principles encompassed by each theme.
Each topic that has been identified represents more than just a compilation of words; rather, it
represents a conceptual cluster that encompasses a distinct facet of blockchain technology. As
an illustration,
- Topic 1: Emerging Technologies: This subject matter could involve discussions regarding
the dynamic nature of technology advancements in blockchain.
- Topic 2: Data Security: The focal point of this discourse pertains to the protection and
preservation of data and information within the blockchain technology framework.
- Topic 3: Digital Innovation: This topic encompasses investigations into the innovative
capabilities of blockchain technology inside the digital domain.
- Topic 4: Technological Applications: This topic will explore the practical applications and
utilization of blockchain technology across many sectors.
- Topic 5: Blockchain Technology: This topic encompasses comprehensive discussions
focused on the fundamental principles and characteristics of blockchain.
Ultimately, the probabilities associated with categorizing documents into respective topics are
consolidated and represented in a structured format known as a dataframe. This analysis
demonstrates the degree to which each document corresponds with the identified subjects,
offering a quantitative basis for comprehending the prevalence of themes throughout the
collection.
Evaluation
Perplexity is a prevalent metric for evaluating the efficacy of language models, encompassing
topic models such as Latent Dirichlet Allocation (LDA). Nevertheless, the understanding of
confusion can be somewhat intricate and contingent upon the specific circumstances.
The log-likelihood of the LDA model (ldatm) is calculated using the logLikelihood -
logLik(ldatm) line of code. The log-likelihood is an evaluative metric for the extent to
which the model accurately represents the observed data, explicitly referring to the document-
term matrix (`DTM`) in your context. A more excellent log-likelihood value signifies a more
substantial alignment between the model and the observed data.
Automated Data Collection Group-04 Term Project
The perplexity value is obtained by exponentiating the negative sum of log-likelihoods and
dividing it by the total number of words. The concept of perplexity pertains to evaluating a
model's ability to anticipate data that has not been previously encountered accurately. Lower
perplexity scores are indicative of a higher level of proficiency in predicting the observed data
by the model.
The perplexity value is printed to the console using the `print` function, and the `paste` function
is used to concatenate the string "Perplexity:" with the perplexity value. Nevertheless, the
perplexity value of 482.904600513577 is deemed satisfactory. Evaluating the perplexity value's
quality depends on the particular use case and area under consideration. In specific contexts, a
confusion score of approximately 100 may be regarded as exemplary, although in alternative
scenarios, scores below 500 could still be deemed satisfactory.
Deployment
We have deployed our model on a new sentence to check the model’s classification in real-
world scenarios. The provided sentence is, “Blockchain technology is a decentralized and
distributed digital ledger system that records transactions across multiple computers in a secure
and transparent manner.” The sentence undergoes several preprocessing steps, such as
converting to lowercase, removing punctuation, and eliminating common English stopwords,
to refine the text. Following this, a document-term matrix (DTM) is created for the
preprocessed sentence, which corresponds to the format of the DTM utilized in training the
LDA model. The topic distribution for the new phrase is computed by employing the trained
LDA model ('ldatm'), thereby estimating the sentence's probability of belonging to each
subject. The topic with the highest probability is designated as the assigned topic, and the code
generates this assignment as output. The illustration pertains to the designated subject matter,
namely, Topic 2: Data Security. This code exemplifies the application of a trained LDA model
to allocate new, unseen sentences to specific themes depending on their content.
Automated Data Collection Group-04 Term Project
ANNEXURE
# Group -04
# Akila - MBAA22004
# Deepak Bhatt - MBAA22085
# Dhruv Chowdary - MBAA22025
# Pranjal Agnihotri - MBAA22052
# Shambhavi Gupta - MBAA22063
# Shivangi Agrawal - MBAA22065
# Automated Data Collection Term Project
# (Extracting Data from Scopus and conducting Topic Modelling)
##########################################################################
#########
>
> # Keywords used in Scopus Search - Blockchain Technology
> # Filter applied: Documents limited to Business, Management and Accounti
ng
> setwd("G:/My Drive/Term-4 Classes/Automated Data Collection")
>
> # Importing Necessary Libraries
> library(tm)
Loading required package: NLP
Warning message:
package ‘tm’ was built under R version 4.2.3
> library(wordcloud2)
> library(topicmodels)
Warning message:
package ‘topicmodels’ was built under R version 4.2.3
> library(ggplot2)
annotate
Warning message:
package ‘ggplot2’ was built under R version 4.2.3
>
> # Importing the Data set
> data<-read.csv("scopus.csv")
> DATA<-data$Abstract
>
> # Text Analysis----------------------
> Docs<-Corpus(VectorSource(DATA))
> Docs
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 4249
> writeLines(as.character(Docs[[6]]))
The working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps (MFDAs), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. This
study pinpoints the imperative factors affecting the customer's attitude a
nd continued intention in light of the task technology fit (TTF) model. Th
e required data were collected from MFDA users and analyzed by the structu
ral equation modeling technique via Amos-23 and SPSS-22. The results confi
rm that customer rating, ordering review, food tracking, navigational desi
gn, and user self-efficacy positively impact TTF. Further, self-efficacy p
ositively moderates the relationship between visual design and TTF, naviga
tional design and TTF, and food tracking and TTF. Moreover, TTF positively
influences attitude and continued intention, and in turn, attitude positiv
ely influences continued intention. Additionally, blockchain technology (B
T) enabled traceability positively moderates the relationship between TTF,
attitudes, and continued intention to use MFDAs. The developers of MFDAs s
hould consider how customers perceive BT-enabled traceability and take ste
ps to embrace it to increase customer trust in MFDAs. Furthermore, the the
oretical and managerial applications are explained in detail so that devel
opers can offer what MFDA users need. © 2023 Elsevier Ltd
Automated Data Collection Group-04 Term Project
>
> #Cleaning the data
> Docs<-tm_map(Docs,content_transformer(tolower)) # Transforming the text
to lowercase
Warning message:
In tm_map.SimpleCorpus(Docs, content_transformer(tolower)) :
transformation drops documents
> writeLines(as.character(Docs[[6]]))
the working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps (mfdas), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. this
study pinpoints the imperative factors affecting the customer's attitude a
nd continued intention in light of the task technology fit (ttf) model. th
e required data were collected from mfda users and analyzed by the structu
ral equation modeling technique via amos-23 and spss-22. the results confi
rm that customer rating, ordering review, food tracking, navigational desi
gn, and user self-efficacy positively impact ttf. further, self-efficacy p
ositively moderates the relationship between visual design and ttf, naviga
tional design and ttf, and food tracking and ttf. moreover, ttf positively
influences attitude and continued intention, and in turn, attitude positiv
ely influences continued intention. additionally, blockchain technology (b
t) enabled traceability positively moderates the relationship between ttf,
attitudes, and continued intention to use mfdas. the developers of mfdas s
hould consider how customers perceive bt-enabled traceability and take ste
ps to embrace it to increase customer trust in mfdas. furthermore, the the
oretical and managerial applications are explained in detail so that devel
opers can offer what mfda users need. © 2023 elsevier ltd
>
> d1<-content_transformer(function(x,pattern){return(gsub(pattern,"",x))})
> Docs<-tm_map(Docs,d1,"tool")
Warning message:
In tm_map.SimpleCorpus(Docs, d1, "tool") : transformation drops documents
> writeLines(as.character(Docs[[6]]))
the working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps (mfdas), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. this
study pinpoints the imperative factors affecting the customer's attitude a
nd continued intention in light of the task technology fit (ttf) model. th
e required data were collected from mfda users and analyzed by the structu
ral equation modeling technique via amos-23 and spss-22. the results confi
rm that customer rating, ordering review, food tracking, navigational desi
gn, and user self-efficacy positively impact ttf. further, self-efficacy p
ositively moderates the relationship between visual design and ttf, naviga
tional design and ttf, and food tracking and ttf. moreover, ttf positively
influences attitude and continued intention, and in turn, attitude positiv
ely influences continued intention. additionally, blockchain technology (b
t) enabled traceability positively moderates the relationship between ttf,
attitudes, and continued intention to use mfdas. the developers of mfdas s
hould consider how customers perceive bt-enabled traceability and take ste
ps to embrace it to increase customer trust in mfdas. furthermore, the the
oretical and managerial applications are explained in detail so that devel
opers can offer what mfda users need. © 2023 elsevier ltd
>
> Docs<-tm_map(Docs, removePunctuation) # Removing Punctuation
Warning message:
In tm_map.SimpleCorpus(Docs, removePunctuation) :
transformation drops documents
> writeLines(as.character(Docs[[6]]))
the working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps mfdas which deliver an innovative meth
od to interact with and offer highquality services to customers this study
pinpoints the imperative factors affecting the customers attitude and cont
inued intention in light of the task technology fit ttf model the required
data were collected from mfda users and analyzed by the structural equatio
n modeling technique via amos23 and spss22 the results confirm that custom
er rating ordering review food tracking navigational design and user selfe
fficacy positively impact ttf further selfefficacy positively moderates th
e relationship between visual design and ttf navigational design and ttf a
nd food tracking and ttf moreover ttf positively influences attitude and c
Automated Data Collection Group-04 Term Project