Group-04 ADCProject TopicModelling

Automated Data Collection Group-04 Term Project
Table of Contents
Business Understanding ............................................................................................................. 3
Data Understanding ................................................................................................................... 4
Data Preparation......................................................................................................................... 6
Modelling ................................................................................................................................... 8
Evaluation ................................................................................................................................ 10
Deployment .............................................................................................................................. 11
ANNEXURE............................................................................................................................ 12
Business Understanding
Blockchain technology has become a significant catalyst for change in diverse industries within
the rapidly expanding technological landscape of the present era. The decentralized and
transparent nature of this technology holds the potential to significantly transform operations,
bolster security measures, and facilitate the emergence of novel business models. The potential
uses of blockchain technology span across various sectors, including but not limited to financial
services, supply chain management, healthcare, and voting systems.
To have a more comprehensive understanding of the vast domain of blockchain technology, it
is imperative to conduct an extensive examination of the available scholarly literature. Scopus,
a widely recognized and esteemed academic database, is an indispensable platform for
retrieving various scholarly articles, conference papers, and research publications. The vast
scope of its coverage ensures the compilation of a comprehensive dataset, which is the basis
for conducting informed analysis.
Analyzing the extensive body of research on blockchain technology might be a formidable
challenge when attempting to derive significant findings. Topic modeling plays a crucial role
in this context. Topic modeling is a highly effective machine-learning technique that facilitates
the automated detection of latent themes within a substantial collection of textual data. The
process of categorizing interconnected papers into cohesive subjects provides a systematic
methodology for comprehending the diverse aspects of blockchain study.
Topic modelling extends beyond its technical nature, presenting significant potential for
academic and professional pursuits. Researchers and professionals can acquire a full
understanding of the ongoing conversation in the field of blockchain by identifying significant
issues within the extensive body of literature available.
The process of integrating a wide range of research findings into cohesive themes facilitates
the synthesis of knowledge and the development of complete idea maps. The evolution of
dominant topics through time facilitates the ability to forecast changes in research focus and
industrial trends. The utilisation of subject extraction techniques in the development of
recommender systems facilitates the augmentation of relevant information finding. Within the
domain of strategic decision-making, the process of aligning organizational plans with new
research directions is guided by the identified themes.
The primary objective of this project is to investigate and evaluate the five most significant
topics in the extensive domain of blockchain technology research. This investigation aims to
uncover and define the prevailing themes that influence the ongoing discourse surrounding
blockchain technology. The process of extracting significant insights from each highlighted
issue plays a crucial role in developing a comprehensive grasp of the varied applications and
difficulties associated with the technology. Visual representations are created with the purpose
of improving understanding and facilitating the communication of the identified themes.
Moreover, the primary objective of this initiative is to enhance the accessibility of research
findings to academic and corporate communities, thereby promoting informed discussions and
creating collaboration prospects.
As the project progresses, it continues to demonstrate adaptability in response to the dynamic

technological environment. Dynamic analysis involves the ongoing surveillance and
examination of emerging research, guaranteeing that the project remains up to date with
evolving trends. The investigation's scope may be broadened to include the multidisciplinary
aspects of blockchain research, thereby exploring possible synergies with sectors such as the
Internet of Things (IoT), Artificial Intelligence (AI), and finance. The integration of sentiment
analysis methodologies has the potential to measure the collective sentiment across several
subjects, hence adding an emotional dimension to the analytical conclusions. Furthermore, the
utilisation of predictive models that incorporate historical trends enables the projection of
future research trajectories, hence enhancing the forecasting skills of the project.
Data Understanding
The data extracted from the Scopus platform contains various information related to scholarly
documents (papers, articles, etc.) that discuss or analyze blockchain technology. Here's a
breakdown of the different fields present in the dataset:
1. Authors: The authors who contributed to the document.
2. Author Full Names: The complete names of the authors.
3. Author(s) ID: A unique identifier assigned to each author in the Scopus database.
4. Title: The title of the research paper or document.
5. Year: The year in which the document was published.
6. Source Title: The title of the journal, conference, or source where the document was
published.
7. Volume: The volume number of the source.
8. Issue: The issue number of the source.
9. Art. No.: Article number, if applicable.
10. Page Start: The starting page number of the document within the source.
11. Page End: The ending page number of the document within the source.
12. Page Count: The total number of pages in the document.
13. Cited By: The number of times other articles have cited this document.
14. DOI: Digital Object Identifier, a unique alphanumeric string assigned to the document for
easy identification.
15. Link: URL link to the document.
16. Affiliations: The affiliations of the authors, indicating their institutional affiliations or
organizational affiliations.
17. Authors with Affiliations: The names of the authors along with their respective
affiliations.
18. Abstract: A summary of the document's content and main findings.
19. Author Keywords: Keywords provided by the authors to highlight the main concepts of
the document.
20. Index Keywords: Keywords assigned by the database to categorize the document.
21. Molecular Sequence Numbers: If applicable, sequence numbers related to molecular
structures.
22. Chemicals/CAS: Information about chemicals and their Chemical Abstracts Service
Registry Numbers.
23. Tradenames: Tradenames of products mentioned in the document.
24. Manufacturers: Manufacturers' names were mentioned in the document.
25. Funding Details: Information about funding sources for the research.
26. Funding Texts: Textual descriptions of the funding details.
27. References: List of references and citations used in the document.
28. Correspondence Address: Address of the corresponding author.
29. Editors: Names of editors if the document is part of an edited volume.
30. Publisher: The publisher of the source where the document was published.
31. Sponsors: Sponsoring organizations, if applicable.
32. Conference Name: Name of the conference if the document was presented at a conference.
33. Conference Date: Date of the conference.
34. Conference Location: Location of the conference.
35. Conference Code: Code associated with the conference.
36. ISSN: International Standard Serial Number of the source.
37. ISBN: International Standard Book Number, if applicable.
38. CODEN: Code assigned to the source.
39. PubMed ID: PubMed identifier, if the document is available on PubMed.
40. Language of Original Document: The language in which the original document is written.
41. Abbreviated Source Title: Abbreviated title of the source.
42. Document Type: The type of document (e.g., article, conference paper).
43. Publication Stage: The stage of publication (e.g., in press, published).
44. Open Access: Indicates whether the document is available.

45. Source: The source from which the document was obtained.
46. EID: Elsevier ID, a unique identifier assigned to documents in the Scopus database.
This comprehensive data set contains a wide range of information related to blockchain
research articles, providing a foundation for your topic modeling and analysis in RStudio.
However, we only consider the Abstract field of the dataset for our topic modeling.
Data Preparation
The following is a comprehensive overview of the individual steps undertaken to prepare our
data for analysis:
1. The dataset "scopus.csv" is imported using the `read.csv()` method and

assigned to the variable `data`. Next, the "Abstract" column is extracted from the dataset
and subsequently stored in the variable `DATA`.
2. To generate a text corpus, the "Abstract" data is transformed into a collection of texts
called `Docs` by the utilization of the `Corpus(VectorSource(DATA))`
function.
3. The text in the corpus is transformed to lowercase using the `tm_map()` function and
the `content_transformer(tolower)` function to maintain consistency in
text analysis.
4. The word "tool" is eliminated from the text by implementing a customized
transformation function called `d1`. This function is subsequently applied to the
`Docs` corpus using the `tm_map()` function.
5. The removal of punctuation marks from the text can be achieved by utilizing the
`tm_map()` function in conjunction with the `removePunctuation` method.
6. Eliminating Numerical Digits: Numeric digits are eliminated from the text by using the
`tm_map()` function in conjunction with the `removeNumbers` operation.
7. Elimination of Stopwords: The process of eliminating frequently occurring English
stopwords from the text is achieved by utilizing the `tm_map()` function in
conjunction with the `removeWords` function, which takes the
`stopwords("en")` parameter.
8. Stemming is conducted on the text using the `tm_map()` function and the
`stemDocument` method to reduce words to their fundamental forms. Additionally, the
term "applic" is substituted with "application". The method of creating a
Document-Term Matrix (DTM) involves generating a matrix termed `DTM` from the
preprocessed corpus `Docs` through the utilization of the function
`DocumentTermMatrix(Docs).`
9. Eliminating Rows with Zero Values: The method involves identifying and excluding
rows from the Document-Term Matrix (DTM) that contain exclusively zero values.
This occurrence may arise because of the preprocessing procedures.
10. Creating a Term-Document Matrix (TDM) involves generating a’ TDM’ matrix by

applying the function `TermDocumentMatrix(Docs)` to the preprocessed corpus
`Docs`.
11. To determine word frequencies, one can employ the `rowSums()` function on the
Term-Document Matrix (TDM) and afterwards arrange the words in descending order.
In a similar manner, the calculation of column sums of the Document-Term Matrix
(DTM) is employed to ascertain the frequencies of words.
12. To generate a Word Frequency Data Frame, a data frame named `df` is constructed to
store the terms and their respective frequencies.
13. To build a word cloud visualization from the `df` data frame, the `wordcloud2()`
package is utilized. The term "word cloud" is a visual representation that effectively
showcases the most frequently occurring terms within a given corpus.
In text analysis, thorough data preparation is crucial in extracting significant insights from
extensive and intricate datasets. The subsequent procedures provide a complete methodology
for improving the unprocessed textual data obtained from the Scopus platform. This procedure
adheres to well-established principles, guaranteeing that the ensuing examination is rigorous,
precise, and capable of revealing underlying patterns and subtleties within the dataset.
The first phase in the process is converting the entire corpus of text into lowercase. The purpose
of maintaining uniform capitalization is to standardize the treatment of words, disregarding any
differences in case and considering them as equivalent entities. The maintenance of uniformity
in capitalization prevents the unintentional differentiation of identical words caused by
variations in capitalization.
To prioritize the central subject matter, specific explicit terminologies, such as the term "tool,"
have been omitted from the discourse. This process facilitates the refinement of the analysis by
removing extraneous elements that may not significantly help to comprehend the main topics
within the corpus.
Effective communication is contingent upon the logical organization of words, with
punctuation marks crucial in facilitating this organization. The removal of punctuation marks
is of utmost importance as they can interfere with the precise identification of words and their
frequency. Similarly, the exclusion of numerical digits guarantees that the analysis remains
focused on the semantic substance of the text, free from the influence of numeric values.
The inclusion of common English stopwords, despite their importance in maintaining
grammatical structure, typically contributes limited semantic value in identifying crucial
themes. The elimination of stopwords allows for a more focused study of words that contain
substantial meaning, hence improving the precision of topic extraction.
The presence of many word forms, such as variations in tense, plurality, or conjugation, might
result in a duplication of terms when examining the text for recurring motifs. The utilization of
stem, a technique that simplifies words to their fundamental form, guarantees that different
forms of a root word are considered as a singular entity, hence diminishing repetition and
offering a more succinct representation.
Transforming the preprocessed text into a Document-Term Matrix enables the measurement of
term frequency across several documents. The matrix functions as a fundamental component
for future analysis, allowing for identifying prominent terms within each document and
simplifying the extraction of overarching themes.
inside complicated datasets, it is possible for certain documents to lack significant content
following preprocessing procedures. Consequently, this might lead to rows inside the
Document-Term Matrix with zero values. The exclusion of these rows guarantees that only
documents containing significant material are considered in the future analysis, enhancing the
dataset's representation of the thematic landscape of the corpus.
The computation of word frequencies is a crucial process in comprehending the relative
significance of terms within the corpus. Arranging words in descending order based on
frequency enables recognizing often occurring terms. These terms often indicate significant
themes or concepts that are extensively covered in the existing body of literature.
The utilization of visual representations of data has the potential to significantly amplify the
effectiveness of conveying analytical outcomes. The generation of a word cloud, in which the
size of each word is proportional to its frequency, offers a prompt and intuitive representation
of the most prominent terms within the collection of texts. This visualization tool facilitates
efficient comprehension of the prevailing topics in the text, benefiting both researchers and
readers. This serves as a fundamental basis for further in-depth analysis.
Modelling
The topic modeling technique is a powerful tool in the pursuit of understanding the complex
network of ideas and concepts present in the literature on blockchain technology. The following
steps delineate a complex analytical procedure designed to uncover the underlying themes
contained in preprocessed textual material. This process serves to facilitate scholars in attaining
a more profound comprehension of the discourse.
The initial step in topic modeling involves a fundamental inquiry: What is the ideal number of
topics that may effectively reveal the most significant themes? The question at hand is
addressed using the ldatuning library, which utilizes a range of metrics such as
"Griffiths2004," "CaoJuan2009," "Arun2010," and "Deveaud2014." Every metric provides
insight into the cohesion and uniqueness of themes. The analysis employs an iterative process
that explores a range of potential topic numbers, spanning from 2 to 40. This exploration uses
the Gibbs sampling approach, which aims to determine the most suitable number of subjects
for the given context. This stage effectively directs the subsequent analysis, guaranteeing that
the number of subjects chosen corresponds to the inherent structure of the corpus.
The convergence of coherence and uniqueness measurements can be observed through a visual
depiction of metrics across various topic numbers. This convergence provides insight into
determining the optimal number of topics. The presented visualization offers a tangible
foundation for selecting the most suitable number of subjects and enhances the process of
extracting sturdy and coherent themes.
After determining the ideal number of topics, the analysis proceeds to a more in-depth
exploration using Linear Dirichlet Allocation (LDA) methodology. This approach reveals the
underlying semantic organization inside the preprocessed dataset. The LDA process is carefully
optimized by adjusting many parameters, including burn-in, iteration, thinning, and seed. The
outcomes are evident in the form of a depiction of the prevalent topics within the collection of
texts and the arrangement of words within those topics.
The result of the Latent Dirichlet Allocation (LDA) analysis manifests as a matrix that exhibits
the relative prominence of each theme across the documents. In the matrix, each row is
associated with a document, while the columns reflect the detected subjects. The utilization of
this matrix facilitates a comprehensive comprehension of the relative frequency of each
thematic element within distinct papers, offering valuable insights into the range of issues
investigated within the literature on blockchain technology.
Moreover, including a word matrix for each topic serves to elucidate the particular words that
make a substantial contribution to the comprehension of each respective issue. This stage
facilitates the qualitative analysis of the subjects, as these selected words provide insight into
the fundamental principles encompassed by each theme.
Each topic that has been identified represents more than just a compilation of words; rather, it
represents a conceptual cluster that encompasses a distinct facet of blockchain technology. As
an illustration,
- Topic 1: Emerging Technologies: This subject matter could involve discussions regarding
the dynamic nature of technology advancements in blockchain.
- Topic 2: Data Security: The focal point of this discourse pertains to the protection and
preservation of data and information within the blockchain technology framework.
- Topic 3: Digital Innovation: This topic encompasses investigations into the innovative
capabilities of blockchain technology inside the digital domain.
- Topic 4: Technological Applications: This topic will explore the practical applications and
utilization of blockchain technology across many sectors.
- Topic 5: Blockchain Technology: This topic encompasses comprehensive discussions
focused on the fundamental principles and characteristics of blockchain.
Ultimately, the probabilities associated with categorizing documents into respective topics are
consolidated and represented in a structured format known as a dataframe. This analysis
demonstrates the degree to which each document corresponds with the identified subjects,
offering a quantitative basis for comprehending the prevalence of themes throughout the
collection.
Evaluation
Perplexity is a prevalent metric for evaluating the efficacy of language models, encompassing
topic models such as Latent Dirichlet Allocation (LDA). Nevertheless, the understanding of
confusion can be somewhat intricate and contingent upon the specific circumstances.
The log-likelihood of the LDA model (ldatm) is calculated using the logLikelihood -
logLik(ldatm) line of code. The log-likelihood is an evaluative metric for the extent to
which the model accurately represents the observed data, explicitly referring to the document-
term matrix (`DTM`) in your context. A more excellent log-likelihood value signifies a more
substantial alignment between the model and the observed data.
The perplexity value is obtained by exponentiating the negative sum of log-likelihoods and
dividing it by the total number of words. The concept of perplexity pertains to evaluating a
model's ability to anticipate data that has not been previously encountered accurately. Lower
perplexity scores are indicative of a higher level of proficiency in predicting the observed data
by the model.
The perplexity value is printed to the console using the `print` function, and the `paste` function
is used to concatenate the string "Perplexity:" with the perplexity value. Nevertheless, the
perplexity value of 482.904600513577 is deemed satisfactory. Evaluating the perplexity value's
quality depends on the particular use case and area under consideration. In specific contexts, a
confusion score of approximately 100 may be regarded as exemplary, although in alternative
scenarios, scores below 500 could still be deemed satisfactory.
Deployment
We have deployed our model on a new sentence to check the model’s classification in real-
world scenarios. The provided sentence is, “Blockchain technology is a decentralized and
distributed digital ledger system that records transactions across multiple computers in a secure
and transparent manner.” The sentence undergoes several preprocessing steps, such as
converting to lowercase, removing punctuation, and eliminating common English stopwords,
to refine the text. Following this, a document-term matrix (DTM) is created for the
preprocessed sentence, which corresponds to the format of the DTM utilized in training the
LDA model. The topic distribution for the new phrase is computed by employing the trained
LDA model ('ldatm'), thereby estimating the sentence's probability of belonging to each
subject. The topic with the highest probability is designated as the assigned topic, and the code
generates this assignment as output. The illustration pertains to the designated subject matter,
namely, Topic 2: Data Security. This code exemplifies the application of a trained LDA model
to allocate new, unseen sentences to specific themes depending on their content.
ANNEXURE
# Group -04
# Akila - MBAA22004
# Deepak Bhatt - MBAA22085
# Dhruv Chowdary - MBAA22025
# Pranjal Agnihotri - MBAA22052
# Shambhavi Gupta - MBAA22063
# Shivangi Agrawal - MBAA22065
# Automated Data Collection Term Project
# (Extracting Data from Scopus and conducting Topic Modelling)
##########################################################################
#########
>
> # Keywords used in Scopus Search - Blockchain Technology
> # Filter applied: Documents limited to Business, Management and Accounti
ng
> setwd("G:/My Drive/Term-4 Classes/Automated Data Collection")
>
> # Importing Necessary Libraries
> library(tm)
Loading required package: NLP
Warning message:
package ‘tm’ was built under R version 4.2.3
> library(wordcloud2)
> library(topicmodels)
Warning message:
package ‘topicmodels’ was built under R version 4.2.3
> library(ggplot2)
Attaching package: ‘ggplot2’

The following object is masked from ‘package:NLP’:
annotate
Warning message:
package ‘ggplot2’ was built under R version 4.2.3
>
> # Importing the Data set
> data<-read.csv("scopus.csv")
> DATA<-data$Abstract
>
> # Text Analysis----------------------
> Docs<-Corpus(VectorSource(DATA))
> Docs
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 4249
> writeLines(as.character(Docs[[6]]))
The working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps (MFDAs), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. This
study pinpoints the imperative factors affecting the customer's attitude a
nd continued intention in light of the task technology fit (TTF) model. Th
e required data were collected from MFDA users and analyzed by the structu
ral equation modeling technique via Amos-23 and SPSS-22. The results confi
rm that customer rating, ordering review, food tracking, navigational desi
gn, and user self-efficacy positively impact TTF. Further, self-efficacy p
ositively moderates the relationship between visual design and TTF, naviga
tional design and TTF, and food tracking and TTF. Moreover, TTF positively
influences attitude and continued intention, and in turn, attitude positiv
ely influences continued intention. Additionally, blockchain technology (B
T) enabled traceability positively moderates the relationship between TTF,
attitudes, and continued intention to use MFDAs. The developers of MFDAs s
hould consider how customers perceive BT-enabled traceability and take ste
ps to embrace it to increase customer trust in MFDAs. Furthermore, the the
oretical and managerial applications are explained in detail so that devel
opers can offer what MFDA users need. © 2023 Elsevier Ltd
>
> #Cleaning the data
> Docs<-tm_map(Docs,content_transformer(tolower)) # Transforming the text
to lowercase
Warning message:
In tm_map.SimpleCorpus(Docs, content_transformer(tolower)) :
transformation drops documents
the working pattern of the food industry has entirely changed with the eme
rgence of mobile food delivery apps (mfdas), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. this
nd continued intention in light of the task technology fit (ttf) model. th
e required data were collected from mfda users and analyzed by the structu
ral equation modeling technique via amos-23 and spss-22. the results confi
gn, and user self-efficacy positively impact ttf. further, self-efficacy p
ositively moderates the relationship between visual design and ttf, naviga
tional design and ttf, and food tracking and ttf. moreover, ttf positively
ely influences continued intention. additionally, blockchain technology (b
t) enabled traceability positively moderates the relationship between ttf,
attitudes, and continued intention to use mfdas. the developers of mfdas s
hould consider how customers perceive bt-enabled traceability and take ste
ps to embrace it to increase customer trust in mfdas. furthermore, the the
opers can offer what mfda users need. © 2023 elsevier ltd
>
> d1<-content_transformer(function(x,pattern){return(gsub(pattern,"",x))})
> Docs<-tm_map(Docs,d1,"tool")
Warning message:
In tm_map.SimpleCorpus(Docs, d1, "tool") : transformation drops documents
rgence of mobile food delivery apps (mfdas), which deliver an innovative m
ethod to interact with and offer high-quality services to customers. this
nd continued intention in light of the task technology fit (ttf) model. th
e required data were collected from mfda users and analyzed by the structu
ral equation modeling technique via amos-23 and spss-22. the results confi
gn, and user self-efficacy positively impact ttf. further, self-efficacy p
ositively moderates the relationship between visual design and ttf, naviga
tional design and ttf, and food tracking and ttf. moreover, ttf positively
ely influences continued intention. additionally, blockchain technology (b
t) enabled traceability positively moderates the relationship between ttf,
attitudes, and continued intention to use mfdas. the developers of mfdas s
hould consider how customers perceive bt-enabled traceability and take ste
ps to embrace it to increase customer trust in mfdas. furthermore, the the
opers can offer what mfda users need. © 2023 elsevier ltd
>
> Docs<-tm_map(Docs, removePunctuation) # Removing Punctuation
Warning message:
In tm_map.SimpleCorpus(Docs, removePunctuation) :
rgence of mobile food delivery apps mfdas which deliver an innovative meth
od to interact with and offer highquality services to customers this study
pinpoints the imperative factors affecting the customers attitude and cont
inued intention in light of the task technology fit ttf model the required
data were collected from mfda users and analyzed by the structural equatio
n modeling technique via amos23 and spss22 the results confirm that custom
er rating ordering review food tracking navigational design and user selfe
fficacy positively impact ttf further selfefficacy positively moderates th
e relationship between visual design and ttf navigational design and ttf a
nd food tracking and ttf moreover ttf positively influences attitude and c
ontinued intention and in turn attitude positively influences continued in

tention additionally blockchain technology bt enabled traceability positiv
ely moderates the relationship between ttf attitudes and continued intenti
on to use mfdas the developers of mfdas should consider how customers perc
eive btenabled traceability and take steps to embrace it to increase custo
mer trust in mfdas furthermore the theoretical and managerial applications
are explained in detail so that developers can offer what mfda users need
© 2023 elsevier ltd
>
> Docs<-tm_map(Docs, removeNumbers) # Removing Numbers
Warning message:
In tm_map.SimpleCorpus(Docs, removeNumbers) :
rgence of mobile food delivery apps mfdas which deliver an innovative meth
od to interact with and offer highquality services to customers this study
pinpoints the imperative factors affecting the customers attitude and cont
inued intention in light of the task technology fit ttf model the required
data were collected from mfda users and analyzed by the structural equatio
n modeling technique via amos and spss the results confirm that customer r
ating ordering review food tracking navigational design and user selfeffic
acy positively impact ttf further selfefficacy positively moderates the re
lationship between visual design and ttf navigational design and ttf and f
ood tracking and ttf moreover ttf positively influences attitude and conti
nued intention and in turn attitude positively influences continued intent
ion additionally blockchain technology bt enabled traceability positively
moderates the relationship between ttf attitudes and continued intention t
o use mfdas the developers of mfdas should consider how customers perceive
btenabled traceability and take steps to embrace it to increase customer t
rust in mfdas furthermore the theoretical and managerial applications are
explained in detail so that developers can offer what mfda users need © e
lsevier ltd
>
> Docs<-tm_map(Docs, removeWords,stopwords("en")) # It is used to retrieve
words from English Language
Warning message:
In tm_map.SimpleCorpus(Docs, removeWords, stopwords("en")) :
working pattern food industry entirely changed emergence mobile foo
d delivery apps mfdas deliver innovative method interact offer highqu
ality services customers study pinpoints imperative factors affecting
customers attitude continued intention light task technology fit ttf m
odel required data collected mfda users analyzed structural equation
modeling technique via amos spss results confirm customer rating orderi
ng review food tracking navigational design user selfefficacy positively
impact ttf selfefficacy positively moderates relationship visual design
ttf navigational design ttf food tracking ttf moreover ttf positively i
nfluences attitude continued intention turn attitude positively influen
ces continued intention additionally blockchain technology bt enabled trac
eability positively moderates relationship ttf attitudes continued inte
ntion use mfdas developers mfdas consider customers perceive btenable
d traceability take steps embrace increase customer trust mfdas furth
ermore theoretical managerial applications explained detail develope
rs can offer mfda users need © elsevier ltd
>
> #STEMMING - Reducing words to their base form
> Docs<-tm_map(Docs,stemDocument)
Warning message:
In tm_map.SimpleCorpus(Docs, stemDocument) : transformation drops document
s
> Docs<-tm_map(Docs, content_transformer(gsub), pattern="applic",replaceme
nt="application")
Warning message:
In tm_map.SimpleCorpus(Docs, content_transformer(gsub), pattern = "applic"
, :
>
> #Word Cloud

> DTM<-DocumentTermMatrix(Docs)
> View(as.matrix(DTM))
>
> #Removing rows with all zero values
> rows_to_delete <- apply(as.matrix(DTM), 1, function(row) all(row == 0))
> matrix_data_filtered <- as.matrix(DTM)[!rows_to_delete, ]
> View(matrix_data_filtered)
>
> TDM<-TermDocumentMatrix(Docs)
> View(as.matrix(TDM))
> TDM_M<-as.matrix(TDM)
> words<-sort(rowSums(TDM_M),decreasing = T)
> DTM_M<-as.matrix(DTM)
> words1<-sort(colSums(DTM_M),decreasing = T)
>
> #Word frequency
> df<-data.frame(word=names(words),freq=words)
> df
word freq
blockchain blockchain 12670
technolog technolog 11111
use use 4991
studi studi 3791
data data 3773
research research 3680
chain chain 3592
system system 3519
suppli suppli 3324
can can 3050
develop develop 2925
digit digit 2845
paper paper 2726
manag manag 2712
application application 2595
model model 2510
industri industri 2360
provid provid 2294
inform inform 2279
adopt adopt 2278
busi busi 2208
process process 2143
secur secur 2124
new new 1958
propos propos 1745
financi financi 1696
base base 1686
implement implement 1599
smart smart 1599
challeng challeng 1547
servic servic 1540
also also 1530
network network 1527
transact transact 1516
will will 1493
innov innov 1484
result result 1471
market market 1468
author author 1464
potenti potenti 1429
find find 1416
analysi analysi 1407
product product 1323
platform platform 1311
trust trust 1291
distribut distribut 1291
literatur literatur 1263
effect effect 1245
framework framework 1243
identifi identifi 1230

contract contract 1227
improv improv 1213
present present 1164
cryptocurr cryptocurr 1159
impact impact 1153
integr integr 1153
practic practic 1136
oper oper 1136
decentr decentr 1131
approach approach 1127
issu issu 1122
increas increas 1116
solut solut 1113
emerg emerg 1102
discuss discuss 1093
govern govern 1092
need need 1090
review review 1088
current current 1074
account account 1060
sustain sustain 1057
includ includ 1054
futur futur 1053
limit limit 1045
case case 1032
differ differ 1024
transpar transpar 991
howev howev 964
perform perform 961
sector sector 954
risk risk 946
explor explor 945
valu valu 933
design design 928
iot iot 917
effici effici 917
import import 905
relat relat 904
chang chang 897
right right 895
problem problem 893
public public 889
method method 884
ieee ieee 881
one one 879
aim aim 876
purpos purpos 871
share share 869
focus focus 867
articl articl 866
social social 861
exist exist 859
show show 858
requir requir 848
enabl enabl 846
organ organ 842
work work 840
signific signific 837
cost cost 836
internet internet 824
make make 798
financ financ 798
mani mani 780
various various 777
natur natur 768
time time 766
ledger ledger 765
publish publish 763
benefit benefit 763

transform transform 750
global global 749
help help 749
economi economi 746
support support 742
address address 731
intellig intellig 729
contribut contribut 728
understand understand 725
reserv reserv 723
bank bank 721
compani compani 720
main main 718
creat creat 717
first first 716
blockchainbas blockchainbas 716
offer offer 713
well well 711
econom econom 708
may may 705
traceabl traceabl 703
implic implic 701
analyz analyz 691
examin examin 684
user user 678
role role 676
consid consid 673
trade trade 672
factor factor 668
investig investig 666
way way 657
field field 656
area area 654
appli appli 652
key key 651
chapter chapter 647
environ environ 646
mechan mechan 642
energi energi 638
consum consum 637
structur structur 627
among among 623
evalu evalu 621
qualiti qualiti 616
high high 614
manufactur manufactur 612
level level 611
knowledg knowledg 608
disrupt disrupt 606
bitcoin bitcoin 595
recent recent 593
intern intern 591
strategi strategi 590
theori theori 575
reduc reduc 572
world world 569
comput comput 567
opportun opportun 564
tradit tradit 564
control control 563
addit addit 559
record record 556
two two 552
specif specif 551
privaci privaci 550
achiev achiev 546
food food 544
concept concept 544
possibl possibl 542

activ activ 541
firm firm 540
thing thing 540
sever sever 539
enhanc enhanc 531
context context 531
central central 527
conduct conduct 527
springer springer 525
collect collect 521
enterpris enterpris 520
artifici artifici 515
direct direct 515
influenc influenc 514
regul regul 512
becom becom 511
within within 511
custom custom 505
access access 503
project project 503
learn learn 503
logist logist 502
advanc advanc 497
relationship relationship 497
architectur architectur 497
construct construct 492
stakehold stakehold 488
due due 487
fintech fintech 485
resourc resourc 480
associ associ 479
decis decis 479
function function 477
institut institut 472
like like 470
allow allow 468
healthcar healthcar 463
perspect perspect 461
barrier barrier 460
ensur ensur 459
interest interest 458
featur featur 458
bct bct 457
collabor collabor 457
communic communic 453
order order 451
advantag advantag 445
big big 445
therefor therefor 445
three three 444
capabl capabl 441
lack lack 441
final final 437
polici polici 437
audit audit 437
success success 431
build build 430
year year 429
invest invest 429
posit posit 425
suggest suggest 424
power power 421
introduc introduc 421
generat generat 419
toward toward 418
combin combin 418
term term 417
characterist characterist 416
complex complex 415

lead lead 415
assess assess 414
algorithm algorithm 413
ecosystem ecosystem 412
topic topic 410
util util 410
devic devic 409
countri countri 408
payment payment 408
critic critic 408
ltd ltd 407
systemat systemat 407
highlight highlight 406
particular particular 405
number number 401
particip particip 400
affect affect 399
concern concern 398
originalityvalu originalityvalu 394
cloud cloud 394
designmethodologyapproach designmethodologyapproach 393
facilit facilit 392
set set 389
initi initi 387
establish establish 386
insight insight 386
exchang exchang 386
relev relev 385
form form 385
still still 383
asset asset 383
compar compar 382
emerald emerald 382
trend trend 381
group group 381
legal legal 380
major major 380
switzerland switzerland 377
academ academ 373
elsevi elsevi 372
face face 372
consensus consensus 370
regard regard 369
better better 367
protect protect 367
reliabl reliabl 366
databas databas 365
without without 361
analyt analyt 360
open open 359
indic indic 358
currenc currenc 355
across across 354
aspect aspect 353
practition practition 353
involv involv 352
analys analys 350
demonstr demonstr 348
block block 348
protocol protocol 344
part part 344
select select 344
object object 343
solv solv 342
technic technic 341
tourism tourism 341
theoret theoret 340
valid valid 340
type type 335

dynam dynam 334
determin determin 334
societi societi 334
thus thus 331
conceptu conceptu 331
educ educ 330
optim optim 330
interact interact 329
expect expect 328
test test 327
store store 327
autom autom 327
individu individu 326
multipl multipl 326
reveal reveal 326
promis promis 324
retail retail 324
follow follow 323
take take 322
attent attent 322
empir empir 322
competit competit 321
standard standard 321
sourc sourc 320
authent authent 317
avail avail 317
state state 316
general general 316
parti parti 316
token token 316
storag storag 315
rapid rapid 314
techniqu techniqu 314
machin machin 314
infrastructur infrastructur 314
novel novel 312
demand demand 309
experi experi 308
report report 306
book book 303
larg larg 302
comprehens comprehens 302
stage stage 302
health health 302
real real 298
price price 297
softwar softwar 296
sinc sinc 296
describ describ 294
agricultur agricultur 293
environment environment 292
methodolog methodolog 292
accept accept 291
second second 291
gain gain 291
scenario scenario 291
privat privat 290
domain domain 289
ident ident 289
promot promot 287
onlin onlin 286
view view 286
human human 285
deploy deploy 284
question question 283
immut immut 283
regulatori regulatori 282
goal goal 281
organiz organiz 281

gap gap 280
expert expert 278
exclus exclus 277
survey survey 276
respons respons 275
grow grow 273
scienc scienc 272
point point 272
continu continu 271
especi especi 270
scheme scheme 270
measur measur 269
electron electron 269
wide wide 269
execut execut 269
connect connect 268
behavior behavior 267
intent intent 266
univers univers 266
modern modern 266
condit condit 265
strateg strateg 264
dlt dlt 264
profession profession 263
found found 258
furthermor furthermor 257
exampl exampl 257
growth growth 256
play play 253
medic medic 252
attack attack 250
scientif scientif 245
supplier supplier 244
explain explain 243
name name 243
engin engin 243
good good 242
perceiv perceiv 242
peopl peopl 241
predict predict 241
whether whether 241
verifi verifi 239
node node 239
corpor corpor 237
accord accord 234
plan plan 233
bring bring 233
licens licens 233
employ employ 233
covid covid 232
mobil mobil 232
communiti communiti 232
capit capit 231
depend depend 230
virtual virtual 230
person person 229
track track 229
interview interview 228
low low 228
organis organis 228
evid evid 227
monitor monitor 227
along along 226
qualit qualit 226
adapt adapt 225
transfer transfer 225
refer refer 224
decisionmak decisionmak 223
higher higher 221

properti properti 221
insur insur 221
caus caus 220
must must 220
repres repres 218
mean mean 217
ico ico 217
law law 216
detail detail 216
citi citi 216
even even 214
four four 214
prevent prevent 214
transport transport 214
common common 213
although although 213
call call 213
given given 212
ethereum ethereum 211
mitig mitig 210
media media 210
document document 209
mine mine 209
moreov moreov 209
third third 209
origin origin 208
scalabl scalabl 208
consider consider 207
detect detect 207
contain contain 206
yet yet 204
nation nation 203
consist consist 203
rate rate 202
fraud fraud 202
peertop peertop 202
[ reached 'max' / getOption("max.print") -- omitted 14977 rows ]
> wc<-wordcloud2(data=df,color = "random-dark",shape = "ovel", backgroundC
olor = "White", size = 0.7)
> wc
>
>
> #Topic Modelling-------------------
> library(ldatuning)
Warning message:
package ‘ldatuning’ was built under R version 4.2.3
> #Determining the number of topics
> system.time({number_of_topic<-FindTopicsNumber(dtm = matrix_data_filtere
d, topics = c(2:40),metrics = c("Griffiths2004","CaoJuan2009","Arun2010","
Deveaud2014"), method = "Gibbs", control = list(seed=12345), mc.cores = 4L
, verbose = TRUE)})
fit models... done.
calculate metrics:
Griffiths2004... done.
CaoJuan2009... done.
Arun2010... done.
Deveaud2014... done.
user system elapsed
135.67 7.14 3016.44
> FindTopicsNumber_plot(number_of_topic)
Warning message:
The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead
as of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the ldatuning package.
Please report the issue at <https://github.com/nikita-moor/ldatuning/iss
ues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was

generated.
>
>
> #Linear Dirichlet Allocation (LDA)------------------
> burnin<-1000
> iter<-2000
> thin<-500
> seed<-list(2000,5,70,100000,700)
> nstart<-5
> k<-5
> best<-TRUE
>
> ldatm<-LDA(matrix_data_filtered,k,method = "Gibbs",control = list(nstart
=nstart, burnin=burnin, iter=iter, thin=thin, seed=seed))
> ldatm
A LDA_Gibbs topic model with 5 topics.
>
> #Reading Results
> ldatmtopics<-as.matrix(topics(ldatm))
> View(ldatmtopics)
> write.csv(ldatmtopics,"Blockchain_abstract_extract.csv")
> ldawords<-as.matrix(terms(ldatm,10))
> View(ldawords)
>
> # Topic 1: Emerging Technologies
> # Topic 2: Data Security
> # Topic 3: Digital Innovation
> # Topic 4: Technological Applications
> # Topic 5: Blockchain Technology
>
> ldatmprobabilities<-as.data.frame(ldatm@gamma)
> # Model Evaluation

> logLikelihood <- logLik(ldatm)
> num_words <- sum(DTM)
> perplexity <- exp(-sum(logLikelihood) / num_words)
> print(paste("Perplexity:", perplexity))
[1] "Perplexity: 482.904600513577"
> # Deployment
> sentence <- "Blockchain technology is a decentralized and distributed di
gital ledger system that records transactions across multiple computers in
a secure and transparent manner."
>
> # Preprocessing
> preprocessed_sentence <- tolower(sentence)
> preprocessed_sentence <- removePunctuation(preprocessed_sentence)
> preprocessed_sentence <- removeWords(preprocessed_sentence, stopwords("e
nglish"))
>
> # Create a document-term matrix for the new sentence using the same dict
ionary as the trained LDA model
> new_dtm <- DocumentTermMatrix(Corpus(VectorSource(preprocessed_sentence)
), control = list(dictionary = dtm$control$dictionary))
>
> # Assuming 'ldatm' is your trained LDA model
> topic_distribution <- posterior(ldatm, newdata = new_dtm)$topics[1, ]
>
> # Identify the topic with the highest probability
> assigned_topic <- which.max(topic_distribution)
>
> # Print the assigned topic
> print(paste("Assigned Topic:", assigned_topic))
[1] "Assigned Topic: 2"

Group-04 ADCProject TopicModelling

Uploaded by

Copyright:

Available Formats

Group-04 ADCProject TopicModelling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group-04 ADCProject TopicModelling

Uploaded by

Copyright:

Available Formats

Automated Data Collection Group-04 Term Project

As the project progresses, it continues to demonstrate adaptability in response to the dynamic

44. Open Access: Indicates whether the document is available.

1. The dataset "scopus.csv" is imported using the `read.csv()` method and

10. Creating a Term-Document Matrix (TDM) involves generating a’ TDM’ matrix by

Attaching package: ‘ggplot2’

ontinued intention and in turn attitude positively influences continued in

> #Word Cloud

identifi identifi 1230

benefit benefit 763

possibl possibl 542

complex complex 415

type type 335

organiz organiz 281

higher higher 221

Call `lifecycle::last_lifecycle_warnings()` to see where this warning was

> # Model Evaluation

You might also like