MySEC 2011 6140639

2011 5th Malaysian Conference in Software Engineering (MySEC)
Controlled Vocabulary based Software Requirements

Classification
Nasir Mehmood Minhas1 Shahla Majeed2 Zia ul Qayyum3 Muhammad Aasem4
UIIT, PMAS-Arid Agriculture UIIT PMAS-Arid Agriculture UIIT PMAS-Arid Agriculture UIIT PMAS-Arid Agriculture
University, Rawalpindi, University, Rawalpindi, University, Rawalpindi, University, Rawalpindi,
Pakistan Pakistan Pakistan Pakistan
[email protected] [email protected] [email protected] [email protected]
Abstract — The nature of software requirements is very much accurately. But it becomes a complex and time consuming task
subjective and multi-faceted. The level of complexity increases in projects having large number or complex nature of
along-with the volume, especially when the requirements are in requirements [8]. In such cases, manual classification is suffering
natural language. In the primary phase of requirements from certain limitations like dealing with large number of
engineering, it is mostly desirable to transform these user written requirements, unavailability of experts, wrong judgment etc.
requirements into more understandable form. Organizing the
requirements in different groups may support further activities
Due to these limitations it is desirable to automate manual
much easier than direct working. In this paper, we present a processes to form automated classifier. This will save the time,
classifier that sufficiently transforms natural language written effort and decrease the cost of experts’ man hours. The focus of
requirements into corresponding groups. The organization of these this paper is to develop such a desirable automated classifier, that
groups depends upon the inter-keywords association i.e., hierarchy could efficiently classify software requirements.
of keywords. The classifier works best when requirements have One of the main issues with such a classifier is: to gain a
been written using relative vocabulary i.e., controlled vocabulary. desired level of decision maturity as a human expert. As an ideal
The overall structure of this technique is composed on three main scenario, we set our vision to an automated classification system
components: 1) Repository: of keywords and their relationships as that is capable to takes large number of natural language written
source data, 2) Mapping: Finding words in the requirements
document with keywords of repository, and 3) Presentation:
requirements and transform them to some bearable number of
presenting classified (grouped) requirements in more meaningful groups. These groups shall reflect their existence based on one or
ways. more parameters so that they can visualize different
stakeholders’ perspectives.
I. INTRODUCTION In this paper we discuss a simple automated classifier which
is based on controlled vocabulary to divide user written
The importance of classification in library, biology, earth, requirements into some specified groups. After presenting an
astrology, and social sciences has widely been accepted. It has overview of related literature in next section, the abstract of the
been also a major research area of computer science i.e. image proposed technique shall be discussed in section III. While in
processing, bioinformatics, data mining and machine learning. section IV and V controlled vocabulary based repository and
From last couple of decades, ‘classification’ has also been an classifier shall be discussed respectively. Implementation of the
attractive keyword in software engineering domain for technique on three case-studies has been discussed in section in
classifying errors, software features, software risks, and software VI, and the results of these case studies have been discussed in
testing [1, 3, 4, 6]. Regarding requirements engineering, section VII. Finally we conclude and set future directions.
classification can play very supportive roles in many stages like
improving the understandability of user requirements [2, 7], II. LITERATURE REVIEW
priorities setting [17, 26, 27, 28], and assessment of their quality.
Moreover, requirements classification can also be used to reduce Most of the existing literature is generally related to text
the complexity of decision making by reducing large number of classification and classification of non-functional requirements
requirements into fewer groups [26]. (NFR). For text processing we have two approaches statistical
Classifying requirements also presents their different and linguistics [8]. Linguistic approaches are costly and difficult
perspectives [4] that may require analyzing them in further to scalable than the statistical approaches. So that statistical
stages. For small or medium number of requirements, manual approaches are mostly used relatively linguistic approaches.
classification shall be tried first, because human experts can Classification performed in three phases by using both ways
better express the value of a requirement very easily and more manual and automated, three phases are: Mining phase,
classification phase and application phase [10]. Classification
978-1-4577-1531-0/11/$26.00 ©2011 IEEE
31
method described in this paper is appropriate for intermediate relationships by using ontology method [14]. We couldn’t use
and high level projects. Requirements can be successfully ontology based method separately because ontology based
identified in first two phases. Non-functional requirements are classification lacks the scientific statistical calculations. A tool in
identified from all scattered documents i.e. interviews, meeting java automates the process of extraction, analysis and
minutes and stakeholders’ comments. To detect the requirements classification of events from textual requirements [15]. It will
from all type of documents they used automatic NFR classifier, also helps in assigning to add further events relevant to
then to finalize the requirements there is an analyst. Due to both application domain. A multi-subject text classification algorithm
manual and automated methods they require more effort and based on fuzzy vector machine is used to classify the sample
time. [16]. SVM is used to resolve the problem of multi-class
This classification method is used for large-scale projects. A classification and for text categorization [16, 20, & 21]. The
boot strapping method can automatically classify the classification framework is proposed for requirement
requirements sentences into each topic category using the topic prioritization methods based on benefits and cost estimation [17].
word which is already assigned to the system [8]. A text This framework is derived from ground theory. By using ground
classification framework based on genetic algorithm [11]. This theory many relevant categories can be discovered. K-Nearest
framework is divided into four modules: pretreatment module (it Neighbor and Centroid-based Classification algorithms are
will convert the document into computer model that is easy to commonly used for text classification. In this paper K-NN and
deal than written document.), Learning classifier (generating Centroid-based algorithms are examined then they use Centroid-
classifier by using different algorithm that are based on genetic based algorithm for text classification with some improvements
algorithm), and Classification module (generate tagging [18]. They use Centroid-based algorithm other than K-NN
documents) and Feedback module (update the classified because K-NN is slower to proceeds the data’s distances/
templates by using stakeholder’s response.) [11]. Text similarities than Centroid-based algorithm. A top-down level
classification methods are used to reduce the ambiguities that are based classification method is proposed in this paper that can
arises due to natural languages. This paper presents the classify the requirement document at each level i.e. leaf and
hierarchal text classification. Two type of text classification is internal categories [19]. Categories Performance could be
introduced: heavy classification and light classification [9]. measure incorrectly due to their relationships considered
Heavy classification deals with the similar content document by independently, so that they have been used Category-Similarity
using naïve bay’s algorithm. Light classification deals with the Measures and Distance-Based Measures to consider the
documents that have variations in data which overcomes by misclassification in measuring the classification performance.
using Euclidean distance. Generally requirements should be Some of classification methods: the Support Vector Machines
classified into main two categories i.e. functional requirements (SVM), a k-Nearest Neighbor (kNN) classifier, a neural network
(FR) and non-functional requirements (NFR) that have a strong (NNet) approach, the Linear Least squares Fit (LLSF) mapping
impact on requirement engineering [12]. Some people consider and a Naive Bayes (NB) classifier are used for text
security requirement as FR but as FR it is not a successful term, categorization and dealing with their performance as training set
so who consider security requirements as NFRs that is better to [20]. One major issue is feature selection in text
consider it FR. But to secure the system behavior’s security we categorization/classification that could be resolved by using any
have to be modeled it as functional requirement. In this paper feature selection method (i.e. Information Gain (IG), Document
authors describe the security requirement from each aspect. So Frequency (DF), Mutual Information (MI) and etc), but all
there is no need for separate security engineering process. In this methods have some lacking so that in this paper Knowledge
paper they examined and compared some important Gain (KG) is proposed and successfully implemented for feature
classification methods for security requirements [12]. Leaf selection in text categorization [22]. A two-step method of
classification method is used for computerized plant feature selection is proposed for text categorization by using IG
identification system [13]. A scaled Centroid-Contour Distance and NB [23].
(CCD) code system is used to classify basic shapes and margin
type of a leaf. They use neural network approach to extract vein III. SOFTWARE REQUIREMENTS CLASSIFIER
details of a leaf. The proposed method integrates low-level
features of an image and specifies the knowledge in the domain Classification is generally described as the arrangement of
of botany [13]. Statistical text classification methods consider objects into groups based on the specified rules. From
their terms are independent there is no semantic relationship requirements engineering perspective, classification is referred to
existing between them [14]. Words always have some the organization of software requirements into different classes
relationship between them, as requirements are written in natural such that they could provide meaningful bases for decision
languages, so that the problem arises is very important to making for further (analytical) activities. These activities may
consider. To resolve this problem in this paper authors combine include planning releases, identification of critical and core
two methods statistical and ontology methods. We can get the requirements, assessment of conflicting requirements, and
32
summarizing requirements. Three main components/elements requirements. Therefore, to make is simple; the repository shall
can be found in every definition of the classification: set of be maintained in the form of a database tables. The data from
objects (to be classified), classification criteria, and classifier. these tables could easily be accessed using structure query
This paper presents a classification technique that takes set of language (SQL). The hierarchy of relational data in relational
software requirements written in natural language and organizes database can also be represented in views (virtual tables) by
them into different classes based on specified rules (repository of writing queries forming data hierarchal in way.
controlled vocabulary). Classification of natural language written
requirements is much easier for human experts but too difficult
for computers. The technique is not considering the semantics of
the language; rather, it looks for the predefined vocabulary and
their relationship with the associated groups. Therefore, it
follows a simple algorithm mainly supported by the supplied
repository.
The structure of the technique can be described in three
components: 1) set of software requirements written as English
sentences i.e., each statement refers to one requirement, 2) set of
parameters or rules i.e., the repository of controlled vocabulary
of the functional domain, and 3) classifier (the algorithm) that
performs the process of classification by taking first two
components as inputs and organizes then into associated groups.
The repository of controlled vocabulary illustrated in figure 1 is
the mean to train the classifier about how to assign each
requirement to related group.
IV. REPOSITORY OF CONTROLLED VOCABULARY
The core component of the proposed classifier is the Figure 1. Hierarchal Organization of controlled vocabulary
repository of controlled vocabulary. This repository can be
either in the form of a flat file, XML, or a database managed by a From implementation point of view, we developed relational
DBMS. Moreover, it can be managed and compiled using any tables and their relationships with each others. Figure 2, shows
approach discussed in [25]. Existing methods to compile these tables prefixed with level numbers as t0, t1, t2, t3 referring
controlled vocabulary may belong to any of the four methods level 0, 1, 2, and 3 respectively.
[25]. These methods are either 1) compiler based on words
extracted; 2) adding the natural language search sign into one
existing vocabulary or classification table; 3) manual
compilation of vocabulary using computers to automatically
collect and accumulate terms; 4) using ready-made general
vocabulary as a substitute.
The logical structure of storing the data of such a controlled
vocabulary is hierarchal. There are certain reasons for making it
hierarchal; data could be easily traced even manual; the
searching of a particular keyword or word would be fast. The
repository shall be organized into three levels i.e., Keywords,
Words, and Alternatives. A top most level has also been
proposed to refer a particular repository. Figure 1, illustrates the
organization of the proposed hierarchy. It clearly shows that the
levels at the top present generic terminology while lower levels
expand and explain the upper ones. The keywords are explained
Figure 2.Relational Tables for storing controlled vocabulary
by associated words which are further linked with alternative
words. These alternative words can be synonyms, thesaurus, Based on these tables, a view is required to access all data of
different forms of verbs, etc. the repository. This view can easily be created by joining these
At this point, our main focus is to develop a repository that tables according to the hierarchy.
can demonstrate the effectiveness of the classifier for software The following query can be used to create such a view:
33
repository. It can be noted that the inner loop searches for
select t0repo.*, t1keywords.*, t2words.*, t3alternates.* elements as much as possible. This means that a requirement can
from t0repo, t1keywords, t2words, t3alternates be a member of one or many groups. At the succession of each
where t0repo.t0repo = t1keywords.t1repo finding the mapping elements are placed on set GROUPS.
and t1keywords.t1kid = t2words.t2kid
and t2words.t2kid = t3alternates.t3kid
and t2words.t2wid = t3alternates.t3wid;
REQ = {Set of user written requirements}
GROUPS={}
V. CLASSIFICATION USING CONTROLLED VOCABULARY
KEYWORDS = {KW1, KW2,…, KWn}
The classifier of our interest strongly depends upon the WORDS = { W1, W2,…, Wnm}
vocabulary used in the repository (discussed previously). If the ALTERNATIVES = {A1, A2,…, Anma}
repository has high coherence with respect to the given
requirements; the resultant output will be of high quality. The REPO = { map KEYWORDS to WORDS
challenging part of this technique is the selection of suitable map WORDS to ALTERNATIVES }
keywords and creating their repository for classification. The
for each requirement of REQ
creation of the repository must include the key person of the R Å current_requirement| requirement REQ
functional area because he/she better knows the keywords and for each element of REPO
meaning behind those keywords for domain user. Selection of If R KEYWORD then
the vocabulary should be made in prediction manner that why GROUPS = GROUPS {map R to REQ }
and when the user will choose the (key)word. Moreover, the End if
groups reflecting the keyword must be meaningful toward its end for
decision making objective. end for
Figure 4. Pseudo code for the Classifier
The pseudo code clearly depicts that requirements are

analyzed in the light of repository of controlled vocabulary and
assigned to groups if their associated keyword has been found.
As there is not control exit in the inner-loop for more than one
occurrence, therefore, a requirement can be assigned to more
than one group.
VI. CASE STUDIES
In order to validate our approach results and performance, we

developed database application software based on the underlying
concept and applied on three case studies i.e., LIMS: Library
Information Management System, RESMAN: Online Resource
Management System and IESCO: Automated Accountant
Figure 3 is the abstraction of the proposed approach. It starts System. These case studies have been selected from the
with the user requirements that have been obtained from submitted projects of BS (CS) students. For demonstration
different users. These requirements were mapped with the words purpose, we explain the first case study LIMS step by step in
in repository iteratively. Finally they were presented in the form detail while RESMAN and IESCO have been discussed without
of groups; each containing a set of requirements. These groups internal details.
can be further used for further analysis of requirement LIMS has been described by its functional user in at most ten
engineering. requirements. These requirements can be seen in figure 6
Another illustration of the classifier internal mechanics can (ignoring the tags enclosed in << and >>). Assuming the
be seen in figure 4. It presents the pseudo code of the technique. perception of a library user (librarian), we developed a repository
Set REQ is represented as user requirements and REPO as the that is capable to capture different classes within these
repository of controlled vocabulary. The mapping function has requirements. This repository can be seen in hierarchal way in
been used to link keywords with words and words with figure 5. Using our template based excel file, user requirements
alternative words. Two loops have been used such that one is and developed repository have been transformed into separate
inner of another. The internal loop is executed for each
requirement to map then against one or more elements of the
34
sheets. The file was than imported to our database software the sentence confirms one or more controlled vocabulary of the
application and mapping process has been executed. repository. Hence they are classifiable for the purpose their
In its first execution, nine requirements were mapped out of repository was designed to.
ten. The reason was the requirement no. 3. In this requirement no The requirements of RESMAN and IESCO were also
keyword, word or alternative was found. The requirement was executed in similar way. RESMAN contains 21 requirements
analyzed and an ‘Alternative’ word was found to update the covering modules like employee management, assets
repository with. Once the repository was updated; ten out of ten management, document management, transport management,
requirements were mapped and classified. and employee book. IESCO’s 21 requirements explain features
for recruitment, employee management, appraisal, leave
management, Pay calculation, and Account management. The
RESMAN’s all requirements were mapped while 2 requirements
were found un-tagged in IESCO.
VII. RESULTS AND DISCUSSION
The validity of results obtained from the case studies were

compared systematically by the results The validation of our
classification approach was performed by a human expert
because of unavailability of such an approach. After performing
the exercise, we found some accuracy gap when compared both
the results.
The accuracy of LIMS and IESCO are 85% and 86% while
ResMan has 79% of accuracy. Therefore, we group them in two
classes such that group 1 includes LIMS and IESCO because of
Figure 5. Repository for LIMS their closer results and ResMan to group 2. Group 1 can be
called best case while group 2 is referred as worst case. The
purpose of dividing them into such groups is to analyze their
1. All library members enter into their accounts using user parameters and causes for being so.
name and password. <<Secure>> The figure 7 clearly shows the trend of case studies results
2. If a member forgets his/her password, password shall be based on the inputs. The number of requirements in group 1 is
sent by email. <<Secure>> greater than the group 2’s. On the same pattern, keywords in
3. Once login into member area, a member can search, sort, group 1 are also less than group 2. Based on these two input
bookmark books. <<Transaction>> parameters, it can be deducted that as soon as number
4. Each member has a quota to issue and reserve books. requirements get larger and their keywords have not been fully
<<Transaction>> defined than their classification accuracy decreases.
5. New Books can be added, that will immediately available
for transactions. <<Transaction>>
6. A reserved book can either be released or issued by the
same member. <<Transaction>>
7. If reserved book was not released or issued within two
days, it shall be released automatically.
<<Transaction>><<Automatic>>
8. Issued book shall be returned within 15 days.
<<Transaction>>
9. If issued book was not returned in 15 days, then each day
Rs. 50 will be charged. <<Transaction>>
10. System automatically generates mail to remind issued
book to be returned within 13 days.
<<Transaction>><<Automatic>>
Figure 7. Parameters and Results of the case studies
Figure 6. User Requirements mapped (tagged) by classifier As keywords are described by words and alternative words,
therefore, they have a great impact on its results. Keywords,
At the execution of the mapping process, the classifier added words and alternatives form a repository that has to be designed
associated group(s) with each requirement. These tags depict that carefully according to that domain. The pattern in the figure 7
35
regarding the repository shows that number of words and [11] Z. Z. Fang, L. P. Yu and L. Ran, “Research of text classification
alternates are closer on average. This observation illustrates the technology based on genetic annealing algorithm”, International
trend of improved results when vocabulary used in requirements symposium on computational intelligence and design, 2008, pp.
writing sufficiently meets with the controlled vocabulary that has 265-269.
[12] T. R. Farkhani and M. R. Razzazi, “Examination and Classification
been used in the repository. On this point it is important to of Security Requirements of Software Systems”, IEEE 2006, pp.
enlighten the importance of qualitative repository items. Even 2778-2783.
the quantity of repository items cannot guarantee effective [13] H. Fu, Z. Chai, D. Feng and J. Song, “Machine Learning
results if its quality has not been improved. Techniques for Ontology-based Leaf Classification”, 8th
International Conference of on control automation robotics and
VIII. CONCLUSION AND FUTURE WORK vision Kunming, December 2004, pp. 681-686.
[14] G. Wu and K. Liu, “Research on Text Classification Algorithm By
This paper presents an automated software requirements Combining Statistical and Ontology Methods”, IEEE 2009.
classification technique and demonstrated its implementation [15] S. K. Singh, S. Sabharwal and J. P. Gupta, “E-XTRACT: A Tool
for Extraction, Analysis and Classification of Events from Textual
using in three case studies. The main component of this Requirements”, International Conference on Advances in Recent
technique has been highlighted as the repository of controlled Technologies in Communication and Computing, 2009, pp.306-
vocabulary. Three-level hierarchal structure of the repository 308.
was discussed in detail. The mechanics of software classifier [16] Q. Yu-ping, A. Qing, W. Xiu-kun and L. Xiang-na, “Study on
were explained from different perspectives and result presented. Classification Algorithm of Multi-Subject Text”, IEEE 2007, pp.
The work presented in this paper is in its initial form. Two main 435-438.
areas have been found to improve its effectiveness i.e., 1) [17] M. Daneva and A. Herrmann, “Requirements Prioritization Based
Updating the repository according to the missing words of not on Benefit and Cost Prediction: A Method Classification
classified requirements, and 2) Improving the quality of results Framework”, IEEE 2008, pp. 240-247.
[18] Z.Cataltepe and E. Aygun, “An improvement of Centroid-Based
such that it could better reflect real groups of interest. Classification Algorithm for Text Classification”, IEEE 2007, pp.
952-956.
REFERENCES [19] A. Sun and E. Lim, “Hierarchical Text Classification and
[1] Z.T. Bieniawski, “Engineering rock mass classifications”, John Evaluation”, IEEE 2001.
Wiley & Sons, New York, 1989, pp. 251. [20] Y. Yang and X. Liu, “A re-examination of text categorization
[2] H. Stille and A. Palmstrom, “Classification as a tool in rock methods”, 1999.
engineering”, Tunnelling and underground space technology, 2003, [21] T. Zhang, “Text Categorization Based on Regularized Linear
pp. 331-345. Classification Methods”, 2001.
[3] E. Hochmuller, “Requirement classification as a first step to grasp [22] X. Yan, L. Jin-Tao, W. Bin and S. Chun-Ming, “A Category
quality requirements”, Proc. Third International Workshop on Resolve Power-Based Feature Selection Method”, Journal of
Requirements Engineering: foundation of Software Quality Software, Vol. 19, No. 1, January 2008, pp. 8289.
(REFSQ’97), Barcelona, June 1997. [23] N. Long, D. Gianola, G. J. M. Rosa, K. A. Weigel and S. Avendan,
[4] H. Hoodat and H. Rashidi, “Classification and Analysis of Risks in “Machine learning classification procedure for selecting SNPs in
Software Engineering”, World Academy of science, Engineering genomic selection: application to early mortality in broilers”, 2007,
and Technology, 2009. pp. 377-389.
[5] P. Zave, “Classification of Research Efforts in Requirements [24] G. S. Walia and J. C. Carver, “A systematic literature review to
Engineering”, AT & T Bell Laboratories, 1997. identify and classify software requirement errors”, Information and
[6] M. Hertzum, “Small-Scale Classification Schemes: A field study of Software Technology, 2009, pp. 1087-1109.
requirement engineering”, Computer Supported Cooperative Work, [25] J. Li, and Y. Dong, Post-controlled Vocabulary Compiling in
Vol. 13, 2004, pp. 35-61. Competitive Intelligence System. IEEE transaction, 2010.
[7] M. Y. Kiang, “A comparative assessment of classification [26] M. Aasem, “Analysis and optimization of software requirements
methods”, Decision support system, 2003, pp. 441-454. prioritization Techniques”. MS Thesis-2011, Arid Agriculture
[8] Y. Ko, S. Park, J. Seo and S. Choi, “Using classification techniques University, Rawalpindi,
for informal requirements in the requirements analysis-supporting [27] A. Kashif. “A Systematic Review of Software Requirements
system”, Information and Software Technology, 2007, pp. 1128- Prioritization”. Thesis no MSE-2006-18.BTH.
1140. [28] A. Ahl, “An Experimental Comparison of Five Prioritization
[9] J. Polpinij and A. Ghose, “An Automatic Elaborate Requirement Techniques” - Investigating Ease of Use,Accuracy, and Scalability",
Specification By Using Hierarchal Text Classification”, Master Thesis No. MSE-2005-11, BTH
International Conference on Computer Science and Software
Engineering, 2008, pp. 706-709.
[10] J. C. Huang, R. Settimi, X. Zou and P. Solc, “The Detection and
Classification of Non-Functional Requirements with Application to
Early Aspects”, 14th IEEE International Requirements Engineering
Conference, 2006.
36

MySEC 2011 6140639

Uploaded by

Copyright:

Available Formats

MySEC 2011 6140639

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MySEC 2011 6140639

Uploaded by

Copyright:

Available Formats

2011 5th Malaysian Conference in Software Engineering (MySEC)

Controlled Vocabulary based Software Requirements

978-1-4577-1531-0/11/$26.00 ©2011 IEEE

IV. REPOSITORY OF CONTROLLED VOCABULARY

Figure 4. Pseudo code for the Classifier

The pseudo code clearly depicts that requirements are

VI. CASE STUDIES

In order to validate our approach results and performance, we

VII. RESULTS AND DISCUSSION

The validity of results obtained from the case studies were

You might also like