Information Retrieval in Business

Information Retrieval in Business
Revati Vilas Wable
Department of Information Technology, National Institute of Technology Raipur
G.E. Road, Raipur, Chhattisgarh-492010, India
[email protected]

Abstract— For several years people have realized the Figure 1 explains the working of information retrieval
importance of archiving and finding information. With process. We can classify information retrieval as free recall,
the need of computers, finding useful information from cued recall, and recognition. Therefore data technology is
such collections has become a necessity. Information useful in managing vital production knowledge and supports
retrieval has become an important research area in the the information it helps the assembly, management, and
field of computer science and gained importance in house owners of the corporation to rise, run their business [4]
several fields like business, healthcare, agriculture, and earn most profits. As per Olson (2003), the term
medicine, law and many other fields. This research ‘Information Retrieval’ was coined in 1952 and gained
paper focuses on the need, models and the processes popularity in the research community from 1961. Information
involved in information retrieval. A case study on Retrieval was seen as the organizing function to obtain major
INSYDER system has been proposed to gain holistic advances in libraries that were no longer storehouses of books
knowledge of information retrieval in the field of business.
[5] but a place where information was indexed and
catalogued. The concept of Information Retrieval helped
Keywords— Information Retrieval, IR models, some documents or records containing information that have
Processes, IR tools, INSYDER system. been organized in an order suited for easy retrieval and that is
why it was designed to retrieve the documents or information
I. INTRODUCTION required by the user community. It should be such that the
right information is available to the target users. Some of the
Information retrieval is finding material that could be in tools generally used for information retrieval [6] are
the form of a document consisting of unstructured nature bibliography, index and abstract; shelve lists and library card
that provides the required information [1]. It collects this catalogue. Evaluation in information retrieval is considered as
information from large collections stored on computers. It is the process of systematically determining a subject’s worth,
the process of obtaining information system resources that merit, and significance by using necessary criteria governed
are relevant to an information need from a collection of by a set of standards. The primary issues of the information
those resources. An information retrieval system [2] is a retrieval systems [7] are: Query Evaluation, Document and
software system that provides access to books, journals and Query Indexing, and System Evaluation.
other documents; stores and manages those documents. It is An example of information retrieval problem would be:
often said that information is not knowledge without consider a fat book owned by many people like Shakespeare's
information retrieval systems. The objective of an Collected Works. Now if you want to determine which plays
information retrieval [3] system is to minimize the time it contain the words ‘Brutus’ and ‘Caesar’ and not ‘Calpurnia’
takes for a user to locate the information they need or in one way would be to start at the beginning and read through
other words to provide the information needed to satisfy the all the text [8]. However, the simplest form of document
user's question. Satisfaction does not mean finding all the retrieval for a computer would be a sort of linear scan through
required information on a particular issue. Thus, an these documents. This process is generally referred as
Information Retrieval system collects and organizes the grepping through text. It can be a very effective process when
information in more than one subject area to equip users the speed of modern computers allows useful possibilities for
with all the relevant information as soon as requested. It wildcard pattern matching through the use.
helps in identifying the potential candidates for business on
the basis of the idea they want. Hence, an information In our modern technology driven society data, facts, and
retrieval system does not inform the user on the subject of knowledge have a higher priority than they were a few years
his inquiry but the existence or non-existence and ago. With greater use of the internet, information has become
whereabouts of documents related to his request more and more accessible. When we want to access
information, it has to retrieve from online sources like the
most famous Google search engine (that is why search
engines are common). The answer to all the queries is
“Information Retrieval” [9] that gathers the information more
precisely and recovers to ensure a discipline in computer and
information science. Above all the importance of search
engines and accuracy is due to the complex information
retrieval systems that the use. They recognize the intentions
or needs behind that specific search terms and thus provide
relevant data on search queries.
Fig. 1: Information Retrieval Process
4) To make adequate adjustments in the system based on
the feedback from the users and retrieve the information
that is relevant to the business. Additionally, it can be
made easier to provide and monitor all the internal
controls designed to cross check frauds, waste and abuse
and ensure the business is complying with information
privacy requirements along with the required electronic

5) These systems are designed to capture, process, store and

retrieve information to hold a business together. Thus, it
identifies the resources relevant to the areas of interest of
the target users’ community for analyzing the contents of
the sources and further represent them in a manner such
that it will be suitable for matching all the queries of the
Fig. 2: Model of Information Retrieval users.

Figure 2 explains the various concepts involved in the 6) To analyze users’ queries and to represent them in a
model of a information retrieval system. Each concept has form that will be suitable for matching with the database.
been discussed in the latter sections of this paper. For simple A proper information retrieval system includes an
querying in modern computers collections of the size of effective indexing system (that not only decreases the
Shakespeare's Collected Works [10] is a bit under one chances of information will be misfiled but also
million words of text in total hence you don’t need anything enhances the retrieval process of information). The result
more. The primary aim of these systems is to arrange all is time-saving benefit which increases office efficiency
knowledge collected from each level, summarize it, and and productivity while decreasing other issues like stress
present it in a manner that facilitates and improves the and anxiety.
standard of the choices being created to extend the
company’s profit and productivity. Information system is 7) To match the search statement as per user’s requirement
incredibly essential for running and managing a business from the stored database. This information is retrieved via
these days. a variety of tools and techniques used to determine the
relevance of information and their ranking. It also follows
compliance regulations and tax record-keeping guidelines
II. SIGNIFICANCE significantly for businesses to increases their confidence
by checking that the business is fully complying.
Worldwide organizations rely majorly on modern
technology that includes information systems to develop Thus continuous changes in all aspects of the system [12]
new ways to generate revenue, engage customers and with rapid developments in information and communication
streamline time-consuming tasks. With an adequate technologies related to changing patterns of society, users and
information system [11], businesses can save time, money their required information. It uses complex algorithms hence
and take smarter decisions. This technology can be less human error. Furthermore, employees can focus on the
automated depending on the requirement. Here are some of core aspects of a business rather than spending hours collecting
the uses of information retrieval systems in business: data, filling out paperwork and doing manual analysis,
simplified by several data analytics tools.
1) To represent the contents of analyzed sources in a way
that matches users’ queries and analyze them. Further,
it can represent them in a form that will be appropriate III. MODELS
for matching the database. This can be achieved
through the design of sophisticated search interfaces.

2) To identify the information related to the areas of

interest of the user and act as a bridge between the
world of creators of information and the users that use
this information. Thus, even a small business should Fig. 3: Models of information retrieval systems
have a well-organized information storage and retrieval
system to improve their performance and be able to A model of information retrieval should select and rank
compete with large scale businesses. It also helps them the relevant documents as per user’s query. Figure 3
to learn, adapt and comply these techniques to increase represents the prominent models [13, 14, 15, 16, 17] of an
its growth rate. information retrieval system. In this section, we have
discussed these models in detail. Here, the texts of the
3) It provides organizations with immediate value along documents and the queries are represented in the same way,
with ways to capture tacit knowledge. It also focuses on
so that selection and ranking of document can be formalized
information that already exists in electronic formats and
by a matching function. This function will return a retrieval
vendors install bases which put them in a better
status value (RSV) for each document in the collection.
position than small startups that helps them to grow.
A. Boolean Model vectors is evaluated using the similarity cosine function [17].
It introduces a term weight scheme known as if-idf weighting
This model is evolved from set theory based on the having a term frequency (tf ) factor that measures the
principle of “exact match”. Here, we can pose any query frequency of occurrence of the terms in the document or
[14] in the form of a boolean expression of terms i.e., one in query texts and an inverse document frequency (idf) factor
which terms are combined with the operators and, or, not that measures the inverse of the number of documents that
but with a disadvantage that it is not able to rank the contains a query or document term.
returned list of documents. This model of information The idea behind this is to represent text and query by
retrieval is known as a classical information retrieval model weighted term vectors D = (d1, d2, …dn) respectively. These
as it was the first and most adopted one. It is virtually used two representations are then used to measure a degree of
by all commercial information retrieval systems. Each similarity between the query (or a sample document) and a
document either matches or is unable to match the query. document sim (D, Q) = Σ diqi resulting in a ranked list. For
The results retrieved in the exact match are a set of the similarity measurement different methods have been
documents (without ranking to match the query. As a result, proposed like the cosine measure:
a ranked list of documents is obtained. The Boolean model
discussed is the most common exact match model. The
retrieval function used in this model decides whether the
specified document is relevant or not. IV. PROCESSES
Web pages mostly contain semi-structured and dynamic
B. Inference Network Model information along with links that may not easily be
accessible. Hence searching through the World Wide Web is
Document retrieval is modeled as an inference process significantly different from searching data in databases,
in an inference network in this model. Most techniques used which are static and centralized. A number of query
by information retrieval systems can be implemented under languages, based on semi-structured data models and mostly
this model. In this model, a document searches a term with represented as labeled graphs have been developed. The main
certain strength on the basis of requirement, and the credit problem is how to convert knowledge into information for
from multiple terms is converted into a query to evaluate the data mining algorithms to function properly. Although most
equivalent of a numeric score for the document [15]. From web documents are text-oriented, a considerable amount of
information is not easily accessible through common search
an operational perspective, the strength of instantiation of a
methods so documents can’t be retrieved without accessing
term for a document can be considered as the weight of the
each one individually. In this section each and every process
term in the document and ranked. Ranking of vector space
involved has described in detail with examples to provide a
models and the probabilistic models are done in a similar better understanding t the readers. Algorithms and best
manner. The strength of instantiation of the specified term strategies have also been discussed to help find out adequate
for the required document is not specified by the inference information precisely. The ultimate aim of information
network model, and any formulation can be used. retrieval system is to find the relevant information or
document that satisfies user information need and to achieve
this goal; it usually implements the following processes:
C. Probabilistic Model

The most significant feature of the probabilistic model A. Indexing Process

is its attempt to rank the identified documents by their
probability of relevance of the query. Stephen Robertson Indexing is used to explain the absoluteness of
[16] formulated the probability ranking principle. In this documents, text and media. Index terms may be derived from
model the identified documents and queries are represented the document itself or from a document-impartial supply.
by binary vectors such as ~d and ~q, each vector element Extracting the phrases from the document itself (creator
representing a document attribute or term that occurs in the absoluteness) continues the author's unique aim and expresses
document or query. his expertise. The index can also use controlled or
For these conditional probabilities that a document d is uncontrolled vocabulary and can be constructed up manually
relevant P(R|(q,d) or irrelevant P(I|(q,d) to the query q is or robotically, managed vocabulary is derived from an
calculated. It uses odds O(R), where O(R) = P(R)/1 − P(R) authorized term listing, a glossary, to overcome problems like
and R refers to ‟document is relevant” and ¯R refers to homonyms and a lack of understanding of the unique terms,
using the glossary himself a consumer can locate the
‟irrelevant document”. The documents with probabilities of
appropriate listed documents more without difficulty. The
relevance are ranked in decreasing order of their relevance
managed phrases are particularly used in online databases,
and those exceeding a cut-off threshold c are the retrieved
e.g. supplied by hosts like STN [18], dialog and many others.
document set, defined as R(q) ={d|P(R(q,d)) ≥ (d|P(I (q,d)), It is where the documents required by the users are
P(R|(q,d) > c) }, with P(R|(d,d))=1 and P(I|(d,d))=0. Hence transformed into searchable data structures and so it can be
the ranking of the probabilities provides an effective way to referred to as the process of extraction. It creates a core
obtain information. functionality of the information retrieval process as it is the
first step and assists in efficient retrieval of information. In
D. Vector Space Model the process, first, the document surrogates are created to
represent each document. Secondly, it requires analysis of
This model was invented by Salton and his working original documents that include simple (identifying meta-
group. In the Vector Space Model, documents and query are information e.g., author, title, subject and others) and
represented as a vector and the angle between the two complex (linguistic analysis of content) data.
Many of those systems allow the consumer to list the query, whether or not information retrieval and information
controlled terms. Indexes are the data structures that are filtering are " sides of the equal coin", insinuating the
used to make the search faster. However a disadvantage is connection of the 2 disciplines. They finish that records
the fact that the controlled phrases are frequently in the back Retrieval and information filtering are certainly two aspects
of new developments. The guide relation of index terms of the identical coin. The "coin" is that both disciplines
lacks the consistency of indexing, it's miles a subjective address the identical goal, gratifying the information desires
indexing. Having some human beings to index the identical of humans, using similar techniques. These systems deal with
file leads typically to using many one of a kind index terms, the ranking of semi-structured or unstructured data (usually
relying on the background of those people. Nonetheless in a textual) in order of relevance. Nonetheless, contemplating
professional environment one attempts to reduce the hassle selective dissemination of information (SDI), as
of indexer inconsistency and inter-indexer consistency commonplace for on-line database search as the genuine
through finding individuals with area knowledge, the usage software area of information retrieval, filtering has been
of controlled vocabulary similarly to out of control and around inside the information retrieval context for a long
other indexing policies. Automatic indexing [19] guarantees term. despite the fact that environmental problems, e.g.
the consistency of index phrases, as the algorithms behind person modeling, consumer tracking to construct profiles,
are continually the identical, producing the equal effects. social and privateers elements have in no way been the focus
The idea in the back of computerized indexing is to find of information retrieval research, but are very an awful lot
characteristic phrases representing a file excellent. inside the middle of studies.
Consequently as a minimum requirements for the
illustration exist: identity of suitable content material unit s
(consider) and resolution of term weights to distinguish vital
phrases from much less essential ones (precision) for th e
content material illustration. An outline of an automate d
indexing is supplied by the subsequent:

1. Calculate the frequency of each term k inside eac h

document i (FREQik).
2. Sum-up the frequency of each time period k insid e
the entire series Σ (FREQik).
3. Sort the terms in line with their lowering frequency.
4. Define a higher and lower threshold for th e
frequency. Fig. 4: Information Filtering System
5. Take away all phrases above or underneath tha t
threshold. Figure 4 shows the basic working or mechanism of the
information filtering process/system. It refers to the selection
6. The last phrases are then indexing terms of the
of relevant information or rejection of the irrelevant one from
a stream of incoming data. The filtering agent uses filters to
Nonetheless with this approach no time period filter out all the irrelevant incoming documents. Thus, it
weighting has been made to assign weights for the presents user only those documents which match the user’s
distinguishing of important and much less important interest. During this period the filtering system [21] becomes
phrases. An affordable measure of importance is acquired more effective by learning the user’s preferences and hence
with the aid of the tf-idf equation, favoring terms with an develops a great accuracy in performing the filtering tasks. It
excessive frequency especially documents (tf) but with a performs tasks like interfacing along with the source
low frequency basic in the series (idf). As a result from a document subsystem thereby managing the user-profile and
contrast of automatic indexing methods and manual key- calculating the relevance of a document vector as per user-
word indexing using the abstracts of the Cranfield series, profiles and communicating with the user. These systems are
argue that automated indexing is not always as good as generally applied to attain information for user’s long term
manual-indexing techniques. In their conclusion of the interests. Different criteria may be used to filter documents or
evaluation they suggest that "weighted terms need to be articles. One such filtering based on the same concept is
used, derived from file excerpts whose duration is at the collaborative filtering.
least equivalent to that of an abstract".
• Collaborative Filtering:
It is a form of social filtering based on the
A. Filtering Process subjective evaluations of other readers attached
Filtering is a name used to describe a spread of as annotations to the shared document.
strategies regarding the shipping of statistics to folks who Schemes using collaborative filtering [22] use
need it. A more specific definition describing the human judgments so that it does not suffer
information Filtering problem is given by thinking about from the problems which automatic techniques
some dynamic records objects. The concept of information have with natural language like polysemy,
filtering machine fits the characterizations [20] of the synonymy, and homonymy while other
statistics gadgets towards the consumer profiles, language constructs at a pragmatic level such
descriptions of the customers' information desires, to obtain as sarcasm, humor and irony may be
a relevance estimate of the facts items with respect to the recognized.
facts wishes." rise inside the identity of their article the
B. Search Strategies In comparison to the block building approach discussed
above this search strategy is impractical for searching on the
Based on the user’s requirement the required World Wide Web [25] as most of the time no status
information is retrieved. For this various searching information about the search is given. Consider a situation
algorithms such as linear search, binary search, brute force where an online database search system keeps track of the
search and many others are applied to the World Wide Web different search sets and combines. It narrows the search by
to get the most preferred information. It includes the use of applying limiting techniques like Boolean operators AND,
one's knowledge about online searching systems, indexing OR or NOT. This process is carried out step-by-step until the
vocabularies and conventions practiced in the generally used search is reduced to a manageable number of hits.
text database construction. A good understanding of the
same and how it is implemented in the system searched
4) Facets strategies: This strategy can be described
makes searching easier. Some of the most suited search
as variations of the block building approach. The most suited
strategies are as follows:
concept first strategy recites that the user selects as the first
facet which is believed to be the most specific in the
1) Brief search: This search consists of a single query information needed. This type of search allows navigation
statement like Information Retrieval in Business or along with several independent dimensions. It is significant to
Information Retrieval and Business Intelligence [23]. It communicate the user’s current location and navigation
might act as a good starting point for further in depth options and this can be done in three ways (to communicate
querying using other search strategies to provide appropriate navigational state): Breadboxes, Multi-Selectable Facets and
information. It identifies the sources of information relevant Inline Breadcrumbs concept [26]. The lowest postings first
to the areas of the target user community and accordingly strategy dictates he chooses the concept he believes to be the
analyzes the contents of the sources. It also represents rarest in the database as the first concept.
contents of analyzed sources that will match queries and
further analyzes user queries that will match with database.
Brief search is one of the most commonly used searching
techniques on the World Wide Web.

2) Block building approach: The idea behind this is to

divide the required information into several concepts, to
search for these concepts separately and combine the results
in a bottom-up manner. Each of these concepts gives a result
within the database, combining these three results using the Fig. 6: Most specific facet strategy
set operation i.e. AND that retrieves the desired documents.
By using related words or acronyms the set operation i.e.
OR might lead to an expansion of the final result set. Figure Figure 6 represents the most specific facet strategy
5 explains the working of block building [24] approach: where the most specific and next specific concepts are
analyzed ad added to the result set. Apart from the search
strategies, searching the World Wide Web is different from
searching online databases. Describing the documents
contained using a formal and content based approach. Other
than internet search engines, the user of an online database
gains an overview of all his search sets and can combine
these easily. As the HTTP protocol is stateless this is difficult
to achieve for search-engines.

5) Citation Pearl Growing: This searching technique

is based on the idea to find a relevant document and similar
ones by using vocabulary from this document, descriptors or
classification codes. The user stops the search when he has
Fig. 5: Example of block building approach enough documents to satisfy his information need. This
technique is in accordance with the relevance feedback option
The approach is advantageous when complex information that some retrieval systems offer and is also useful for
is to be retrieved such that it consists of more than three searching the internet [27], as the user can select appropriate
entities/concepts such that the user can keep a track of how keywords from relevant documents. The only problem with
well each entity is represented in the given database. The this technique is that the system support from the search-
only problem is that the user has to be aware of the Boolean engines is mostly not given / specified.
set operations used otherwise this search will fail to execute.
Only some search engines provide the number of how many
times a keyword in the query has been found. V. INFORMATION RETRIEVAL TOOLS

Several information retrieval tools are available on the

3) Successive Fractions Approach: This approach Internet. User can choose the tool of their choice to retrieve
starts with a broad set of documents, successively filtering it the desired material. Because of different people accessing
by using the set operation AND to narrow the result set till a information for different reasons user needs to use the right
feasible size is reached to retrieve the desired information. tools to locate their material. Its primary goal is to supply the
right information to the right user at a right time. Different system having components mainly developed in Java. Only
techniques [28], materials and methods are used for the component involved in semantic analysis has been
retrieving the desired information. It provides organizations developed in C++.
or businesses with immediate value i.e. important
information / data to try to figure out ways to capture tacit
knowledge. Some examples of information retrieval tools
would be classification schemes, catalogues, indexes and
other information retrieval tools in the library include the
following: almanacs, handbooks, periodicals, atlases,
encyclopedia, directories, dictionaries, and concordances
among others. Internet search engine, subject directory,
online database, online public access catalogues (OPAC)
and digital library. Information Retrieval thus with the help
of these tools provides a means to get at information that
already exists in electronic formats. These tools differ in
structure as their function and use different methods and
techniques for storing and retrieving the information. This
reflects the target audience and indeed their intended use.
Some of these tools are: Fig. 7: Architecture of the INSYDER System

1) OPAC i.e. online public access catalogues are

generally used by students to find books from online Further, the user-interface and visualizations have
library instead borrowing it from the library. It is a also been developed in Java using Java Foundation Classes or
computerized catalogue [29] containing bibliographic Swing. The scheduler's is responsible for the monitoring
records of items in a library. However, in digital age process of user’s query on the internet. The watch function
students rely heavily on the Internet and they usually used for the same checks user-defined Web pages [33] for
use internet search engine to find sources of regular changes. These sources are defined in various XML
information. The information found in search engine documents, thereby enabling an easy maintenance and
may be a web page, images and any other type of file. extension of the sources in an organized manner. Web-API
It uses web crawler to retrieve the information from supports easy access to the documents thereby acting as a set
millions of web pages on the web which is further of functions and methods. For every document that matches
stored as the search engine index thereby making the user’s requirement, calls the semantic analysis via a COM
search engine the most comprehensive coverage of wrapper to get a relevance value. It further uses a semantic
the web. net that models the real world by a controlled vocabulary and
can be individually adapted to various application domains.
2) Even subject directories like Yahoo and DMOZ are INSYDER uses a dynamic search approach [34] for
used in order to locate the required information. These the online search to discover relevant information by
directories [30] are created by assigning the submitted following links. The sources are representation of starting
sites manually to a suitable subject category by the points of a search. It determines the relevance ranking by
directory developers itself. Despite this some students using semantic analysis of documents. A significant aspect is
assume subject directories as Internet search engine the fact that ideas and components from different fields are
whereas the real internet search engines are AOL combined. Nowadays, systems do a dynamic search with a
Search, AltaVista, and Google. metadata generation and the visualizations of the same leads
to document inherent data i.e. new. The visualization of the
3) Online databases also provide access to the remote query is performed in the following manner:
databases through the so called database vendor or
service provider. Examples of such databases [31] are
IEEE, Elsevier, and ACM. It is also defined as digital 1) First step is the semantic analysis that includes
library as it is an organized collection of information various relationships such as narrower term, part-of
with associated services wherein the information is broader term although they are not represented in the
stored in digital formats and is accessible over a graph visualization but tells that there is a
network. This ensures a high quality resource. relationship.

2) To keep the overview (for user’s assistance) the

VI. CASE STUDY system was designed with a detailed and full view.
This is done by taking the information from the tree
Here, we would review the real time application of view. For example, if the user clicks on a branch of
information retrieval by considering the INSYDER system the tree view then only that branch should be
that helps in seeking business information from the World visualized in the graph. On clicking the root of the
Wide Web. The advantage of using external information for tree the result should be a graphical presentation of
business intelligence system helps enterprises to know more the entire tree in the graph.
about its customers, suppliers, competitors, government
agencies, and other external factors. Valuable information
about external business factors is readily available on the 3) Lastly, interaction with die graph representation
internet but only a few are used as reliable data sources. includes all the terms represented in the graphical
Figure 7 shows the architecture [32] of the INSYDER representation. It can be moved by keeping the
