Title:: Data Mining & Warehousing

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

Title: Data Mining & Warehousing

Mailing address:

Ms. Jayamma.R,
Ms. Archana.B

P.B.Siddhartha College of Arts and Sciences,


Siddartha nagar,
Mogalrajpuram,
Vijayawada.

Email: [email protected]
[email protected]

Phone number: 9985407626


Data Mining and Data Ware Housing
- A Case Study on Geographic Information systems
(GIS)

Abstract

Data mining, the extraction of hidden predictive information from


large databases, is a powerful new technology with great potential to help
companies focus on the most important information in their data
warehouses. Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the
analyses of past events provided by retrospective tools typical of decision
support systems. Data mining tools can answer business questions that
traditionally were too time consuming to resolve. They scour databases
for hidden patterns, finding predictive information that experts may miss
because it lies outside their expectations.
Most companies already collect and refine massive quantities of
data. Data mining techniques can be implemented rapidly on existing
software and hardware platforms to enhance the value of existing
information resources, and can be integrated with new products and
systems as they are brought on-line. When implemented on high
performance client/server or parallel processing computers, data mining
tools can analyze massive databases to deliver answers to questions such
as, "Which clients are most likely to respond to my next promotional
mailing, and why?"
This paper investigates the basic technologies of data mining and
data ware housing and illustrate its relevance to today’s business
environment as well as a basic description of how data warehouse
architectures can evolve to deliver the value of data mining to end users.
This paper focus on the need of data mining and data warehousing
in knowledge retrieval with a case study on Geographic Information
Systems (GIS).
Introduction

Data mining is the process of exploration and analysis of large


quantities of data in order to discover meaningful patterns and rules.
Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different perspectives and
summarizing it into useful information - information that can be used to
increase revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to analyze
data from many different dimensions or angles, categorize it, and
summarize the relationships identified. Technically, data mining is the
process of finding correlations or patterns among dozens of fields in large
relational databases.
The activities involved in extracting meaningful new information
from the data are:
 classification
 estimation
 prediction
 affinity grouping or association rules
 clustering
 description and visualization
Classification, estimation and prediction are examples of directed data
mining and the next three are examples of undirected data mining.

What is Data Warehousing?


A data warehouse is a copy of transaction data specifically
structured for querying and reporting. It can be a relational database,
multidimensional database, flat file, hierarchical database, object
database, etc. Data warehouse data often gets changed. And data
warehouses often focus on a specific activity or entity. Data warehousing
is not necessarily for the needs of "decision makers" or used in the
process of decision making.The overwhelming uses of data warehouses
are for quite mundane, non-decision making purposes rather than for grist
for making decisions
Data warehousing is becoming " de rigeur " in every major
company. And big business is betting more and more on a series of data
mining tools to help them predict future trends based on an analysis of
historical behavior. In the construction of data warehouses to day,
companies are pinning their hopes on being able to extract out from the
data mined content stored in their data warehouses, likely new customers
for products or services through the integration of existing customer
account information with demographics and lifestyle data.

The Expansion of the Appeal of Data Mining and


Data Warehousing
In the book "Advances in Knowledge Discovery and Data
Mining," published by the MIT Press, Dr. Usama M. Fayyad and his fellow
editors stated that: " ... in combining the two terms "data mining" and
"data warehousing", they are attempting to build bridges between the
statistical, database and machine learning communities and appeal to a
wider audience of information systems developers."
We think this approach is very helpful - in fact, we believe that the
term "data mining" is actually on the cusp of appearing on the radar
screens, for the first time, of millions of companies. We predict that the
path of familiarity with the terms "data mining" and "data warehousing"
will be very similar to what has happened to the term "GIS" in the last five
years.

Information Overload

Advances in data technology have overwhelmed corporations with


information, driving the urgent need to develop new tools that can help
transform data into business advantage. The advent of the Internet and
the World Wide Web has, in the matter of five years, made more
information accessible to more individuals than at any other time in our
history.
Companies know that their in-house databases contain untapped
knowledge about themselves. This information, if properly stored, can
then be retrieved and analyzed. The outcomes can provide them with a
competitive edge in a world of saturated markets. The single biggest
problem is that so much of this information is distributed across networks,
divisions, and often continents!

What to do?
In our opinion, corporate nirvana in the use of data mining tools
and data warehousing will only be achieved when companies link the
concept of data mining to equally sophisticated information retrieval tools.
These tools will work on the basis of combined machine and human
intervention in more intelligent ways than those presently offered in to
day's information retrieval tools. Corporations will need to run two
complementary data/information retrieval processes. One process will
literally mine data and allow software to detect hidden patterns. Another
process will query information through the posing of specific questions
and secure targeted answers.

The Risks in Data Warehousing

We recognize that there is a major difference between data and


information. In data warehousing, there is a tremendous risk that what
will be delivered from these warehouses will be vast quantities of data
rather than quality information. This is where the importance of metadata
comes in to play. There is no question that improving the quality of
metadata automatically improves the quality of the information retrieved.
And the amount of data retrieved is also reduced. Successful data
warehouses invariably are individually small, heavily used by the target
business unit organization, are constantly changing in terms of their
content based on changing market needs, and are controlled by the
business unit of an enterprise.
The importance of providing all in-house company users with a
single, uniform view of information throughout a corporation is key to a
company's efficiency and ability to deliver a higher quality customer
service. It is only through the complementary usage of data mining tools,
and data warehousing, alongside with that of cutting -edge information
retrieval tools and processes, that corporate nirvana in the area of
knowledge extraction, can be achieved.

Impact of Cyberspace on Data Warehousing and Data Mining

We see today a unification of decision-support technologies into a


universal knowledge system. The advent of the Internet and the World
Wide Web allows not only for the cheap publication of terabits of
information around the world, but also has created a paradigm shift in
how we view information, where it is, and how any piece of information
relates to any other. This is good news for corporations if they can start
being really creative about how to organize both data and information
within their organizations.

The Distribution of Documents in Cyberspace

Ever since Gibson, we are familiar with applying a spatial metaphor to the
Internet. Documents on the Web are distributed in a "cyberspace." By
extension of the metaphor, one "navigates" across the Web, visits sites,
and so on.
Part of the excitement of the Web flows directly from this
paradigm shift in how we view what information is, where it is, and how
one piece of information is related to another. In the near future we will
see the reality of the Web adhere even more strongly. We will be able to
navigate cyberspace with the aid of good maps, maps which tell us where
we are and what is nearby. Cyberspace will acquire the textures of real
space, with landmarks both personal and official. We will be able to mark
our trails like a cyber-Hansel and cyber-Gretel. We will be able to measure
distances in any number of useful ways, effectively warping that space to
our specification. All this has very interesting implications for progress in
achieving greater accuracy in information retrieval

The Shortcomings of Information Technology

Information technology as offered to corporations today


suffers from having yet to catch up with cyberspace, even in its current
incoherent state. Corporate data is treated as a supermarket from which
items are to be retrieved, or a pit from which data should be mined.
"Search engines" on the Web are similarly crippled at a conceptual level.
Users try to funnel their desires through keywords shot into the dark.
There are lots of well-known problems with keywords. For
instance, very few recipes contain the word "recipe." The way to look for
recipes is actually with such words as "teaspoon," "tablespoon," and
"cuillere"! And then the same words can mean different things to different
people. An architect looking for documents on computer-aided
architecture might form the query -"computers AND architecture" - and
arrive at the home page of Intel. Mapping cyberspace will solve these
problems and the enabling technologies of data mining and data
warehousing must figure out quickly how to pro-actively lead the business
charge. If not, there will be then the risk of being trampled by a stampede
of business users, who will quickly brand sthose terms as their own,
inventing a broader meaning for those concepts than they currently have
today.

Mapping Cyberspace

In a sense, current search engines produce zero-dimensional


maps. Points in this space correspond to vectors of words and documents
and documents are returned in a list ordered by their distance (relevancy
distance) from that point. They collapse the multiple dimensions of
cyberspace into a point and all orientation is lost.
Information retrieval tools need to learn that lesson and must 'figure out'
how to ensure that orientation within information can be guaranteed to all
users of a data warehouse

The Importance of Agents

Personal agents are entering our lives more and more. And
over the next few years, they will come to have even more importance in
the area of information retrieval.
Coming from many sources, notably artificial life, agents have
been gaining popularity as a way of conceptualizing software design. The
agent has a certain autonomy, and inherent rules of behavior. There may
be many agents in a system, each responsible for one or many tasks, and
able to cooperate with other agents. For example, in the "Chiliad
publishing system," an agent maybe responsible for the maintenance of a
particular document, or subject area.
Some of these ideas are extensions of the already familiar
personalized newspaper, such as Point Cast. However, by having many
agents, and having them interacting in sophisticated ways and under
expert control, and then having the results presented in a visual map, we
create a system with both quantitative and qualitative advantages. The
multiplicity of agents contributes to the robustness of the system, since
imperfections in a given agent need not propagate. It also contributes to
its speed, since an agent-based system is naturally scalable.

GIS –A Case Study


Geographic Information Systems (better known as GIS) has
been the purview for over 25 years of companies involved with primary
resources and who had in their possession large mainframes on which to
be able to store and retrieve massive amounts of relational data. ESRI
based in Redlands, California, was - and is continuing very much today -
to be one of the original leading companies. There are now millions of
companies using 'business GIS' from their desktops and laptops to be able
to both display and extract all kinds of information relating to such
common problems as the location of a new shop or service store to help
guarantee the largest "reservoir" of people in the surrounding area.
In other words, GIS went from a very specialized status, known
only to a small group of professionals working in such fields as the oil and
gas industry, forestry, and mining, to being an enabling application that
could also be used on the desktop. It proved to be a strategic business
advantage to both small and large companies alike. A burst of new
applications came on the market and names like Map information and
Strategic Mapping captured chunks of the global business GIS market.
Interestingly enough, it was the GIS enabling capability of linking
demographics with psycho-graphics that opened up the way for whole
new suites of business applications to be built.
It is our prognosis that the same trajectory is going to be followed
by both the terms of data mining and data warehousing. Once that
happens, the growth of this industry, like that of GIS, will be exponential.
Even today, there is confusion by many in business over the use of the
term "data." Often that term is used to cover textual and other forms of
information and not just transactional data. In fact the terms "data" and
"information" are often used interchangeably by many business people as
they go about their day-to-day data/information-gathering tasks.
We believe that in the very near future, the present suites of data
mining tools will start coming with a complement of new information
retrieval tools. These tools will have an in-built ability to help even an
untrained user extract, from textual and other media formats, the kind of
gold nuggets of information presently being pulled from a company's
transactional data, by suites of data mining tools. It is important to bear
in mind that a large chunk of any company's in-house information is to be
found in textual form. That information, if retrieved from a data
warehouse using advanced information retrieval tools, can be made to
bring forth added value information, leading to better business decisions.
The first computerized GIS began its life in 1964 as a project of the
Rehabilitation and Development Agency Program within the government
of Canada. The Canada Geographic Information System (CGIS) was
designed to analyze Canada's national land inventory data to aid in the
development of land for agriculture. The CGIS project was completed in
1971 and the software is still in use today.
From the mid-1960s to 1970s, developments in GIS were mainly
occurring at government agencies and at universities. In 1964, Howard
Fisher established the Harvard Lab for Computer Graphics where
many of the industries early leaders studied. The Harvard Lab produced a
number of mainframe GIS applications. This development was one of the
first systematic map databases.
A Geographic Information System (GIS) is an automated
information system that is able to compile, store, retrieve, analyze, and
display mapped data. Only a decade ago this technology was limited to a
relatively small number of colleges, universities, and local, state, and
federal agencies. The two general types of users are systems users (who
have hands-on use of the technology) and end users (who are users of
the information generated by a GIS). Today, it is used by government
officials, natural resource and social analysts, and many others. Its
applications include environmental research and model building, urban
demographic studies, and transportation analysis to mention only a few.
While its use is expanding almost daily, its most important applications
include those that support decision making.
The activities normally carried out on a GIS include:
• The measurement of natural and human made phenomena and
processes from a spatial perspective. These measurements
emphasize three types of properties commonly associated with
these types of Systems, elements, attributes and relationships.
• The storage of measurements in digital form in a computer
database. These measurements are often linked to features on a
digital map. The features can be of three types: points, lines, or
areas (polygons).
• The analysis of collected measurements to produce more data and
to discover new relationships by numerically manipulating and
modeling different pieces of data.
The depiction of the measured or analyzed data in some type of
display - maps, graphs, lists, or summary statistics. Components of a
GIS A Geographic Information System combines computer
cartography with a database management system. Within the GIS
databases a user can enter, analyze, and manipulate data that is
associated with some spatial element in the real world. The cartographic
software of the GIS enables one to display the geographic information at
any scale or projection and as a variety of layers which can be turned on
or off. Each layer would show some different aspect of a place on the
Earth. These layers could show things like a road network, topography,
vegetation cover, streams and water bodies, or the distribution of annual
precipitation received.

Conclusion
If data mining and data warehousing are to become corporate
nirvana for the 21st Century, then they must be built as complex adaptive
systems with the business end user firmly in mind. As data mining
companies work on adding complements of information retrieval
processes and tools to their present suites of offerings, this will vastly
speed up the adaptation of the data mining industry to the broader needs
of the business end user. When the worlds of data mining and knowledge
extraction expand to include information retrieval from alternative media
formats and text based data, then "data mining" and "data warehousing"
will become the hottest buzzwords for businesses in the information age

References
www.dwinfocenter.org
www.extension.umn.edu
www.datawarehousing.com
www.firstmonday.org
www.marketsearch.com

You might also like