Data Mining Project
Data Mining Project
Data Mining Project
CSC 312
Computer Architecture
Assignment:
Project Report
Submitted to:
Dr. Ruwayda Takchi
Submitted on:
10/7/2022
Submitted by:
JoeElie el Fadl (EE): 20197074
Marcel Samrout (CS): 20217156
Live Link:
https://youtu.be/Xd42RPLbfhA
Data mining have plenty of techniques being used for data mining, but however focusing on the latest
and most important techniques is the main objective.
Data Mining tools come through large batches of data sets with a broad range of techniques to discover data
structures such as anomalies, patterns, journeys or correlations.
It was around since 1990, the data mining we know and use today comprises three disciplines:
These 3 elements have helped us move beyond the tedious processes of the past, and onto simpler and better
automations for today complex data set.
And in fact, the more complex and varied these sets are, the more relevant and accurate their insights and
predictions will be
Data mining is a crucial component of successful analytics initiatives in organizations. The information it generates
can be used in business intelligence (BI) and advanced analytics applications that involve analysis of historical data,
as well as real-time analytics applications that examine streaming data as it's created or collected.
Effective data mining aids in various aspects of planning business strategies and managing operations. That
includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It also plays an important role in
healthcare, government, scientific research, mathematics, sports and more.
Outlining your Goals: it is important to understand our objective, allowing us to set the most accurate
project parameters which include the time frame and scope of data and most importantly our primary
objective and how to succeed it.
Understanding your data sources: with deeper knowledge of your project parameters, you will be able to
better understand which platforms and data bases are necessary to solve the problem. whatever it is from
CRM or excel
Preparing your data: In this step we use the ETL process which stand for extract, transform and load.
This prepares the data ensuring it is collected from the various selected sources cleaned and then collated.
Analyzing data: At this stage, the organized data is fed into a advanced application and different machine
learning algorithms get to work on identifying relationships and patterns that can help inform decisions
and forecast future trends.
This application organizes your elements of data aka data points, and standardize how they relate to one another.
Reviewing the results: Here you will be able to determine if and how well the results and insights
delivered by the model can assist in confirming your predictions answering your questions and achieving
your objective.
Deployment or implementation: upon completion of data mining project, the result should then be made
available to the decision makers via a report. they can then choose how they would like to implement the
information to achieve the objective.
Classification: is the process of grouping different data under different classes, where
the main aim is to have a better connection among data sets.
For example: vegetables, fruits, and rice all could be placed under the name of food.
Clustering: is a process similar to classification in the sense of grouping data, but instead
it focusses on the similarities and differences among them.
For example: if individuals watch horror movies the most, they could be marketed to
that group.
Prediction: is the process of using previous and current data to predict future actions. It
is sometime also to combine other types of techniques like classification.
For example: relying on previous data history of customers in order to predict the total income
later on.
Association Rule: a technique that shows correlation between two or more item by
identifying the hidden pattern with in a data set.
For example: It helps to identify the items that are purchased together in a
supermarket. It determines that if the customer buys a bag of chips then there is a 50%
chance he will buy a can of soda and the chips selling contribute to 1% of the products
being sold, and therefore using this technique it becomes easier to determine the
behavior of the customer.
Pattern tracking: This technique checks for regularities and tendencies in the data set
over a period depending on previous data history.
For example: The Sales of a company increase during Christmas or summer time every
year.
Outlier Analysis: a technique used to determine what data items do not comply with expected
pattern or data behavior. It also could be referred to as outlier mining. It is basically used in
fraud claiming models.
Hardware:
When searching for a data mining hardware or device in general there are several key factors to
look at in order to pick what suits our demands:
Performance (Hash rate) – It is related to how many hashes per second can the Data
miner make, how fast it is in performing the calculations, therefore it being pipelined,
multi-threaded, and more. The miner has to be able to cope with a multitude of
calculation types. The More hashes the more is the cost, which is why bot performance
and efficiency are crucial.
Efficiency –It is important to emphasize on the point that mining is a proof of work
having electricity consumption being the one to do the work. Indeed the faster and
tougher the calculations are the more electricity will be consumed, and that is why
miners use a large amount of electricity; you want to find one that converts the most
amount of electricity while performing the data tasks.
Price –Inexpensive mining hardware will mine less data, which is why efficiency and
electricity usage are important. The fastest and more efficient mining hardware is going
to cost more.
Power: We want the best performance, at same time battery last longer
So when achieving the most of all these 4 factors we will definitely have the best
hardware being found.
Central Processing Unit: Or what is known as CPU is the primary component of a computer
that acts as its “control center.” The CPU, also referred to as the “central” or “main” processor, is
a complex set of electronic circuitry that runs the machine's operating system and apps. Choosing
the best mining, CPU for your needs might be difficult, especially unfamiliar with the most
recent offers. However, for the sake of the latest improvements, the CPU market is constantly
evolving, with newer, more powerful alternatives frequently changing pricing.
● However, the 3700X frequently exceeds the minimal clock speed, particularly in light to
medium workloads.
Graphics Processing Unit: A GPU server is a machine for working with certain data. It has a modern CPU, fast
RAM and a high-performance GPU. All these parts are connected by hardware logic and data buses. Why is such a
server calling a GPU server? Because it usually has a powerful GPU installed. The advantages of a GPU include the
large amount of memory on board with high bandwidth, and high-speed CUDA cores and tensor cores. Basically, a
GPU in a GPU server is a computer within a computer. It can perform specific calculations at a very high speed.
NVIDIA is a major vendor that offers its own solutions for mining tasks, such as the RTX 3090, RTX
3080, and RTX 3070. These solutions are united by the Amper GPU-chip architecture and a large
number of CUDA computing units and tensor cores, high component frequencies and large
amounts of high-performance memory with high bandwidth. The RTX 3090 GPU has 24 GB GDDR
6X, 10496 CUDA Cores and 328 Tensor Cores. The RTX 3080 GPU is equipped with 10 GB
GDDR6X, 8704 CUDA Cores and 272 Tensor Cores. The RTX 3070 GPU includes 8 GB GDDR6, 5888
CUDA Cores and 184 Tensor Cores. Today, NVIDIA solutions are the best available on the market.
Data mining Software:
Data mining software is an analytical tool that allows the user to view data from different angles
depending on the algorithm or code being used, where each has its own purpose. These
software aims to find, extract, and then distribute the information.
Orange Data Mining: Orange Data Mining is an open supply information data image,
machine learning, and data processing toolkit; it is an open analysis visualization tool.
Data mining in it is done using python scripting and visual programming
R software environment: R is a free software environment for visuals and arithmetical
computing. It works on UNIX platforms, MacOS and Windows. It is a set of software
services for calculation, graphical display, and data manipulation.
Weka Data Mining: It is a group of algorithms of machine learning to perform data
mining tasks. The algorithms are set to be called using Java code, or by directly being
applied to the dataset. It is written in Java and contains features like machine learning,
preprocessing, data mining, clustering, regression, classification, visualization, and
attribute selection with the some other techniques.
Anaconda: It is an open data science platform. It is a high-performance distribution of R
and Python. It includes R, Scala, and Python for data mining, stats, deep learning,
simulation and optimization, Natural language processing, and image analysis.
Data Melt: It is software for statistics, numeric computation, scientific visualization,
and analysis of big data. It is a computational platform. It can use different programming
languages.
GNU Octave: It represents a high-level language built for numerical computations. It
works on a command-line interface and allows users to solve linear and nonlinear
problems numerically using a language compatible with Matlab. It offers features like
visualization tools. It runs on Windows, MacOS, GNU/Linux, and BSD.
Conclusion:
To sum up what was mentioned, it could be said that data mining has some special
factors that lead us to have a perfect mining process. The miner’s hardware should have all new
features and technologies in order to achieve best performance and solve all numeric and
categorical data. However, as a matter of fact, research in data mining will always include new
methods and technologies being used to solve even more and new complex calculations, and as
a result coping with the newest technologies will solve complex calculations and improve all
factors of performance. Therefore, the evolution of such technology has no limits. In addition,
having it nowadays being used as an alternative to fight economic crisis in Lebanon, whether it
being used for complex calculations to fight inflations and frauds, or by having individuals who
use it as mining tool in the crypto world.
In Lebanon with all this economic crisis and the failure of Lebanese pound currency, data
mining could come as cure for most of these problems. If and only if 15% of computer science
and eng students invest in data mining technology from bitcoin to several other manufacture,
the outcome of this project could be huge on personal and country perspective, due to the big
flow of USD that this work could lead for.