Data Mining Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Notre Dame University - Louaise

North Lebanon Campus – NLC

CSC 312
Computer Architecture

Assignment:
Project Report

Submitted to:
Dr. Ruwayda Takchi

Submitted on:
10/7/2022

Submitted by:
JoeElie el Fadl (EE): 20197074
Marcel Samrout (CS): 20217156
Live Link:
https://youtu.be/Xd42RPLbfhA

Data mining have plenty of techniques being used for data mining, but however focusing on the latest
and most important techniques is the main objective.

What is data mining?


Data mining is a type of an analytical process that identifies through large data sets meaningful trends and
relationships in raw data, and this is typically done to predict future data.

Data Mining tools come through large batches of data sets with a broad range of techniques to discover data
structures such as anomalies, patterns, journeys or correlations.

It was around since 1990, the data mining we know and use today comprises three disciplines:

 Statistics = numerical study of data relationships.


 Artificial intelligence = extreme human-like intelligence displayed by software’s or machines.
 Machine learning = ability to automatically learn from data with minimal human assistance

These 3 elements have helped us move beyond the tedious processes of the past, and onto simpler and better
automations for today complex data set.

And in fact, the more complex and varied these sets are, the more relevant and accurate their insights and
predictions will be

Why Data mining is important?

Data mining is a crucial component of successful analytics initiatives in organizations. The information it generates
can be used in business intelligence (BI) and advanced analytics applications that involve analysis of historical data,
as well as real-time analytics applications that examine streaming data as it's created or collected.

Effective data mining aids in various aspects of planning business strategies and managing operations. That
includes customer-facing functions such as marketing, advertising, sales and customer support, plus
manufacturing, supply chain management, finance and HR. Data mining supports fraud detection, risk
management, cybersecurity planning and many other critical business use cases. It also plays an important role in
healthcare, government, scientific research, mathematics, sports and more.

Data Mining Steps:

What are the steps in data mining?

The overall process of data mining generally consist of 6 steps:

 Outlining your Goals: it is important to understand our objective, allowing us to set the most accurate
project parameters which include the time frame and scope of data and most importantly our primary
objective and how to succeed it.
 Understanding your data sources: with deeper knowledge of your project parameters, you will be able to
better understand which platforms and data bases are necessary to solve the problem. whatever it is from
CRM or excel
 Preparing your data: In this step we use the ETL process which stand for extract, transform and load.

This prepares the data ensuring it is collected from the various selected sources cleaned and then collated.

 Analyzing data: At this stage, the organized data is fed into a advanced application and different machine
learning algorithms get to work on identifying relationships and patterns that can help inform decisions
and forecast future trends.

This application organizes your elements of data aka data points, and standardize how they relate to one another.
 Reviewing the results: Here you will be able to determine if and how well the results and insights
delivered by the model can assist in confirming your predictions answering your questions and achieving
your objective.

 Deployment or implementation: upon completion of data mining project, the result should then be made
available to the decision makers via a report. they can then choose how they would like to implement the
information to achieve the objective.

Data mining techniques:

 Classification: is the process of grouping different data under different classes, where
the main aim is to have a better connection among data sets.
For example: vegetables, fruits, and rice all could be placed under the name of food.

 Clustering: is a process similar to classification in the sense of grouping data, but instead
it focusses on the similarities and differences among them.
For example: if individuals watch horror movies the most, they could be marketed to
that group.

 Prediction: is the process of using previous and current data to predict future actions. It
is sometime also to combine other types of techniques like classification.
For example: relying on previous data history of customers in order to predict the total income
later on.

 Association Rule: a technique that shows correlation between two or more item by
identifying the hidden pattern with in a data set.
For example: It helps to identify the items that are purchased together in a
supermarket. It determines that if the customer buys a bag of chips then there is a 50%
chance he will buy a can of soda and the chips selling contribute to 1% of the products
being sold, and therefore using this technique it becomes easier to determine the
behavior of the customer.

 Pattern tracking: This technique checks for regularities and tendencies in the data set
over a period depending on previous data history.
For example: The Sales of a company increase during Christmas or summer time every
year.
 Outlier Analysis: a technique used to determine what data items do not comply with expected
pattern or data behavior. It also could be referred to as outlier mining. It is basically used in
fraud claiming models.

Data Mining Technology


Data mining tools are available from many vendors. Vendors that offer tools for data mining
include Alteryx, AWS, Databricks, Dataiku, DataRobot, Google, H2O.ai, IBM, Knime, Microsoft,
Oracle, RapidMiner, SAP, SAS Institute and Tibco Software, among others.

Hardware:

When searching for a data mining hardware or device in general there are several key factors to
look at in order to pick what suits our demands:

 Performance (Hash rate) – It is related to how many hashes per second can the Data
miner make, how fast it is in performing the calculations, therefore it being pipelined,
multi-threaded, and more. The miner has to be able to cope with a multitude of
calculation types. The More hashes the more is the cost, which is why bot performance
and efficiency are crucial.
 Efficiency –It is important to emphasize on the point that mining is a proof of work
having electricity consumption being the one to do the work. Indeed the faster and
tougher the calculations are the more electricity will be consumed, and that is why
miners use a large amount of electricity; you want to find one that converts the most
amount of electricity while performing the data tasks.
 Price –Inexpensive mining hardware will mine less data, which is why efficiency and
electricity usage are important. The fastest and more efficient mining hardware is going
to cost more.
 Power: We want the best performance, at same time battery last longer

So when achieving the most of all these 4 factors we will definitely have the best
hardware being found.
Central Processing Unit: Or what is known as CPU is the primary component of a computer
that acts as its “control center.” The CPU, also referred to as the “central” or “main” processor, is
a complex set of electronic circuitry that runs the machine's operating system and apps. Choosing
the best mining, CPU for your needs might be difficult, especially unfamiliar with the most
recent offers. However, for the sake of the latest improvements, the CPU market is constantly
evolving, with newer, more powerful alternatives frequently changing pricing.

● AMD is a major vendor that offers its own solutions for


mining tasks, such as the Ryzen 7 3700x.

● AMD’s third-generation Ryzen CPUs have faster clock


rates and more cores than the primary first and second-
generation components, and the Ryzen 7 3700X is
currently one of the finest CPUs for gaming.

● The Zen 2 CPUs are so fantastic that AMD nearly


doesn’t need the faster options.

● However, the 3700X frequently exceeds the minimal clock speed, particularly in light to
medium workloads.

Graphics Processing Unit: A GPU server is a machine for working with certain data. It has a modern CPU, fast
RAM and a high-performance GPU. All these parts are connected by hardware logic and data buses. Why is such a
server calling a GPU server? Because it usually has a powerful GPU installed. The advantages of a GPU include the
large amount of memory on board with high bandwidth, and high-speed CUDA cores and tensor cores. Basically, a
GPU in a GPU server is a computer within a computer. It can perform specific calculations at a very high speed.

NVIDIA is a major vendor that offers its own solutions for mining tasks, such as the RTX 3090, RTX
3080, and RTX 3070. These solutions are united by the Amper GPU-chip architecture and a large
number of CUDA computing units and tensor cores, high component frequencies and large
amounts of high-performance memory with high bandwidth. The RTX 3090 GPU has 24 GB GDDR
6X, 10496 CUDA Cores and 328 Tensor Cores. The RTX 3080 GPU is equipped with 10 GB
GDDR6X, 8704 CUDA Cores and 272 Tensor Cores. The RTX 3070 GPU includes 8 GB GDDR6, 5888
CUDA Cores and 184 Tensor Cores. Today, NVIDIA solutions are the best available on the market.
Data mining Software:

Data mining software is an analytical tool that allows the user to view data from different angles
depending on the algorithm or code being used, where each has its own purpose. These
software aims to find, extract, and then distribute the information.

Some of its features:


1. Easy to use: since uses Graphical user interface or what is known by as GUI which allows
users to analyze data efficiently.
2. It includes preprocessing, where it does data cleaning, data transformation, data
normalization, and data integration.
3. Allows scalable processing, related to the size of data and users
4. Has high performance where it generates results quickly
5. Anomaly detection where it detects errors and unusual behaviors

Different Data Mining software:

 Orange Data Mining: Orange Data Mining is an open supply information data image,
machine learning, and data processing toolkit; it is an open analysis visualization tool.
Data mining in it is done using python scripting and visual programming
 R software environment: R is a free software environment for visuals and arithmetical
computing. It works on UNIX platforms, MacOS and Windows. It is a set of software
services for calculation, graphical display, and data manipulation.
 Weka Data Mining: It is a group of algorithms of machine learning to perform data
mining tasks. The algorithms are set to be called using Java code, or by directly being
applied to the dataset. It is written in Java and contains features like machine learning,
preprocessing, data mining, clustering, regression, classification, visualization, and
attribute selection with the some other techniques.
 Anaconda: It is an open data science platform. It is a high-performance distribution of R
and Python. It includes R, Scala, and Python for data mining, stats, deep learning,
simulation and optimization, Natural language processing, and image analysis.
 Data Melt: It is software for statistics, numeric computation, scientific visualization,
and analysis of big data. It is a computational platform. It can use different programming
languages.
 GNU Octave: It represents a high-level language built for numerical computations. It
works on a command-line interface and allows users to solve linear and nonlinear
problems numerically using a language compatible with Matlab. It offers features like
visualization tools. It runs on Windows, MacOS, GNU/Linux, and BSD.

Matlab : MATLAB is a multi-paradigm numerical


computing environment for processing mathematical
information. It is closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several
scientific disciplines. Using the MATLAB graphics library, you can create powerful
visualizations. MATLAB is also used in image and signal processing.

SAS: It is one of those data science


tools which are specifically designed for
statistical operations. SAS is a closed
source proprietary software that is used
by large organizations to analyze data.
SAS uses base SAS programming
language which for performing
statistical modeling.
It is widely used by professionals and
companies working on reliable
commercial software. SAS offers
numerous statistical libraries and tools
that you as a Data Scientist can use for
modeling and organizing their data.

Apache Spark: is an all-powerful analytics engine


and it is the most used Data Science tool. Spark is
specifically designed to handle batch processing
and Stream Processing.
It comes with many APIs that facilitate Data
Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement
over Hadoop and can perform 100 times faster than
MapReduce.

Spark has many Machine Learning APIs that can


help Data Scientists to make powerful predictions with the given data.

Conclusion:
To sum up what was mentioned, it could be said that data mining has some special
factors that lead us to have a perfect mining process. The miner’s hardware should have all new
features and technologies in order to achieve best performance and solve all numeric and
categorical data. However, as a matter of fact, research in data mining will always include new
methods and technologies being used to solve even more and new complex calculations, and as
a result coping with the newest technologies will solve complex calculations and improve all
factors of performance. Therefore, the evolution of such technology has no limits. In addition,
having it nowadays being used as an alternative to fight economic crisis in Lebanon, whether it
being used for complex calculations to fight inflations and frauds, or by having individuals who
use it as mining tool in the crypto world.

In Lebanon with all this economic crisis and the failure of Lebanese pound currency, data
mining could come as cure for most of these problems. If and only if 15% of computer science
and eng students invest in data mining technology from bitcoin to several other manufacture,
the outcome of this project could be huge on personal and country perspective, due to the big
flow of USD that this work could lead for.

Live video : https://youtu.be/Xd42RPLbfhA

Some contests are on video and not here.

You might also like