BIG Data Analysis Assign - Final
BIG Data Analysis Assign - Final
BIG Data Analysis Assign - Final
Department of ICT
Course title: BIG Data Analysis (IT 562)
1. Volume: refers to the quantity of data gathered by a company. This data must be
used further to gain important knowledge. Enterprises are overflowing with ever-
growing data of all types, easily build-up terabytes even petabytes of information
(e.g. turning 12 terabytes of Tweets per day into improved product sentiment
analysis; or converting 350 billion annual meter readings to better predict power
consumption).
2. Velocity: refers to the time in which Big Data can be processed. Some activities are
very important and need immediate responses, which is why fast processing
maximizes efficiency. For time-sensitive processes such fraud detection, Big Data
flows must be analysed and used as they stream into the organizations in order to
maximize the value of the information (e.g. inspect 5 million trade events created
Trends
Big data started with a little shift from traditional analytics to include batch-processing
computations.
It gradually moved from this stage with MapReduce paradigm to a higher level where stream
processing is involved with Apache Spark platform. The trend continues to near real-time
processing and is currently progressing to real-time analytics.
I have already read these trends since 2019 from different source:
IoT.
Augmented analytics.
The harnessing of dark data.
Cold storage and cloud cost optimization.
Edge computing and analytics.
Data storytelling and visualization.
DataOps.
Technology
BIG DATA is a term used for a collection of large and complex data sets so that it is difficult to
process using traditional applications/tools. It is the data exceeding Terabytes in size. Because of
the variety of data that it encompasses, big data always brings a number of challenges relating to
its volume and complexity.Here are technologies used to store and analyse Big Data. Some
books categorise them into two (storage and Querying/Analysis).
Apache Hadoop
Apache Hadoop is a java based free software framework that can effectively store large
amount of data in a cluster.
Other data analysis techniques include spatial analysis, predictive modelling, association rule
learning, network analysis, Visual Analysis … and many, many more. The technologies that
process, manage, and analyse this data are of an entirely different and expansive field, that
similarly evolves and develops over time.
Currently I can saythat thereis no well-known big data analytics company in Ethiopia, but
there are mega companies which have millions of customers in Ethiopia like Ethiopian
telecommunication, Ethiopian airlines and commercial bank of Ethiopia. Here I can’t get
any information about these companies whether they use Big data technology or not.
However, there are some international companies that are well-known and have large
number of customers in Ethiopia like social media companies (Facebook, twitter…
etc)and Google.
Answer:
Prerequisites
Server: To run Apache Hadoop jobs, it is recommended to use dual core machines or
dualprocessors.
Memory: There should be 4GB or 8GB RAM with the processor with Error-correcting code
(ECC) memory. Without ECC memory, there is high chance of getting checksum errors.
Storage: For storage high capacity SATA drives (around 7200 rpm) should be used in
Hadoopcluster.
Bandwidth: Around 10GB bandwidth Ethernet networks are good for Hadoop.
Hadoop is written in Java, so you will need to have Java installed on your machine, version 6 or
later. Sun's JDK is the one most widely used with Hadoop, although others have been reported to
work. Hadoop runs on Unix and on Windows. Linux is the only supported production platform,
but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should includethe openssh package if you plan to
run Hadoop in pseudo-distributed mode.
Hadoop can be run in one of three modes:
Standalone (or local) mode:
There are no daemons running and everything runs in a single JVM. Standalone mode is
suitable for running MapReduce programs during development, since it is easy to test and
debug them
Pseudo-distributed mode:
The Hadoop daemons run on the local machine, thus simulating a cluster on a smallscale.
Fully distributed mode:
The Hadoop daemons run on a cluster of machines.
Answer:
Filesystems that manage the storage across a network of machines are called distributed
filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.
HDFS is a filesystem designed for storing very large files with streaming data accesspatterns,
running on clusters on commodity hardware.
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes
in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve a large proportion, if
not all, of the dataset, so the time to read
the whole dataset is more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designedto run on
clusters of commodity hardware (commonly available hardware available from multiple
vendors†) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in
the face of such failure.
Answer:
NameNode:NameNode is the core of HDFS that manages the metadata.The information of
what file maps to what block locations and what blocks are stored on what datanode. In simple
terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure
consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for
namespace:
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode: Checkpoint NameNode has the same directory structure as NameNode,
and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits
file and margining them within the local directory. The new image after merging is then
uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not
support the ‘upload to NameNode’ functionality.
Answer:
Let me start with simpledefinition
Commodity hardware means cheap servers.
This doesn’t mean scrap servers. Instead, it indicates servers that are affordable and easy to
obtain, even in large quantities. The modern company needs computing power to process and
store information. Of course, we can get that from a battery of severs. The thing is, any company
needs that. Alessandro Maggio try to elaborate this concept with scenario on
www.ictshore.com/data-center
Imagine a company in the ’60s, when all the information was on paper. At that time,
companies heavily relied on paper to store information. Yet, no one thought that paper
was a critical part of the business model. Yet, everyone used it. Paper was just there; its
usage was taken for granted. In other words, paper was a commodity.
Fast forward 60 years, things have changed, yet principles stay the same. At the end of
the day, business is hungry for information, just like they have been for the last century.
However, they now rely on modern technologies instead of paper. Now like then, what’s
important is the information, not the way you process it. Thus, now the hardware is the
commodity.
Big Data is another buzzword of modern times. With all this digital information, any company
has a lot of data to process. Big data means big computers to process them, right? Well, not
quite. Thanks to new development paradigms, the industry is moving away from super
computers. Rather than having one big system processing everything, we now want to have
many servers, each processing a small chunk of that. That’s where commodity hardware comes
into the picture.
Now, modern applications prefer parallelism. This means wecan have lots of not-so-powerful
servers instead of a single big one. If one server fails, you lose only one tiny part of our
processing power. In the end, wewill have better efficiencies and more availability. In case
weneed more power, wecan simply add a server: the application is already parallel, and wedon’t
need to rethink our entire solution.
Hadoop is an open source solution by the Apache foundation that helps weachieve parallelism. It
is one of the leading projects in the Big Data world, and has driven the industry in its early
stages.
BIG DATA ANALYTICS ASSIGNMENT 15
The concept behind Hadoop is simple: wehave several servers (the commodity hardware) and
you distribute the load among them. This is possible thanks to Hadoop MapReduce, a special
feature of this solution. Hadoop is installed on all the severs, and it then distributes the data
among them. In the end, each server will have a piece of data, but no server will have everything.
However, the same piece of data will be duplicated on two servers to protect against faults. Now
that each server has its piece of data, wecan process the data. When wedo that, Hadoop tells each
server to process its own data and give back only the results. That’s parallelism, many servers
running together at the same time toward the same goal.
Commodity hardware is server hardware you can get at affordable prices, fast and in
large quantities. That’s what modern companies use, even tech giants like Google.
B) How big data analysis helps businesses increase their revenue? Give
example and Name some companies that use Hadoop.
Answer:
Big data analytics is done using advanced software systems. This allows businesses to
reduce the analytics time for speedy decision making. Basically, the modern big data
analytics systems allow for speedy and efficient analytical procedures. This ability to
work faster and achieve agility offers a competitive advantage to businesses.
The use of big data allows businesses to observe various customer related patterns and
trends. Observing customer behavior is important to trigger loyalty. Theoretically, the
more data that a business collects the more patterns and trends the business can be able to
identify. In the modern business world and the current technology age, a business can
easily collect all the customer data it needs. This means that it is very easy to understand
the modern-day client. Basically, all that is necessary is having a big data analytics
strategy to maximize the data at your disposal. With a proper customer data analytics
mechanism in place, a business will have the capability to derive critical behavioral
insights that it needs to act on so as to retain the customer base.
Understanding the customer insights will allow your business to be able to deliver what
the customers want from you. This is the most basic step to attain high customer
retention.Example of a Company that uses Big Data for Customer Acquisition and
Retention
BIG DATA ANALYTICS ASSIGNMENT 16
o Coca-Cola. In the year 2015, Coca-Cola managed to strengthen its data strategy
by building a digital-led loyalty program.
Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
A more targeted and personalized campaign means that businesses can save money and
ensure efficiency. This is because they target high potential clients with the right
products. Big data analytics is good for advertisers since the companies can use this data
to understand customers purchasing behavior. We can’t ignore the huge ad fraud
problem. Through predictive analytics, it is possible for the organizations to define their
target clients. Therefore, businesses can have an appropriate and effective reach avoiding
the huge losses incurred as a result of Ad fraud. Example of a Brand that uses Big Data
for Targeted Adverts
o Netflix is a good example of a big brand that uses big data analytics for targeted
advertising. If you are a subscriber, you are familiar to how they send you
suggestions of the next movie you should watch. Basically, this is done using
your past search and watch data. This data is used to give them insights on what
interests the subscriber most. See the screenshot below showing how Netflix
gathers big data.
Big Data Analytics for Risk Management
So far, big data analytics has contributed greatly to the development of risk management
solutions. The tools available allow the businesses to quantify and model risks that they
face every day. Considering the increasing availability and diversity of statistics, big data
analytics has a huge potential for enhancing the quality of risk management models.
Therefore, a business can be able to achieve smarter risk mitigation strategies and make
strategic decisions. Example of Brand that uses Big Data Analytics for Risk Management
o UOB bank from Singapore is an example of a brand that uses big data to drive
risk management. Being a financial institution, there is huge potential for
incurring losses if risk management is not well thought of. UOB bank recently
tested a risk management system that is based on big data.
Big Data Analytics as a Driver of Innovations and Product Development
Every design process has to begin from establishing what exactly fits the customers.
There are various channels through which an organization can study customer needs.
Then the business can identify the best approach to capitalize on that need based on the
big data analytics. Example of use of Big Data to Drive Innovations
Modern supply chain systems based on big data enable more complex supplier networks.
These are built on knowledge sharing and high-level collaboration to achieve contextual
intelligence. It is also essential to note that supply chain executives consider the big data
analytics as a disruptive technology. This is based on the thinking that it will set a
foundation for change management in the organizations.Example of a Brand that uses
Big Data for Supply Chain Efficiency
Question No 6
Given the current situation on going COVID19 crisis, how could the
Information Communication Technologies and big data analytics contribute to
solve it?
Answer:
It is a known fact that big data is acting as an asset helping forecast and understand the impact of
corona virus. It is being used by healthcare workers, scientists, epidemiologists, and
policymakers to aggregate and synthesize data on a regular scale.
Here are 10 ways artificial intelligence, data science, and technology is being used to manage
and fight COVID-19.
The better we can track the virus, the better we can fight it. By analyzing news reports, social
media platforms, and government documents, AI can learn to detect an outbreak. Tracking
infectious disease risks by using AI is exactly the service Canadian startup BlueDot provides. In
fact, the BlueDot’s AI warned of the threat several days before the Centers for Disease Control
and Prevention or the World Health Organization issued their public warnings
Artificial Intelligence Company Infervision launched a corona virus AI solution that helps front-
line healthcare workers detect and monitor the disease efficiently. Imaging departments in
It’s not only the clinical operations of healthcare systems that are being taxed but also the
business and administrative divisions as they deal with the surge of patients. A blockchain
platform offered by Ant Financial helps speed up claims processing and reduces the amount of
face-to-face interaction between patients and hospital staff.
One of the safest and fastest ways to get medical supplies where they need to go during a disease
outbreak is with drone delivery. Terra Drone is using its unmanned aerial vehicles to transport
medical samples and quarantine material with minimal risk between Xinchang County’s disease
control centre and the People’s Hospital. Drones also are used to patrol public spaces, track non-
compliance to quarantine mandates, and for thermal imaging.
5. Robots sterilize, deliver food and supplies and perform other tasks
Robots aren’t susceptible to the virus, so they are being deployed to complete many tasks such as
cleaning and sterilizing and delivering food and medicine to reduce the amount of human-to-
human contact. UVD robots from Blue Ocean Robotics use ultraviolet light to autonomously kill
bacteria and viruses. In China, Pudu Technology deployed its robots that are typically used in the
catering industry to more than 40 hospitals around the country.
6. Develop drugs
Google’s Deep Mind division used its latest AI algorithms and its computing power to
understand the proteins that might make up the virus, and published the findings to help others
develop treatments. Benevolent AI uses AI systems to build drugs that can fight the world’s
toughest diseases and is now helping support the efforts to treat corona virus, the first time the
company focused its product on infectious diseases. Within weeks of the outbreak, it used its
predictive capabilities to propose existing drugs that might be useful.
Companies such as Israeli startup Sonovia hope to arm healthcare systems and others with face
masks made from their anti-pathogen, anti-bacterial fabric that relies on metal-oxide
nanoparticles.
While certainly a controversial use of technology and AI, China’s sophisticated surveillance
system used facial recognition technology and temperature detection software from SenseTime
Tencent operates WeChat, and people can access free online health consultation services through
it. Chatbots have also been essential communication tools for service providers in the travel and
tourism industry to keep travellers updated on the latest travel procedures and disruptions.
The cloud computing resources and supercomputers of several major tech companies such as
Tencent, DiDi, and Huawei are being used by researchers to fast-track the development of a cure
or vaccine for the virus. The speed these systems can run calculations and model solutions is
much faster than standard computer processing.