Subject: Port Information Systems and Platforms: Proposed By: Prof Tali

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

2018

Subject : Port Information Systems and Platforms

Realized by: Omar Lanjri Riyahi Proposed by: Prof Tali


C.N.E : 1210714573
Summary

Introduction......................................................................

1. Definition............................................................................

2. The sources of big data ......................................................3

3. The handling systems of the big data ................................4

4. Big Data Use Cases ............................................................5

Conclusion........................................................................8

Webographie.....................................................................8

1
Introduction
In today‘s world, every tiny gadget is a potential data source, adding to the huge data
bank. Also, every bit of data generated is practically valued, be it enterprise data or personal
data, historical or transactional data. This data generated through large customer transactions,
social networking sites is varied, voluminous and rapidly generating. All this data prove
storage and processing crisis for the enterprises. The data being generated by massive web
logs, healthcare data sources, point of sale data, satellite imagery needs to be stored and
handled well. Although, this huge amount of data proves to be a very useful knowledge bank
if handled carefully. Hence big companies are investing largely in the research and harnessing
of this data. By all the predilections today for Big Data, one can easily state Big Data
technology as the next best thing to learn. All the attention it has been getting over the past
decade is but due to its overwhelming need in the industry. So how can we define big data?
Where does it come from ? And how can we handle it and benefit from it?

1.Definition
While the term “big data” is relatively new, the act of gathering and storing large
amounts of information for eventual analysis is ages old. The concept gained momentum in
the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition
of big data as the three Vs:
Volume: Organizations collect data from a variety of sources, including business
transactions, social media and information from sensor or machine-to-machine data. In the
past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased
the burden.
Velocity: Data streams in at an unprecedented speed and must be dealt with in a timely
manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of
data in near-real time.
Variety: Data comes in all types of formats – from structured, numeric data in
traditional databases to unstructured text documents, email, video, audio, stock ticker data and
financial transactions.
When it comes to big data we can consider two additional dimensions :
Variability: In addition to the increasing velocities and varieties of data, data flows
can be highly inconsistent with periodic peaks. Is something trending in social media? Daily,
seasonal and event-triggered peak data loads can be challenging to manage. Even more so
with unstructured data.
Complexity: Today's data comes from multiple sources, which makes it difficult to
link, match, cleanse and transform data across systems. However, it’s necessary to connect
and correlate relationships, hierarchies and multiple data linkages or your data can quickly
spiral out of control.

2
2.The sources of big data
There are some of many sources of BigData:

1. Sensors/meters and activity records from electronic devices :These kind of


information is produced on real-time, the number and periodicity of observations of
the observations will be variable, sometimes it will depend of a lap of time, on others
of the occurrence of some event (per example a car passing by the vision angle of a
camera) and in others will depend of manual manipulation (from an strict point of
view it will be the same that the occurrence of an event). Quality of this kind of source
depends mostly of the capacity of the sensor to take accurate measurements in the way
it is expected.

2. Social interactions: Is data produced by human interactions through a network, like


Internet. The most common is the data produced in social networks. This kind of
data implies qualitative and quantitative aspects which are of some interest to be
measured. Quantitative aspects are easier to measure tan qualitative aspects, first
ones implies counting number of observations grouped by geographical or temporal
characteristics, while the quality of the second ones mostly relies on the accuracy of
the algorithms applied to extract the meaning of the contents which are commonly
found as unstructured text written in natural language, examples of analysis that are
made from this data are sentiment analysis, trend topics analysis, etc.;

3. Business transactions: Data produced as a result of business activities can be


recorded in structured or unstructured databases. When recorded on structured data
bases the most common problem to analyze that information and
get statistical indicators is the big volume of information and the periodicity of its
production because sometimes these data is produced at a very fast pace, thousands of
records can be produced in a second when big companies like supermarket chains are
recording their sales. But these kind of data is not always produced in formats that can
be directly stored in relational databases, an electronic invoice is an example of this
case of source, it has more or less an structure but if we need to put the data that it
contains in a relational database, we will need to apply some process to distribute that
data on different tables (in order to normalize the data accordingly with the relational
database theory), and maybe is not in plain text (could be a picture, a PDF, Excel
record, etc.), one problem that we could have here is that the process needs time and as
previously said, data maybe is being produced too fast, so we would need to have
different strategies to use the data, processing it as it is without putting it on a
relational database, discarding some observations (which criteria?), using parallel
processing, etc. Quality of information produced from business transactions is tightly
related to the capacity to get representative observations and to process them;

4. Electronic Files: These refers to unstructured documents, statically or dynamically


produced which are stored or published as electronic files, like Internet pages, videos,
audios, PDF files, etc. They can have contents of special interest but are difficult to
3
extract, different techniques could be used, like text mining, pattern recognition, and
so on. Quality of our measurements will mostly rely on the capacity to extract and
correctly interpret all the representative information from those documents;

5. Broadcastings: Mainly referred to video and audio produced on real time, getting
statistical data from the contents of this kind of electronic data by now is too complex
and implies big computational and communications power, once solved the problems
of converting “digital-analog” contents to “digital-data” contents we will have similar
complications to process it like the ones that we can find on social interactions.

3. the handling systems of the big data


BIG DATA is a term used for a collection of data sets so large and complex that it is
difficult to process using traditional applications/tools. It is the data exceeding Terabytes in
size. Because of the variety of data that it encompasses, big data always brings a number of
challenges relating to its volume and complexity. A recent survey says that 80% of the data
created in the world are unstructured. One challenge is how these unstructured data can be
structured, before we attempt to understand and capture the most important data. Another
challenge is how we can store it. Here are the top tools used to store and analyze Big Data.
We can categorize them into two (storage and Querying/Analysis):

1. Apache Hadoop
The Apache Hadoop software is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the library
itself is designed to detect and handle failures at the application layer, so delivering a highly-
available service on top of a cluster of computers, each of which may be prone to failures.

2. Microsoft HDInsight
It is a Big Data solution from Microsoft powered by Apache Hadoop which is
available as a service in the cloud. HDInsight uses Windows Azure Blob storage as the default
file system. This also provides high availability with low cost.

3. NoSQL
While the traditional SQL can be effectively used to handle large amount of structured
data, we need NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases store
unstructured data with no particular schema. Each row can have its own set of column values.
NoSQL gives better performance in storing massive amount of data. There are many open-
source NoSQL DBs available to analyze big Data.

4. Hive
4
This is a distributed data management for Hadoop. This supports SQL-like query
option HiveSQL (HSQL) to access big data. This can be primarily used for Data mining
purpose. This runs on top of Hadoop.

5. Sqoop
This is a tool that connects Hadoop with various relational databases to transfer data.
This can be effectively used to transfer structured data to Hadoop or Hive.

6. PolyBase
This works on top of SQL Server 2012 Parallel Data Warehouse (PDW) and is used to
access data stored in PDW. PDW is a data warehousing appliance built for processing any
volume of relational data and provides an integration with Hadoop allowing us to access non-
relational data as well.

7. Big data in EXCEL


As many people are comfortable in doing analysis in EXCEL, a popular tool from
Microsoft, you can also connect data stored in Hadoop using EXCEL 2013. Hortonworks,
which is primarily working in providing Enterprise Apache Hadoop, provides an option to
access big data stored in their Hadoop platform using EXCEL 2013. You can use Power View
feature of EXCEL 2013 to easily summarize the data.

Similarly, Microsoft’s HDInsight allows us to connect to Big data stored in Azure


cloud using a power query option.

8. Presto
Facebook has developed and recently open-sourced its Query engine (SQL-on-
Hadoop) named Presto which is built to handle petabytes of data. Unlike Hive, Presto does
not depend on MapReduce technique and can quickly retrieve data.

4.Big Data Use Cases


Big data can help you address a range of business activities, from customer experience
to analytics. Here are just a few.

Product Development

Companies like Netflix and Procter & Gamble use big data to anticipate customer demand.
They build predictive models for new products and services by classifying key attributes of
past and current products or services and modeling the relationship between those attributes
and the commercial success of the offerings. In addition, P&G uses data and analytics from
focus groups, social media, test markets, and early store rollouts to plan, produce, and launch
new products.

5
Predictive Maintenance

Factors that can predict mechanical failures may be deeply buried in structured data,
such as the equipment year, make, and model of a machine, as well as in unstructured data
that covers millions of log entries, sensor data, error messages, and engine temperature. By
analyzing these indications of potential issues before the problems happen, organizations can
deploy maintenance more cost effectively and maximize parts and equipment uptime.

Customer Experience

The race for customers is on. A clearer view of customer experience is more possible
now than ever before. Big data enables you to gather data from social media, web visits, call
logs, and other data sources to improve the interaction experience and maximize the value
delivered. Start delivering personalized offers, reduce customer churn, and handle issues
proactively.

Fraud and Compliance

When it comes to security, it’s not just a few rogue hackers; you’re up against entire
expert teams. Security landscapes and compliance requirements are constantly evolving. Big
data helps you identify patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.

Machine Learning

Machine learning is a hot topic right now. And data—specifically big data—is one of
the reasons why. We are now able to teach machines instead of program them. The availability
of big data to train machine-learning models makes that happen.

Operational Efficiency

Operational efficiency may not always make the news, but it’s an area in which big
data is having the most impact. With big data, you can analyze and assess production,
customer feedback and returns, and other factors to reduce outages and anticipate future
demands. Big data can also be used to improve decision-making in line with current market
demand.

Drive Innovation

Big data can help you innovate by studying interdependencies between humans,
institutions, entities, and process and then determining new ways to use those insights. Use
data insights to improve decisions about financial and planning considerations. Examine
trends and what customers want to deliver new products and services. Implement dynamic
pricing. There are endless possibilities.

Improving Healthcare and Public Health

6
The computing power of big data analytics enables us to decode entire DNA strings in
minutes and will allow us to find new cures and better understand and predict disease
patterns. Just think of what happens when all the individual data from smart watches and
wearable devices can be used to apply it to millions of people and their various diseases. The
clinical trials of the future won't be limited by small sample sizes but could potentially include
everyone!

Apple's new health app, called ResearchKit, has effectively just turned your phone into
a biomedical research device. Researchers can now create studies through which they collect
data and input from users phones to compile data for health studies. Your phone might track
how many steps you take in a day, or prompt you to answer questions about how you feel
after your chemo, or how your Parkinson's disease is progressing. It's hoped that making the
process easier and more automatic will dramatically increase the number of participants a
study can attract as well as the fidelity of the data.

Improving Sports Performance

Most elite sports have now embraced big data analytics. We have the IBM
SlamTracker tool for tennis tournaments; we use video analytics that track the performance of
every player in a football or baseball game, and sensor technology in sports equipment such
as basket balls or golf clubs allows us to get feedback (via smart phones and cloud servers) on
our game and how to improve it. Many elite sports teams also track athletes outside of the
sporting environment - using smart technology to track nutrition and sleep, as well as social
media conversations to monitor emotional wellbeing.

Improving Science and Research

Science and research is currently being transformed by the new possibilities big data
brings. Take, for example, CERN, the nuclear physics lab with its Large Hadron Collider, the
world's largest and most powerful particle accelerator. Experiments to unlock the secrets of
our universe - how it started and works - generate huge amounts of data.
The CERN data center has 65,000 processors to analyze its 30 petabytes of data.
However, it uses the computing powers of thousands of computers distributed across 150 data
centers worldwide to analyze the data. Such computing powers can be leveraged to transform
so many other areas of science and research.

Conclusion
The increase in the amount of data available presents both opportunities and problems.
In general, having more data on one’s customers (and potential customers) should allow
7
companies to better tailor their products and marketing efforts in order to create the highest
level of satisfaction and repeat business.

Companies that are able to collect large amount of data are provided with the
opportunity to conduct deeper and richer analysis. This data can be collected from publicly
shared comments on social networks and websites, voluntarily gathered from personal
electronics and apps, through questionnaires, product purchases, and electronic check-ins. The
presence of sensors and other inputs in smart devices allows for data to be gathered across a
broad spectrum of situations and circumstances.

Webographie :
https://www.bernardmarr.com/default.asp?contentID=1076

https://www.investopedia.com/terms/b/big-data.asp

http://www.hadoopadmin.co.in/sources-of-bigdata/

https://bigdata-madesimple.com/top-big-data-tools-used-to-store-and-analyse-data/

https://www.oracle.com/big-data/guide/what-is-big-data.html

You might also like