TM351 Data Management and Analysis: Prepared by Eng. A.Samy Tel: 99941566

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Prepared by ENG. A.

SAMY Tel: 99941566 1

TM351
Data management and analysis
Part 1

Data and data sets (Rob Kitchin 2014)

Metadata
Data about the dataset itself. Three kinds:
1. Descriptive: supporting identification and discovery: for example, the name of a dataset, or a
description of its contents
2. Structural: relating to the structure of the dataset: for example, the column headings in a tabular dataset
3. Administrative: recording the means by which the dataset came into being and how it may be, or may
have been, used.
• Can also be subject to data analysis

Stakeholders
• A dataset or database may have a very broad range of stakeholders
• For example, if data about an individual is being analyzed, then:
• That individual is a stakeholder

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 2

Scale the three (or six) Vs


1. Volume: our traditional measure of data size – how much there is of it.
2. Variety: in many different, sometimes incompatible forms and representations.
3. Velocity: how fast new data is generated and has to be processed.

• Three more Vs are now becoming current:


4. Veracity: the quality of the data; how ‘clean’ it is.
5. Validity: to what extent the facts the data incorporates are correct and consistent for their context (1)
6. Volatility: how quickly data changes, or becomes invalid

Big data
• Big data aims to gather, analyze, link, and compare large datasets to identify patterns

Data handling
• Comprises two distinct sets of activities, roles and responsibilities: data management and data analysis.
• Two ways to characterise data handling:
• as a cycle or life cycle (which mainly emphasizes data management)
• or as a pipeline (which combines data management with the way the data is used).

‫اعادة توظيف‬

Issues in data management key issues:


• Legal
• Control
• Curation ‫معالجة‬
• Flexibility
• Currency ‫ دقة‬and maintenance.

Some common large-scale data management architectures


• OLTP (online transaction processing). Guarantees a high degree of consistency and correctness in the data
through a series of transactions, in real time. Ex: order entry and financial transactions.
• OLAP (online analytical processing) Extract and view settled (not frequently updated) data from selected
points of view: for example, seeing data on sales aggregated by sales region, monthly sales, and product ranges,
etc.
• Data stream processing (DSP) systems support processing of data arriving in continuous streams. Processing
is triggered by certain data elements arriving on the stream, or user intervention, so new results can be generated
as long as new data arrives to be processed.

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 3

Issues in data analysis


1. Trust
2. Data quality and fitness for purpose
3. Bias ‫االنحياز‬
4. Completeness and correctness
5. Reproducibility ‫ قابليه العاده االنتاج‬and provenance‫ مصدر‬.

Trust
Trust can relate to several aspects of data analysis:
• Trust in the data itself: in its origins, documentation, security, and curation and in the quality of its
maintenance.
• Trust in the processing applied to the data
• Trust in the data managers and analysts themselves: their competence, their understanding of procedures and
processes, and of concepts of fitness for purpose, data quality, appropriate interpretation of results and
requirements.

Data quality and fitness for purpose High-level attributes of data:


• Relevance • Accuracy
• Consistency • Validity
• Completeness • Reliability
• Provenance‫مصدر‬ • Timeliness

Bias
• Human bias
• Bias in data capture
• Bias in data cleaning
• Bias in data handling

• Data engineering has been defined as: extracting information partly through the analysis of data’

The tasks of Data engineers include:


• Collecting data over space and time
• Cleaning it of errors
• Anonymizing it
• Filtering it
• Representing it so that it can be exported from one system and imported into others
• Sorting and storing it across distributed systems
• Shaping it into forms that allow it to be analyzed
• Visualizing it.
• Must respect legal and ethical concerns.

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 4

TM351
Data management and analysis
Part 2

Acquiring and representing data


A data analysis pipeline

Numerous problems that can arise when trying to represent data in the world (O/P), and on a computer
(Coding). This may require difficult decisions.

2.2 In the beginning was the bit: data and data types

Typing is useful in a programming language for:


1. Allocate an appropriate amount of memory for the values.
2. Constrain legal operations that may be applied to a variable.

Every typed programming language supplies a different set of atomic primitive types.
• Java offers:
• bool (1 bit)
• byte (1-byte signed)
• char (2-byte unsigned)
• short (2-byte signed)
• int (4-byte signed)
• double (8-byte floating point).

• Python, offers a richer set of primitive types including complex numbers and various collection types.

Numbers for measurements – Stevens’ NOIR


• In a 1946 paper in Science, psychologist Stanley Smith (Stevens, 1946) Stevens identified four
classes (NOIR):
1. Nominal: numbers are used only as labels, in which case words or letters would serve equally
as well.
• Example, gender (1=M and 2=F)
2. Ordinal: numbers have a rank ordering.

3. Interval: numbers on an interval scale can be ranked, and we know how far apart things are,
such as on a temperature scale, but without a specific origin being stated.
Example: temperature in Celsius or Fahrenheit scale

4. Ratio: Numbers are on an ordinal scale, but with a meaningful, known, fixed origin
Example 1: Mass of an object
Example 2: Height of an object

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 5

3 Representing structured data: tables


• The table is a very common schema for representing structured data
Tabular data is data that is structured into rows,
• Each of which contains information about something.
• Each row contains the same number of cells (although some of these cells may be empty), which
provide values of properties of the thing described by the row.
• In tabular data, cells within the same column provide values for the same property of the thing
described by the particular row.
• A table must contain at least one column and at least one row.

3.2 Representing tables in web pages


• Tabular data can be represented in two forms within a web page, both of which allow the browser
and its plug-ins to handle the interaction between the logical and physical aspects of the data. The two
forms are:
1- As an (X)HTML element
2- As a JavaScript data object for manipulation

4 Representing structured data: documents


• Data scientists like to refer to a document as meaning any file or representation that embodies a
particular data record.
• Books are usually divided into chapters, or sections; they may contain illustrations, footnotes,
endnotes, tables of contents, indexes and special headings.

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 6

4.3 Extended markup


• XML allows user-defined tags
• XML can represent arbitrary data structures
• XML documents are both human and machine readable.

Output

5 Transporting data •
• XML is popular as a message passing format in the delivery of web services.
Also,
• CSV – comma-separated values file (sometimes referred to as a comma-separated variable file).
• JSON – JavaScript Object Notation

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]
Prepared by ENG. A.SAMY Tel: 99941566 7

 Kuwait – Salmiya – Salem Al-Mubarak St. North Salmiya Market Complex 2nd floor.
 (965) 2572 6686 - 2571 4343  (965) 2571 0775  [email protected]

You might also like