Data Science: Lecture #1
Data Science: Lecture #1
Data Science: Lecture #1
(CS3206)
Lecture #1
Introduction
Quote of the day..
“We are what our thoughts have made us; so take care of what
you think. Words are secondary. Thoughts live; they travel
far.”
Swami Vivekananda
2
In today’s discussion…
• Introduction to data
• Current trend
Ref:https://cse.iitkgp.ac.in/~dsamanta/courses/da/index.html
3
Introduction to data
• Example:
10, 25, …, Kharagpur, 10CS3002, [email protected]
Anything else?
4
How large your data is?
• What is the maximum file size you have
dealt so far?
• Movies/files/streaming video that you have
used?
5
Growth of data
6
Sources of data
• “Every day, we create 2.5 quintillion bytes of data
• So much that 90% of the data in the world today has been created in the last
two years alone.
7
Examples
8
Now data is Big data!
Big data is data whose scale, diversity, and complexity require new architecture,
techniques, algorithms, and analytics to manage it and extract value and hidden
knowledge from it…
9
Characteristics of Big data: V3
10
V3 : V for Volume
• Volume of data, which needs to be
processed is increasing rapidly
• More storage capacity
• More computation
• More tools and techniques
11
V3: V for Variety
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data,
multi-dimensional arrays, etc…
12
V3: V for Velocity
• Data is being generated fast and need to be
processed fast
• For time-sensitive processes such as
catching fraud, big data must be used as it
streams into your enterprise in order to
maximize its value
13
Big data vs. small data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
14
Big data vs. small data
15
Challenges ahead…
16
Big data players
17
Major players…
• Hadoop
• MapReduce
• Mahout
• Apache Hbase
• Cassandra
18
Tools available
• NoSQL
• DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
• MapReduce
• Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
• Storage
• S3, HDFS, GDFS
• Servers
• EC2, Google App Engine, Elastic, Beanstalk, Heroku
• Processing
• R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
19
Any question?
20
Questions of the day…
1. What is the smallest and largest units of measuring size of data?
21
Questions of the day…
5. What type of data are involved in the following applications?
1. Weather forecasting
22