The Role of Data Architecture in Nosql: What Advances Occurred in DBMSS?
The Role of Data Architecture in Nosql: What Advances Occurred in DBMSS?
The Role of Data Architecture in Nosql: What Advances Occurred in DBMSS?
Tom Haughey President InfoModel LLC 868 Woodfield Road Franklin Lakes, NJ 07417 201 755 3350 [email protected]
Why NOSQL?
Stands for Not Only SQL or even Not SQL RDBMSs (they say) have shown poor performance on dataintensive applications, including: Indexing a large number of documents Serving pages on high-traffic websites Handling the volumes of social networking data Delivering streaming media Typical RDBMS implementations (they say) are tuned for small but frequent read/write transactions or for large batch transactions with rare write access NOSQL (they also say) can service heavy read/write workloads (yes but for certain types of workloads, which they dont say) Real-world NOSQL deployments include: Digg's 3 TB for green badges (markers that indicate stories upvoted by others in a social network) Facebook's 50 TB for inbox search Google uses BigTable in over 60 applications such as Google Earth, Orkut
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 3
NOSQL Characteristics
NOSQL has nothing to do with SQL - (Michael Stonebraker) Characteristics of NOSQL data stores are: Often, file management system, not DBMS Large volumes of data Non-relational and no support for joins Improved performance for large data sets Distributed databases and queries Fault tolerance so that an application will continue even if some connection is lost Horizontally scalable using component nodes Schema-free (so they have a more flexible structure) Eventually consistent (ACID-free) Easy to replicate so as to improve availability Access via API support or other non-SQL interfaces Open-source
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 4
Actually, there are now more DBMS types than there used to be DBMS instances 10-years ago
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 5
NOSQL Assumptions
NOSQL DBMS further utilize one or more of three assumptions: The database will be big enough that it should be scaled across multiple servers. The application should run well if the database is replicated across multiple geographically distributed data centers, even if the connection between them is temporarily lost. The database should run well if the database is replicated across a host server and a bunch of occasionally-connected mobile devices. In addition, NOSQL proposes that a database should have no fixed schema, other than whatever emerges as a byproduct of the application-writing process.
CAP Theorem
Eric Brewer challenged the ACID properties of RDBMS Atomic - do the whole transaction or nothing at all Consistent obey all integrity and business rules Isolated hide the results of a transaction in progress Durable once committed, the results are persisted He contends that a database cannot ensure all three of the following properties at once, called CAP: Consistency: the client perceives that all of the operations have been performed at once (Do you agree with this definition?) Availability: every operation must end in an intended response Partition tolerance: operations will continue even if individual components are unavailable A database can ensure only any two-of-the-three (PICK TWO) Consequently, NOSQL proposes to replace the ACID properties of an RDBMS with CAP properties NOSQL emphasizes eventual consistency
Message Queuing
Enables processes running at different times to communicate across heterogeneous networks and systems that may be temporarily offline Applications send messages to queues and read messages from queues Events Process Process
Guaranteed Delivery
Events
Process
Message Queue
Process
Can be used for: Mission-critical financial services Embedded and hand-held applications Outside sales Workflow
InfoModel LLC, 2012 Advances in Data Modeling for DW 8
Advantages of Sharding
The number of rows in each physical space is reduced This reduces index size, improving search performance Database activity can be spread out over multiple machines, greatly improving performance If a shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers), only the relevant shard needs to be queried Two considerations in sharding can improve performance: Even distribution of data to avoid peaks and valleys in instances Collocation of related data (Customer, Order, Order Item all related by Customer ID)
11
We are building applications today that have data volume and load requirements that vastly exceed typical OLTP, and even DW, applications of the recent past.
13
14
Examples
Key value data store
Automobile Key 1 Attributes Make: Toyota Model: Highlander Color: Maroon Year: 2004 Make: Toyota Model: Highlander Color: Blue Year: 2005 Transmission: Automatic
A A
16
A A
InfoModel LLC, 2012
A
17
Amazon Requirements
Amazon had these characteristics: Query: simple read/write ACID: requires high availability but weaker consistency Efficiency: use commodity hardware Other: non-hostile environment with no security issues To solve this they came up with Dynamo
Very simple to build a key value store, and very easy to scale Usually good performance (for certain applications) Schema-less
18
Amazons Dynamo
The most famous Key : Value data store Available, scalable and distributed Sample services are best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog Failure will happen Millions of components At any given moment, always a small but significant number of server and network components are failing Must treat failure handling as the norm Data is partitioned & replicated for availability by hashing the key Consistency facilitated by object versioning Consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization Nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 19
Columnar Databases
20
10
A Column-oriented Database puts values of a column together. 1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000;
22
11
In most RDBMSs, the entire row is read into a buffer and then sliced down to the attributes
Col2
Col3
Col4
Col5
Col6
Col7
Col8
23
24
12
Columnar Databases
Michael Stonebraker (CTO of Vertica) says: Column databases will take over the warehouse market Many warehouses can't load in the available load window, can't support ad-hoc queries, can't get better performance without a "fork-lift" upgrade Vertica beats all row stores typically by a factor of 50 Columnar database systems are not new, witness Sybase IQ Columnar databases take advantage of exhaustive indexing of columns, bitmapping and data compression to improve performance They used to be considered a niche offering but no more Googles BigTable Data Store keeps data by columns and is great for what it does But is it suitable for data warehousing? Is it suitable for critical financial and medical applications?
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 25
26
13
27
28
14
BigTable (Google)
A BigTable instance contains a column or group of elements that are: Sparse (rows only where there are values; not fixed length) Distributed (horizontally partitioned over many servers) Persistent (stored) Multidimensional (consisting of a multi-part key) Sorted map (in lexigraphic order) A special variation of Key : Value data stores Indexed by row key, column key, and a timestamp Each value in the map is an uninterpreted array of bytes. (row: string, column: string, time:int64) ->string
29
Attribute
Row ID
Example
9019435035
Company
15
16
Document Stores
Have databases, collections, and indexes but not joins (~) Collections contain documents They store unstructured (e.g., text) or semi-structured (e.g., XML) documents basically hierarchical (1 : M) Essentially, they support the embedding of documents and arrays within other documents and arrays Document structures do not have to be predefined, so are schema-free (think about XML) These can be very valuable for storing data in data marts, such as salesperson sales or insurance agent sales for querying
33
Graph Databases
Employ nodes (like entities), properties (attributes), and edges (relationships) They say, faster for associative data sets no joins Map more directly to the structure of OO apps Can scale to large data sets without joins Have a less a rigid schema than RDBMS so deal with structural change more easily RDBMSs are faster at set processing
34
17
35
36
18
Database Comparison
Performance Key-Value Stores Scalability Of Volume Flexibility Of Structure Complexity Of DB Base Model Main Players
high
high
high
none
AmazonS3 Redis Memcached Voldemort Cassandra HBase Big Table & clones Sybase IQ Vertica ParAccel InfoBright Aster Data CouchDB MongoDB Cassandra Neo4j FlockDB InfiniteGraph The Big Four For DW: Teradata & all DW appliances
38
high
high
moderate
low
high
high
high
high
Inverted file
Document Stores
high
high
low
hierarchical
Graph Databases
variable
high
high
graph theory
Relational Databases
variable
variable
moderate
relational algebra
19
Levels of Models
Conceptual Model Logical Model Physical Model
Contains major business entities, major attributes and business relationships Used as planning tool at a business level or initial project data model
Contains all the data entities, all attributes and all data relationships A complete model is accurate, detailed and identifies all the required data. A good model offers a number of options for the physical implementation.
Contains table names, columns, indexes, constraints as well as partitioning This is the implementation model, targeting a specific physical DBMS, e.g., DB2, Oracle, etc.
These two levels should be unaffected by NOSQL and columnar databases Data modeling should be done as normal
InfoModel LLC, 2012 Role of Data Architecture in NOSQL
20
Trust in Ourselves
As professional data modelers, we must have trust in our science We treat data with respect and should ensure others do also We should also learn to be a bit quicker about it. Data modeling shouldnt take forever. Improved productivity is about tradeoffs. Is it more productive to iterate and iterate through data modeling until we get it right, or is it better to determine, this is good enough, build it and then refactor it? Think twice about putting into a data model something that is extraordinarily difficult to implement (or give guidelines) In implementing, use denormalization sensibly, guided by quantifiable factors Divide deliverables into increments Deliver in shorter increments Provide for early delivery of a prototype Assemble functional teams of motivated, skilled, practical people Timebox the work
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 42
21
Conclusions
RDBMSs are here to stay because they have proven their value Remember, the paychecks you get, the bank accounts you have, the stocks you buy and sell, the pills you take, the auto insurance you have -- were all enabled by an RDBMS NOSQL data stores are suitable for certain web-oriented applications Most NOSQL DBMSs are unsuitable for the DW and BI querying Assess the value of column-based and other BI DBMSs Believe in the value of logical data modeling, regardless of the physical storage implementation, but just be quicker about it Streamline DBA functions so that refactoring is less painful Use a modified Agile approach for data modeling, especially in a DW and data mart environment Youd have to be crazy not to be Agile But just as crazy to do it in 2-week increments
InfoModel LLC, 2012 Role of Data Architecture in NOSQL 43
44
22