Final

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

DATABASE PERFORMANCE TUNING AND QUERY Execution and Fetching: During the execution phase of rather than expressions.

rather than expressions. Numeric comparisons are faster


OPTIMIZATION query processing all I/O operations indicated in the access than character, date, and NULL comparisons, LIKE –
Key function of a DBMS is to provide data as quickly as plan are performed, Locks are acquired, Data are retrieved slowest. Equality comparisons are faster than inequality
possible Process begins when an application generates a and placed in the data cache, Transaction management comparisons. When using multiple AND conditions the
query, Query is then sent to the DBMS,DBMS executes the commands are processed. During the fetching phase of condition most likely to be false should be first. When
query, DBMS sends the resulting data set back to the query processing the rows of the query result set are using multiple OR conditions the condition most likely to
application. Database performance tuning is the set of returned to the client, DBMS may use temporary table be true should be first. Avoid NOT logical operator.
activities and procedures aimed at reducing the response space to store temporary data during the fetching phase of 2. DISTRIBUTED DBMS : Dbs Architectures:
time of a database management system, Requires a holistic query processing, Database server coordinates the Centralized architecture All responsibilities of the DBMS
approach, All factors must operate at optimum level with movement of the result set rows from the server cache to are handled by one centralized entity, Data, application
minimal bottlenecks. the client cache. logic, and presentation logic, Rare, expensive, and difficult
Server-side tuning involves tuning the performance of the Query Processing Bottlenecks: Bottlenecks are delays to maintain. Client-server architecture Data is stored on a
DBMS, DBMS environment is configured to respond to introduced to the processing of a queryoperation that server while application and presentation logic are
client requests as fast as possible while still making reduce system performance, sources: Competing maintained on PCs Generates considerable network traffic,
optimum use of available resources, Majority of transactions are constraining resources, CPU is Difficult to maintain.
performance tuning activities focus on minimizing database experiencing high levels of utilization, Insufficient RAM is Tiered architecture: Distribute computing tasks among
input/output (I/O) operations. Client-side tuning involves increasing the need for data movement operations-Can also multiple computing devices, Two-tier or client-server
SQL performance tuning, Objective is to create SQL increase CPU utilization, Hard disks are too slow or have architecture, Fat client variant maintains application and
queries that return desired data in the least amount of time low data transfer rate, Network is transferring large presentation logic on PCs, Thin client variant maintains
and use the fewest server resources. volumes of data, Application code is inefficient, Database is only presentation logic on PCs. Three-tire and N-Tier
poorly designed. Distribution of data, application logic, and presentation
DBMS Architecture: Database data is stored in data files,
Files automatically expand in predefined increments known Indexes: • An index is an ordered list of the values in a logic across many layers
as extends • Data files are grouped into file groups or table column that includes pointers to the table rows that have Distributed DB System: Distributed systems consist of
spaces, Logical grouping of data files that store data with these values, Increase data access speed, Facilitate search, several processing units (nodes) with a certain level of
similar characteristics •Data or buffer cache is a shared, sort,and join, Facilitate execution of aggregation functions. autonomy that are connected by a network to perform
reserved memory area, Stores data read from data files Using an index is more efficient than scanning an entire complex tasks, System complexity becomes more
•SQL or procedure cache is a second shared memory area, table, Index data is preordered and contains less data. manageable by dividing and distributing system tasks into
Stores recently executed SQL statements in pre-processed Despite benefits, it is not practical to index every column in smaller, more manageable units of work, Scalability and
every database table, maintaining an index incurs costs that availability are improved because the addition of new and
form. can readily overwhelm the benefits of the index, Data redundant nodes is relatively simple.
DBMS performance tuning at the server end focuses on sparsity refers to the number of distinct values that can
Distributed BBS distribute data and data retrieval
setting system parameters. Data cache size must be large appear in a column-Impacts the type of index that will be
functionality across multiple data sources and/or locations,
enough to permit most data requests to be serviced from the most effective.
provide an integrated view of the distributed data to users,
cache. SQL cache size must be large enough to retain B-tree indexes create a tree structure that can be moved Provide transparency such that the distributed system
optimizations for frequently executed queries. Sort cache through quickly to identify a pointer to the desired row,
appears as a single, standalone system to users.
size must be set large enough to handle relevant ORDER widely used when columns contain many different values.
BY and GROUP BY operations. Optimizer mode must be Bitmap indexes create a series of bits that are coded to Distributed processing: Logical processing of data is
set to yield optimal performance, Choices related to the shared across multiple sites, does not require a distributed
represent the value of a column in each row, used when
DBMS physical storage environment can also impact. database. Distributed database: Storage of data is
columns contain a limited set of different values, Use less distributed across multiple sites, Requires the use of
Server-side performance tuning : DBMS Performance space than B-tree. Hash indexes compress column values distributed processing. Fully distributed.
Tuning – Physical Storage, Recommendations for DBMS into hashes, A list of these hash values is created that
physical storage, Utilize I/O accelerators. Use RAID includes pointers to their associated rows, used in data Distributed Database Architectures: Shared memory
(Redundant Array of Independent Disks) to provide a warehouses as StarSc call for many joins on relatively architecture: Multiple interconnected processors that run
balance between performance improvement and fault unique key value. the DBMS share the same primary storage and secondary
tolerance. Minimize disk contention, assign indexes, system Query Optimization: involves a wide range of choices storage. Shared disk architecture: Each processor has its
table space, and high-usage tables to separate data files that such as how to perform join operations and which tables to own primary storage but shares secondary storage with
are located on separate storage devices. Partition tables other processors. Shared nothing architecture: Each
access first. Most query optimization algorithms focus on
based on usage, different attributes of a table on different two key principles, Identify query order that fast execution processor has its own primary storage and hard disk units,
physical storage devices makes it possible to retrieve them time and that the data storage sites that are accessed result Data sharing occurs via network communication, Most
simultaneously. Denormalize tables where appropriate, can in min communication costs. Optimiz can be manual or common and most resilient configuration.
be used to reduce the number of JOIN operations that are automatic. Some criteria for selecting distributed systems
needed. Store computed and aggregate attributes in tables, architecture: Performance needs, need for local autonomy,
Optimiz, Static optimiz occurs at time of query
can notably improve the performance of aggregation Availability needs, Degree of transparency needed or
compilation; Dynamic optimiz takes place at time of
queries. desired, Complexity of query processing, transaction
execution-Uses current info about the state of the DBMS management, concurrency control, and recovery that takes
but must be performed every time the query is executed. place.
Query optimizer can operate in two modes, Rule-based Scalability: Two key strategies, Vertical scaling involves
optimization relies on preset rules to determine the best increasing capacity of nodes by adding RAM, CPU’s etc.
approach to execute a query, Rules assign “costs” to SQL Limited by the capacity of existing hardware to support
operations such as table scans-defined by DBA. Cost- such additions. Horizontal scaling involves the addition of
based optimization uses algorithms based on statistics more nodes, Coordination costs limit the extent to which
about the database objects being accessed to determine the this can be done with RDBMS. Two approaches to
best approach to execute a query, Optimizer adds distributing query processing with distributed databases.
Query Processing: DBMS parses the SQL query and processing costs, I/O costs, and resource costs (RAM and Inter-query parallelism involves simultaneously executing
chooses the most efficient access/ execution plan, DBMS temporary space) to determine the total cost of alternative many small queries in parallel on different nodes. Intra-
executes the SQL query using the chosen execution plan, execution plans. Statistical information for cost-based query parallelism involves searching for different subsets
DBMS fetches the data and sends the result set back to the optimization is generated in one of two ways, manually of data on different nodes in response to a single query.
when a user runs a statistics generation utility, Dynamically
client. SQL Query Parsing: the most time-consuming, by the DBMS as it operates-most common. “Optimizer Fragmentation is the dividing of data into subsets or
Performed by the query optimizer, Query optimizer hints” fragments to improve database performance, local
analyzes SQL query to find the most efficient way to access autonomy, and availability. Vertical fragmentation is the
Client Side Performance Tuning
data, Query is broken down and transformed into a slightly creation of fragments that are subsets of columns Useful if
different form, Revised query is fully equivalent and more Due to poorly written code, indexes can be used to improve only a subset of attributes is relevant to a certain node such
efficient (almost always executes faster). Query parsing performance, function-based indexes. Index Selectivity: as an order processing node, Full table is returned through
process, Validate syntax, Validate table and column names, Measure of the likelihood that an index will be used in the use of a JOIN query, Every vertical fragment must
Ensure that the user submitting the query has the necessary query processing, Aim is to create indexes with high contain the primary key to re-join the data. Horizontal
access rights, Analyze and decompose the original query, selectivity. Indexes are typically used when, The indexed fragmentation or sharding is the creation of fragments that
Transform original query to optimized, fully equivalent column appears alone in a WHERE, HAVING, ORDER are subsets of rows, Full table is returned through the use of
query, Determine the most efficient access or execution BY, or GROUP BY clause, The data sparsity of the indexed a UNION query. Derived fragmentation is the use of data
plan. column is high, Do not create indexes for low sparsity from other tables to determine how data is fragmented.
tables or for small tables, MIN or MAX functions are Mixed fragmentation -v & h.
Query Optimization: Access plan provides the sequence
applied to the indexed column, A small set of rows are
of operations the DBMS should use to most efficiently Replication: Data replication occurs when there is
being selected from a large table, JOIN operations are
execute a query and return the result set, Access plans for redundancy in the data in different fragments and when
performed, Minimize the use of JOINS and create indexes identical fragments are allocated to different nodes,
queries are retained in the SQL cache for possible reuse, If
for columns used in JOIN operations that are not declared Fragments are no longer disjoint, can include metadata.
no access plan is available for a query the optimizer
as primary or foreign keys
evaluates various plans and chooses one to be used and Replication supports, Local autonomy, Performance and
placed in the SQL cache Conditional Expressions: The structure of the conditional scalability, Reliability and availability. Replication creates
expressions used in a WHERE clause can impact query additional overhead and complexity to keep replicas
Table scan-s,T.access-f,index-f1,index.access,nested
performance, Use simple columns or literals as operands consistent, Propagation of changes can be synchronous or
loop,merge,sort—s.
asynchronous, With asynchronous propagation nodes format, Repository of historical data organized by subject keys that are used in the originating business systems,
decide when to retrieve updates-Involves less overhead, to support managerial decision making. Can use relational Natural keys reflect current business setting and
Also known as push and pull replication database management systems to store data, However, the requirements- can change when business circumstances
Transparency ensures that applications and users are structure of a data warehouse database differs change as, for example, when a merger or acquisition takes
confronted with only a single logical database -Insulates fundamentally from that of a transaction oriented database. place, Natural keys are typically larger in size-Consume
users from the complexities of the distributed environment. Data mart is a low-cost, scaled-down version of a data more storage space in the fact table-Leads to larger indexes
Transparency is an extension of logical and physical data warehouse that is designed for the needs of a single and reduced performance, Natural keys might be re-used
independence. business unit or department, provide focused content and over longer periods of time- In such cases natural keys
improve query performance, Located closer to users to cannot be used as primary keys in a data warehouse
Different types: Distribution transparency Location
reduce network traffic. environment; Natural keys are retained as attributes in
transparency means users do not need to know on which
node data resides, Fragmentation transparency means users Two key forms of data marts, Independent data marts are dimension tables to maintain links with operational
execute queries without knowing that distributed fragments standalone systems that draw data directly from operational systems- Dimension tables have far fewer tuples than fact
systems and/or external sources. Dependent data marts tables.
need to be combined, Replication transparency means
different replicas are kept consistent and updates to one pull their data from a central data warehouse, use Fact Table Granularity: Granularity is the level of detail
replica are propagated transparently to others. Failure conformed dimensions to avoid the data inconsistencies provided in one row of the fact table. Higher granularity
transparency System continues to operate when a node or that often arise with independent data marts. implies more rows. Trade-offs exist between detail
the network fails, Other nodes continue tasks of failed Characteristics of data warehouses and data marts: provided by analysis, storage requirements, and query
nodes. Transaction transparency DBMS performs Subject-oriented - organized around key facts-Exclude data performance - Time dimension presents particularly
distributed transactions as if they were on a standalone not relevant to the decision process, Integrated - data is noteworthy challenges since storage and processing needs
system, Heightens challenges associated with concurrency collected from many operational systems and altered to can grow quite rapidly with higher granularity.
control and database recovery. Performance transparency conform to standards, Nonvolatile - data cannot be updated Fact less Fact Tables: Fact table that contains only foreign
System performs as if it were a centralized system. Access by users-Reduces need for concurrency control keys and no measurement data, Used to track events such
or heterogeneity transparency Distributed database can mechanisms, Time-variant - data contains a time as attendance or equipment failures, Aggregations can be
be accessed and queried uniformly regardless of its reliance dimension-Include a time series sequence of data used on fact less fact tables to determine, for example,
on different database systems, Often relies on “wrappers”, snapshots-Operational systems only show the most recent average attendance and maximum attendance.
Primarily relevant in relation to federated databases. data state-Supports trend analysis, Multi-dimensional – Optimizing Dimension Tables: Dimension table attributes
Distributed Requests and Distributed Transactions: more than two dimensions are stored together in “cubes”. are typically used as selection criteria in Queries, Should be
Remote request: Single SQL statement accesses data ETL: Extraction: Full or incremental approaches are heavily indexed to improve query execution time;
processed by a single remote database processor. Remote possible, Operational data changes must be flagged with Excessive number of dimensions >25 suggests a need to
transac: Multiple remote requests that access data at a incremental approach. Transformation: Facilitated by combine some dimensions, A “junk” dimension can be
single remote site. Distributed transaction: Requests that moving data to a staging area-Alternative is to extract data, created for low cardinality attributes that includes a tuple
reference data from several different local or remote sites load to warehouse, then transform (ELT)-Key activities for each unique combination of these attributes; Time
on a network, Individual SQL statements in the transaction include : Format data consistently, Cleanse to address dimension can be populated in advance, Often includes date
access only one site. Distributed request: Single SQL missing values, irregularities, and inconsistencies, attributes notsupported by SQL date functions such seasons
statement can reference data at multiple sites. Aggregate and merge duplicate data, Enrich data by adding and holidays.
Distributed Query Processing: is more complex in a calculated values and linking to other data. Loading, Outrigger Tables: Used to store sets of attribute types
distributed environment, A distributed transaction can populate dimension tables and generate surrogate keys, from a dimension table that are highly correlated, low in
access the data on multiple data processing nodes. Populate fact tables, Build indexes and generate cardinality, and updated at the same time- Demographic
Optimizer must consider the elements of the local setting as documentation and metadata data about the regions where customers are located that is
well as the location of data on the network, the structure of obtained from a third party organization; Connected to the
the data fragments (distribution and replication), and dimension table via a 1:N relationship; Outrigger tables
communication costs. Demands global (across all nodes) reduce data duplication and facilitate updates-Reduces
and local (within a single node) query optimization. performance as a consequence of the need for extra JOIN
Objective of optimizer is to minimize, CPU processing operations
time, I/O operations, Communication activity. Slowly Changing Dimensions: Dimensions that change
Steps: Decomposition: Query is analysed for correctness slowly and irregularly over time Customer segment in a
(syntax, etc.), Query is represented in relational algebra and customer table that is updated once per year; Four distinct
transformed into the most appropriate form. Data approaches to handling such changes:
Scalability: Good scalability means that queries and other
localization: Query is transformed into a fragment query,
data-access functions will grow linearly with the size of the 1. Overwrite previous data: Suitable for correcting errors
eliminate fragments that will contain no data of relevance
warehouse, Key factors impacting scalability: The amount and eliminating data that is no longer relevant such as old
to the query, Database fragments and locations are
of data in the warehouse, The number of concurrent users, phone numbers, Not suitable when historical values are
identified. Global query optimization: Cost model is used relevant to analysis 2. Duplicate existing tuple and add
The complexity of user queries, How quickly the
to evaluate different global strategies. Query tasks are
warehouse is expected to grow. In essence, data warehouse StartDate, EndDate, and CurrentFlag attribute types: New
distributed to local nodes. Local query optimization:
scalability is impacted by many of the same factors as the surrogate key is created while natural key is retained in
Nodes optimize their query segment. both tuples, Fact tables use surrogate keys to ensure that
scalability of database management systems in
Distributed Concurrency Control: Concurrency control is general.Data Warehouse Structure: Dimensional facts are associated with correct versions of dimensions,
especially important in distributed database environments, Modeling: is a method for designing database structures Suitable for dimensions that are relatively small and change
Multi-site, multiple-process operations are more likely to that support high volume query access, Four key steps to infrequently 3. Add a new attribute type to the dimension
create inconsistencies and deadlocked transactions. construct a dimensional model: Classify entities- table: Retains only partial historical Information, Date
Solution to inconsistent database is a two-phase commit Transactions – capture details of an event, Components – attribute can be included to indicate most recent change 4.
protocol, First check that all nodes are ready to commit, who, what, when, where, how, and why of event, Create a history table for the dimension- Linked to the fact
Write-ahead protocol forces log file entries to be written to Classifications – related to component entities via one or table using surrogate key.
permanent storage before actual operations take place, more one-to-many relationships; Identify hierarchies; Rapidly Changing Dimensions: Some dimensions contain
Commit the transaction once all nodes indicate they are Create dimensional model- Collapse hierarchies, data that changes regularly over time, Still valid to
ready. Distributed DB advantages: data is located near the Aggregate measures; Evaluate and refine the model. overwrite previous data or add a new attribute type to the
site of greatest demand, Faster data access, faster data
Star Schema: Consists of fact table connected to several dimension table when either of these approaches is suitable,
processing, growth facilitation, improved communication, In contrast, approaches that seek to record historical data
dimension tables- Fact table includes one tuple per
reduced operational cost, user friendly interface, less
transaction or event with measurement data for the event- will generate large numbers of rows with relatively few
danger of single point failure, processor independence. changes, Alternative is to split the dimension into two
Includes foreign keys that link to dimensions via 1:N
Disadvantage: complexity to manage & control,
relationships-Primary key is the combination of all of these dimensions-First dimension contains stable information
technological difficulty, security, lack of standards, (the original dimension) and Second dimension contains
foreign keys Measurement data can be additive, semi-
increased storage and infrastructure requirements, training
additive, or non-additive; Dimension tables provide details information that changes regularly (the rapidly changing
cost, higher costs. dimension).
about the facts in the fact table (e.g. time, store), Provide
Operational decision support: Depends on highly criteria for aggregating measurement data, Typically Cannot simply add the surrogate key for the rapidly
normalized data to avoid duplication and denormalized and have fewer rows than fact tables. changing dimension as a foreign key in the original
inconsistencies;Incorporates transaction management and
Snowflake Schema are similar to star schemas except that dimension- Requires creation of a new original dimension
recovery capabilities,Supports high transaction
dimensions are normalized; Can slow report generation as whenever a row is added to the rapidly changing
volumes,Relies on online transaction processing (OLTP) dimension- Defeats the purpose of the split; Can include
more JOIN operations are needed to build reports; Used
systems, Uses simple INSERT, UPDATE, DELETE and/or surrogate keys for both dimensions in the fact table -
when dimension tables grow too large and when most
SELECT statements; Tactical and strategic decision Adding a foreign key that links the rapidly changing
queries do not use data in outer level dimensions.
support Depends on data from multiple sources, dimension to the original dimension permits identification
Represents data in a multidimensional way, Supports Fact Constellation includes more than one fact table,
of current data, Rapidly changing data can only be updated
multiple levels of aggregation and summarization, Provides Sometimes referred to as a galaxy schema or as a collection
when a new fact is added to the fact table; Alternative is to
trend analysis, Relies on online analytical processing of star schemas, Representative of real world data
create a junction table.
(OLAP) systems, Focuses on data retrieval via complex ad- warehouses; Blended variants of star, snowflake, and
constellation schemas are common. NoSQL DATABASE
hoc queries (SELECT statements
Surrogate Keys: Dimension tables typically rely on As the data volumes or number of parallel transactions
Data Warehouses: integrated database organized to
provide enterprise-wide, cleansed data in a standardized meaningless integers as primary keys rather than the natural increase, capacity can be increased by; Vertical scaling:
extending storage capacity and/or CPU power of the SQL interface, Couchbase, also allows to define foreign
database server; Horizontal scaling: multiple DBMS keys and perform join operations.
servers being arranged in a cluster; RDBMSs are not good Many RDBMS vendors start implementing NoSQL by
at extensive horizontal scaling, Coordination overhead Focusing on horizontal scalability and distributed querying,
because of focus on consistency and rigid database dropping schema requirements, Support for nested data
schemas. types or allowing to store JSON directly in tables, Support
Other types of DBMSs needed for situations with massive for Map-Reduce operations, Support for special data types,
volumes, flexible data structures and where scalability and such as geospatial data.
availability are more important so, NoSQL databases: Column-oriented DB: A column-oriented DBMS is a
Describes databases that store and manipulate data in other database management system that stores data tables as
formats than tabular relations, i.e. non-relational databases sections of columns of data, Useful if aggregates are
NoSQL databases aim at near linear horizontal scalability, regularly computed over large numbers of similar data
by distributing data over a cluster of database nodes for the items data is sparse, i.e. columns with many null values,
sake of performance as well as availability. Eventual Can also be an RDBMS, key-value or document store. Row
consistency: the data (and its replicas) will become based databases are not efficient at performing operations
consistent at some point in time after each transaction. that apply to the entire data set-Need indexes which add
Key-valued Stores: Key-value based database stores data overhead ; all values of a column are placed together on
as (key, value) pairs, Keys are unique, Hash map, or hash disk, A column matches the structure of a normal index in a
table or dictionary; Keys are hashed by means of a so- row-based system, Operations such as: find all records with
called hash function; A hash function takes an arbitrary price equal to 10 can now be executed directly, Null values
value of arbitrary size and maps it to a key with a fixed do not take up storage space anymore. Disadvantages:
size, which is called the hash value.Each hash can be Retrieving all attributes pertaining to a single entity
mapped to a space in computer memory becomes less efficient, Join operations will be slowed
down.
NoSQL databases are built with horizontal scalability
support in mind; Distribute hash table over different Graph based DB: Graph databases apply graph theory to
locations; Assume we need to spread our hashes over three the storage of information of records, Graphs consist of
servers, Hash every key (“wilfried”, “seppe”) to a server nodes and edges. One-to-one, one-to-many, and many-to-
identifier, index(hash) = mod(hash, nrServers) + 1. many structures can easily be modeled in a graph. using
Cypher query language from Neo4j
Key Value Stores: Request Coordination:
In many NoSQL implementations (e.g. Cassandra, Google’s A graph database is a hyper-relational database, where
BigTable, Amazon’s DynamoDB) all nodes implement the JOIN tables are replaced by more interesting and
semantically meaningful relationships that can be navigated
same functionality and are all able to perform the role of
and/or queried using graph traversal based on graph pattern
request coordinator. Need for membership protocol:
matching. Eg: Location-based services, Recommender
Dissemination- Based on periodic, pairwise
communication, Failure detection. Eventual Consistency: systems, Social media, Knowledge based systems. Other
Membership protocol does not guarantee that every node is NoSQL Categories: XML databases, OO databases,
Database systems to deal with time series and streaming
aware of every other node at all times, It will reach a
consistent state over time. State of the network might not be events, Database systems to store and query geospatial
data, Database systems such as BayesDB which let users
perfectly consistent at any moment in time, though will
query the probable implication of their data.
become eventually consistent at a future point in time.
Many NoSQL databases guarantee so called eventual Evaluation: not proven their true worth, harder to learn and
consistency. Most NoSQL databases follow the BASE use, NoSQL vendors start focusing again on robustness and
principle, CAP theorem states that a distributed computer durability whereas RDBMS vendors start implementing
system cannot guarantee the following three properties at features to build schema-free, scalable data stores;
the same time: Consistency (all nodes see the same data at NewSQL: blend the scalable performance and flexibility of
the same time) Availability (guarantees that every request NoSQL systems with the robustness guarantees of a
receives a response indicating a success or failure result) traditional RDBMS.
Partition tolerance (the system continues to work even if
nodes go down or are added). Most NoSQL databases
sacrifice the consistency part of CAP in their setup, instead
striving for eventual consistency.
The full BASE acronym stands for: Basically available:
NoSQL databases adhere to the availability guarantee of the
CAP theorem; Soft state: the system can change over time,
even without receiving input; Eventual consistency: the
system will become consistent over time.
The operation that repartitions hashes over nodes if nodes
are added or removed is called stabilization.
Integrity Constraints and Querying: Key value stores
represent a very diverse range of systems, Full blown
DBMSs versus caches, Only limited query facilities are
offered- E.g. put and set, Limited to no means to enforce
structural constraints- DBMS remains agnostic to the
internal structure, No relationships, referential integrity
constraints or database schema, can be defined. A tuple
store is similar to a key-value store, with the difference that
it does not store pairwise combinations of avkey and a
value, but instead stores a unique key together with a vector
of data. No requirement to have the same length or
semantic ordering (schema-less!). Various NoSQL
implementations do, however, permit organizing entries in
semantical groups, (aka collections or tables).
Document stores store a collection of attributes that are
labeled and unordered, representing items that are semi-
structured, Most modern NoSQL databases choose to
represent documents using JSON.
Items With Keys: Most NoSQL document stores will
allow you to store items in tables (collections) in a schema-
less manner, but will enforce that a primary key be
specified, E.g. Amazon’s DynamoDB, MongoDB ( id )
Primary key will be used as a partitioning key to create a
hash and determine where the data will be stored.
Filters and Queries ; Complex Queries and Aggregation
with MapReduce
SQL after all : GROUP BY style SQL queries are
convertible to an equivalent map-reduce pipeline. Many
document store implementations express queries using an

You might also like