Download as DOCX, PDF, TXT or read online from Scribd
Download as docx, pdf, or txt
You are on page 1of 3
DATABASE PERFORMANCE TUNING AND QUERY Execution and Fetching: During the execution phase of rather than expressions.
rather than expressions. Numeric comparisons are faster
OPTIMIZATION query processing all I/O operations indicated in the access than character, date, and NULL comparisons, LIKE – Key function of a DBMS is to provide data as quickly as plan are performed, Locks are acquired, Data are retrieved slowest. Equality comparisons are faster than inequality possible Process begins when an application generates a and placed in the data cache, Transaction management comparisons. When using multiple AND conditions the query, Query is then sent to the DBMS,DBMS executes the commands are processed. During the fetching phase of condition most likely to be false should be first. When query, DBMS sends the resulting data set back to the query processing the rows of the query result set are using multiple OR conditions the condition most likely to application. Database performance tuning is the set of returned to the client, DBMS may use temporary table be true should be first. Avoid NOT logical operator. activities and procedures aimed at reducing the response space to store temporary data during the fetching phase of 2. DISTRIBUTED DBMS : Dbs Architectures: time of a database management system, Requires a holistic query processing, Database server coordinates the Centralized architecture All responsibilities of the DBMS approach, All factors must operate at optimum level with movement of the result set rows from the server cache to are handled by one centralized entity, Data, application minimal bottlenecks. the client cache. logic, and presentation logic, Rare, expensive, and difficult Server-side tuning involves tuning the performance of the Query Processing Bottlenecks: Bottlenecks are delays to maintain. Client-server architecture Data is stored on a DBMS, DBMS environment is configured to respond to introduced to the processing of a queryoperation that server while application and presentation logic are client requests as fast as possible while still making reduce system performance, sources: Competing maintained on PCs Generates considerable network traffic, optimum use of available resources, Majority of transactions are constraining resources, CPU is Difficult to maintain. performance tuning activities focus on minimizing database experiencing high levels of utilization, Insufficient RAM is Tiered architecture: Distribute computing tasks among input/output (I/O) operations. Client-side tuning involves increasing the need for data movement operations-Can also multiple computing devices, Two-tier or client-server SQL performance tuning, Objective is to create SQL increase CPU utilization, Hard disks are too slow or have architecture, Fat client variant maintains application and queries that return desired data in the least amount of time low data transfer rate, Network is transferring large presentation logic on PCs, Thin client variant maintains and use the fewest server resources. volumes of data, Application code is inefficient, Database is only presentation logic on PCs. Three-tire and N-Tier poorly designed. Distribution of data, application logic, and presentation DBMS Architecture: Database data is stored in data files, Files automatically expand in predefined increments known Indexes: • An index is an ordered list of the values in a logic across many layers as extends • Data files are grouped into file groups or table column that includes pointers to the table rows that have Distributed DB System: Distributed systems consist of spaces, Logical grouping of data files that store data with these values, Increase data access speed, Facilitate search, several processing units (nodes) with a certain level of similar characteristics •Data or buffer cache is a shared, sort,and join, Facilitate execution of aggregation functions. autonomy that are connected by a network to perform reserved memory area, Stores data read from data files Using an index is more efficient than scanning an entire complex tasks, System complexity becomes more •SQL or procedure cache is a second shared memory area, table, Index data is preordered and contains less data. manageable by dividing and distributing system tasks into Stores recently executed SQL statements in pre-processed Despite benefits, it is not practical to index every column in smaller, more manageable units of work, Scalability and every database table, maintaining an index incurs costs that availability are improved because the addition of new and form. can readily overwhelm the benefits of the index, Data redundant nodes is relatively simple. DBMS performance tuning at the server end focuses on sparsity refers to the number of distinct values that can Distributed BBS distribute data and data retrieval setting system parameters. Data cache size must be large appear in a column-Impacts the type of index that will be functionality across multiple data sources and/or locations, enough to permit most data requests to be serviced from the most effective. provide an integrated view of the distributed data to users, cache. SQL cache size must be large enough to retain B-tree indexes create a tree structure that can be moved Provide transparency such that the distributed system optimizations for frequently executed queries. Sort cache through quickly to identify a pointer to the desired row, appears as a single, standalone system to users. size must be set large enough to handle relevant ORDER widely used when columns contain many different values. BY and GROUP BY operations. Optimizer mode must be Bitmap indexes create a series of bits that are coded to Distributed processing: Logical processing of data is set to yield optimal performance, Choices related to the shared across multiple sites, does not require a distributed represent the value of a column in each row, used when DBMS physical storage environment can also impact. database. Distributed database: Storage of data is columns contain a limited set of different values, Use less distributed across multiple sites, Requires the use of Server-side performance tuning : DBMS Performance space than B-tree. Hash indexes compress column values distributed processing. Fully distributed. Tuning – Physical Storage, Recommendations for DBMS into hashes, A list of these hash values is created that physical storage, Utilize I/O accelerators. Use RAID includes pointers to their associated rows, used in data Distributed Database Architectures: Shared memory (Redundant Array of Independent Disks) to provide a warehouses as StarSc call for many joins on relatively architecture: Multiple interconnected processors that run balance between performance improvement and fault unique key value. the DBMS share the same primary storage and secondary tolerance. Minimize disk contention, assign indexes, system Query Optimization: involves a wide range of choices storage. Shared disk architecture: Each processor has its table space, and high-usage tables to separate data files that such as how to perform join operations and which tables to own primary storage but shares secondary storage with are located on separate storage devices. Partition tables other processors. Shared nothing architecture: Each access first. Most query optimization algorithms focus on based on usage, different attributes of a table on different two key principles, Identify query order that fast execution processor has its own primary storage and hard disk units, physical storage devices makes it possible to retrieve them time and that the data storage sites that are accessed result Data sharing occurs via network communication, Most simultaneously. Denormalize tables where appropriate, can in min communication costs. Optimiz can be manual or common and most resilient configuration. be used to reduce the number of JOIN operations that are automatic. Some criteria for selecting distributed systems needed. Store computed and aggregate attributes in tables, architecture: Performance needs, need for local autonomy, Optimiz, Static optimiz occurs at time of query can notably improve the performance of aggregation Availability needs, Degree of transparency needed or compilation; Dynamic optimiz takes place at time of queries. desired, Complexity of query processing, transaction execution-Uses current info about the state of the DBMS management, concurrency control, and recovery that takes but must be performed every time the query is executed. place. Query optimizer can operate in two modes, Rule-based Scalability: Two key strategies, Vertical scaling involves optimization relies on preset rules to determine the best increasing capacity of nodes by adding RAM, CPU’s etc. approach to execute a query, Rules assign “costs” to SQL Limited by the capacity of existing hardware to support operations such as table scans-defined by DBA. Cost- such additions. Horizontal scaling involves the addition of based optimization uses algorithms based on statistics more nodes, Coordination costs limit the extent to which about the database objects being accessed to determine the this can be done with RDBMS. Two approaches to best approach to execute a query, Optimizer adds distributing query processing with distributed databases. Query Processing: DBMS parses the SQL query and processing costs, I/O costs, and resource costs (RAM and Inter-query parallelism involves simultaneously executing chooses the most efficient access/ execution plan, DBMS temporary space) to determine the total cost of alternative many small queries in parallel on different nodes. Intra- executes the SQL query using the chosen execution plan, execution plans. Statistical information for cost-based query parallelism involves searching for different subsets DBMS fetches the data and sends the result set back to the optimization is generated in one of two ways, manually of data on different nodes in response to a single query. when a user runs a statistics generation utility, Dynamically client. SQL Query Parsing: the most time-consuming, by the DBMS as it operates-most common. “Optimizer Fragmentation is the dividing of data into subsets or Performed by the query optimizer, Query optimizer hints” fragments to improve database performance, local analyzes SQL query to find the most efficient way to access autonomy, and availability. Vertical fragmentation is the Client Side Performance Tuning data, Query is broken down and transformed into a slightly creation of fragments that are subsets of columns Useful if different form, Revised query is fully equivalent and more Due to poorly written code, indexes can be used to improve only a subset of attributes is relevant to a certain node such efficient (almost always executes faster). Query parsing performance, function-based indexes. Index Selectivity: as an order processing node, Full table is returned through process, Validate syntax, Validate table and column names, Measure of the likelihood that an index will be used in the use of a JOIN query, Every vertical fragment must Ensure that the user submitting the query has the necessary query processing, Aim is to create indexes with high contain the primary key to re-join the data. Horizontal access rights, Analyze and decompose the original query, selectivity. Indexes are typically used when, The indexed fragmentation or sharding is the creation of fragments that Transform original query to optimized, fully equivalent column appears alone in a WHERE, HAVING, ORDER are subsets of rows, Full table is returned through the use of query, Determine the most efficient access or execution BY, or GROUP BY clause, The data sparsity of the indexed a UNION query. Derived fragmentation is the use of data plan. column is high, Do not create indexes for low sparsity from other tables to determine how data is fragmented. tables or for small tables, MIN or MAX functions are Mixed fragmentation -v & h. Query Optimization: Access plan provides the sequence applied to the indexed column, A small set of rows are of operations the DBMS should use to most efficiently Replication: Data replication occurs when there is being selected from a large table, JOIN operations are execute a query and return the result set, Access plans for redundancy in the data in different fragments and when performed, Minimize the use of JOINS and create indexes identical fragments are allocated to different nodes, queries are retained in the SQL cache for possible reuse, If for columns used in JOIN operations that are not declared Fragments are no longer disjoint, can include metadata. no access plan is available for a query the optimizer as primary or foreign keys evaluates various plans and chooses one to be used and Replication supports, Local autonomy, Performance and placed in the SQL cache Conditional Expressions: The structure of the conditional scalability, Reliability and availability. Replication creates expressions used in a WHERE clause can impact query additional overhead and complexity to keep replicas Table scan-s,T.access-f,index-f1,index.access,nested performance, Use simple columns or literals as operands consistent, Propagation of changes can be synchronous or loop,merge,sort—s. asynchronous, With asynchronous propagation nodes format, Repository of historical data organized by subject keys that are used in the originating business systems, decide when to retrieve updates-Involves less overhead, to support managerial decision making. Can use relational Natural keys reflect current business setting and Also known as push and pull replication database management systems to store data, However, the requirements- can change when business circumstances Transparency ensures that applications and users are structure of a data warehouse database differs change as, for example, when a merger or acquisition takes confronted with only a single logical database -Insulates fundamentally from that of a transaction oriented database. place, Natural keys are typically larger in size-Consume users from the complexities of the distributed environment. Data mart is a low-cost, scaled-down version of a data more storage space in the fact table-Leads to larger indexes Transparency is an extension of logical and physical data warehouse that is designed for the needs of a single and reduced performance, Natural keys might be re-used independence. business unit or department, provide focused content and over longer periods of time- In such cases natural keys improve query performance, Located closer to users to cannot be used as primary keys in a data warehouse Different types: Distribution transparency Location reduce network traffic. environment; Natural keys are retained as attributes in transparency means users do not need to know on which node data resides, Fragmentation transparency means users Two key forms of data marts, Independent data marts are dimension tables to maintain links with operational execute queries without knowing that distributed fragments standalone systems that draw data directly from operational systems- Dimension tables have far fewer tuples than fact systems and/or external sources. Dependent data marts tables. need to be combined, Replication transparency means different replicas are kept consistent and updates to one pull their data from a central data warehouse, use Fact Table Granularity: Granularity is the level of detail replica are propagated transparently to others. Failure conformed dimensions to avoid the data inconsistencies provided in one row of the fact table. Higher granularity transparency System continues to operate when a node or that often arise with independent data marts. implies more rows. Trade-offs exist between detail the network fails, Other nodes continue tasks of failed Characteristics of data warehouses and data marts: provided by analysis, storage requirements, and query nodes. Transaction transparency DBMS performs Subject-oriented - organized around key facts-Exclude data performance - Time dimension presents particularly distributed transactions as if they were on a standalone not relevant to the decision process, Integrated - data is noteworthy challenges since storage and processing needs system, Heightens challenges associated with concurrency collected from many operational systems and altered to can grow quite rapidly with higher granularity. control and database recovery. Performance transparency conform to standards, Nonvolatile - data cannot be updated Fact less Fact Tables: Fact table that contains only foreign System performs as if it were a centralized system. Access by users-Reduces need for concurrency control keys and no measurement data, Used to track events such or heterogeneity transparency Distributed database can mechanisms, Time-variant - data contains a time as attendance or equipment failures, Aggregations can be be accessed and queried uniformly regardless of its reliance dimension-Include a time series sequence of data used on fact less fact tables to determine, for example, on different database systems, Often relies on “wrappers”, snapshots-Operational systems only show the most recent average attendance and maximum attendance. Primarily relevant in relation to federated databases. data state-Supports trend analysis, Multi-dimensional – Optimizing Dimension Tables: Dimension table attributes Distributed Requests and Distributed Transactions: more than two dimensions are stored together in “cubes”. are typically used as selection criteria in Queries, Should be Remote request: Single SQL statement accesses data ETL: Extraction: Full or incremental approaches are heavily indexed to improve query execution time; processed by a single remote database processor. Remote possible, Operational data changes must be flagged with Excessive number of dimensions >25 suggests a need to transac: Multiple remote requests that access data at a incremental approach. Transformation: Facilitated by combine some dimensions, A “junk” dimension can be single remote site. Distributed transaction: Requests that moving data to a staging area-Alternative is to extract data, created for low cardinality attributes that includes a tuple reference data from several different local or remote sites load to warehouse, then transform (ELT)-Key activities for each unique combination of these attributes; Time on a network, Individual SQL statements in the transaction include : Format data consistently, Cleanse to address dimension can be populated in advance, Often includes date access only one site. Distributed request: Single SQL missing values, irregularities, and inconsistencies, attributes notsupported by SQL date functions such seasons statement can reference data at multiple sites. Aggregate and merge duplicate data, Enrich data by adding and holidays. Distributed Query Processing: is more complex in a calculated values and linking to other data. Loading, Outrigger Tables: Used to store sets of attribute types distributed environment, A distributed transaction can populate dimension tables and generate surrogate keys, from a dimension table that are highly correlated, low in access the data on multiple data processing nodes. Populate fact tables, Build indexes and generate cardinality, and updated at the same time- Demographic Optimizer must consider the elements of the local setting as documentation and metadata data about the regions where customers are located that is well as the location of data on the network, the structure of obtained from a third party organization; Connected to the the data fragments (distribution and replication), and dimension table via a 1:N relationship; Outrigger tables communication costs. Demands global (across all nodes) reduce data duplication and facilitate updates-Reduces and local (within a single node) query optimization. performance as a consequence of the need for extra JOIN Objective of optimizer is to minimize, CPU processing operations time, I/O operations, Communication activity. Slowly Changing Dimensions: Dimensions that change Steps: Decomposition: Query is analysed for correctness slowly and irregularly over time Customer segment in a (syntax, etc.), Query is represented in relational algebra and customer table that is updated once per year; Four distinct transformed into the most appropriate form. Data approaches to handling such changes: Scalability: Good scalability means that queries and other localization: Query is transformed into a fragment query, data-access functions will grow linearly with the size of the 1. Overwrite previous data: Suitable for correcting errors eliminate fragments that will contain no data of relevance warehouse, Key factors impacting scalability: The amount and eliminating data that is no longer relevant such as old to the query, Database fragments and locations are of data in the warehouse, The number of concurrent users, phone numbers, Not suitable when historical values are identified. Global query optimization: Cost model is used relevant to analysis 2. Duplicate existing tuple and add The complexity of user queries, How quickly the to evaluate different global strategies. Query tasks are warehouse is expected to grow. In essence, data warehouse StartDate, EndDate, and CurrentFlag attribute types: New distributed to local nodes. Local query optimization: scalability is impacted by many of the same factors as the surrogate key is created while natural key is retained in Nodes optimize their query segment. both tuples, Fact tables use surrogate keys to ensure that scalability of database management systems in Distributed Concurrency Control: Concurrency control is general.Data Warehouse Structure: Dimensional facts are associated with correct versions of dimensions, especially important in distributed database environments, Modeling: is a method for designing database structures Suitable for dimensions that are relatively small and change Multi-site, multiple-process operations are more likely to that support high volume query access, Four key steps to infrequently 3. Add a new attribute type to the dimension create inconsistencies and deadlocked transactions. construct a dimensional model: Classify entities- table: Retains only partial historical Information, Date Solution to inconsistent database is a two-phase commit Transactions – capture details of an event, Components – attribute can be included to indicate most recent change 4. protocol, First check that all nodes are ready to commit, who, what, when, where, how, and why of event, Create a history table for the dimension- Linked to the fact Write-ahead protocol forces log file entries to be written to Classifications – related to component entities via one or table using surrogate key. permanent storage before actual operations take place, more one-to-many relationships; Identify hierarchies; Rapidly Changing Dimensions: Some dimensions contain Commit the transaction once all nodes indicate they are Create dimensional model- Collapse hierarchies, data that changes regularly over time, Still valid to ready. Distributed DB advantages: data is located near the Aggregate measures; Evaluate and refine the model. overwrite previous data or add a new attribute type to the site of greatest demand, Faster data access, faster data Star Schema: Consists of fact table connected to several dimension table when either of these approaches is suitable, processing, growth facilitation, improved communication, In contrast, approaches that seek to record historical data dimension tables- Fact table includes one tuple per reduced operational cost, user friendly interface, less transaction or event with measurement data for the event- will generate large numbers of rows with relatively few danger of single point failure, processor independence. changes, Alternative is to split the dimension into two Includes foreign keys that link to dimensions via 1:N Disadvantage: complexity to manage & control, relationships-Primary key is the combination of all of these dimensions-First dimension contains stable information technological difficulty, security, lack of standards, (the original dimension) and Second dimension contains foreign keys Measurement data can be additive, semi- increased storage and infrastructure requirements, training additive, or non-additive; Dimension tables provide details information that changes regularly (the rapidly changing cost, higher costs. dimension). about the facts in the fact table (e.g. time, store), Provide Operational decision support: Depends on highly criteria for aggregating measurement data, Typically Cannot simply add the surrogate key for the rapidly normalized data to avoid duplication and denormalized and have fewer rows than fact tables. changing dimension as a foreign key in the original inconsistencies;Incorporates transaction management and Snowflake Schema are similar to star schemas except that dimension- Requires creation of a new original dimension recovery capabilities,Supports high transaction dimensions are normalized; Can slow report generation as whenever a row is added to the rapidly changing volumes,Relies on online transaction processing (OLTP) dimension- Defeats the purpose of the split; Can include more JOIN operations are needed to build reports; Used systems, Uses simple INSERT, UPDATE, DELETE and/or surrogate keys for both dimensions in the fact table - when dimension tables grow too large and when most SELECT statements; Tactical and strategic decision Adding a foreign key that links the rapidly changing queries do not use data in outer level dimensions. support Depends on data from multiple sources, dimension to the original dimension permits identification Represents data in a multidimensional way, Supports Fact Constellation includes more than one fact table, of current data, Rapidly changing data can only be updated multiple levels of aggregation and summarization, Provides Sometimes referred to as a galaxy schema or as a collection when a new fact is added to the fact table; Alternative is to trend analysis, Relies on online analytical processing of star schemas, Representative of real world data create a junction table. (OLAP) systems, Focuses on data retrieval via complex ad- warehouses; Blended variants of star, snowflake, and constellation schemas are common. NoSQL DATABASE hoc queries (SELECT statements Surrogate Keys: Dimension tables typically rely on As the data volumes or number of parallel transactions Data Warehouses: integrated database organized to provide enterprise-wide, cleansed data in a standardized meaningless integers as primary keys rather than the natural increase, capacity can be increased by; Vertical scaling: extending storage capacity and/or CPU power of the SQL interface, Couchbase, also allows to define foreign database server; Horizontal scaling: multiple DBMS keys and perform join operations. servers being arranged in a cluster; RDBMSs are not good Many RDBMS vendors start implementing NoSQL by at extensive horizontal scaling, Coordination overhead Focusing on horizontal scalability and distributed querying, because of focus on consistency and rigid database dropping schema requirements, Support for nested data schemas. types or allowing to store JSON directly in tables, Support Other types of DBMSs needed for situations with massive for Map-Reduce operations, Support for special data types, volumes, flexible data structures and where scalability and such as geospatial data. availability are more important so, NoSQL databases: Column-oriented DB: A column-oriented DBMS is a Describes databases that store and manipulate data in other database management system that stores data tables as formats than tabular relations, i.e. non-relational databases sections of columns of data, Useful if aggregates are NoSQL databases aim at near linear horizontal scalability, regularly computed over large numbers of similar data by distributing data over a cluster of database nodes for the items data is sparse, i.e. columns with many null values, sake of performance as well as availability. Eventual Can also be an RDBMS, key-value or document store. Row consistency: the data (and its replicas) will become based databases are not efficient at performing operations consistent at some point in time after each transaction. that apply to the entire data set-Need indexes which add Key-valued Stores: Key-value based database stores data overhead ; all values of a column are placed together on as (key, value) pairs, Keys are unique, Hash map, or hash disk, A column matches the structure of a normal index in a table or dictionary; Keys are hashed by means of a so- row-based system, Operations such as: find all records with called hash function; A hash function takes an arbitrary price equal to 10 can now be executed directly, Null values value of arbitrary size and maps it to a key with a fixed do not take up storage space anymore. Disadvantages: size, which is called the hash value.Each hash can be Retrieving all attributes pertaining to a single entity mapped to a space in computer memory becomes less efficient, Join operations will be slowed down. NoSQL databases are built with horizontal scalability support in mind; Distribute hash table over different Graph based DB: Graph databases apply graph theory to locations; Assume we need to spread our hashes over three the storage of information of records, Graphs consist of servers, Hash every key (“wilfried”, “seppe”) to a server nodes and edges. One-to-one, one-to-many, and many-to- identifier, index(hash) = mod(hash, nrServers) + 1. many structures can easily be modeled in a graph. using Cypher query language from Neo4j Key Value Stores: Request Coordination: In many NoSQL implementations (e.g. Cassandra, Google’s A graph database is a hyper-relational database, where BigTable, Amazon’s DynamoDB) all nodes implement the JOIN tables are replaced by more interesting and semantically meaningful relationships that can be navigated same functionality and are all able to perform the role of and/or queried using graph traversal based on graph pattern request coordinator. Need for membership protocol: matching. Eg: Location-based services, Recommender Dissemination- Based on periodic, pairwise communication, Failure detection. Eventual Consistency: systems, Social media, Knowledge based systems. Other Membership protocol does not guarantee that every node is NoSQL Categories: XML databases, OO databases, Database systems to deal with time series and streaming aware of every other node at all times, It will reach a consistent state over time. State of the network might not be events, Database systems to store and query geospatial data, Database systems such as BayesDB which let users perfectly consistent at any moment in time, though will query the probable implication of their data. become eventually consistent at a future point in time. Many NoSQL databases guarantee so called eventual Evaluation: not proven their true worth, harder to learn and consistency. Most NoSQL databases follow the BASE use, NoSQL vendors start focusing again on robustness and principle, CAP theorem states that a distributed computer durability whereas RDBMS vendors start implementing system cannot guarantee the following three properties at features to build schema-free, scalable data stores; the same time: Consistency (all nodes see the same data at NewSQL: blend the scalable performance and flexibility of the same time) Availability (guarantees that every request NoSQL systems with the robustness guarantees of a receives a response indicating a success or failure result) traditional RDBMS. Partition tolerance (the system continues to work even if nodes go down or are added). Most NoSQL databases sacrifice the consistency part of CAP in their setup, instead striving for eventual consistency. The full BASE acronym stands for: Basically available: NoSQL databases adhere to the availability guarantee of the CAP theorem; Soft state: the system can change over time, even without receiving input; Eventual consistency: the system will become consistent over time. The operation that repartitions hashes over nodes if nodes are added or removed is called stabilization. Integrity Constraints and Querying: Key value stores represent a very diverse range of systems, Full blown DBMSs versus caches, Only limited query facilities are offered- E.g. put and set, Limited to no means to enforce structural constraints- DBMS remains agnostic to the internal structure, No relationships, referential integrity constraints or database schema, can be defined. A tuple store is similar to a key-value store, with the difference that it does not store pairwise combinations of avkey and a value, but instead stores a unique key together with a vector of data. No requirement to have the same length or semantic ordering (schema-less!). Various NoSQL implementations do, however, permit organizing entries in semantical groups, (aka collections or tables). Document stores store a collection of attributes that are labeled and unordered, representing items that are semi- structured, Most modern NoSQL databases choose to represent documents using JSON. Items With Keys: Most NoSQL document stores will allow you to store items in tables (collections) in a schema- less manner, but will enforce that a primary key be specified, E.g. Amazon’s DynamoDB, MongoDB ( id ) Primary key will be used as a partitioning key to create a hash and determine where the data will be stored. Filters and Queries ; Complex Queries and Aggregation with MapReduce SQL after all : GROUP BY style SQL queries are convertible to an equivalent map-reduce pipeline. Many document store implementations express queries using an