Oracle Json

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Native JSON Datatype Support: Maturing SQL and NoSQL

convergence in Oracle Database


Zhen Hua Liu, Beda Hammerschmidt, Doug McMahon, Hui Chang, Ying Lu, Josh Spiegel,
Alfonso Colunga Sosa, Srikrishnan Suresh, Geeta Arora, Vikas Arora
Oracle Corporation
Redwood Shores, California, USA
{zhen.liu, beda.hammerschmidt, doug.mcmahon, hui.x.zhang, ying.lu, josh.spiegel,
alfonso.colunga, srikrishnan.s.suresh, geeta.arora, vikas.arora}@oracle.com

ABSTRACT collections of schema-flexible document entities. This contrasts


Both RDBMS and NoSQL database vendors have added varying traditional relational databases which support similar operations
degrees of support for storing and processing JSON data. Some but over structured rows in a table. However, over the past
vendors store JSON directly as text while others add new JSON decade, many relational database vendors such as Oracle [29],
type systems backed by binary encoding formats. The latter Microsoft SQL Server [10], MySQL [12], PostgreSQL [16] have
option is increasingly popular as it enables richer type systems added support for storing JSON documents to enable schema-
and efficient query processing. In this paper, we present our new flexible operational storage.
native JSON datatype and how it is fully integrated with the OLAP for JSON: Both SQL and NoSQL databases have added
Oracle Database ecosystem to transform Oracle Database into a support for real-time analytics over collections of JSON
mature platform for serving both SQL and NoSQL style access documents [4, 16, 15]. In general, analytics require expressive
paradigms. We show how our uniquely designed Oracle Binary and performant query capabilities including full-text search and
JSON format (OSON) is able to speed up both OLAP and OLTP schema inference. SQL vendors, such as Oracle [28] are able to
workloads over JSON documents. automatically derive structured views from JSON collections to
PVLDB Reference Format: leverage existing SQL analytics over JSON. The SQL/JSON 2016
Z. Hua Liu et al.. Native JSON Datatype Support: Maturing SQL standard [21] provides comprehensive SQL/JSON path language
and NoSQL convergence in Oracle Database. PVLDB, 13(12) : for sophisticated queries over JSON documents. NoSQL users
3059-3071, 2020. leverage Elastic Search API [8] for full text search over JSON
DOI: https://doi.org/10.14778/3415478.3415534 documents as a basis of analytics. All of which have created
online analytical processing over JSON similar to the classical
1. INTRODUCTION OLAP over relational data.
JSON has a number of benefits that have contributed to its growth While well suited for data exchange, JSON text is not an ideal
in popularity among database vendors. It offers a schema-flexible storage format for query processing. Using JSON text storage in a
data model where consuming applications can evolve to store new database requires expensive text processing each time a document
attributes without having to modify an underlying schema. is read by a query or is updated by a DML statement. Binary
Complex objects with nested master-detail relationships can be encodings of JSON such as BSON [2] are increasingly popular
stored within a single document, enabling efficient storage and among database vendors. Both MySQL [12] and PostgreSQL [16]
retrieval without requiring joins. Further, JSON is human have their own binary JSON formats and have cited the benefits
readable, fully self-contained, and easily consumed by popular of binary JSON for query processing. Oracle’s in-memory JSON
programming languages such as JavaScript, Python, and Java. As feature that loads and scans Oracle binary JSON (OSON) in-
a result, JSON is popular for a broad variety of use cases memory has shown better query performance compared with
including data exchange, online transaction processing, online JSON text [28]. In addition to better query performance, binary
data analytics. formats allow the primitive type system to be extended beyond the
OLTP for JSON: NoSQL vendors, such as MongoDB [11] and set supported by JSON text (strings, numbers, and booleans).
Couchbase [4] provide JSON document storage coupled with Supporting a binary JSON format only to enable efficient query
simple NoSQL style APIs to enable a lightweight, agile processing and richer types is not enough for OLTP use cases. In
development model that contrasts the classic schema-rigid SQL such cases, it is critical that applications can also efficiently
approach over relational data. These operational stores provide create, read, and update documents as well. Efficient updates
create, read, update and delete (CRUD) operations over over JSON are especially challenging and most vendors resort to
This work is licensed under the Creative Commons Attribution- replacing the entire document for each update, even when only a
NonCommercial-NoDerivatives 4.0 International License. To view a copy of small portion of the document has actually changed. Compare
this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any this to update operations over relational data where each column
use beyond those covered by this license, obtain permission by emailing can be modified independently. Ideally, updates to JSON
[email protected]. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
documents should be equally granular and support partial updates
Proceedings of the VLDB Endowment, Vol. 13, No. 12 in a piecewise manner. Updating a single attribute in a large
ISSN 2150-8097. JSON document should not require rewriting the entire document.
DOI: https://doi.org/10.14778/3415478.3415534
In this paper, we describe the native JSON datatype in Oracle
Database and how it is designed to support the efficient query,

3059
update, ingestion, and retrieval of documents for both OLTP and on related work. Section 7 is on future work. Section 8 is
OLAP workloads over JSON. We show how fine-grained updates conclusion with acknowledgments in section 9.
are expressed using the new JSON_TRANSFORM() operator and how
the underlying OSON binary format is capable of supporting
these updates without full document replacement. This results
2. JSON DATATYPE FUNCTIONALITY
in update performance improvements for medium to large JSON 2.1 SQL/JSON 2016
documents. The SQL/JSON 2016 [21] standard defines a set of SQL/JSON
operators and table functions to query JSON text and generate
We will show how data ingestion and retrieval rates are improved JSON text using VARCHAR2/CLOB/BLOB as the underlying
by keeping OSON as the network exchange format and adding storage. JSON_VALUE() selects a scalar JSON value using a path
native OSON support to existing client drivers. These drivers expression and produces it as a SQL scalar. JSON_QUERY()
leverage the inherent read-friendly nature of the format to provide selects a nested JSON object or array using a path expression and
"in-place", efficient, random access to the document without returns it as a JSON text. JSON_TABLE() is a table function
requiring conversions to intermediate formats on the server or used in the SQL FROM clause to project a set of rows out of a
client. OSON values are read by client drivers using convenient JSON object based on multiple path expressions that identify rows
object-model interfaces without having to first materialize the and columns. JSON_EXISTS() is used in boolean contexts, such
values to in-memory data structures such as hash tables and as the SQL WHERE clause, to test if a JSON document matches
arrays. This, coupled with the natural compression of the format, certain criteria expressed using a path expression. These JSON
results in a significant improvement in throughput and latency for query operators accept SQL/JSON path expressions that are used
simple reads. We will show how ingestion rates are not hindered to select values from within a document. The SQL/JSON path
by the added cost of client document encoding but instead tend to language is similar to XPath and uses path steps to navigate the
benefit from reduced I/O costs due to compression. document tree of objects, arrays, and scalar values. Each step in
In this paper, we also present the set of design principles and a path may optionally include predicates over the values being
techniques used to support JSON datatype in the Oracle Database selected. Like XPath, SQL/JSON path leverages a sequence data
eco-system. The design is driven by variety of customer use model and the intermediate result of any SQL/JSON path
cases, including pure JSON document storage usecases to expression is a sequence of JSON values (objects, arrays, scalars).
process both OLTP (put/get/query/modify) and OLAP (ad-hoc While the mechanics of SQL/JSON path follows XPath, the
query report, full text search) operations, hybrid usecases where syntax is more similar to JavaScript.
JSON is stored along-side relational to support flexible fields
within a classic relational schema, JSON generation usecases 2.2 JSON Datatype
from relational data via SQL/JSON functions, and JSON In Oracle Database 20c, the "JSON" type can be used to store
shredding usecases where JSON is shredded into relational tables JSON data instead of VARCHAR/CLOB/BLOB. The JSON type
or materialized views. Both horizontal scaling via Oracle data model is closely aligned with JSON text and includes objects,
sharding and vertical scaling via Oracle ExaData and In-Memory arrays, strings, numbers, true, false, and null. But like other JSON
store have been leveraged to support all these cases efficiently. formats [2], the data model is also extended with SQL primitive
The main contributions of this paper are: types for packed decimal, IEEE float/double, dates, timestamps,
1. The OSON binary format to support the efficient query, time intervals, and raw values. We refer to this logical data model
as the JSON Document Object Model (JDOM). The OSON binary
update, ingestion, and retrieval of JSON documents. To the
best of our knowledge, OSON is the first binary JSON format for JSON datatype is a serialization of a JDOM.
format that supports general piecewise updates and efficient SQL/JSON 2016 supports type casting item functions, such as
in-place server and client-side navigation without sacrificing .number(), .string(), .date(), .binary() etc, that can cast string to
schema-flexibility. The novel design enables queries and non-string built-in datatypes.
updates to be done in logarithmic rather than linear running The JSON datatype can be used as the type of a table column,
time. view column, parameter, return value, or a local variable datatype
2. The JSON_TRANSFORM() operator provides declarative partial in SQL and PL/SQL functions. The SQL/JSON operator
JSON_QUERY() by default returns JSON datatype. JSON_TABLE() can
updates over JSON documents in a way that is amenable to
efficient piece-wise evaluation over OSON. return JSON datatype as projected column datatype. SQL/JSON
generation functions can return JSON datatype. All of these have
3. Integration of the JSON datatype with all the salient features overcome the limitation from the JSON text with IS JSON
of Oracle Database to achieve high performance for both constraint based pseudotype which in some cases may lose type
OLTP and OLAP workloads. In particular, the in-memory information between SQL operations. Implicit conversion
path-value index format and inverted keyword hash index between JSON datatype and JSON text is supported by the SQL
format for JSON_EXISTS() and JSON_TEXTCONTAINS() compiler.
in memory predicate evaluation for OLAP is novel.
Figure 1 shows a set of SQL/JSON queries and DML statements
4. An extensive performance study of the benefits of using over a purchaseOrder table whose DDL definition is shown as
OSON storage over JSON text for both server and client. D1. The JSON type column jdoc stores a purchase order JSON
document. DML statement I1 shows an example of a JSON
The rest of the paper is organized as follows. Section 2 gives an document representing a purchase order being inserted into the
overview of JSON datatype functionality. Section 3 describes its purchaseOrder table. In I1, the compiler will implicitly wrap the
design. Section 4 is on support of JSON OLTP and OLAP JSON string within a JSON() constructor that encodes JSON text
workloads. Section 5 is on performance experiments. Section 6 is to OSON binary during insertion.

3060
D1 CREATE TABLE purchaseOrder Q4 SELECT jdoc
(did NUMBER PRIMARY KEY, jdoc JSON) FROM purchaseOrder
WHERE JSON_EXISTS(jdoc,
‘$.purchaseOrder.items?(
I1 INSERT INTO purchaseOrder @.price > $price && @.quantity >= $qty &&
VALUES (1, ' {"purchaseOrder": { (exists(@.parts?(
"podate": "2015-06-03", @.partName == $pname &&
"shippingAddress": {"street": "3467 35th Ave", @.partQuantity >= $pquantity))))’
"city" : "Berkeley", “state”: “CA”, "zip": PASSING TO_NUMBER(:1) AS "price",
94612}, TO_NUMBER(:2) AS “qty”,
"comments" : "Discounted sales Independence Day", :3 AS “pname”,
"sparse_id" :"CDEG35", TO_NUMBER(:4) AS “pquantity”)
"items": [ AND JSON_TEXTCONTAINS(jdoc,
{"name" : "TV", "price": 345.55, "quantity": 2,
‘$.purchaseOrder.comments’,
"parts": [
{"partName": "remoteCon", "partQuantity": 1}, ‘{Independence} NEAR {discount}’)
{"partName": "antenna”, "partQuantity": 2}]},
{"name": “PC”, “price”: 446.78, "quantity": 10,
"parts": [
Q5 SELECT JSON {
{"partName": "mouse", "partQuantity": 2}, ‘name’ : li.itemName,
‘sales’ : li.price * li.quantity
{"partName": "keyboard", "partQuantity": 1}]}
}
]}}');
FROM lineItems_rel li

Q1 SELECT did,
po.jdoc.purchaseOrder.podate.date(), Q6 SELECT
po.jdoc.purchaseOrder.shippingAddress, JSON {
po.jdoc.purchaseOrder.items[*].count(), 'id' : po.Id,
po.jdoc.purchaseOrder.item[1] 'poDate' : po.podate,
FROM purchaseOrder po 'items' : (SELECT JSON_ARRAYAGG(JSON {*})
WHERE po.jdoc.purchaseOrder.podate.date() = FROM lineItems_rel E
TO_DATE(‘2015-06-03’,'YYYY-MM-DD') AND
po.jdoc.purchaseOrder.shippingAddress.zip.number() WHERE E.fid_po = po.Id)
BETWEEN 94610 AND 94620 }
FROM PurchaseOrder_rel po

Q2 SELECT did,
JSON_QUERY(jdoc, U1 UPDATE purchaseOrder po
‘$.purchaseOrder.items?(@.price > 300)’),
SET jdoc = JSON_TRANSFORM(jdoc,
JSON_VALUE(jdoc,
REPLACE
'$.purchaseOrder?(exists(@..parts?(@.partName ==
‘$.purchaseOrder.shippingAddress.city’
"mouse" && @.partQuantity >=2 ))).podate’)
= ‘Oakland’,
FROM purchaseOrder po
REPLACE ‘$.purchaseOrder.shippingAddress.zip’
WHERE = 94607,
JSON_EXISTS(jdoc, SET '$.purchaseOrder.contactPhone' =
‘$.purchaseOrder.sparse_id?(@ == “CDEG35”)’) AND JSON('["(415)-667-8960","(510)332-8888"]'),
JSON_EXISTS(jdoc, REMOVE ‘$.purchaseOrder.sparse_id’,
‘$.purchaseOrder.items?(@.name == “TV” &&
APPEND ‘$.purchaseOrder.items’ =
@.parts.partQuantity >= 2)’)
JSON(‘{“items” :[{“name”:”iphone”,
“price” : 635.54, “quantity” :2}]}’))
WHERE po.jdoc,purchaseOrder.podate.date() =
Q3 SELECT po.did, jt.* TO_DATE(‘2015-06-03’)
FROM purchaseOrder po, JSON_TABLE (jdoc
COLUMNS (
poid NUMBER PATH ‘$.purchaseOrder.id’, Q7 SELECT did,
podate DATE PATH ‘$.purchaseOrder.podate’, JSON_QUERY(jdoc,
sp_id PATH ‘$.purchaseOrder.sparse_id’, '$.purchaseOrder.items?(@.price > 300)’),
NESTED ‘$.purchaseOrder.items[*]’ JSON_VALUE(jdoc,
COLUMNS (name, price NUMBER, quantity NUMBER),
‘$.purchaseOrder?(@..parts?(@.partName ==
“mouse”
NESTED ‘$.parts[*]’ && @.partQuantity >=2 )).podate’)
COLUMNS (partName, partQuantity NUMBER ))) jt FROM purchaseOrder po
WHERE EXISTS(
SELECT 1
FROM MV_PO
WHERE MV_PO.FID = po.did AND
MV_PO.sp_id = 'CDEG35' AND
MV_PO.name = 'TV' AND
MV_PO.quantity >= 2)

Figure 1. Example SQL/JSON queries and DMLS statements

3061
'JSON_TRANSFORM' '(' the JSON that the expression creates. In Q6, the expression
input_expr ',' operation (',' operation)* JSON{*} automatically creates a JSON object representation of the
JSON_TRANSFORM_returning_clause? relational row. There are other syntax simplifications for
JSON_passing_clause?
')' accessing JSON as well, all of which map to the core SQL/JSON
operators [20,21].
operation := (removeOp | insertOp| replaceOp |
appendOp | setOp | renameOp | keepOp)
2.4 Updating with JSON_TRANSFORM
removeOp := 'REMOVE' pathExp
(('IGNORE' | 'ERROR') 'ON' 'MISSING')? JSON_TRANSFORM() is a new SQL operator to declaratively apply a
sequence of modification operations to JSON values selected by
insertOp := 'INSERT' pathExp '=' rhsExpr the SQL/JSON path language. Figure 2 shows the grammar for
(('REPLACE' | 'IGNORE' | 'ERROR') 'ON' 'EXISTING')?
(('NULL' | 'IGNORE' | 'ERROR' | 'REMOVE') 'ON' the JSON_TRANSFORM() operator. The INSERT operation inserts a
'NULL')? new value into a JSON object or array. REPLACE operation replaces
an existing value with a new value. The APPEND operation appends
replaceOp := 'REPLACE' pathExp '=' rhsExpr
(('CREATE' | 'IGNORE' | 'ERROR') 'ON' 'MISSING')?
a new value into an array. The SET operation either replaces an
(('NULL' | 'IGNORE' | 'ERROR' | 'REMOVE') 'ON' existing value by a new value or adds a new value if the original
'NULL')? one does not exist. REMOVE operation removes values. U1 in Figure
1 is an example of a SQL UPDATE statement that uses
appendOp := 'APPEND' pathExp '=' rhsExpr
(('CREATE' | 'IGNORE' | 'ERROR') 'ON' 'MISSING')? JSON_TRANSFORM to modify a JSON column value. Logically, the
(('NULL' | 'IGNORE' | 'ERROR') 'ON' 'NULL')? update is a full replacement of the existing value with the
modified value. However internally, the update is applied
setOp := 'SET' pathExp '=' rhsExpr
(('IGNORE' | 'ERROR' | 'REPLACE') 'ON' 'EXISTING')? piecewise without materializing the new and old value at once
(('CREATE' |'IGNORE' | 'ERROR') 'ON' 'MISSING')? (see section 3.4). JSON_TRANSFORM can also be used in the select
(('NULL' | 'IGNORE' | 'ERROR') 'ON' 'NULL')? list of a query to perform a transformation on a JSON value
renameOp := 'RENAME' pathExpr 'WITH' stringLiteral without changing the persistent storage. For example,
(('IGNORE' | 'ERROR') 'ON' 'MISSING')? JSON_TRANSFORM could be used to redact a social security number
using a REMOVE operation before sending the document to the
keepOp := 'KEEP' (pathExpr (('IGNORE' | 'ERROR')
'ON' 'MISSING')? ) client. The KEEP operation is the inverse of REMOVE: only values
(',' pathExpr (('IGNORE' | 'ERROR') identified by path expressions are kept and all others are removed.
'ON' 'MISSING')? )*

rhsExpr := sqlExpr ('FORMAT JSON')? 2.5 Client JSON Datatype Access


Both JDBC (Java Database Connectivity) and OCI (Oracle Call
Figure 2. JSON_TRANFORM grammar Interface) have been enhanced with OSON support. Using these
libraries, applications can read, create, and modify JSON type
values, convert values between JSON text and OSON, and store
2.3 Simplified Syntax for SQL/JSON and retrieve values in the database. These libraries provide simple
Oracle Database provides a simplified syntax for querying JSON object model access APIs to randomly navigate and access values
values as an alternative to calling more expressive but verbose within a document. For example, the package oracle.sql.json
SQL operators such as JSON_QUERY() and JSON_VALUE(). A in JDBC implements JSR374/JSON-P interfaces (javax.json.*)
simple JSON path navigation without any predicates can be over OSON and is compatible with JSR367/JSON-B so that user
abbreviated using the dot notation as shown in the select list of domain objects can be mapped directly to and from JSON type
Q1. Tailing step functions such as number(), binary(), date(), values without incurring JSON text parsing or serialization costs.
and timestamp() can be used to specify the return type of the
expression. For example, po.jdoc.purchaseOrder.podate.
PreparedStatement select = con.prepareStatement
date(), is internally translated into JSON_VALUE( po.jdoc,
("SELECT jcol FROM purchaseOrder WHERE did =1");
'$.purchaseOrder.podate' RETURNING DATE ). The trailing step ResultSet rs = select.executeQuery();
function casts the value to the corresponding SQL built-in type (in rs.next();
this case, DATE). Without a trailing step function, the return type JsonObject doc =
rs.getObject(1, javax.json.JsonObject.class);
of the expression is JSON. For example, System.out.println(doc.getString("sparse_id"));
po.jdoc.purchaseOrder.shipping, is translated into JSON_QUERY( rs.close();
po.jdoc, '$.purchaseOrder.shipping' RETURNING JSON). Figure 3. JSON datatype in JDBC
Other types of trailing step functions are also supported. For
example, po.jdoc.purchaseOrder.items[*].count() in Q1 Figure 3 is an excerpt from of a Java program that uses JDBC to
illustrates a sequence item aggregation function that returns the retrieve a purchase order document from the server. In this case,
total number of items in an array. This avoids using the general the OSON bytes for the purchase order are transferred directly to
purpose JSON_TABLE() operator for simple aggregations. the application and exposed to the developer using the standard
javax.json.JsonObject interface. The value for sparse_id is
Examples Q5 and Q6 show the simplified syntax for read from the object in-place without processing or converting the
JSON_OBJECT(). This JSON constructor syntax allows users to rest of the document. This is discussed more in sections 3 and 4.
create new JSON objects and arrays using a syntax that resembles

3062
oson_document := header dictionary tree_seg
(extended_tree_seg)
dictionary := sortedFieldHashes fieldNameOffsets
fieldNames
tree_seg := value+
extended_tree_seg := tree-seg
value := object | array | scalar | forward_offset
object := OBJECT_CODE (fields | delegate_offset)
offsetArray
fields := size sortedFidArray
array := ARR_CODE size offsetArray
scalar := scalarCode sqlScalar

Figure 4. OSON Binary Format (pseudo grammar)

2.6 Simple Oracle Document Access (SODA)


SODA [19] is Oracle’s simple document access API for
developers to use Oracle Database as a pure JSON document
store. This API provides NoSQL-style access (CRUD) based on
the document's ID, presenting a key/value model similar to other
common document stores. With SODA, data is logically
managed by an application using collections of simple documents
rather than tables containing rows. However, SODA collections Figure 5. OSON example
are still backed by internally managed relational tables having ID, For a BLOB storing OSON bytes for the JSON datatype, we
JSON, and other metadata columns. With this model, applications applied value based LOB semantics on both the server and client.
can be written without using SQL but SQL/JSON can still be A large client side prefetch buffer is auto-configured for OSON
leveraged over collection data for analytics and reporting if ever BLOB. The lifecycle for OSON BLOB is restricted to cursor fetch
needed. More complex queries over JSON documents are duration in order to prevent resource leakage.
specified as QBEs (Query By Example). QBE's are JSON
documents themselves which are internally transformed to
equivalent SQL/JSON queries over the underlying JSON datatype 3.2 OSON Binary Format Design
column of the table. SODA is available in all popular client The following section describes the OSON binary format and its
programming languages: Java, Node.js, Python, C, PL/SQL. characteristics. Space precludes giving a full formal definition of
OSON but Figure 4 gives a pseudo-grammar that identifies the
salient aspects of its structure. This section gives an overview of
3. JSON DATATYPE DESIGN the structure and highlights its benefits.
3.1 JSON Datatype Derivation from BLOB 3.2.1 OSON Structure
The JSON datatype is internally designed as a derivation of the An OSON image consists of a header, a dictionary, and a value
BLOB datatype for storing OSON bytes. This simple approach segment. The header gives a fixed signature specific to OSON
enables Oracle to provide complete implementation support for and records various aspects about the image that are needed by a
the JSON datatype in every part of Oracle's ecosystems within a reader, such as the size of byte offsets used within the document.
yearly based release. However, this is all transparent to developers
because the general LOB APIs, to read and write LOBs using The dictionary contains the set of distinct field names used within
offsets, cannot be used on JSON type. One main advantage of all objects in the document. Specifically, it first stores a sorted
being a SQL datatype is that SQL static typing can enforce array of hash codes for each distinct key (sortedFieldHashes).
desired type semantics over JSON. And in contrast, with JSON This array is followed by a second array of equal length
text storage developers use the more complicated LOB API (fieldNameOffsets) that stores the corresponding offset of the
explicitly to access and modify JSON. key string. These offsets reference a third array (fieldNames) that
contain the actual field strings. The unique field ID of a given key
Classic BLOB was initially designed to store large binary objects is its ordinal position within fieldNameOffsets .
and provide random access over any file or media (e.g. a music
file). However, JSON documents for operational data are The tree value segment follows the dictionary and encodes the
typically smaller (kilobytes to megabytes). To achieve optimal objects, arrays, and primitive values used within the document.
performance, we inline OSON values up to the database block Objects are encoded by a set of size-prefixed parallel arrays that
size to avoid out of row BLOB access as much as possible. This encode the entries in the object. The first array (sortedFidArray)
allows SQL/JSON evaluation to directly access OSON bytes as if gives the sorted field IDs of the field names in the object. The
it were normal relational RAW column inside a row. When an second array (offsetArray) gives the corresponding offsets to
OSON value is larger than the block size, it is stored outside of values in the value segment. Arrays are encoded by a single size-
the row using multiple data blocks managed by the Oracle prefixed offset array containing the offsets of the values in the
securefile i-node infrastructure [13]. Data blocks for OSON array. The upper part of Figure 5 shows an example of the OSON
BLOB storage are lazily read and cached in the buffer cache layout without any partial updates. After a partial OSON update,
based on tree navigation patterns instead of linearly reading extended tree segment is appended as show in the bottom half of
everything. For large OSON, we only need to read OSON data Figure 5. The update replaces the string ‘CDEG4’ with
blocks that are relevant to answer the path query. ‘CDEG52’. Note if ‘CDEG4’ is replaced by ‘CDEG8’, then a

3063
direct replacement is done without appending in the extended tree content length is bigger than or equal to the new content length, it
segment. does an in-place update. When in-place update is not feasible, the
old value is tombstoned with the forwarding address
3.2.2 Self-contained and platform independent (forward_offset) of the new content that is appended at the end
Similar to JSON text and BSON, OSON does not depend on any of the original OSON bytes. Subsequent change of the same node
external schema information and supports unencumbered schema but with larger content creates a new forwarding address that is
evolution. This is a critical property to support distributed still stored in the original tombstone location to avoid forwarding
computing over the data without synchronizing on central schema address chaining. See bottom part of Figure 5. When accumulated
access. Many database features, like partitioning, replication, appended pieces exceed a certain threshold relative to the original
sharding, import/export, and transportable tablespace require data document size, the encoding is re-compacted and the resulting
to be self-contained and accessible on any platform without data OSON bytes fully replace the old document. Partial update
conversion. leverages the fact that OSON uses tree offset based jump
navigation. To avoid a rebuild of the dictionary due to the
3.2.3 Compact insertion of new distinct field names, OSON supports partial
Unlike JSON text and BSON, OSON maintains a local dictionary dictionary rebuilding by tracking dictionary codes that have been
in the header that encodes each distinct field name. For JSON actually changed due to insertion of new distinct field names and
documents having many repeating field names due to arrays of then only patching those changed dictionary codes in the end.
identical JSON object structures or recursive JSON object Compared with the OSON format described in paper [28], this
structures, OSON is often much smaller than the equivalent UTF- OSON format is enhanced to support piece-wise update of OSON
8 encoded JSON text or BSON since the repeated field names are using concept of forward offsets (forward_offset) and reduces
replaced by small field IDs. Furthermore, multiple objects OSON size for common JSON documents by allowing object
containing the same field names will only have their field ID array encodings to share field structures (delegate_offset).
stored once. Objects can reference the offset of the other object
that shares the same structure (delagate_object) instead of 3.2.6 SQL scalar binary compatibility
repeating the same field ID array. For example, in the All JSON scalar values use the same encoding as existing SQL
purchaseOrder JSON document shown in Figure 1, there is an built-in datatypes. For example, JSON packed decimal is encoded
‘item’ array, each of which has an identical object structure of using the same binary format as that of the NUMBER built-in
‘name’, ‘price’, ‘quantity’, ‘parts’ fields. Each ‘part’ array has type in Oracle SQL. This allows for efficient conversions and low
identical object structures of ‘partName’, ‘partQuantity’ fields. impedance mismatch when working with JSON inside the
These repeating field IDs are stored just once and reused by database. JSON scalar content is mapped to relational columns
multiple objects. The field ID size is fixed within an OSON and back without loss of fidelity.
document as either 1 or 2 or 4 bytes depending on how many
distinct field names there are in the document. Commonly,
documents have less than 256 distinct keys and a 1-byte field ID
3.3 Fast SQL/JSON Path Evaluation
Q1, Q2, and Q4 in Figure 1 show SQL/JSON path language
is thus sufficient. Similarly, the size of value offsets used to
expressions used in JSON_VALUE(), JSON_QUERY(), and
reference values will be either 2 or 4 bytes depending on whether
JSON_EXIST(). Notice that in addition to simple child steps,
the total size of the OSON exceeds 64K. Both the field ID and
Oracle also supports a recursive descendant path step. See
offset sizes in effect are encoded in the OSON header.
"..parts" in Q2 of Table 1 which selects all entries for key
3.2.4 Efficient in-place navigation "parts" within each descendant object of the current value. Q3
Tree navigation is performed in-place over the OSON bytes uses JSON_TABLE() to project a master detail hierarchy within a
without first loading the data into other in-memory structures. A document as a set of flat table rows. The path expressions used
value offset is used as direct pointer to the OSON bytes that within these SQL/JSON operators are evaluated efficiently over
encode that value. Searching for a key name within an object, the input OSON column value using in-place jump navigation.
or an array indexed element uses jump navigation to efficiently When compiling a path expression, a hash function is applied to
skip past irrelevant portions of the document. Specifically, the field names in the path to generate a corresponding hash code
when searching for a given key within an object, a binary search for each field name. These hash codes are stored in the compiled
over sortedFidArray is performed to identify the offset in the execution plan for the path expression. The OSON field name
corresponding offsetArray. This offset can be used to jump dictionary (sortedFieldHashes) is organized based on the sorted
directly to the corresponding value. Likewise, elements within an hash id. At run time, the hash ID from the path is first located
array can be efficiently located at any position using the within the OSON field name dictionary using a binary search to
offsetArray. Both the server and client drivers make use of in- obtain the corresponding field ID within the document. If the
place jump navigation to support efficient data access without field is not found in the dictionary, it means the field does not
converting the OSON image to other data structures. occur anywhere in the document. This is particularly helpful for
the heterogeneous JSON collection where there are many possible
3.2.5 Efficient piece-wise updates sparse optional fields. Searching for non-existence of fields within
OSON supports partial updates so that changes can be made JSON text or BSON requires a scan of the entire document to the
without replacing the entire document, as would be required for end whereas OSON only requires a binary search within the
JSON text or BSON. Partial BSON update is limited to the case OSON field name dictionary. When the object field ID is found in
that the new content length exactly matches the old content length the dictionary, it is then located in the field ID array of the current
because BSON uses fixed offsets throughout the document which object, again using binary search, to get the corresponding child
need to be recalculated after a size changing update. Partial update offset.
for OSON can handle complex update operations. When the old

3064
Oracle's built-in SQL datatype format is designed to support type operation. Since OSON uses jump navigation to evaluate
agnostic byte comparable for range comparison. This property is SQL/JSON path expressions, it provides the best performance to
exploited for range comparisons in SQL/JSON path expressions compute the JSON_VALUE() expression during functional index
because OSON scalar binary is the same as that of SQL built-in maintenance.
SQL datatype. Also, when the SQL return type used within
JSON_VALUE() matches a JSON scalar, its bytes are copied out
OSON also reduces transfer costs and enables efficient client-
directly from OSON bytes without any datatype conversion. access to documents. The OSON bytes for the document
retrieved are transferred directly to the client application, avoiding
any transformation or serialization costs on the server. The
3.4 Fast JSON_TRANSFORM() Evaluation OSON bytes are also typically smaller than the corresponding
Consider execution of the update statement U1 in Figure 1. JSON text or BSON value which reduces both IO and network
Logically, the JSON_TRANSFORM() expression on the right-hand transfer overhead. Like the server, the client application can read
side of the SET clause is evaluated over the input JSON value, the OSON bytes using jump offset navigation, avoiding any
applying a sequence of modification operations to yield a new transformation costs on the client as well. In contrast, for JSON
temporary document that replaces the JSON column value. text or BSON, the application must load the data into alternative
However, replacing the entire document in this manner is data structures such as hash tables to enable efficient random
inefficient as it requires the database to generate a redo log entry access to the document. Some database vendors use binary JSON
proportional to the size of the document. Oracle Database as a storage format but then serialize the value to JSON text
optimizes U1 execution by applying piece-wise updates on the before sending it to a client. These conversions can consume
OSON storage without generating a temporary new OSON significant resources on the database server, especially in OLTP
document to fully replace the previous stored OSON document. type use cases where many clients are using the database at once.
With this optimization, transaction redo log size is usually
proportional to the size of the change rather than the full OLTP workloads also require high performance ingestion (i.e.
document size. At runtime, JSON_TRANSFORM() evaluates the put() operations). Using OSON reduces I/O and network transfer
update over the OSON bytes using only 3 low-level update costs during an insert since the OSON value is typically smaller
operations: length preserved byte replacement, append bytes at the than the corresponding JSON text. Client-side encoding of OSON
end and truncate bytes at the end. Typical OSON update also reduces server side CPU usage during an insert since the
operations only result in few database BLOB storage block server can directly store the bytes received from the client with
changes instead of every BLOB storage block being modified. OSON bytes verification.
Therefore, update performance using JSON_TRANSFORM() is As discussed in section 3.4, document updates are also important
improved significantly. When there are many partial updates that
to OLTP workloads. OSON has efficient partial OSON update
have accumulated on an OSON document, re-compacting the capability that typically result in transaction redo-log size
OSON document to reclaim the space due to deletion occurs
proportional to small changes of OSON instead of the full OSON
automatically. The partial BLOB update API is similar to the document replacement. Execution of U1 in Table 1 uses a
POSIX/UNIX file system API [13]. We note this partial OSON
functional index to locate the document and then performs partial
update technique is applicable to OSON stored in file systems as
update.
well.

4. JSON WORKLOADS 4.2 OLAP Performance


The traditional model of using separate databases for different 4.2.1 Incrementally Refreshable Materialized Views
types of workloads is more and more breaking in favor of so- Materialized views (MVs) have traditionally been used for
called translytic databases that support both transactional and materializing aggregations persistently. Depending on the
analytic workloads at once. We have designed JSON type and the complexity of SQL expressions used in MVs, Oracle allows
SQL/JSON facilities to support both types of workloads over developers to control whether MVs are refreshed incrementally or
JSON data. fully, and the frequency of refresh at the statement level, at the
transaction level or at scheduled intervals. MVs are typically used
4.1 OLTP Performance to speed up OLAP queries. In this paper, we have applied
JSON OLTP workloads require high performance document JSON_TABLE() MV to speed up SQL/JSON OLAP queries.
retrieval (i.e. get() operations). A primary key index on a JSON_TABLE() queries, such as the one shown in Q3, can be used
document identifier is required to efficiently retrieve a JSON to define MV for analytics. The view can be set to refresh at the
document by id lookup. For example, the DDL statement D1 in statement or transaction commit level. This is feasible because the
Table 1 shows the primary key specification for column did of the underlying JSON_TABLE() MV table stores the primary key of the
purchaseOrder table. A functional index is needed to facilitate JSON datatype column in the original table as a foreign key. In
common secondary access paths. Efficient execution of Q1 in this example, when new JSON document is inserted in the
Table 1 needs two functional indexes on paths purchaseOrder table, Oracle MV maintenance runs JSON_TABLE()
po.jdoc.purchaseOrder.podate.date(), po.jdoc. over the new JSON document to return a set of relational rows
purchaseOrder.shippingAddress.zip.number(). B+ tree based tagged with the primary key of the new document and then inserts
functional indexes offer the best tradeoff to speed up document them into the MV table. JSON_TABLE() evaluation over OSON is
retrieval based on range or equality predicates while being both more efficient than JSON text for the reasons discussed in section
statement and transactionally consistent with respective to 3. When JSON documents are deleted from the purchaseOrder
underlying DML. DML statements, such as inserts, must compute table, Oracle MV maintenance deletes those rows in MV table
the functional index expression to maintain the B+ tree for each whose foreign key value matching the primary keys of the deleted

3065
rows. Update of JSON datatype column in purchaseOrder table is memory maps a 16 byte hash id for each unique path to a sorted
triggered as deletion followed by insertion operations in the MV distinct leaf scalar values, each of which maps to a bitmap for the
maintenance layer. The JSON_TABLE() MV maintenance can be docid having that scalar value.
triggered at individual DML statement completion time. This If the in-memory index fails to load in an IME unit due to lack of
statement level consistency is semantically equivalent to that of memory or if it cannot keep up with heavy DML rates, execution
index maintenance during which a session can see its own change automatically falls back to use OSON bytes for evaluation. Unlike
immediately without committing the transaction. The MV table disk-based indexes, the in-memory index maintenance does not
update is rolled back if the transaction containing these DML slow down the main DML execution path because its population
statements is rolled back. is trigged periodically and is done in the background.
Due to the MV statement level consistency semantics,
JSON_TABLE() based MVs can be used to transparently rewrite not 4.2.4 Predicate Scans in Exadata Storage Cells
only queries that use JSON_TABLE() in the FROM clause but also Oracle Exadata [14] is an engineered system for Oracle Database.
queries that use JSON_EXISTS() as predicate in WHERE clause. For A key aspect of the Exadata architecture is to separate compute
example, assume that MV_PO is a MV defined over the and storage and push predicate evaluation down to storage server.
JSON_TABLE() query given in Q3. The query given in Q2 can be SQL predicates evaluated during a table scan are pushed down
internally rewritten to the query given in Q7. Q7 uses MV_PO to and evaluated in the storage layer by parallel kernel threads. The
help evaluate the query over the purchaseOrder table. The storage layer returns matching rows. This improves the
JSON_EXIST() expressions in the original Q2 are transformed in performance of scans by orders of magnitudes by reducing the
Q7 into a semi-join using an EXISTS subquery. MV_PO.FID is the amount of data sent back to compute nodes. JSON support takes
internal foreign key of the MV table that joins to the primary key full advantage of this by supporting SQL/JSON functional
did in the purchaseOrder table. Semi-joins like this one are evaluations in the storage cells, so that SQL/JSON predicates,
amenable query optimizations for efficient execution plans. For such as JSON_EXISTS() and JSON_VALUE() in WHERE clauses are
example, when MV_PO has secondary indexes on sp_id, name and pushed down to storage cell server for evaluation when any
quantity columns or if the table is placed in the Oracle in-memory indexing mechanism in the compute node are not available or
columnar store, significant performance improvement is determined to be non-beneficial by optimizer.
achievable.

4.2.2 Parallel Query 4.3 Sharding & Full Text Search


This section covers sharding and full-text search which are
SQL/JSON OLAP queries can be executed in parallel with
common features of NoSQL databases and useful for improving
multiple slaves executing SQL/JSON operators used in the select,
both operational and analytical workloads.
where, group by and order by clauses. JSON_TABLE() is executed
in the same slave process which reads the JSON column.
SQL/JSON operators in the select list are pushed down to each 4.3.1 Full-text Search
slave process so that they are executed in parallel. Slave processes Oracle Database supports the operator JSON_TEXTCONTAINS(),
add intermediate results as temporary OSON BLOBs in the usable in WHERE clause to facilitate JSON path context aware full
parallel table queue. The query coordinator processes the queue text search. For example, Q4 in Figure 1 uses
and ships the final results to DB client. JSON_TEXTCONTAINS() to perform the full text search
‘{Independence} NEAR {discount}’ under a SQL/JSON path
4.2.3 In-memory Indexes $.purchaseOrder.comments. OLAP queries over JSON data
Although JSON search index is powerful, it is based on disk typically have ad-hoc path value range searches that are not
structures that inevitably impact DML performance. Oracle DB possible to create many functional indexes or statement
in-memory store[22], however, is able to maintain an alternative refreshable MVs to capture all possible paths in a JSON datatype
format of data for fast query without paying the disk I/O cost to column. Therefore, Oracle JSON Search Index is designed to
persist them while still providing transactional consistency. index everything in a JSON datatype column, both full text and
leaf scalar values. The SQL compiler detects all SQL/JSON
The Oracle In-Memory store has an in-memory expression feature operators: JSON_TEXTCONTAINS(), JSON_EXISTS(), JSON_VALUE()
(IME) [1] that enables storing virtual column expressions in in WHERE clause over the same JSON datatype column and then
memory. JSON_VALUE() is used to shred the JSON into multiple combines them into one tree pattern search operator to be
columns and load them into an in memory columnar format that evaluated by the JSON search index.
is amendable to SIMD scans [28]. JSON datatype is further
integrated with IME as a memory pool to maintain a compact There are two components in the JSON search index. One
JSON search index based memory representation so that component indexes keyword tokens of both scalar strings and
JSON_EXISTS() and JSON_TEXTCONTAINS() predicates can be hierarchical tree structures. It uses an inverted index layout to map
evaluated in-memory. This is achieved by using both a path- each keyword token to its document ids and its containing
value index format to support JSON_EXISTS() and a inverted hierarchical path as intervals. The other component indexes leaf
keyword hash index format to support JSON_TEXTCONTAINS(). number and timestamp values and its leading JSON path. It has
The Oracle in-memory store assigns each row an ordinal number B+ tree index to index a combination of 16 bytes hash of each
within a memory unit, and we use that ordinal number as the unique JSON paths and its distinct leaf scalar node value for path-
document id (docid) for keyword inverted-list and path-value value range search. The posting list is compact because the
indexing. The hash index maps keywords into posting lists having posting list uses ordinal numbers as document ids and does delta
that keyword using a serialized in-memory hash table. The posting compression of them. Furthermore, ordinal number based
list is a bitmap for docid with the doc offset. Path-Value in- document ids also facilitate fast bitmap joins and pre-sorted merge

3066
Table 1. Document sizes

ID UTF8 (b) BSON (b) OSON (b) Dict vsUTF8 vsBSON #Object #Keys #Array #String #Number
.
D1 613 764 524 5% 0.9x 0.7x 20 33/5 1 31 0
D2 1,782 1,813 1,950 30% 1.1x 1.1x 4 56/55 0 31 21
D3 2,608 3,094 2,160 16% 0.8x 0.7x 26 100/32 14 124 1
D4 2,943 3,293 2,476 6% 0.8x 0.8x 46 100/19 14 87 0
D5 8,842 8,440 5,591 19% 0.6x 0.7x 38 307/74 29 107 56
D6 40,285 37,526 20,486 18% 0.5x 0.5x 81 1,435/246 9 572 476
D7 76,861 75,195 38,383 11% 0.5x 0.5x 490 3,300/282 23 1,263 1,129
D8 141,051 133,307 103,897 0% 0.7x 0.8x 1,688 6,620/40 52 652 6,860
D9 682,228 No Data 483,053 0% 0.7x No Data 9,727 36,778/42 42 666 40,375
D10 3,374,379 3,303,387 2,167,101 0% 0.6x 0.7x 14,712 112,356/90 12,738 43,927 27,618
D11 41,548,995 37,352,414 13,801,333 0% 0.3x 0.4x 100,141 1,839,847/73 1 100,143 901,263

joins when combining posting list results from multiple full text among different approaches to show what relative improvement
and path value ranges. can be achieved. Also, since we use a variety of document sizes,
using ratios helps show all the data in a single figure/scale.
The posting lists are not DML friendly structures. To keep up with Measurements are done after a warm-up period so that the
high volume of DML rates, the index can be configured as synch working set of data is fully loaded in-memory to eliminate
on transaction commit or synch at certain intervals. It internally measuring physical I/O overhead.
uses a log structured merge mechanism to initially keep less
compacted posting lists tailored for fast DMLs and then gradually
merge and compact multiple posting lists for the same key 5.1 Encoding Size
together into read friendly structures using parallel group by / We compare the encoding size of JSON text, BSON, and OSON.
aggregation query execution as the underlying map-reduce Encoding size is an important metric as it determines how much
framework. As for functional indexes, OSON storage helps with data is read and written to disk by the database as well as how
indexing time as it can be more efficiently read in-place. much data is sent over the network to the client. We use a set of
11 JSON documents where 4 were taken from public sources and
the rest were given to us by customers. Our goal here is to use
4.3.2 Sharding only real-world, non-synthetic data that covers a variety of
NoSQL databases typically use sharding to scale horizontally in a document sizes. Since we cannot disclose the specific customers
distributed environment. Data is partitioned out to multiple and public sources used, we instead give each document a generic
machines based on a shard key which may be document id or identifier (D1 - D11).
document creation dates etc. Oracle Database also supports
sharding stored JSON data, and each shard can have a locally Table 1 summarizes the encoding sizes of the 11 documents we
defined JSON search index. NoSQL style OLTP and OLAP selected. The first column gives the identifier of the document
operations, with sharded key specifications, are routed to the while the next three columns give the size of the JSON text,
corresponding shard server. NoSQL style cross-shard OLAP BSON, and OSON values. The Dict. column gives the percentage
queries are distributed to each shard server to be executed in of space taken by the OSON dictionary. The vsUTF8 and
parallel. SQL/JSON operators, including JSON_TABLE(), are vsBSON columns give the size of JSON text and BSON relative
pushed down to each shard server. to OSON. The last five columns give the total number of objects,
keys, arrays, numbers, and strings in each document. The #Keys
column gives two values - the total number of keys in the
5. PERFORMANCE EXPERIMENTS document followed by the total number of distinct keys.
We compare the performance of JSON text storage in BLOB with
an ‘IS JSON’ check constraint to that of JSON type backed by In all cases except for D2, the size of the OSON encoding is
OSON storage. Specifically, we show that document retrieval, smaller than the UTF-8, whitespace stripped JSON text. D2 is a
ingestion, queries, and updates all benefit from using OSON. For small, flat document with no repeating keys, short key names, and
JSON text storage, the document is stripped of insignificant small numeric values. In this case, the op codes and jump offsets
whitespace and encoded as UTF-8 to be as small as possible. The in OSON made it 10% bigger than the corresponding JSON text.
database character set is ALU32UTF8. All experiments ran on a In all other cases, the OSON encoding is smaller than the
compute node in Oracle's US East (Ashburn) data center [17]. corresponding JSON text, especially for documents having arrays
The compute node used was a VM.Standard2.24 [25] type virtual of object structures or recursive structures that have many
machine with the following characteristics: 24 OCPU/2.0 GHz repeating keys. For example, the OSON encoding of D6 and D7
Intel® Xeon® Platinum 8167M, 320 GB of memory, and 1TB is half the size of the corresponding JSON text. D11 achieved the
encrypted block volume, ext4 [3]. In Section 5.2, we use two of largest relative compression at one third the size of the JSON text.
these compute nodes - one for the database and one for the Java D11 is a large document with 1.8 million keys but only 73 distinct
application. Otherwise, the experiments are run on a single node. values. Thus, encoding these keys as numbers results in a large
Figures presented in this section show ratios rather than absolute reduction in size. Similarly, the OSON encoding is typically
numbers. The purpose of these experiments is not to demonstrate significantly smaller than the corresponding BSON encoding for
the absolute performance numbers, but rather comparing the ratio medium to large size documents because BSON does not encode
key names as identifiers. For D9, the customer document

3067
contained large numeric values that caused the 3rd party BSON absolute values since the insert rates for the smaller documents are
encoder we used to produce an error and we report "No Data". much greater than that of the larger documents. For D1 through
D4 the average writes per second is essentially the same showing
Just as JSON text and BSON, OSON is designed for OLTP use that the cost of encoding OSON at client side is not a significant
cases with OSON as an operating CRUD entity. Therefore, each
factor. Furthermore, for the remaining cases, the rate of insertion
OSON document is self-contained and it does not try to do is significantly higher for OSON due to OSON storage of having
common schema elements compression across document a smaller size, reducing disk I/O and data transfer costs. For
boundary as what relational storage would do by extracting and D11, the OSON is about a third of the size of the JSON text and
storing the common schema in central data dictionary. However, we consequently get about three times the insert throughput.
each OSON document contains common intra-document fields in
the dictionary. Table 1 shows the percentage of the dictionary size 5.2.2 Retrieval
relative to the whole OSON document in the Dict. column of In this experiment, we measure the read throughput of a client
Table 1. For small OSON documents, the OSON dictionary size Java application running on a compute node in the same data
represents average of 16% of the document. For medium to large center as the database node. The application contains 24 threads
OSON document, the OSON dictionary size is negligible. This and each thread repeatedly retrieves the same document from the
means even if we do schema inference to extract and store database and obtains a DOM API for the value. In the case of
common dictionary over a collection of OSON documents in a JSON text, we use a popular, 3rd party, JSON parser that we
central dictionary, its results in 15% of saving in exchange for found to be the most performant of several other Java JSON
schema-evolution and migration issues. This is why the design of parsers we tested. We use this parser to fully parse the JSON text
OSON using intra-document dictionary achieves the balance of and create an in-memory DOM for the value. In the case of
schema flexibility and storage space compaction for OLTP JSON OSON, we use DOM API implemented on top of our in-place
use case. client side OSON decoder.
On the other hand, for OLAP JSON use cases, Oracle [28] is able
to do common schema inference over JSON document collection 6.0x 5.1x
to derive JSON Data Guide to shred JSON collection as relational 4.0x
storage with columnar in-memory format whose size is similar 2.0x 1.8x 2.1x
1.5x 1.2x
closer to that of Dremel [18] and STREED [27] style of columnar 2.0x 1.0x 1.0x 1.1x 1.1x 1.1x
encoding. While these are efficient columnar encodings for OLAP 0.0x
queries through leaf scalar value dictionary encoding and SIMD D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
scans [1,22], they lack of efficient random update capability at Figure 7. Relative full read throughput, text vs OSON
document level in comparison with OSON for efficient OLTP
JSON use case in the same way as that of row store and columnar Figure 7 shows the relative read throughput when the client
store of relational model for OLTP and OLAP use cases application fully iterates all of the values in the DOM. All JSON
respectively. strings are accessed as Java strings and all numbers are converted
into Java's BigDecimal. In this experiment, full document access
5.2 Ingestion and Retrieval performance with OSON is typically equivalent to or better than
JSON text. For D11, the OSON is about a third of the size of the
5.2.1 Ingestion JSON text (saving on transfer costs) and reading the OSON is
In this experiment, we measure the insert throughput of a client more efficient than JSON text parsing.
Java application running on a compute node in the same data
center as the database node. The application contains 24 threads Figure 8 shows the relative read throughput when only a single
and each thread starts with a pre-parsed, in-memory object model client path is accessed on the client instead of iterating every
representation of the given document. Each thread repeatedly value. For each document, we selected a path to some arbitrary
inserts the document into the database using a SQL insert value within the document (the path chosen is not shown). The
statement. In the case of JSON text, the document object is specific path used is not important as both the JSON parsed object
converted to UTF-8 JSON text to be inserted into a BLOB model using hash tables and the jump offsets within the OSON
column. In the case of OSON, the document object is converted to image will provide fast access to any point in the document.
OSON binary at the client side and inserted into a JSON type However, the advantage of OSON is that any point in the
column with server side OSON byte verification (section 4.1). document can be read without pre-processing the rest of the
After a warm-up period of 6 seconds, the average inserts per document. In the case of JSON text, the entire document must be
second is recorded over a 1-minute window. parsed before the path can be evaluated. This advantage is also
useful to applications that need low latency access to the first
4.0x 3.3x value read within a document. For example, a page load on a
3.0x social media feed can display the first 10 items without having to
2.0x 1.5x 1.7x 1.4x 1.5x 1.6x first process all 100 returned by the server. And, this can be
1.0x 1.0x 1.1x 1.0x 1.2x
1.0x achieved without any of the limitations or usability issues
0.0x typically imposed by a streaming text parser. These benefits are
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 most exaggerated for D11 where the relative increase is 7.7x as
the client can skip reading portions of this large document.
Figure 6. Relative insert throughput, text vs OSON

Figure 6 shows the insert throughput of JSON text compared to


OSON binary storage. The figure gives the ratio rather than

3068
7.7x 125x
8.0x 112x 122x 115x
120x
6.0x
80x
4.0x 2.3x 2.5x 2.5x 2.0x 1.7x 17x
1.0x 1.0x 1.0x 1.0x 1.1x 40x 10x 8x 7x
2.0x
0x
0.0x
U1 U2 U3 U4 U5 U6 U7 U8
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
(D7) (D7) (D7) (D7) (D11) (D11) (D11) (D11)

Figure 8. Relative path read throughput, text vs OSON Figure 11. Update response time, full replacement
versus piecewise update
5.3 Query Performance
In this section we compare query performance over JSON text and 8809x 8857x
JSON type using 10 queries from the EYCSB [7] benchmark. In 10000x 2813x
this experiment, there is no index created so that all queries use a 1000x 105x
31x 31x
table scan and evaluate the SQL/JSON operators over each 100x 10x 14x
document. The table used contains 2 million EYCSB JSON 10x
1x
documents where the average size of a document is 2K. The total
U1 U2 U3 U4 U5 U6 U7 U8
storage size of the table is 4.4 GB for JSON text and 4.2GB for (D7) (D7) (D7) (D7) (D11) (D11) (D11) (D11)
OSON. The equivalent relational row storage by extracting all
common meta-data and store them in central dictionary for Figure 12. Update redo log size, full replacement
EYCSB is 3.7 GB. Although OSON has 12% storage size versus piecewise update (log scale)
overhead compared with pure relational row storage, in exchange
for that, OSON storage has no schema evolution issue. We used 4 statements (U1-U4) that use JSON_TRANSFORM() to
update over 20,000 documents generated from D7. The total
9.0x 7.9x 8.2x 8.3x 7.9x storage size is 1.5 GB for JSON text and 0.77GB for OSON. The
6.9x 6.5x 7.2x
6.2x 6.4x 6.1x equivalent relational row storage by extracting all common meta-
6.0x data and store them in central dictionary is 0.73 GB. So for large
3.0x OSON document, its equivalent relational row storage has
0.0x
negligible storage size advantage.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 For updating D7, Figure 11 shows 7x – 17x update response time
Figure 9. Relative response time, text vs OSON (no index) improvement. Figure 12 shows 10x-31x transaction log redo size
reduction. We used another 4 statements (U5-U8) that use
Figure 9 shows a 6x-8x improvement in response time using JSON_TRANSFORM() over 40 documents generated from D11. The
OSON storage. SQL/JSON path expressions can be efficiently total storage size is 1.6 GB for JSON text and 0.55GB for OSON.
evaluated over OSON values in-place using the technique For updating DOC11, Figure 11 shows 112x – 125x update
described in section 3.3. performance improvement. Figure 12 shows 105x-8809x
transaction log redo size reduction.
As explained in section 4.2.3, JSON datatype is integrated with
Oracle in-memory store by populating in-memory path/value
indexes over OSON bytes. Figure 10 shows an average of 200x
6. RELATED WORK
response time improvement for the 10 EYCSB queries using IMC. Oracle [29], Microsoft SQL Server [10], MySQL [12],
Q6, Q7, and Q8 show a small improvement compared with the PosgreSQL [16], TeraData [23] support JSON text storage.
other queries because they return more rows that require post MongoDB [11], TeraData [23], DB2 [6] support BSON storage.
filters after in-memory index probes. MySQL[12] and PosgreSQL[16] support their own binary storage.
MySQL[12] and TeraData [23] support a JSON datatype. MySQL
JSON datatype uses its own JSON binary format. TeraData
5.4 Update Performance supports JSON datatype with variety of storage formats: text,
In this section we compare update performance of JSON type BSON [2], UBJSON [24]. Oracle supports fast in-memory OSON
using JSON_TRANSFORM() for two customer datasets (D7 and D11 with JSON text storage on disk [28].
in Table 1). For JSON text storage, partial update is not feasible
and, consequently, the corresponding transaction redo log size is Similar to that of OSON [14], binary JSON formats from MySQL,
proportional to the full document size. For OSON storage, PostgreSQL, Sinew[5], STREED [27] have all shared the
transaction redo size is proportional to the actual delta update size, common design strategy of supporting jump navigation based on
not the full document size. Consequently, partial update from indexed field key names in the binary format to speed up query.
OSON is significantly faster with significantly smaller redo size. Therefore, all of these binary formats support faster query
performance than JSON text storage. However, none of the query
265x 269x 271x 260x friendly binary formats addresses the issue of supporting partial
300x
186x 208x update in the binary format. Our earlier work that used the OSON
170x
200x format loaded in-memory [28] does not address partial updates.
100x 60x The OSON format used for JSON datatype support in this paper
7x 28x
does support partial update at binary format level. Mison [26]
0x
proposes SIMD based fast JSON parser for JSON query.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
However, it does not address partial update either. JSON text and
Figure 10. EYCSB response time, IMC index speedup streaming encoding based binary format, such as BSON [2] and

3069
UBJSON [24], are very limited to support partial update JSON DOMs to be persisted with minimal storage space, fast
efficiently without replacing the full document. Compared with queries, and efficient updates that cannot be achieved with
JSON_TRANSFORM() that combines multiple modification comparable binary XML DOM representations.
commands in one operator, MySQL[12] and PostreSQL[16]
JSON update operator can only do a single update command at a 7. FUTURE WORK
time. Multi-value functional indexes over JSON documents will be
Sinew[5] and STREED [27] binary format separate dictionary supported since there can be multiple values in a JSON array
meta-data out of binary formats to be stored in central location, within a JSON document. Multi-value functional indexes are used
this approach causes each binary instance to be non-self- for JSON array range predicates in JSON_EXISTS() to determine a
contained and hard to distribute without shipping central known set of scalar values that map to a given document. We will
dictionary. Our design of OSON is self-contained without relying work with the SQL/JSON standards committee to add the JSON
on a central location schema. Consequently, it is able to fully datatype, JSON_TRANSFORM(), and the simplified syntax and
support database features such as partitioning, replication, publish benchmark for hybrid JSON usecases.
sharding, import/export, and transportable tablespaces since it
does not require any central schema synchronization. 8. CONCLUSIONS
Dremel [18] and STREED [27] columnar layout are examples of The native JSON datatype, in combination with the SQL/JSON
JSON columnar encoding formats. However, both assume that all standard, establishes a strong semantic foundation to support
JSON documents in a collection have the same schema. This is JSON in a relational database. Using OSON as the encoding
required to correctly re-assemble the original JSON document format enables efficient query, update, ingestion, and retrieval of
through columnar decomposition of the data. However, for a JSON documents. Features such as SODA, full-text search
generic JSON datatype, we need to handle both homogeneous and indexes, and sharding have filled the main functional gaps with
heterogeneous JSON collections. We store the OLTP friendly NoSQL databases. Further, we support high performance OLAP
OSON format on disk and, to achieve columnar scan workloads over JSON by deeply integrating JSON type with
performance, we leverage Oracle in-memory store [22] and IME Oracle in-memory, parallel query, materialized views, and storage
expressions [1] to load columnar projections of JSON into cell smart scans. Combined, these features make a robust
memory for columnar based SIMD scan. Furthermore, we use platform for processing JSON data. End-to-end OSON support
path-value index and inverted keyword hash index format to accomplishes the goals of object databases - to be able to persist
facilitate JSON_EXISTS() and JSON_TEXTCONTAINS() in application level objects in a schema-flexible way without
memory predicate evaluation and join. shredding them across multiple tables and rows.
SQL++ [9] (Couchbase) and the similar PartiQL (Amazon) [15]
represent approaches that take a "JSON first" approach to 9. ACKNOWLEDGEMENTS
extending a simpler SQL-92 language with a singular JSON-like The authors would like to thank Andrew Mendelsohn for his
data model and dynamic typing. Notably, these languages do not commitment of JSON datatype support, Tirthankar Lahiri, Juan
use explicit SQL/JSON operators or a secondary JSON path Loaiza for support of JSON functionality as crucial feature in
language to access JSON values. The Oracle simplified syntax RDBMS, Cetin Ozbutun for his direction of in memory JSON
for SQL/JSON (section 2.3) represents a somewhat similar support. Fred Zemke, Jan Michels for SQL/JSON standard.
approach but is defined on top of SQL-2016's SQL/JSON Martijn Cohen, Niraj Srivastava, Manoj Sahoo, Srinivas
operators. SQL-2016 has been and will continue to be extended Kareenhalli for JSON/OSON performance evaluation. Sriram
to support many more data models other than JSON. Oracle's Krishnamurthy and his team: Rodrigo Fuentes, Jose Valera, Jose
approach integrates well with the latest version of the SQL Alvizo, Sunitha Subramanyam, Sundeep Abraham, Hui Zhang for
standard and supports storing and querying many different data their help for implementation. Shasank Chavan, Aurosish Mishra,
models side-by-side (relational, object relational, XML, JSON, Saurabh Naresh Netravalkar for in-memory support. Kam Shergill
spatial, graph, and so on). At the same time, it does not sacrifice for mem-optimized write support. Maxim Orgiyan and Tim Yu
the performance or expressiveness of queries over JSON data. for SODA support. Venkat Venkatraman team for quality
XML and JSON are both popular ways to represent semi- assurance. Andrew Witkowski, Tsae-Feng Yu for parallel query
structured data. XML is the more mature of these representations and Materialized view support. Rajendra Pingte, Saurabh Verma,
and has a larger body of academic and industry-led research on Martha Woo Woo and their client teams to support JSON datatype
access on all Oracle client tiers. Ron Decker for PL/SQL support.
efficient processing using binary encodings. However, the JSON
Wei Hu, Mark Dilman for sharding support. Yuhong Gu, Thuvan
data model serves the same purpose and is semantically much
Hoang for replication support. Rich Phillips for DB import/export
simpler. XML is complicated by its document-oriented heritage,
tool support.
with concepts such as document node order, interleaved node
types, mixed content, lack of in-line scalar data types, and no
explicit array construct. In contrast, JSON doesn't specify a 10. REFERENCES
document node order, has explicit numeric and boolean types, and [1] A. Mishra, et al. Accelerating Analytics with Dynamic In-
an explicit array type to connote ordering when it's required. The Memory Expressions. PVLDB 9(13), 1437-1448, 2016
XML DOM interface requires implementations to provide costly [2] BSON: http://bsonspec.org/
semantics such as navigation to an explicit parent node and sibling
traversals. XML has constructs such as attributes and namespaces [3] Block Volume:
that complicate processing without adding much expressive power https://docs.cloud.oracle.com/iaas/Content/Block/Concepts/o
for data representation. Our work on the OSON format has been verview.htm
able to exploit the simpler and cleaner JSON data model to allow

3070
[4] Couchbase JSON Support: [17] Regions and Availbility Domains
https://developer.couchbase.com/documentation/server/3.x/d https://docs.cloud.oracle.com/iaas/Content/General/Concepts
eveloper/dev-guide-3.0/using-json-docs.html /regions.htm
[5] D. Tahara, et al: Sinew: a SQL system for multi-structured [18] S. Melnik, et al: Dremel: Interactive Analysis of Web-Scale
data. SIGMOD Conference 2014: 815-826 Datasets. VLDB 3(1), 330-339, 2010
[6] DB2 JSON support: [19] SODA: Simple Oracle Document Access API:
https://www.ibm.com/support/knowledgecenter/en/SSEPEK https://docs.oracle.com/en/database/oracle/simple-oracle-
_11.0.0/json/src/tpc/db2z_jsonfunctions.html document-access/index.html
[7] EYCSB: https://blog.couchbase.com/ycsb-json- [20] SQL NESTED Clause:
benchmarking-json-databases-by-extending-ycsb/ https://docs.oracle.com/en/database/oracle/oracle-
[8] Elastic Search API: database/19/adjsn/function-JSON_TABLE.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/in [21] SQL/JSON 2016 Standard: ISO/IEC TR 19075-6:2017,
dex.html Information technology Part 6: SQL support for JavaScript
[9] Kian et al: “The SQL++ Query Language: Configurable, Object Notation (JSON),
Unifying and Semi-structured”. http://standards.iso.org/ittf/PubliclyAvailableStandards/
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.690 [22] T.Lahiri, et al.: Oracle Database In-Memory: A dual format
.8323 in-memory database. ICDE 2015: 1253-1258
[10] Microsoft SQL Server JSON support: [23] Teredata JSON Datatype:
https://docs.microsoft.com/en-us/sql/relational- https://docs.teradata.com/reader/C8cVEJ54PO4~YXWXeX
databases/json/json-data-sql-server?view=sql-server-2017 GvsA/4IAzgRsj_8aRj5pCQoEqzA
[11] MongoDB BSON and JSON : [24] UBJSON: http://ubjson.org/
https://www.mongodb.com/json-and-bson [25] Virtual Machines:
[12] MySQL JSON DataType: https://www.oracle.com/cloud/compute/virtual-
https://dev.mysql.com/worklog/task/?id=8132 machines.html
[13] N. Mukherjee, et al. Oracle SecureFiles System. PVLDB [26] Y. Li, et al: Mison. A Fast JSON Parser for Data Analytics.
1(2), 1301-1312, 2008 PVLDB 10(10): 1118-1129, 2017
[14] Oracle Exadata Machine & Storage Server: [27] Z. Wang, et al: STEED: An Analytical Database System for
http://www.oracle.com/us/products/database/exadata-tech- TrEE-structured Data. PVLDB 10(12): 1897-1900, 2017
overview-wp-1867437.pdf [28] Z.H. Liu, et al. Closing the functional and Performance Gap
[15] PartiQL: https://partiql.org/ between SQL and NoSQL. SIGMOD Conference 2016, 227-
[16] PostgreSQL with JSON and JSONB support: 238
https://www.postgresql.org/docs/9.4/datatype-json.html [29] Z.H. Liu, et al. JSON data management: supporting schema-
less development in RDBMS. SIGMOD Conference 2014,
1247-1258 2014

3071

You might also like