Databricks Delta Guide
Databricks Delta Guide
Databricks Delta Guide
Note
Databricks Delta is in Preview.
Use this guide to learn about Databricks Delta, a powerful transactional storage
layer that harnesses the power of Apache Spark and Databricks DBFS.
o Requirements
o Create a table
o Read a table
o Optimize a table
o Clean up snapshots
Table Batch Reads and Writes
o Create a table
o Read a table
o Write to a table
o Schema validation
o Views on tables
o Table properties
o Table metadata
Table Streaming Reads and Writes
o As a source
o As a sink
Optimizing Performance and Cost
o Compaction (bin-packing)
o Garbage collection
o Isolation levels
Porting Existing Workloads to Databricks Delta
o Example
PREVIOUS NEXT
Introduction to Databricks
Delta
Note
Databricks Delta is in Preview.
Databricks Delta delivers a powerful transactional storage layer by harnessing
the power of Apache Spark and Databricks DBFS. The core abstraction of
Databricks Delta is an optimized Spark table that
You read and write data stored in the delta format using the same familiar
Apache Spark SQL batch and streaming APIs that you use to work with Hive
tables and DBFS directories. With the addition of the transaction log and other
enhancements, Databricks Delta offers significant benefits:
ACID transactions
Requirements
Databricks Delta requires Databricks Runtime 4.1 or above. If you created a
Databricks Delta table using a Databricks Runtime lower than 4.1, the table
version must be upgraded. For details, see Table Versioning.
OUTPUTFORMAT AND INPUTFORMAT
COMPRESSION
STORED AS
o ANALYZE TABLE PARTITION
o ALTER TABLE [ADD|DROP] PARTITION
o ALTER TABLE SET LOCATION
o ALTER TABLE RECOVER PARTITIONS
o ALTER TABLE SET SERDEPROPERTIES
o CREATE TABLE LIKE
o INSERT OVERWRITE DIRECTORY
o LOAD DATA
o Bucketing.
o Specifying a schema when reading from a table. A command such
as spark.read.format("delta").schema(df.schema).load(path
) will fail.
SparkR
Spark-submit job
Client-side S3 encryption
Warning
For these reasons we only recommend the above technique on static data
sets that must be read by external tools.
In this topic:
Create a table
Read a table
o Example notebooks
Stream data into a table
Optimize a table
Clean up snapshots
Create a table
Create a table from a dataset. You can use existing Spark SQL code and change
the format from parquet, csv, json, and so on, to delta.
Scala
Copy
events = spark.read.json("/data/events")
events.write.format("delta").save("/data/events")
SQL
Copy
Read a table
You access data in Databricks Delta tables either by specifying the path on
DBFS ("/data/events") or the table name ("events"):
Scala
Copy
events = spark.read.format("delta").load("/data/events")
or
Copy
events = spark.table("events")
SQL
Copy
Copy
Scala
Copy
newEvents.write
.format("delta")
.mode("append")
.save("/data/events")
or
Copy
newEvents.write
.format("delta")
.mode("append")
.saveAsTable("events")
SQL
Copy
Copy
Example notebooks
Python notebook
Scala notebook
SQL notebook
Python notebook
How to import a notebookGet notebook link
Scala notebook
How to import a notebookGet notebook link
SQL notebook
How to import a notebookGet notebook link
Copy
events = spark.readStream.json("/data/events")
events.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/delta/events/_checkpoint/etl-from-json")
.start("/delta/events")
For more information about Databricks Delta integration with Structured
Streaming, see Table Streaming Reads and Writes.
Optimize a table
Once you have been streaming for awhile, you will likely have a lot of small files
in the table. If you want to improve the speed of read queries, you can
use OPTIMIZE to collapse small files into larger ones:
Copy
OPTIMIZE delta.`/data/events`
or
Copy
OPTIMIZE events
You can also specify interesting columns that are often present in query
predicates for your workload, and Databricks Delta uses this information to
cluster related records together:
Copy
Clean up snapshots
Databricks Delta provides snapshot isolation for reads, which means that it is
safe to run OPTIMIZE even while other users or jobs are querying the table.
Eventually you should clean up old snapshots. You can do this by running
the VACUUM command:
Copy
VACUUM events
You control the age of the latest retained snapshot by using
the RETAIN <N> HOURS option:
Copy
PREVIOUS NEXT