Snowflake Snowpro Exam Cheatsheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7
At a glance
Powered by AI
Some of the key takeaways from the document are that Snowflake is a cloud-based data warehouse with no hardware to maintain, it uses a hybrid architecture of shared disk and shared nothing databases, and it offers different pricing models.

Data can be loaded into Snowflake from local files, cloud storage like AWS S3, Azure Blob Storage and GCP Cloud Storage. It supports loading structured and semi-structured file formats like CSV, JSON, ORC and Parquet.

Snowflake encrypts all data by default using 126-bit or 256-bit keys. It uses a hierarchy key model to manage encryption keys with rotation of account master keys, table master keys and file keys.

Snowflake SnowPro Certification Exam Cheat Sheet by Jeno Yamma

// Snowflake - General // // Architecture //

Introduction Overview
- Analytic data warehouse - Hybrid of shared-disk db and shared-nothing db
- SaaS offering - Uses central data repo for persisted storage - All compute nodes have data access
- No hardware (or installation or updates/patch) - Using MPP clusters to process queries - Each node stores portion of data locally
- No ongoing maintenance or tuning
- Can’t run privately (on-prem or hosted) Architectural layers
- Runs completely on the cloud
- Either AWS, Azure or GCP (NO ON-PREM or HYBRID!)
- Has its own VPC!
- Decoupled compute and storage (scaled compute does not need scaled storage)

Pricing
- Unit costs for Credits and data storage determined by region (not cloud platform)
- Cost of Warehouse used
- Minimum 60s on resuming, per 1s after
Storage
- Cost of storage (Temp, Transient, Perm Tables & time-travel data & fail-safe)
- All data stored as an internal, optimised, compressed columnar format (micro-partitions -
- Calcs based on daily average
represents logical structure of table)
- Pricing model = on demand or discounted upfront
- Can’t access the data directly, only through Snowflake (SQL etc)
- Just remember:
- >= Enterprise = 90 days time travel (default: 1 day for all) + materialized view +
Query Processing
multi-cluster warehouse
- Where the processing of query happens
- >= Business Critical = lots more security (HIPAA, SOC 1&2, PCI DSS)
- Uses virtual warehouses - an MPP compute cluster
- VPS edition = own cloud service layer (not shared with accounts)
- Each cluster = Independent - doesn’t share resources with other vwh and doesn’t impact
others
Supported Regions
- Multi-region account isn’t supported, each SF in single region
Cloud Services
- 17 Regions in total
- Command centre of SF - coordinates and ties all activities together within SF
- AWS - 9 regions (Asia Pacific-3, EU-2, North America-4)
- Gets provisioned by Snowflake and within AWS, Azure or GCP
- GCP - 1 (Asia Pacific-0, EU-0, North America-1) >> in preview
- You don’t have access to the build of this or to do any modifications
- Azure - 7 (Asia Pacific-2, EU-1, North America-4)
- Instance is shard to other SF accounts unless you have VPS SF Edition
- Services that this layer takes care of:
// Connecting to SF //
- Authentication
- Infrastructure management
- Metadata management
- Web-based UI (see above for usage,
- Query parsing and optimisation
capabilities, restriction)
- Access control
- Command line client (SnowSQL)
- ODBC and JDBC (you have to download the
Caches
driver!)
- Snowflake Caches different data to improve query performance and assist in reducing cost
- Native Connectors (Python, Spark & Kafka)
>Metadata Cache - Cloud Services layer
- Third party connectors (e.g. Matilion)
- Improves compile time for queries against commonly used tables
- Others (Node.js, .Net)
>Result Cache - Cloud Services layer
- Holds the query results
Snowflake Address
- If Customers run the exact same query within 24 hours, result cache is used and no warehouse is
https://account_name_region_{provider}.snowflakecomputing.com
required to be active
Account_name either:
>Local Disk Cache or Warehouse Cache - Storage Layer
- AWS = account_name.region
- Caches the data used by the SQL query in its local SSD and memory
- GCP/Azure = account_name.region.gcp/azure
- This improves query performance if the same data was used (less time to fetch remotely)
e.g. https://pp12345.ap-southeast-2.snowflakecomputing.com
- Cache is deleted when the Warehouse is suspended
// Data Loading // Planning Data Load
Dedicating Separate Warehouse
File Location - Load of large data = affect query performance
- On Local - Use seperate Warehouse
- On Cloud - # data files processed in parallel determined by servers in the warehouse.
- AWS (Can load directly from S3 into SF) - Split large data to scale linearly, use of small vwh should be sufficient
- Azure (Can load directly from Blob Storage into SF)
- GCP (Can load directly from GCS into SF) Staging Data
File Type
- Structured > Organising Data by Path - Snowflake Staging
- Delimited files (CSV, TSV etc) - Both internal and external ref can include path (prefix) in the cloud storage
- Sem-structured - SF recommends (for file organisation in Cloud Storage):
- JSON (SF can auto detect if Snappy compressed) 1. Partition into logical path - includes identifying detail with date
- ORC (SF can auto detect if Snappy compressed or zlib) 2. Organising file by path = allow copy fraction of partitioned data in SF in one command
- Parquet (SF can auto detect if Snappy compressed) - Result: Easier to execute concurrent COPY statements - takes advantage of
- XML (in preview) parallel operations
- * Note: If files = uncompressed, on load to SF it is gzip (can disable) - Named stage operation:
- Staging = A place where the location/path of the data is stored to assist in processing the
>Compression upload of files
- SF auto compresses data from local fs to gzip - can change (AUTO_COMPRESS) - Remember: Files uploaded to snowflake Staging area using PUT are automatically
- Specify compression type on loading compressed data encrypted with 128-bit or 256-bit (CLIENT_ENCRYPTION_KEY_SIZE to specify) keys!
Loading Data
>Encryption for load > COPY command - parallel execution
- Loading unencrypted files - Supports loading by Internal Stage or S3 Bucket path
- SF auto encrypts files using 126-bit keys! (or 256-keys - requires configuring) - Specifying specific list of files to upload (1000 files max at a time)
- Loading encrypted files - Identify files through pattern matching (regex
- Provide your key to SF on load - Validating Data:
Best Practice - Use VALIDATION_MODE - validate errors on load - does not load into table
- ON_ERROR to run actions to follow
>File Sizing - When COPY command runs, SF sets load status in the table’s metadata
- ~10-100 MB file, compressed - Optimises parallel operations for data load - Prevents parallel COPY from loading the same file
- Aggregate smaller files - Minimises processing overhead - On complete, SF sets the load status and to those that failed as well
- Split up larger files to small files - Distributes workload among server in active vwh - Metadata (expires after 64 days):
- # data files processed in parallel depends on capacity of servers in vwh - Name of file
- Parquet: >3GB compressed could time out - split into 1GB chunks - File size
- ETag of the file
>Semi Structured Sizing - # rows parsed in file
- VARIANT data type has 16 MB compressed size limit per row - Timestamp of last load
- For JSON or Avro, outer array structure can be removed using STRIP_OUTER_ARRAY. - Information on errors during load
- SF recommends removing the Data from the Stage once the load is completed to avoid reloading
>Continues Data Loads File Sizing again - use REMOVE command (and specify PURGE in COPY argument)
- Load new data within a minute after file notification sent, longer if file is large - Improves performance - doesn’t have to scan processed files
- Large compute is required (decompress, decrepit, transform etc). > Note for Semi-structured
- If > 1min to accumulate MBs of data in source app, create data file once per min - SF loads ss data to VARIANT type column
- Lowers cost, improve performance (load latency) - Can load ss data into multiple columns but ss data must be stored as field in structured data
- Creating smaller files and staging them in cloud storage >1 time per minute = disadvantage - Use FLATTEN to explode compounded values into multiple rows
- Reduction in latency between staging and loading data != guaranteed
- Overhead to manage file in internal load queue - increase utilisation cost >Data Transformation during load
- Overhead charges apply if per than 1000 files in queue - Supported
- Sequence, substring, to_binary, to_decimal
- Commands not supported
- WHERE, FLATTEN, JOIN, GROUP BY, DISTINCT (not fully supported)
- VALIDATION_MODE (if any aggregation applied)
- CURRENT_TIME (will result in the same time)
Snowflake Stage Snowpipe - Incremental Load

>Introduction

Types of Stages
- Default: each table and user are allocated an internal named stage
- Specify Internal Stage in PUT command when uploading file to SF
- Specify the Same Stage in COPY INTO when loading data into a table
> User Stage (Ref: @~) - Enable you to loads data as soon as they are in stage (uses COPY command)
- Accessed by single user but need copying to multiple tables - Uses Pipe (first-class SF object) which contains the COPY command
- Can’t be altered or dropped - Can DROP, CREATE, ALTER, DESCRIBE and SHOW PIPES
- Can’t set file format, need to specify in COPY command to table - Think of this as SF way of streaming data into the table as long as you have created a Named
Stage.
> Table Stage (Ref: @%) - Generally loads older files first but doesn’t guarantee - Snowpipe appends to queue.
- Accessed by multiple users and copied to single tale - Has loading metadata to avoid duplicated loads
- Can’t be altered or dropped
- Can’t set file format, need to specify in COPY command to table >Billing and Usage
- No transformation while loading - User doesn’t need to worry about managing vwh
- Charges based on resource usage - consume credit only when active
Internal Named Stages (Ref: @) - View charges and usage
- A Database object - Web UI - Account > Billing & Usage
- Can load data into any tables (Needs user with privilege) - SQL - Information Schema > PIPE_USAGE_HISTORY
- Ownership of stage can be transferred >Automating Snowpipe
- Check out their AWS, GCP, Azure, and REST API article
AWS - Bulk Load Loading from S3
Querying from Staged File - Query data outside SF
- SF allows you to query data directly in external locations as long as you have created a stage for it
- SF recommends using simple queries only for
- Performance = impacted since it’s not compressed and columnised to SF standard

// Monitoring //

Resource Monitor
- Helps control the cost and unexpected spike in credit usage
- Set Actions (or triggers) to Notify and/or Suspend if credit usage is above certain threshold
- Set intervals of monitoring
- SF uses S3 Gateway Endpoint. If region bucket = SF region - no route through public internet. - Created by users with admin roles (ACCOUNTADMIN)
>Securely access S3 - If monitor suspends, any warehouse assigned to the monitor cannot be resumed until
1. Configure storage integration object - delegate authentication responsibility for external cloud 1. Next interval starts
storage to SF identity and IAM 2. Minitor is dropped
2. Configure AWS IAM role -> allow S3 bucket access 3. Credit quota is increased
3. Configure IAM user, provide Key and Secret Key 4. Warehouse is no longer assigned to the monitor
5. Credit threshold is suspended
// Unloading Data // // Virtual Warehouse //

Introduction Cluster of resource


- Allows you to export your data out of SF - Provides CPU, Mem and Temp Storage
- Uses COPY INTO <location> (either external or SF stage) - Is active on usage of SELECT and DML (DELETE, INSERT, COPY INTO etc)
- If SF then use GET command to download file - Can be stopped at any time and resized at any time (even while running)
- Use SELECT and other full SQL syntax to build data for export - Running queries are do not get affected, only new queries
- Can specify MAX_FILE_SIZE to split into multiple files by chunks - Warehouse sizes = T-shirt sizes (generally query performance scales linearly with vwh size)
Format and Restrictions - X-Small - 4X-Large
- Allowed formats for export (only UTF-8 as encoding allowed) - When creating a vwh you can specify:
- Delimited Files (CSV etc) - Auto-suspend (default: 10mins) - suspends warehouse if active after certain time
- JSON - Auto-resume - auto-resumes warehouse whenever a statement that requires active vwh
- Parquet is required
- Allowed compression for export - Query Caching
- gzip (default) - vwh maintains cache of table data - Improves query performance
- bzip2 - The Larger the vwh, larger the cache - Cache dropped when vwh suspended
- Brotli
- Zstandard Choosing VWH
- Encryption for export - SF recommends experimenting by running the same queries on different vwh
- Internal Stage - 128-bit or 256-bit - gets decrypted when downloaded - Monitoring Warehouse Load - WebUI: In Warehouse (view queued and running queries)
- External - Customer supply encryption key
Considerations Query Processing and Concurrency
- Empty String and Null - When queries submitted, vwh calls and reserves resource
- FIELD_OPTIONALLY_ENCLOSED_BY - Query is queued if vwh doesn't have enough resource
- EMPTY_FIELD_AS_NULL - STATEMENT_QUEUED_TIMEOUT_IN_SECONDS
- NULL_IF - STATEMENT_TIMEOUT_IN_SECONDS
- Unloading a single file - Both can be used to control query processing and concurrency
- MAX_FILE_SIZE default = 16mb - Size and complexity of query determines the concurrency (also # queries being executed)
- Max max is 5gb for AWS, GCP, and 256mb for Azure
- Unloading Relational table to JSON Multi-cluster warehouse
- User OBJECT_OBSTRUCT obstruct in COPY command to convert rows of relational - Up to 10 server clusters
table to VARIANT column then unload the data as per usual - Auto-suspend and auto-resume is of whole cluster not 1 server
- Can resize anytime
Unloading in SF Stage Unloading into S3 - Multi-cluster mode
- Maximised - associate same min and max for # cluster
- SF starts all clusters at start would be good for expected workload
- Auto-scale - different min and max for # cluster
- To help control this, SF provides scaling policy

- Some rules:
// Tables in Snowflake // Types of Tables (Internal)

Micro-partition >Temporary Tables


- All of SF Tables are divided into micro-partition - Used to store data temporarily
- Each micro-partitions are compressed columnar data - Non-permanent and exists only within the session - data is purged after session ends
- Max size of 16mb compressed (50-500MB of uncompressed data) - Not visible to other users and not recoverable
- Stored on logical hard-drive - Contributes to overall storage
- Data only accessible via query through SF not directly - Belongs to DB and Schema - Can have the same name as another non-temp table within the same
- They are immutable, can’t be changed DB and Schema
- Order of data ingestion are used to partition these data - Default 1 day Time Travel
- SF uses natural clustering to colocate column with the same value or similar range
- Results to non-overlapping micro-partition and least depth >Transient Tables (or DB and Schema)
- Improves query performance to avoid scanning unnecessarily micr-partitions - Persists until dropped
- Have all functionality as permanent table but no Fail-safe mode (no FS storage cost)
- Default 1 day Time Travel

>Permanent Table
- Have 7 days fail safe
- Default 1 day Time Travel (up to 90 days for >=Enterprise edition)

>External Table
- Allows access to data stored in External Stage

>Creating tables
- SF doesn’t enforce any constraints (primary key, unique, foreign key) beside the NOT NULL
constraint
- Metadata of the micropartitions are stored in the Cloud Services Layer
- Customers can query this without needing an Active vwh Types of Views
- Range of each columns
- Number of Distinct Values >Non-materialized views (views)
- Count NULL - Named definition of query
>Clustering Key - Result are not stored
- SF recommends the use of Clustering key once your table grows really large (multi-terabyte
range). >Materialized views
- Especially when data does not cluster optimally. - Result are stored
- Clustering keys are subset of columns designed to colocate data in the table in the same - Faster performance than normal views
micro-partitions - Contributes towards storage cost
- Improves query performance on large tables by skipping data that is not part of filtering
predicate

Zero-copy Cloning
- Allows Customers to Clone their table
- SF references the original data (micro-partition) - hence zero-copy (no additional storage cost)
- When a change is made to the cloned table (new data added, deleted or updated) then
a new micro-partition is created (incurs storage cost).
- Cloned object do not inherit the source’s granted privileges

>Secure view
- Both non-materialized and materialized view can be defined as Secure view
- Improves data privacy - hides view definition and details from unauthorised viewers
- Performance is impacted as internal optimization is bypassed
// Time Travel and Fail Safe // Share

Introduction

>Time Travel (0-90 days)


- Allows db to query, clone or restore historical data to tables, schema or db for up to 90 days
- Can use this to access snapshot of data at a point in time
- Useful if data was deleted, dropped or updated.
>Fail safe (7 days)
- Allows disaster recovery of historical data - Only accessible by Snowflake!

// Data Sharing //

Introduction
- Between SF accounts
- Each share consists of
- Privilege that grant db access
- Privilege that grant access to specific object (tables, views)
- Consumer accounts with which db and object are shared
- Can perform DML operations

Reader Account

- Only ACCOUNTADMIN can provision share


- The sharer is the Provider while the user of the shared data is the Consumer
- Only ACCOUNTADMIN role can create this
- Secure data sharing enables account-to-account sharing of Database, Tables and Views
- No actual data is copied or transferred
- Sharing accomplished using SF service layer and metadata store
- Charge is compute resources used to query shared data
- User creates share of DB then grants object level access
- read-only db is created on consumer side
- Share data with user who isn’t on SF
- Access can be revoked at any time
- Objects can only be read not modified (no DML) - consumer can only consume the data
- Type of share includes Share and Reader
- No limit on number of shares or accounts but 1 DB per share
// Access Control within Snowflake // >Custom Roles
- Created by SECURITYADMIN
Network Policy - Assigned to SYSADMIN
- Allows access based on IP whitelist or restrictions to IP blacklist - If not assigned, only roles with MANAGE GRANTS can modify grants on it.
- Apply through SQL or WebUI - Create custom roles with least privilege and role them up
- Only ACCOUNTADMIN or SECURITYADMIN can modify, drop or create these
PrivateLink
- If time permits briefly read through these (AWS and Azure)

MFA
- Powered by Duo System and enrolled on all accounts, Customers need to enable it
- Recommend enabling MFA on ACCOUNTADMIN
- Use with WebUI, SnowSQL, ODBC, JDCB, Python Connector

Federated Auth and SSO


- User authenticate through external SAML 2.0-compliant identity provider (IdP)
- Users don’t have to log into Snowflake directly - Visit “Access Control Privileges” in SF Doc and have a quick skim over the privileges
- As per basic introduction to SSO, a token is passed to the application that the user is trying to
login to authenticate which in turns will open access when verified // Data Security within Snowflake //

Access Control Models > End-to-end Encryption


SF approach to access control uses the following models: - Snowflake encrypts all data by default, no additional cost
- Discretionary Access Control (DAC) - All object have an owner, owner can grant access to their - No one other than the customer or runtime components can read the data
objects - Data is always protected as it is in an encrypted state
- Role-based Access Control (RBAC) - Privileges are assigned to roles, then roles to users - Customer provides encrypted data to the external staging area (S3 etc)
- Roles are entities that contains granted privileges (level of access to object) which are - Provide SF with the encryption master key when creating the Named Stage

System Defined Roles

- Client Side Encryption


- Customer provides unencrypted data to SF internal staging area, SF will automatically encrypt the
data.

>ACCOUNTADMIN
- encapsulates SECURITYADMIN and SYSADMIN
>SECURITYADMIN >Key Rotation
- Creates, modify and drops user, roles, networks, monitor or grants - Snowflake encrypts all data by default and keys are rotated regularly
>SYSADMIN - Uses Hierarchy Key Model for its Encryption Key Management - comprises of
- Has privileges to create vwh, and db and its objects - Root Key
- Recommend assigning all custom roles to SYSADMIN so it has control over all objects created - Account Master Key (auto-rotate if >30 days old)
(create role hierarchy based on this) - Table Master Key (auto-rotate if >30 days old)
>PUBLIC - File Keys
- Automatically grant to all users
- Can own secured objects >Tri-Secret Secure and Customer-managed keys (Business Critical Edition)
- Used where explicit access control is not required - Combines the customer key with snowflake’s maintained key
- Creates composite master key then use it to encrypts all data in the account
- If customer key or snowflake key is revoked, data cannot be decrypted

You might also like