Redshift Dist Key and Sort Key in a Data Warehouse

Question

I have a Data Warehouse in Redshift. The redshift cluster is a 2 nodes ra3.xlplus (4 vCPU, 32GB Memory) .

I have relatively smaller dimensions - The largest one has 1M records. The fact tables would contain around 10M records.

Based on the blogs, answers, and videos that I have checked so far, Could below be the right combination of DISTKEY and SORTKEY?

For all dimensions - DIST STYLE - ALL (since the data is less)

SORT KEY - surrogate key of the dimension

For all fact tables - DIST STYLE - KEY

DIST KEY - The most important dim table's surrogate key which is frequently joined in my BI queries.

SORT KEY - Dim_Date_ID since this is used in where clauses.

Can someone please help in confirming whether this could be the correct combination?

Reference links that I have checked - This and This

Thank you!

Sanket

John Rotenstein · Accepted Answer · 2022-11-22 04:54:13Z

4

You are correct. In general:

If the tables are small, then DISTKEY ALL is fine -- it will replicate the tables between all nodes, thereby reducing cross-node data transfer.

Preferably, use the same DISTKEY on all tables that are JOINed. That way, the data is distributed on the same node.

answered Nov 22, 2022 at 4:54

John Rotenstein

268k27 gold badges441 silver badges525 bronze badges

2

John is spot on, as usual. I'd just add that when joining a dist style all dim table to a fact table it doesn't matter what the dist key is on the fact table. Every join target is local. It will matter when joining 2 tables with dist keys. So focus on the fact to fact joins when selecting the dist key.
– Bill Weiner
Commented Nov 23, 2022 at 11:28
Hey @BillWeiner -- does the latest incarnation of Redshift automatically figure out the best DISTKEY and SORTKEY these days?
– John Rotenstein
Commented Nov 23, 2022 at 20:35
Redshift does have an auto distribution mode but that is far from choosing the best dist key. It rarely chooses a key but rather all or even. Even is equivalent to random which is a join and group by performance killer. Unless there is a fall off a log obvious dist key I haven't seen RS make a good choice. Your experience different?
– Bill Weiner
Commented Nov 23, 2022 at 20:45

Add a comment |

1 Answer 1