Cassandra data-modelling around email-system

Question

I need data modelling help as I haven't found a resource that tackles the same problem.

The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways:

All emails ever received
Mails that have been read by a user
Mails that are still unread by a user

My current model is as under:

CREATE TABLE TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    read boolean,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE INDEX ON TIMELINE (userID, read);

The queries I need to support are:

SELECT * FROM TIMELINE where userID = 12;
SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = true;
SELECT * FROM TIMELINE where userID = 12 and read = false;
SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc;

My queries are:

Should I keep read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem.
Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index
If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge.

Aaron · Accepted Answer · 2015-04-27 21:57:35Z

2

Your index (userID) is high cardinality - you'd probably want to manage that as a second (or third) CF that you manually sync with the application.

Perhaps something like

CREATE TABLE READ_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE TABLE UNREAD_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

That gives you the ability to satisfy queries like:

SELECT * FROM READ_TIMELINE where userID = 12;
SELECT * FROM UNREAD_TIMELINE where userID = 12;
SELECT * FROM READ_TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM UNREAD_TIMELINE where userID = 12 order by timestamp asc;

That is, you use the natural clustering order for the ORDER BY, and you can move emails from UNREAD to READ with a simple batch (one DELETE, one INSERT)

Now, you'll end up with potentially lots of tombstones in the UNREAD table, as you mark emails read. Setting GCGS low and using frequent compaction can help that somewhat, but you may also want to break those partitions up to avoid having tombstoneoverwhelming issues if you have thousands of emails fly in, get marked read.

edited Apr 27, 2015 at 21:57

Aaron

57.7k11 gold badges121 silver badges141 bronze badges

answered Apr 27, 2015 at 21:55

Jeff Jirsa

4,42612 silver badges24 bronze badges

1

Good answer Jeff. I was thinking something along the same lines.
– Aaron
Commented Apr 27, 2015 at 21:57
There's a very real potential tombstone problem here that shouldn't be ignored, but I think it's probably the right direction.
– Jeff Jirsa
Commented Apr 27, 2015 at 22:21
@JeffJirsa Thanks for the answer and explanation. I do have one query still: to render an all email timeline - I would be firing two queries on cassandra and then doing an in-memory merge, right?
– sangupta
Commented Apr 28, 2015 at 3:11
No - you'd do a single query against the TIMELINE table, and only use the READ_TIMELINE and UNREAD_TIMELINE if you needed the read status in the WHERE clause.
– Jeff Jirsa
Commented Apr 28, 2015 at 3:22
@JeffJirsa Ok got it. So we maintain three different timelines for a single user. Thanks. Will get back if have more questions after implementation.
– sangupta
Commented Apr 29, 2015 at 3:21

| Show 1 more comment

Collectives™ on Stack Overflow

Cassandra data-modelling around email-system

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
cassandra
data-modeling
cassandra-2.0
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged cassandradata-modelingcassandra-2.0 or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
cassandra
data-modeling
cassandra-2.0
or ask your own question.