3

I need data modelling help as I haven't found a resource that tackles the same problem.

The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways:

  1. All emails ever received
  2. Mails that have been read by a user
  3. Mails that are still unread by a user

My current model is as under:

CREATE TABLE TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    read boolean,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE INDEX ON TIMELINE (userID, read);

The queries I need to support are:

SELECT * FROM TIMELINE where userID = 12;
SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = true;
SELECT * FROM TIMELINE where userID = 12 and read = false;
SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc;

My queries are:

  1. Should I keep read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem.

  2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index

  3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge.

1 Answer 1

2

Your index (userID) is high cardinality - you'd probably want to manage that as a second (or third) CF that you manually sync with the application.

Perhaps something like

CREATE TABLE READ_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE TABLE UNREAD_TIMELINE (
    userID varchar,
    emailID varchar,
    timestamp bigint,
    PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

That gives you the ability to satisfy queries like:

SELECT * FROM READ_TIMELINE where userID = 12;
SELECT * FROM UNREAD_TIMELINE where userID = 12;
SELECT * FROM READ_TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM UNREAD_TIMELINE where userID = 12 order by timestamp asc;

That is, you use the natural clustering order for the ORDER BY, and you can move emails from UNREAD to READ with a simple batch (one DELETE, one INSERT)

Now, you'll end up with potentially lots of tombstones in the UNREAD table, as you mark emails read. Setting GCGS low and using frequent compaction can help that somewhat, but you may also want to break those partitions up to avoid having tombstoneoverwhelming issues if you have thousands of emails fly in, get marked read.

6
  • 1
    Good answer Jeff. I was thinking something along the same lines.
    – Aaron
    Commented Apr 27, 2015 at 21:57
  • There's a very real potential tombstone problem here that shouldn't be ignored, but I think it's probably the right direction.
    – Jeff Jirsa
    Commented Apr 27, 2015 at 22:21
  • @JeffJirsa Thanks for the answer and explanation. I do have one query still: to render an all email timeline - I would be firing two queries on cassandra and then doing an in-memory merge, right?
    – sangupta
    Commented Apr 28, 2015 at 3:11
  • No - you'd do a single query against the TIMELINE table, and only use the READ_TIMELINE and UNREAD_TIMELINE if you needed the read status in the WHERE clause.
    – Jeff Jirsa
    Commented Apr 28, 2015 at 3:22
  • @JeffJirsa Ok got it. So we maintain three different timelines for a single user. Thanks. Will get back if have more questions after implementation.
    – sangupta
    Commented Apr 29, 2015 at 3:21

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.