3

I've got a table structure that can be summarized as follows:

pagegroup
* pagegroupid
* name

has 3600 rows

page
* pageid
* pagegroupid
* data

references pagegroup; has 10000 rows; can have anything between 1-700 rows per pagegroup; the data column is of type mediumtext and the column contains 100k - 200kbytes data per row

userdata
* userdataid
* pageid
* column1
* column2
* column9

references page; has about 300,000 rows; can have about 1-50 rows per page

The above structure is pretty straight forwad, the problem is that that a join from userdata to page group is terribly, terribly slow even though I have indexed all columns that should be indexed. The time needed to run a query for such a join (userdata inner_join page inner_join pagegroup) exceeds 3 minutes. This is terribly slow considering the fact that I am not selecting the data column at all. Example of the query that takes too long:

SELECT userdata.column1, pagegroup.name
FROM userdata
INNER JOIN page USING( pageid )
INNER JOIN pagegroup USING( pagegroupid )

Please help by explaining why does it take so long and what can i do to make it faster.

Edit #1

Explain returns following gibberish:

id  select_type  table      type    possible_keys        key      key_len  ref                         rows    Extra
1   SIMPLE       userdata   ALL     pageid                                                             372420
1   SIMPLE       page       eq_ref  PRIMARY,pagegroupid  PRIMARY  4        topsecret.userdata.pageid   1
1   SIMPLE       pagegroup  eq_ref  PRIMARY              PRIMARY  4        topsecret.page.pagegroupid  1

Edit #2

SELECT
u.field2, p.pageid
FROM
userdata u
INNER JOIN page p ON u.pageid = p.pageid;
/*
0.07 sec execution, 6.05 sec fecth
*/

id  select_type  table  type    possible_keys  key      key_len  ref                rows     Extra
1   SIMPLE       u      ALL     pageid                                              372420
1   SIMPLE       p      eq_ref  PRIMARY        PRIMARY  4        topsecret.u.pageid 1        Using index

SELECT
p.pageid, g.pagegroupid
FROM
page p
INNER JOIN pagegroup g ON p.pagegroupid = g.pagegroupid;
/*
9.37 sec execution, 60.0 sec fetch
*/

id  select_type  table  type   possible_keys  key          key_len  ref                      rows  Extra
1   SIMPLE       g      index  PRIMARY        PRIMARY      4                                 3646  Using index
1   SIMPLE       p      ref    pagegroupid    pagegroupid  5        topsecret.g.pagegroupid  3     Using where

Moral of the story

Keep medium/long text columns in a separate table if you run into performance problems such as this one.

6 Answers 6

4

What's the data type and purpose of columnX in the userdata table? It should be noted that any text data type (i.e excluding char, varchar) forces any temporary tables to be created on disk. Now since you're doing a straight join without conditions, grouping or ordering, it probably won't need any temporary tables, except for aggregating the final result.

I think it would also be very helpful if you show us how your indexes are created. One thing to remember is that while InnoDB concatenates the primary key of the table to each index, MyISAM does not. This means that if you index column name and search for it with LIKE, but still want to get the id of the page group; Then the query would still need to visit the table to get the id instead of being able to retrieve it from the index.

What this means, in your case, if I understand your comment to apphacker correctly, is to get the name of each users pagegroups. The query optimizer would want to use the index for the join, but for each result it would also need to visit the table to retrieve the page group name. If your datatype on name is not bigger than a moderate varchar, i.e. no text, you could also create an index (id, name) which would enable the query to fetch the name directly from the index.

As a final try, you point out that the whole query would probably be faster if the mediumtext was not in the page table.

  1. This column is excluded from the query you are running I presume?
  2. You could also try to separate the page data from the page "configuration", i.e. which group it belongs to. You'd then probably have something like:
    • Pages
      • pageId
      • pageGroupId
    • PageData
      • pageId
      • data

This would hopefully enable you to join quicker since no column in Pages take up much space. Then, when you needed to display a certain page, you join with the PageData table on the pageId-column to fetch the data needed to display a particular page.

3
  • Answer: #1 - Yes, i am not SELECTing the data column; #2 - Yes thats a workaround which "should" work, another possibility is to de-normalize the table a tad bit and add pagegroupid into userdata but the question is if there is something wrong in the table structure or the query. Commented May 9, 2009 at 9:02
  • most of the columns are varchar 100s including pagesource.name and userdata.field2 Commented May 9, 2009 at 9:09
  • It worked after I moved the "data" column into a separate table "pagetemp" that relates 1-to-1 with the page table. None of the indexes work otherwise. Commented May 9, 2009 at 10:40
2

The easy way to figure out what MySQL is doing with your query is to have it explain the query to you. Run this and have a look at the output:

EXPLAIN SELECT userdata.column1, pagegroup.name
FROM userdata
INNER JOIN page USING( pageid )
INNER JOIN pagegroup USING( pagegroupid )

MySQL will tell you in which order it processes the queries and what indexes it uses. The fact that you created indexes does not mean that MySQL actually uses them.

See also Optimizing queries with EXPLAIN

EDIT

The output of your EXPLAIN looks fine. It does a full table scan on the userdata table, but that is normal since you want to return all rows in it. The best way to optimize this is to rethink your application. Do you really need to return all 372K rows?

1
  • I've revised my question and added the result from explain command. Appears to be using correct indices but still 128 seconds just to execute. Commented May 9, 2009 at 7:39
2

I'm assuming the userdata table is very large and does not fit in memory. MySQL would have to read the entire table from harddisk, even if it needs only two small columns.

You can try to eliminate the need for scanning the entire table by defining an index that contains everything the query needs. That way, the index is not a way to facilitate a search into the main table, but it's a shorthand version of the table itself. MySQL only has to read the shorthand table from disk.

The index could look like this:

column1, pageid

This has to be non-clustered, or it would be part of the big table, defeating its purpose. See this page for an idea on how MySQL decides which index to cluster. The easiest way seems to make sure you have a primary key on pageid, which will be clustered, so the secondary column1+pageid index will be non-clustered.

2
  • I tried creating a two column index (pageid-sourceid) on the page table in order to create a short-circuit between userdata and pagesource which reduced execution time but not much. Commented May 9, 2009 at 10:47
  • Your comment above says performance got better with a 1:1 table. That really seems to suggest the index on (pageid,sourceid) was clustered in some way. Oh well... problem solved I guess.
    – Andomar
    Commented May 10, 2009 at 0:14
1

One possible issue is that MySQL uses only one index per query and maybe you don't have a single index with those columns -- or MySQL's query optimizer isn't picking it. What does EXPLAIN SELECT &c tell you here?

0
1

I would start with breaking the query up, to figure out if there is one slow and one fast part, or if both are slow (sorry, I'm no fan of the USING syntax, so I'm going to use ON):

SELECT 
  u.userdata, p.pageid
FROM
  userdata u
  INNER JOIN page p ON u.pageid = p.pageid

SELECT 
  p.pageid, g.pagegroupid
FROM
  page 
  INNER JOIN pagegroup g ON p.pagegroupid = g.pagegroupid

What does that give you? Running these with EXPLAIN EXTENDED will provide additional hints.

3
  • I've posted the output of the two queries. Explain extended returns similar queries in different syntax. Commented May 9, 2009 at 8:33
  • Looks like the second query is the trouble maker. Please include the info what indexes you have in place.
    – Tomalak
    Commented May 9, 2009 at 9:31
  • primary keys + all keys used in joins are indexed. For the three tables I have indexes on pagegroup.pagegroupid (PK), page.pageid (PK), page.pagegroupid (INDEX), userdata.userdataid (PK), userdata.pageid (INDEX), userdata.column1 (INDEX) Commented May 9, 2009 at 9:58
1

Looks like you're doing a join on all rows on userdata and then trying to select everything. That is every page in a pagegroup with userdata. Where's the WHERE clause? There's no LIMIT, how many results did you want? Why don't you get your row count down on userdata row in your explain result, that should speed up the query. Heh.

2
  • I need a dump of selected columns from userdata along with pagegroup.name for cross reference. I believe it should work fast enough if there was no "mediumtext" column in the page table. Commented May 9, 2009 at 8:17
  • Maybe you want to start recording this information in a log as it comes in instead of using SQL, maybe think about something other than SQL for this data, like non-normalized berkeley db or something.
    – Bjorn
    Commented May 9, 2009 at 10:26

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.