0

We are using Python and LOAD DATA INFILE to load data from CSVs into our staging database. From staging we have sql scripts moving the data into our actual production database.

LOAD DATA INFILE is lightning fast compared to selecting rows from staging and inserting them into production.

We are on 5.7, using InnoDB and we have applied the following configuration to optimise our inserts:

Set innodb_autoinc_lock_mode to 2
Set innodb buffer pool size to half of the memory (16GB)
Set log buffer size to 4GB
We are using TRANSACTIONS
Use SET autocommit=0;

Still the insert from one table to another is significantly slower compared to LOAD DATA INFILE.

When I look at IO writes, with load data infile it goes up to 30 MB/s, while with normal inserts, it's max 500KB/sec.

Is there any way we could improve this performance or do we need to completely rethink our approach. I can think of using OUTFILE for sub queries and load that back in with INFILE but it doesn't sound like the right approach.

INSERT INTO documentkey (documentClassCode,dId,fileTypeCode,internet,pathId,
                         creationTime,signature,CSVimportId) 
SELECT case when csv.`Document Class` is null
                then (select classCode from mydb.class where classDesc = 'Empty'
                    And LookupId = (select LookupId from mydb.Lookup
                              where LookupGroupCode = 'C'
                                and EntityLookedup = 'documentkey')
                    )
                else (select classCode from mydb.class
                              where classDesc = csv.`Document Class`
                                And LookupId = (select LookupId from mydb.Lookup
                                         where LookupGroupCode = 'C'
                                           and EntityLookedup = 'documentkey')
                    )
        end,
        csv.`dId`,
        (select typeCode from mydb.type
                Where typeDesc = csv.`File Type`
                And LookupId = (select LookupId from mydb.Lookup
                          where LookupGroupCode = 'T'
                            and EntityLookedup = 'documentkey')
        ),
        case    when csv.`message ID` is null
                then (select messageIncrId from message where internetdesc = 'Empty')
                else case   when    exists (select internetMessageIncrId
                                      from internetMessage
                                      where internetdesc = csv.`Internet Message ID`)
                            then    (select internetMessageIncrId
                                      from internetMessage
                                      where internetdesc = csv.`Internet Message ID`)
                            else    0
                    end
        end,
        case    when exists (select pathId from Path where pathDesc = csv.`path`)
                then    (select pathId from Path where pathDesc = csv.`path`)
                else 0
        end,
        case when csv.`Creation Time` <> ''
                  then STR_TO_DATE(csv.`Creation Time`, '%d/%m/%Y  %H:%i:%s')
                  else '2016-06-16 10:00:00' end,
        #STR_TO_DATE(csv.`Creation Time`, '%Y-%m-%d %H:%i:%s'),
        csv.`Signature Hash`,
        1
        #csv.`CSV import id`
FROM `mydb_stage`.`csvDocumentKey` csv
where csv.`dId` is not null and csv.threadId = @thread;

Select part of the query only takes a fraction of a second.

Explain:

'1', 'PRIMARY', 'csv', NULL, 'ALL', NULL, NULL, NULL, NULL, '1', '100.00', 'Using where'
'12', 'DEPENDENT SUBQUERY', 'path', NULL, 'eq_ref', 'pathDesc_UNIQUE', 'pathDesc_UNIQUE', '1026', 'func', '1', '100.00', 'Using where; Using index'
'11', 'DEPENDENT SUBQUERY', 'path', NULL, 'eq_ref', 'pathDesc_UNIQUE', 'pathDesc_UNIQUE', '1026', 'func', '1', '100.00', 'Using where; Using index'
'10', 'SUBQUERY', 'message', NULL, 'const', 'messageDesc_UNIQUE', 'messageDesc_UNIQUE', '2050', 'const', '1', '100.00', 'Using index'
'9', 'DEPENDENT SUBQUERY', 'message', NULL, 'eq_ref', 'messageDesc_UNIQUE', 'messageDesc_UNIQUE', '2050', 'func', '1', '100.00', 'Using where; Using index'
'8', 'DEPENDENT SUBQUERY', 'message', NULL, 'eq_ref', 'messageDesc_UNIQUE', 'messageDesc_UNIQUE', '2050', 'func', '1', '100.00', 'Using where; Using index'
'6', 'DEPENDENT SUBQUERY', 'type', NULL, 'eq_ref', 'typeDesc_UNIQUE', 'typeDesc_UNIQUE', '1026', 'func', '1', '100.00', 'Using index condition; Using where'
'7', 'SUBQUERY', 'Lookup', NULL, 'ref', 'PRIMARY', 'PRIMARY', '6', 'const', '3', '10.00', 'Using where'
'4', 'SUBQUERY', 'class', NULL, 'const', 'classDesc_UNIQUE', 'classDesc_UNIQUE', '1026', 'const', '1', '100.00', NULL
'5', 'SUBQUERY', 'Lookup', NULL, 'ref', 'PRIMARY', 'PRIMARY', '6', 'const', '2', '10.00', 'Using where'
'2', 'DEPENDENT SUBQUERY', 'class', NULL, 'eq_ref', 'classDesc_UNIQUE', 'classDesc_UNIQUE', '1026', 'func', '1', '20.00', 'Using index condition; Using where'
'3', 'SUBQUERY', 'Lookup', NULL, 'ref', 'PRIMARY', 'PRIMARY', '6', 'const', '2', '10.00', 'Using where'

1 Answer 1

1

innodb_log_buffer_size should not be set to 4G. The default of 8M is usually sufficient; let's compromise on 100M, ok?

case when exists ( select ... ) then ( select ... ) else 0 end -->
coalesce( ( select ... ), 0 )

"Creation time" comes out either as YMD format or DMY; did you want that inconsistency?

Indexes needed:

INDEX(classDesc, LookupId, classCode)
INDEX(dId, threadId)  -- in either order
INDEX(LookupGroupCode, EntityLookedup, LookupId)
INDEX(pathDesc, pathId)

Back to your complaint... Time the SELECT by itself. I think that is why the INSERT is so slow.

Please provide SHOW CREATE TABLE if you wish to discuss this further.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.