During twenty years of creating relational database applications for clients, I have optimized data loads primarily by changing INSERT statements into a BULK LOAD process.  In SQL Server, an external command line utility (bcp.exe) allows bidirectional data streams to flow.  In MySQL, the inward flow is managed by the LOAD DATA INFILE command.

The key to using these BULK LOAD processes is formatting the data properly in files before loading the data.  The best language for parsing large streams of data is Perl.  If you convert your INSERT  statement-generating PHP, Java, C#, VB, etc. to a Perl scrubbing and formatting operation and use the database’s BULK LOAD utility, I am sure you will see a significant increase in the performance of your data loads.  I have used operating system shell scripting languages to tie it all together for automation.

Another factor is the configuration settings of the relational database logging operation.  Most simplified databases do not log transactions because they do not enforce ACID principals.  In fact, most databases, relational or not, are based on straight-forward ISAM principals.  Transactional databases, those that require recovery of transactional (e.g. INSERT) failures, perform extra operations to allow for that recovery.   But doing this at every INSERT of a row slows the load process down unnecessarily.  In a data warehouse, this transactional logging can be turned off.  Data recovery is reduced to the size of each bulk load operation, but then that is what you are already managing with your load process.

A BULK LOAD operation in transactional databases will treat the entire load as a single transaction, so breaking up a large set of rows into separate files may make the processes more manageable.  The cause of failed file loads can be investigated and dealt with on a more granular level if the files are split into manageable pieces.  What defines “manageable” is determined on a case by case basis by the latency and use requirements of the data.



Leave a Reply

Your email address will not be published. Required fields are marked *