04
Sep

Things to check when your data warehouse loads take too long‏

If you ever find yourself suffering from slow data load problem, let me say without a doubt…. there is no situation where the loading of data into a data warehouse should go beyond a couple of hours. If a data warehouse here at WhippsX NHS Trust can load 2 years worth of data from 3 different trusts in less than 2 hours, any data warehouse should be able to load a days worth of data in a few minutes, or at least, within an hour.

If a data warehouse load is taking longer than it should, then it’s likely that it is disk bound. Generally speaking, all SQL Servers will be heavy on the disk usage. A SQL server database server should be more about the disk than about memory, or CPU power or network.

The reason is simple – if you update 1 row in a SQL table, this one action can hit the disk dozens of times. After all, it has to (1) locate the table on the disk, (2) locate the record in the table (3) read the data from the table (4) update in a temporary table (5) see if the data has been changed since it was read, (6) write the table changes, (7) update any associated indexes, (8) and for any reads and changes, write to a log file. Change 10,000 tables rows, can mean hundreds of thousands of disk actions.

So with this in mind, if you ever find a database or data warehouse load taking longer than it should, things you can check and change are as follows:

  1. Make sure that all databases in the processing of the load have recovery options turned to SIMPLE (in enterprise manager, right hand click on the database, select properties then options).  A full check means that it is going to write massive LOG files which will slow down disk processing
  2. Make sure that all three parts of the database are on separate disks – index files, log files and data files should each be on different disks (otherwise a disk will be hammered).
  3. If possible, you can also split database, log and index blocks into multiple windows ‘files’ which can be over different disks for even better performance.
  4. Check the setup of the file store.  SQL keeps log files, data files and index files for each database.  Make sure that all associated files are on separate disks to balance out the work
  5. Check how many copies of the data you are processing for a data warehouse there should only ever be three copies (MAXIMUM) the source data (flat file or transaction database), an interim staging copy and the final data warehouse.   Any more copies is overkill and results in too much disk activity
  6. Balance the indexes too many indexes of your database tables can be as bad as too few.   If you have too many indexes, the update of records will take too long as it has too much work to do
  7. If a database or data warehouse is only ever going to be updated overnight, any insertion into the table should use the LOCK hint this means the load process has exclusive access to the table which means it does not have to re-check changes before the update or create log files reducing disk processing by a third
  8. If you are updating many records in a table, turn off the non-primary indexing during the insertion/update, and turn it back on once processing is complete.  This will greatly improve loading time.  As a rule of thumb, if you are inserting or updating more than 20,000 rows AND this is more than 10% of the total record count in the table disabling the indexes is a good test to try.
  9. If updating multiple columns in a table, it is far quicker to perform all updates in one command than to run separate updates even though the update SQL statement will be more complex as it means it only has to process the records once
  10. If data insertion is being performed by a SQL Server Integration Studio package (rather than a stored procedure), then massive performance improvements can be found by (again) ticking the lock table button and also by setting the record commit size.  The size of the commit will vary depending on the number of rows and the number of columns in the table but somewhere around 10,000 is a good starting point.
  11. If the data is coming from another relational data source (SQL, Oracle, MySQL etc), and the number of insertions is likely to go over 100,000, then it is worth recoding the INSERT statement to use BULK INSERT instead as this takes care of a lot of the previous suggestions on its own.
  12. If any of the above are out of your control (the loading is done by a product supplied by a 3rd party) and the loads take more than 1 or 2 hours, then their design is flawed and they should be called in to rectify the problems.