Tuesday, September 27, 2011

Problems with the SCD

The SCD has issues with ease of use and performance. I’ll start with the most important one: Performance. How the SCD works is for each incoming row of data it issues a SQL command to check against the reference (or Dimension) table in the database to compare the incoming row against its corresponding row in the reference (you can watch this happening in SQL profile). This isn’t a problem for small reference tables, but once you start processing thousands of incoming rows against tables with thousands of reference rows, performance starts to drag, because it is doing these row by row checks. The only performance tuning option you have at your disposal is to index the Business Key in the reference table. It would be much better if it was possible to cache the reference table in memory so look-ups could be done in memory instead of row by row against the database – according to this Connect article it may be on the list for the next release.

In terms of ease of use, there’s a couple of annoying things that can trip you up with this component. First of all, every time you complete the wizard, it creates new output transformations, deleting the old ones. If you have customized these in anyway – e.g. adding an update date column - it gets annoying fast. Fortunately the workaround is easy – create your own transformations to receive the outputs independently of the SCD, and when the wizard completes, just delete the outputs it creates and re-map the output data flows to your own transformations.

Secondly, the Wizard is actually disconnected from what is stored in the package for the data flow. The wizard’s data is stored in a separate chunk of XML within the package definition. What this means in practice is if you use the Advanced Editor to make any changes, these will not be picked up by the wizard if you run it again. So its quite easy to make tweaks that get lost if you re-use the wizard.

No comments:

Post a Comment