Monday, 4 August 2008

DW 2.0: The Architecture for the Next Generation of Data Warehousing

There is a new book on data warehousing as captioned from William H Inmon.

http://www.amazon.com/2-0-Architecture-Generation-Warehousing-Management/dp/0123743192/ref=wl_it_dp?ie=UTF8&coliid=I1QPVT468NJ02Q&colid=PCMLLT8JEW9C

Is his design concepts / principles another school of thoughts from Dr Kimball? Is it common in the DW industry?

Saturday, 2 August 2008

Part 2 - data model analysis

Current model
I have spent a few days now looking at the existing data model. The model mainly consists of ETL related objects (normalised - 3NF), and reporting tables (wide denormalised). Access to reporting tables are done by joining various ETL tables to the reporting tables, resulting in very complex expensive joins.

Where we want to be
We want to reintroduce the basic fundamentals of data warehousing here, without introducing too many changes, as this is a live system, where a lot of investment has been made. Try to protect existing investment by reusing as many components as possible.

The key things that I would like to re-emphasise are:
  • ETL tables should be off-limits to the business. No users or reports should query these ETL tables.
  • we need to move from existing model to a star schema.

OK, so let's now braindump what's important here.

FACT tables
  • candidate for FACT tables (pnl vector, sensitivities, limits)
  • asset class covered will be IR, FX, CS, AB (asset backed) and EQ. Will all the data received be of the same grain? i.e. will they all be additive?
  • Make sure FACT table design is REPORT driven, not ETL driven. While there is pressure to be able to load data as fast as possible, remember that the success of this risk system will not be judged by how fast the data can be pumped in, but how quickly users can pull the data out in a timely fashion, as well as whether data can be presented in a useful format for users
  • Currently I have identified about 30 attributes used to store these data in its current format
  • Remembering that we will try to perform aggregation on the fly, it would be preferable to ensure that FACT tables stay as small as possible. One way to achieve this is by moving the common attributes out to common dimension tables, and introduce use of surrogate keys.

DIMENSION tables
  • Dr Kimball suggests 15-25 dimension tables max. I think for this purpose I want to aim for no more than 10.
  • I have a feeling that we will not be able to create a pure star schema, as there are some complex data structure which will force me to snowflake the design. OK, wll try to minimise this

Common dimensions can be identified as follows
  1. Date
  2. Hierarchy
  3. Asset Class/Accounting Type and grouping
  4. Currency and currency group
  5. Bucket detail information
  6. Product details
  7. IR/FX detailed attributes
  8. CS/AB detailed attributes
  9. EQ detailed attributes
  10. Limit details