Current model I have spent a few days now looking at the existing data model. The model mainly consists of ETL related objects (normalised - 3NF), and reporting tables (wide denormalised). Access to reporting tables are done by joining various ETL tables to the reporting tables, resulting in very complex expensive joins.
Where we want to be We want to reintroduce the basic fundamentals of data warehousing here, without introducing too many changes, as this is a live system, where a lot of investment has been made. Try to protect existing investment by reusing as many components as possible.
The key things that I would like to re-emphasise are:
ETL tables should be off-limits to the business. No users or reports should query these ETL tables.
we need to move from existing model to a star schema.
OK, so let's now braindump what's important here.
FACT tables
candidate for FACT tables (pnl vector, sensitivities, limits)
asset class covered will be IR, FX, CS, AB (asset backed) and EQ. Will all the data received be of the same grain? i.e. will they all be additive?
Make sure FACT table design is REPORT driven, not ETL driven. While there is pressure to be able to load data as fast as possible, remember that the success of this risk system will not be judged by how fast the data can be pumped in, but how quickly users can pull the data out in a timely fashion, as well as whether data can be presented in a useful format for users
Currently I have identified about 30 attributes used to store these data in its current format
Remembering that we will try to perform aggregation on the fly, it would be preferable to ensure that FACT tables stay as small as possible. One way to achieve this is by moving the common attributes out to common dimension tables, and introduce use of surrogate keys.
DIMENSION tables
Dr Kimball suggests 15-25 dimension tables max. I think for this purpose I want to aim for no more than 10.
I have a feeling that we will not be able to create a pure star schema, as there are some complex data structure which will force me to snowflake the design. OK, wll try to minimise this
Overview An existing market risk system is processing 800-900 million rows daily. This number is made up of leaf node level data (pnl vector and sensitivities). Data is then aggregated up the hierarchy, as well as along dimensions, aggregated data is then persisted in the database.
The same leaf node data can belong to multiple hierarchies. As users add more hierarchies, or dimensions, even with the same amount of raw data, increase in data volume is becoming insupportable.
Quite often, users do not get their VaR numbers until late afternoon, and when they get it, reporting performance is not very good.
Analysis
Daily leaf node data for both pnl and sensitivities amount to ~10 million rows
ETL is spending ages loading these rows and aggregating the data up the hierarchy and along dimensions
As a result daily partition size of the FACT tables are enormous
Development effort has been focused on reducing ETL time. System is ETL driven, rather than report driven.
Data model was originally designed as a data warehouse, however quickly turned into a hybrid system (denormalised FACT tables with a lot of normalised supporting tables)
Current thoughts
10 million rows seem to be a "tiny" amount
Rather than persisting the aggregated data, how about we perform aggregation on-the-fly.
Can the database cope with 20 concurrent requests to produce these aggregated data?
How about 100 requests?
What will the speed be like for a global report, where most of the leaf node data will have to be read?
With analytical functions required to produce VaR ranking, can the database cope with aggregation plus ranking analytics?
I am starting this blog because I am hoping that I have finally found a media where I can just put down summaries of my experiences working as data architect for various projects, mainly in the financial world.
Out of the various financial projects, risk systems have always interested me the most, as they posses certain unique challenges. In the current credit climate, demands for risk systems to perform well are higher than ever.
As a data architect, I am finding it hard to find resources specific to data architecting related to risk systems. So hopefully some experts out there would find this blog and start contributing.