An existing market risk system is processing 800-900 million rows daily. This number is made up of leaf node level data (pnl vector and sensitivities). Data is then aggregated up the hierarchy, as well as along dimensions, aggregated data is then persisted in the database.
The same leaf node data can belong to multiple hierarchies.
As users add more hierarchies, or dimensions, even with the same amount of raw data, increase in data volume is becoming insupportable.
Quite often, users do not get their VaR numbers until late afternoon, and when they get it, reporting performance is not very good.
Analysis
- Daily leaf node data for both pnl and sensitivities amount to ~10 million rows
- ETL is spending ages loading these rows and aggregating the data up the hierarchy and along dimensions
- As a result daily partition size of the FACT tables are enormous
- Development effort has been focused on reducing ETL time. System is ETL driven, rather than report driven.
- Data model was originally designed as a data warehouse, however quickly turned into a hybrid system (denormalised FACT tables with a lot of normalised supporting tables)
- 10 million rows seem to be a "tiny" amount
- Rather than persisting the aggregated data, how about we perform aggregation on-the-fly.
- Can the database cope with 20 concurrent requests to produce these aggregated data?
- How about 100 requests?
- What will the speed be like for a global report, where most of the leaf node data will have to be read?
- With analytical functions required to produce VaR ranking, can the database cope with aggregation plus ranking analytics?
No comments:
Post a Comment