If you take a look at the first two rows of the OrgUnit table for example, this means that employee 100 was in the organizational unit “Delivery” from until Decemand then starting with Januin “PreSales”.įor each of the four tables, EmployeeNo together with ValidFrom forms a primary key. ![]() In my example I’m using the approach of an including ValidFrom and an excluding ValidTo. Each table may contain multiple versions of data and therefore each table has valid from/to fields to distinguish the versions. The tables reflect a very simple human resources model of four tables, a base table Employee and three detail tables, all joined by the EmployeeNo-field. In this post, I’m using a much more simplified scenario with the following 4 tables: You may find this situation for example in SAP’s human resources tables where the properties of an employee are stored in so called info types which are independently versioned by valid from/to date ranges. However, after this intro my post today is about a situation where you have several linked tables in the source system, all with a valid from/to date. I’m saying rare cases as update-operations on fact tables that are tuned for high volume bulk loads and bulk queries are usually not a good idea, so you may want to implement a partition-wise recreation of the fact table (partition switch operations) which adds some complexity to the overall workload management. So, for these cases you will need to periodically reload parts of your data warehouse (for example the last three months) or in some rare cases track the changes and adjust the surrogate keys of the fact tables. In either case, the surrogate keys of some fact rows would point to the “wrong” dimension record afterwards. For example, a given date range could be split or merged or the from and to dates may shift. While this sounds to be much more simple than the case with missing validity dates, it’s usually a challenging situation, especially when past records (and their valid from/to dates) may be modified. In this case, the historical values are provided by the source system and usually it’s not necessary for the data warehouse to track the changes. However, there may be some cases, where you find valid from/to dates in the original source system. Query performance now benefits from the simplified link structure between the tables. ![]() This approach moves the time consuming process of resolving date ranges from query time to data loading time, so it has to be performed only once. ![]() Now, this key can be used as a direct inner join from the fact table to its dimensions. ![]() The surrogate key concepts offers a good solution here, by assigning a unique key (the surrogate key) to each version of a record. But the valid from/to dates are usually not a good idea for joining fact data to the associated dimensions because this would result in range lookups (ETL) or date range (between) joins (in SQL or ELT). To keep historical values, versions of the master data records are created to memorize each state of the original record together with a valid from/to timestamp so that fact data can be joined to corresponding dimension data. In short, the SCD methods proposed by Ralph Kimball assume, that the source system (for example the ERP system) doesn’t keep historical versions of its data records, so changes need to be detected at the time when data is loaded into the warehouse. Tracking historical changes within a dimension is a common task in data warehousing and well covered by Ralph Kimball’s slowly changing dimension (SCD) methods.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |