пятница, 19 сентября 2014 г.

Modeling Date and Time


Date and time seem trivial subjects at first, but when you think about it for a couple of seconds there’s a lot more to it. For instance, think about how you would account for different time zones your organization operates in. Or think about how to handle fiscal years that are out of sync with the regular calendar year. The week numbering in Europe is based on ISO 3306, whereas in the U.S. week numbering starts at 1 on January 1, causing possible differences in results when an EU subsidiary reports weekly results to a U.S.-based head office. ISO also has a different year calculation, which means that the ISO year for a date can differ from the calendar year of the same date. And how would you model date and time? What granularity do you want to use? We give you our thoughts on the subject here and a couple of hints and best practices to get you started, but ultimately the design decisions should be made by you based on the case at hand.

Time Dimension Granularity
Sometimes when people start with designing a dimensional model, they think that date and time should be stored in a single dimension table. This assumption is perhaps not entirely wrong, but usually not the best one to make. Suppose you want to be able to reference a date-time dimension by seconds. There are 24 × 60 × 60 seconds in each day. For a year consisting of 365 days, you would need 31,536,000 rows in the dimension table, and because you usually store 10 or more years, that adds up to over 315 million rows. Doesn’t look like a sensible approach. The most extreme alternative is to create a dimension table per date-time part. In that case, you’d have to create dimension tables for year, month, day, hour, minute, and second. These tables would all be very small but would require six foreign keys in the fact table. This is also not a very good approach. It’s best to stay somewhere in the middle between these two extremes by creating one date and one time dimension. Even in the case where you would need your transactions to reference a specific second, the dim_time table would only contain 86,400 rows. Our sample WCM data warehouse uses a time dimension by minute, which is only 1,440 rows.

Local Versus UTC Time
When dealing with time, you must also consider the challenges presented when your organization covers multiple time zones. In that case, it might be beneficial to add both local and UTC time to the time dimension table so any transaction can always be looked at from different perspectives.

UTC is an international standard time and time notation governed by the ISO 8601 standard. The acronym is a compromise between the French TUC (Temps Universel Coo ̈rdonne ́) and the English CUT (Coordinated Universal Time). Anyone familiar with military terminology might have heard of ‘‘Zulu time,’’ which is the same as UTC. International time zones are indicated as UTC plus or minus a number of hours. This offset can even vary over time when countries change the daylight savings start or end date, or when a government (for example, Venezuela) decides to deviate from international standards. The fixed offset in our time dimension table is therefore a simplification of the real world. 


Smart Date Keys
There is one exception to the use of meaningless surrogate keys—the key for the date dimension. We prefer to use an integer in the form YYYYMMDD for two reasons. First, this key can be easily generated from an incoming date and therefore saves a lookup operation on the date dimension table. Second and probably more important is the fact that this numbering scheme can be used to partition your fact tables. Table partitions can also get a name with a date number extension to easily identify, such as P_200808 for all the August 2008 transactions. Other than these two reasons, the use of a smart date key is prohibited and can under no circumstances be used directly in queries on the fact table without joining the corresponding date dimension table.

Handling Relative Time
In many cases, we are not interested in a specific time period such as a week or a month, but want to know what a specific time period looks like compared to another one. The question is how to set up the time dimension to handle relative time, such as last week, last month, or same month last year. Of course, you can write SQL to retrieve the corresponding rows from the database but then you either need to adjust the statement for each new period, or need to use functions and expressions to make a statement dynamic. The examples in the following query blocks show what we mean. The first query retrieves the result from the current and last month, but the statements are hard coded: 


SELECT d.year4, d.month_number, f.sum(revenue) AS revenue FROM fact_sales AS f
INNER JOIN dim_date d ON f.order_date_key = d.date_key
WHERE d.year4 = 2008 AND d.month_number BETWEEN 7 AND 8

A dynamic version might look something like this:

page214image3808








This will work fine, except when the current month is January. To avoid complex coding like this it’s better to add additional columns to the time dimension table that can be updated during the ETL process. As an example, we’ll use two columns: current_month and last_month. When applicable, they will contain the value 1, otherwise 0. The same technique can be applied to indicate year to date, month to date, and so on.
Another technique we’d like to mention is the use of offsets to be able to perform date arithmetic in a simple way. A good example of this is the use of Julian dates, which are consecutive integer values. This makes it very easy to filter on the last 30 days or 30 days prior to a chosen date. Again, you can expand this to weeks, months, and years as well. By setting the current week, month, or year to 0 and counting backwards, it’s possible to have both a current period indicator and a means of filtering on the last three months, last six quarters, etc.

Table below shows a partial collection of rows and columns from the time dimension to illustrate this technique. The current date in this example is July 1, 2009 with a week starting on Monday.
In this table, you can see three ways of handling relative time:
  • True/false indicators—The current and last week columns are updated each week. Retrieving the current week is a matter of adding current_week = 1 to a query.
  • Sequence number—The Julian date is an integer that allows for simple arithmetic, e.g., ‘‘last 30 days.’’ Any date can be used as the offset (starting point).
  • Sequence with offset 0—This combines the first two options in one column. Current week is always 0, last week –1, and so on.
The third option might appear to be the ideal solution to handle all relative time issues, but there’s a catch. The extra current and last week columns with just 1 and 0 enable you to do calculations as well. In many cases, you need
to see current and last week’s revenue side by side, and this can easily be accomplished by using a statement like the following: