пятница, 19 сентября 2014 г.

Multi-Valued Dimensions and Bridge Tables


One of the hardest to solve problems in dimensional modeling is the multi-valued dimension. Again, the actor table is a perfect example: Our customers buy or rent a single DVD but this item usually has multiple actors appearing in the movie. Depending on the kind of information we’d like to retrieve from the data warehouse, there are two possible solutions to solve this problem. The first one is to list the actors in a text field at the movie level, as in Table:

This is fine for informational purposes and answers the question ‘‘Which actors played in movie X?’’ but cannot be used for other, more interesting, questions as ‘‘In which movies did actor Y play?’’ or ‘‘Which are the top 10 actors based on rental revenue?’’ To solve this last problem, you need a way to relate multiple facts to multiple dimension values. And because you cannot directly create many-to-many relationships in a database you need a bridge table to perform the task. Schema below shows the part of the data warehouse model with the dvd_actor bridge table. 

Note that in this diagram, the table dvd_actor_bridge contains a dvd_actor_factor field. This is a necessary addition to the bridge table to force SQL to return the correct results. If we omit the factor in our calculations involving actors, the revenue will be multiplied by the number of actors linked to a specific DVD. Debate is, of course, open to whether this should be an equal division (10 actors, factor 0.1) or that the leading actor gets a higher factor (10 actors of which one is Brad Pitt; Brad counts for 0.55, the others for 0.05 each). This poses a problem when you want an answer to the question ‘‘How much revenue did we generate with the movies starring Brad Pitt?’’ In that case, only one actor is selected causing the factor to return a value that’s too low so the calculation factor should be either omitted or set to the value 1 in this case. Things get really messy when we want to know our top 10 movie stars based on the total revenue of the movies they played in. Some of these actors might have co-starred in one or more movies in the past, causing incorrect results. The message here is: Be careful when modeling these kinds of relationships and make sure that you provide access only to bridged relations when the user or analyst working with the data knows exactly what he or she is dealing with.