пятница, 19 сентября 2014 г.

Querying Star Schemas


First of all, if you want to really learn how work with SQL, you can learn it on sql-ex.ru for free.

So far, you’ve seen diagrams linking dimension and fact tables together, but in a database these are just tables. The relations in the diagrams are foreign key constraints, which means that it will not be possible to insert a product key into the orders fact table if this key does not exist in the dim_product dimension table. Getting data out of the database is another thing. For this, you need a query language called Structured Query Language, or SQL. If you’re not familiar with SQL, don’t worry—the SQL you need for retrieving information from a star schema is not very complicated. And although this book is not a SQL text, we cover enough of the basics here to enable you to start writing queries to retrieve data from dimensional data marts. All we need for this is the SELECT statement, a way of joining the different tables together, and an understanding of how to group the results from the query. For starters, have a look at the basic SQL building blocks for selecting data:

  • SELECT—A list of the columns, constants, and expressions you want to retrieve from the database.
  • FROM—The tables and views that contain the data you need, including the relationships between those tables and views.
  • WHERE—The restrictions you want to apply to the data in the selected tables, views, and expressions, excluding the aggregations.
  • GROUP BY—Specifies the summary level of the query. All non-aggregate columns and expressions in the SELECT list should be part of the GROUP BY statement.
  • HAVING—Contains conditions on summarized expressions, e.g., having sum (revenue) > 100,000.
  • ORDER BY—Denotes the order in which the results will be displayed.

If only the total revenue is needed from the fact table, there’s no need to use
JOINs or GROUP BY clauses, just the following: 

SELECT SUM(revenue)   FROM fct_orders

This little query doesn’t tell you very much, so you can add the date dimension to calculate the revenue by year:

SELECT date_year, SUM(revenue)
FROM fct_orders
JOIN dim_date
ON fct_orders.date_key = dim_date.date_key GROUP BY date_year


Now you have a simple query that is the basis for all other possible queries that can be executed against a dimensional data model. The first line with the SELECT clause tells the database what the result set will look like. Next, you need to tell where the data is retrieved from by using the FROM clause.
You also want to JOIN another table and you need to tell the database which columns the join will be based on—in this case, the date_key in the dimension and fact tables. Finally, the results are grouped by all elements of the SELECT statement that are not part of an aggregate function. For example, when the SELECT statement looks like this:

SELECT date_year, product_type, product_name, SUM(revenue), SUM(quantity)

the GROUP BY statement needs to be

GROUP BY date_year, product_type, product_name


To complete this query all you need to add are the FROM and JOIN parts:



The SQL examples used so far still lack the use of aliases. An alias is another name you can use to refer to a table, column, or expression. Although they are not always required, it’s good practice to always use them. Adding the alias total_revenue to the expression SUM(revenue) gives more meaning to the result column, and using aliases for table names enables a shorthand notation of the join conditions. Compare the following JOIN clauses to understand what we mean by that:

FROM fct_orders
JOIN dim_date ON fct_orders.date_key = dim_date.date_key


versus

FROM fct_orders AS f
JOIN dim_date AS d ON f.date_key = d.date_key


There are cases in which the use of aliases is not optional but required. The first practical case where aliases are at least very helpful is when the same column name appears in more than one table of the set used in the query. Suppose that in the previous example you wanted to include date_key in the result set. Because this column is part of both the fact and the dimension table, the SQL parser cannot determine which table to pick the column from, so you need to include the table reference in your SELECT statement—for example, SELECT dim_date.datekey. Now it’s immediately obvious that the use of a short alias such as d will save a lot of typing. The complete query statement, including the use of aliases, now looks as follows:



The second case where aliases are not only practical but required is when the same table is used in different roles. An example is a query that asks for all revenue by product type with an order date in 2007 and a ship date in 2008:
  


The WHERE part of this query is covered in more detail shortly.