вторник, 30 декабря 2014 г.

CDC (changed data capture)

The data warehouse stores the history of the data changes, so you need to sign in to submit only the changed data since the last update in DWH.


  • To be sure that the isolation of the changed data
  • Provide certain type of change (insert, delete, update)
  • Ensure batch downloading changes in DWH


There are several ways to detect changes:

1. Analysis of the column audit or timestamp
The table is added to the system source column contains the date and time of insertion or last update. The value of the column is usually filled with triggers that run automatically when you insert or update. In some cases, in place of the time you can use a monotonically increasing numeric code audit.


  • increasing the load on the database system resources and slowing down data modifications because of the trigger;
  • if these columns are filled not trigger, then ensure the accuracy of change detection is not possible;
  • it is impossible to determine whether the removal;
  • problems with the time change to the source server or not synchronized with time DWH (for timestamp);
  • storage of digital codes of the last audit exemption for each table (for digital labels).

2. Analysis of tables with the change log
For each tracked table to create a log table that stores all the line after modification, the modification time, and type of modification. Filling a table is done via a trigger to insert, delete, and update the tracked table.

  • increasing the load on the database system resources and slowing down data modifications because of the trigger;
  • need to be able to aggregate changes in one line. Since the magazine provides all the changes, the hour might happen inserting rows, updating and subsequent removal;
  • the need for periodic cleaning of the magazines;
  • problems with the time change to the source server or not synchronized with time DWH;
3. Analysis of log files
The method is to periodically copy the redo log (redo log) database and analysis of these logs in order to highlight the transaction affecting the table you are interested in.

  • magazines not supported by all DBMS;
  • need to be able to aggregate changes in one line;
  • loss of information about changes in the case of carrying out administrative work.
Often crowded logs that blocks further transactions. When this happens in a commercial transaction system, the reaction responsible DBA is to clean the logs. If you have sorted out all the options, and log analysis was the only appropriate, it is necessary to convince the DBA to create for you a special transaction log.

4. Full comparison
At full comparison holds the full picture of the table on the last day, and compared to each entry with today's version of the search for each change. Gives a guaranteed change detection.

  • the cost of the disk subsystem;
  • cost comparison;
  • implementation of the algorithm for calculating checksums.

ETL Team Mission

The main mission of the ETL team is to build the DWH.
More deeply, the ETL team have to develop ETL framework which can effectively:

  • Deliver data most effectively to end user tools;
  • Add value to data in the cleaning and conforming steps;
  • Protect and document the lineage of data.

The ETL framework have to cover the following main points:

  • Extracting data from the sources;
  • Keep an eye on quality of data and cleaning data;
  • Conforming the labels and measures in the data to achieve consistency across the sources;
  • Delivering data in a business layer, where it can be accessed by BI tools and business users.
In addition, I want to mention about responsibility of ETL guys, they are responsible for the quality and accuracy of data and they do not hesitate to contact business in case of technical problems or in case of unclear business requirements.

We, technical staff, all work for our business departments and help to make right and accuracy business decision, that's why we have to work in collaboration with business, although we are technical guys and used to work with machines.

Data Models in DWH

The data model is a corporate repository ER-model (Entity-relationship model - the model "entity-relationship"), described on several levels set of interconnected entities, which are grouped by functional area and reflect the needs of the business in the analytical analysis and reporting.

General corporate data model repository is developed consistently and consists of:

  • conceptual data model;
  • logical data model;
  • physical data model.

Conceptual Model
Conceptual model of a data warehouse is a description of the main (basic) entities and relations between them. The conceptual model is a reflection of the subject areas in which it is planned to build a data warehouse.
Logic model
Logical model extends the conceptual definitions for entities by their attributes, descriptions and restrictions, clarifies the composition of entities and relationships between them.

Physical model
Physical data model describes the implementation of a logical model of objects at the level of a specific database objects.

Comparison of models of different levels
The table below presents a comparative analysis of models of different levels.

пятница, 26 декабря 2014 г.

Using bitmap indexes in databases

Kelly Sommers posted a great article on the internals of bitmap indexes:
"There are many different data structures used in databases to create indexes used to quickly evaluate queries. Each one has different strengths and weaknesses based on the tradeoffs they make on memory, cpu and storage (if persisted). One of these types of indexes is called a bitmap index. For the purpose of this post I’m going to be using relational database terminology but the same techniques can apply to different database types like column oriented databases or many others non-relational databases.
A bitmap index represents membership. A bitmap index is an array of bits, a bit represents each row where each bit set to 1 represents membership in the set.
To illustrate this better lets create an example table with 4 rows but consider it could be a table with 100 million rows.
Creating a bitmap index on the country column would give us 3 bitmap indexes (3 arrays of bits) that looks like the following.
[1, 0, 0, 0]
[0, 1, 0, 1]
[0, 0, 1, 0]
Now a bitmap index for the Name column which creates 4 bitmap indexes.
[1, 0, 0, 0]
[0, 1, 0, 0]
[0, 0, 1, 0]
[0, 0, 0, 1]
In addition to the bitmap indexes, we need a map that maps index positions to rows in the database (more on why later).
[100, 101, 102, 103]

Types of data

Bitmap indexes do well at indexing categorial data with well defined values like countries, dates and ages as a few examples. Bitmap indexes don’t do well with numerical data with continuous values like prices or kilometres per hour that you may want to sum or average. A 1 in the array of bits represents membership in the set for a unique term which makes them suitable for WHERE A = B type evaluations and not WHERE A > B evaluations.

Memory and storage

If you have 100 million rows in your database, the storage for the Country column index would be 3 bitmap indexes with 100 million bits (12 megabytes) each taking a total 36MB.
Using bitmap indexes means managing a lot of indexes because you need a new array of bits per unique term. That’s one of the cons of a bitmap index. If you have 100 million rows and each has a unique term (say a timestamp) you would create bitmap indexes for each timestamp where only 1 bit is set out of 100 million bits.
Fortunately there are ways to tackle this problem called Compressed Bitmaps which I will cover later.

Evaluating a query

With the indexes and row map defined we can jump back to the high level and explain how a query like this works.
WHERE Name = ‘Julie’ AND Country = ‘USA’
There’s a lot of material and research on how to optimize query operations such as choosing which order to do things based on cardinality and many other aspects. In these examples I’m not explaining the process of optimizing a query execution plan, just explaining the basics.
To evaluate which rows match the criteria we can do bitwise operations like bitwise AND and OR’s.
STEP 1: Evaluate Name = ‘Julie’ AND Country = ‘USA’
We take the 2 bitmap indexes that represent the terms “Julie” and “USA” and we bitwise AND them.
    0001 (Julie)
AND 0101 (USA)
  = 0001
STEP 2: Map results to row map
Take all the set bits and map the row key.
[0, 0, 0, 1]
= rowmap[3]
= rows["103"]
STEP 3: Select results
With the results from the where clause evaluated into row keys now we can use however our database storage works to fetch the rows and the columns selected in the query to return.


Depending on the database implementation, deleting a row in a database may cause a bunch of index rebuilds to occur but luckily with bitmap indexes you can still respond to queries with responses that exclude the deleted rows of data while these indexes rebuild in the background. Create a new bitmap index that represents deleted rows then bitwise AND against this bitmap index with any query results. In a bitmap index that represents deletes, a bit set to 1 represents a valid row and a bit set to 0 represents a deleted row.
Using the query mentioned above, if we deleted Julie the operations would look something like this.
    0001 (Julie)
AND 0101 (USA)
AND 1110 (Deletes)
  = 0000 (No matching rows!)
The bitmap index representing deletes is temporary and can be removed once the indexes affected by the row deletion have been rebuilt.

Group By and Joins

Bitmap indexes can be used for evaluating a lot of different parts of a query such as GROUP BY and JOIN clauses. Evaluating a join is a multi-step process because you need to evaluate the bitmap indexes from the right side join table and then take the resulting bitmap and bitwise AND it against the query result on the left side like we did on the deletes example. One to many to one relationships get even more complicated because there is an intermediary table facilitating the relationship. Explaining joins is a much more involved example I will leave for another blog post.

Bitmap indexes in OLAP databases

An OLAP (online analytical processing) database is typically used for reporting and business intelligence where there is a need to query data in many different ways (dimensions) with quick response times and is sometimes called an “OLAP Cube”. OLAP data is usually stored with a star schema where there is a fact table and multiple dimension tables. If we were analyzing sales, the sales table would be the fact table. Country and day of week would be dimension tables. This star schema would allow us to do analytical queries answering questions like “top 10 sales by country on Saturday” giving you insight on which countries you may want to increase promotions.
Some OLAP databases use bitmap indexes. When the OLAP cube is being generated it creates pre-calculated results so that queries can be extremely fast. When the cube is being generated it’s going through a materialization stage where it is evaluating every possible bitmap index combination for the dimension tables and creates all the resulting bitmaps.
The nice thing about using bitmap indexes in an OLAP database, since everything is an array of bits, you can choose not to materialize everything. If a pre-calculated result explodes the cube due to the size of possible combinations you can do what is called partial materialization where you don’t materialize everything but you can still evaluate bitmaps from parts that are materialized with bitmaps that aren’t materialized.

Accelerating with SIMD & GPU’s

Included in many modern processors is SSE2 which provides SIMD (Single instruction, multiple data). Using SIMD to evaluate the bitwise AND or bitwise OR operations would give the ability to evaluate (depending on CPU) 128 bits at once. With Intel AVX (Advanced Vector Extensions) in Sandy Bridge and AMD BullDozer up to 256 bits is supported. This would be a significant increase in performance for the operations explained earlier.
Using GPGPU could also accelerate processing even further but this poses additional challenges around getting data in and out of the GPU VRAM. If your indexes are larger than the VRAM available you will need to swap data in and out of the GPU many times.

Compressed Bitmap Index

One of the cons to using bitmap indexes is the amount of bitmap indexes that get created. You need to create one bitmap index the size of the row count per unique term. Luckily run-length encoding fits this problem really well. If you have really high cardinality columns, you will likely have most bits set to 0 and these can be compressed very well. Compressed bitmaps allow you to do the bitwise AND and bitwise OR operations in the compressed form avoiding having to decompress these large bitmaps before using them. Sparsity of membership when using run-length encoding will dictate how effective the compression rate is.
You can find some really good compressed bitmap implementations at the following links.
Compressed bitmap indexes can be really useful for evaluating queries on really large datasets quickly and offer a good compromise between memory efficiency and processing speed since in many cases they can be faster than uncompressed bitmaps. I only cover some basic uses of bitmap indexes, there are a lot of different ways to approach data indexing and query execution plans."

ETL Tool versus Hand Coding

If you are planning DWH project, You should consider ETL Tool.
I found very interesting and useful article of Gary Nissen where he compared two various approaches.

ETL Tool vs Scripting.

Advantages of tool-based ETL

  1. Simpler, faster, cheaper development. The ETL tool cost will make up for itself in projects that are large enough or sophisticated enough.
  2. Many ETL tools have integrated metadata repositories that can synchronize metadata from source systems, target databases and other BI tools.
  3. Most ETL tools enforce a consistent metadata driven methodology that all developers must follow.
  4. Data flow and data lineage documentation can often be automatically generated from a metadata repository.
  5. ETL tools have connectors pre-built for many source and target systems. So, if you are working with many sources or targets, you will find many advantages in a tool-based approach.
  6. Most ETL tools deliver good performance even for very large data sets. If your ETL data volume is really large, or it will be large in a year or two, you should seriously consider a tool-based option.
  7. Some ETL tools help you conduct change impact analysis further down the road.

 Advantages of hand-coded ETL

  1. Automated unit testing tools are available in a hand-coded system, but not with a tool-based approach. For example, the JUnit library (http://junit.org) is a highly regarded and well-supported tool for unit testing Java programs. There are also similar packages for other languages. Another common approach to unit testing is to use a scripting language, like TCL or Python, to setup test data, run an ETL process and verify the results. Automating the testing process through one of these methods will significantly improve the productivity of your QA staff and the quality of your deliverables.
  2. Object-oriented programming techniques help you to make all of your transformations consistent for error reporting, validation, and metadata updates. You may not have this degree of control with the parameter setting and snippet programming style of ETL tools.
  3. Metadata can be more directly managed by hand-coded systems. Actually, metadata driven hand-coded systems are much more flexible than any alternative.
  4. A brief requirements analysis of an ETL system quickly points you towards file-based processing, not database stored procedures. File-based processes are more direct. They are simply coded, easily tested and well understood.
  5. A tool-based approach will limit you to the tool vendor’s abilities and their unique scripting language. But, a hand-coded system can be developed in a common and well-known language. In fairness, all the ETL tools allow “escapes” to standard programming languages in isolated modules.
  6. Hand-coded ETL provides unlimited flexibility, if that is indeed what you need. You can literally do anything you want. There are many instances where a unique approach or a different language can provide a big advantage.
In addition, be aware about you IT guys, who used to change job sometimes, and it is a fact, that hand-code of good programmer very often is impossible to understand. Moreover, hand code is quite difficult to scale.

пятница, 19 декабря 2014 г.

Columnar DBMS - principle, advantages and scope

The middle of the 2000s was marked by rapid growth of columnar DBMS. Vertica, ParAccel, Kognito, Infobright, SAND and others joined the club columnar DBMS and diluted proud solitude Sybase IQ, founded it in the early 90s. In this article I will discover the reasons for the popularity of the idea in a column of data storage, operation and use of the area columnar DBMS.

Let's start with the fact that popular nowadays relational databases - Oracle, SQL Server, MySQL, DB2, Postgresql and etc. Based on the architecture, its history is counted more since 1970s, when transistor radios were sideburns long, flared trousers, a database in the world dominated by hierarchical and network data management system. The main task of the database, then, is to support beginning in the 1960s, a massive shift from paper records of economic activity to the computer. A wealth of information from paper documents was transferred to a database accounting systems that were supposed to securely store all incoming information and, if necessary, to quickly find them. These requirements led to the architectural features of relational databases, the remaining hitherto virtually unchanged: progressive storage, indexing records and logging operations.

Under the interline storage is usually understood the physical storage of all rows in the table in the form of a record in which the fields are sequentially one after the other, and the last field recordings, in general, is the first next record. Something like this:

[A1, B1, C1], [A2, B2, C2], [A3, B3, C3] ...

where A, B and C - a field (columns), and 1.2 and 3 - record number (row).

Such storage is extremely convenient for frequent operations to add new rows in the database, usually stored on a hard drive - in fact in this case, a new record can be added entirely in just one pass of the drive heads. Significant speed restrictions imposed by the HD, caused the need to conduct specific indexes, which would allow to find the desired item on the disc in the minimum number of passes head HDD. Typically, several indices formed, depending on which fields required to make the search, which increases the volume of the database on a disk is sometimes several times. For resiliency traditional DBMS automatically duplicate transactions in the log, which leads to even more space occupied on disk. As a result, for example, an Oracle database average takes up 5 times more space than the volume of payload data therein. For srednepotolochnoy database to DB2, this ratio is even more - 7: 1.

However, in the 1990s, the spread of information systems analysis and storage of data used for the analysis of management accounting systems in the accumulated data, it became clear that the nature of the load in these two types of systems is radically different.

If transactional applications characterized by very frequent small transactions add or change one or more records (insert / update), in the case of analytical systems opposite picture - the largest load is created relatively rare but serious samples (select) the hundreds of thousands and millions of records, often with groups and calculation of totals (so-called units). Write operations wherein a low, often less than 1% of the total. And often you record large blocks (bulk load). It should be noted that the analytical sample is one important feature - they usually contain only a few fields. On average, in the analytic SQL-user request them rarely more than 7-8. This is due to the fact that the human mind is unable to properly absorb information more than 5-7 sections.

However, what happens if you choose, for example, only 3 fields from the table, in which there are only 50? Due to progressive data storage in traditional DBMS (required, as we remember, for frequent operations to add new entries to the accounting systems) will be read completely all lines completely with all fields. This means that no matter whether we need only 50 fields or 3, with the disc in any case they are read entirely, passed through the disk controller, input and output to the processor, which has only necessary to take away request. Unfortunately, the channels of disk IO are usually the main limiter performance analytical systems. As a result, the effectiveness of traditional RDBMS in the performance of this query can be reduced by 10-15 times because of the imminent reading unnecessary data. And Moore's Law on the rate of input-output disk drives much weaker than the speed of processors and amount of memory. So, apparently, the situation will only get worse.

To solve this problem are called columnar DBMS. The basic idea of columnar database - it can store data in rows, as do traditional DBMS, and on columns. This means that from the point of view of SQL-client data is typically represented as a table, but a table is physically these plurality of columns, each of which is essentially a table of one field. At the same physical disk space value of one field are stored one after the other - something like this:

[A1, A2, A3], [B1, B2, B3], [C1, C2, C3], etc.

This data organization leads to the fact that when the select which featured only three fields of 50 fields in the table, the disc will be physically read only 3 columns. This means that the load on the input-output channel will be about 50/3 = 17 times smaller than if the same in traditional database query.

In addition, when columnar data storage appears a great opportunity to greatly compress the data as a single column of the table data is generally of the same type can not be said about the line. Compression algorithms may be different. Here is an example of one of them - the so-called Run-Length Encoding (RLE):

If we have a table with 100 million records, made within one year, in the column "Date" will actually be stored for no more than 366 possible values, as of the year no more than 366 days (including leap years). So we can 100 million sorted values in this field is replaced by 366 pairs of values of the form <date, the number of times> and store them on disk in this form. Thus they will occupy approximately 100 thousand. Times less space, which also contributes to the query performance.

From a developer's perspective, columnar DBMS usually correspond ACID and support largely SQL-99 standard.


Columnar DBMS designed to solve the problem of inefficiency of traditional databases in analytical systems and systems in the vast majority of operations such as 'read'. They allow for cheaper and less powerful hardware to get a speed boost query performance at 5, 10 and sometimes even 100 times, while, thanks to compression, the data on the disk will take 5-10 times less than in the case of traditional DBMS.

In columnar DBMS there are disadvantages - they are slow to write, not suitable for transactional systems, and as a rule, because of the "youth" have a number of limitations to the developer who is used to the development of traditional DBMS.

Columnar DBMS usually used in analytical systems class business intelligence (ROLAP) and analytical data warehouse (data warehouses). And the amount of data can be quite large - there are examples on 300-500TB and even cases with> 1PB of data.