Data Warehouse and OLAP

Data Warehouse and DBMS

  1. What is Data Warehouse?
  2. Data warehousing
  3. OLTP vs. OLAP.

Multidimensional data model

Let us consider the weather data defined in the introduction. The dependent variable play has just two values - yes and no. As these values are mutually exclusive, we can replace them by 1 and 0 respectively. This will allows us to add up values and thus get the total number of days when tennis was played and at the same time the number of days tennis was not played (the complement of the former to the total number of days). Let us also rename the day attribute into time, which is more general and will allow us to use other time units (e.g. weeks). Thus we get the following relational table.
     
    time outlook temperature humidity windy play
    1 sunny 85 85 false 0
    2 sunny 80 90 true 0
    3 overcast 83 86 false 1
    4 rainy 70 96 false 1
    5 rainy 68 80 false 1
    6 rainy 65 70 true 0
    7 overcast 64 65 true 1
    8 sunny 72 95 false 0
    9 sunny 69 70 false 1
    10 rainy 75 80 false 1
    11 sunny 75 70 true 1
    12 overcast 72 90 true 1
    13 overcast 81 75 false 1
    14 rainy 71 91 true 0

Concept hierarchies

Let us assume also that we know some partial ordering among the values of the attributes. These partial ordering define the so called concept hierarchies.  For example, for attributes day, temperature and humidity we can group values in subsets and name these subsets, thus obtaining the following hierarchies (all denotes the set of all values).
day:
            all
       ______|_________
      |                |
   week 1            week 2
 _____|_____    _______|_______
| | | | | | |  | | |  |  |  |  |
1 2 3 4 5 6 7  8 9 10 11 12 13 14
 

temperature:
                all
     ____________|_____________
     |           |             |
    hot         mild         cool
 ____|___     ___|____      ___|____
|  |  |  |   |  |  |  |    |  |  |  |
80 81 83 85  70 71 72 75   64 65 68 69
 

humidity:
             all
        ______|________
       |               |
     high            normal
 ______|_______     ___|____
|  |  |  |  |  |   |  |  |  |
85 86 90 91 95 96  65 70 75 80

We may also extend the sets of numbers or replace them with intervals, which will make the hierarchy complete (covering all possible values). For example, humidity may look like this:
           all
        ____|____
       |         |
     high     normal
       |         |
    [85,96]   [65,84]
For the nominal (non numeric) attributes outlook and windy we define one-level hierarchies, as their values cannot be ordered or grouped.
outlook:
        all
  _______|________
 |       |        |
sunny  rainy  overcast
 

windy:
        all
      ___|____
     |        |
   true    false
 

Data cube

To create a data cube we have to:
  1. Select dimensions, that is select a subset of attributes. For example, let us select time and temperature. Thus we will create a two-dimensional data cube.
  2. Select levels in the concept hierarchies. For example, let us select weeks for time and degrees for temperature.
  3. Select a measure to populate the cube. This is the attribute whose values will be aggregated across the dimensions (obviously it has to be numeric). Let us select play.
Then placing the time values in the rows and the temperature values in the columns we get the following cube:
 
64 65 68 69 70 71 72 75 80 81 83 85
week 1 1 0 1 0 1 0 0 0 0 0 1 0
week 2 0 0 0 1 0 0 1 2 0 1 0 0
The numbers in the internal cells are obtained by adding up the values of the play attribute, where the time and the temperature attribute are equal to the values in the corresponding row and column. For example the value 2 (row 2, column 8) means that tennis was played two days during week 2 when the temperature was 75.

OLAP operations

Now assume we want to change the level that we selected for the temperature hierarchy to the intermediate level (hot, mild, cool). To do this we have to group columns and add up the values according to the concept hierarchy. This operation is called roll-up, and in this particular case it produces the following cube.
 
cool mild hot
week 1 2 1 1
week 2 1 3 1
In other words, climbing up the concept hierarchy produces roll-up's. Inversely, climbing down the concept hierarchy expands the table and is called drill-down. For example, the drill down of the above data cube over the time dimension produces the following:
 
cool mild hot
day 1 0 0 0
day 2 0 0 0
day 3 0 0 1
day 4 0 1 0
day 5 1 0 0
day 6 0 0 0
day 7 1 0 0
day 8 0 0 0
day 9 1 0 0
day 10 0 1 0
day 11 0 1 0
day 12 0 1 0
day 13 0 0 1
day 14 0 0 0

Lattice of cubes, slice and dice operations

The number of dimensions define the total number of data cubes that can be created. Actually this is the number of elements in the power set of the set of attributes. Generally if we have a set of N attributes, the power set of this set will have 2N elements. The elements of the power set  form a lattice. This is an algebraic structure that can be generated by applying intersection to all subsets of the given set. It has a bottom element - the set itself and a top element - the empty set. Here is a part of the lattice of cubes for the weather data cube.

                      {}
                 _____|______
                |           |
        ... {outlook}  {temperature} ...
                 ___________|________
                |                    |
... {temperature,humidity}   {outlook,temperature}  ...
             |                       |
            ...                     ...
                 ...
                  |
{outlook,temperature,humidity,windy}    {time,temperature,humidity,windy}
                  |____________________________________|
                                   |
                  {time,outlook,temperature,humidity,windy}

In the above terms the selection of dimensions actually means selection of a cube, i.e. an element of the above lattice.

There are two other OLAP operations that are related to the selection of a cube - slice and dice. Slice performs a selection on one dimension of the given cube, thus resulting in a subcube. For example, if we make the selection (temperature=cool) we will reduce the dimensions of the cube from two to one, resulting in just a single column from the table above.

 
cool
day 1 0
day 2 0
day 3 0
day 4 0
day 5 1
day 6 0
day 7 1
day 8 0
day 9 1
day 10 0
day 11 0
day 12 0
day 13 0
day 14 0
The dice operation works similarly and performs a selection on two or more dimensions. For example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR temperature = hot) to the original cube we get the following subcube (still two-dimensional):
 
cool hot
day 3 0 1
day 4 0 0

Relational representation of the data cube

The use of the lattice of cubes and concept hierarchies gives us a great flexibility to represent and manipulate data cubes. However, a still open question is how to implement all this. An interesting approach to this based on a simple extension of standard relational representation used in DBMS is proposed by Jim Gray and collaborators (his paper on this is available in word format from here). The basic idea is to use the value ALL as a legitimate value in the relational tables. Thus, ALL will represent the set of all values aggregated over the corresponding dimension. By using ALL we can also represent the lattice of cubes, where instead of dropping a dimension when intersecting two subsets, we will replace it with ALL. Then all cubes will have the same number of dimensions, where their values will be extended with the val,ue ALL. For example, a part of the above shown lattice will now look like this:

                         {ALL,ALL,temperature,ALL,ALL}
                     __________________|_________________
                    |                                    |
    {ALL,ALL,temperature,humidity,ALL}    {ALL,outlook,temperature,ALL,ALL}

Using this technique the whole data cube can be represented as a single relational table as follows (we use higher levels in the concept hierarchies and omit some rows for brevity):

     
    time outlook temperature humidity windy play
    week 1 sunny cool normal true 0
    week 1 sunny cool normal false 0
    week 1 sunny cool normal ALL 0
    week 1 sunny cool high true 0
    week 1 sunny cool high false 0
    week 1 sunny cool high ALL 0
    week 1 sunny cool ALL true 0
    week 1 sunny cool ALL false 0
    week 1 sunny cool ALL ALL 0
    week 1 sunny mild normal true 0
    ... ... ... ... ... ...
    week 1 overcast ALL ALL ALL 2
    week 1 ALL ALL ALL ALL 4
    week 2 sunny cool normal true 0
    week 2 sunny cool normal false 1
    week 2 sunny cool normal ALL 1
    week 2 sunny cool high true 0
    ... ... ... ... ... ...
    ALL ALL ALL high ALL 3
    ALL ALL ALL ALL true 3
    ALL ALL ALL ALL false 6
    ALL ALL ALL ALL ALL 9
The above table allows us to use an unified approach to implement all OLAP operations - they all can me implemented just by selecting proper rows. For example, the following cube:
 
cool mild hot
week 1 2 1 1
week 2 1 3 1
can be extracted from the table by selecting the rows that match the pattern (*, ALL, *, ALL, ALL), where * matches all legitimate values for the corresponding dimension except for  ALL.