Introduction to Data Mining

"Drowning in Data yet Starving for Knowledge"
???

"Computers have promised us a fountain of wisdom but delivered a flood of data"
William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?”
T. S. Eliot

What is NOT data mining?

Data Mining, noun: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything"
Jeff Jonas, IBM

"An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results"
W.S. Brown “Introducing Econometrics”

"A buzz word for what used to be known as DBMS reports"
An Anonymous Data Mining Skeptic

What is data mining?

"The non trivial extraction of implicit, previously unknown, and potentially useful information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus

Data Mining - an interdisciplinary field

Related Technologies

Machine Learning vs. Data Mining

Data Mining vs. DBMS

Data Warehouse

On-line Analytical Processing (OLAP)

Statistical Analysis

Data Mining Goals

Classification

Association

Sequence/Temporal

Database management systems (DBMS), Online Analytical Processing (OLAP) and Data Mining

Area DBMS OLAP Data Mining
Task Extraction of detailed and summary data Summaries, trends and forecasts Knowledge discovery of hidden patterns and insights
Type of result Information Analysis Insight and Prediction
Method Deduction (Ask the question, verify with data) Multidimensional data modeling, Aggregation, Statistics Induction (Build the model, apply it to new data, get the result)
Example question Who purchased mutual funds in the last 3 years? What is the average income of mutual fund buyers by region by year? Who will buy a mutual fund in the next 6 months and why?

Statges of the data mining process

Techniques

Knowledge Representation Methods

Data Mining Applications

Example of DBMS, OLAP and Data Mining: Weather data

Assume we have made a record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular day. Thus we have generated tuples (or examples, instances) consisting of values of four independent variables (outlook, temperature, humidity, windy) and one dependent variable (play). See the textbook for a detailed description.

DBMS

Consider our data stored in a relational table as follows:
 
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no

By querying a DBMS containing the above table we may answer questions like:

OLAP

Using OLAP we can create a Multidimensional Model of our data (Data Cube). For example using the dimensions: time, outlook and play we can create the following model.
 
9 / 5 sunny rainy overcast
Week 1 0 / 2 2 / 1 2 / 0
Week 2 2 / 1 1 / 1 2 / 0

Obviously here time represents the days grouped in weeks (week 1 - days 1, 2, 3, 4, 5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13, 14) over the vertical axis. The outlook is shown along the horizontal axis and the third dimension play is shown in each individual cell as a pair of values corresponding to the two values along this dimension - yes / no. Thus in the upper left corner of the cube we have the total over all weeks and all outlook values.

By observing the data cube we can easily identify some important properties of the data, find regularities or patterns. For example, the third column clearly shows that if the outlook is overcast the play attribute is always yes. This may be put as a rule:

if outlook = overcast then play = yes
We may now apply "Drill-down" to our data cube over the time dimension. This assumes the existence of a concept hierarchy for this attribute. We can show this as a horizontal tree as follows:
  • time
  • The drill-down operation is based on climbing down the concept hierarchy, so that we get the following data cube:
     
    9 / 5 sunny rainy overcast
    1 0 / 1 0 / 0 0 / 0
    2 0 / 1 0 / 0 0 / 0
    3 0 / 0 0 / 0 1 / 0
    4 0 / 0 1 / 0 0 / 0
    5 0 / 0 1 / 0 0 / 0
    6 0 / 0 0 / 1 0 / 0
    7 0 / 0 0 / 0 1 / 0
    8 0 / 1 0 / 0 0 / 0
    9 1 / 0 0 / 0 0 / 0
    10 0 / 0 1 / 0 0 / 0
    11 1 / 0 0 / 0 0 / 0
    12 0 / 0 0 / 0 1 / 0
    13 0 / 0 0 / 0 1 / 0
    14 0 / 0 0 / 1 0 / 0

    The reverse of drill-down (called roll-up) applied to this data cube results in the previous cube with two values (week 1 and week 2) along the time dimension.

    Data Mining

    By applying various Data Mining techniques we can find associations and regularities in our data, extract knowledge in the forms of rules, decision trees etc., or just predict the value of the dependent variable (play) in new situations (tuples). Here are some examples (all produced by Weka):

    Mining Association Rules

    To find associations in our data we first discretize the numeric attributes (a part of the data pre-processing stage in data mining). Thus we group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) and substitute the values in data with the corresponding names. Then we apply the Apriori algorithm and get the following association rules:
    1. humidity=normal windy=false 4 ==> play=yes (4, 1)
    2. temperature=cool 4 ==> humidity=normal (4, 1)
    3. outlook=overcast 4 ==> play=yes (4, 1)
    4. temperature=cool play=yes 3 ==> humidity=normal (3, 1)
    5. outlook=rainy windy=false 3 ==> play=yes (3, 1)
    6. outlook=rainy play=yes 3 ==> windy=false (3, 1)
    7. outlook=sunny humidity=high 3 ==> play=no (3, 1)
    8. outlook=sunny play=no 3 ==> humidity=high (3, 1)
    9. temperature=cool windy=false 2 ==> humidity=normal play=yes (2, 1)
    10. temperature=cool humidity=normal windy=false 2 ==> play=yes (2, 1)
    These rules show some attribute values sets (the so called item sets) that appear frequently in the data. The numbers after each rule show the support (the number of occurrences of the item set in the data) and the confidence (accuracy) of the rule. Interestingly, rule 3 is the same as the one that we produced by observing the data cube.

    Classification by Decision Trees and Rules

    Using the ID3 algorithm we can produce the following decision tree (shown as a horizontal tree): The decision tree consists of decision nodes that test the values of their corresponding attribute. Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached. They determine the value of the dependent variable. Using a decision tree we can classify new tuples (not used to generate the tree). For example, according to the above tree the tuple {sunny, mild, normal, false} will be classified under play=yes.

    A decision trees can be represented as a set of rules, where each rule represents a path through the tree from the root to a leaf. Other Data Mining techniques can produce rules directly. For example the Prism algorithm available in Weka generates the following rules.

    If outlook = overcast then yes
    If humidity = normal  and windy = false then yes
    If temperature = mild and humidity = normal then yes
    If outlook = rainy  and windy = false then yes
    If outlook = sunny  and humidity = high then no
    If outlook = rainy  and windy = true then no

    Prediction methods

    Data Mining offers techniques to predict the value of the dependent variable directly without first generating a model. One of the most popular approaches for this purpose is based of statistical methods. It uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables. For example, applying Bayes to the new tuple discussed above we get:
    P(play=yes | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8
    P(play=no | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2
    Then obviously the predicted value must be "yes".