Introduction to Data Mining
"Drowning in Data yet Starving for Knowledge"
???
"Computers have promised us a fountain of wisdom but delivered
a flood of data"
William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus
“Where is the wisdom we have lost in knowledge? Where is the knowledge
we have lost in information?”
T. S. Eliot
What is NOT data mining?
Data Mining, noun: "Torturing data until it confesses ... and if
you torture it enough, it will confess to anything"
Jeff Jonas, IBM
"An Unethical Econometric practice of massaging and manipulating
the data to obtain the desired results"
W.S. Brown “Introducing Econometrics”
"A buzz word for what used to be known as DBMS reports"
An Anonymous Data Mining Skeptic
What is data mining?
"The non trivial extraction of implicit, previously unknown, and
potentially useful information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus
-
Data mining finds valuable information hidden in large volumes of data.
-
Data mining is the analysis of data and the use of software techniques
for finding patterns and regularities in sets of data.
-
The computer is responsible for finding the patterns by identifying the
underlying rules and features in the data.
-
It is possible to "strike gold" in unexpected places as the data mining
software extracts patterns not previously discernible or so obvious that
no-one has noticed them before.
-
Mining analogy:
-
large volumes of data are sifted in an attempt to find something worthwhile.
-
in a mining operation large amounts of low grade materials are sifted through
in order to find something of value.
Data Mining - an interdisciplinary field
-
Databases
-
Statistics
-
High Performance Computing
-
Machine Learning
-
Visualization
-
Mathematics
Related Technologies
Machine Learning vs. Data Mining
-
Large Data sets in Data Mining
-
Efficiency of Algorithms is important
-
Scalability of Algorithms is important
-
Real World Data
-
Lots of Missing Values
-
Pre-existing data - not user generated
-
Data not static - prone to updates
-
Efficient methods for data retrieval available for use
-
Domain Knowledge in the form of integrity constraints available.
Data Mining vs. DBMS
-
Example DBMS Reports
-
Last months sales for each service type
-
Sales per service grouped by customer sex or age bracket
-
List of customers who lapsed their policy
-
Questions answered using Data Mining
-
What characteristics do customers that lapse their policy have in common
and how do they differ from customers who renew their policy?
-
Which motor insurance policy holders would be potential customers for my
House Content Insurance policy?
Data Warehouse
-
Data Warehouse: centralized data repository which can be queried for business
benefit.
-
Data Warehousing makes it possible to
-
extract archived operational data
-
overcome inconsistencies between different legacy data formats
-
integrate data throughout an enterprise, regardless of location, format,
or communication requirements
-
incorporate additional or expert information
On-line Analytical Processing (OLAP)
-
Multi-Dimensional Data Model (Data Cube)
-
Operations:
-
Roll-up
-
Drill-down
-
Slice and dice
-
Rotate
Statistical Analysis
-
Ill-suited for Nominal and Structured Data Types
-
Completely data driven - incorporation of domain knowledge not possible
-
Interpretation of results is difficult and daunting
-
Requires expert user guidance
Data Mining Goals
Classification
-
DM system learns from examples or the data how to partition or classify
the data i.e. it formulates classification rules
-
Example - customer database in a bank
-
Question - Is a new customer applying for a loan a good investment or not?
-
Typical rule formulated:
if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes
then INVESTMENT_TYPE = good
Association
-
Rules that associate one attribute of a relation to another
-
Set oriented approaches are the most efficient means of discovering such
rules
-
Example - supermarket database
-
72% of all the records that contain items A and B also contain item C
-
the specific percentage of occurrences, 72 is the confidence factor of
the rule
Sequence/Temporal
-
Sequential pattern functions analyze collections of related records and
detect frequently occurring patterns over a period of time
-
Difference between sequence rules and other rules is the temporal factor
-
Example - retailers database can be used to discover the set of purchases
that frequently precedes the purchase of a microwave oven
Database management systems (DBMS), Online Analytical Processing (OLAP)
and Data Mining
Area |
DBMS |
OLAP |
Data Mining |
Task |
Extraction of detailed and summary data |
Summaries, trends and forecasts |
Knowledge discovery of hidden patterns and insights |
Type of result |
Information |
Analysis |
Insight and Prediction |
Method |
Deduction (Ask the question, verify with data) |
Multidimensional data modeling, Aggregation, Statistics |
Induction (Build the model, apply it to new data, get the result) |
Example question |
Who purchased mutual funds in the last 3 years? |
What is the average income of mutual fund buyers by region by year? |
Who will buy a mutual fund in the next 6 months and why? |
Statges of the data mining process
-
Data pre-processing
-
Heterogeneity resolution
-
Data cleansing
-
Data transformation
-
Data reduction
-
Discretization and generating concept hierarchies
-
Creating a data model: applying Data Mining tools to extract knowledge
from data
-
Testing the model: the performance of the model (e.g. accuracy, completeness)
is tested on independent data (not used to create the model)
-
Interpretation and evaluation: the user bias can direct DM tools to areas
of interest
-
Attributes of interest in databases
-
Goal of discovery
-
Domain knowledge
-
Prior knowledge or belief about the domain
Techniques
-
Set oriented database methods
-
Statistics: can be used in several data mining stages
-
data cleansing: removal of erroneous or irrelevant data
-
EDA (exploratory data analysis): frequency counts, histograms etc.
-
data selection and sampling: reduce the scale of computation
-
attribute re-definition
-
data analysis - measures of association and relationships between attributes,
interestingness of rules, classification etc.
-
Visualization: enhances EDA, makes patterns more visible
-
Clustering (Cluster Analysis)
-
Clustering and segmentation is basically partitioning the database so that
each partition or group is similar according to some criteria or metric
-
Clustering according to similarity is a concept which appears in many disciplines,
e.g. clustering of molecules in chemistry
-
Data mining applications make use of clustering according to similarity,
e.g. segmenting a client/customer base
-
It provides sub-groups of a population for further analysis or action -
very important when dealing with very large databases
Knowledge Representation Methods
-
Neural Networks
-
a trained neural network can be thought of as an "expert" in the category
of information it has been given to analyze
-
provides projections given new situations of interest and answers "what
if" questions
-
problems include:
-
the resulting network is viewed as a black box
-
no explanation of the results is given i.e. difficult for the user to interpret
the results
-
difficult to incorporate user intervention
-
slow to train due to their iterative nature
-
Decision trees
-
used to represent knowledge
-
built using a training set of data and can then be used to classify new
objects
-
problems are:
-
opaque structure - difficult to understand
-
missing data can cause performance problems
-
they become cumbersome for large data sets
-
Rules
-
probably the most common form of representation
-
tend to be simple and intuitive
-
unstructured and less rigid
-
problems are:
-
difficult to maintain
-
inadequate to represent many types of knowledge
-
Example format: if X then Y
Data Mining Applications
-
Credit Assessment
-
Stock Market Prediction
-
Fault Diagnosis in Production Systems
-
Medical Discovery
-
Fraud Detection
-
Hazard Forecasting
-
Buying Trends Analysis
-
Organizational Restructuring
-
Target Mailing
-
Knowledge Acquisition
-
Scientific Discovery
-
Semantics based Performance Enhancement of DBMS
Example of DBMS, OLAP and Data Mining: Weather data
Assume we have made a record of the weather conditions during a two-week
period, along with the decisions of a tennis player whether or not to play
tennis on each particular day. Thus we have generated tuples (or
examples, instances) consisting of values of four independent variables
(outlook, temperature, humidity, windy) and one dependent variable
(play). See the textbook for a detailed description.
DBMS
Consider our data stored in a relational table as follows:
Day |
outlook |
temperature |
humidity |
windy |
play |
1 |
sunny |
85 |
85 |
false |
no |
2 |
sunny |
80 |
90 |
true |
no |
3 |
overcast |
83 |
86 |
false |
yes |
4 |
rainy |
70 |
96 |
false |
yes |
5 |
rainy |
68 |
80 |
false |
yes |
6 |
rainy |
65 |
70 |
true |
no |
7 |
overcast |
64 |
65 |
true |
yes |
8 |
sunny |
72 |
95 |
false |
no |
9 |
sunny |
69 |
70 |
false |
yes |
10 |
rainy |
75 |
80 |
false |
yes |
11 |
sunny |
75 |
70 |
true |
yes |
12 |
overcast |
72 |
90 |
true |
yes |
13 |
overcast |
81 |
75 |
false |
yes |
14 |
rainy |
71 |
91 |
true |
no |
By querying a DBMS containing the above table we may answer questions
like:
-
What was the temperature in the sunny days? {85, 80, 72, 69, 75}
-
Which days the humidity was less than 75? {6, 7, 9, 11}
-
Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12,
13, 14}
-
Which days the temperature was greater than 70 and the humidity was less
than 75? The intersection of the above two: {11}
OLAP
Using OLAP we can create a Multidimensional Model of our data (Data
Cube). For example using the dimensions: time, outlook
and play we can create the following model.
9 / 5 |
sunny |
rainy |
overcast |
Week 1 |
0 / 2 |
2 / 1 |
2 / 0 |
Week 2 |
2 / 1 |
1 / 1 |
2 / 0 |
Obviously here time represents the days grouped in weeks (week
1 - days 1, 2, 3, 4, 5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13, 14) over
the vertical axis. The outlook is shown along the horizontal axis and the
third dimension play is shown in each individual cell as a pair
of values corresponding to the two values along this dimension - yes
/ no. Thus in the upper left corner of the cube we have the total over
all weeks and all outlook values.
By observing the data cube we can easily identify some important properties
of the data, find regularities or patterns. For example, the third column
clearly shows that if the outlook is overcast the play attribute is always
yes. This may be put as a rule:
if outlook = overcast then play = yes
We may now apply "Drill-down" to our data cube over the time dimension.
This assumes the existence of a concept hierarchy for this attribute.
We can show this as a horizontal tree as follows:
time
-
week 1
-
day 1
-
day 2
-
day 3
-
day 4
-
day 5
-
day 6
-
day 7
-
week 2
-
day 8
-
day 9
-
day 10
-
day 11
-
day 12
-
day 13
-
day 14
-
day 15
The drill-down operation is based on climbing down the concept hierarchy,
so that we get the following data cube:
9 / 5 |
sunny |
rainy |
overcast |
1 |
0 / 1 |
0 / 0 |
0 / 0 |
2 |
0 / 1 |
0 / 0 |
0 / 0 |
3 |
0 / 0 |
0 / 0 |
1 / 0 |
4 |
0 / 0 |
1 / 0 |
0 / 0 |
5 |
0 / 0 |
1 / 0 |
0 / 0 |
6 |
0 / 0 |
0 / 1 |
0 / 0 |
7 |
0 / 0 |
0 / 0 |
1 / 0 |
8 |
0 / 1 |
0 / 0 |
0 / 0 |
9 |
1 / 0 |
0 / 0 |
0 / 0 |
10 |
0 / 0 |
1 / 0 |
0 / 0 |
11 |
1 / 0 |
0 / 0 |
0 / 0 |
12 |
0 / 0 |
0 / 0 |
1 / 0 |
13 |
0 / 0 |
0 / 0 |
1 / 0 |
14 |
0 / 0 |
0 / 1 |
0 / 0 |
The reverse of drill-down (called roll-up) applied to this data cube
results in the previous cube with two values (week 1 and week 2) along
the time dimension.
Data Mining
By applying various Data Mining techniques we can find associations and
regularities in our data, extract knowledge in the forms of rules, decision
trees etc., or just predict the value of the dependent variable (play)
in new situations (tuples). Here are some examples (all produced by Weka):
Mining Association Rules
To find associations in our data we first discretize the numeric attributes
(a
part of the data pre-processing stage in data mining). Thus we group
the temperature values in three intervals (hot, mild, cool) and humidity
values in two (high, normal) and substitute the values in data with the
corresponding names. Then we apply the Apriori algorithm and get
the following association rules:
1. humidity=normal windy=false 4 ==> play=yes (4, 1)
2. temperature=cool 4 ==> humidity=normal (4, 1)
3. outlook=overcast 4 ==> play=yes (4, 1)
4. temperature=cool play=yes 3 ==> humidity=normal (3, 1)
5. outlook=rainy windy=false 3 ==> play=yes (3, 1)
6. outlook=rainy play=yes 3 ==> windy=false (3, 1)
7. outlook=sunny humidity=high 3 ==> play=no (3, 1)
8. outlook=sunny play=no 3 ==> humidity=high (3, 1)
9. temperature=cool windy=false 2 ==> humidity=normal play=yes
(2, 1)
10. temperature=cool humidity=normal windy=false 2 ==> play=yes
(2, 1)
These rules show some attribute values sets (the so called item sets)
that appear frequently in the data. The numbers after each rule show the
support
(the number of occurrences of the item set in the data) and the confidence
(accuracy) of the rule. Interestingly, rule 3 is the same as the one that
we produced by observing the data cube.
Classification by Decision Trees and Rules
Using the ID3 algorithm we can produce the following decision tree
(shown as a horizontal tree):
-
outlook = sunny
-
humidity = high: no
-
humidity = normal: yes
-
outlook = overcast: yes
-
outlook = rainy
-
windy = true: no
-
windy = false: yes
The decision tree consists of decision nodes that test the values of their
corresponding attribute. Each value of this attribute leads to a subtree
and so on, until the leaves of the tree are reached. They determine the
value of the dependent variable. Using a decision tree we can classify
new tuples (not used to generate the tree). For example, according to the
above tree the tuple {sunny, mild, normal, false} will be classified
under play=yes.
A decision trees can be represented as a set of rules, where each rule
represents a path through the tree from the root to a leaf. Other Data
Mining techniques can produce rules directly. For example the Prism
algorithm available in Weka generates the following rules.
If outlook = overcast then yes
If humidity = normal and windy = false then yes
If temperature = mild and humidity = normal then yes
If outlook = rainy and windy = false then yes
If outlook = sunny and humidity = high then no
If outlook = rainy and windy = true then no
Prediction methods
Data Mining offers techniques to predict the value of the dependent variable
directly without first generating a model. One of the most popular approaches
for this purpose is based of statistical methods. It uses the Bayes rule
to predict the probability of each value of the dependent variable given
the values of the independent variables. For example, applying Bayes to
the new tuple discussed above we get:
P(play=yes | outlook=sunny, temperature=mild, humidity=normal,
windy=false) = 0.8
P(play=no | outlook=sunny, temperature=mild, humidity=normal, windy=false)
= 0.2
Then obviously the predicted value must be "yes".