Introduction to Data Mining

"Drowning in Data yet Starving for Knowledge"
???

"Computers have promised us a fountain of wisdom but delivered a flood of data"
William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?”
T. S. Eliot

What is NOT data mining?

Data Mining, noun: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything"
Jeff Jonas, IBM

"An Unethical Econometric practice of massaging and manipulating the data to obtain the desired results"
W.S. Brown “Introducing Econometrics”

"A buzz word for what used to be known as DBMS reports"
An Anonymous Data Mining Skeptic

What is data mining?

"The non trivial extraction of implicit, previously unknown, and potentially useful information from data"
William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus

Data mining finds valuable information hidden in large volumes of data.
Data mining is the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
The computer is responsible for finding the patterns by identifying the underlying rules and features in the data.
It is possible to "strike gold" in unexpected places as the data mining software extracts patterns not previously discernible or so obvious that no-one has noticed them before.
Mining analogy:

large volumes of data are sifted in an attempt to find something worthwhile.
in a mining operation large amounts of low grade materials are sifted through in order to find something of value.

Data Mining - an interdisciplinary field

Databases
Statistics
High Performance Computing
Machine Learning
Visualization
Mathematics

Related Technologies

Machine Learning vs. Data Mining

Large Data sets in Data Mining
Efficiency of Algorithms is important
Scalability of Algorithms is important
Real World Data
Lots of Missing Values
Pre-existing data - not user generated
Data not static - prone to updates
Efficient methods for data retrieval available for use
Domain Knowledge in the form of integrity constraints available.

Data Mining vs. DBMS

Example DBMS Reports

Last months sales for each service type
Sales per service grouped by customer sex or age bracket
List of customers who lapsed their policy

Questions answered using Data Mining

What characteristics do customers that lapse their policy have in common and how do they differ from customers who renew their policy?
Which motor insurance policy holders would be potential customers for my House Content Insurance policy?

Data Warehouse

Data Warehouse: centralized data repository which can be queried for business benefit.
Data Warehousing makes it possible to

extract archived operational data
overcome inconsistencies between different legacy data formats
integrate data throughout an enterprise, regardless of location, format, or communication requirements
incorporate additional or expert information

On-line Analytical Processing (OLAP)

Multi-Dimensional Data Model (Data Cube)
Operations:

Roll-up
Drill-down
Slice and dice
Rotate

Statistical Analysis

Ill-suited for Nominal and Structured Data Types
Completely data driven - incorporation of domain knowledge not possible
Interpretation of results is difficult and daunting
Requires expert user guidance

Data Mining Goals

Classification

DM system learns from examples or the data how to partition or classify the data i.e. it formulates classification rules
Example - customer database in a bank

Question - Is a new customer applying for a loan a good investment or not?
Typical rule formulated:

if STATUS = married and INCOME > 10000 and HOUSE_OWNER = yes

then INVESTMENT_TYPE = good

Association

Rules that associate one attribute of a relation to another
Set oriented approaches are the most efficient means of discovering such rules
Example - supermarket database

72% of all the records that contain items A and B also contain item C
the specific percentage of occurrences, 72 is the confidence factor of the rule

Sequence/Temporal

Sequential pattern functions analyze collections of related records and detect frequently occurring patterns over a period of time
Difference between sequence rules and other rules is the temporal factor
Example - retailers database can be used to discover the set of purchases that frequently precedes the purchase of a microwave oven

Database management systems (DBMS), Online Analytical Processing (OLAP) and Data Mining

Area	DBMS	OLAP	Data Mining
Task	Extraction of detailed and summary data	Summaries, trends and forecasts	Knowledge discovery of hidden patterns and insights
Type of result	Information	Analysis	Insight and Prediction
Method	Deduction (Ask the question, verify with data)	Multidimensional data modeling, Aggregation, Statistics	Induction (Build the model, apply it to new data, get the result)
Example question	Who purchased mutual funds in the last 3 years?	What is the average income of mutual fund buyers by region by year?	Who will buy a mutual fund in the next 6 months and why?

Statges of the data mining process

Data pre-processing

Heterogeneity resolution
Data cleansing
Data transformation
Data reduction
Discretization and generating concept hierarchies

Creating a data model: applying Data Mining tools to extract knowledge from data
Testing the model: the performance of the model (e.g. accuracy, completeness) is tested on independent data (not used to create the model)
Interpretation and evaluation: the user bias can direct DM tools to areas of interest

Attributes of interest in databases
Goal of discovery
Domain knowledge
Prior knowledge or belief about the domain

Techniques

Set oriented database methods
Statistics: can be used in several data mining stages

data cleansing: removal of erroneous or irrelevant data
EDA (exploratory data analysis): frequency counts, histograms etc.
data selection and sampling: reduce the scale of computation
attribute re-definition
data analysis - measures of association and relationships between attributes, interestingness of rules, classification etc.

Visualization: enhances EDA, makes patterns more visible
Clustering (Cluster Analysis)

Clustering and segmentation is basically partitioning the database so that each partition or group is similar according to some criteria or metric
Clustering according to similarity is a concept which appears in many disciplines, e.g. clustering of molecules in chemistry
Data mining applications make use of clustering according to similarity, e.g. segmenting a client/customer base
It provides sub-groups of a population for further analysis or action - very important when dealing with very large databases

Knowledge Representation Methods

Neural Networks

a trained neural network can be thought of as an "expert" in the category of information it has been given to analyze
provides projections given new situations of interest and answers "what if" questions
problems include:

the resulting network is viewed as a black box
no explanation of the results is given i.e. difficult for the user to interpret the results
difficult to incorporate user intervention
slow to train due to their iterative nature

Decision trees

used to represent knowledge
built using a training set of data and can then be used to classify new objects
problems are:

opaque structure - difficult to understand
missing data can cause performance problems
they become cumbersome for large data sets

Rules

probably the most common form of representation
tend to be simple and intuitive
unstructured and less rigid
problems are:

difficult to maintain
inadequate to represent many types of knowledge

Example format: if X then Y

Data Mining Applications

Credit Assessment
Stock Market Prediction
Fault Diagnosis in Production Systems
Medical Discovery
Fraud Detection
Hazard Forecasting
Buying Trends Analysis
Organizational Restructuring
Target Mailing
Knowledge Acquisition
Scientific Discovery
Semantics based Performance Enhancement of DBMS

Example of DBMS, OLAP and Data Mining: Weather data

Assume we have made a record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular day. Thus we have generated tuples (or examples, instances) consisting of values of four independent variables (outlook, temperature, humidity, windy) and one dependent variable (play). See the textbook for a detailed description.

DBMS

Consider our data stored in a relational table as follows:

Day	outlook	temperature	humidity	windy	play
1	sunny	85	85	false	no
2	sunny	80	90	true	no
3	overcast	83	86	false	yes
4	rainy	70	96	false	yes
5	rainy	68	80	false	yes
6	rainy	65	70	true	no
7	overcast	64	65	true	yes
8	sunny	72	95	false	no
9	sunny	69	70	false	yes
10	rainy	75	80	false	yes
11	sunny	75	70	true	yes
12	overcast	72	90	true	yes
13	overcast	81	75	false	yes
14	rainy	71	91	true	no

By querying a DBMS containing the above table we may answer questions like:

What was the temperature in the sunny days? {85, 80, 72, 69, 75}
Which days the humidity was less than 75? {6, 7, 9, 11}
Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14}
Which days the temperature was greater than 70 and the humidity was less than 75? The intersection of the above two: {11}

OLAP

Using OLAP we can create a Multidimensional Model of our data (Data Cube). For example using the dimensions: time, outlook and play we can create the following model.

9 / 5	sunny	rainy	overcast
Week 1	0 / 2	2 / 1	2 / 0
Week 2	2 / 1	1 / 1	2 / 0

Obviously here time represents the days grouped in weeks (week 1 - days 1, 2, 3, 4, 5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13, 14) over the vertical axis. The outlook is shown along the horizontal axis and the third dimension play is shown in each individual cell as a pair of values corresponding to the two values along this dimension - yes / no. Thus in the upper left corner of the cube we have the total over all weeks and all outlook values.

By observing the data cube we can easily identify some important properties of the data, find regularities or patterns. For example, the third column clearly shows that if the outlook is overcast the play attribute is always yes. This may be put as a rule:

if outlook = overcast then play = yes

We may now apply "Drill-down" to our data cube over the time dimension. This assumes the existence of a concept hierarchy for this attribute. We can show this as a horizontal tree as follows:

time

week 1

day 1
day 2
day 3
day 4
day 5
day 6
day 7

week 2

day 8
day 9
day 10
day 11
day 12
day 13
day 14
day 15

The drill-down operation is based on climbing down the concept hierarchy, so that we get the following data cube:

9 / 5	sunny	rainy	overcast
1	0 / 1	0 / 0	0 / 0
2	0 / 1	0 / 0	0 / 0
3	0 / 0	0 / 0	1 / 0
4	0 / 0	1 / 0	0 / 0
5	0 / 0	1 / 0	0 / 0
6	0 / 0	0 / 1	0 / 0
7	0 / 0	0 / 0	1 / 0
8	0 / 1	0 / 0	0 / 0
9	1 / 0	0 / 0	0 / 0
10	0 / 0	1 / 0	0 / 0
11	1 / 0	0 / 0	0 / 0
12	0 / 0	0 / 0	1 / 0
13	0 / 0	0 / 0	1 / 0
14	0 / 0	0 / 1	0 / 0

The reverse of drill-down (called roll-up) applied to this data cube results in the previous cube with two values (week 1 and week 2) along the time dimension.

Data Mining

By applying various Data Mining techniques we can find associations and regularities in our data, extract knowledge in the forms of rules, decision trees etc., or just predict the value of the dependent variable (play) in new situations (tuples). Here are some examples (all produced by Weka):

Mining Association Rules

To find associations in our data we first discretize the numeric attributes (a part of the data pre-processing stage in data mining). Thus we group the temperature values in three intervals (hot, mild, cool) and humidity values in two (high, normal) and substitute the values in data with the corresponding names. Then we apply the Apriori algorithm and get the following association rules:

1. humidity=normal windy=false 4 ==> play=yes (4, 1)
2. temperature=cool 4 ==> humidity=normal (4, 1)
3. outlook=overcast 4 ==> play=yes (4, 1)
4. temperature=cool play=yes 3 ==> humidity=normal (3, 1)
5. outlook=rainy windy=false 3 ==> play=yes (3, 1)
6. outlook=rainy play=yes 3 ==> windy=false (3, 1)
7. outlook=sunny humidity=high 3 ==> play=no (3, 1)
8. outlook=sunny play=no 3 ==> humidity=high (3, 1)
9. temperature=cool windy=false 2 ==> humidity=normal play=yes (2, 1)
10. temperature=cool humidity=normal windy=false 2 ==> play=yes (2, 1)

These rules show some attribute values sets (the so called item sets) that appear frequently in the data. The numbers after each rule show the support (the number of occurrences of the item set in the data) and the confidence (accuracy) of the rule. Interestingly, rule 3 is the same as the one that we produced by observing the data cube.

Classification by Decision Trees and Rules

Using the ID3 algorithm we can produce the following decision tree (shown as a horizontal tree):

outlook = sunny

humidity = high: no
humidity = normal: yes

outlook = overcast: yes
outlook = rainy

windy = true: no
windy = false: yes

The decision tree consists of decision nodes that test the values of their corresponding attribute. Each value of this attribute leads to a subtree and so on, until the leaves of the tree are reached. They determine the value of the dependent variable. Using a decision tree we can classify new tuples (not used to generate the tree). For example, according to the above tree the tuple {sunny, mild, normal, false} will be classified under play=yes.

A decision trees can be represented as a set of rules, where each rule represents a path through the tree from the root to a leaf. Other Data Mining techniques can produce rules directly. For example the Prism algorithm available in Weka generates the following rules.

If outlook = overcast then yes
If humidity = normal and windy = false then yes
If temperature = mild and humidity = normal then yes
If outlook = rainy and windy = false then yes
If outlook = sunny and humidity = high then no
If outlook = rainy and windy = true then no

Prediction methods

Data Mining offers techniques to predict the value of the dependent variable directly without first generating a model. One of the most popular approaches for this purpose is based of statistical methods. It uses the Bayes rule to predict the probability of each value of the dependent variable given the values of the independent variables. For example, applying Bayes to the new tuple discussed above we get:

P(play=yes | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.8
P(play=no | outlook=sunny, temperature=mild, humidity=normal, windy=false) = 0.2

Then obviously the predicted value must be "yes".