Data preprocessing
Why preprocessing ?
-
Real world data are generally
-
Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
-
Noisy: containing errors or outliers
-
Inconsistent: containing discrepancies in codes or names
-
Tasks in data preprocessing
-
Data cleaning: fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.
-
Data integration: using multiple databases, data cubes, or files.
-
Data transformation: normalization and aggregation.
-
Data reduction: reducing the volume but producing the same or similar analytical
results.
-
Data discretization: part of data reduction, replacing numerical attributes
with nominal ones.
Data cleaning
-
Fill in missing values (attribute or class value):
-
Ignore the tuple: usually done when class label is missing.
-
Use the attribute mean (or majority nominal value) to fill in the missing
value.
-
Use the attribute mean (or majority nominal value) for all samples belonging
to the same class.
-
Predict the missing value by using a learning algorithm: consider the attribute
with the missing value as a dependent (class) variable and run a learning
algorithm (usually Bayes or decision tree) to predict the missing value.
-
Identify outliers and smooth out noisy data:
-
Binning
-
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
-
Then smooth by bin means, bin median, or bin boundaries.
-
Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)
-
Regression: smooth by fitting the data into regression functions.
-
Correct inconsistent data: use domain knowledge or expert decision.
Data transformation
-
Normalization:
-
Scaling attribute values to fall within a specified range.
-
Example: to transform V in [min, max] to V' in [0,1],
apply V'=(V-Min)/(Max-Min)
-
Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev
-
Aggregation: moving up in the concept hierarchy on numeric attributes.
-
Generalization: moving up in the concept hierarchy on nominal attributes.
-
Attribute construction: replacing or adding new attributes inferred by
existing attributes.
Data reduction
-
Reducing the number of attributes
-
Data cube aggregation: applying roll-up, slice or dice operations.
-
Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space (see Lecture 5: Attribute-oriented
analysis).
-
Principle component analysis (numeric attributes only): searching for a
lower dimensional space that can best represent the data..
-
Reducing the number of attribute values
-
Binning (histograms): reducing the number of attributes by grouping them
into intervals (bins).
-
Clustering: grouping values in clusters.
-
Aggregation or generalization
-
Reducing the number of tuples
Discretization and generating concept hierarchies
-
Unsupervised discretization - class variable is not used.
-
Equal-interval (equiwidth) binning: split the whole range of numbers in
intervals with equal size.
-
Equal-frequency (equidepth) binning: use intervals containing equal number
of values.
-
Supervised discretization - uses the values of the class variable.
-
Using class boundaries. Three steps:
-
Sort values.
-
Place breakpoints between values belonging to different classes.
-
If too many intervals, merge intervals with equal or similar class distributions.
-
Entropy (information)-based discretization. Example:
-
Information in a class distribution:
-
Denote a set of five values occurring in tuples belonging to two classes
(+ and -) as [+,+,+,-,-]
-
That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples
-
Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs
are base 2)
-
3/5 and 2/5 are relative frequencies (probabilities)
-
Ignoring the order of the values, we can use the following notation:
[3,2] meaning 3 values from one class and 2 - from the other.
-
Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
-
Information in a split (2/5 and 3/5 are weight coefficients):
-
Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
-
Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])
-
Method:
-
Sort the values;
-
Calculate information in all possible splits;
-
Choose the split that minimizes information;
-
Do not include breakpoints between values belonging to the same class (this
will increase information);
-
Apply the same to the resulting intervals until some stopping criterion
is satisfied.
-
Generating concept hierarchies: recursively applying partitioning or discretization
methods.