Data preprocessing
Why preprocessing ?

Real world data are generally

Incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data

Noisy: containing errors or outliers

Inconsistent: containing discrepancies in codes or names

Tasks in data preprocessing

Data cleaning: fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.

Data integration: using multiple databases, data cubes, or files.

Data transformation: normalization and aggregation.

Data reduction: reducing the volume but producing the same or similar analytical
results.

Data discretization: part of data reduction, replacing numerical attributes
with nominal ones.
Data cleaning

Fill in missing values (attribute or class value):

Ignore the tuple: usually done when class label is missing.

Use the attribute mean (or majority nominal value) to fill in the missing
value.

Use the attribute mean (or majority nominal value) for all samples belonging
to the same class.

Predict the missing value by using a learning algorithm: consider the attribute
with the missing value as a dependent (class) variable and run a learning
algorithm (usually Bayes or decision tree) to predict the missing value.

Identify outliers and smooth out noisy data:

Binning

Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);

Then smooth by bin means, bin median, or bin boundaries.

Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)

Regression: smooth by fitting the data into regression functions.

Correct inconsistent data: use domain knowledge or expert decision.
Data transformation

Normalization:

Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1],
apply V'=(VMin)/(MaxMin)

Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(VMean)/StDev

Aggregation: moving up in the concept hierarchy on numeric attributes.

Generalization: moving up in the concept hierarchy on nominal attributes.

Attribute construction: replacing or adding new attributes inferred by
existing attributes.
Data reduction

Reducing the number of attributes

Data cube aggregation: applying rollup, slice or dice operations.

Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space (see Lecture 5: Attributeoriented
analysis).

Principle component analysis (numeric attributes only): searching for a
lower dimensional space that can best represent the data..

Reducing the number of attribute values

Binning (histograms): reducing the number of attributes by grouping them
into intervals (bins).

Clustering: grouping values in clusters.

Aggregation or generalization

Reducing the number of tuples
Discretization and generating concept hierarchies

Unsupervised discretization  class variable is not used.

Equalinterval (equiwidth) binning: split the whole range of numbers in
intervals with equal size.

Equalfrequency (equidepth) binning: use intervals containing equal number
of values.

Supervised discretization  uses the values of the class variable.

Using class boundaries. Three steps:

Sort values.

Place breakpoints between values belonging to different classes.

If too many intervals, merge intervals with equal or similar class distributions.

Entropy (information)based discretization. Example:

Information in a class distribution:

Denote a set of five values occurring in tuples belonging to two classes
(+ and ) as [+,+,+,,]

That is, the first 3 belong to "+" tuples and the last 2  to "" tuples

Then, Info([+,+,+,,]) = (3/5)*log(3/5)(2/5)*log(2/5) (logs
are base 2)

3/5 and 2/5 are relative frequencies (probabilities)

Ignoring the order of the values, we can use the following notation:
[3,2] meaning 3 values from one class and 2  from the other.

Then, Info([3,2]) = (3/5)*log(3/5)(2/5)*log(2/5)

Information in a split (2/5 and 3/5 are weight coefficients):

Info([+,+],[+,,]) = (2/5)*Info([+,+]) + (3/5)*Info([+,,])

Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])

Method:

Sort the values;

Calculate information in all possible splits;

Choose the split that minimizes information;

Do not include breakpoints between values belonging to the same class (this
will increase information);

Apply the same to the resulting intervals until some stopping criterion
is satisfied.

Generating concept hierarchies: recursively applying partitioning or discretization
methods.