Data Preprocessing

Definition:

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications.

Why Data Preprocessing?

Data in the real world is dirty

incomplete: missing attribute values, lack of certain attributes of interest, or containing only aggregate data

e.g., occupation=“”

noisy: containing errors or outliers

e.g., Salary=“-10”

inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”

e.g., Was rating “1,2,3”, now rating “A, B, C”

e.g., discrepancy between duplicate records.

Data Preprocessing Methods:

Data Preprocessing is one of the most critical steps in Data Mining process which

Deals with the preparation and transformation of the dataset. Data Preprocessing methods are divided into following categories.

Data Cleaning
Data Integration
Data Transformation
Data Reduction
Data Discretization

Data Cleaning:

Data is cleansed through processes such as filling in missing values, smoothing the noisy data, or resolving the inconsistencies in the data.

1. Fill in missing values (attribute or class value):

Ignore the tuple: usually done when class label is missing.
Use the attribute mean (or majority nominal value) to fill in the missing value.
Use the attribute mean (or majority nominal value) for all samples belonging to the same class.
Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.

2. Identify outliers and smooth out noisy data:

Binning

Sort the attribute values and partition them into bins (see "Unsupervised discretization" below);
Then smooth by bin means, bin median, or bin boundaries.

Clustering: group values in clusters and then detect and remove outliers (automatic or manual)
Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data:

use domain knowledge or expert decision.

Data Integration:

Data with different representations are put together and conflicts within the data are resolved.

Data Transformation:

Data is normalized, aggregated and generalized.

Normalization:

Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)

Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev

2. Aggregation:

moving up in the concept hierarchy on numeric attributes.

3. Generalization:

moving up in the concept hierarchy on nominal attributes.

4. Attribute construction:

replacing or adding new attributes inferred by existing attributes.

Data Reduction:

This step aims to present a reduced representation of the data in a data warehouse.

Reducing the number of attributes

Data cube aggregation: applying roll-up, slice or dice operations.
Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis).
Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data.

2. Reducing the number of attribute values

Binning (histograms): reducing the number of attributes by grouping them into intervals (bins).
Clustering: grouping values in clusters.
Aggregation or generalization.

3. Reducing the number of tuples

Sampling

Data Discretization:

Involves the reduction of a number of values of a continuous attribute by dividing the range of attribute intervals.

1. Unsupervised discretization - class variable is not used.

Equal-interval (equiwidth) binning: split the whole range of numbers in intervals with equal size.
Equal-frequency (equidepth) binning: use intervals containing equal number of values

2. Supervised discretization - uses the values of the class variable.

Using class boundaries. Three steps:

Sort values.
Place breakpoints between values belonging to different classes.
If too many intervals, merge intervals with equal or similar class distributions.

3. Generating concept hierarchies:

recursively applying partitioning or discretization methods.