CHAPTER 3: DATA PRE-PROCESSING

Topics Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary3Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregatedata e.g., occupation=“” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes ornames e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records4Why Is Data Dirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data wascollected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation5Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or evenmisleading statistics. Data warehouse needs consistent integration of qualitydata Data extraction, cleaning, and transformation comprisesthe majority of the work of building a data warehouse. —Bill Inmon6Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: intrinsic, contextual, representational, andaccessibility.7Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or removeoutliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the sameor similar analytical results Data discretization Part of data reduction but with particular importance, especially fornumerical data8Forms of data preprocessing9DATA CLEANING10Data Cleaning Importance “Data cleaning is one of the three biggest problemsin data warehousing”—Ralph Kimball “Data cleaning is the number one problem in datawarehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration11Missing Data Data is not always available E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time ofentry not register history or changes of the data Missing data may need to be inferred.12How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assumingthe tasks in classification—not effective when the percentage ofmissing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class:smarter the most probable value: inference-based such as Bayesian formulaor decision tree13Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data14How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by binmedian, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g.,deal with possible outliers) Regression smooth by fitting the data into regression functions15Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size:uniform grid if A and B are the lowest and highest values of theattribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominatepresentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containingapproximately same number of samples Good data scaling Managing categorical attributes can be tricky.16Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 3417Cluster Analysis18Regressionxyy = x + 1X1Y1Y1’19DATA INTEGRATION ANDTRANSFORMATION20Data Integration Data integration: combines data from multiple sources into a coherentstore Schema integration integrate metadata from different sources Entity identification problem: identify real world entitiesfrom multiple data sources, e.g., A.cust-id  B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values fromdifferent sources are different possible reasons: different representations, differentscales, e.g., metric vs. British units21Handling Redundancy in Data Integration Redundant data occur often when integration of multipledatabases The same attribute may have different names indifferent databases One attribute may be a “derived” attribute in anothertable, e.g., annual revenue Redundant data may be able to be detected bycorrelational analysis Careful integration of the data from multiple sources mayhelp reduce/avoid redundancies and inconsistencies andimprove mining speed and quality22Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specifiedrange min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones23Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scalingA A AA AA new max new min new minmax minv’ v min ( _  _ )  _AAstand devv v mean_‘ jv v10‘ Where j is the smallest integer such that Max(| v ‘ |)<124DATA REDUCTION25Data Reduction Strategies A data warehouse may store terabytes of data Complex data analysis/mining may take a very long timeto run on the complete data set Data reduction Obtain a reduced representation of the data set that ismuch smaller in volume but yet produce the same (oralmost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction—remove unimportant attributes Data Compression Numerosity reduction—fit data into models Discretization and concept hierarchy generation26Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solvethe task Queries regarding aggregated information should beanswered using data cube, when possible27Dimensionality Reduction Feature selection (i.e., attribute subset selection): Select a minimum set of features such that theprobability distribution of different classes given thevalues for those features is as close as possible to theoriginal distribution given the values of all features reduce # of patterns in the patterns, easier tounderstand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction28Example of Decision Tree InductionInitial attribute set:A1, A2, A3, A4, A5, A6A4 ?A1? A6?Class 1 Class 2 Class 1 Class 2

Reduced attribute set: A1, A4, A630Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible withoutexpansion Audio/video compression Typically lossy compression, with progressiverefinement Sometimes small fragments of signal can bereconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time31Data CompressionOriginal Data CompressedDatalosslessOriginal DataApproximatedlossy32Wavelet Transformation Discrete wavelet transform (DWT): linear signal processing,multiresolutional analysis Compressed approximation: store only a small fraction ofthe strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, whennecessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired lengthHaar2 Daubechie433DWT for Image Compression ImageLow Pass High PassLow Pass High PassLow Pass High Pass34 Given N data vectors from k-dimensions, find c <= korthogonal vectors that can be best used to representdata The original data set is reduced to one consisting of Ndata vectors on c principal components (reduceddimensions) Each data vector is a linear combination of the c principalcomponent vectors Works for numeric data only Used when the number of dimensions is largePrincipal Component Analysis35X1X2Y1Y2Principal Component Analysis36Numerosity Reduction Parametric methods Assume the data fits some model, estimate modelparameters, store only the parameters, and discardthe data (except possible outliers) Log-linear models: obtain value at a point in m-Dspace as the product on appropriate marginalsubspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling37Regression and Log-Linear Models Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to bemodeled as a linear function of multidimensional featurevector Log-linear model: approximates discretemultidimensional probability distributions Linear regression: Y =  +  X Two parameters ,  and  specify the line and are tobe estimated by using the data at hand. using the least squares criterion to the known valuesof Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into theabove. Log-linear models: The multi-way table of joint probabilities isapproximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcdRegress Analysis and Log-Linear Models39Histograms A popular data reductiontechnique Divide data into bucketsand store average (sum)for each bucket Can be constructedoptimally in onedimension using dynamicprogramming Related to quantizationproblems. 0510152025303540 10000200003000040000500006000070000800009000010000040Clustering Partition data set into clusters, and one can storecluster representation only Can be very effective if data is clustered but not if datais “smeared” Can have hierarchical clustering and be stored in multidimensionalindex tree structures There are many choices of clustering definitions andclustering algorithms, further detailed in Chapter 841Sampling Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poorperformance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (orsubpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).42SamplingSRSWOR(simple randomsample withoutreplacement)SRSWRRaw Data43SamplingRaw Data Cluster/Stratified Sample44Hierarchical Reduction Use multi-resolution structure with different degrees ofreduction Hierarchical clustering is often performed but tends todefine partitions of data sets rather than “clusters” Parametric methods are usually not amenable tohierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set intopartitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at eachnode is a hierarchical histogram45DISCRETIZATION AND CONCEPTHIERARCHY GENERATION46Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute intointervals Some classification algorithms only accept categoricalattributes. Reduce data size by discretization Prepare for further analysis47Discretization and Concept hierachy Discretization reduce the number of values for a given continuousattribute by dividing the range of the attribute intointervals. Interval labels can then be used to replaceactual data values Concept hierarchies reduce the data by collecting and replacing low levelconcepts (such as numeric values for the attribute age) byhigher level concepts (such as young, middle-aged, orsenior)48Discretization and Concept HierarchyGeneration for Numeric Data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning49Entropy-Based Discretization Given a set of samples S, if S is partitioned into twointervals S1 and S2 using boundary T, the entropy afterpartitioning is The boundary that minimizes the entropy function over allpossible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtaineduntil some stopping criterion is met, e.g., Experiments show that it may reduce data size andimprove classification accuracyE S TSEntS( , ) S S S Ent S| || |( )| || | 1  ( )122Ent(S)  E(T,S) 50Segmentation by Natural Partitioning A simply 3-4-5 rule can be used to segment numericdata into relatively uniform, “natural” intervals. If an interval covers 3, 6, 7 or 9 distinct values at themost significant digit, partition the range into 3 equiwidthintervals If it covers 2, 4, or 8 distinct values at the mostsignificant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the mostsignificant digit, partition the range into 5 intervals

[Button id=”1″]

Quality and affordable writing services. Our papers are written to meet your needs, in a personalized manner. You can order essays, annotated bibliography, discussion, research papers, reaction paper, article critique, coursework, projects, case study, term papers, movie review, research proposal, capstone project, speech/presentation, book report/review, and more.
Need Help? Click On The Order Now Button For Help

What Students Are Saying About Us

.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐
"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"

.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."

"Order a Custom Paper on Similar Assignment! No Plagiarism! Enjoy 20% Discount"