Topics Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary3Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregatedata e.g., occupation=“” noisy: containing errors or outliers e.g., Salary=“-10” inconsistent: containing discrepancies in codes ornames e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records4Why Is Data Dirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data wascollected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation5Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or evenmisleading statistics. Data warehouse needs consistent integration of qualitydata Data extraction, cleaning, and transformation comprisesthe majority of the work of building a data warehouse. —Bill Inmon6Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: intrinsic, contextual, representational, andaccessibility.7Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or removeoutliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the sameor similar analytical results Data discretization Part of data reduction but with particular importance, especially fornumerical data8Forms of data preprocessing9DATA CLEANING10Data Cleaning Importance “Data cleaning is one of the three biggest problemsin data warehousing”—Ralph Kimball “Data cleaning is the number one problem in datawarehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration11Missing Data Data is not always available E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time ofentry not register history or changes of the data Missing data may need to be inferred.12How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assumingthe tasks in classification—not effective when the percentage ofmissing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class:smarter the most probable value: inference-based such as Bayesian formulaor decision tree13Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data14How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by binmedian, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g.,deal with possible outliers) Regression smooth by fitting the data into regression functions15Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size:uniform grid if A and B are the lowest and highest values of theattribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominatepresentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containingapproximately same number of samples Good data scaling Managing categorical attributes can be tricky.16Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34
Partition into (equi-depth) bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin means:
Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23
Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 4, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 3417Cluster Analysis18Regressionxyy = x + 1X1Y1Y1’19DATA INTEGRATION ANDTRANSFORMATION20Data Integration Data integration: combines data from multiple sources into a coherentstore Schema integration integrate metadata from different sources Entity identification problem: identify real world entitiesfrom multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values fromdifferent sources are different possible reasons: different representations, differentscales, e.g., metric vs. British units21Handling Redundancy in Data Integration Redundant data occur often when integration of multipledatabases The same attribute may have different names indifferent databases One attribute may be a “derived” attribute in anothertable, e.g., annual revenue Redundant data may be able to be detected bycorrelational analysis Careful integration of the data from multiple sources mayhelp reduce/avoid redundancies and inconsistencies andimprove mining speed and quality22Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specifiedrange min-max normalization z-score normalization normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones23Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scalingA A AA AA new max new min new minmax minv’ v min ( _ _ ) _AAstand devv v mean_‘ jv v10‘ Where j is the smallest integer such that Max(| v ‘ |)<124DATA REDUCTION25Data Reduction Strategies A data warehouse may store terabytes of data Complex data analysis/mining may take a very long timeto run on the complete data set Data reduction Obtain a reduced representation of the data set that ismuch smaller in volume but yet produce the same (oralmost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction—remove unimportant attributes Data Compression Numerosity reduction—fit data into models Discretization and concept hierarchy generation26Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse. Multiple levels of aggregation in data cubes Further reduce the size of data to deal with Reference appropriate levels Use the smallest representation which is enough to solvethe task Queries regarding aggregated information should beanswered using data cube, when possible27Dimensionality Reduction Feature selection (i.e., attribute subset selection): Select a minimum set of features such that theprobability distribution of different classes given thevalues for those features is as close as possible to theoriginal distribution given the values of all features reduce # of patterns in the patterns, easier tounderstand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction28Example of Decision Tree InductionInitial attribute set:A1, A2, A3, A4, A5, A6A4 ?A1? A6?Class 1 Class 2 Class 1 Class 2
Reduced attribute set: A1, A4, A630Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible withoutexpansion Audio/video compression Typically lossy compression, with progressiverefinement Sometimes small fragments of signal can bereconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time31Data CompressionOriginal Data CompressedDatalosslessOriginal DataApproximatedlossy32Wavelet Transformation Discrete wavelet transform (DWT): linear signal processing,multiresolutional analysis Compressed approximation: store only a small fraction ofthe strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossycompression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, whennecessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired lengthHaar2 Daubechie433DWT for Image Compression ImageLow Pass High PassLow Pass High PassLow Pass High Pass34 Given N data vectors from k-dimensions, find c <= korthogonal vectors that can be best used to representdata The original data set is reduced to one consisting of Ndata vectors on c principal components (reduceddimensions) Each data vector is a linear combination of the c principalcomponent vectors Works for numeric data only Used when the number of dimensions is largePrincipal Component Analysis35X1X2Y1Y2Principal Component Analysis36Numerosity Reduction Parametric methods Assume the data fits some model, estimate modelparameters, store only the parameters, and discardthe data (except possible outliers) Log-linear models: obtain value at a point in m-Dspace as the product on appropriate marginalsubspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling37Regression and Log-Linear Models Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to bemodeled as a linear function of multidimensional featurevector Log-linear model: approximates discretemultidimensional probability distributions Linear regression: Y = + X Two parameters , and specify the line and are tobe estimated by using the data at hand. using the least squares criterion to the known valuesof Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into theabove. Log-linear models: The multi-way table of joint probabilities isapproximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcdRegress Analysis and Log-Linear Models39Histograms A popular data reductiontechnique Divide data into bucketsand store average (sum)for each bucket Can be constructedoptimally in onedimension using dynamicprogramming Related to quantizationproblems. 0510152025303540 10000200003000040000500006000070000800009000010000040Clustering Partition data set into clusters, and one can storecluster representation only Can be very effective if data is clustered but not if datais “smeared” Can have hierarchical clustering and be stored in multidimensionalindex tree structures There are many choices of clustering definitions andclustering algorithms, further detailed in Chapter 841Sampling Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poorperformance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (orsubpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).42SamplingSRSWOR(simple randomsample withoutreplacement)SRSWRRaw Data43SamplingRaw Data Cluster/Stratified Sample44Hierarchical Reduction Use multi-resolution structure with different degrees ofreduction Hierarchical clustering is often performed but tends todefine partitions of data sets rather than “clusters” Parametric methods are usually not amenable tohierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set intopartitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at eachnode is a hierarchical histogram45DISCRETIZATION AND CONCEPTHIERARCHY GENERATION46Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute intointervals Some classification algorithms only accept categoricalattributes. Reduce data size by discretization Prepare for further analysis47Discretization and Concept hierachy Discretization reduce the number of values for a given continuousattribute by dividing the range of the attribute intointervals. Interval labels can then be used to replaceactual data values Concept hierarchies reduce the data by collecting and replacing low levelconcepts (such as numeric values for the attribute age) byhigher level concepts (such as young, middle-aged, orsenior)48Discretization and Concept HierarchyGeneration for Numeric Data Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning49Entropy-Based Discretization Given a set of samples S, if S is partitioned into twointervals S1 and S2 using boundary T, the entropy afterpartitioning is The boundary that minimizes the entropy function over allpossible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtaineduntil some stopping criterion is met, e.g., Experiments show that it may reduce data size andimprove classification accuracyE S TSEntS( , ) S S S Ent S| || |( )| || | 1 ( )122Ent(S) E(T,S) 50Segmentation by Natural Partitioning A simply 3-4-5 rule can be used to segment numericdata into relatively uniform, “natural” intervals. If an interval covers 3, 6, 7 or 9 distinct values at themost significant digit, partition the range into 3 equiwidthintervals If it covers 2, 4, or 8 distinct values at the mostsignificant digit, partition the range into 4 intervals If it covers 1, 5, or 10 distinct values at the mostsignificant digit, partition the range into 5 intervals
[Button id=”1″]
Quality and affordable writing services. Our papers are written to meet your needs, in a personalized manner. You can order essays, annotated bibliography, discussion, research papers, reaction paper, article critique, coursework, projects, case study, term papers, movie review, research proposal, capstone project, speech/presentation, book report/review, and more.
Need Help? Click On The Order Now Button For Help
What Students Are Saying About Us
.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"
.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."