Data Mining: Deferred/Referred Assignmentpage 1 of 21DATA MINING: DEFERRED/REFERRED ASSIGNMENTSection 1: INTRODUCTION1.0 LEARNING OUTCOMESThis assignment will assess the following learning outcomes:1) Correctly apply and interpret a variety of statistical and data mining techniques usingappropriate data mining software.2) Effectively communicate your results.1.1 DEADLINE:26 August 2016 to be submitted TWICE electronically.Once as a “SHU Assignment” AND again via turnitin1.2 ASSESSMENT FOR THIS MODULEThis module will be assessed via a case study. This will involve the analysis of a datasetthat is described below.1.3 PROBLEM OUTLINEFor this assignment you are required to analyse a data set taken from the data miningcompetition prior to the third international conference of Principles and Practices ofknowledge discovery in data bases (PKDD). This conference was held in Prague in 1999‡.One of the challenges given for the competition was a set of datasets concerning financialtransactions and details for customers at a Czech bank. Several of these files have beencombined to give 4 500 observations of various financial and personal details relating todifferent accounts. Details of the variables included are given in the appendix.1.4 DATASETSThe data you will need to analyse consists of four datasets. The datasets areczechr.sas7bdat, czechin.sas7bdat, district.sas7bdat and loans2.sastbdat. A full list of thevariables in each of these is given in the appendix.‡ http://lisp.vse.cz/pkdd99/Data Mining: Deferred/Referred Assignmentpage 2 of 211.4.1 CZECH DATASETThe data has a few missing values. The full dataset consists of 4500 observations with thenumber having a loan (LOAN=yes) occurring in 682 of the cases (15.16%). These data arein the czechr.sas7bdat dataset and should be used in 2.1 for questions 1), 2) and 3). A fulllist is given in the Appendix.1.4.2 CZECHIN DATASETThis is a copy of the Czech Dataset but includes dummy variables for the missing values.These additional variables are given in the Appendix. This dataset will be formed as part ofEnterprise Miner analysis in question 3) below and should be used in question 4). Questions6)-8) use either this dataset or the Czech dataset following on in the Enterprise Minerdiagram.1.4.3 DISTRICT DATASETA number of the variables in the first two datasets relate to the demographic backgroundbased on one of 77 districts that the account holder comes from.• In addition we also have the name of each district distname.• In the original dataset the variables unemploy95 and crime95 contained one missingvalue. These two values have been imputed using a regression model.• In addition, the two variables crime95 and crime96 will vary on the region size asmeasured by the number of inhabitants. The other variables of this type(unemploy95, unemploy96 and enter) have been converted to a rate. It seemssensible to also do this for the crime variables.Consequently two new variables have been created and used in place of crime95and crime96:crime95r = crime95/no_inhab*100crime96r=crime95/no_inhab*100This results in the final set of variables for the district dataset. A full list of these is given inthe appendix. Given these data concern the Czech Republic there are also usefulguidelines on finding information about the Czech Republic. This dataset is used in 2.2and 2.3 and questions 9),10),11),12) and 13).Data Mining: Deferred/Referred Assignmentpage 3 of 211.4.4 LOANS2 DATASETThe data set to be analysed contains all the variables as listed in czechr.sas7bdat plusadditional variables that are transformations of these, details of which are given in theappendix. The target is the variable that records whether the account holder has a loan withthe bank or not.
LOAN
==
yesno
if account holder has a loan (with this bank)if not
This dataset is used in 2.4 in questions 14), 15), 16), 17), 18), 19) and 20).1.4.5 AVAILABILITY OF DATASETSAll datasets will be available on the SAS 9.4 Server in the following path:E:SHUUsers!SharedDataTeresaRefer BD Assign1.4.5 – i SAS ENTERPRISE GUIDEIn Enterprise Guide a library is needed to point to the path above. The first line of code inthe ass2r.sas and ass3r.sas creates this library. Ensure you run this line of code beforerunning code that references this library.1.4.5 – ii ENTERPRISE MINEROn the SAS Enterprise Miner server the data is in the path above. You will need to create alibrary to access the data.To create a library in SAS Enterprise Miner follow the instructions given in tutorial 2 on page81 of the first booklet. However ensure that you point to the path above.1.4.5 – iii SAS/IML STUDIOFollow the instructions at the start of tutorial 2 on page 78 of the first booklet to launchIML/Studio but navigate to the path above and open the required dataset.Data Mining: Deferred/Referred Assignmentpage 4 of 211.5 LOADING THE SAS MACROS AND SAS PROGRAMTo help you with this assignment programs containing some of the code you will need hasbeen produced. These require the use of two macros which must also be downloaded. Onemacro, label.sas, has been written by Michael Friendly§. The link to this is available onBlackboard. The other three files are available on Blackboard as direct downloads.The files you need to download are:• ass2r.sas• ass3r.sasand the two macros• label.sas (via the link to Michael Friendly’s website) and• pcait.sas.Load each of these into SAS Enterprise Guide. Run the macros: label.sas and pacait.sas.When you run them nothing will have appeared to have happened, but this will load them intoSAS and then compile them. This will enable them to be called from the remaining program.§ http://www.math.yorku.ca/SCS/sasmac/label.html Michael Friendly Psychology DepartmentYork University, Toronto, CanadaData Mining: Deferred/Referred Assignmentpage 5 of 21Section 2: ANALYSIS REQUIRED2.0 INTRODUCTIONThere are four parts to this assignment in line with the original four topics for this module.You need to attempt all parts and obtain a mark of 40% (i.e. 120 marks out of the possible300) to pass the module. The final mark will be converted to a percentage.2.1 PART 1: EXPLORING THE DATAYou are required to analyse the Czechr.sas7bdat data inSAS Enterprise Guide, SAS/IML Studio and EnterpriseMiner as follows:1) Using either SAS/IML Studio or Enterprise Minerproduce suitable graphs of the data that will aid yourunderstanding of the variables and also theirrelationship to the target variable LOAN. Fullydiscuss your results in the report (see details below).You should discuss your plots and any output fully.Which variables are symmetrical? Which ones arelikely to be good predictors of the target variableLOAN? What other features can you discover aboutthese data?(11 marks)2) Load the data into Enterprise Miner. Ensure that allthe variables have the correct role and level.Produce suitable summary measures (means,standard deviation, skewness etc.) of the data andfully interpret your results. For example, what is themost frequent type of credit? Which is the mostconsistent? What about withdrawals? How doesunemployment change? Are there any otherinteresting features? Discuss the missing values thatexist in these data.(8 marks)Figure 2.13) Create an Enterprise Miner stream by adding the Czechr data to a new diagram.a) Now add a impute node: . Join this to the Czechr node. Leave thedefault method for input interval variables set to “mean”. Similarly leave the defaultmethod for class variables as “count”. Under Score set “Hide Original Variables” to“No”. Also under Score and Indicator Variables set “Type” to “unique” and “Role” to“input”. The final diagram will look like this: (see below for further nodes to be added).Data Mining: Deferred/Referred Assignmentpage 6 of 21Figure 2.2Run the impute node and examine the results.b) Add a StatExplore node. Right click and select Edit Variables… From here selectthe new variables created as part of running this node and the original ones that hadmissing values. Click Explore. Examine the table of values, and make sure youunderstand what the variables that have been created represent. Include a smallsection of this table in your report. (2 marks)c) In your own words explain what the impute node has done. Do you think theparticular choice of settings used were sensible? Should any settings be changed toimpute any of the categorical variables? Fully justify your answers.(4 marks)4) A copy of the resulting dataset czechin.sas7bdat that includes these dummy variablesand imputed variables is available. Using either this dataset and SAS/IML Studio orusing Enterprise Miner and appropriate nodes, produce suitable plots to see if theindicator variables generated from the replacement node are likely to be good predictorsof LOAN. If you use Enterprise Miner, join subsequent nodes to the impute node, so thatyou are using a dataset with the new variables. Join a GraphExplore node to this toobtain a suitable plot against interval variables (see b) below). Join a StatExplore node,run it and inside the results window, select ViewPlots Class Variables: Loan forcategorical variables against Loan.a) Comment on the type of missingness that these plots seem to suggest. Justify youranswer. (3 marks)b) Also plot these indicator variables against one other interval variable and oneother categorical variable. Again comment on the type of missingness this seemsto suggest. (4 marks)c) What other plots would be needed to fully investigate the type of missingness?(3 marks)5) The data have been analysed using a macro to calculate Little’s test for the intervalvariables. This yields the output given below:Data Mining: Deferred/Referred Assignmentpage 7 of 21The MI ProcedureModel Information
Data SetMethod
WORK.ONEMCMC
Multiple Imputation Chain
Single Chain
Initial Estimates for MCMC
EM Posterior Mode
Start
Starting Value
Prior
Jeffreys
Number of ImputationsNumber of Burn-in IterationsNumber of Iterations
0200100
Seed for random number generator
146279001
Missing Data PatternsGroup creditn creditt withdrn withdrt cashcrn cashcrt cashwdn cashwdt bankcolt bankcoln bankrn bankrt othcrn othcrt days no_inhab mu_low1 X X X X X X X X X X X X X X X X X2 X X X X X X X X X X X X X X X X X3 X X X X X . X X X X X X X X X X X4 X X X X X . X X X X X X X X X X X5 X X X X . X X X X X X X X X X X X6 X X X X . X X X X X X X X X X X X7 X X X X . . X X X X X X X X X X X8 X X X X . . X X X X X X X X X X XMissing Data Patterns——–Group Means——-Group mu_lmid mu_umid mu_high cities urbanr ave_sal unemploy95 unemploy96 enter crime95 crime96 Freq Percent creditn creditt1 X X X X X X X X X X X 2685 59.67 90.067412 7612302 X X X X X X . X X . X 28 0.62 93.142857 7825623 X X X X X X X X X X X 405 9.00 91.143210 6251084 X X X X X X . X X . X 3 0.07 61.666667 2629045 X X X X X X X X X X X 587 13.04 89.621806 6611916 X X X X X X . X X . X 4 0.09 68.000000 5232267 X X X X X X X X X X X 775 17.22 89.856774 6521488 X X X X X X . X X . X 13 0.29 79.000000 928778SAS Output 2.1: part 1Data Mining: Deferred/Referred Assignmentpage 8 of 21Missing Data Patterns———————————————————————–Group Means———————————————————————-Group withdrn withdrt cashcrn cashcrt cashwdn cashwdt bankcolt bankcoln bankrn bankrt othcrn1 146.067412 716722 35.265549 559981 97.420484 555482 195041 14.145996 46.931099 157333 40.6558662 145.392857 736734 45.285714 603856 104.000000 582953 171939 5.750000 39.107143 148046 42.1071433 143.293827 582119 33.429630 . 96.627160 451503 136283 16.207407 45.071605 127215 41.5061734 108.666667 235966 24.666667 . 53.333333 110597 49539 12.666667 52.333333 118903 24.3333335 142.826235 618393 . 506809 96.148211 475222 148308 14.652470 44.691652 138498 40.5025556 95.000000 482927 . 517894 66.750000 361671 0 0 28.250000 121256 34.5000007 143.029677 609421 . . 94.596129 462042 138299 15.196129 46.441290 143018 40.6967748 119.076923 892208 . . 86.153846 772858 261501 7.923077 32.923077 119351 33.692308Missing Data Patterns———————————————————————–Group Means———————————————————————-Group othcrt days no_inhab mu_low mu_lmid mu_umid mu_high cities urbanr ave_sal unemploy951 6208.208268 1223.985847 271465 39.476350 21.032030 5.572439 1.740037 5.542272 69.651359 9537.816015 2.8700112 6768.300000 1242.678571 42821 4.000000 13.000000 5.000000 1.000000 3.000000 48.400000 8173.000000 .3 5956.532840 1266.148148 294505 37.916049 19.518519 5.279012 1.676543 5.429630 70.872099 9622.735802 2.9219014 3368.333333 945.000000 42821 4.000000 13.000000 5.000000 1.000000 3.000000 48.400000 8173.000000 .5 6073.293356 1259.357751 274538 40.683135 22.018739 5.563884 1.758092 5.579216 69.087394 9527.419080 2.9074456 5332.150000 1190.750000 42821 4.000000 13.000000 5.000000 1.000000 3.000000 48.400000 8173.000000 .7 5850.351355 1261.165161 258358 41.938065 20.983226 5.569032 1.703226 5.680000 68.363871 9454.548387 2.8394198 5327.923077 1097.230769 42821 4.000000 13.000000 5.000000 1.000000 3.000000 48.400000 8173.000000 .Missing Data Patterns————————-Group Means————————Group unemploy96 enter crime95 crime961 3.458276 121.120298 14702 163762 7.010000 124.000000 . 1358.0000003 3.481654 121.308642 16516 184704 7.010000 124.000000 . 1358.0000005 3.478296 121.407155 14991 167536 7.010000 124.000000 . 1358.0000007 3.434142 120.598710 13903 155298 7.010000 124.000000 . 1358.000000SAS Output 2.1: part 2Data Mining: Deferred/Referred Assignmentpage 9 of 21Number of Observed Variables = 28Number of Missing Data Patterns = 8Summary of Missing Data Patterns (0 = Missing, 1 = Observed)Frequency | Pattern | d2j13 | 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 | 251.5981775 | 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 37.741164 | 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 | 87.35500587 | 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 40.337533 | 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 | 61.13851405 | 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 39.0789628 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 | 523.33872685 | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 46.99653Sum of the Number of Observed Variables Across Patterns (Sigma psubj) = 208Little’s (1988) Chi-Square Test of MCAR
Chi-Square (d2)
=
1087.584180F0.000
df (Sigma psubj – p) =
p-value
=
SAS Output 2.1: part 3a) Explain how many missing value patterns exist and what each of these represent. Isthere any evidence in the printed means as to what type of missingness exists?(4 marks)b) Interpret the result of Little’s test. What type of missingness exists? (3 marks)6) Some of the techniques that we will try later require that the variables be symmetrical.Add a transformation node and find suitable transformations of the interval variables thatcould meet this requirement. (To enable comparisons to be made it is best under Scoreto set both “Hide” and “Reject” to “No” so that the original variables are still kept.)a) Explain what transformations have been selected. (2 marks)b) In some cases the transformation adds 1 first, explain why this is necessary.(2 marks)7) Produce suitable plots to check if the transformations appear to have worked. Fullydiscuss your results. (6 marks)8) Using you discussion above, summarize your findings from questions 1)-7). For example,indicate how each variable should be treated in the subsequent analysis and whether it islikely to be a good or poor predictor.(4 marks)Data Mining: Deferred/Referred Assignmentpage 10 of 212.2 PART 2: PRINCIPAL COMPONENTSYou are required to analyse the district data in SAS Enterprise Guide, SAS/IML Studio andEnterprise Miner as described below. First, launch Enterprise Guide and load and run themacros: label.sas and pcait.sas. When you run them nothing will have appeared to havehappened, but this will load them into SAS and then compile them. This will enable them tobe called from the remaining programs.9) The missing value for unemploy95 was imputed using a regression model ofunemploy95 on unemploy96. The model was fitted, and then the predicted value wasused to replace the missing value.a) Use Enterprise Guide and load and compile the macros label.sas and pcait.sas. Loadthe ass2r.sas program. Add some code at the start of this to produce a scatter plot ofunemploy95 vs. unemploy96 and comment on the relationship between them.(2 marks)b) Run the first section of code in the ass2r.sas program (BLOCK 1) to fit a simple linearregression of unemploy95 on unemploy96. Fully interpret the output.(6 marks)c) Hence or otherwise comment on the suitability of this method to impute the missingvalues. (2 marks)d) A similar process was used to impute the missing value for crime95. Do you thinkthis is reasonable? Justify your answer. (1 mark)10) Carry out a principal component analysis of the data. You may use SAS/IML Studio,SAS Enterprise Miner or the code provided in Enterprise Guide to do this. Examples ofthe code can be found in the second block of ass2r.sas. Please ensure that the librarythat you have created that contains the district dataset is correctly referenced in the code.a) Examine the eigenvalues. Also produce a scree plot (if using the code this is donefor you). How many components would you recommend? Fully justify your answer.(7 marks)b) Interpret the component loadings for the first three components. What “name” mightyou give each? (9 marks)c) Produce a variety of scatter plots of one component against another. Include labelsfor the districts on the edge of the scatter plot and the component loadings. Themacros given you will achieve this, and example code can be found in the secondblock of ass2r.sas. You should produce plots that correspond to the number ofcomponents you would keep. Note that the component loadings are superimposedon the scatter plot rather than as a separate plot.Data Mining: Deferred/Referred Assignmentpage 11 of 21i) Using your plots explain how the component loadings (in blue) arerepresented on the plot. Illustrate this for two variables. How does thisrepresentation help in understanding the components? How does thisrepresentation help in understanding which variables are similar? Useexamples from your plot where necessary. (8 marks)ii) Examine the scattered points showing each district. What do you notice?Are there any unusual points? Identify these. Illustrate what features thesedistrict(s) have that might make them unusual. Use your interpretation of thecomponents in b) above and the component loadings shown on the plot toachieve this. (5 marks)d) Given all of your results for question 10) part a) through to c) discuss how wellprincipal component analysis works for this particular dataset. What interestingfeatures have you discovered about the data? How might we utilise the results insubsequent analysis? (5 marks)2.3 PART 3 CLUSTER ANALYSISYou are required to analyse this data in SAS Enterprise Guide, SAS/IML Studio andEnterprise Miner as follows:11) You are required to carry out variable clustering. First the data needs to bestandardised. To do this run block one in the ass3r.sas program which will also add theprincipal components to the data. Now run block two which will perform the variableclustering.a) Examine the dendrogram. How many clusters would you pick? Fully justify youranswer.(5 marks)b) Examine the printed output. Explain how many clusters the algorithm picks as itsfinal solution. What criteria has the algorithm used to do this?(4 marks)c) Fully discuss the printed output for the final solution. How well does this clusteringseemed to have worked? What variables are in the same cluster? Compare youranswers to the results of principal component analysis from the solution toassignment 2. Are there any similarities? Does this surprise you?(12 marks)12) The next stage is to carryout observational clustering. Run block 3 of the code toperform Ward’s method of clustering.Data Mining: Deferred/Referred Assignmentpage 12 of 21a) Examine the printed output. How many clusters would you pick? Give reasons foryour answer utilising the information given in the printed output.(5 marks)b) Examine the graphical output, the dendrogram and the plot of the Cubic ClusteringCriteria. How many clusters would each of these plots suggest? Fully explain youreasoning. (10 marks)c) Using all the results from 12) a) and b) above, explain how many clusters you wouldpick. Justify your answer. (4 marks)d) Edit the code in block 3 to carry out clustering for one other type of hierarchicalclustering. Fully interpret both the printed output and the graphs produced, againexplaining how many cluster you would pick. Are the results from your chosenmethod any clearer than for Ward’s method? Why might this be the case?(12 marks)13) The next stage is to use the k-means algorithm to find six clusters for the observations.Run block 4 and examine the output.a) Is it possible to produce a dendrogram for this method of clustering? Explain why.(4 marks)b) Fully interpret the output such as the R2 values, the R2 ratio, cluster centroids etc.Using this evidence explain how well this clustering seems to have worked. Is thereany evidence that the clusters overlap?(14 marks)c) Various plots can be produced to validate the cluster solution. These are producedin blocks 5-8. Run each of the blocks and use a suitable selection from these todiscuss the properties of the cluster containing Prague and two other clusters.Profile these three clusters and explain their features. Compare these profiles to theresults from the solution to part 2 and where each cluster appears on the plot of theprincipal components. (Please note that in Enterprise Guide some plots do notdisplay correctly if you do not use the correct graphics device. Goptions statementshave been inserted in the code – please make sure you run these when running thecode.)(14 marks)Data Mining: Deferred/Referred Assignmentpage 13 of 212.4 PART FOUR PREDICTIVE MODELLINGIn the first parts of this assignment you will have already explored the data and discussed itsfeatures. In this section you will attempt to produce a predictive model for the variableLOAN. In earlier work it could be suggested that various transformations of the variablescould be tried. These have been included in the dataset loans2, but you are free to createothers that you may think appropriate.You should read the loans2.sas7bdat data into SAS Enterprise Miner checking carefully thateach variable is correctly specified (see appendix). In addition you should partition the datainto training, validation and testing data using a data partition node (found under Sample).14) Use suitable software to fit two logistic regression models to LOAN. In both cases youmay wish to use suitable methodology to reduce the set of variables used in the model.(e.g in one case you might use backward elimination and in the other forwardselection). You will only need to discuss one of these models in detail once you havecompared them in part 4) below. Ensure that your settings do produce two differentmodels. Explain and justify the settings used.( [fitting/settings] 6 marks)15) Use suitable software to fit two decision trees. These should be distinguished by thealgorithms and settings you use which should be appropriate for these data, and youshould justify your choices. Again the two decision trees will be compared in part 4)below, after which you will need to discuss only one of these trees in detail. Explainand justify the settings used.( [fitting/settings] 6 marks)16) Build at least one other model. You will need to experiment with Enterprise Miner todo this and possibly read the help available. Suggestions include, Neural Network orMemory Based Reasoning. Briefly explore the software and explain one setting uniqueto your chosen method. ( [choice/fitting/settings] 4 marks)17)a) Compare the logistic regression models, the decision tree models and the thirdmodel by producing appropriate summary measures of their goodness of fit.Explain which logistic regression model and which decision tree you would prefer.Which model would you prefer overall? (12 marks)b) Compare all five models by producing suitable %Cumulative response,%Cumulative captured response and ROC charts. For each type of chart illustratewhat it is showing by the interpretation one particular point and discuss how wellData Mining: Deferred/Referred Assignmentpage 14 of 21the models are performing. Using these charts, explain which logistic regressionmodel and which decision tree you would prefer. Which model would you preferoverall? (12 marks)c) Given your answers to 17) a) and 17) b) above which model would you pick?Justify your answer. (3 marks)18) In each case fully discuss the results for these preferred models:a) For the chosen regression model, discuss the parameter estimates, odds ratios,summary statistics, statistical tests etc. (12 marks)b) For the best decision tree discuss the number of leaves, structure, variables, plot ofmisclassification against leaves, goodness of fit, etc.(12 marks)19) For the selected logistic regression model and the selected decision tree (chosenmodels in question 17)c) above) illustrate how the predicted probability of taking a loanfor the first observation in the data is calculated. (16 marks)20) Use all of your results from questions 14) to 19) select one particular method that youwould recommend to predict LOAN. You should bear in mind practical as well asstatistical considerations. Fully discuss your choice and any other final conclusionsyou have resulting from all of your analyses 14) to 19) above. Suggest any furtherwork that may be useful to build an improved model in the future.(15 marks)Total Marks 2802.5 REPORTYou should write all your findings in a technical report which should include any relevantoutput either in the main body of the report or in a suitable appendix (as appropriate).Marks will be awarded for good English, report format, layout, use of Figures, balancebetween appendices and main report etc.The report should not be of more than 4000 words.(20 marks)Total Marks available: 300 marksData Mining: Deferred/Referred Assignmentpage 15 of 21Section 3: APPENDIX3.0 DETAILS OF CZECHR.SAS7BDAT
Variable
Meaning
Notes
account_id
identification of the account
Credit
Total number of credits
§ Measured over the period of thetransaction data base which is from1/1/93 – 31/12/98 however manyaccounts do not show anytransactions until much later thanthe 1/1/93 (the last “first transactiondate” is 29/12/97). Similarly for thelast transaction.
Credit
Total value of credits
Withdrn
Total number of withdrawls
Withdrt
Total value of withdrawls
Cashcrn
Total number of cash credits
Cashcrt
Total value of cash credits
Cashwdn
Total number of cash withdrawals
Cashwdt
Total value of cash withdrawals
Bankcoln
Total number of collections from otherbanks (i.e. electronic credits)
Bankcolt
Total value of collections from other banks(i.e. electronic credits)
Bankrn
Total number of times of remittance to otherbanks (i.e. electronic debits)
Bankrt
Total value of remittance to other banks(i.e. electronic debits)
Othcrn
Total number of other credits
Othcrt
Total value of other credits
Days
Total number of days between firsttransaction and last
Sex
Gender of primary account holder
Second
Frequency
frequency of issuance of statements
“monthly” stands for monthly issuance“weekly” stands for weekly issuance“After_Trans” stands for issuance aftertransaction
Region
Region in which account holder lives
One of : Prague; North Moravia; SouthMoravia; South Bohemia; CentralBohemia; North Bohemia; WestBohemia; East Bohemia
=N if no second account holder=Y if thee is a secondary account holderContinued on next page…Data Mining: Deferred/Referred Assignmentpage 16 of 21
no_inhab
no. of inhabitants
Demographic data based on one of the77 possible districts the account holderlives in. (Therefore a large number ofthe values of these variables will be thesame for several observationscorresponding to clients living in thesame district).
mu_low
no. of municipalities with inhabitants < 499
mu_lmid
no. of municipalities with inhabitants 500-1999
mu_umid
no. of municipalities with inhabitants 2000-9999
mu_high
no. of municipalities with inhabitants >10000
Cities
no. of cities
Urbanr
ratio of urban inhabitants
ave_sal
average salary
unemploy95
unemployment rate ’95
unemploy96
unemployment rate ’96
Enter
no. of entrepreneurs per 1000 inhabitants
crime95
no. of committed crimes ’95
crime96
no. of committed crimes ’96
LOAN
whether the account holder has a loan withthe bank
= yes if account holder has a loan(with this bank)= no if not
3.1 DETAILS OF CZECHIN.SAS7BDATThis dataset contains all the variables of Czech above but in addition has the followingvariables:
Variable
Meaning
Notes
IMP_cashcrn
Imputed cashcrn
Copy of cashcrn, but missing valuesreplaced by the mean of theremaining cashcrn values.
IMP_cashcrt
Imputed cashcrt
Copy of cashcrt, but missing valuesreplaced by the mean of theremaining cashcrt values.
IMP_crime95
Imputed crime95
Copy of crime95, but missing valuesreplaced by the mean of theremaining crime95 values.
IMP_unemploy95
Imputed unemploy95
Copy of unemploy95, but missingvalues replaced by the mean of theremaining unemploy95 values.
M_cashcrn
Indicator for if value of cashcrn ismissing
= 0 if cashcrn not missing= 1 if cashcrn missing
M_cashcrt
Indicator for if value of cashcrt ismissing
= 0 if cashcrt not missing= 1 if cashcrt missing
M_crime95
Indicator for if value of crime95 ismissing
= 0 if crime95 not missing= 1 if crime95 missing
M_unemploy95
Indicator for if value of unemploy95is missing
= 0 if unemploy95 not missing= 1 if unemploy95 missing
Data Mining: Deferred/Referred Assignmentpage 17 of 213.2 DETAILS OF DISTRICT.SAS7BDAT
Variable
Meaning
Notes
Region
Region in which account holder lives
One of : Prague; North Moravia;South Moravia; South Bohemia;Central Bohemia; North Bohemia;West Bohemia; East Bohemia
no_inhab
no. of inhabitants
Demographic data based on oneof the 77 possible districts theaccount holder lives in.(Therefore, in the Czech dataset,a large number of the values ofthese variables will be the samefor several observationscorresponding to clients living inthe same district).In this version of the dataset thereis just the data on districts i.e.ONE observation per district.
mu_low
no. of municipalities with inhabitants < 499
mu_lmid
no. of municipalities with inhabitants 500-1999
mu_umid
no. of municipalities with inhabitants 2000-9999
mu_high
no. of municipalities with inhabitants >10000
cities
no. of cities
urbanr
ratio of urban inhabitants
ave_sal
average salary
unemploy95
unemployment rate ’95
unemploy96
unemployment rate ’96
enter
no. of entrepreneurs per 1000 inhabitants
crime95r
crime rate ’95 as a %
crime96r
crime rate ’96 as a%
distname
district name
3.2.1 BACKGROUNDGiven that the data concerns the Czech Republic and municipalities within it, you may find ituseful to research the structure of this country. A good starting point is:http://en.wikipedia.org/wiki/Czech_Republic ;http://en.wikipedia.org/wiki/Regions_of_the_Czech_Republicandhttp://en.wikipedia.org/wiki/Districts_of_the_Czech_RepublicA map of the regions is given overleaf.Data Mining: Deferred/Referred Assignmentpage 18 of 21Data Mining: Deferred/Referred Assignmentpage 19 of 213.3 DETAILS OF LOANS2.SAS7BDAT
Variable
Meaning
Notes
account_id
identification of the account
creditn
Total number of credits
=cashcrn + bankcoln+ othcrn ; seealso §
creditt
Total value of credits
= cashcrt + bankcolt + othcrt ; seealso §
withdrn
Total number of withdrawls
= cardwdln + cashwdn + bankrn +othwdn ; see also §
withdrt
Total value of withdrawls
= cardwdlt + cashwdt + bankrt +othwdt ; see also §
clbal
Closing balance at the end of the transactionperiod.
For most accounts this is at the end of1998 i.e. 31/12/98, however somerecord a last transaction one or twomonths earlier presumably becausethere were no transactions in the lastone or two months of 1998.
cardwdln
Total number of credit card withdrawals
§ Measured over the period of thetransaction data base which is from1/1/93 – 31/12/98 however manyaccounts do not show anytransactions until much later thanthe 1/1/93 (the last “first transactiondate” is 29/12/97). Similarly for thelast transaction – see previousremark.
cardwdlt
Total value of credit card withdrawals
cashcrn
Total number of cash credits
cashcrt
Total value of cash credits
cashwdn
Total number of cash withdrawals
cashwdt
Total value of cash withdrawals
bankcoln
Total number of collections from other banks
bankcolt
Total value of collections from other banks
bankrn
Total number of times of remittance to otherbanks
bankrt
Total value of remittance to other banks
othcrn
Total number of other credits
othcrt
Total value of other credits
Days
Total number of days between first transactionand last
sex
Gender of primary account holder
card
age
Age of primary account holder
= no if account holder does not hold a creditcard with this bank= yes if account holder does hold a creditcard with this bankContinued on next page…Data Mining: Deferred/Referred Assignmentpage 20 of 21
second
frequency
frequency of issuance of statements
“monthly” stands for monthly issuance“weekly” stands for weekly issuance“After_Trans” stands for issuance aftertransaction
Region
Region in which account holder lives
One of : Prague; North Moravia; SouthMoravia; South Bohemia; CentralBohemia; North Bohemia; WestBohemia; East Bohemia
no_inhab
no. of inhabitants
Demographic data based on one ofthe 77 possible districts the accountholder lives in. (Therefore a largenumber of the values of thesevariables will be the same for severalobservations corresponding to clientsliving in the same district).
mu_low
no. of municipalities with inhabitants < 499
mu_lmid
no. of municipalities with inhabitants 500-1999
mu_umid
no. of municipalities with inhabitants 2000-9999
mu_high
no. of municipalities with inhabitants >10000
cities
no. of cities
urbanr
ratio of urban inhabitants
ave_sal
average salary
unemploy95
unemploymant rate ’95
unemploy96
unemploymant rate ’96
enter
no. of enterpreneurs per 1000 inhabitants
crime95
no. of commited crimes ’95
crime96
no. of commited crimes ’96
loan
= yes if account holder has a loan withthe bank
Target variable
creditnd
Total number of credits/days
=cashcrnd + bankcolnd+ othcrnd ; seealso §2
credittd
Total value of credits/days
= cashcrt + bankcolt + othcrt ; seealso §2
withdrnd
Total number of withdrawls/days
= cardwdln + cashwdn + bankrn +othwdn ; see also §2
withdrtd
Total value of withdrawls/days
= cardwdlt + cashwdt + bankrt +othwdt ; see also §2
=N if no second account holder=Y if thee is a secondary account holder= no if account holder does not have aloan with the bankContinued on next page…Data Mining: Deferred/Referred Assignmentpage 21 of 21
cardwdlnd
Total number of credit card withdrawals/days
§2 Measured over the period ofthe transaction data base which isfrom 1/1/93 – 31/12/98 howevermany accounts do not show anytransactions until much later thanthe 1/1/93 (the last “first transactiondate” is 29/12/97). Similarly for thelast transaction – see remark onclbal.
cardwdltd
Total value of credit card withdrawals/days
cashcrnd
Total number of cash credits/days
cashcrtd
Total value of cash credits/days
cashwdnd
Total number of cash withdrawals/days
cashwdtd
Total value of cash withdrawals/days
bankcolnd
Total number of collections from other banks/days
bankcoltd
Total value of collections from other banks/days
bankrnd
Total number of times of remittance to otherbanks/days
bankrtd
Total value of remittance to other banks/days
othcrnd
Total number of other credits/days
othcrtd
Total value of other credits/days
accredit
Average value of credits = Total value of credits/Total number of credits = credit/creditn
=( cashcrt + bankcolt + othcrt ) /(cashcrn + bankcoln+ othcrn)
awithdr
Average value of withdrawals = Total value ofwithdrawals / Total number of withdrawals =withdrt/withdrn
=( cardwdlt + cashwdt + bankrt +othwdt )/(cardwdln + cashwdn +bankrn + othwdn)
acardwdl
Average value of credit card withdrawals = Totalvalue of credit card withdrawals /Total number ofcredit card withdrawals = cardwdlt/cardwdln
§3 Given earlier remarks (see §)average is based on differenttime periods per observation
acashcr
Average value of cash credits = Total value of cashcredits /Total number of cash credits =cashcrt/cashcrn
acashwd
Average value of cash withdrawals = Total value ofcash withdrawals /Total number of cashwithdrawals = cashwdt/caswdn
abankcol
Average value of collections from other banks =Total value of collections from other banks /Totalnumber of collections from other banks = bankcolt/bankcoln
abankr
Average value of remittance to other banks = Totalvalue of remittance to other banks /Total number oftimes of remittance to other banks = bankrt/bankrn
aothcr
Average value of other credits = Total value ofother credits /Total number of other credits = othcrt/othcrn
What Students Are Saying About Us
.......... Customer ID: 12*** | Rating: ⭐⭐⭐⭐⭐"Honestly, I was afraid to send my paper to you, but splendidwritings.com proved they are a trustworthy service. My essay was done in less than a day, and I received a brilliant piece. I didn’t even believe it was my essay at first 🙂 Great job, thank you!"
.......... Customer ID: 14***| Rating: ⭐⭐⭐⭐⭐
"The company has some nice prices and good content. I ordered a term paper here and got a very good one. I'll keep ordering from this website."