ADM3308: Business Data Mining
Data Mining Project Using IBM SPSS Modeler
(Team work)
_____________________________________________________________________________________
_____________________________________________________________________________________
Weight: 25% of the final mark. This is a team work project (only one submission per team).
_____________________________________________________________________________________
Important Note: Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team.
Personal Ethics & Academic Integrity Statement
By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.
I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.
We, by typing in our names and student IDs on this form and submitting it electronically,
?warrant that the work submitted herein is our own group members’ work and not the work of others
?acknowledge that we have read and understood the University Regulations on Academic Misconduct
?acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work
The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler.
For this project, you are required to complete two parts:
?Part-1 (100 points): Data mining modelling project using a selected datasets from Table-1.
?Part-2 (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.
PART-1
(A) Dataset Selection:
Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams. Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.
NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes).
(B) Data Analysis and Model Building:
You are required to import the data, perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.
(C) Project Report for Part-1:
Your report for this part of the project should include:
?Explaining the data you selected for your project (attributes, instances, etc.)
?Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)
?Explaining the data mining modeling techniques you performed on the data (at least two techniques)
?Demonstrating the graphs/tables of the results produced by the techniques
?Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings
?Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend
Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.
Rubrics for Part-1
Your report for Part-1 of the project will be evaluated as follows:
Components of the Report (Part-1)Points
Abstract OR Executive summary (or abstract)10
Explanation of the data set, and the pre-processing tasks (if any) to prepare the data 10
Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset 20
Relevant graphs showing the output results of the techniques you applied20
Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings10
A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data10
Overall organization of the paper, its soundness and readability, and quality of the presentation20
Total (Part-1)100
(D) List of Datasets:
Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset.
Table-1: List of datasets for Part-1 of the project
Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html
#NameNumber of featuresNumber of SamplesComments
1Waveform Database Generator (version 2)405000Use the dataset without Noise
2Statlog (Landsat Satellite)366435Training and Testing datasets are different
3seismic-bumps228124
4Image Segmentation192310Use only the testing dataset
5Bank Marketing1745211
6Pen-Based Recognition of Handwriting Digits1610992Training and Testing datasets are different
7Student Performance33649
8Adult1448842Training and Testing datasets are different
9Statlog (Shuttle)958000
10Abalone84177
11Nursery812960
12Yeast81484
13One-hundred plant species leaves data set641600Use just-data_Mar_64.txt
14Spambase574601
15Cardiotocography232126
16Statlog (German Credit Card)201000
17Letter Recognition1620000
18EEG Eye State1514980
19Page Blocks Classification105473
20Contraceptive Method Choice91473
21Weight lifting exercises monitored1039242Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)
22Connect-44267557
23Mushroom228124
24Default of credit card clients2430000
25Autism Screening Adult Data Set21704
26Drug consumption (quantified) Data Set321885
27Polish companies bankruptcy data, Data Set6410503
PART-2
In this part of the project, all teams will use the dataset Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.
This dataset includes missing values, invalid values, and outliers. You should use the IBM SPSS Modeler nodes to pre-process and clean the data.
Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.
Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.
If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values.
If you detect outliers, you may then delete the entire record.
You may also want to do other pre-processing tasks such as data normalization, binning data, etc.
Deliverables for Part-2:
1- Include in your project report a short explanation of three different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller.
2- Also, include the clean dataset (name it “Clean-Bank-Data.xlsx”) in your submission together with your project report (you may submit everything in one zip file).
Rubrics for Part-2
Your report for Part-2 of the project will be evaluated as follows:
Components of the Report (Part-2)Points
Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission.
3 X 10
Total (Part-2)30
版权所有:留学生作业网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。