联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> javajava

日期:2020-03-23 10:17

ADM3308: Business Data Mining

Data Mining Project Using IBM SPSS Modeler

(Team work)

_____________________________________________________________________________________

_____________________________________________________________________________________

Weight: 25% of the final mark.        This is a team work project (only one submission per team).

_____________________________________________________________________________________


Important Note:  Read the following academic integrity statement, type in your full name and student ID, and include a copy in your submission. Submitting this form electronically by the team representative is considered as signing the document by BOTH members of the team.

Personal Ethics & Academic Integrity Statement

By typing in my name and student ID on this form and submitting it electronically, I am attesting to the fact that I have reviewed not only my own work, but the work of my team member, in its entirety.

I attest to the fact that my own work in this project adheres to the fraud policies as outlined in the Academic Regulations in the University’s Undergraduate Studies Calendar. I further attest that I have knowledge of and have respected the “Beware of Plagiarism” brochure found on the Telfer School of Management’s doc-depot site. To the best of my knowledge, I also believe that each of my group colleagues has also met the aforementioned requirements and regulations. I understand that if my group assignment is submitted without a completed copy of this Personal Work Statement from each group member, it will be interpreted by the school that the missing student(s) name is confirmation of non-participation of the aforementioned student(s) in the required work.

We, by typing in our names and student IDs on this form and submitting it electronically,

?warrant that the work submitted herein is our own group members’ work and not the work of others

?acknowledge that we have read and understood the University Regulations on Academic Misconduct

?acknowledge that it is a breach of University Regulations to give or receive unauthorized and/or unacknowledged assistance on a graded piece of work



The IBM SPSS Modeler is a commercial data mining package offered by the IBM capable of performing data mining tasks including predictive and descriptive models with user-friendly interfaces. The IBM Modeler is available on the computers in the lab. There will be tutorials presented to class on using the IBM Modeler for data mining. Students are also required to consult on-line resources to learn more about IBM Modeler.

For this project, you are required to complete two parts:

?Part-1  (100 points): Data mining modelling project using a selected datasets from Table-1.

?Part-2  (30 points): Perform data pre-processing and data cleaning on the raw dataset provided to you (Unclean-Bank-Data.Xlsx) using IBM SPSS Modeller nodes to clean and pre-process the data.


PART-1

(A) Dataset Selection:

Each team must select one of the datasets listed in Table-1 (or from other recommended repositories with the pre-approval of the professor), and announce it on the “Discussion Board” on the Forum named  “Announcing Dataset Selection”. Post your name, your tem-member’s name, and the dataset selected. If a dataset is already taken by one of the teams, as posted on the Forum, that dataset cannot be selected by other teams.  Therefore, I recommended that you select your dataset and announce it on the Discussion Board as early as possible.

NOTE: You may choose a dataset other than what listed in Table-1 with the professor prior approval. If you would like to analyze a dataset not listed in Table-1, please email me the details of the dataset for my review (e.g. the source of the data, how many records, how many attributes).


(B) Data Analysis and Model Building:

You are required to import the data,  perform pre-processing tasks if needed (such as reformatting the data, normalizing it, dealing with missing values, dealing with outlier), followed by two or more modeling tasks such as classification (Decision tree, Bayesian, KNN, neural networks, etc.), clustering (K-means, agglomerative), and association rules mining.  


(C) Project Report for Part-1:

Your report for this part of the project should include:

?Explaining the data you selected for your project (attributes, instances, etc.)

?Explaining your pre-processing tasks if any (cleaning, transforming, normalizing, etc.)

?Explaining the data mining modeling techniques  you performed on the data (at least two techniques)

?Demonstrating the graphs/tables of the results produced by the techniques

?Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings

?Concluding remarks, your recommendations, actionable discoveries, and future trends/studies you would recommend


Overall, your report for this part of the project should be 15 to 25 pages long (including graphs). Use 12pt Times New Roman font, with 1.5 line space. Keep a margin of 1” on all sides of the page.


Rubrics for Part-1

Your report for Part-1 of the project will be evaluated as follows:

Components of the Report (Part-1)Points

Abstract OR Executive summary (or abstract)10

Explanation of the data set, and the pre-processing tasks (if any) to prepare the data 10

Explanation of at least two data mining tasks you performed on the data. Also, explain why you considered the specific data mining tasks for your dataset  20

Relevant graphs showing the output results of the techniques you applied20

Interpretation of the modeling results: useful patterns, predicted values, significance of the features, what actions you might suggest based on your findings10

A conclusion section summarizing your findings, discussing the results, your understanding of the results, your recommendation, and any useful patterns, rules, prediction or future trend you infer from the data10

Overall organization of the paper, its soundness and readability, and quality of the presentation20

Total  (Part-1)100



(D) List of Datasets:

Select one of the following datasets, then post a message on the Discussion Board on Brightspace to claim your dataset.

Table-1: List of datasets for Part-1 of the project

Note: These datasets are available at the UCI Machine Learning Repository. For more information, visit http://archive.ics.uci.edu/ml/datasets.html


#NameNumber of featuresNumber of SamplesComments

1Waveform Database Generator (version 2)405000Use the dataset without Noise

2Statlog (Landsat Satellite)366435Training and Testing datasets are different

3seismic-bumps228124

4Image Segmentation192310Use only the testing dataset

5Bank Marketing1745211

6Pen-Based Recognition of Handwriting Digits1610992Training and Testing datasets are different

7Student Performance33649

8Adult1448842Training and Testing datasets are different

9Statlog (Shuttle)958000

10Abalone84177

11Nursery812960

12Yeast81484

13One-hundred plant species leaves data set641600Use just-data_Mar_64.txt

14Spambase574601

15Cardiotocography232126

16Statlog (German Credit Card)201000

17Letter Recognition1620000

18EEG Eye State1514980

19Page Blocks Classification105473

20Contraceptive Method Choice91473

21Weight lifting exercises monitored1039242Use the following features: roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, (class as output)

22Connect-44267557

23Mushroom228124

24Default of credit card clients2430000

25Autism Screening Adult Data Set21704

26Drug consumption (quantified) Data Set321885

27Polish companies bankruptcy data, Data Set6410503



PART-2

In this part of the project, all teams will use the dataset  Unclean-Bank-Data.Xlsx posted on the “Project Description” page of the course website.

This dataset includes missing values, invalid values, and outliers.  You should use the IBM SPSS Modeler nodes to pre-process and clean the data.

Do not remove a record if there is only one missing value in that record. Instead, use the IBM Modeler to fill in the missing value with an algorithm of your choice.

Similarly, do not remove a record if it has only one invalid value. Instead, use the IBM Modeler to fill in the invalid value with an algorithm of your choice.

If you find a record with more than one missing value, or more than one invalid value, then you may either remove the record, or use the IBM Modeler to fill in for the missing or invalid values.

If you detect outliers, you may then delete the entire record.

You may also want to do other pre-processing tasks such as data normalization, binning data, etc.

Deliverables for Part-2:

1- Include in your project report a short explanation of three  different cleaning and pre-processing tasks you applied on the data using the IBM SPSS Modeller.

2- Also, include the clean dataset (name it  “Clean-Bank-Data.xlsx”) in  your submission together with your project report (you may submit everything in one zip file).

Rubrics for Part-2

Your report for Part-2 of the project will be evaluated as follows:


Components of the Report (Part-2)Points

Explanation of three cleaning and pre-processing tasks applied on the data; explaining the results after you pre-processed the data; including the Clean-Bank-Data.xlsx with your submission.

3 X 10

Total (Part-2)30



版权所有:留学生作业网 2020 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。