CP1407 Machine Learning algorithms 

102 阅读2分钟

CP1407 Assignment 2 

 

  • Page 1 - 

 

 

Note: This is an individual assignment. While it is expected that students will 

discuss their ideas with one another, students need to be aware of their 

responsibilities in ensuring that they do not deliberately or inadvertently 

plagiarise the work of others. 

 

 

Assignment 2 – Practice on various Machine Learning algorithms 

 

 

 

 1. [Data Pre-Processing, Clustering] [10 marks] 

Why is attribute scaling of data important? The following table contains sample 

records having 代写CP1407 Machine Learning algorithms  the number of numbers and the total revenue generated by particular 

stores of a supermarket. Use the table as an example to discuss the necessity of 

normalisation in any proximity measurement for clustering purposes. 

 

Supermarket ID Employee Count Revenue 

001 38 $5,500,000 

002 29 $5,000,000 

003 24 $5,000,000 

004 10 $890,000 

005 40 $2,500,000 

006 31 $3,200,000 

007 14 $678,000 

008 35 $5,200,000 

009 30 $5,300,000 

010 22 $5,500,000 

 

 

 

 

  1. [Classification – Decision Tree algorithm] [20 marks] 

Use the soybean dataset (diabetes.arff) to perform decision tree induction in Weka 

using three different decision tree induction algorithms; J48, REPTree, and 

RandomTree. Investigate different options, particularly looking at differences between 

pruned trees and unpruned trees. In discussing your results, consider the following 

questions. 

 

a) What are the effects of pruning on the results for the soybean datasets? 

b) Are there differences in the performances of the three decision tree algorithms? 

c) What impacts do other parameters of the algorithms have on the results? 

 

  1. [Classification – Naïve Bayes algorithm] [30 marks] 

Suppose we have data on a few individuals randomly examined for basic health check. 

The following table gives the data on these individuals’ health-related attributes. CP1407 Assignment 2 

 

  • Page 2 - 

Body 

Weight 

Body 

Height 

Blood 

Pressure 

Blood Sugar 

Level 

Habit Class 

Heavy Tall High 3 Smoker P 

Heavy Short High 1 Nonsmoker P 

Normal Tall Normal 3 Nonsmoker N 

Heavy Tall Normal 2 Smoker N 

Low Medium Normal 2 Nonsmoker N 

Low Tall Normal 1 Nonsmoker P 

Normal Medium High 3 Smoker P 

Low Short High 2 Smoker P 

Heavy Tall High 2 Nonsmoker P 

Low Medium Normal 3 Smoker P 

Heavy Medium Normal 3 Smoker N 

 

 Use the data together with the Naïve Bayes classifier to perform a new classification for 

the following new instance. Create and use the classifier by hand, not with Weka, and 

show all your working. 

Body 

Weight 

Body 

Height 

Blood 

Pressure 

Blood Sugar 

Level 

Habit Class 

Low Tall High 2 Smoker ? 

 

 4. [Association Rules Mining] [20 marks] 

The following table film watching histories for several viewers of an on-demand service. 

 

User Id Items 

001 Airplane!, Downfall, Evita, Idiocracy, Jurassic Park 

002 Casablanca, Downfall, Evita, Flubber, Jurassic Park 

003 Airplane!, Downfall, Half Baked, Jurassic Park 

004 Airplane!, Downfall 

005 Casablanca, Downfall, Flubber, Jurassic Park, Zoolander 

006 Casablanca, Downfall, Half Baked, Idiocracy, Zoolander 

007 Evita, Idiocracy, Jurassic Park 

008 Downfall, Jurassic Park, Zoolander 

009 Casablanca, Downfall, Evita, Half Baked, Jurassic Park, Zoolander 

 

a) Follow the steps outlined in Practical 07 and conduct a mining task for Boolean 

association rules using the Apriori algorithm in Weka. 

b) Set different parameters and observe the association rules discovered. 

c) Weka provides association evaluation parameters other than support and 

confidence. Note the evaluation results by those evaluation parameters of example 

rules. 

 CP1407 Assignment 2 

 

  • Page 3 - 

 

  1. [Clustering] [20 marks] 

Consider the following 2-dimensional point data set presented in (x,y) coordinates: 

 P1(1,1), P2(1,3), P3(4,3), P4(5,4), P5(9,4), P6(9, 6). 

Apply the hierarchical clustering method by hand (using Agglomerative algorithm) to 

get final two clusters. Use the Manhattan distance function to measure the distance 

between points and use the single-linkage scheme to do clustering. Show all your 

working. 

 

Rubric 

 Exemplary Good Satisfactory Limited Very Limited 

 90-100% 70-80% 50-60% 30-40% 0-20% 

WX:codinghelp