Data Mining Assignment 1: Data Understanding
Submission: Submit the assignment hardcopy in the second Data Mining class of the week (23 or 24 Nov. 2023).
1. (20 points)
Apply your basic data mining knowledge to compare students’ performance in the midterm exam results of a
course for two years, i.e., 2020 and 2021 (result_20_21.xls). You should provide your comments and comparison
by using the statistical description of the data (e.g., mean, median, mode, variance, 5-number summary, etc.)
and plots (boxplot, histogram, etc.). (2 to 3 pages report required)
2. (20 points)
Download the DryBean dataset from UCI Machine Learning Repository. Read the datasets’ descriptions and report
the following (use any language or tool of your choice to solve this problem):
a. The types of the attributes (continuous [interval, ratio], categorical [nominal, ordinal]). Also identify which
attribute(s) are input attribute(s) and which are class attribute(s) (if any).
b. Compute the five-number summary for any two continuous attributes. Compute the mode for categorical
attributes.
c. Compute the mean and standard deviation for the two continuous attributes.
d. Generate the quantile (percentile) plots for two attributes in each dataset.
e. Generate the histogram or distribution plot for each of the two attributes selected in (b).
f. Generate the scatter plots for the two attributes selected in (d).
3. (10 points)
Download and install Weka, a data mining tool, on your systems. Explore the tool and the datasets provided
with the installation. Submit a report containing basic statistics and plots (e.g., scatter plot matrix) for the Iris
dataset using Weka tool. (2 to 3 pages report required)
The following links can be useful.
[Link]
[Link]
[Link]
4. (30 points) Handwritten solution is required.
a. Given these four points in a 3-D space, compute and show the dissimilarity matrix. Use
Euclidian distance as the dissimilarity measure. A(4,5,5), B(5,3,3), C(1,1,0), D(4,4,1)
b. Repeat part (a) using Manhattan distance as dissimilarity measure.
c. Draw a scatter plot for the distances obtained in parts (a) and (b) to identify the relationship
between the two dissimilarity measures.
5. (20 points) Handwritten solution is required.
Name Fever Cough Height Weight Profession City
Ali N Y 65 80 Student Lahore
Bilal Y Y 55 65 Student Karachi
Khan N N 70 75 Teacher Lahore
Ahmed Y N 60 55 Doctor Islamabad
Given the data above, compute the dissimilarity matrix. Fever and Cough are asymmetric binary, Height and
weight are numeric, Profession and City are nominal attributes. Who should be suggested as a friend to Ali
based on your computed dissimilarity matrix?