ADVANCED DATABASES
AND DATA MINING
CSCI-527
PROJECT REPORT PRESENTATION
ANALASYS ON
AUTOMOBILE DATASET
TEAM MEMBERS:
Anusha Vadlamudi Narasimha Rao
50134597
Deepthi Chidura 50129270
Namitha Yellokonda 50126906
Shravya Beerakayala 50124534
Abstract
The goal of a data mining process is to
extract information from a dataset and
transform it into a format that could be used
for any purpose of the concerned field.
We have examined the Auto dataset in
which the performance of various cars are
analyzed based on their attributes such as
mpg, cylinders, displacement, horse power,
weight, acceleration, year and origin.
We have analyzed using data mining
technique Apriori algorithm.
INTRODUCTION
This Auto dataset contains the car model,
mpg (miles per gallon), cylinders,
displacement, horse power, weight,
acceleration, origin.
By using Apriori algorithm, confidence and
support allows us to generate important
decisions to find how they worked as
combined factors to find ways to increase the
performance.
The minimum support and
confidence value are set
according to the application and
the sets which satisfy this criterion
are considered to finally find out
which attributes together satisfy
the confidence and support.
DATA OF AUTO :
ATTRIBUTES DESCRIPTION
MPG
CYLINDERS
DISPLACEMENT
HORSEPOWER
WEIGHT
ACCELERATION
YEAR
ORIGIN
NAME
APRIORI ALGORITHM
The
Apriori Algorithm is an influential
algorithm for mining frequent item sets
for Boolean association rules that have
support and confidence greater than
minimum
support
(min-sup)
and
minimum
confidence
(min-conf),
respectively.
The problem of discovering all association
rules can be broken down into two parts
as follows:
Find all sets of items that have support
values greater than the minimum support.
These items are called large item sets.
Use the large item sets to generate the
desired rules.
Two factors that affect significance of
association rules:
Support: The rule X Y has support s in
the transaction set D if s% of the
transactions in D contains X Y.
Confidence: The rule X Y holds in the
transaction set D with confidence c if c%
of the transactions in D that contain X
also contain Y.
PSEUDO CODE
L1 = {large 1-itemsets};
for (k=2; Lk-1 0; k++) do
begin
Ck= apriori-gen(Lk-1); // new candidates
For all transactions t D do
begin
Ct =subset(C, t);
forall candidates c Ct do
[Link]++;
end
L k= {c Ck | [Link] minsup}
end
answer = k Lk;
DATA CLEANING
Unclean data refers to data that contains
erroneous information.
It may also be used when referring to data
that is in memory and not yet loaded into a
database. There are some missing fields in the
data fields.
UNCLEAN DATA
CODE FOR DATA CLEANING
autoData<- [Link](file =
"~/Documents//data//[Link]", header = TRUE)
horsepwr<- ([Link](autoData$horsepower))
horsepwr<- (ifelse( horsepwr== "?", 0, horsepwr))a
After data cleaning these fields are
removed.
PYTHON CODE FOR APRIORI ALGORITHM
import csv
import os
def apriori_generation_algo(data, min_support=0.3, verbose=False):
can_keys = create_candidate_keys(data)
D_map = map(set, data)
F1, supporting_data = back_prune(D_map, can_keys, min_support, verbose=False)
F = [F1]
key = 2
while (len(F[key - 2]) > 0):
candidate_keys = apriori_generation(F[key-2], key)
F_key, support_K = back_prune(D_map, candidate_keys, min_support)
supporting_data.update(support_K)
[Link](F_key)
key += 1
if verbose:
for kset in F:
for item in kset:
print("" \
+ "{" \
+ "".join(str(i) + ", " for i in iter(item)).rstrip(', ') \
+ "}" \
+ ": supp = " + str(round(supporting_data[item], 3)))
return F, supporting_data
def create_candidate_keys(data, verbose=False):
can_keys = []
for transac in data:
for item in transac:
if not [item] in can_keys:
can_keys.append([item])
can_keys.sort()
return map(frozenset, can_keys)
def back_prune(data, candidates, min_support, verbose=False):
sscount = {}
for tid in data:
for candidate in candidates:
if [Link](tid):
[Link](candidate, 0)
sscount[candidate] += 1
num_items = float(len(data))
ret_list = []
supporting_data = {}
for key in sscount:
support = sscount[key] / num_items
if support >= min_support:
ret_list.insert(0, key)
supporting_data[key] = support
if verbose:
for kset in ret_list:
for item in kset:
print("{" + str(item) + "}")
print("")
for key in sscount:
print("" \
+ "{" \
+ "".join([str(i) + ", " for i in iter(key)]).rstrip(', ') \
+ "}" \
+ ": supp = " + str(supporting_data[key]))
return ret_list, supporting_data
def apriori_generation(frequency_sets, key):
returnList = []
lenLk = len(frequency_sets)
for i in range(lenLk):
for j in range(i+1, lenLk):
a=list(frequency_sets[i])
b=list(frequency_sets[j])
[Link]()
[Link]()
F1 = a[:key-2]
F2 = b[:key-2]
if F1 == F2:
[Link](frequency_sets[i] | frequency_sets[j])
return returnList
def rules_from_conseq(frequency_set, H, supporting_data, rules, min_confidence=0.9,
verbose=False):
m = len(H[0])
if m == 1:
Hmp1 = cal_conf(frequency_set, H, supporting_data, rules, min_confidence, verbose)
if (len(frequency_set) > (m+1)):
Hmp1 = apriori_generation(H, m+1)
Hmp1 = cal_conf(frequency_set, Hmp1, supporting_data, rules, min_confidence,
verbose)
if len(Hmp1) > 1:
rules_from_conseq(frequency_set, Hmp1, supporting_data, rules, min_confidence,
verbose)
def cal_conf(frequency_set, H, supporting_data, rules, min_confidence=0.9, verbose=False):
pruned_H = []
for consequence in H:
confidence = supporting_data[frequency_set] / supporting_data[frequency_set consequence]
if confidence >= min_confidence:
append((frequency_set - consequence, consequence, confidence))
pruned_H.append(consequence)
if verbose:
print("" \
+ "{" \
rules.
+ "".join([str(i) + ", " for i in iter(frequency_setconsequence)]).rstrip(', ') \
+ "}" \
+ " --> " \
+ "{" \
+ "".join([str(i) + ", " for i in iter(consequence)]).rstrip(', ') \
+ "}" \
+ ": conf = " + str(round(confidence, 3)) \
+ ", supp = " + str(round(supporting_data[frequency_set], 3)))
return pruned_H
def gen_rules(F, supporting_data, min_confidence=0.9, verbose=True):
rules = []
for i in range(1, len(F)):
for frequency_set in F[i]:
def import_data():
with open('C:/Users/Anusha/Desktop/Auto_clean_data.csv',"rU") as fin:
data = [row for row in [Link]([Link]().splitlines())]
return data
data = import_data()
D_map = map(set, data)
can_keys = create_candidate_keys(data, verbose=True)
F1, supporting_data = back_prune(D_map, can_keys, 0.3, verbose=True)
F, supporting_data = apriori_generation_algo(data, min_support=0.05,
verbose=True)
H = gen_rules(F, supporting_data, min_confidence=0.9, verbose=True)
OBSERVATION
If mpg equals 14, Cylinders equals 8 and origin
equals 1 then confidence = 1.0 and support =
0.063
If mpg equals 13, Cylinders equals 8 and origin
equals 1 then confidence = 0.929 and support =
0.066
If cylinders equals 8, Origin equals 73 and origin
equals 1 then confidence = 1.0 and support =
0.051
If horsepower 150, Cylinders equals 8 and origin
equals 1 then confidence = 1.0 and support =
0.056.
CONCLUSION
In our project we have observed Apriori
algorithm and generated rules by considering
minimum support and confidence. The data
set is cleaned using R-Programming and the
algorithm is implemented using python code.
The python code is run is the java
environment and the results are obtained.