DataMining代写:CS626AprioriAlgorithm


代写数据挖掘里面的经典推荐算法Apriori,从给定的数据里面,根据规则挖掘特征。

Objective

The purpose of the project is to implement the apriori algorithm for
generating all association rules whose support is greater than a user-supplied
minimum support and whose confidence is greater than a user supplied minimum
confidence.
Your program should take as command line option six parameters: (i) the
minimum support, (ii) the minimum confidence, (iii) the name of the input
file, (iv) the name of the output file, (v) the range of the hash function
(i.e., the branching factor of the hash tree), and (vi) the maximum leaf size
of the hash table. The specific parameter sequence along with the name of the
executable are as follows:
hcrminer minsup minconf inputfile outputfile hfrange maxleafsize
NOTE: For those using Java, the command line should look like the following:
java hcrminer minsup minconf inputfile outputfile hfrange maxleafsize
Your program must be written in either C, C++, or Java.
You can assume that you can store the database and the frequent itemsets in
memory.
For hash function, you can just simply use a “ID % hfrange” (i.e., a mod
function).

Input file format

The input file consists of a set of lines, each line containing two numbers.
The first number is the transaction ID and the second number is the item ID.
The lines in the file are ordered in increasing transaction ID order. Note
that the set of items that make up the transaction will be derived by
combining the item IDs of all the lines that correspond to the same
transaction ID.
Two input files are provided. The first is a small one that you can use during
initial code development. The second is larger and will be the one on which
you need to report performance results.

Output file format

The output file will contain as many lines as the number of high-confidence
frequent rules that you found. The format of each line will be as follows:
LHS|RHS|SUPPORT|CONFIDENCE
Both LHS and RHS will contain the items that make up the left- and right-hand
side of the rule in a space delimited fashion.
Due to the large number of the generated association rules, for support values
of 20 and 15 do not generate the rules and just output the frequent itemsets.
but you need to generate them for the other support values. The format of the
output file when no association rules are generated will be of the form:
LHS|{}|SUPPORT|-1
where LHS is the frequent item set.

Report

Along with your code, you need to submit a report that contains the following:
Run your code for each combination of the following sets of values for the
minimum support, the minimum confidence, and option:
minsup: {15, 20, 30, 50, 100, 500, 1000} [Note that this is the support count (act ual frequency)]
minconf: {0.8, 0.9, 0.95}
hfrange: {5, 10, 20}
maxleafsize: {10, 50, 100}
For each combination of minconf, hfrange, and maxleafsize, generate two bar-
charts. The first will show the amount of time required to find the frequent
itemsets and to find the rules (a set of bars for each of these combinations)
for the different values of minimum support. The second will show the number
of frequent itemsets that were found and the number of high confidence rules
that were generated (a set of bars for each of these combinations) for the
different values of minimum support. Make sure that each bar is also annotated
with the quantity that is plotting (at the top of the bar).
Here is a table of the approximate amount of time taken for a couple of minsup
values for just the frequent itemset generation (the time corresponds to C
implementation, the hfrange is set to 20 and maxleafsize is set to 100, no
optimization flags are used):
minsup time (secs)
500 29.56
30 3756.69
A discussion as to which values of the hfrange and maxleafsize parameters
resulted in faster execution time and why.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录