DataMining代写:CS412TextMining


代写数据挖掘作业,根据算法对文本文件进行特征提取。作业详细给了每一步的步骤,比开放性的作业要容易很多。

Objective

  • Explore how frequent pattern mining can be applied to text mining to discover meaningful phrases.
  • In this assignment, you will first run LDA on a corpus made up of titles from 5 domains’ conference papers. Based on the results from LDA, a topic (representing a particular domain) is assigned to each word in each title. Then you write a frequent mining algorithm to mine frequent patterns from each topic to get meaningful phrases. The mined frequent patterns may not necessarily be meaningful phrases for the topic. So you will consider the question how to extract meaningful ones out of all the frequent patterns. The final goal is to output highly representative phrases for each topic.

Step 1: Get to Know the Data

We collect paper titles from conferences in computer science of 5 domains:
Data Mining(DM), Machine Learning(ML), Database(DB), Information Retrieval(IR)
and Theory(TH). You can download the raw data here paper_raw.txt. (Note that
we will not use this file directly for this assignment. But you can look at
this file to see what the original titles look like.) Each line contains two
columns, PaperID and Title of a paper, separated by Tab(‘\t’). Recall the
example in class. For each line in the file you can consider it as one
transaction. Each word in the title is equivalent to an item in a transaction.
Note that PaperID is unique in the whole data set. However, since this data is
a subset of all the paper titles from a large corpus, PaperIDs are not
starting from 0 and not consecutive. The raw data looks like this:

PaperID Title
7600 The Automatic Acquisition of Proof Methods
85825 Frequent pattern discovery with memory constraint
For this assignment, we pre-processed the raw data by removing stop words,
converting the words to lower cases and lemmatization. You can download the
processed data here paper.txt. This is the actual dataset we will use for this
assignment. In this file, each line is a PaperID followed by a list of
terms.The format is PaperID Tab(‘\t’) term1 Space(‘ ‘) term2 Space(‘ ‘) …
paper.txt looks like this:
PaperID [term1] [term2] [term3] [term4] [term5] …… [termN]
7600 automatic acquisition proof method
85825 frequent pattern discovery memory constraint

Step 2: Preprocessing

This step prepares input for LDA. You will generate two files based on
paper.txt.

Generate a Dictionary

First, you need to generate a vocabulary from paper.txt. Please name your
vocabulary file as vocab.txt. Each line in this file is a unique term
extracted from paper.txt; each term should appear exactly once. Here is an
example of the first five lines in vocab.txt. Note that the sequence of terms
may be different.
automatic
acquisition
proof
method
philosophical

Tokenize Plain Text by Dictionary

In this step, we represent each title as a sparse vector of word counts. We
ask you to transform each title (i.e., each line in paper.txt) into the
following format.

Note: if you use any other lda package rather than the one mentioned in next
step, please make sure the format in title.txt match what your lda package
requires.

Step 3: Partitioning

Recall that we collect paper titles from conferences of 5 domains in computer
science. If we run frequent pattern mining algorithms directly on paper
titles, the patterns would be independent of topics. Thus, if we want to mine
frequent patterns in each domain, titles and terms should be partitioned into
5 domains. Note that the domain knowledge is not known to you. Instead, we
apply topic modeling to uncover the hidden topics behind titles and terms
automatically. Specifically, we apply LDA (you are not required to understand
how exactly topic modeling works) to assign a topic to each term.

Assign a Topic to Each Term

  1. Download the LDA package here. Unzip it and you could see a list of source code. You can refer to this page for a comprehensive introduction to the package.
  2. Open a terminal and go to the directory of the source code. Type make. Then an executable lda is generated.
  3. In lda-c-dist directory, there is a settings.txt file. You could use the following settings, but feel free to tune the parameters if you know how inference works for LDA.
  4. Run LDA with the following command.
  5. Check your output.

Re-organize the Terms by Topic

You are asked to re-organize the terms by 5 topics. For the i-th topic, you
should create a file named topic-i.txt. Separate each line in word-
assignment.dat by topics assigned to them. For example, the lines in word-
assignment.dat can be considered as the following form ( Note that here we
replace the integers with the real terms for better illustration):
004 automatic:02 acquisition:02 proof:02 method:02
005 classification:03 noun:02 phrase:03 concept:01 individual:03

Step 4: Mining Frequent Patterns for Each Topic

In this step, you need to implement a frequent pattern mining algorithm. You
can choose whatever frequent pattern mining algorithms you like, such as
Apriori, FP-Growth, ECLAT, etc.
(Hint: you need to figure out min_sup by yourself.)
Question to ponder A: How you choose min_sup for this task? Explain how you
choose the min_sup in your report. Any reasonable choice will be fine.

Step 5: Mining Maximal/Closed Patterns

In this step, you need to implement an algorithm to mine maximal patterns and
closed patterns. You could write the code based on the output of Step 4, or
implement a specific algorithm to mine maximal/closed patterns, such as
CLOSET, MaxMiner, etc.
The output should be of the same form as the output of Step 4. Max patterns
are put into max directory. The i-th file is named as max-i.txt.Closed
patterns are put into closed directory. The i-th file is named as
closed-i.txt.
Question to ponder B: Can you figure out which topic corresponds to which
domain based on patterns you mine? Write your observations in the report.
Question to ponder C: Compare the result of frequent patterns, maximal
patterns and closed patterns, is the result satisfying? Write down your
analysis.

Step 6: Re-rank by Purity of Patterns

In this paper, purity is introduced as one of the ranking measures for
phrases. A phrase is pure in topic t if it is only frequent in documents (
here documents refer to titles) about topic t and not frequent in documents
about other topics. For example, ‘query processing’ is a more pure phrase than
‘query’ in the Database topic. We measure the purity of a pattern by comparing
the probability of seeing a phrase in the topic-t collection D(t) and the
probability of seeing it in any other topic-t’ collection.

Step7: Bonus

Could you come up with other filtering/ranking criteria to improve the quality
of your mined phrase lists? Implement your algorithm and put your analysis in
your report.
(Hint: Some related papers describe strategies to deal with this problem by
balancing max patterns and closed patterns.

Step8: Report

Now you are ready to write your report.
You should include the following sections in your report:

  1. Briefly explain what algorithms you use in Step4~Step6.
  2. Answer all the questions in Question to ponder.
  3. List your source file names and their corresponding Steps.

文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录