Introduction
Task3是之前的大数据作业的最后一次,这次作业就全是用Spark来对数据做处理了。
Spark一个比较麻烦的地方是,虽然用Scala编程比Java编程效率高,但是Scala的语言学习难度比Java大。
Task 3: Creating inverted index from Bag of Words data
In this task you are asked to create an inverted index of words to documents
that contain the words. Using this inverted index you can search for all the
documents that contain a particular word easily. The data has been stored in a
very compact form. There are two files. The first file is called docword.txt,
which contains the contents of all the documents stored in the following
format:
Attribute number | Attribute name | Description |
---|---|---|
1 | Doc id | The id of the document that contains the word. |
2 | Vocab Index | Instead of storing the word itself. We store the index |
into the vocabulary file. The index starts from 1. | ||
3 | count | An integer representing the number of times this word occurred |
in this document. | ||
The second file called vocab.txt contains each word in the vocabulary, which | ||
is indexed by attribute 2 of the docword.txt file. | ||
The data set used for this task can be found inside the Bag of words directory | ||
of the assignment_datafiles.zip on LMS. | ||
If you want to test your solution with more data sets you can download other | ||
data sets of the same format from the following source. | ||
Here is a small example content of the docword.txt file. | ||
Doc id | Vocab Index | Count |
— | — | — |
1 | 3 | 1200 |
1 | 2 | 120 |
1 | 1 | 1000 |
2 | 3 | 702 |
2 | 5 | 200 |
2 | 2 | 500 |
3 | 1 | 100 |
3 | 3 | 600 |
3 | 4 | 122 |
3 | 5 | 2000 |
Here is an example of the vocab.txt file |
Plane
Car
Motorbike
Truck
Boat
Using the input files docword.txt and vocab.txt downloaded from LMS, complete
the following subtasks using spark:
a) [spark] Output into a text file called “task3a.txt” a list of the total
count of each word across all documents. List the words in ascending
alphabetical order. So for the above small example input the output would be
the following (the format of the text file can be different from below but the
data content needs to be the same):
Boat 2200
Car 620
Motorbike 2502
Plane 1100
Truck 122
b) [spark] Create an inverted index of the words in the docword.txt file and
store the entire inverted index in binary format under the name InvertedIndex.
Also store the output in text file format under the name task3b. The inverted
index contains one line per word. Each line stores the word followed by a list
of (Doc id, counts) pairs (one pair per document). So the output format is the
following:
word, (Doc id, count), (Doc id, count), …
- Note you need to have the list of (Doc id, count) in decreasing order by count.
- Note you need to have the words in ascending alphabetical order
So for the above example input the output text file(s) would contain (the
actual format can a bit different but it should contain the same content):
Boat (3, 2000), (2, 200)
Car (2, 500), (1, 120)
Motorbike (1, 1200), (2, 702), (3, 600)
Plane (1, 1000), (3, 100)
Truck (3, 122)
For example following format for the text file would also be acceptable:
(Boat,ArrayBuffer((3,2000), (2,200)))
(Car,ArrayBuffer((2,500), (1,120)))
(Motorbike,ArrayBuffer((1,1200), (2,702), (3,600)))
(Plane,ArrayBuffer((1,1000), (3,100)))
(Truck,ArrayBuffer((3,122)))
c) [spark] Load the previously created inverted index stored in binary format
from subtask b) and cache it in RAM. Search for a particular word in the
inverted index and return the list of (Doc id, count) pairs for that word. The
word can be hard coded inside the spark script. In the execution test we will
modify that word to search for some other word. For example if we are
searching for “Car” the output for the example dataset would be:
Car (2, 500), (1, 120)
Bonus Marks
- Using spark perform the following task using the data set of task 2.
[spark] Find the hash tag name that has increased the number of tweets the
most from among any two consecutive months of any hash tag name. Consecutive
month means for example, 200801 to 200802, or 200902 to 200903, etc. Report
the hash tag name, and the 1st month count and the 2nd month counts. - Propose any other data processing task using real data you obtained from somewhere. You need to do the data processing using either MapReduce or Pig or Spark. Need to give a proposal of this task to me by the end of week 9. I will tell you how many marks it is worth. You can get a maximum of 10 bonus marks for this task. Include the proposal as PDF, TXT or Word document in the assignment submission.