Hadoop代写:INFR11088ExtremeComputing


代写 Hadoop 的作业,需要使用
MapReduce
进行数据分析。

Tasks

Invented index with MapReduce

Task 1

Use the files in the folder /data/assignments/ex2/task1/large/ as input and
produce an inverted index using MapReduce. For instance, given the following
documents:
d1.txt: cat dog cat fox
d2.txt: cat bear cat cat fox
d3.txt: fox wolf dog
you should build the following full inverted index.
bear: 1: {(d2.txt,1)}
cat : 2: {(d1.txt, 2), (d2.txt, 3)}
dog : 2: {(d1.txt, 1), (d3.txt, 1)}
fox : 3: {(d1.txt, 1), (d2.txt, 1), (d3.txt, 1)}
wolf: 1: {(d3.txt,1)}
For each term (anything separated by spaces), there is a single record
consisting of a number and a list of what are termed postings; the colon
character (‘:’) is used to delimit the fields of each record. The first field
is a number that represents the number of documents that contain the term.
Then a list of postings follows where each posting is a pair consisting of the
document name and the frequency of the word in that specific document. Note
that terms are sorted alphabetically and also that the items inside lists are
also sorted alphabetically by document identifier. For example, the following
line:
cat : 2: {(d1.txt, 2), (d2.txt, 3)}
To get the full path to the input file in Hadoop streaming, read the
mapreduce_map_input_file environment variable. In Python, that’s
os.environ[“mapreduce_map_input_file”].

Parsing StackOverflow

For tasks 2, 3, and 4 you will use a dataset from StackOverflow
(stackLarge.txt) and extract specific pieces of information. Initially, you
should understand the format of the dataset, next you will need to do parse
each post, and finally you will need to implement your MapReduce workflows.
Use MapReduce for the tasks 2, 3, and 4.
The dataset contains a number of post records, one record per line. Each
record consists of comma-separated key-value pairs, which are then pointlessly
wrapped in an XML element.
Each record has its own identifier stored in a field named Id and a type,
indicated by the value of a field PostTypeId. If the value of PostTypeId is 1,
than the post refers to a question, otherwise is the value of PostTypeId is 2
the post refers to an answer.
You will need to parse the record into a structure that will allow to access
the value of each attribute by name. In this example, Id=”2155” represents the
unique identifier given to the post; PostTypeId=”1” means that this post is a
question; AcceptedAnswerId=”2928” means that the accepted answer from the user
for this query is the answer with Id=”2928”; and so on.
The attribute-value pair Id=”659891” represents the unique identifier given to
this post. The value for ParentId represents the identifier of the question
this answer applies to, and the value for OwnerUserId represents the user who
wrote the answer for this question.
You do not need to know exactly what each attribute means, but you will need
to be able to access the value of each attribute given a record. You will need
to write a parser for each record and then answer the following questions. For
each question we give the expected output format; lines beginning with ‘#’ are
only descriptive comments and you do not need to print them; you only need to
print the actual output values.

Task 2

Which are the 10 most popular questions according to their view counts
(attribute ViewCount in a question post)? Output Format:
#Count Id
17551 659891
2131 659892
1782 314159

The columns are count and question id. Ties may be broken arbitrarily. Sort in
decreasing order of count. Use a single space between count and id.

Task 3

Who was the user that answered the most questions and what were the Ids of
these questions? Output Format:
# OwnerUserId -> PostId, PostId, PostId, …
1342 -> 23, 26, 531
Use a single space in your actual output: “1342 -> 23, 26, 531”.

Task 4

Who was the user that had the most accepted answers and what were these
answers? Output Format:
# OwnerUserId -> Number of accepted answers, AnswerId, AnswerId, …
1 -> 2, 1432, 1643
Use a single space in your actual output: “1 -> 2, 1432, 1643”.

Sifting Web Data

Task 5

In lectures you saw how to sample uniformly a single line from a stream of
lines efficiently on a single machine. For large data, running reservoir
sampling on a single machine would take too long. Implement a MapReduce
version of reservoir sampling which uniformly samples only a single line and
uses MapReduce to do so. Use your implementation to sample a single line from
the file webLarge.txt. Your output should contain only a single line.

Task 6

Extend the basic version of reservoir sampling such that it can sample
multiple lines uniformly without replacement and run on a single machine (i.e.
no MapReduce). This means that if we want to sample k lines from a file with n
lines in total, then each of the (nk) possible outcomes has equal probability.
Implement a program that will sample 100 lines from the file webLarge.txt. Run
your program locally (not as a MapReduce job).

Task 7

Make your own implementation of a Bloom filter. We leave the choice of a
hashing function up to you. Write a program that uses your implementation of
Bloom filter to approximately de-duplicate the lines in the file webLarge.txt.
The output of an approximate de-duplication contains no duplicate lines, but
some lines from the input might not appear at all in the output. You should
think carefully about the number of hashing functions and the size of the
Bloom filter you use. The probability that a line (and its duplicates) from
the input does not appear at all in the output should be less than 1%. You can
assume that your hashing functions produce every value equally likely. When
choosing the size of your Bloom filter and number of hashing functions, you
should assume the worst case in which all lines are unique. The number of
lines should be a command-line parameter to your program.

Mining Query Logs

Task 8

Imagine you are Google and you want to know which search queries (if any) form
at least 1% of all queries in total. In the file queriesLarge.txt each line is
a hash of a query and queries occurred in the order as listed in the file.
Implement the lossy counting algorithm and run it on the file
queriesLarge.txt. Your output should contain all queries that form at least 1%
of all queries and no query that formed less than 0.9% of all queries.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录