Introduction
大数据作业,利用Hadoop去跑数据集,先是几个基本的MapReduce简单问题,当然也可以用Hive,然后是去计算 TF-IDF
,当然,数据集得自己下,Hadoop平台也得自己去搭。
Requirement
Tasks:
- Using MapReduce, carry out the following tasks:
- Acquire the top 250,000 posts by viewcount (see notes)
- Using pig or mapreduce, extract, transform and load the data as applicable
- Using mapreduce calculate the per-user TF-IDF (just submit the top 10 terms for each user)
- Bonus use elastic mapreduce to execute one or more of these tasks (if so, provide logs / screenshots)
- Using hive and/or mapreduce, get:
- The top 10 posts by score
- The top 10 users by post score
- The number of distinct users, who used the word ‘java’ in one of their posts
Notes
TF-IDF
The TF-IDF algorithm is used to calculate the relative frequency of a word in
a document, as compared to the overall frequency of that word in a collection
of documents. This allows you to discover the distinctive words for a
particular user or document.
The formula is:
TF(t) = Number of times t appears in the document / Number of words in the
document
IDF(t) = log_e(Total number of documents / Number of Documents containing t)
The TFIDF(t) score of the term t is the multiple of those two.
Downloading from Stackoverflow
- You can only download 50000 rows in one query. Here is a query to get to get most popular posts:
select top 50000 * from posts where posts.ViewCount > 1000000 ORDER BY
posts.ViewCount - To count the number of records in a range:
select count(*) from posts where posts.ViewCount>15000 and posts.ViewCount <
20000 - To retrieve records from a particular range:
select * from posts where posts.ViewCount > 15000 and posts.ViewCount <
20000
Summary
用Hadoop去计算TF-
IDF的时间复杂度还是挺高的,毕竟有很多临时数据要落地,而且Hadoop程序也不是一个就能解决问题的,如果换成Spark的话,应该会高效很多。