Hadoop代写：CSE3BDCBigDataToolsTask2

发布日期: 2016-08-19

Introduction

接上次的大数据作业，Task2是对时间序列数据集进行大数据分析，要求分别使用Spark, Hive以及Hadoop来处理。

Task 2: Analysing Twitter Time Series Data

In this task we will be doing some analytics on real twitter data. The data is
a section of the data from obtained from the infochimps.org web site.
The data set used for this task can be found inside the twitter directory of
the assignment_datafiles.zip on LMS. Note the data file is tab (\t) delimited.
The data has the following attributes

Attribute number	Attribute name	Description
1	Token type	In our data set all rows have Token type of hashtag. So
this attribute is useless for this assignment.
2	Month	The year and month specified like the following: YYYYMM. So 4
digits for year followed by 2 digits for month. So like the following 200905,
meaning the year 2009 and month of May.
3	count	An integer representing the number tweets of this hash tag for
the given year and month.
4	Hash Tag Name	The #tag name, e.g. babylove, mydate, etc.
Here is a small example of the Twitter data that we will use to illustrate the
subtasks below:
Token type	Month	count
—	—	—
hashtag	200910	2
hashtag	200911	2
hashtag	200912	90
hashtag	200812	100
hashtag	200901	201
hashtag	200910	1
hashtag	200912	500
hashtag	200905	23
hashtag	200907	1000
Using the twitter data set downloaded from LMS please perform the following.
a) [Spark] Find the single row that has the highest count and for that row
report the month, count and hashtag name. So for the above small example data
set the result would be:

Month: 200907, count: 1000, hash tag name: abc

b) [Do twice, once using Hive and once using Spark] Find the hash tag name
that has tweeted the most in the entire data set. Report the total number of
tweets for that hash tag name. So for the above small example data set the
output would be:
abc 1023
c) [MapReduce] Given two months x and y, where y > x. Find the hashtag name
that has increased the number of tweets the most from month x to month y.
Ignore the tweets in the months between x and y, so just compare the number of
tweets at month x and at month y. Report the hashtag name, the number of
tweets in months x and y. Ignore any hashtag names that had no tweets in
either month x or y. You can assume that the combination of hashtag and month
is unique. Therefore, the same hashtag and month combination cannot occur more
than once. Take x and y as command line arguments as was done in Task D of Lab
3. For the above small example data set:
Input: x = 200910, y = 200912
Output (hashtag, month x count, month y count):
mycoolwife, 1, 500

Appendix A: Working with Linux

As 3rd/4th year students of a computer science subject, we assume that you
know how to use Linux. If you have not used Linux before, this is a good
opportunity to gain hands-on experience. After startup, the VM will
automatically log you in and load into a graphical environment. However, you
may find it useful to work with the command shell (=bash, =terminal)
sometimes.
It is possible to complete the entire assignment using just the tools
currently installed in the VM without installing any other tools. Anyway, your
user “cloudera” (password “cloudera”) possesses all necessary rights for
downloading and installing additional application packages.

Appendix B: Transferring files between your host system and the Cloudera

VM
The following information is useful particularly if your drag n drop is not
working correctly.
VirtualBox: If you create a folder on your host system and mark is as a shared
in the settings of your VirtualBox application.
You can mount it to the folder ~/Public of your Linux guest operating system
using the following command:
sudo mount -t vboxsf Temp ~/Public

SafePoker

https://bestcstutor.github.io/2016/08/19/Hadoop%E4%BB%A3%E5%86%99%EF%BC%9ACSE3BDCBigDataToolsTask2/