代写MapReduce作业,用MapReduce框架编程处理csv文件数据。
Aim
This assignment aims to assess your understanding of Map Reduce framework and
programming a distributed program using this framework.
Description
This assignment consist of 2 parts: Theoritical part (asssessed by a Quiz) and
Practical part.
Part 1: Quiz
A closed book Mylo Quiz will be conducted during lecture 8. Its weightage will
be 3%. The multiple type questions will be asked from Lecture slides 6 and 7.
Part 2: Practical Part
Here, you need to implement a Map Reduce code for Hadoop that analyses given
weather data. This part of the assignment consists of two further sub-tasks:
Basic level and Advance level.
Data
Input data will be several .csv files for different years. Each file contains
several rows giving information about weather conditions at different weather
stations on different days of the year. The data is from
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/ There are at least two
measurements each day, one for the maximum temperature (TMAX) and one for the
minimum temperature (TMIN), and sometimes one for the precipitation (PRCP).
Each row contains following relevant information:
- The weather station id
- the date in format yyyymmdd
- type of measurement (for this homework we care about the maximum temperature TMAX and TMIN)
- temperature in tens of degrees (e.g. -90 = -9.0 deg. C., -184 = -18.4 deg. C.)
Outline of Tasks
Basic level: Finding Average
In first task, your goal is to write a Map Reduce program that can find the
average maximum temperature at each station in different years. The input to
your program will be the csv files for different years provided to you. The
ouptut should have rows with three fields: Stationid Year AverageTemp. For
example a sample output file will look like:
ITE00100554 1789, -63
ITE00100554 1789 -90
GM000010962 1789 4
EZE00100082 1789 -103
Advanced level: finding similarity between different stations
The goal of this task is to implement a MapReduce program that can find
similarity between different weather stations. Similarity between two stations
is calculated based on the following:
You can assume output from the previous task as input to this task. Output for
this task will be in following format:
weatherStationID1 weatherStationID2 SimilarityScore.
Submission
a) Source Codes of 2 Tasks
b) A report explaining map/reduce program. If any optimisation such as using
combiner to reduce number of keys, is done to improve the performance, please
also specify with that explaination. If you have taken inspiration from some
MapReduce programs to complete these tasks, please give their reference.