Hadoop代写:CSC555MiningBigData2


在之前搭建的AWS平台上,使用Hadoop, Hive, Pig, Mahout等工具,进行大数据处理。

Requirement

You will various queries using Hive, Pig and Hadoop streaming. The schema is
available below, but do remember that schema should specify the delimiter.
The data is available at (note that the data is |-separated, not comma
separated):
Please note what instance and what cluster you are using (you can reuse your
existing cluster for most of the questions).
Please be sure to submit all code (pig and python and Hive). You should also
submit the command lines you use and a screenshot of a completed run (just the
last page, do not worry about capturing the whole output). You can use time
command to record time of execution of anything you run.
I highly recommend creating a small sample input (e.g., by running head
lineorder.tbl > lineorder.tbl.sample and testing your code with it, you can
use head -n 100 to get first 100 lines only).

Part 1: Data Transformation

Using Scale4 data perform the following data processing.

  • A. Transform lineorder.tbl table into a csv (comma-separated file): Use Hive, MapReduce with HadoopStreaming and Pig (i.e. 3 different solutions)
  • B. Extract five of the numeric columns that for rows where lo_discount is between 4 and 6 into a space-separated text file (for K-Means clustering later). Use Hive and Pig (2 different solutions) (NOTE: you do not need to use your code to identify what is a numeric column, just go by what the data types say. You should manually pick any 5 columns that contain only numbers)
  • C. Create a pre-join (i.e. a new data file) that corresponds to the following query below. You can think of it as a materialized view. What is the size of the new file? Use Hive and Pig (2 different solutions).
    SELECT lo_partkey, lo_suppkey, lo_discount, d_year, lo_revenue
    FROM lineorder, dwdate
    WHERE lo_orderdate = d_datekey;
    —|—

Part 2: Querying

All queries from SSBM benchmark are available here.
Using Scale4 data perform the following data processing and don’t forget to
time your results.

  • A. Run SSBM queries 2.2, 3.2 and 4.2 using Hive only.
  • B. For this part use Hive and Pig (two different solutions) to run Q2.1 using what you have created in 1-C (i.e. use PreJoin1 instead of lineorder and dwdate tables in the from clause). You would need to rewrite the query accordingly. (e.g. something like,
    select sum(lo_revenue), d_year, p_brand1
    from MyNewStructureFrom1C, part, supplier
    where lo_partkey = p_partkey
    and lo_suppkey = s_suppkey
    and p_category = ‘MFGR#12’
    and s_region = ‘AMERICA’
    group by d_year, p_brand1
    order by d_year, p_brand1;)
    —|—

Part 3: Clustering

Using the file you have created in 1-B, run KMeans clustering using 7
clusters.

  • A. Using Mahout synthetic clustering as you have in a previous assignment on sample data.
  • B. Using Hadoop streaming perform one iteration manually with randomly chosen input centers. (This would require passing a text file with cluster centers using -file option, opening the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point based on which center is the closest to each particular point). Your reducer would need to compute the new centers, and at that point the iteration is done.
    NOTE: if you get a java.lang.OutOfMemoryError error, you will need to
    reconfigure Hadoop to supply the java virtual machine with more memory. You
    can do this by editing the mapred-site.xml (Mapper should not need much RAM): mapreduce.reduce.java.opts -Xmx1024m ---|---

The amount of memory can be tweaked (you can go higher, but keep in mind how
much physical memory your machine has). Do not forget to restart Hadoop after
any configuration file change.
If you still run out of memory in 3-A submit the screenshot of that and you
will get full credit for the question.

Part 4: Performance

Compare the performance given following combinations. If you already ran that
combination before it is sufficient to copy the runtime for comparison.

  • A. All three of your solutions to Part-1A with
    • a. Scale4: single node and a cluster of at least 4 nodes
    • b. Scale14: a cluster of at least 4 nodes
  • B. Both of your solutions for 2-B.
    • a. Scale4: single node and a cluster of at least 4 nodes
      Summarize the results and cluster performance/scaling in at least a paragraph.

Extra Credit

Research and describe the most affordable way to build a 1-Petabyte drive.
Note that the setup has to be self-sufficient (i.e. easily usable) and include
references. Buying 250 of 4TB drives is not enough because you still need a
way to use it. The drive should be built to own, not to rent (Dropbox or
similar services doesn’t count, even if it does say “unlimited” storage).
Submit a single document containing your written answers.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录