Hadoop代写:CSC555MiningBigData


根据提供的文档,需要在AWS上搭建Hadoop环境进行计算。

Requirement

In this part of the project, you will 1) Set up a 3-node cluster and 2)
perform data warehousing and transformation queries using Hive, Pig and Hadoop
streaming.

Part 1: Multi-node cluster

  1. Your first step is to setup a multi-node cluster and re-run a simple
    wordcount. For this part, you will create a 3-node cluster (with a total of 1
    master + 2 worker nodes). Include your master node in the “slaves” file, to
    make sure all 3 nodes are working.
    You need to perform the following steps:
  1. Create a new node of a medium size. It is possible, but I do not recommend trying to reconfigure your existing Hadoop into this new cluster (it is much easier to make 3 new nodes for a total of 4 in your AWS account).
    a. When creating a node I recommend changing the default 8G hard drive to 30G
    so that you do not run out of space easily.
    b. Change your security group setting to open firewall access. Rather than
    figure out all individual port, you can set 0-64000 range opening up all ports
    (not the most secure setting in the long term, but fine for us)
    c. Step by step instructions on how to make the change to open up the ports.
    d. Finally, right click on the Master node and choose “create more like this”
    to create 2 more nodes with same settings. If you configure the network
    settings on master first, security group information will be copied.
    NOTE: Hard drive size will not be copied and default to 8G unless you change
    it.
  2. Connect to the master and set up Hadoop similarly to what you did previously. Do not attempt to repeat these steps on workers yet – you will only need to set up Hadoop once.
    a. Configure core-site.xml, adding the PrivateIP (do not use public IP) of the
    master.
    b. Configure hdfs-site and set replication factor to 2.
    c. cp hadoop-2.6.4/etc/hadoop/mapred-site.xml.template
    hadoop-2.6.4/etc/hadoop/mapred-site.xml and then configure mapred-site.xml
    d. Configure yarn-site.xml (once again, use PrivateIP of the master)
    Finally, edit the slaves file and list your 3 nodes (master and 2 workers)
    using Private IPs
    [ec2-user@ip-172-31-7-201 ~]$ cat hadoop-2.6.4/etc/hadoop/slaves
    172.31.7.201
    172.31.5.246
    172.31.11.50
    Make sure that you use private IP (private DNS is also ok) for your
    configuration files (such as conf/masters and conf/slaves or the other 3
    config files). The advantage of the Private IP is that it does not change
    after your instance is stopped (if you use the Public IP, the cluster would
    need to be reconfigured every time it is stopped). The downside of the Private
    IP is that it is only meaningful within the Amazon EC2 network. So all nodes
    in EC2 can talk to each other using Private IP, but you cannot connect to your
    instance from the outside (e.g., from your laptop) because Private IP has no
    meaning for your laptop (since your laptop is not part of the Amazon EC2
    network).
    Now, we will pack up and move Hadoop to the workers. All you need to do is to
    generate and then copy the public key to the worker nodes to achieve
    passwordless access across your cluster.
  3. Run ssh-keygen -t rsa (and enter empty values for the passphrase) on the master node. That will generate .ssh/id_rsa and .ssh/id_rsa.pub (private and public key). You now need to manually copy the .ssh/id_rsa.pub and append it to ~/.ssh/authorized_keys on each worker.
    Keep in mind that this is a single-line public key and accidentally
    introducing a line break would break it. For example:
    Note that this is NOT the master, but one of the workers (ip-172-31-5-246).
    The first public key is the .pem Amazon half and the 2nd public key is the
    master’s public key copied in as one line.
    You can add the public key of the master to the master by running this
    command:
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    Make sure that you can ssh to all of the nodes from the master node (by
    running ssh 54.186.221.92, where the IP address is your worker node) from the
    master and ensuring that you were able to login. You can exit after successful
    ssh connection by typing exit (the command prompt will tell you which machine
    you are connected to, e.g., ec2-user@ip-172-31-37-113). Here’s me ssh-ing from
    master to worker.
    Once you are have verified that you can ssh from the master node to every
    cluster member including the master itself (ssh localhost), you are going to
    return to the master node (exit until your prompt shows the IP address of the
    master node) and pack up the contents of the hadoop directory there. Make sure
    your Hadoop installation is configured correctly (because from now on, you
    will have 3 copies of the Hadoop directory and all changes may need to be
    applied in 3 places).
    cd (go to root home directory, i.e. /home/ec2-user/)
    (pack up the entire Hadoop directory into a single file for transfer. You can
    optionally compress the file with gzip)
    tar cvf myHadoop.tar hadoop-2.6.4
    ls -al myHadoop.tar (to verify that the .tar file had been created)
    Now, you need to copy the myHadoop.tar file to every non-master node in the
    cluster. If you had successfully setup public-private key access in the
    previous step, this command (for each worker node) will do that:
    (copies the myHadoop.tar file from the current node to a remote node into a
    file called myHadoopWorker.tar. Don’t forget to replace the IP address with
    that your worker nodes. By the way, since you are on the Amazon EC2 network,
    either Public or Private IP will work just fine.)
    scp myHadoop.tar [email protected]:/home/ec2-user/myHadoopWorker.tar
    Once the tar file containing your Hadoop installation from master node has
    been copied to each worker node, you need to login to each non-master node and
    unpack the .tar file.
    Run the following command (on each worker node, not on the master) to untar
    the hadoop file. We are purposely using a different tar archive name (i.e.,
    myHadoopWorker.tar), so if you get “file not found” error, that means you are
    running this command on the master node or have not successfully copied
    myHadoopWorker.tar file.
    tar xvf myHadoopWorker.tar
    Once you are done, run this on the master (nothing needs to be done on the
    workers to format the cluster unless you are re-formatting, in which case
    you’ll need to delete the dfs directory).
    hadoop namenode -format
    Once you have successfully completed the previous steps, you should be able to
    start and use your new cluster by going to the master node and running the
    start-dfs.sh and start-yarn.sh scripts (you do not need to explicitly start
    anything on worker nodes – the master will do that for you).
    You should verify that the cluster is running by pointing your browser to the
    link below.
    http://[insert-the-public-ip-of-master]:50070/
    Make sure that the cluster is operational (you can see the 3 nodes under
    Datanodes tab).
    Submit a screenshot of your cluster status view.
    Repeat the steps for wordcount using bioproject.xml from Assignment 1 and
    submit screenshots of running it.

Part 2: Hive

Run the following five (1.1, 1.2, 1.3 and 2.1, 2.2) queries in Hive and record
the time they take to execute.

Part 3: Pig

Convert and load the data into Pig, implementing queries 0.1, 0.2, 0.3. You
only need to do all of the queries for Pig. Check disk storage, if your disk
usage is over 90% Pig may hang without an error or a warning.
One easy way to time Pig is as follows: put your sequence of pig commands into
a text file and then run, from command line in pig directory (e.g.,
[ec2-user@ip-172-31-6-39 pig-0.15.0]$), bin/pig -f pig_script.pig (which will
inform you how long the pig script took to run).

Part 4: Hadoop Streaming

Implement query 0.3 using Hadoop streaming with python (you may implement this
part in Java if you prefer).
Submit a single document containing your written answers. Be sure that this
document contains your name and “CSC 555 Project Phase 1” at the top.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录