根据提供的文档,需要在AWS上搭建Hadoop环境进行计算。
Requirement
In this part of the project, you will 1) Set up a 3-node cluster and 2)
perform data warehousing and transformation queries using Hive, Pig and Hadoop
streaming.
Part 1: Multi-node cluster
- Your first step is to setup a multi-node cluster and re-run a simple
wordcount. For this part, you will create a 3-node cluster (with a total of 1
master + 2 worker nodes). Include your master node in the “slaves” file, to
make sure all 3 nodes are working.
You need to perform the following steps:
- Create a new node of a medium size. It is possible, but I do not recommend trying to reconfigure your existing Hadoop into this new cluster (it is much easier to make 3 new nodes for a total of 4 in your AWS account).
a. When creating a node I recommend changing the default 8G hard drive to 30G
so that you do not run out of space easily.
b. Change your security group setting to open firewall access. Rather than
figure out all individual port, you can set 0-64000 range opening up all ports
(not the most secure setting in the long term, but fine for us)
c. Step by step instructions on how to make the change to open up the ports.
d. Finally, right click on the Master node and choose “create more like this”
to create 2 more nodes with same settings. If you configure the network
settings on master first, security group information will be copied.
NOTE: Hard drive size will not be copied and default to 8G unless you change
it. - Connect to the master and set up Hadoop similarly to what you did previously. Do not attempt to repeat these steps on workers yet – you will only need to set up Hadoop once.
a. Configure core-site.xml, adding the PrivateIP (do not use public IP) of the
master.
b. Configure hdfs-site and set replication factor to 2.
c. cp hadoop-2.6.4/etc/hadoop/mapred-site.xml.template
hadoop-2.6.4/etc/hadoop/mapred-site.xml and then configure mapred-site.xml
d. Configure yarn-site.xml (once again, use PrivateIP of the master)
Finally, edit the slaves file and list your 3 nodes (master and 2 workers)
using Private IPs
[ec2-user@ip-172-31-7-201 ~]$ cat hadoop-2.6.4/etc/hadoop/slaves
172.31.7.201
172.31.5.246
172.31.11.50
Make sure that you use private IP (private DNS is also ok) for your
configuration files (such as conf/masters and conf/slaves or the other 3
config files). The advantage of the Private IP is that it does not change
after your instance is stopped (if you use the Public IP, the cluster would
need to be reconfigured every time it is stopped). The downside of the Private
IP is that it is only meaningful within the Amazon EC2 network. So all nodes
in EC2 can talk to each other using Private IP, but you cannot connect to your
instance from the outside (e.g., from your laptop) because Private IP has no
meaning for your laptop (since your laptop is not part of the Amazon EC2
network).
Now, we will pack up and move Hadoop to the workers. All you need to do is to
generate and then copy the public key to the worker nodes to achieve
passwordless access across your cluster. - Run ssh-keygen -t rsa (and enter empty values for the passphrase) on the master node. That will generate
.ssh/id_rsa
and.ssh/id_rsa.pub
(private and public key). You now need to manually copy the.ssh/id_rsa.pub
and append it to~/.ssh/authorized_keys
on each worker.
Keep in mind that this is a single-line public key and accidentally
introducing a line break would break it. For example:
Note that this is NOT the master, but one of the workers (ip-172-31-5-246).
The first public key is the .pem Amazon half and the 2nd public key is the
master’s public key copied in as one line.
You can add the public key of the master to the master by running this
command:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Make sure that you can ssh to all of the nodes from the master node (by
running ssh 54.186.221.92, where the IP address is your worker node) from the
master and ensuring that you were able to login. You can exit after successful
ssh connection by typing exit (the command prompt will tell you which machine
you are connected to, e.g., ec2-user@ip-172-31-37-113). Here’s me ssh-ing from
master to worker.
Once you are have verified that you can ssh from the master node to every
cluster member including the master itself (ssh localhost), you are going to
return to the master node (exit until your prompt shows the IP address of the
master node) and pack up the contents of the hadoop directory there. Make sure
your Hadoop installation is configured correctly (because from now on, you
will have 3 copies of the Hadoop directory and all changes may need to be
applied in 3 places).cd
(go to root home directory, i.e. /home/ec2-user/)
(pack up the entire Hadoop directory into a single file for transfer. You can
optionally compress the file with gzip)
tar cvf myHadoop.tar hadoop-2.6.4
ls -al myHadoop.tar (to verify that the .tar file had been created)
Now, you need to copy the myHadoop.tar file to every non-master node in the
cluster. If you had successfully setup public-private key access in the
previous step, this command (for each worker node) will do that:
(copies the myHadoop.tar file from the current node to a remote node into a
file called myHadoopWorker.tar. Don’t forget to replace the IP address with
that your worker nodes. By the way, since you are on the Amazon EC2 network,
either Public or Private IP will work just fine.)
scp myHadoop.tar [email protected]:/home/ec2-user/myHadoopWorker.tar
Once the tar file containing your Hadoop installation from master node has
been copied to each worker node, you need to login to each non-master node and
unpack the .tar file.
Run the following command (on each worker node, not on the master) to untar
the hadoop file. We are purposely using a different tar archive name (i.e.,
myHadoopWorker.tar), so if you get “file not found” error, that means you are
running this command on the master node or have not successfully copied
myHadoopWorker.tar file.
tar xvf myHadoopWorker.tar
Once you are done, run this on the master (nothing needs to be done on the
workers to format the cluster unless you are re-formatting, in which case
you’ll need to delete the dfs directory).
hadoop namenode -format
Once you have successfully completed the previous steps, you should be able to
start and use your new cluster by going to the master node and running the
start-dfs.sh and start-yarn.sh scripts (you do not need to explicitly start
anything on worker nodes – the master will do that for you).
You should verify that the cluster is running by pointing your browser to the
link below.
http://[insert-the-public-ip-of-master]:50070/
Make sure that the cluster is operational (you can see the 3 nodes under
Datanodes tab).
Submit a screenshot of your cluster status view.
Repeat the steps for wordcount using bioproject.xml from Assignment 1 and
submit screenshots of running it.
Part 2: Hive
Run the following five (1.1, 1.2, 1.3 and 2.1, 2.2) queries in Hive and record
the time they take to execute.
Part 3: Pig
Convert and load the data into Pig, implementing queries 0.1, 0.2, 0.3. You
only need to do all of the queries for Pig. Check disk storage, if your disk
usage is over 90% Pig may hang without an error or a warning.
One easy way to time Pig is as follows: put your sequence of pig commands into
a text file and then run, from command line in pig directory (e.g.,
[ec2-user@ip-172-31-6-39 pig-0.15.0]$), bin/pig -f pig_script.pig (which will
inform you how long the pig script took to run).
Part 4: Hadoop Streaming
Implement query 0.3 using Hadoop streaming with python (you may implement this
part in Java if you prefer).
Submit a single document containing your written answers. Be sure that this
document contains your name and “CSC 555 Project Phase 1” at the top.