EC2 Tutorial
The following steps will help you get starting running Accumulo in Amazon's Elastic Compute Cloud. These instructions have been tested on clusters of up to 400 EC2 m1.large instances.
The current stable release of Accumulo is version 1.3.5. The AMI and instructions will be updated when 1.4 is released.
For all node types (NameNode, DataNode, TabletServer, etc) m1.large instance types have been tested and are known to work on clusters consisting of up to 400 nodes. Using a larger instance for the HDFS NameNode, specifically those with mode memory, may be desired.
m1.large instances have 7.5GB of RAM, which is enough to run all three types of processes: TabletServer, HDFS DataNode, and MapReduce TaskTracker simultaneously without problems. Organizations have been known to use larger instances with success.
m1.larges also have two local disks (a.k.a instance storage), which may be preferable to a single disk of the same size, as it provides more IOPS to Accumulo.
This tutorial was tested with the Ubuntu Mavrick 10.10 x86_64 AMI
with the ID ami-08f40561
We may post a pre-installed and configured Accumulo AMI here soon.
Accumulo has been tested on a cluster spanning up to four separate regions in the same Availability Zone without suffering from inter-region latency. It is unknown whether inter-Availablity-Zone latency will cause problems.
For better failure resilience, HDFS's "Rack Awareness" feature could theoretically be configured to create replicas in different regions. This could help avoid data unavailability if an entire region becomes unavailable. Doing this is outside the scope of this tutorial.
Although all the processes necessary can be run on a single machine, for production clusters it's best to separate some processes. Namely the following list the types of nodes and the processes run thereon:
Much of the Installation and Configuration can be performed by executing this script on your instances after booting up. After the script executes, resume the steps at Hadoop Configuration. The following describes the steps in the script.
Cloudera's Hadoop distribution, CDH3u2 has been tested and known to work. apt-get is an easy way to install Hadoop.
RELEASE=`lsb_release -c | awk {'print $2'}`
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
sudo apt-get install python-software-properties -y
sudo add-apt-repository "deb http://archive.canonical.com/ $RELEASE partner"
sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu/ $RELEASE multiverse"
sudo add-apt-repository "deb http://archive.cloudera.com/debian $RELEASE-cdh3u2 contrib"
sudo apt-get update
cat << EOD | sudo debconf-set-selections
sun-java6-jdk shared/accepted-sun-dlj-v1-1 select true
sun-java6-jre shared/accepted-sun-dlj-v1-1 select true
EOD
sudo dpkg --set-selections <<EOS
sun-java6-jdk install
EOS
sudo apt-get install -y gcc g++ hadoop-0.20 hadoop-0.20-datanode hadoop-0.20-tasktracker hadoop-zookeeper
In addition to these, ZooKeeper servers, the NameNode, and JobTracker nodes will need these additional packages, respectively:
sudo apt-get install hadoop-zookeeper-server -y
sudo apt-get install hadoop-0.20-namenode -y
sudo apt-get install hadoop-0.20-jobtracker -y
On all nodes that will serve as Accumulo TabletServers or the Master, wget the binary distribution of Accumulo from the Incubator site:
wget http://www.alliedquotes.com/mirrors/apache/incubator/accumulo/1.3.5-incubating/accumulo-1.3.5-incubating-dist.tar.gz
tar -xzf accumulo-1.3.5-incubating-dist.tar.gz
ln -s accumulo-1.3.5-incubating accumulo
If you plan to run MapReduce over Accumulo, copy the following JAR files to Hadoop's lib directory on each of the nodes that will serve as MapReduce TaskTrackers:
sudo cp accumulo/lib/accumulo-core-1.3.5-incubating.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/log4j-1.2.16.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/libthrift-0.3.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/cloudtrace-1.3.5-incubating.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/zookeeper/zookeeper.jar /usr/lib/hadoop/lib/
A basic Ubuntu AMI like Mavrick OS 10.10 (ami-08f40561) works fine. After booting up, the following tweaks should be applied:
sudo sysctl -w vm.swappiness=0
echo -e "ubuntu\t\tsoft\tnofile\t65536" | sudo tee --append /etc/security/limits.conf
echo -e "ubuntu\t\thard\tnofile\t65536" | sudo tee --append /etc/security/limits.conf
sudo apt-get install xfsprogs -y;
sudo umount /mnt;
sudo /sbin/mkfs.xfs -f /dev/sdb;
sudo mount -o noatime /dev/sdb /mnt;
sudo mkdir /mnt2;
sudo /sbin/mkfs.xfs -f /dev/sdc;
sudo mount -o noatime /dev/sdc /mnt2;
sudo chown -R ubuntu /mnt
sudo chown -R ubuntu /mnt2
mkdir /mnt/hdfs
mkdir /mnt/namenode
mkdir /mnt/mapred
mkdir /mnt/walogs
mkdir /mnt2/hdfs
mkdir /mnt2/mapred
sudo chown -R hdfs /mnt/hdfs
sudo chown -R hdfs /mnt/namenode
sudo chown -R mapred /mnt/mapred
sudo chown -R hdfs /mnt2/hdfs
sudo chown -R mapred /mnt2/mapred
code snippet by Maid2Clean
EBS volumes have not been tested, but should work. EBS volumes are less cost effective and may not perform as well as local instance storage for larger clusters, simply because HDFS is providing reliability features that make EBS's managed RAID unnecessary, and because EBS I/O happens over the network.
HDFS, MapReduce, and Accumulo were designed with local storage in mind.
To use EBS volumes, simply attach them to the desired nodes, format, create directories, and set permissions as above. Then make sure these paths are entered in the HDFS and Accumulo configuration files, outlined below
Add this line to /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
The following tags should be placed in /etc/hadoop/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://ip-address-of-namenode:9000</value>
</property>
The following tags should be placed in /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.name.dir</name>
<value>/mnt/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs,/mnt2/hdfs</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
The following tags should be placed in /etc/hadoop/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>ip-address-of-your-jobtracker:9001</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/mnt/mapred,/mnt2/mapred</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>
Add the following line to /etc/zookeeper/zoo.cfg:
maxClientCnxns=250
Edit the following lines of accumulo/conf/accumulo-env.sh
test -z "$JAVA_HOME" && export JAVA_HOME="/usr/lib/jvm/java-6-sun/jre/"
test -z "$HADOOP_HOME" && export HADOOP_HOME="/usr/lib/hadoop/"
test -z "$ZOOKEEPER_HOME" && export ZOOKEEPER_HOME="/usr/lib/zookeeper/"
Add the following tags to accumulo/conf/accumulo-site.xml
<property>
<name>instance.zookeeper.host</name>
<value>your-zk-servers-internal-dns1:2181,another-zk-server-internal-dns:2181</value>
</property>
<property>
<name>logger.dir.walog</name>
<value>/mnt/walogs</value>
</property>
<property>
<name>instance.secret</name>
<value>DEFAULT</value>
</property>
<property>
<name>tserver.memory.maps.max</name>
<value>1G</value>
</property>
EC2 provides internal Domain Names for all instances. These should be used where servers are referenced.
Place the internal domain name of the Accumulo master into accumulo/conf/masters
Place the internal domain names of the Accumulo tablet servers into accumulo/conf/slaves
Edit your instances' security policy to allow communication for the following ports to and from your internal instances:
2181 Zookeeper
2888 Zookeeper
3888 Zookeeper
4560 Accumulo monitor
9000 HDFS
9001 JobTracker
9997 Tablet Server
9999 Master Server
11224 Accumulo Logger
12234 Accumulo Tracer
50010 DataNode Data
50020 DataNode Metadata
50060 TaskTrackers
50070 NameNode HTTP monitor
50075 DataNode HTTP monitor
50091 ?
50095 Accumulo HTTP monitor
Allowing access of these ports to 10.0.0.0/8 will allow internal access but not access from outside EC2.
To use the Accumulo start up scripts (which must be used) setup password-less SSH from controller node to the other nodes. To generate a new SSH key run the following and hit enter when prompted for a passphrase to create a password-less key:
ssh-keygen cd ~/.ssh
This will create a private key and a public key. Copy the public key to the TabletServer nodes
cat id_rsa.pub >> authorized_keys
These steps only need to be performed once.
If not already done, the Namenode must be formatted. On the NameNode run:
sudo -u hdfs hadoop namenode -format
Accumulo must be initialized once HDFS and ZooKeeper are started. See below.
HDFS DataNodes should already be running. If not, on each DataNode run:
sudo /etc/init.d/hadoop-0.20-datanode start
On the NameNode run:
sudo /etc/init.d/hadoop-0.20-namenode start
ZooKeeper should already be running on the machines on which the hadoop-zookeeper-server package was installed. You should restart so ZooKeeper will have a higher number of available connections.
sudo /etc/init.d/hadoop-0.20-zookeeper stop
sudo /etc/init.d/hadoop-0.20-zookeeper start
If you plan to run MapReduce, the TaskTrackers should already be started. If not, on each TaskTracker run:
sudo /etc/init.d/hadoop-0.20-tasktracker start
MapReduce will need access to write to HDFS. This next step is overkill, but securing HDFS is outside the scope of this tutorial.
sudo -u hdfs hadoop fs -chmod a+rwx /
The JobTracker must be started. On the JobTracker run:
sudo /etc/init.d/hadoop-0.20-jobtracker start
Either log into all your accumulo nodes once or run:
for i in `cat conf/slaves`; do
ssh -o "StrictHostKeyChecking no" $i "echo";
done
Initialize Accumulo using the init command. Only do this once.
~/accumulo/bin/accumulo init
You will be prompted to choose an instance name and a root password
Accumulo ships with some startup scripts for convenience. These must be used when starting Accumulo for the first time. These will start the Master, TabletServers, Loggers, GarbageCollector, Monitor, and Tracing processes
~/accumulo/bin/start-all.sh
Hadoop and Accumulo provide web interfaces for monitoring health and status. It's often convenient to SSH into an EC2 instance with the -X option and run Firefox on the instance.
ssh -i your-key.pem ubuntu@your-instance.com -X
sudo apt-get install firefox -y
firefox localhost:50095 &
Accumulo (and MapReduce, if running) should always be stopped before HDFS.
~/accumulo/bin/stop-all.sh
In order to suspend instances (assuming they are EBS backed), Accumulo, MapReduce, ZooKeeper, and HDFS should be stopped prior to suspension, in that order.
~/accumulo/bin/stop-all.sh
sudo /etc/init.d/hadoop-0.20-tasktracker stop; # on all tasktrackers
sudo /etc/init.d/hadoop-0.20-jobtracker stop;
sudo /etc/init.d/hadoop-zookeeper-server stop; # on zookeeper nodes
sudo /etc/init.d/hadoop-0.20-namenode stop;
sudo /etc/init.d/hadoop-0.20-datanode stop; # on all datanodes
On any machine that has the Accumulo and Hadoop config files as described above, the Accumulo shell can be used to create, configure, and inspect tables:
~/accumulo/bin/accumulo shell -u root
To begin writing clients to write and read data, see the Client section of the Accumulo 1.3 Manual.