EC2 | Accumulo Data

Running on EC2

The following steps will help you get starting running Accumulo in Amazon's Elastic Compute Cloud. These instructions have been tested on clusters of up to 400 EC2 m1.large instances.

The current stable release of Accumulo is version 1.3.5. The AMI and instructions will be updated when 1.4 is released.

Hardware
- Instance Type
- AMIs
- Instance Location
- Process Distribution
Installation
Configuration
- Operating System
- File System
- Elastic Block Service (EBS) Volumes
- Hadoop
- HDFS
- MapReduce
- ZooKeeper
- Accumulo
- Networking and Security
Initialization
- Hadoop
Starting
- HDFS
- ZooKeeper
- MapReduce
- Accumulo
Monitoring
Stopping
Suspending
Next Steps

Hardware

Instance Types

For all node types (NameNode, DataNode, TabletServer, etc) m1.large instance types have been tested and are known to work on clusters consisting of up to 400 nodes. Using a larger instance for the HDFS NameNode, specifically those with mode memory, may be desired.

m1.large instances have 7.5GB of RAM, which is enough to run all three types of processes: TabletServer, HDFS DataNode, and MapReduce TaskTracker simultaneously without problems. Organizations have been known to use larger instances with success.

m1.larges also have two local disks (a.k.a instance storage), which may be preferable to a single disk of the same size, as it provides more IOPS to Accumulo.

AMIs

This tutorial was tested with the Ubuntu Mavrick 10.10 x86_64 AMI
with the ID ami-08f40561

We may post a pre-installed and configured Accumulo AMI here soon.

Instance Location

Accumulo has been tested on a cluster spanning up to four separate regions in the same Availability Zone without suffering from inter-region latency. It is unknown whether inter-Availablity-Zone latency will cause problems.

For better failure resilience, HDFS's "Rack Awareness" feature could theoretically be configured to create replicas in different regions. This could help avoid data unavailability if an entire region becomes unavailable. Doing this is outside the scope of this tutorial.

Process Distribution

Although all the processes necessary can be run on a single machine, for production clusters it's best to separate some processes. Namely the following list the types of nodes and the processes run thereon:

HDFS NameNode - NameNode
JobTracker - JobTracker
ZooKeeper - ZooKeeper (QuorumPeer)
Accumulo Master - Master, Monitor, Garbage Collector
Worker Node - TabletServer, Logger, TaskTracker, DataNode

Installation

Much of the Installation and Configuration can be performed by executing this script on your instances after booting up. After the script executes, resume the steps at Hadoop Configuration. The following describes the steps in the script.

Cloudera's Hadoop distribution, CDH3u2 has been tested and known to work. apt-get is an easy way to install Hadoop.

RELEASE=`lsb_release -c | awk {'print $2'}`

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

sudo apt-get install python-software-properties -y
sudo add-apt-repository "deb http://archive.canonical.com/ $RELEASE partner"
sudo add-apt-repository "deb http://archive.ubuntu.com/ubuntu/ $RELEASE multiverse"
sudo add-apt-repository "deb http://archive.cloudera.com/debian $RELEASE-cdh3u2 contrib"

sudo apt-get update

cat << EOD | sudo debconf-set-selections
sun-java6-jdk shared/accepted-sun-dlj-v1-1 select true
sun-java6-jre shared/accepted-sun-dlj-v1-1 select true
EOD

sudo dpkg --set-selections <<EOS
sun-java6-jdk install
EOS

sudo apt-get install -y gcc g++ hadoop-0.20 hadoop-0.20-datanode hadoop-0.20-tasktracker hadoop-zookeeper

In addition to these, ZooKeeper servers, the NameNode, and JobTracker nodes will need these additional packages, respectively:

sudo apt-get install hadoop-zookeeper-server -y

sudo apt-get install hadoop-0.20-namenode -y

sudo apt-get install hadoop-0.20-jobtracker -y

On all nodes that will serve as Accumulo TabletServers or the Master, wget the binary distribution of Accumulo from the Incubator site:

wget http://www.alliedquotes.com/mirrors/apache/incubator/accumulo/1.3.5-incubating/accumulo-1.3.5-incubating-dist.tar.gz
tar -xzf accumulo-1.3.5-incubating-dist.tar.gz
ln -s accumulo-1.3.5-incubating accumulo

If you plan to run MapReduce over Accumulo, copy the following JAR files to Hadoop's lib directory on each of the nodes that will serve as MapReduce TaskTrackers:

sudo cp accumulo/lib/accumulo-core-1.3.5-incubating.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/log4j-1.2.16.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/libthrift-0.3.jar /usr/lib/hadoop/lib/
sudo cp accumulo/lib/cloudtrace-1.3.5-incubating.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/zookeeper/zookeeper.jar /usr/lib/hadoop/lib/

Configuration

Operating System

A basic Ubuntu AMI like Mavrick OS 10.10 (ami-08f40561) works fine. After booting up, the following tweaks should be applied:

sudo sysctl -w vm.swappiness=0

echo -e "ubuntu\t\tsoft\tnofile\t65536" | sudo tee --append /etc/security/limits.conf
echo -e "ubuntu\t\thard\tnofile\t65536" | sudo tee --append /etc/security/limits.conf

Storage

m1.large instances have two local 420GB hard drives on /dev/sdb and /dev/sdc, which we will just let HDFS manage (i.e. no RAID required). We'll also store the Accumulo write-ahead logs there as well as create directories for temporary MapReduce output. XFS is a fine filesystem to use.

sudo apt-get install xfsprogs -y;
sudo umount /mnt;
sudo /sbin/mkfs.xfs -f /dev/sdb;
sudo mount -o noatime /dev/sdb /mnt;

sudo mkdir /mnt2;
sudo /sbin/mkfs.xfs -f /dev/sdc;
sudo mount -o noatime /dev/sdc /mnt2;

sudo chown -R ubuntu /mnt
sudo chown -R ubuntu /mnt2

mkdir /mnt/hdfs
mkdir /mnt/namenode
mkdir /mnt/mapred
mkdir /mnt/walogs

mkdir /mnt2/hdfs
mkdir /mnt2/mapred

sudo chown -R hdfs /mnt/hdfs
sudo chown -R hdfs /mnt/namenode
sudo chown -R mapred /mnt/mapred

sudo chown -R hdfs /mnt2/hdfs
sudo chown -R mapred /mnt2/mapred

~~code snippet by Maid2Clean~~

Elastic Block Service (EBS) Volumes

EBS volumes have not been tested, but should work. EBS volumes are less cost effective and may not perform as well as local instance storage for larger clusters, simply because HDFS is providing reliability features that make EBS's managed RAID unnecessary, and because EBS I/O happens over the network.

HDFS, MapReduce, and Accumulo were designed with local storage in mind.

To use EBS volumes, simply attach them to the desired nodes, format, create directories, and set permissions as above. Then make sure these paths are entered in the HDFS and Accumulo configuration files, outlined below

Hadoop

Add this line to /etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-6-sun/

The following tags should be placed in /etc/hadoop/conf/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://ip-address-of-namenode:9000</value>
</property>

HDFS

The following tags should be placed in /etc/hadoop/conf/hdfs-site.xml

<property>
<name>dfs.name.dir</name>
<value>/mnt/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/mnt/hdfs,/mnt2/hdfs</value>
</property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

MapReduce

The following tags should be placed in /etc/hadoop/conf/mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>ip-address-of-your-jobtracker:9001</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/mnt/mapred,/mnt2/mapred</value>
</property>

<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
</property>

ZooKeeper

Add the following line to /etc/zookeeper/zoo.cfg:

maxClientCnxns=250

Accumulo

Edit the following lines of accumulo/conf/accumulo-env.sh

test -z "$JAVA_HOME" && export JAVA_HOME="/usr/lib/jvm/java-6-sun/jre/"
test -z "$HADOOP_HOME" && export HADOOP_HOME="/usr/lib/hadoop/"
test -z "$ZOOKEEPER_HOME" && export ZOOKEEPER_HOME="/usr/lib/zookeeper/"

Add the following tags to accumulo/conf/accumulo-site.xml

<property>
<name>instance.zookeeper.host</name>
<value>your-zk-servers-internal-dns1:2181,another-zk-server-internal-dns:2181</value>
</property>

<property>
<name>logger.dir.walog</name>
<value>/mnt/walogs</value>
</property>

<property>
<name>instance.secret</name>
<value>DEFAULT</value>
</property>

<property>
<name>tserver.memory.maps.max</name>
<value>1G</value>
</property>

EC2 provides internal Domain Names for all instances. These should be used where servers are referenced.

Place the internal domain name of the Accumulo master into accumulo/conf/masters

Place the internal domain names of the Accumulo tablet servers into accumulo/conf/slaves

Networking and Security

Edit your instances' security policy to allow communication for the following ports to and from your internal instances:

2181 Zookeeper
2888 Zookeeper
3888 Zookeeper
4560 Accumulo monitor
9000 HDFS
9001 JobTracker
9997 Tablet Server
9999 Master Server
11224 Accumulo Logger
12234 Accumulo Tracer
50010 DataNode Data
50020 DataNode Metadata
50060 TaskTrackers
50070 NameNode HTTP monitor
50075 DataNode HTTP monitor
50091 ?
50095 Accumulo HTTP monitor

Allowing access of these ports to 10.0.0.0/8 will allow internal access but not access from outside EC2.

To use the Accumulo start up scripts (which must be used) setup password-less SSH from controller node to the other nodes. To generate a new SSH key run the following and hit enter when prompted for a passphrase to create a password-less key:

ssh-keygen cd ~/.ssh

This will create a private key and a public key. Copy the public key to the TabletServer nodes

cat id_rsa.pub >> authorized_keys

Initialization

These steps only need to be performed once.

Hadoop

If not already done, the Namenode must be formatted. On the NameNode run:

sudo -u hdfs hadoop namenode -format

Accumulo must be initialized once HDFS and ZooKeeper are started. See below.

Starting

HDFS

HDFS DataNodes should already be running. If not, on each DataNode run:

sudo /etc/init.d/hadoop-0.20-datanode start

On the NameNode run:

sudo /etc/init.d/hadoop-0.20-namenode start

ZooKeeper

ZooKeeper should already be running on the machines on which the hadoop-zookeeper-server package was installed. You should restart so ZooKeeper will have a higher number of available connections.

sudo /etc/init.d/hadoop-0.20-zookeeper stop
sudo /etc/init.d/hadoop-0.20-zookeeper start

MapReduce

If you plan to run MapReduce, the TaskTrackers should already be started. If not, on each TaskTracker run:

sudo /etc/init.d/hadoop-0.20-tasktracker start

MapReduce will need access to write to HDFS. This next step is overkill, but securing HDFS is outside the scope of this tutorial.

sudo -u hdfs hadoop fs -chmod a+rwx /

The JobTracker must be started. On the JobTracker run:

sudo /etc/init.d/hadoop-0.20-jobtracker start

Accumulo

Either log into all your accumulo nodes once or run:

for i in `cat conf/slaves`; do
ssh -o "StrictHostKeyChecking no" $i "echo";
done

Initialize Accumulo using the init command. Only do this once.

~/accumulo/bin/accumulo init

You will be prompted to choose an instance name and a root password

Accumulo ships with some startup scripts for convenience. These must be used when starting Accumulo for the first time. These will start the Master, TabletServers, Loggers, GarbageCollector, Monitor, and Tracing processes

~/accumulo/bin/start-all.sh

Monitoring

Hadoop and Accumulo provide web interfaces for monitoring health and status. It's often convenient to SSH into an EC2 instance with the -X option and run Firefox on the instance.

ssh -i your-key.pem ubuntu@your-instance.com -X
sudo apt-get install firefox -y
firefox localhost:50095 &

Stopping

Accumulo (and MapReduce, if running) should always be stopped before HDFS.

~/accumulo/bin/stop-all.sh

Suspending

^{Thanks to a developer at Maid2Clean for a small correction.}

In order to suspend instances (assuming they are EBS backed), Accumulo, MapReduce, ZooKeeper, and HDFS should be stopped prior to suspension, in that order.

~/accumulo/bin/stop-all.sh
sudo /etc/init.d/hadoop-0.20-tasktracker stop; # on all tasktrackers
sudo /etc/init.d/hadoop-0.20-jobtracker stop;
sudo /etc/init.d/hadoop-zookeeper-server stop; # on zookeeper nodes
sudo /etc/init.d/hadoop-0.20-namenode stop;
sudo /etc/init.d/hadoop-0.20-datanode stop; # on all datanodes

Next Steps

On any machine that has the Accumulo and Hadoop config files as described above, the Accumulo shell can be used to create, configure, and inspect tables:

~/accumulo/bin/accumulo shell -u root

To begin writing clients to write and read data, see the Client section of the Accumulo 1.3 Manual.

Running on EC2

Contents

Hardware

Instance Types

AMIs

Instance Location

Process Distribution

Installation

Configuration

Operating System

Storage

Elastic Block Service (EBS) Volumes

Hadoop

HDFS

MapReduce

ZooKeeper

Accumulo

Networking and Security

Initialization

Hadoop

Starting

HDFS

ZooKeeper

MapReduce

Accumulo

Monitoring

Stopping

Suspending

Next Steps