Notes on Spark Servers
======================

Spark EC2 Script
----------------
**tl; dr**: Use git-aws.internal.justin.tv/ids/apache-spark on the aasted branch to deploy; it uses a pre-built image
from https://github.com/mesos/spark-ec2

Preconditions:
- You have created an SSH key using  `ssh-keygen -C lumberjack -t rsa -b 4096`
  **with a password** and uploaded the public key to AWS at
  https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#KeyPairs:sort=keyName
- You have your DNS server set up for the VPC you're launching into,
  such as connecting to 10.192.0.4 in the science VPC.


Suppose you have an uploaded keypair named <key name> (in the amazon console) stored at 
<key location> on your local system, and you want to launch a cluster named <cluster name>.
To launch a 1 master/9 slave cluster in us-west-2c on spot instances
in the science VPC with http servers open to anyone on the office network, run:

```
./spark-ec2 -k <key name> -i <key location> \
    -s 9 \
    -t m3.xlarge \
    -r us-west-2 \
    --zone=us-west-2c \
    --vpc-id vpc-0713b162 \
    --subnet-id subnet-ef4ab1b6 \
    --spot-price=0.1 \
    --authorized-address 10.0.0.0/8 \
    launch <cluster name>
```

This can take up to 10 minutes. If it appears to hang, double check you're on the DNS server. For larger tasks, the max number of slaves you can launch is 99.

If something goes wrong, you can manually clean up the nodes and security groups in the conslole

Experiment Results
------------------
The Amazon-provided version of Spark at
https://aws.amazon.com/articles/4926593393724923 seems to be
and order of magnitude slower than the spark-ec2 script's cluster
when tested against a ~20 GB corpus doing line counts; it
took ~300s to do what the vanilla spark cluster did in 25s.

The spark-ec2 script runs a Hadoop 1.0.6 cluster; it outperforms the
Hadoop 2.0.0 cluster when the cluster is deployed via spark-ec2 with
`--hadoop-major-version` set to 2. In our experiment using scala
on a large corpus with a 9 slave cluster, the 1.0.6 version finished
in ~17 minutes, while the hadoop 2 cluster crashed from OOM
in ~19 minutes.


Misc Facts
----------
 - IAM roles cannot be used because Hadoop doesn't know how to use them until version 2.6.0.
   https://issues.apache.org/jira/browse/HADOOP-10400
 - US-West-2c has the cheapest spot instances of any AZ, often an order of magnitude cheaper.
   Check the pricing history for m3.xlarge before launching.
 - The version of Spark that homebrew ships (1.2.1) vendors Boto for the spark-ec2 script,
   which prevents launching in a named VPC. Use the aasted branch instead -- we'll work
   on pushing this upstream soon


Building spark
--------------

```
# cd to your apache-spark checkout
brew install maven
[brew install zinc && zinc --start]
git checkout branch-1.2
mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests
```

Questions
---------
Please feel free to ping me: aasted@twitch.tv
