Single Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2

  • Post author:
  • Post category:IT
  • Post comments:0评论

Recently I need to do some experiments with the Montage astronomical image mosaic engine, using Pegasus as the workflow management system. This involves setting up a condor cluster and Pegasus on the submit host, and several other steps to run Montage in such an environment. After extensive search on the Internet, I find out that there exists no good documentation on how to accomplish this complicate task with my favorite Linux distribution – Ubuntu 12.04. I decide write a tutorial on this topic, in the hope that it might save someone else’s time in the future.

This tutorial includes the following three parts:

Single Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2

Multi Node Condor and Pegasus on Ubuntu 12.04 on AWS EC2

Running Montage with Pegasus on AWS EC2

[STEP 1: Create an EC2 Instance for the Master Host]

Create an EC2 instance with a Ubuntu 12.04 AMI. For testing purposes you might wish to take advantage of the spot instance to save money. Use one of the compute optimized (C3) instance types so that you will see multiple Condor slots with a single EC2 instance. In this document I use c3.xlarge, which has 4 vCPU’s.

For a single node setup, all you need to do with the security group setting is open up port 22 to your IP address so that you can SSH to the instance when it is up.

When the instance is up and running, SSH to the instance.

ssh -i yourkey.pem ubuntu@ip_of_the_instance

[STEP 2: Install Condor]

Download the latest version of HTCondor(native package)
for Ubuntu 12.04 from the following URL. What I have downloaded is condor-8.1.6-247684-ubuntu_12.04_amd64.deb. The actual filename might change over time.

http://research.cs.wisc.edu/htcondor/downloads/

Install Condor using the following commands:

$ sudo dpkg -i condor-8.1.6-247684-ubuntu_12.04_amd64.deb
$ sudo apt-get update
$ sudo apt-get install -f
$ sudo apt-get install chkconfig
$ sudo chkconfig condor on
$ sudo service condor start

Now we should have Condor up and running, and it should be automatically started when the system boots. Check into the status of Condor using the following commands:

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@ip-10-0-5-11 LINUX      X86_64 Unclaimed Benchmar  0.060 1862  0+00:00:04
slot2@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:05
slot3@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:06
slot4@ip-10-0-5-11 LINUX      X86_64 Unclaimed Idle      0.000 1862  0+00:00:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       0         4       0          0        0

               Total     4     0       0         4       0          0        0
$ condor_q


-- Submitter: ip-10-0-5-114.ec2.internal :  : ip-10-0-5-114.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended

[STEP 3: Install Pegasus]

Pegasus needs Java (1.6 or higher) and Python (2.4 or higher). Ubuntu 12.04 comes with Python 2.7 but not Java, so we will need to install Java first. Optionally Pegasus also needs Globus for grid support, we will take care of Globus later.

$ sudo apt-get install openjdk-7-jdk

Then we config the Pegasus repository and install Pegasus.

$ gpg --keyserver pgp.mit.edu --recv-keys 81C2A4AC
$ gpg -a --export 81C2A4AC | sudo apt-key add -  

All the following line into /etc/apt/source.list:

deb http://download.pegasus.isi.edu/wms/download/debian wheezy main

Update the repository and install Pegasus:

$ sudo apt-get update
$ sudo apt-get install pegasus

Now we should have Pegasus installed on the system. Check the installation with the following command. If you see similar output, congratulations!

$ pegasus-status
(no matching jobs found in Condor Q)

Pegasus comes with some examples, we will use these example to test the installation further.

$ cd ~
$ cp -r /usr/share/pegasus/examples .
$ cd examples/hello-world
$ ls
dax-generator.py  hello.sh  pegasusrc  submit  world.sh

Run the hello-world example:

$ ./submit 
2014.06.24 10:34:00.455 UTC:   Submitting job(s). 
2014.06.24 10:34:00.460 UTC:   1 job(s) submitted to cluster 1. 
2014.06.24 10:34:00.465 UTC:    
2014.06.24 10:34:00.471 UTC:   ----------------------------------------------------------------------- 
2014.06.24 10:34:00.476 UTC:   File for submitting this DAG to Condor           : hello_world-0.dag.condor.sub 
2014.06.24 10:34:00.481 UTC:   Log of DAGMan debugging messages                 : hello_world-0.dag.dagman.out 
2014.06.24 10:34:00.487 UTC:   Log of Condor library output                     : hello_world-0.dag.lib.out 
2014.06.24 10:34:00.492 UTC:   Log of Condor library error messages             : hello_world-0.dag.lib.err 
2014.06.24 10:34:00.497 UTC:   Log of the life of condor_dagman itself          : hello_world-0.dag.dagman.log 
2014.06.24 10:34:00.503 UTC:    
2014.06.24 10:34:00.508 UTC:   ----------------------------------------------------------------------- 
2014.06.24 10:34:00.513 UTC:    
2014.06.24 10:34:00.519 UTC:   Your workflow has been started and is running in the base directory: 
2014.06.24 10:34:00.524 UTC:    
2014.06.24 10:34:00.530 UTC:     /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.535 UTC:    
2014.06.24 10:34:00.540 UTC:   *** To monitor the workflow you can run *** 
2014.06.24 10:34:00.546 UTC:    
2014.06.24 10:34:00.551 UTC:     pegasus-status -l /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.556 UTC:    
2014.06.24 10:34:00.562 UTC:   *** To remove your workflow run *** 
2014.06.24 10:34:00.567 UTC:    
2014.06.24 10:34:00.572 UTC:     pegasus-remove /home/ubuntu/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000 
2014.06.24 10:34:00.578 UTC:    
2014.06.24 10:34:01.024 UTC:   Time taken to execute is 1.109 seconds
Check the status of the Pegasus jobs and Condor queue using the pegasus-statua and condor_q commands:
$ pegasus-status
STAT  IN_STATE  JOB                                               
Run      01:05  hello_world-0                                     
Summary: 1 Condor job total (R:1)

$ condor_q


-- Submitter: ip-10-0-5-114.ec2.internal :  : ip-10-0-5-114.ec2.internal
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   ubuntu          6/24 10:34   0+00:01:28 R  0   0.0  pegasus-dagman -f 

1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

After the hello-world workflow have been executed, a trace file (jobstat.log) can be found in the working directory. Workflow related information is hidden several sub-directories deep. In my case the directory is ~/examples/hello-world/work/ubuntu/pegasus/hello_world/20140624T103359+0000. Please note that the last sub-directory is a timestamp, depending on the time you submit the workflow.

As a bonus of this tutorial, I have prepared an AMI with the above-mentioned setup, and make the AMI publicly available to the community. If all you need is a single node Pegasus + Condor configuration, you don’t need to repeat any of the above-mentioned steps. All you need to do is to launch an EC2 instance with AMI ami-5ee01b36 in the US-EAST-1 (N. Virginia) region. If you need to run this in other regions, copy the AMI to the desired region and launch an instance with the copied AMI in that region.

发表回复