Jump to content

Cluster-Handbook/Torque

From Wikibooks, open books for an open world

Torque

[edit | edit source]

Torque is an open source resource manager based on the original PBS project (http://www.pbsworks.com/). It is responsible to start, delete or to monitor jobs and thus supports a scheduler that could not manage the jobs without these functions otherwise. Therefore Torque brings it its own scheduler (pbs_sched), but you can also use other. Torque is flexible enough to perform space planning, but is used mostly in clusters. How to install and configure Torque for simple jobs on a cluster is described below. To install the latest version of Torque, you should not use the package from Ubuntu, but the package from the following website: http://www.adaptivecomputing.com/products/open-source/torque/.

Download Torque

[edit | edit source]

Download the files in the master (here we used version 4.1.4).

$ sudo wget http://adaptive.wpengine.com/resources/downloads/torque/torque-4.1.4.tar.gz

Unzip file and navigate to the directory

[edit | edit source]

$ tar -xzvf torque-4.1.4.tar.gz

$ cd torque-4.1.4/

When configuring and installing one remains best in this directory.

Configure and install the package on the master

[edit | edit source]

Set Directory

[edit | edit source]

By default make install installs all files in in /usr/local/bin, /usr/local/lib, /usr/local/sbin, /usr/local/include, and /usr/local/man.
You can also specify a different folder where the files should be stored by putting -–prefix=$directoryname behind ./configure. So If you don't want to change anything, you do not need to consider this step.

Set Library Folder

[edit | edit source]

Create a new file: /etc/ld.so.conf.d/torque.conf

$ sudo nano /etc/ld.so.conf.d/torque.conf

There you write the path to the libraries. In the standard setting, it would be /usr/local/lib (is home defined as a directory it would be /home/lib). Then enter the following command:

$ sudo ldconfig

Perform Configure

[edit | edit source]

To execute configure you have to install build-essentials, libssl-devel and libxml2-devel with this command:

$ sudo apt-get install build-essentials libssl-dev libxml2-dev

If you execute ./configure you will get an error that libxml2-devel isn't installed. This is a bug in Torque and can be fixed with following steps:
Firstly two lines in the configure.ac file need to be changed (see screenshot).

$ sudo nano configure.ac

Figure 12.1: Configure Bug Fix


The minus describes the line that needs to be changed, the plus describes how the line should read after the change. It is best to look for a keyword for the line to be changed because the file has a lot of lines.
After that execute autoconf:

$ sudo autoconf

and change the configure file:

sudo nano configure

Figure 12.2: Configure Bug Fix 2


Again, you look for the yellow marked line and change in the end (red rectangle) the -1 in a -l.
Now you can run ./configure and it should finish without errors.

sudo ./configure

In the end also run make and make install.

sudo make

sudo make install

By default, make install creates the directory /var/spool/torque. This directory is referred to as TORQUE_HOME. There, various subfolders are created that are used to configure and run the program.

Install Torque on the Nodes

[edit | edit source]

Create packages

[edit | edit source]

Torque has the function to create the packages, which uses the configurations and then can be installed on the nodes. Use the command make for this.

make packages

The packages are stored in the torque-4.1.4/ and must be copied from there in a shared directory the nodes have access to. In our case it would be the /home directory.
For example:

cp torque-package-mom-linux-i686.sh /home

On the nodes only the mom-linux package is needed. All others are optional.

Install Package

[edit | edit source]

On the node you navigate to the directory in which you have copied the package and install it with the following command:

./torque-package-mom-linux-i686.sh –install

Torque Konfigurieren

[edit | edit source]

Initialise serverdb

[edit | edit source]

In the directory TORQUE_HOME/server_priv are configurations and information located that the pbs_server Service uses. To initialise the file serverdb run following command:

sudo ./torque.setup

Then the pbs_server needs a restart.

sudo qterm

sudo pbs_server

The server properties can be see by the following command:

sudo qmgr -c ’p s’

Specify Nodes

[edit | edit source]

Thus, the pbs_server recognizes which computers in the network are the nodes. For this create in the directory TORQUE_HOME/server_priv a new file nodes:

sudo nano nodes

In this file, the nodes will be stored with their name. Normally it is sufficient to write the names in the file, you may set special properties for each node. The syntax is:
NodeName[:ts] [np=] [gpus=] [properties]
[:ts]: This option sets the node as timeshared. These nodes are indeed listed by the server, but do not get jobs allocated.
[np=] This option is used to specify how many virtual processors are located on the nodes.
[gpus=] This option is used to specify how many CPUs are on the node.
[properties] This option allows to enter a name to identify the node. However, it must start with a letter.
One can detect the number of processors also automatically:

sudo qmgr -c set server auto_node_np = True

As a result, properties in the server auto_node_np are set to True.

Configure Nodes

To configure the nodes, the file config in the directory TORQUE_HOME/mom_priv has to be created:

sudo nano config

This file is created the same on all nodes and should read the following:

Figure 12.3: Config file

Furthermore, one must write the line $usecp*:/home /home write into it. This ensures that the file of the finished jobs is stored in a specific directory (here the shared /home). Otherwise the following error will occur when running the command tracejob:

Figure 12.4: Tracejob error


Execute Job

[edit | edit source]

Run Services

In order for a job to be performed at least 4 services must be started . On the master that are pbs_server, pbs_sched and trqauthd. On the nodes that is pbs_mom:

sudo pbs_server

sudo pbs_sched

sudo sudo trqauthd

sudo pbs_mom

Run Job

[edit | edit source]
Figure 12.5: Bash file example


The command qsub [file name], executed on the master, starts a job. To run a job, you need a Bash file. In the example above, the date is displayed, wait 10 seconds and then again output the date. The result is then stored in the directory on the master from which the job was started.

Useful Commands

[edit | edit source]

There are some commands in Torque with which you can trace the running jobs and which are very useful for troubleshooting.

The command

pbsnodes -a

, executed on the master, shows if a node is active or not. With the command

qstat

a list of running or finished jobs is displayed.

Figure 12.6: qstat display


There you can see which number a job has which node is used and whether the job is started, in progress or has already ended.
A very useful command for debugging is

tracejob [job number]

This is a command from Torque which searches and summarizes the log files in the pbs_server, mom and scheduler. With this one gets a quick overview.