CiGri User Documentation

Authors: Bruno Bzeznik
Ghislain Charrier
Contact: cigri-devel@lists.gforge.inria.fr
Organization: LIG laboratory
Address:
Laboratoire d'Informatique de Grenoble
Bat. ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN
Status: Testing
Copyright: Licenced under the GNU General Public License

Abstract

Cigri is a tool for multiparametric jobs submissions on a ligweight computing grid. It is built to run over a set of clusters managed by the OAR resources and job management system.

Dedication:For users.

Table of Contents


1   Cigri Tour

1.1   General Presentation

Cigri is a campaign management tool. It is design to run on top of multiple clusters each managed by a batch scheduler.

1.2   Campaigns

A campaign is a set of jobs that have to be executed. In our context, we consider that all the jobs in a campaign are similar. In other terms, all the jobs of a campaign use the same executable with different parameters. It can be the same program executed repetitively with different parameters. A typical Monte-Carlo campaign using a seed for its random generator could be schematized by:

for i in 0..1 000 000
  program.exe i
end

1.3   Cigri Features

Cigri includes many features including but not limited to:

  • Multiple campaigns management
  • Multiple users
  • Different campaigns types
  • Automatic resubmission

TODO

1.4   Campaigns types

Cigri distinguishes 4 different types of campaigns:

  • Normal campaigns: with this type of campaigns, Cigri submits jobs to the batch schedulers. Normal campaigns are the best for the users because the jobs are assured to have the requested time. However, because the first role of Cigri is to use idle resources with minimum impact on the other users, this type of campaign will most likely require an authorization from the admins.
  • Best-effort campaigns: this type of campaigns submits jobs in a best-effort mode to the batch scheduler. This means that when resources are needed by a non best-effort job, the campaign job will be killed and will have to be resubmitted later. This type of campaign can take advantage of idle resources while not disturbing the platform. However, due to the likeliness that jobs may be killed, it is better is jobs are small or checkpointable.
  • Semi-best-effort campaigns: the semi-best-effort campaign is a mix of the two previous policies. During the day, jobs are submitted in a best-effort mode and during the night, normal submissions are used. This ensures that jobs execution progresses during the night.
  • Nightly campaigns: for some kind of jobs (long and parallel ones for example) trying to execute jobs is a best-effort mode has no purpose as they will get killed most of the time. Resources would just be wasted. Therefore, for this kind of jobs, it is better only to use normal submissions during the night in order to let resources to the other users of the platform during the day.

2   Job Description Language (JDL)

To describe a campaign, we use a Job Description Language (JDL). The JDL is based on JSON [1].

[1]See http://www.json.org/ for more information about JSON

The JDL has 2 main parts:

  1. The global settings
  2. The cluster settings

Emphasized values correspond to the default.

Attributes followed by a "*" are mandatory.

2.1   Global Settings

  • name*: Name of the campaign
  • clusters*: list of the clusters where the campaign should run. See Cluster Settings
  • param_file: path to the file containing all the parameters to run for the campaign
  • nb_jobs: number of jobs
  • params: array of parameters
  • jobs_type:
    • normal: jobs using the param_file or nb_jobs
    • desktop_computing: jobs launched with always the same parameters
  • Any field described in Cluster Settings

Note

  • If the job_type is not desktop_computing, then one of param_file or nb_jobs or params is mandatory
  • nb_jobs is just syntactic sugar equivalent with a param_file containing a number from 0 to nb_jobs on each line
  • If param_file or nb_jobs is given, they will be changed into params. It's just there to facilitate submissions.

2.2   Cluster Settings

Settings in this section can be defined in the global section to act as value on all clusters.

  • type: Values other than best-effort may require approval from platform admins
    • best-effort: jobs are executed day and night as best-effort
    • semi-best-effort: jobs are executed as best-effort during the day and as normal submissions during the night
    • nightly: jobs are only executed as normal submissions during the night
    • normal: jobs are executed as normal submissions during the day and the night
  • walltime: maximum duration of the jobs
    • Default defined in Cigri configuration file
  • exec_file*: script to execute
  • exec_directory: path to a directory execution.
    • Default: $HOME
  • resources: resources that are asked to the underlying batch scheduler (-l in OAR)
    • Default: /<resource_unit>=1. Resource_unit is defined per cluster and can therefore be different between 2 clusters. Users should answer this field.
  • properties: properties passed to OAR to select resources
  • prologue: commands that are executed before the first job on each cluster
  • epilogue: commands that are executed at the end of a campaign
  • prologue_walltime: specific walltime for the prologue
  • epilogue_walltime: specific walltime for the epilogue
  • output_gathering_method: method to use to gather results in a single place
    • None
    • iRods: files will be put in iRods at the end of the execution
    • collector: a collector will pass regularly to gather files
    • scp: a simple scp will be done on the output files after the completion
  • output_file: file or directory to save
  • output_destination: some server (not used with iRods) where output files will be gathered
  • dimensional_grouping: allow to execute several jobs in parallel in a single submission if possible
    • true
    • false
  • temporal_grouping: allow to execute several jobs one after the other in a single submission. The number of jobs is computed automatically by Cigri
    • true
    • false
  • checkpointing_type:
    • None
    • BLCR
    • ...
  • test_mode: when test_mode is enabled, only one job per active cluster is submitted into normal mode even if best-effort is enabled. The jobs of such a campaign are also executed prior to other campaigns. This allow testing of a campaign without sending all the jobs and with less waiting.
    • true
    • false
  • max_jobs: limit the number of jobs submitted for the current campaign on the cluster. This is useful when for example, your jobs are doing a lot of i/o and they may crash distributed filesystems if too many occurences are running.
    • None
    • <integer>

Note

  • resources: if several type of resources are asked, the default resources (nodes, cpus, cores, ...) MUST BE first. Example: "resources": "nodes=3+other_type_of_resource=2"
  • dimensional_grouping: enabling this feature will speedup execution, however, jobs must not write in common files
  • dimensional_grouping: should be activated for jobs requiring a small number of resources (typically, one core)
  • temporal_grouping: should be activated for short jobs (typically less than 5 minutes).
  • output_gathering_method is defined

2.3   Example of JDL

Here is an example of a JDL file described in JSON:

{
  "name": "Some campaign",
  "nb_jobs": 2,
  "resources": "nodes=1",
  "exec_file": "$HOME/script.sh",
  "output_gathering_method": "scp",
  "output_destination": "my.dataserver.fr",
  "clusters": {
    "tchernobyl": {
    },
    "my.other_cluster.fr": {
    },
    "fukushima": {
      "exec_file": "$HOME/path/script"
    }
  }
}

3   Client tools

This chapter describes the client tools available to the users for interacting with the grid. Most of the CLI tools (gridsub, gristat, gridevents, gridnotify,...) have a minimal help that is printed with the -h option.

3.1   gridsub

The gridsub command is used for submitting new job campaigns or adding jobs to a running campaign.

3.2   gridstat

The gridstat command is used to get informations about the campaigns and the jobs. It may also be used to fetch some output files from the clusters.

3.3   gridnotify

This command must be used by users to setup their notification preferences.

3.4   gridevents

This command is used to manage the events. It allows listing of the events on a given campaign and fixing. When used to fix events, it may be asked to trig an automatic re-submission.

3.5   griddel

This command allows campaign deletion, suspend and resume.

3.6   gridclusters

May show useful informations about the clusters: their names, their stress status and usage. It may display colored bargraphs of the current usage of the grid.

4   REST API

Cigri offers a REST API accessible through HTTP.

4.1   URLs

HTTPrequest URL Purpose
GET / List the available links
GET /clusters List all clusters available in Cigri
GET /clusters/<cluster_id> Get details on a specific cluster
GET /campaigns List of all running campaigns
GET /campaigns/<campaign_id> Get details on a specific campaign
GET /campaigns/<campaign_id>/jdl Get the expanded JDL of a campaign
GET /campaigns/<campaign_id>/jobs List all jobs of a specific campaign (See API options)
GET /campaigns/<campaign_id>/jobs/<job_id> Get details of a specific job of a specific campaign
POST /campaigns Submit a new campaign
PUT /campaigns/<campaign_id> Update a campaign (status, name)
DELETE /campaigns/<campaign_id> Delete a campaign
GET /campaigns/<campaign_id>/events List the open events for the given campaign
DELETE /campaigns/<campaign_id>/events Fix (close) all the events for the given campaign
GET /notifications List notification subscriptions for the current user
POST /notifications/mail Subscribe to the mail notification service
POST /notifications/jabber Subscribe to the jabber notification service
DELETE /notifications/<mail|jabber> Unsubscribe from a notification service
GET /events/<id> Get a specific event
DELETE /events/<id> Fix (close) a specific event
DELETE /events/<id>?resubmit Fix (close) a specific event and resubmit the job
GET /gridusage Get the current usage state of the grid
GET /gridusage?from=<date>&to=<date> Get usage stats between two dates (unix timestamps)

4.2   Accessing the API

Getting the links available on the server:

$ curl http://api-host:port
{"links":[{"href":"/","rel":"self"},{"href":"/campaigns","title":"campaigns","rel":"campaigns"},{"href":"/clusters","title":"clusters","rel":"clusters"}]}

When posting a campaign, the JSON containing the ID of the submitted campaign is returned:

$ curl -X POST http://api-host:port/campaigns -d '{"name":"n", "nb_jobs":0,"clusters":{"fukushima":{"exec_file":""}}}'
{"id":"585","links":[{"href":"/campaigns/585","rel":"self"},{"href":"/campaigns","rel":"parent"}]}

4.3   Return codes

Each action done through the API will return a code in the HTTP header. The list of the codes is described here:

Code HTTPrequest Meaning
200 GET Request successful: everything went well :)
201 POST Resource created: the campaign has been submitted
202 PUT, DELETE Accepted: modifications done
400 POST, PUT Bad request: see the body of the answer for details
403 POST, PUT, DELETE Forbidden: see response for details
404 GET, POST, PUT, DELETE Page not found: the URL does not exist

Exemples:

$ curl -i http://api-host:port
  HTTP/1.1 200 OK
$ curl -i -X DELETE http://api-host:port/campaigns/1
  HTTP/1.1 403 Forbidden
$ curl -i -X POST http://api-host:port/campaigns -d '{"name":"n", "nb_jobs":2,"clusters":{"cluster1":{"exec_file":"toto.sh"}}}'
  HTTP/1.1 201 Created

4.4   API options

Options that can be passed in the URL with their default value in parenthesis:

  • pretty (false): Will display the answered JSON in a more readable format (but larger). Only not giving the option or putting it to false will disable it:

    $ curl http://api-host:port?pretty => pretty print on
    $ curl http://api-host:port?pretty=true => pretty print on
    $ curl http://api-host:port?pretty=whatever => pretty print on
    $ curl http://api-host:port => pretty print off
    $ curl http://api-host:port?pretty=false => pretty print off
    
  • limit (100) and offset (0): Some resources may contain many items, therefore, only a subset of them are displayed.:

    $ curl http://api-host:port/campaigns/<campaign_id>/jobs => display the first 100 jobs
    $ curl http://api-host:port/campaigns/<campaign_id>/jobs?limit=23 => display the first 23 jobs
    $ curl http://api-host:port/campaigns/<campaign_id>/jobs?limit=12&offset=50 => display jobs 50 to 62
    

5   Cigri Changelog

5.1   version 0.0.1:

  • nothing changed, because there was nothing before