Tool Integration Short Tutorial

De ifb
Aller à : Navigation, rechercher

Galaxy is an open, web-based platform designed to provide easy access to a versatile tool-box and preconfigured workflows.

This page describes a short version of the best tool integration practice recommendations elaborated by the Institut Français de Bioinformatique Galaxy working group.

Sommaire

Best practices for writing 'Galaxifiable' programs/scripts

A few guidelines can be followed when developing a program/script with Galaxy integration in mind:

  • Input and output files should be passed through arguments (preferably long, eg. "--input [input]")
  • Errors and warnings should return a positive exit code (with distinct values if possible) and a meaningful error message.
  • Tool purpose and parameters should be documented in detail and available through the help or the manual

Prerequisites to integration

Before integrating a tool in Galaxy, there is a few prerequisites:

  • The tool has to be:
    • installed on your computing hardware
    • tested via classic console runs
    • available to the Galaxy user
  • Dependencies also have to be installed on your computing hardware
  • You must have an access to a Galaxy development instance (if possible with restart rights)

Wrapper writing. Do I need a wrapper ?

Several cases leading to an additional wrapper development:

Input files are not passed through arguments but loaded based on predetermined fixed names
The wrapper can create symbolic links between the files and the predetermined fixed names in the cwd
Output files with conditional names are not passed through arguments
The wrapper can create symbolic links between the files and predetermined fixed names in the cwd
Using R functions
The wrapper can be used for libraries loading, data preprocessing and argument parsing
Multiple command lines
A simple bash wrapper can be used to pass arguments and call multiple commands

XML writing

The main step to integrate a tool in Galaxy consists in writing its XML description file. This file has to be put in the ~/tools directory of your Galaxy instance.

An empty template can be found here: Template.xml

and a very complete example can be found here: Tutorial.xml

A few definitions

As a standard XML, the tool description file consists of interlocked elements opening with a start-tag (e.g. "<toto>") and closing with an end-tag (e.g. "</toto>"). Attributes in the start-tag can allow more or less complex configuration (e.g. "<toto attribute1='string' attribute2=42 >"). The content is the space between the start-tag and the end-tag. An empty-element tag is self-sufficient and doesn't have any content nor end-tag (e.g. "<toto name='Toto' />").

<tool>

The <tool> element is the master element. Every other element is inside the <tool> element content. The tool XML description file begins by:

<tool id="tool_id" name="Tool Name" version="1.0.0">

and ends with:

</tool>

The complete description of all required and optional attributes can be found on Galaxy wiki[1].

<requirements>

The <requirements> element is optional. It allows defining a list of 3rd party tools, binaries, modules, or ToolShed packages required for the tool to work. Each requirement will be checked when Galaxy starts and the tool will not be loaded if there is one missing.

An individual requirement can be defined with the sub-element <requirement> (note the missing "s").

For example:

<requirements>
    <requirement type="binary">perl</requirement>
    <requirement type="python-module">argparse</requirement>
</requirements>

More info on the Galaxy wiki[2].

<version_command>

<command>

The <command> element contains the unique command line that will be run by Galaxy. Multiple command lines cannot directly be run from the XML description file and will require an additional wrapper (#Wrapper writing. Do I need a wrapper ?).


Some example of interpreter: bash, python, perl, Rscript ...

The interpreter attribute should be set to the executable/wrapper language following a few rules:

Executable in $PATH Executable NOT in $PATH
Interpreter doesn't have to be set
<command>
    tool_binary
</command>
<command>
    script.py
</command>
Compiled executable Interpreted executable/wrapper
Interpreter doesn't have to be set
Use full path
<command>
    /path/to/tool_binary
</command>
Executable/wrapper and xml description file in same directory Executable/wrapper and xml description file in different directories
Set interpreter
<command interpreter="python">
    script.py
</command>
Set interpreter
Use full path
<command interpreter="python">
    /path/to/script.py
</command>

Arguments management

The executable input parameters are defined in the <inputs> element. Their value can be used in the command line by preceding the parameters name with a dollar sign ($).

<command>
    tool_binary --input $input --output $output
</command>

In order to access a parameter defined in conditional blocs, it has to be referred to with every leading conditional bloc name.

$conditional1.conditional2.param

Command special syntax

A comment begins with a double sharp sign ##:

<command>
    ## This is a comment
    tool_binary
</command>

Python flow controls are implemented by using single sharp signs "#":

  • the if-then-else conditional statement
#if str($input) == “test.txt”:
    dummy command
#else:
    dummier command
#end if
  • a for loop for iterable parameters
#for $r in $repeat:
    --param $r
#end for

Every new line return in the <command> element content will be converted in spaces at runtime by Galaxy. As a consequence, the resulting command becomes a single line.

<inputs>

The <inputs> element allows defining the tool input files and parameters. Most inputs are defined using empty-element tags and various attributes.

Files

Input files are defined as data type parameters:

<inputs>
    <param name="input" type="data" format="tabular" optional="false" label="Input file" help="Some help" />
</inputs>

The format attribute can be set to a predefined Galaxy datatype (txt, fasta, pdf, bam, ...). By setting it to format="data" any file format will be accepted. This configuration can be useful if the input format is unknown, not defined in Galaxy datatypes, or you encounter problems with the predefined datatype methods.

Numbers (integers and float)

Integers and float parameters can be defined as follow:

<inputs>
    <param name="int" type="integer" value="42" min="0" max="100" label="Integer value"/>
    <param name="flt" type="float" value="1.6180" min="0" max="2.5" label="Float value"/>
</inputs>

The value (default value), min and max attributes are optionals.

Text

Text boxes can be defined as follow:

<inputs>
    <param name="txt" type="text" value="GATTACA" optional="true" label="Text box" help="Some help"/>
</inputs>

Checkbox (on/off switch)

Checkboxes are an easy way to pass switches to the tool executable:

<inputs>
    <param name="switch" type="boolean" checked="false" truevalue="--switch" falsevalue="" label="Check-box"/>
</inputs>

This type of parameter is easily used in the <command> element:

<command>
    tool_binary $switch "Some random string"
</command>

which becomes at runtime:

Checkbox false.png
tool_binary "Some random string"
Checkbox true.png
tool_binary --switch "Some random string"

Single or multiple choice(s) list

To let the user choose a value from a list:

<inputs>
    <param name="selection" type="select" label="List Selection">
        <option value="value1" selected="true">Value 1</option>
        <option value="value2"                >Value 2</option>
        <option value="value3"                >Value 3</option>
    </param>
</inputs>

To allow multiple choices selection, with check-boxes:

<inputs>
    <param name="selection" type="select" display="checkboxes" multiple="true" label="Multiple Choices">
        <option value="value1" selected="true">Value 1</option>
        <option value="value2" selected="true">Value 2</option>
        <option value="value3"                >Value 3</option>
    </param>
</inputs>

In this case, the $selection variable will contain all the values separated by comas (e.g. 'value1,value2').

To select from multiple reference files (databases) located on the instance disks, .loc files can be used (#Using .loc files).

Common attributes

name
This is the parameter variable name. It is used internally to designate the parameter in all the XML elements.
GalaxyInput.png
label
This is the parameter name, displayed above the parameter in the Galaxy web interface. It should be short and meaningful.
help
This is the parameter help, displayed below the parameter in the Galaxy web interface. It should be clear and as detailled as it need to be for the user to know how to use the parameter.
optional
When not present or set to false, the parameter is required and a value has to be set. If set to true, the parameter is optional.

Conditional arguments

Frequently, some tool parameter usage depends on other parameters. Conditional blocs can be used to resolve these cases:

<inputs>
    <conditional name="conditional_bloc" >
        <param name="condition" type="select" label="Condition" help="" >
            <option value="choice1" selected="true">Choice 1</option>
            <option value="choice2">Choice 2</option>
        </param>
            <when value="choice1">
                <param name="simple_param" type="text" value="dummy" />
            </when>
            <when value="choice2">
                <param name="complex_param1" type="text" value="dummy" />
                <param name="complex_param2" type="text" value="dummy" />
            </when>
    </conditional>
</inputs>

Conditional blocs can also be used to hide advanced parameters:

<inputs>
    <conditional name="advanced_parameters" >
        <param name="adv_param" type="select" label="Advanced Parameters" help="" >
            <option value="hide" selected="true">Hide</option>
            <option value="show">Show</option>
        </param>
            <when value="hide" />
            <when value="show">
                <param name="adv_param1" type="text" value="dummy" />
                <param name="adv_param2" type="text" value="dummy" />
            </when>
    </conditional>
</inputs>

Reusing repeated configuration elements (macros)

You can repeat the same XML fragments in a file or between tools in the same repository, by using the macros element.

Imported macros

To reuse XML elements between wrappers in the same directory, you must create a "file_macros.xml (example shown below)

<macros>
  <macro name="own_junctionsConditional">
    <conditional name="own_junctions">
      <param name="use_junctions" type="select" label="Use Own Junctions">
        <option value="No">No</option>
        <option value="Yes">Yes</option>
      </param>
  </macro>
</macros>

<outputs>

the <outputs> element allows retrieving all the relevant tool output files.

Output passed through arguments or being STDOUT

With a command line as:

<command>
    tool_binary --input $input --output $output
</command>

or:

<command>
    tool_binary --input $input > $output
</command>

output files can be defined as follow:

<outputs>
    <data name="output" format="tabular" label="Tabular output file" />
</outputs>

Outputs written to the disk with a fixed name

Some tools do not allow to pass output files through arguments, but instead write them to the current working directory with predefined fixed names. In that case, output files can be defined as follow:

<outputs>
    <data name="output" format="tabular" from_work_dir="tool_binary.output" label="Tabular output file" />
</outputs>

<stdio>

In previous Galaxy releases, when a tool wrote some informations to STDERR without being fatal errors, the tool run was considered as failed. In order to activate a more user-friendly error management, the <stdio> element has to be defined:

<stdio>
    <exit_code range="1:" level="fatal" />
</stdio>

WARNING: There is currently a bug in Galaxy when are used both <stdio> and the option of <outputs><data> from_work_dir=. It's explain here. So if from_work_dir= is avoided, the error code is well return.

<tests>

<help>

xml help tag

Advanced development

Using .loc files

Using pre-defined datatypes

The list of supported data formats is contained in the ~/datatypes_conf.xml.sample file. The “format” argument from an input or output file has to match the “extension” argument from an existing datatype.

Each Galaxy datatypes is defined by a Python class, sub-classed from the data:Data class, with its own methods and attributes. These methods usually check that the given file is correctly formatted. They also allow the system to convert to other formats, or indicate it how to display the file. There are some loosely defined format (eg. data:Text, binary:Binary, tabular:Tabular) and some much more strictly defined formats with multiple checking points (eg. binary:Bam, sequence:FastqSanger, interval:Gff)

Most usual file formats are already defined, or can be sub-classed without modification from an existing class. In rare cases, a specific format will not be already defined, or it will be defined too strictly for the wanted usage. In these cases, refer to the following section (#Adding proprietary datatypes).

Known problems with specific datatypes

  • Compressed files (zip, tar, gz): decompression before sniffing
  • HTML files: sometimes (html5?) can’t display or download
  • BAM files: can’t upload non-sorted bam

Adding proprietary datatypes

Making Galaxy aware of your tool

Modify tool_conf.xml

Add the following line to the appropriated section:

<tool file=”path/to/the/tool_wrapper.xml” />

with the path to the wrapper starting from the “~galaxy/tools” directory

Did it work ?

After XML wrapper writing: open with a browser

Sharing your Galaxy tool (ToolShed)

Creating a ToolShed repository

First you have to create a ToolShed account, which is different from the main public Galaxy or a local instance. Then you log in and click on the option "Create new repository" on the left menu.

You have to fill different options from the "Create Repository" form:

Name
name of the repository
Repository type
(unrestricted or tool dependency definition).
Synopsis
short description of the objects that contains the repository
Detailled description
explain with more details what are the function of the objects within the repository ( tools, workflows,etc)
Categories
categorie in which the repository will appear in the ToolShed's research list categorie (data source, text manipulation, sequence analysis,etc)

Repository types

Associating a repository with a certain type, changes the way that the ToolShed generates metadata for the repository revisions.

There are two types of repository:

  • Unrestricted: the repository can contain any set of Galaxy utilities or files.
  • Tool dependency definition: the repository can only contain a single file named tool_dependencies.xml. Generally, this type of repository is used to download and compile certain versions of a package tool. Following best practices, these repositories are named like this: package_<name>_<version> (e.g., package_amos_3_1_0, package_ape_3_0, package_atlas_3_10, etc).

Adding files to a repository

You must click on the "Repository actions" button, then on the "Upload files to repository" of the the ToolShed page. You can upload individual files or tar archives (gzip and bzip2 supported).

Type of files

  • Basic Galaxy tool Wrapper (tool config file and executable).
  • functional tests: input and output datasets used by the tests, must be put in a directory named test-data.
  • index location file: your repository should include a xxx.loc.sample file.
  • Images displayed in tool's help section: All image files must be contained in the directory path: <repository root>/static/images within the repository hierarchy ("best practice" approach ).
  • Datatypes:
  • Workflows
  • datatypes

Adding dependencies

A useful example

References

Modèle:Reflist

Outils personnels
Espaces de noms

Variantes
Actions
Navigation
Boîte à outils