Getting started

Quickstart

To predict selenoproteins and selenocysteine machinery proteins in a metazoan genome, use:

selenoprofiles -o outfolder -t genome.fasta -s "species name" -p metazoa,machinery -output_gtf_file all_predictions.gtf

It will generate all_predictions.gtf, besides other output files in outfolder/species_name.genome/output/

Command line structure

In a single run, selenoprofiles can search for multiple profiles in a single target. Here’s a minimal command line:

selenoprofiles  -o output_folder  -t target_file  -s "species name"  -p profile  [options]

These are the compulsory arguments:

  • -o output folder, will be created if non-existing.

  • -t target_file = a (multi-)fasta file containing nucleotide sequences

  • -s a species descriptor with no restrictions. For multiple words, use quotes

  • -p the profile(s) to be searched. Multiple comma-separated arguments are accepted. Each argument can be:

    • a profile name: invokes a built-in profile (located in the profiles_folder defined in the config file)

    • a profile set: a keyword expanded to a list of profiles. Profile sets are defined in the config file

    • a path to a profile alignment: to create one, see: selenoprofiles build -h

By default, selenoprofiles assumes the target is an eukaryotic genome sequence, and it will attempt to predict introns. If you’re searching (intronless) prokaryotes or eukaryotic mRNA sequences, use option -no_splice.

Blast searches can use multiple CPUs. To control how many, use option -ncpus. This is just one of the many non-compulsory options and parameters. The config file ~/.selenoprofiles_config.txt defines default values for all of them. The full list of options can be inspected with:

selenoprofiles -h full

Searching for selenoproteins and selenium markers

The common usage of selenoprofiles is the prediction of selenoprotein families and selenium metabolism genes. For this task, you must specify appropriate profile sets: the selenoprotein families expected in a genome depend on its taxonomy.

To search for metazoan selenoprotein families and selenium usage markers, use:

-p metazoa,machinery

To search for all eukaryotic families, use:

-p eukarya

To search for all prokaryotic families, use:

-p prokarya

Selenoprofiles output

Upon completion, you will find output files in a subfolder of the output folder:

output_folder/species_name.target_file_name/output/

Selenoprofiles produces one file per gene prediction, per requested format. The output files are named after prediction identifier, which have the syntax:

profile_name.index.label

Where:

  • profile_name identifies the source profile for the prediction

  • index is an arbitrary numeric id

  • label identifies the class of predicted gene.

For selenoprotein families, the label can be:

  • “selenocysteine” for selenoproteins: UGA is aligned to the Sec position of the profile

  • “cysteine” for cysteine-homologs of selenoproteins (i.e. Cys aligned to Sec position)

  • if any other amino acid is aligned to Sec, the label takes its name

  • for predictions that do not include the Sec position of the profile:

    • “unaligned” if the prediction ends upstream or starts downstream of the Sec position

    • “gapped” if Sec is not aligned but there are homologous regions on both sides

  • “pseudo” for predictions containing inframe stops or frameshifts (likely pseudogenes)

  • “uga_containing” for predictions whose only pseudogene features are in-frame UGAs

For profiles that are not selenoprotein families (i.e. do not contain Sec), the label can be:

  • “homologue” for standard predictions

  • “pseudo” (see above)

Output files are named after the prediction identifier and the format:

profile_name.index.label.format   # e.g. GPx.1.selenocysteine.p2g

These output formats are supported:

  • p2g: native format with query-target alignment, coordinates and other info; see an example here.

  • fasta: fasta file with the predicted protein sequence

  • gff: gff file with the genomic coordinates of the prediction

  • gtf: analog to gff, but with this last field: gene_id “prediction_id”; transcript_id “prediction_id”;

  • cds: fasta file with coding sequence, including the stop codon if applicable

  • three_prime: fasta with the sequence immediately at 3’ of the prediction. Length is defined by -three_prime_length

  • five_prime: fasta with the sequence immediately at 5’ of the prediction. Length is defined by -five_prime_length

  • dna: fasta file with the full nucleotide sequence, including introns (and frameshift-causing insertions if any)

  • introns: fasta file with the sequence of the introns, split into different fasta headers

Additionally, a fasta alignment called profile_name.ali is created. Only one such ali file is produced per profile, containing the sequences of all predictions plus the profile sequences.

On the command line, option -output_FORMAT activates the corresponding output for each prediction, e.g. -output_gff will produce gff files. By default, only the ali and p2g formats are active, as visible in the config file:

### active output format
output_ali=1
output_p2g=1

To create a single output file for all predictions, use -output_FORMAT_file providing as argument the file that will be created, e.g. -output_fasta_file all_predicted_proteins.fa