Advanced customization
======================

This page is intended for advanced users that want to customize selenoprofiles behavior.
Make sure you read ref:`pipeline_detailed_` first.

Selenoprofiles was designed to be as customizable as possible. It offers
to the user the possibility of writing python code which will be
integrated and run. The code can be provided mainly through the
configuration file of each profile, and through the main selenoprofiles
configuration file. Additionally, custom modules can be loaded using
option *-add*, as we will see later.

In the simplest use of custom code, the user can set profile specific
procedures, exploiting the built-in methods for filtering::

  ### fam1.fa.config
  blast_filtering = x.evalue < 1e-15
  p2g_filtering = x.awsi_filter (awsi=0.3)
  p2g_refiltering = x.coverage() > 0.5

With more experience, it is possible to add custom information to
output, or even annotate motifs or secondary structures in the
predictions::

  ### selenoprofiles config at ~/.selenoprofiles_config.txt
  (...)
  ACTION.pre_output.see_cys= print(x.output_id()+ " Cys:" +(','.join([str(i)  for i, aa in enumerate(x.protein()) if aa== "C"] or "None" ))

::

  ### output
  fam1.1.homologue Cys:14,17,64,189,192
  fam1.5.homologue Cys:18,21,194,19
  fam1.11.pseudo Cys:60,63
  fam1.19.pseudo Cys:None

The *p2ghit* class
------------------

To learn how to use custom code, you need to be familiar with some
variables and classes in selenoprofiles, as these are the objects that
your code will be manipulating. To do this, you should have already some
experience with python code and classes. The *p2ghit* class is the key
of user customization. It represents a prediction of selenoprofiles,
coming from any source among blast, exonerate or genewise. It contains
the alignment of a query against a target, and the genomic coordinates
of such alignments. Let’s see its mostly used attributes and methods:

+-------------------+--------------------------------------------------+
| **Attribute or    | **Description**                                  |
| method**          |                                                  |
+===================+==================================================+
| :sup:`id`         | :sup:`The numeric id of the prediction (string). |
|                   | It is unique for that family and target`         |
+-------------------+--------------------------------------------------+
| :sup:`chromosome` | :sup:`The first word of the fasta header of the  |
|                   | chromosome or scaffold where this prediction     |
|                   | resides`                                         |
+-------------------+--------------------------------------------------+
| :sup:`strand`     | :sup:`The strand of the prediction ( + or - )`   |
+-------------------+--------------------------------------------------+
| :sup:`label`      | :sup:`The label assigned to this prediction in   |
|                   | the labeling phase`                              |
+-------------------+--------------------------------------------------+
| :sup:`filtered`   | :sup:`The filtered state assigned by the         |
|                   | filtering phase (kept, filtered, refiltered,     |
|                   | redundant).`                                     |
|                   |                                                  |
|                   | :sup:`After inter-family overlaps are computed,  |
|                   | the state overlapping is also possible`          |
+-------------------+--------------------------------------------------+
| :                 | :sup:`The prediction name displayed in output    |
| sup:`output_id()` | (profile name.index id.label). Example:          |
|                   | SelK.1.pseudo`                                   |
+-------------------+--------------------------------------------------+
| :sup:`pred        | :sup:`The program that generated this prediction |
| iction_program()` | (blast, exonerate or genewise)`                  |
+-------------------+--------------------------------------------------+
| :sup:`q           | :sup:`The full name of the query, as it appears  |
| uery_full_name()` | in the profile alignment`                        |
+-------------------+--------------------------------------------------+
| :sup:`coverage()` | :sup:`A float value, indicating how much profile |
|                   | is spanned by the prediction (max is 1.0)`       |
+-------------------+--------------------------------------------------+
| :sup:`protein()`  | :sup:`Protein sequence, with \* for stop codons, |
|                   | U for Sec`                                       |
+-------------------+--------------------------------------------------+
| :sup:`cds()`      | :sup:`Nucleotide coding sequence, as ATGC        |
|                   | characters`                                      |
+-------------------+--------------------------------------------------+
| :sup:`pos         | :sup:`A string with the positions of all exons.  |
| itions_summary()` | Examples:`                                       |
|                   |                                                  |
|                   | :sup:`24-40,70-100 (+ strand) 400-450,340-354 (- |
|                   | strand)`                                         |
+-------------------+--------------------------------------------------+
| :sup:`exons`      | :sup:`A list (array) containing the exons. Each  |
|                   | exon is a list of 2 elements (integers), the     |
|                   | position of start and the position of end of the |
|                   | coding sequence, both 1-based and included. Each |
|                   | prediction has at least one exon.`               |
+-------------------+--------------------------------------------------+
| :sup:`header()`   | :sup:`A string used as default fasta header.     |
|                   | Contains lots of non-sequence information.       |
|                   | Example: SBP2.1.homologue chromosome:scaffold1   |
|                   | strand:+ positions:869-881,1163-1417             |
|                   | species:Polysphondylium_pallidum_PN500           |
|                   | target:genomes/P.pallidum/genome.fa              |
|                   | prediction_program:exonerate`                    |
+-------------------+--------------------------------------------------+
| :sup:`dna()`      | :sup:`Full nucleotide gene sequence, including   |
|                   | introns and frameshifts if present.`             |
+-------------------+--------------------------------------------------+
| :sup:`splice_     | :sup:`A list of 4 letter strings, with the first |
| site_sequences()` | two and last two nucleotides of each intron in   |
|                   | the prediction.`                                 |
+-------------------+--------------------------------------------------+
| :sup:`            | :sup:`Generic function to return any nucleotide  |
| subsequence(self, | subsequence of a prediction, using lazy          |
| start, length)`   | computing. It can be used with negative start or |
|                   | large length to get the sequence around the      |
|                   | genomic interval. Normally the indexes are       |
|                   | relative to the predicted coding sequence, but   |
|                   | you can use include_introns=True to count any    |
|                   | nucleotide in the gene prediction.`              |
+-------------------+--------------------------------------------------+
| :sup:`alignment`  | :sup:`Pairwise protein alignment between a       |
|                   | profile query and the target, as an instance of  |
|                   | the alignment class in MMlib (if interested,     |
|                   | check its code in the installation directory).`  |
+-------------------+--------------------------------------------------+

There are plenty more of methods. Many are actually inherited from the
*p2ghit* parent class, called *gene*, defined in the library *MMlib.py*.

Custom output: option *-fasta_add*
----------------------------------

The *-fasta_add* option represents an elegant and fast way to add
information to output. A python written procedure with the same style of
actions and filters must be provided as argument. The procedure is
evaluated to a string which is inserted in the fasta headers of the
files in output. All the fasta files in output will contain the add-on,
as they all call the same function to determine the fasta header. Files
with extension *fasta*, *cds*, *dna*, *three_prime*, *five_prime* and
also *ali* will have it. Let’s see an example. Normally the fasta
headers contain the following information::

  >GPx.6.selenocysteine chromosome:chr3 strand:- positions:49395460-49395711,49394824-49395180 species:"Homo sapiens" target:/Genomes/Homo_sapiens/genome.fa prediction_program:genewise

Let’s say that you want to add the length of the protein to the header.
You could add this to your command line::

  -fasta_add '"seq_length:" +str( len(x.protein()) )'

Now if you run selenoprofiles with this (forcing the replacement of the
old output with *-O* or specifying another output folder), you will
have::

  >GPx.6.selenocysteine chromosome:chr3 strand:- positions:49395460-49395711,49394824-49395180 species:"Homo sapiens" target:/Genomes/Homo_sapiens/genome.fa prediction_program:genewise seq_length:203

Actions
-------

The actions are performed during the workflow on each prediction coming
from the prediction choice/labelling step. Each action is provided as
python code that is directly executed in the selenoprofiles environment.
In a classical *for* loop, the variable *x* in the code is replaced by
each *p2ghit* instance and executed. The keyword *ACTION* in the main
configuration file denotes the active actions. Actions can be specified
also in the command line. From now on, we will display the examples with
the configuration file syntax::

  ACTION.pre_filtering.echo = print( f'hello world {x.id} {x.label}' )

Separating the left side with dots, the first field is the keyword
*ACTION*, the second field is the category of the action and the third
is the name of the action. The category determines the time point of the
actions, while the name is used only to order the actions in the same
category. In this example, the user will just see something like this
appearing in the output of selenoprofiles::


  CHOOSE: choosing among available predictions, assigning label -->  selenoprofiles_results/Polysphondylium_pallidum_PN500.genome/prediction_choice/SelI.tab (just loading file)
  SelI.1 : exonerate longest CDS predicted unaligned
  SelI.3 : blast longest CDS predicted unaligned
  SelI.4 : exonerate longest CDS predicted unaligned
  SelI.7 : blast SecTGA aligned pseudo
  hello world 1 unaligned
  hello world 3 unaligned
  hello world 4 unaligned
  hello world 7 pseudo
  ...

Each action is performed on all available prediction at a certain step
of the pipeline, determined by his category. There are many possible
categories of actions: *post_blast_filter, post_blast, post_blast_merge,
pre_choose, pre_filtering, post_filtering, pre_output*.

The categories names are pretty self-explanatory. The actions
*post_blast* and *post_blast_merge* are performed on blast hits, while
the others are performed on blast hits or exonerate/genewise
predictions.

You will have to choose the category of your actions depending on what
operation you want to perform. Actions executed during *pre_filtering*
can be used to improve the predictions, but remember that their
attribute .filtered is not set yet. *post_filtering* actions can access
the .filtered attribute and are performed before storing results on the
database. *pre_output* actions can add useful information to the log
output.

Let’s see an example which uses an if statement to execute operations
only on a certain subset of the available predictions. Typically, the
attributes that you want to check are the .label and the .filtered
attributes. Let’s say for example that we want to check the chromosomes
and strands where the prediction with label “unaligned” rely::

  ACTION.post_filtering.test = "if x.label=='unaligned': print( f'{x.output_id()} CHROMOSOME {x.chromosome} {x.strand} ') "

This adds something like this in the standard output of selenoprofiles::

  ...
  SelI.1.unaligned CHROMOSOME gb|GL290990.1| +
  SelI.3.unaligned CHROMOSOME gb|GL290997.1| +
  SelI.4.unaligned CHROMOSOME gb|GL290984.1| -
  ...

The next action is for giving a quick look to the protein sequence of
all discarded predictions. Below is the output added.

::

   ACTION.post_filtering.check_ali = "if x.filtered != 'kept': print( f'{x.output_id()} {x.protein()} )'"

::

   SelI.4.unaligned ITLVGLFCNIAMYLIVYFQCPGLTEPAPRWCYFLIAFLIFAYQTLDNLDGKQARRTKSSSPLGELFDHCCDA
   SelI.7.pseudo VTATGFVCNFIALFLMSSYMRPVNDGQEPV

After the *post_filtering* actions are performed, the results are stored
in the selenoprofiles database. Remember that if selenoprofiles finds
the results in the database, it does not perform the steps up to
filtering. Therefore beware that if you specify actions of category pre
or post filtering (or any of the categories before them) on a second run
of selenoprofiles, it won’t perform them unless you force the proper
routine, for example with option -F to force the filtering routine.
*pre_output* actions, on the contrary, are performed both if in the
current run results are produced or loaded from the database, but only
on the results which are output (determined by the *-state* option).

Later, we will see how actions can be used to correct gene structures,
or to add custom genomic features to the predictions.

Blast filtering
---------------

There are 3 layers of filtering in selenoprofiles, all regulated by
procedures defined in the profile. We have already seen them: blast
filtering, p2g filtering and refiltering. The same grammar applies to
all of them. For blast filtering, the most common attribute checked is
the *evalue*, an attribute specific of blast hits. The blast hit is a
subclass of *p2ghit* and has the same methods. Let’s see a simple blast
filtering procedure as written in a profile configuration file; this
accepts only the blast hits with *evalue* minor (better) than 1e-5::

  blast_filtering = x.evalue < 1e-5

Selenoprofiles offers also more sophisticated tools, which map the
prediction back to profile alignment to use what we know from the
profile alignment. For example many families possess N-terminal regions
of disordered or repetitive sequence, which hits spuriously many regions
in the genome. The resulting blast hits span only the initial portion of
the profile.

You may want to exclude those, using function
*is_contained_in_profile_range*::

  blast_filtering = x.evalue < 1e-5 and not x.is_contained_in_profile_range(1, 35)

The similar function *spans_profile_range* asks whether the predictions
spans certain columns of the alignment, useful when you want only
proteins with a certain conserved domain::

  blast_filtering = x.evalue < 1e-5 and x.spans_profile_range(50, 60)

The function *show_conservation_in_profile_range* is useful when dealing
with blast filtering of profiles with regions of low information. It
checks the number of pairwise similarities (defined as positive scores
in the BLOSUM62 matrix) between the amino acids in the query and in the
target in the prediction along a certain profile range. In the example
below, predictions are required to have 3 conserved amino acids in the
region from positions 1 to 50::

  blast_filtering = x.show_conservation_in_profile_range(1, 50, 3)

AWSI Z-score based filtering
----------------------------

We developed various method to score how much a sequence “fits” in a
protein profile. We called the best performing one Average Weighted
Sequence Identity (AWSI).

It is based on the Weighted Sequence Identity (WSI), a scoring method
for comparison of two sequences, with one of the two belonging to a
profile alignment.

The WSI score is computed as an average of sequence identities with
different weights on the different columns of the profiles. In the
pairwise comparison between the profile sequence and the candidate
sequence, the weight is given by the representation of the amino acid in
this profile sequence and column across all the profile. More conserved
columns are given more weight thus more importance. This weight is also
multiplied by the column coverage, that is to say, the total number of
characters which are not gaps divided by the total number of profile
sequences. In this way, the alignment regions present only in a small
subset of sequences have less importance.

When the term AWSI is used in this manual, we refer to the variant
AWSIc, computed as just explained. There is another variant (AWSIw),
which is computed in the same way, but the weight is not multiplied by
the column coverage.

When comparing a candidate sequence against a profile, a WSI for each
profile sequence is calculated. Each one ranges from 0 to 1, as it is
normalized to the sum of weights in that WSI. Now the AWSI of the
candidate sequence is just the average of all computed WSI.

Although the range of AWSI is also between 0 and 1, the maximum value it
can assume is constrained by the profile characteristics. In a profile
with very dissimilar sequences, no candidate sequence can reach high
scores (as if it matches a sequence of the profile, it cannot match the
different ones at the same time). Thus, it is useful to adjust the AWSI
threshold for each profile.

For this purpose, profile alignments are analyzed when used for the
first time, and AWSI values for all sequences are computed. For each
profile sequence, we compute its AWSI as explained above, considering
this sequence as a candidate, and the rest of sequences as the profile
to compare against.

The distribution of these AWSI scores is used to decide the similarity
threshold when fitting a sequence into this particular profile. The AWSI
score of the target sequence is fit in a Gaussian distribution with the
profile average and standard deviation, and a Z-score is computed. In
the default p2g refiltering procedure (*awsi_filter*), the Z-score must
be greater than -3.

The script *selenoprofiles_build_profile.py* can be used to display the
distribution of the AWSI scores with option *-d*, as shown here above
(pylab must be installed). The frequencies of the computed AWSI values
are shown as green columns, while colored dots are used to display the
approximated gaussian distribution: the red dot is the average, while
the purple, blue and cyan dots correspond to the average minus 1, 2, 3
standard deviation respectively. The default cut-off point is thus
indicated by the leftmost cyan dot.

.. figure:: images/AWSI_distribution.png
	    :width: 450

The methods of the *p2ghit* class relevant to AWSI scores are:

 * **awsi()** with no arguments, it returns the AWSIc value for this candidate.
   Used as *awsi(with_coverage=False)*, returns AWSIw instead
   
 * **awsi_z_score()** returns the z-score compute comparing the AWSI of this
   candidate with the profile distribution. This function also accepts the
   *with_coverage=False* switch to return AWSIw instead.
   
 * **awsi_filter()** returns *True* if the prediction would pass the default
   AWSI-based filtering, *False* otherwise. This function also accepts the
   *with_coverage=False* switch to return AWSIw instead. This is normally
   computed just as *awsi_z_score()>-3*, with two possible exceptions. For
   extremely conserved profiles, the cut-off threshold would be generally
   too strict. Thus, if the candidate has an extremely high AWSI (>0.9), it
   is accepted regardless of the profile characteristics. The second
   exception is for profiles with few sequences (<3). In these case, the
   computed AWSI standard deviation is always zero or extremely close to
   it, and this would also result in filtering too strict. Thus, for these
   profiles the filter just checks that *awsi()>=0.3*

One can  alter the filter behavior using any of these arguments to
the *awsi_filter* function: *z_score*, *awsi*, *few_sequences_awsi.* For
example awsi_filter(awsi=0.5) accepts any candidate scoring a AWSI with
the profile of 0.5 or greater (or a *z_score* >-3).

Other filtering functions
-------------------------

Here’s some other methods useful for blast or p2g filtering of specific
families.

The function *seq_in_profile_pos* provides the amino acid predicted in
the target at a certain position of the profile alignment (may be - for
unaligned). It can be used to check that certain domains are complete
(e.g. redox boxes CXXC).

::

   p2g_refiltering = x.seq_in_profile_pos(31)=='C' and x.seq_in_profile_pos(34)=='C'

The function *sequence_identity_with_profile* computes a quantitative
measure of how much the prediction fits in the profile: it computes the
sequence identity of the prediction with every profile sequence, and
average them. It is a simplification of the AWSI score. With no
arguments, internal (but not terminal) gaps are counted as mismatches.
The choice of the threshold in this case depends largely on the profile.

::
   
   p2g_refiltering = x.label!='pseudo' and x.sequence_identity_with_profile()>=0.25

The more useful function *sequence_identity_in_range* is analogous the
previous one, but computes the average sequence identity only on a
certain range of the profile. Predictions not spanning this region are
given 0.

::
   
   p2g_refiltering = x.label!='pseudo' and x.sequence_identity_in_range(40, 80)>=0.35

For a full list of the methods of the *p2ghit* class, run
*selenoprofiles* with *-help full* or inspect the script inside
your installation directory.

Tag blast filtering
-------------------

Tag blast is an implemented form of filtering. This consists in
searching the protein sequence predicted in the target against a
comprehensive protein database (by default, Uniref50 maintained by
Uniprot). The output generally provides a good annotation of the protein
in question. Note that your profile may match sequences in the genome
that are real genes, but do not belong to your family and are hit
because of their sequence similarity. These predictions usually have
blast hits against proteins in Uniref50 which are not in your protein
family. Tag blast utilizes a set of profile-defined tags to scan the
titles in the blast output and assign a score to the prediction. A
predicted sequence that resembles proteins not belonging to the family
are likely to be spurious, and will be assigned a negative tag score. To
use tag blast, you must first set the list of tags for your profile in
its configuration file. Tags are strings which are interpreted as perl
regular expressions. In the configuration file of the profile, the tags
are written as a python list of strings::

  tags = ['SecS ', '(Sec|selenocysteine|tRNA).\* selenium transferase']

Tags should be carefully designed in order to recognize all sequences of
the profile and those with similar names. For each blast hit appearing
in the blast file, the tags are tested and a score is assigned to the
title. Its absolute value is the negative logarithm of the *evalue*: a
blast hit with *evalue* 1e-5 gets 5 points. The final tag score assigned
to prediction is the sum of all the titles. If the title matches any
profile tag, its score will be positive. If it matches any neutral tag,
its score will be zero. If a title does not match any profile or neutral
tag, its score will be negative. The neutral tags are used to skip all
the blast hits with uninformative titles and those based only on
computational prediction. The neutral tags are defined in your main
configuration file, with a decent default value. For filtering, we check
whether the final tag score assigned to predictions is positive::

  p2g_refiltering = x.label!='pseudo' and x.tag_score() > 0

If you want to use the tag score in a filter, we suggest you to inspect
manually the results and check their tag score first. For example with
this action (paste it in the main configuration file)::

  ACTION.post_filtering.check_score = print(f"Tag score of {x.output_id()} filtered: {x.filtered}:\n {x.tag_score(verbose=1)}")

The verbose mode will allow you to check the titles of all proteins
present in the blast output and the score assigned to them. This will
allow you to build and improve useful tags for your family.

When the method *tag_score* is run for the first time on a *p2ghit*,
blastp is run against the database defined in the profile or in the main
configuration file (under the keyword *tag_db*). The output file is kept
in the *tag_blast* subfolder inside the folder dedicated to this target.
A tag blast run takes a few minutes, so take care of avoid doing it on a
lot of hits. If you put the *tag_score* evaluation on the right side of
an *and* construct, the tag blast will not be performed unless all
conditions to his left are true::

  p2g_refiltering = x.coverage()>0.4 and x.label!='pseudo' and x.tag_score()>0

GO score filtering
------------------

Similarly to the tag score, the GO score utilizes the same blast search
against Uniref50, but in this case it is the GO terms associated to the
proteins found which are evaluated. A list of the positive GO terms is
to be provided in the profile configuration file::

  go_terms = ["GO:08028", "GO:08030"]

A score is assigned to each blast hit depending on the *evalue*, as in
the tag score. The GO terms are searched considering their hierarchy: if
for a certain title in the blast output, a GO term is found which is a
child of a GO term defined in the profile configuration, this will count
as positive. Blast hit with no annotated GO are scored neutral. Only
molecular functions GO terms are checked.

::

   p2g_refiltering = x.label!='pseudo' and x.go_score()>0

Integrate your own code: option *-add*
--------------------------------------

With the *-add* option, you can provide a python add-on file that will
be loaded in selenoprofiles. This will allow you to define functions can
then be used in any procedure, for example for filtering or output. The
code inside the file provided is read line by line and executed in
selenoprofiles when all variables are already loaded and everything is
ready to run.

User defined functions are useful for filtering, labeling or outputing.
Let’s see how to create a simple output function. Create a file called
*extension.py* where you define function which accept a *p2ghit*::

  def my_name_is(z):
    """This functions accepts a p2ghit and returns its output id """
    return z.output_id()

If you now you provide this file with the option *-add*, the function
*my_name_is* will be available in selenoprofiles. Running selenoprofiles
with::

  -add extension.py -ACTION.pre_output.test "print(my_name_is(x))"

you will have something like this in the output::

  ...
  SelI.1.unaligned
  SelI.3.unaligned
  SelI.4.unaligned
  ...

Let’s see a more relevant example. Assume that for some reason you are
interested only in the non-pseudo, single-exon predictions. You could
then write this function in your *extension.py*::

  def has_no_introns(z):  
    """ This functions accepts a p2ghit and returns True if it has no introns"""
    return len(z.exons)==1
  
You may then use this function for filtering, adding something like this
in your profile *.config* file::

  p2g_refiltering = x.label!= "pseudo" and has_no_introns(x)

Adding functions may be useful for several purposes. It is possible to
write procedures to improve the predictions, as those
previously presented, or for filtering, as shown above. It can
also be used to perform one-time operations (for example to load custom
data), or override some functions or attributes used in selenoprofiles.
For example, the user may want to customize the labeling procedure used
in selenoprofiles. The easiest way to do this is writing a new labeling
procedure in the *extension.py* file, which redefines the *.label*
attribute of the input *p2ghit*, and use it in a *pre_filtering* action.
In this example, we define a procedure to label the predictions as short
or long, checking their predicted protein length::

  def custom_labelling(z):
    if len(z.protein()) >= 50:
      z.label='long'
    else:
      z.label='short'

We activate this by adding this action in the main configuration file::

  ACTION.pre_filtering.labelling = custom_labelling(x)

Note that when the new function is called, the standard labeling
procedure has been already called, so a *.label* attribute is available,
and you can check it (or use it) to define the new label. Example::

  def custom_labelling(z):
    original_label=z.label
    if len(z.protein()) >= 50:
      z.label='long_'+original_label
    else:
      z.label='short_'+original_label

The label is then typically used for filtering::

  p2g_refiltering = x.label.startswith("long")

There are a few global functions in selenoprofiles that user may be
interested in altering. In various steps of the workflow, the program
must decide which gene structure prediction is best among 2 or more
candidates. The first such function is named *choose_prediction*. This
is used in the prediction choice step, when a single prediction among
blast, exonerate and genewise is chosen. It accepts a list of *p2ghit,*
with variable length (1-3). It returns a tuple like (*p*, *s*) where *p*
is the chosen *p2ghit* and *s* is a string with a reason why (it will be
printed and stored in a file). The native function is the quite complex,
and takes into account the presence of frameshifts, presence of stop
codons, aligned Sec position (for selenoprotein families), length of
coding sequence (you can inspect the code at *def choose_prediction* in
*selenoprofiles.py*). Let’s see an example in which this function is
replaced by a simple hierarchal function, choosing predictions by
genewise over those by exonerate, over those by blast (note that it is
still possible that even blast is chosen in this way, if for a given hit
the exonerate and genewise predictions are empty or non-valid). Put this
into your *extension.py* file provided to option *-add*::

  global choose_prediction
  def choose_prediction(candidates):
    for c in candidates:
      if c.prediction_program()=='genewise':
        return (c, 'genewise is available')
	
    for c in candidates:
      if c.prediction_program()=='exonerate':
        return (c, 'exonerate is 2nd best')
	
    return (candidates[0], 'only blast available')

When writing a new *choose_prediction* function, you may still want to
call internally the old function, which you can refer to as
*choose_prediction_selenoprofiles.* In this example, the new function
keeps the behavior of the old one, except for blast predictions which
are forced to be never chosen. This is accomplished by returning an
*empty_p2g*\ () object when only blast is available::

  global choose_prediction
  def choose_prediction(candidates):
    if all( [ c.prediction_program()=='blast' for c in candidates ] ):
      return empty_p2g(), 'excluding blast'
    else:
      return choose_prediction_selenoprofiles(candidates)

The second such function is named
*choose_among_overlapping_p2gs_intrafamily* and is used when removing
intrafamily redundancy. This accepts two *p2ghit* that were found
overlapping and returns the best one, which is kept. The default
function calls internally *choose_prediction.* In its code, this is
named *choose_prediction_selenoprofiles*, so if you override the
*choose_prediction*, the *choose_among_overlapping_p2gs_intrafamily*
function will still run the original, built-in procedure.

If you want to remove intrafamily redundancy using an overridden
*choose_prediction* function, it is necessary to override
*choose_among_overlapping_p2gs_intrafamily* too. You can search its code
in *selenoprofiles_3.py* as a template.

The third and last function is named
*choose_among_overlapping_p2gs_interfamily* and is used when removing
redundancy between gene predictions by various profiles. This also
accepts two *p2ghit* and returns one. The default function considers the
AWSI score of the candidate with the 2 profiles, and their filtered
attribute (a prediction kept by a profile is never masked by an
overlapping prediction filtered by another profile). Let’s see how to
replace it with a function which always keeps the prediction with longer
protein sequence. Create an *extension.py* file like this::

  global choose_among_overlapping_p2gs_rem_red
  def choose_among_overlapping_p2gs_rem_red(p2g_hit_A, p2g_hit_B):
    if len(p2g_hit_A.protein()) > len(p2g_hit_B.protein()):
      return p2g_hit_A
    elif len(p2g_hit_A.protein()) < len(p2g_hit_B.protein()):
      return p2g_hit_B
    else:
      return p2g_hit_A

If you believe that your own function may be useful to other users, or
if you need help building your own function, contact us.

Custom prediction features
--------------------------

Selenoprofiles offers the possibility to annotate and manipulate custom
features linked to gene predictions. Such annotations (*p2g_features*)
can be used for example for protein motifs or domains, or signal
sequences, or secondary structures, present in all or some gene
predictions. Within selenoprofiles, SECIS elements are implemented as
*p2g_features*. Technically, *p2g_feature* is a python class, thought to
be generic so the user can created a child-class (subclass) to adapt it
to his specific purpose.

Selenoprofiles includes a built-in example to show the capabilities of
*p2g_features*: the class *protein_motif*. This is thought to annotate a
short motif within the protein sequence, the redox box, expressed as the
perl-like regular expression *C..C* (*C* stands for cysteine, and *.*
means any character). The class *protein_motif* allows to detect these
motifs and easily integrate them in the p2g or gff output.

For any custom *p2g_feature,* the user has to define at least the
following procedures: how to search and assign these features, how to
dump them in the sqlite database, how to load them back. Then,
optionally one can define how to output them to the gff and/or p2g file,
and also how to reload the features if gene structure predictions are
modified. The *protein_motif* includes examples of all these procedures.

All the code relevant to the *protein_motif* is here below, copied from
*selenoprofiles_3.py*.

::

  def annotate_protein_motif(p, silent=False):
    """p is a p2ghit. This is an example of method to annotate the p2g_feature protein_motif. To use, add this to the main configuration file:  
    ACTION.post_filtering.annotate_motif =    "if x.filtered == 'kept':  annotate_protein_motif(x)"   
    """
    s= protein_motif.motif.search(  p.protein() )   ##using search method of re.RegexObject  --  protein_motif.motif is such an object
    while s:
      protein_motif_instance=         protein_motif()
      protein_motif_instance.start=   s.start()+1   #making 1 based
      protein_motif_instance.end=     s.end()       #making 1 based and included, so it'd be +1-1
      protein_motif_instance.sequence=     \\
		    p.protein() [ protein_motif_instance.start-1 : protein_motif_instance.end ]
      p.features.append(protein_motif_instance)     ## adding feature to p2g object
      if not silent:  printerr('annotate_protein_motif found a motif: ‘ \\
				 +protein_motif_instance.output()+' in prediction: '+p.output_id(), 1)
      s=protein_motif.motif.search(  p.protein(), pos= s.start()+1 ) ## searching again, starting from just right of the previously found position


  class protein_motif(p2g_feature):
    """ protein motif is an example of a p2g_feature, to annotate the positions of a certain motif defined as a perl-style regexp.  The motif is defined in the line following this, as a class attribute. In the example, the redox box (CXXC) is the motif. 
	Attributes:
	- start      start of the protein motif in the protein sequence (1-based, included)
	- end        end of protein motif in the protein sequence (1-based, included)
	- sequence   motif sequence 
    """
    motif=re.compile( 'C..C' )
    included_in_output=True
    included_in_gff=   True

    def dump_text(self):
      """ Returns a string with all the information for this feature. This string is stored in the sqlite database. """
      return str(self.start)+':'+str(self.end)+':'+self.sequence

    def load_dumped_text(self, txt):
      """ Reverse the dump_text method: gets a string as input, and loads the self object with the information found in that string. """
      start, end, sequence= txt.split(':')
      self.start= int(start);    self.end=int(end);    self.sequence=sequence

    def output(self):
      """ Returns a string. This will be added to the p2g output of the prediction to which this            feature is linked -- if class attribute included_in_output is True"""
      return 'Motif: '+self.sequence+' Start: '+str(self.start)+' End: '+str(self.end)

    def gff(self, **keyargs):
      """This must return a gff-like tab-separated string. In this case, we are exploiting and overriding the gff method of the gene class, which is a parent class for p2g_feature"""
      ## getting a gene object with the genomic coordinates of the protein motif. we use the gene method subseq, which returns a subsequence of the parent gene. Indexes are adjusted for protein-nucleotide conversion
      motif_gene_object= self.parent.subseq( start_subseq= (self.start-1)*3 +1,    \\
					     length_subseq=(self.end-self.start+1)*3, minimal=True )
      #now motif_gene_object has a .exons attributes with the genomic coordinates of the protein motif. now we can use the native gff method of the obtained gene object
      return gene.gff(motif_gene_object,    **keyargs)

    def reset(self):
      """ This method is called when the linked prediction is modified, to allow to recompute some or all attributes of the feature. In this case, we are removing all features of this class, and annotating them again with the same method used to add them in first place: annotate_protein_motif"""
      ##removing instances of this class
      for index_to_remove in \\
       [i  for i, f in enumerate(self.parent.features) if f.__class__ == protein_motif ]  [::-1]: \\
	  self.parent.features.pop(index_to_remove)
      #reannotating
      annotate_protein_motif( self.parent, silent=True )


The code above contains the definition of a class (*protein_motif*, including
5 methods), and the function *annotate_protein_motif*. This function
takes as input a *p2ghit* instance, analyzes it, and if any protein
motif is found, it populates its *.features* attribute with one
*protein_motif* instance for each motif found.

If this function is never run, the *protein_motif* class is unused. As
mentioned within the code, to activate it you should add this line to
the main configuration file::

  ACTION.post_filtering.annotate_motif = if x.filtered == 'kept':  annotate_protein_motif(x)

In this way, the *annotate_protein_motif* will be run on every
prediction that passed filtering. The protein motif *C..C* is defined as
the class attribute *motif*, which is of type *RegexObject* from the
pattern matching module *re*. Inside the *annotate_protein_motif*
function, it is searched in the predicted protein sequence its dedicated
method *search*. For each motif found, a *protein_motif* instance is
created, and the start and end positions of the match are stored within
this object; the protein sequence of the motif is also derived and
stored.

Once the *protein_motif* instance is ready, it is appended to
the *.features* list attribute of the input *p2ghit*. Shortly after,
this *p2ghit* reaches the database step, and its information is stored
as a sqlite entry. All the features associated to it are also stored in
the database. For this reason, the method *dump_text* is called on every
feature instance. This method must return a string containing all the
information sufficient to then load it back. The method
*load_dumped_text* is its reverse, and is used during the output phase
to load the dumped information from the database into an empty
*protein_motif* instance. An annotating function (in this case
*annotation_protein_motif*), and the *p2g_feature* class methods
*dump_text* and *load_dumped_text* are the minimal set of definitions to
make a functional feature.

Other attributes and methods can be used to
output the features. To output features to the native selenoprofiles
format (*.p2g*), the class attribute
*included_in_output* must be *True*, and the *output* method has to be
defined. Features can be used for gff output too, if the class attribute
*included_in_gff* is set to *True*. In this case, it makes sense to take
advantage of the *gene* class, the parent of both classes *p2ghit* and
*p2g_feature.*

The *gene* object is designed to represent a genomic
interval, optionally composed by multiple exons, on a certain chromosome
(or scaffold) of a target file. It provides plenty of methods such as
for fasta fetching, cutting subsequences, computing overlaps, merging
gene structures and so on. Its native *gff* method returns one line for
each exon in the object, reporting its coordinates and optionally other
attributes.

In the example above, the *protein_motif* class is not
really used as a *gene* object, but just as a data container for the
attributes *start*, *end*, *sequence*: its attributes *chromosome*,
*strand*, *exons* are not used. Instead, the correct genomic coordinates
of the protein motif are derived dynamically, and added to output by
overriding the native *gff* method of the class gene.

For each motif
instance, its start and end positions relative to the full protein
sequence are available. Thus, the *gene* method *subseq* is used to
derive the global genomic coordinates of the motif. This function
accepts as input a *gene* (self) object, a start position and a region
length, and returns another *gene* object, which contains a subset of
the genomic intervals in the self object. If the desired region spans
any exon boundary, the returned object contains multiple exons. In the
code, the indexes are adjusted for converting protein-based to
nucleotide-based positions. Once the appropriate gene object containing
the global genomic coordinates for the motif is ready
(*motif_gene_object*), the native *gene* class *gff* method can be
called.

Lastly, the method *reset* can be defined for custom features that have
to be recomputed when the predictions are modified, by actions such as
those explained in Improving predictions in the Pipe. In the example, the
*protein_motif* instances are searched and expelled from the *features*
list of the *p2ghit* object for which the *reset* function is run. Then,
the annotating function *annotate_protein_motif* is run again.