Getting started --------------- Quickstart ========== To predict selenoproteins and selenocysteine machinery proteins in a metazoan genome, use:: selenoprofiles -o outfolder -t genome.fasta -s "species name" -p metazoa,machinery -output_gtf_file all_predictions.gtf It will generate all_predictions.gtf, besides other output files in outfolder/species_name.genome/output/ Command line structure ====================== In a single run, selenoprofiles can search for multiple profiles in a single target. Here's a minimal command line:: selenoprofiles -o output_folder -t target_file -s "species name" -p profile [options] These are the compulsory arguments: * **\-o** output folder, will be created if non-existing. * **\-t** target_file = a (multi-)fasta file containing nucleotide sequences * **\-s** a species descriptor with no restrictions. For multiple words, use quotes * **\-p** the profile(s) to be searched. Multiple comma-separated arguments are accepted. Each argument can be: * a profile name: invokes a built-in profile (located in the profiles_folder defined in the config file) * a profile set: a keyword expanded to a list of profiles. Profile sets are defined in the config file * a path to a profile alignment: to create one, see: selenoprofiles build -h By default, selenoprofiles assumes the target is an eukaryotic genome sequence, and it will attempt to predict introns. If you're searching (intronless) prokaryotes or eukaryotic mRNA sequences, use option *-no_splice*. Blast searches can use multiple CPUs. To control how many, use option *-ncpus*. This is just one of the many non-compulsory options and parameters. The config file *~/.selenoprofiles_config.txt* defines default values for all of them. The full list of options can be inspected with:: selenoprofiles -h full Searching for selenoproteins and selenium markers ================================================= The common usage of selenoprofiles is the prediction of selenoprotein families and selenium metabolism genes. For this task, you must specify appropriate profile sets: the selenoprotein families expected in a genome depend on its taxonomy. To search for metazoan selenoprotein families and selenium usage markers, use:: -p metazoa,machinery To search for all eukaryotic families, use:: -p eukarya To search for all prokaryotic families, use:: -p prokarya Selenoprofiles output ===================== Upon completion, you will find output files in a subfolder of the output folder:: output_folder/species_name.target_file_name/output/ Selenoprofiles produces one file per gene prediction, per requested format. The output files are named after **prediction identifier**, which have the syntax:: profile_name.index.label Where: - *profile_name* identifies the source profile for the prediction - *index* is an arbitrary numeric id - *label* identifies the class of predicted gene. For selenoprotein families, the label can be: - "selenocysteine" for selenoproteins: UGA is aligned to the Sec position of the profile - "cysteine" for cysteine-homologs of selenoproteins (i.e. Cys aligned to Sec position) - if any other amino acid is aligned to Sec, the label takes its name - for predictions that do not include the Sec position of the profile: - "unaligned" if the prediction ends upstream or starts downstream of the Sec position - "gapped" if Sec is not aligned but there are homologous regions on both sides - "pseudo" for predictions containing inframe stops or frameshifts (likely pseudogenes) - "uga_containing" for predictions whose only pseudogene features are in-frame UGAs For profiles that are not selenoprotein families (i.e. do not contain Sec), the label can be: - "homologue" for standard predictions - "pseudo" (see above) Output files are named after the prediction identifier and the format:: profile_name.index.label.format # e.g. GPx.1.selenocysteine.p2g These output **formats** are supported: - *p2g*: native format with query-target alignment, coordinates and other info; :download:`see an example here`. - *fasta*: fasta file with the predicted protein sequence - *gff*: gff file with the genomic coordinates of the prediction - *gtf*: analog to gff, but with this last field: gene_id "prediction_id"; transcript_id "prediction_id"; - *cds*: fasta file with coding sequence, including the stop codon if applicable - *three_prime*: fasta with the sequence immediately at 3' of the prediction. Length is defined by -three_prime_length - *five_prime*: fasta with the sequence immediately at 5' of the prediction. Length is defined by -five_prime_length - *dna*: fasta file with the full nucleotide sequence, including introns (and frameshift-causing insertions if any) - *introns*: fasta file with the sequence of the introns, split into different fasta headers Additionally, a fasta alignment called *profile_name.ali* is created. Only one such *ali* file is produced per profile, containing the sequences of all predictions plus the profile sequences. On the command line, option -output_FORMAT activates the corresponding output for each prediction, e.g. -output_gff will produce gff files. By default, only the ali and p2g formats are active, as visible in the config file:: ### active output format output_ali=1 output_p2g=1 To create a single output file for all predictions, use -output_FORMAT_file providing as argument the file that will be created, e.g. ``-output_fasta_file all_predicted_proteins.fa``