seqann package¶
seqann.sequence_annotation¶
-
class
seqann.sequence_annotation.
BioSeqAnn
(server: <module 'BioSQL.BioSeqDatabase' from '/home/docs/checkouts/readthedocs.org/user_builds/seqann/envs/stable/lib/python3.7/site-packages/BioSQL/BioSeqDatabase.py'> = None, dbversion: str = '3310', datfile: str = '', verbose: bool = False, verbosity: int = 0, pid: str = 'NA', kir: bool = False, align: bool = False, load_features: bool = False, store_features: bool = False, refdata: seqann.models.reference_data.ReferenceData = None, cached_features: Dict[KT, VT] = None, safemode: bool = False, debug: Dict[KT, VT] = None)[source]¶ Bases:
seqann.models.base_model_.Model
from seqann import BioSeqAnn seqann1 = BioSeqAnn() seqann2 = BioSeqAnn(dbversion="3300", verbose=True, verbosity=3) seqann3 = BioSeqAnn(debug={"align":4}, safe)
Parameters: - server (BioSQL Database) – A BioSQL database to use for retriving the sequence features. Using a BioSQL DB will speed up the annotations dramatically.
- dbversion (
str
) – The IPD-IMGT/HLA or KIR database release. - datfile (
str
) – The IPD-IMGT/HLA or KIR dat file to use in place of the server parameter. - pid (
str
) – A process label that can be provided to help track the logging output. - load_features (
bool
) – Flag for downloading all gene features and accessions from the feature service. - store_features (
bool
) – Flag for caching all features and their corresponding accessions. - cached_features (
dict
) – Dictionary containing all the features from the feature service. - kir (
bool
) – Flag for indicating the input sequences are from the KIR gene system. - align (
bool
) – Flag for producing the alignments along with the annotations. - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode. - debug (
dict
) – Dictionary containing names of steps that you want to debug. - safemode (
bool
) – Flag for running the annotations in safemode. No alignments will be done if no feature matches were made. This can prevent the alignment step for running for too long on bad sequences.
-
annotate
(sequence: Bio.Seq.Seq = None, locus: str = None, nseqs: int = 20, alignseqs: int = 10, skip: List[T] = [], rerun: bool = True, full: bool = True) → seqann.models.annotation.Annotation[source]¶ annotate - method for annotating a BioPython sequence
Parameters: - sequence (Seq) – The input consensus sequence.
- locus (
str
) – The gene locus associated with the sequence. - nseqs (
int
) – The number of blast sequences to use. - alignseqs (
int
) – The number of sequences to use for targeted alignments. - skip (
List
) – A list of alleles to skip for using as a reference. This is used for validation and testing.
Return type: - Returns:
The annotate function return an
Annotation
object that contains the sequence features and names associated with them.Example output:
{ 'complete_annotation': True, 'annotation': {'exon_1': SeqRecord(seq=Seq('AGAGACTCTCCCG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]), 'exon_2': SeqRecord(seq=Seq('AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGC...GAG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]), 'exon_3': SeqRecord(seq=Seq('TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACA...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='<unknown name>', description='HLA:HLA00630', dbxrefs=[])}, 'features': {'exon_1': SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(13), strand=1), type='exon_1'), 'exon_2': SeqFeature(FeatureLocation(ExactPosition(13), ExactPosition(283), strand=1), type='exon_2') 'exon_3': SeqFeature(FeatureLocation(ExactPosition(283), ExactPosition(503), strand=1), type='exon_3')}, 'method': 'nt_search and clustalo', 'gfe': 'HLA-Aw2-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-4', 'seq': SeqRecord(seq=Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]) }
Example usage:
>>> from Bio.Seq import Seq >>> from seqann import BioSeqAnn >>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC') >>> seqann = BioSeqAnn() >>> ann = seqann.annotate(sequence) >>> for f in ann.annotation: ... print(f, ann.method, str(ann.annotation[f].seq), sep=" ") exon_2 nt_search and clustalo AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGCGCGTGCGTTATGTGACCAGATACATCTATAACCGAGAGGAGTACGCACGCTTCGACAGCGACGTGGAGGTGTACCGGGCGGTGACGCCGCTGGGGCCGCCTGCCGCCGAGTACTGGAACAGCCAGAAGGAAGTCCTGGAGAGGACCCGGGCGGAGTTGGACACGGTGTGCAGACACAACTACCAGTTGGAGCTCCGCACGACCTTGCAGCGGCGAG exon_3 nt_search and clustalo TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAACCTGCTGGTCTGCTCAGTGACAGATTTCTATCCAGCCCAGATCAAAGTCCGGTGGTTTCGGAATGACCAGGAGGAGACAACCGGCGTTGTGTCCACCCCCCTTATTAGGAACGGTGACTGGACCTTCCAGATCCTGGTGATGCTGGAAATGACTCCCCAGCATGGAGACGTCTACACCTGCCACGTGGAGCACCCCAGCCTCCAGAACCCCATCACCGTGGAGTGGC exon_1 nt_search and clustalo AGAGACTCTCCCG exon_4 nt_search and clustalo GGGCTCAGTCTGAATCTGCCCAGAGCAAGATG
-
ref_align
(found_seqs, sequence: Bio.Seq.Seq = None, locus: str = None, annotation: seqann.models.annotation.Annotation = None, partial_ann: seqann.models.annotation.Annotation = None, run: int = 0, cutoff: float = 0.9) → seqann.models.annotation.Annotation[source]¶ ref_align - Method for doing targeted alignments on partial annotations
Parameters: - found_seqs (Seq) – The input sequence record.
- sequence (Seq) – The input sequence record.
- locus (
str
) – The gene locus associated with the sequence. - annotation (Annotation) – The incomplete annotation from a previous iteration.
- partial_ann (Annotation) – The partial annotation after looping through all of the blast sequences.
Return type:
-
add_alignment
(ref_seq, annotation) → seqann.models.annotation.Annotation[source]¶ add_alignment - method for adding the alignment to an annotation
Parameters: - ref_seq (List) – List of reference sequences
- annotation (Annotation) – The complete annotation
Return type:
seqann.seq_search¶
-
class
seqann.seq_search.
SeqSearch
(verbose: bool = False, verbosity: int = 0)[source]¶ Bases:
seqann.models.base_model_.Model
This is a class for annotating a BioPython sequence without using alignment
Parameters: - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode.
Example usage:
>>> from seqann.seq_search import SeqSearch >>> seqsrch = SeqSearch()
-
classmethod
from_dict
(dikt) → seqann.seq_search.SeqSearch[source]¶ Returns the dict as a model
Parameters: dikt – A dict. Type: dict Returns: The SeqSearch of this SeqSearch. Return type: SeqSearch
-
search_seqs
(seqrec, in_seq, locus, run=0, partial_ann=None)[source]¶ search_seqs - method for annotating a BioPython sequence without alignment
Parameters: - seqrec (SeqRecord) – The reference sequence
- locus (str) – The gene locus associated with the sequence.
- in_seq (SeqRecord) – The input sequence
- run (int) – The number of runs that have been done
- partial_ann (Annotation) – A partial annotation from a previous step
Return type: Example usage:
>>> from Bio.Seq import Seq >>> from seqann.seq_search import SeqSearch >>> inseq = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC') >>> sqsrch = SeqSearch() >>> ann = sqsrch.search_seqs(refseqs, inseq)
-
verbose
¶ Gets the verbose of this SeqSearch.
Returns: The verbose of this SeqSearch. Return type: bool
-
verbosity
¶ Gets the verbosity of this SeqSearch.
Returns: The verbosity of this SeqSearch. Return type: int
- verbose (
seqann.gfe¶
-
class
seqann.gfe.
GFE
(url='http://feature.nmdp-bioinformatics.org', loci=['KIR2DP1', 'KIR2DL5A', 'KIR2DS4', 'HLA-DRA', 'HLA-DPA1', 'HLA-DQA1', 'HLA-DPB1', 'KIR2DS2', 'KIR3DP1', 'HLA-DRB4', 'KIR2DL1', 'KIR2DS5', 'HLA-DRB3', 'KIR2DS3', 'KIR3DL1', 'HLA-A', 'HLA-DRB5', 'KIR2DL4', 'HLA-DQB1', 'KIR3DL2', 'HLA-B', 'KIR3DS1', 'KIR2DL5B', 'HLA-DRB1', 'KIR3DL3', 'KIR2DS1', 'HLA-C'], load_features=False, store_features=False, cached_features=None, verbose=False, pid='NA', verbosity=0)[source]¶ Bases:
object
This class is used for converting annotations into GFE notations.
Example:
>>> from Bio import SeqIO >>> from BioSQL import BioSeqDatabase >>> from seqann.sequence_annotation import BioSeqAnn >>> from pygfe.pygfe import pyGFE >>> seq_file = 'test_dq.fasta' >>> gfe = pyGFE() >>> server = BioSeqDatabase.open_database(driver="pymysql", user="root", ... passwd="", host="localhost", ... db="bioseqdb") >>> seqann = BioSeqAnn(server=server) >>> seq_rec = list(SeqIO.parse(seq_file, 'fasta'))[0] >>> annotation = seqann.annotate(seq_rec, "HLA-DQB1") >>> features, gfe = gfe.get_gfe(annotation, "HLA-DQB1") >>> print(gfe) HLA-DQB1w0-4-0-141-0-12-0-4-0-0-0-0-0
seqann.blast_cmd¶
-
seqann.blast_cmd.
has_hla
(x)¶
-
seqann.blast_cmd.
blastn
(sequences, locus, nseqs, kir=False, verbose=False, refdata=None, evalue=10)[source]¶ Gets the a list of alleles that are the most similar to the input sequence
Parameters: - sequences (SeqRecord) – The input sequence record.
- locus (
str
) – The gene locus associated with the sequence. - nseqs (
int
) – The incomplete annotation from a previous iteration. - evalue (
int
) – The evalue to use (default = 10) - kir (
bool
) – Run with KIR or not - verbose (
bool
) – Run in versboe - refdata (Reference Data) – An object with reference data
Return type: Example usage:
>>> from Bio.Seq import Seq >>> from seqann.blast_cmd import blastn >>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC') >>> blast = blastn(sequence, locus, nseqs)
-
seqann.blast_cmd.
get_locus
(sequences, kir=False, verbose=False, refdata=None, evalue=10)[source]¶ Gets the locus of the sequence by running blastn
Parameters: - sequences – sequenences to blast
- kir – bool whether the sequences are KIR or not
Return type: str
Example usage:
>>> from Bio.Seq import Seq >>> from seqann.blast_cmd import get_locus >>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC') >>> locus = get_locus(sequence)
seqann.align¶
-
seqann.align.
flatten
(l)¶
-
seqann.align.
align_seqs
(found_seqs, sequence, locus, start_pos, missing, annotated, cutoff=0.9, verbose=False, verbosity=0)[source]¶ align_seqs - Aligns sequences with clustalo
Parameters: - found_seqs (
List
) – List of the reference sequences - sequence (SeqRecord) – The input consensus sequence.
- locus (
str
) – The gene locus associated with the sequence. - annotated (
dict
) – dictonary of the annotated features - start_pos (
int
) – Where the reference sequence starts - missing (
List
) – List of the unmapped features - cutoff (
float
) – The alignment cutoff - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type: - found_seqs (
-
seqann.align.
find_features
(feats, sequ, annotated, start_pos, cutoff)[source]¶ find_features - Finds the reference sequence features in the alignments and records the positions
Parameters: - feats (
dict
) – Dictonary of sequence features - sequ (
List
) – The sequence alignment for the input sequence - annotated (
dict
) – dictonary of the annotated features - start_pos (
int
) – Where the reference sequence starts - missing (
List
) – List of the unmapped features - cutoff (
float
) – The alignment cutoff - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type: List
- feats (
-
seqann.align.
resolve_feats
(feat_list, seqin, seqref, start, locus, missing, verbose=False, verbosity=0)[source]¶ resolve_feats - Resolves features from alignments
Parameters: - feat_list (
List
) – List of the found features - seqin (
str
) – The input sequence - locus (
str
) – The input locus - start (
int
) – Where the sequence start in the alignment - missing (
List
) – List of the unmapped features - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type: - feat_list (
-
seqann.align.
count_diffs
(align, feats, inseq, locus, cutoff, verbose=False, verbosity=0)[source]¶ count_diffs - Counts the number of mismatches, gaps, and insertions and then determines if those are within an acceptable range.
Parameters: - align (
List
) – The alignment - feats (
dict
) – Dictonary of the features - locus (
str
) – The gene locus associated with the sequence. - inseq (
str
) – The input sequence - cutoff (
float
) – The alignment cutoff - verbose (
bool
) – Flag for running in verbose mode. - verbosity (
int
) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type: List
- align (