seqann package¶

seqann package

seqann.sequence_annotation¶

class seqann.sequence_annotation.BioSeqAnn(server: <module 'BioSQL.BioSeqDatabase' from '/home/docs/checkouts/readthedocs.org/user_builds/seqann/envs/stable/lib/python3.7/site-packages/BioSQL/BioSeqDatabase.py'> = None, dbversion: str = '3310', datfile: str = '', verbose: bool = False, verbosity: int = 0, pid: str = 'NA', kir: bool = False, align: bool = False, load_features: bool = False, store_features: bool = False, refdata: seqann.models.reference_data.ReferenceData = None, cached_features: Dict[KT, VT] = None, safemode: bool = False, debug: Dict[KT, VT] = None)[source]¶

Bases: seqann.models.base_model_.Model

from seqann import BioSeqAnn
seqann1 = BioSeqAnn()
seqann2 = BioSeqAnn(dbversion="3300", verbose=True, verbosity=3)
seqann3 = BioSeqAnn(debug={"align":4}, safe)

Parameters:

server (BioSQL Database) – A BioSQL database to use for retriving the sequence features. Using a BioSQL DB will speed up the annotations dramatically.
dbversion (str) – The IPD-IMGT/HLA or KIR database release.
datfile (str) – The IPD-IMGT/HLA or KIR dat file to use in place of the server parameter.
pid (str) – A process label that can be provided to help track the logging output.
load_features (bool) – Flag for downloading all gene features and accessions from the feature service.
store_features (bool) – Flag for caching all features and their corresponding accessions.
cached_features (dict) – Dictionary containing all the features from the feature service.
kir (bool) – Flag for indicating the input sequences are from the KIR gene system.
align (bool) – Flag for producing the alignments along with the annotations.
verbose (bool) – Flag for running in verbose mode.
verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
debug (dict) – Dictionary containing names of steps that you want to debug.
safemode (bool) – Flag for running the annotations in safemode. No alignments will be done if no feature matches were made. This can prevent the alignment step for running for too long on bad sequences.

annotate(sequence: Bio.Seq.Seq = None, locus: str = None, nseqs: int = 20, alignseqs: int = 10, skip: List[T] = [], rerun: bool = True, full: bool = True) → seqann.models.annotation.Annotation[source]¶

annotate - method for annotating a BioPython sequence

Parameters:	sequence (Seq) – The input consensus sequence. locus (`str`) – The gene locus associated with the sequence. nseqs (`int`) – The number of blast sequences to use. alignseqs (`int`) – The number of sequences to use for targeted alignments. skip (`List`) – A list of alleles to skip for using as a reference. This is used for validation and testing.
Return type:	Annotation

Returns:

The annotate function return an Annotation object that contains the sequence features and names associated with them.

Example output:

{
     'complete_annotation': True,
     'annotation': {'exon_1': SeqRecord(seq=Seq('AGAGACTCTCCCG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]),
                    'exon_2': SeqRecord(seq=Seq('AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGC...GAG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]),
                    'exon_3': SeqRecord(seq=Seq('TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACA...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='<unknown name>', description='HLA:HLA00630', dbxrefs=[])},
     'features': {'exon_1': SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(13), strand=1), type='exon_1'),
                  'exon_2': SeqFeature(FeatureLocation(ExactPosition(13), ExactPosition(283), strand=1), type='exon_2')
                  'exon_3': SeqFeature(FeatureLocation(ExactPosition(283), ExactPosition(503), strand=1), type='exon_3')},
     'method': 'nt_search and clustalo',
     'gfe': 'HLA-Aw2-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-4',
     'seq': SeqRecord(seq=Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[])
}

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann import BioSeqAnn
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> seqann = BioSeqAnn()
>>> ann = seqann.annotate(sequence)
>>> for f in ann.annotation:
...    print(f, ann.method, str(ann.annotation[f].seq), sep="       ")
exon_2  nt_search and clustalo  AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGCGCGTGCGTTATGTGACCAGATACATCTATAACCGAGAGGAGTACGCACGCTTCGACAGCGACGTGGAGGTGTACCGGGCGGTGACGCCGCTGGGGCCGCCTGCCGCCGAGTACTGGAACAGCCAGAAGGAAGTCCTGGAGAGGACCCGGGCGGAGTTGGACACGGTGTGCAGACACAACTACCAGTTGGAGCTCCGCACGACCTTGCAGCGGCGAG
exon_3  nt_search and clustalo  TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAACCTGCTGGTCTGCTCAGTGACAGATTTCTATCCAGCCCAGATCAAAGTCCGGTGGTTTCGGAATGACCAGGAGGAGACAACCGGCGTTGTGTCCACCCCCCTTATTAGGAACGGTGACTGGACCTTCCAGATCCTGGTGATGCTGGAAATGACTCCCCAGCATGGAGACGTCTACACCTGCCACGTGGAGCACCCCAGCCTCCAGAACCCCATCACCGTGGAGTGGC
exon_1  nt_search and clustalo  AGAGACTCTCCCG
exon_4  nt_search and clustalo  GGGCTCAGTCTGAATCTGCCCAGAGCAAGATG

ref_align(found_seqs, sequence: Bio.Seq.Seq = None, locus: str = None, annotation: seqann.models.annotation.Annotation = None, partial_ann: seqann.models.annotation.Annotation = None, run: int = 0, cutoff: float = 0.9) → seqann.models.annotation.Annotation[source]¶

ref_align - Method for doing targeted alignments on partial annotations

Parameters:	found_seqs (Seq) – The input sequence record. sequence (Seq) – The input sequence record. locus (`str`) – The gene locus associated with the sequence. annotation (Annotation) – The incomplete annotation from a previous iteration. partial_ann (Annotation) – The partial annotation after looping through all of the blast sequences.
Return type:	Annotation

add_alignment(ref_seq, annotation) → seqann.models.annotation.Annotation[source]¶

add_alignment - method for adding the alignment to an annotation

Parameters:	ref_seq (List) – List of reference sequences annotation (Annotation) – The complete annotation
Return type:	Annotation

seqann.sequence_annotation.getblocks(coords)[source]¶

seqann.seq_search¶

class seqann.seq_search.SeqSearch(verbose: bool = False, verbosity: int = 0)[source]¶

Bases: seqann.models.base_model_.Model

This is a class for annotating a BioPython sequence without using alignment

Parameters:	verbose (`bool`) – Flag for running in verbose mode. verbosity (`int`) – Numerical value to indicate how verbose the output will be in verbose mode.

Example usage:

>>> from seqann.seq_search import SeqSearch
>>> seqsrch = SeqSearch()

classmethod from_dict(dikt) → seqann.seq_search.SeqSearch[source]¶

Returns the dict as a model

Parameters:	dikt – A dict.
Type:	dict
Returns:	The SeqSearch of this SeqSearch.
Return type:	SeqSearch

search_seqs(seqrec, in_seq, locus, run=0, partial_ann=None)[source]¶

search_seqs - method for annotating a BioPython sequence without alignment

Parameters:	seqrec (SeqRecord) – The reference sequence locus (str) – The gene locus associated with the sequence. in_seq (SeqRecord) – The input sequence run (int) – The number of runs that have been done partial_ann (Annotation) – A partial annotation from a previous step
Return type:	Annotation

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann.seq_search import SeqSearch
>>> inseq = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> sqsrch = SeqSearch()
>>> ann = sqsrch.search_seqs(refseqs, inseq)

verbose¶

Gets the verbose of this SeqSearch.

Returns:	The verbose of this SeqSearch.
Return type:	bool

verbosity¶

Gets the verbosity of this SeqSearch.

Returns:	The verbosity of this SeqSearch.
Return type:	int

seqann.seq_search.loctype(s1, e1, s2, e2)[source]¶

seqann.seq_search.getblocks(coords)[source]¶

seqann.gfe¶

class seqann.gfe.GFE(url='http://feature.nmdp-bioinformatics.org', loci=['KIR2DP1', 'KIR2DL5A', 'KIR2DS4', 'HLA-DRA', 'HLA-DPA1', 'HLA-DQA1', 'HLA-DPB1', 'KIR2DS2', 'KIR3DP1', 'HLA-DRB4', 'KIR2DL1', 'KIR2DS5', 'HLA-DRB3', 'KIR2DS3', 'KIR3DL1', 'HLA-A', 'HLA-DRB5', 'KIR2DL4', 'HLA-DQB1', 'KIR3DL2', 'HLA-B', 'KIR3DS1', 'KIR2DL5B', 'HLA-DRB1', 'KIR3DL3', 'KIR2DS1', 'HLA-C'], load_features=False, store_features=False, cached_features=None, verbose=False, pid='NA', verbosity=0)[source]¶

Bases: object

This class is used for converting annotations into GFE notations.

Example:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> from seqann.sequence_annotation import BioSeqAnn
>>> from pygfe.pygfe import pyGFE
>>> seq_file = 'test_dq.fasta'
>>> gfe = pyGFE()
>>> server = BioSeqDatabase.open_database(driver="pymysql", user="root",
...                                       passwd="", host="localhost",
...                                      db="bioseqdb")
>>> seqann = BioSeqAnn(server=server)
>>> seq_rec = list(SeqIO.parse(seq_file, 'fasta'))[0]
>>> annotation = seqann.annotate(seq_rec, "HLA-DQB1")
>>> features, gfe = gfe.get_gfe(annotation, "HLA-DQB1")
>>> print(gfe)
HLA-DQB1w0-4-0-141-0-12-0-4-0-0-0-0-0

load_features()[source]¶: Loads all the known features from the feature service

locus_features(locus)[source]¶

Returns all features associated with a locus

Parameters:	locus (`str`) – string containing HLA locus.
Return type:	`dict`

get_gfe(annotation, locus)[source]¶

creates GFE from a sequence annotation

Parameters:	locus (`str`) – The gene locus annotation (`List`) – An sequence annotation object
Return type:	`List`

Returns:: The GFE notation and the associated features in an array

seqann.blast_cmd¶

seqann.blast_cmd.has_hla(x)¶

seqann.blast_cmd.blastn(sequences, locus, nseqs, kir=False, verbose=False, refdata=None, evalue=10)[source]¶

Gets the a list of alleles that are the most similar to the input sequence

Parameters:

sequences (SeqRecord) – The input sequence record.
locus (str) – The gene locus associated with the sequence.
nseqs (int) – The incomplete annotation from a previous iteration.
evalue (int) – The evalue to use (default = 10)
kir (bool) – Run with KIR or not
verbose (bool) – Run in versboe
refdata (Reference Data) – An object with reference data

Return type:

Blast

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann.blast_cmd import blastn
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> blast = blastn(sequence, locus, nseqs)

seqann.blast_cmd.get_locus(sequences, kir=False, verbose=False, refdata=None, evalue=10)[source]¶

Gets the locus of the sequence by running blastn

Parameters:	sequences – sequenences to blast kir – bool whether the sequences are KIR or not
Return type:	`str`

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann.blast_cmd import get_locus
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> locus = get_locus(sequence)

seqann.align¶

seqann.align.flatten(l)¶

seqann.align.align_seqs(found_seqs, sequence, locus, start_pos, missing, annotated, cutoff=0.9, verbose=False, verbosity=0)[source]¶

align_seqs - Aligns sequences with clustalo

Parameters:

found_seqs (List) – List of the reference sequences
sequence (SeqRecord) – The input consensus sequence.
locus (str) – The gene locus associated with the sequence.
annotated (dict) – dictonary of the annotated features
start_pos (int) – Where the reference sequence starts
missing (List) – List of the unmapped features
cutoff (float) – The alignment cutoff
verbose (bool) – Flag for running in verbose mode.
verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.

Return type:

Annotation

seqann.align.find_features(feats, sequ, annotated, start_pos, cutoff)[source]¶

find_features - Finds the reference sequence features in the alignments and records the positions

Parameters:

feats (dict) – Dictonary of sequence features
sequ (List) – The sequence alignment for the input sequence
annotated (dict) – dictonary of the annotated features
start_pos (int) – Where the reference sequence starts
missing (List) – List of the unmapped features
cutoff (float) – The alignment cutoff
verbose (bool) – Flag for running in verbose mode.
verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.

Return type:

List

seqann.align.resolve_feats(feat_list, seqin, seqref, start, locus, missing, verbose=False, verbosity=0)[source]¶

resolve_feats - Resolves features from alignments

Parameters:

feat_list (List) – List of the found features
seqin (str) – The input sequence
locus (str) – The input locus
start (int) – Where the sequence start in the alignment
missing (List) – List of the unmapped features
verbose (bool) – Flag for running in verbose mode.
verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.

Return type:

Annotation

seqann.align.count_diffs(align, feats, inseq, locus, cutoff, verbose=False, verbosity=0)[source]¶

count_diffs - Counts the number of mismatches, gaps, and insertions and then determines if those are within an acceptable range.

Parameters:

align (List) – The alignment
feats (dict) – Dictonary of the features
locus (str) – The gene locus associated with the sequence.
inseq (str) – The input sequence
cutoff (float) – The alignment cutoff
verbose (bool) – Flag for running in verbose mode.
verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.

Return type:

List