seqann package

seqann.sequence_annotation

class seqann.sequence_annotation.BioSeqAnn(server: <module 'BioSQL.BioSeqDatabase' from '/home/docs/checkouts/readthedocs.org/user_builds/seqann/envs/stable/lib/python3.7/site-packages/BioSQL/BioSeqDatabase.py'> = None, dbversion: str = '3310', datfile: str = '', verbose: bool = False, verbosity: int = 0, pid: str = 'NA', kir: bool = False, align: bool = False, load_features: bool = False, store_features: bool = False, refdata: seqann.models.reference_data.ReferenceData = None, cached_features: Dict[KT, VT] = None, safemode: bool = False, debug: Dict[KT, VT] = None)[source]

Bases: seqann.models.base_model_.Model

from seqann import BioSeqAnn
seqann1 = BioSeqAnn()
seqann2 = BioSeqAnn(dbversion="3300", verbose=True, verbosity=3)
seqann3 = BioSeqAnn(debug={"align":4}, safe)
Parameters:
  • server (BioSQL Database) – A BioSQL database to use for retriving the sequence features. Using a BioSQL DB will speed up the annotations dramatically.
  • dbversion (str) – The IPD-IMGT/HLA or KIR database release.
  • datfile (str) – The IPD-IMGT/HLA or KIR dat file to use in place of the server parameter.
  • pid (str) – A process label that can be provided to help track the logging output.
  • load_features (bool) – Flag for downloading all gene features and accessions from the feature service.
  • store_features (bool) – Flag for caching all features and their corresponding accessions.
  • cached_features (dict) – Dictionary containing all the features from the feature service.
  • kir (bool) – Flag for indicating the input sequences are from the KIR gene system.
  • align (bool) – Flag for producing the alignments along with the annotations.
  • verbose (bool) – Flag for running in verbose mode.
  • verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
  • debug (dict) – Dictionary containing names of steps that you want to debug.
  • safemode (bool) – Flag for running the annotations in safemode. No alignments will be done if no feature matches were made. This can prevent the alignment step for running for too long on bad sequences.
annotate(sequence: Bio.Seq.Seq = None, locus: str = None, nseqs: int = 20, alignseqs: int = 10, skip: List[T] = [], rerun: bool = True, full: bool = True) → seqann.models.annotation.Annotation[source]

annotate - method for annotating a BioPython sequence

Parameters:
  • sequence (Seq) – The input consensus sequence.
  • locus (str) – The gene locus associated with the sequence.
  • nseqs (int) – The number of blast sequences to use.
  • alignseqs (int) – The number of sequences to use for targeted alignments.
  • skip (List) – A list of alleles to skip for using as a reference. This is used for validation and testing.
Return type:

Annotation

Returns:

The annotate function return an Annotation object that contains the sequence features and names associated with them.

Example output:

{
     'complete_annotation': True,
     'annotation': {'exon_1': SeqRecord(seq=Seq('AGAGACTCTCCCG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]),
                    'exon_2': SeqRecord(seq=Seq('AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGC...GAG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[]),
                    'exon_3': SeqRecord(seq=Seq('TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACA...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='<unknown name>', description='HLA:HLA00630', dbxrefs=[])},
     'features': {'exon_1': SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(13), strand=1), type='exon_1'),
                  'exon_2': SeqFeature(FeatureLocation(ExactPosition(13), ExactPosition(283), strand=1), type='exon_2')
                  'exon_3': SeqFeature(FeatureLocation(ExactPosition(283), ExactPosition(503), strand=1), type='exon_3')},
     'method': 'nt_search and clustalo',
     'gfe': 'HLA-Aw2-1-1-1-1-1-1-1-1-1-1-1-1-1-1-1-4',
     'seq': SeqRecord(seq=Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC...ATG', SingleLetterAlphabet()), id='HLA:HLA00630', name='HLA:HLA00630', description='HLA:HLA00630 DQB1*03:04:01 597 bp', dbxrefs=[])
}

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann import BioSeqAnn
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> seqann = BioSeqAnn()
>>> ann = seqann.annotate(sequence)
>>> for f in ann.annotation:
...    print(f, ann.method, str(ann.annotation[f].seq), sep="       ")
exon_2  nt_search and clustalo  AGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACCAACGGGACGGAGCGCGTGCGTTATGTGACCAGATACATCTATAACCGAGAGGAGTACGCACGCTTCGACAGCGACGTGGAGGTGTACCGGGCGGTGACGCCGCTGGGGCCGCCTGCCGCCGAGTACTGGAACAGCCAGAAGGAAGTCCTGGAGAGGACCCGGGCGGAGTTGGACACGGTGTGCAGACACAACTACCAGTTGGAGCTCCGCACGACCTTGCAGCGGCGAG
exon_3  nt_search and clustalo  TGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAACCTGCTGGTCTGCTCAGTGACAGATTTCTATCCAGCCCAGATCAAAGTCCGGTGGTTTCGGAATGACCAGGAGGAGACAACCGGCGTTGTGTCCACCCCCCTTATTAGGAACGGTGACTGGACCTTCCAGATCCTGGTGATGCTGGAAATGACTCCCCAGCATGGAGACGTCTACACCTGCCACGTGGAGCACCCCAGCCTCCAGAACCCCATCACCGTGGAGTGGC
exon_1  nt_search and clustalo  AGAGACTCTCCCG
exon_4  nt_search and clustalo  GGGCTCAGTCTGAATCTGCCCAGAGCAAGATG
ref_align(found_seqs, sequence: Bio.Seq.Seq = None, locus: str = None, annotation: seqann.models.annotation.Annotation = None, partial_ann: seqann.models.annotation.Annotation = None, run: int = 0, cutoff: float = 0.9) → seqann.models.annotation.Annotation[source]

ref_align - Method for doing targeted alignments on partial annotations

Parameters:
  • found_seqs (Seq) – The input sequence record.
  • sequence (Seq) – The input sequence record.
  • locus (str) – The gene locus associated with the sequence.
  • annotation (Annotation) – The incomplete annotation from a previous iteration.
  • partial_ann (Annotation) – The partial annotation after looping through all of the blast sequences.
Return type:

Annotation

add_alignment(ref_seq, annotation) → seqann.models.annotation.Annotation[source]

add_alignment - method for adding the alignment to an annotation

Parameters:
  • ref_seq (List) – List of reference sequences
  • annotation (Annotation) – The complete annotation
Return type:

Annotation

seqann.sequence_annotation.getblocks(coords)[source]

seqann.gfe

class seqann.gfe.GFE(url='http://feature.nmdp-bioinformatics.org', loci=['KIR2DP1', 'KIR2DL5A', 'KIR2DS4', 'HLA-DRA', 'HLA-DPA1', 'HLA-DQA1', 'HLA-DPB1', 'KIR2DS2', 'KIR3DP1', 'HLA-DRB4', 'KIR2DL1', 'KIR2DS5', 'HLA-DRB3', 'KIR2DS3', 'KIR3DL1', 'HLA-A', 'HLA-DRB5', 'KIR2DL4', 'HLA-DQB1', 'KIR3DL2', 'HLA-B', 'KIR3DS1', 'KIR2DL5B', 'HLA-DRB1', 'KIR3DL3', 'KIR2DS1', 'HLA-C'], load_features=False, store_features=False, cached_features=None, verbose=False, pid='NA', verbosity=0)[source]

Bases: object

This class is used for converting annotations into GFE notations.

Example:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> from seqann.sequence_annotation import BioSeqAnn
>>> from pygfe.pygfe import pyGFE
>>> seq_file = 'test_dq.fasta'
>>> gfe = pyGFE()
>>> server = BioSeqDatabase.open_database(driver="pymysql", user="root",
...                                       passwd="", host="localhost",
...                                      db="bioseqdb")
>>> seqann = BioSeqAnn(server=server)
>>> seq_rec = list(SeqIO.parse(seq_file, 'fasta'))[0]
>>> annotation = seqann.annotate(seq_rec, "HLA-DQB1")
>>> features, gfe = gfe.get_gfe(annotation, "HLA-DQB1")
>>> print(gfe)
HLA-DQB1w0-4-0-141-0-12-0-4-0-0-0-0-0
load_features()[source]

Loads all the known features from the feature service

locus_features(locus)[source]

Returns all features associated with a locus

Parameters:locus (str) – string containing HLA locus.
Return type:dict
get_gfe(annotation, locus)[source]

creates GFE from a sequence annotation

Parameters:
  • locus (str) – The gene locus
  • annotation (List) – An sequence annotation object
Return type:

List

Returns:
The GFE notation and the associated features in an array

seqann.blast_cmd

seqann.blast_cmd.has_hla(x)
seqann.blast_cmd.blastn(sequences, locus, nseqs, kir=False, verbose=False, refdata=None, evalue=10)[source]

Gets the a list of alleles that are the most similar to the input sequence

Parameters:
  • sequences (SeqRecord) – The input sequence record.
  • locus (str) – The gene locus associated with the sequence.
  • nseqs (int) – The incomplete annotation from a previous iteration.
  • evalue (int) – The evalue to use (default = 10)
  • kir (bool) – Run with KIR or not
  • verbose (bool) – Run in versboe
  • refdata (Reference Data) – An object with reference data
Return type:

Blast

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann.blast_cmd import blastn
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> blast = blastn(sequence, locus, nseqs)
seqann.blast_cmd.get_locus(sequences, kir=False, verbose=False, refdata=None, evalue=10)[source]

Gets the locus of the sequence by running blastn

Parameters:
  • sequences – sequenences to blast
  • kir – bool whether the sequences are KIR or not
Return type:

str

Example usage:

>>> from Bio.Seq import Seq
>>> from seqann.blast_cmd import get_locus
>>> sequence = Seq('AGAGACTCTCCCGAGGATTTCGTGTACCAGTTTAAGGCCATGTGCTACTTCACC')
>>> locus = get_locus(sequence)

seqann.align

seqann.align.flatten(l)
seqann.align.align_seqs(found_seqs, sequence, locus, start_pos, missing, annotated, cutoff=0.9, verbose=False, verbosity=0)[source]

align_seqs - Aligns sequences with clustalo

Parameters:
  • found_seqs (List) – List of the reference sequences
  • sequence (SeqRecord) – The input consensus sequence.
  • locus (str) – The gene locus associated with the sequence.
  • annotated (dict) – dictonary of the annotated features
  • start_pos (int) – Where the reference sequence starts
  • missing (List) – List of the unmapped features
  • cutoff (float) – The alignment cutoff
  • verbose (bool) – Flag for running in verbose mode.
  • verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type:

Annotation

seqann.align.find_features(feats, sequ, annotated, start_pos, cutoff)[source]

find_features - Finds the reference sequence features in the alignments and records the positions

Parameters:
  • feats (dict) – Dictonary of sequence features
  • sequ (List) – The sequence alignment for the input sequence
  • annotated (dict) – dictonary of the annotated features
  • start_pos (int) – Where the reference sequence starts
  • missing (List) – List of the unmapped features
  • cutoff (float) – The alignment cutoff
  • verbose (bool) – Flag for running in verbose mode.
  • verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type:

List

seqann.align.resolve_feats(feat_list, seqin, seqref, start, locus, missing, verbose=False, verbosity=0)[source]

resolve_feats - Resolves features from alignments

Parameters:
  • feat_list (List) – List of the found features
  • seqin (str) – The input sequence
  • locus (str) – The input locus
  • start (int) – Where the sequence start in the alignment
  • missing (List) – List of the unmapped features
  • verbose (bool) – Flag for running in verbose mode.
  • verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type:

Annotation

seqann.align.count_diffs(align, feats, inseq, locus, cutoff, verbose=False, verbosity=0)[source]

count_diffs - Counts the number of mismatches, gaps, and insertions and then determines if those are within an acceptable range.

Parameters:
  • align (List) – The alignment
  • feats (dict) – Dictonary of the features
  • locus (str) – The gene locus associated with the sequence.
  • inseq (str) – The input sequence
  • cutoff (float) – The alignment cutoff
  • verbose (bool) – Flag for running in verbose mode.
  • verbosity (int) – Numerical value to indicate how verbose the output will be in verbose mode.
Return type:

List