ppiref.comparison

Methods to measure similarity between protein-protein interactions.

class ppiref.comparison.FoldseekMMComparator(path: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/ppiref/envs/latest/lib/python3.11/site-packages/external/foldseek/bin/foldseek'), args: str = '', **kwargs)

Bases: PPIComparator

Wrapper for the Foldseek-MM (Foldseek-Multimer) protein-protein interaction comparator.

Foldseek-MM converts each of the partners to a sequence in the 3Di alphabet and then uses MMseqs2 to find similar sequences combined with search for consensus alignment between individual partners.

Please note that the wrapper does not aim to provide an efficient way to use Foldseek-MM. Using the official puerly C++ based implementation should lead to better performance for a single comparison or database search. This wrapper is implemented for a unified interface with other comparators for large-scale pairwise comparison in parallel.

To use the wrapper, please install Foldseek in the PPIRef/external/foldseek directory. Alternatively, you can install Foldseek under a different location but change the ppiref.definitions.FOLDSEEK_PATH. The resulting directory structure may look like this:

foldseek
├── bin
│   └── foldseek
└── README.md

1 directory, 2 files

If you find Foldseek-MM useful, please cite the original papers:

@article{kim2024rapid,
    title={Rapid and sensitive protein complex alignment with foldseek-Multimer},
    author={Kim, Woosub and Mirdita, Milot and Karin, Eli Levy and Gilchrist, Cameron LM and Schweke, Hugo and S{"o}ding, Johannes and Levy, Emmanuel D and Steinegger, Martin},
    journal={bioRxiv},
    pages={2024--04},
    year={2024},
    publisher={Cold Spring Harbor Laboratory}
}

@article{van2022foldseek,
    title={Foldseek: fast and accurate protein structure search},
    author={van Kempen, Michel and Kim, Stephanie S and Tumescheit, Charlotte and Mirdita, Milot and Gilchrist, Cameron LM and S{"o}ding, Johannes and Steinegger, Martin},
    journal={Biorxiv},
    pages={2022--02},
    year={2022},
    publisher={Cold Spring Harbor Laboratory}
}
Parameters:
  • path (Path, optional) – Path to the foldseek executable. Defaults to ppiref.definitions.FOLDSEEK_PATH.

  • args (str, optional) – Optional command line arguments to be passed to Foldseek-MM. Defaults to ''.

compare(ppi0: Path, ppi1: Path) dict

Compare two protein-protein interactions with Foldseek-MM.

Parameters:
  • ppi0 (Path) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Path) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with two PPI ids being compared and comparison metrics produced by Foldseek-MM. If there is no hit detected, the dictionary only contains the PPI ids.

Return type:

dict

class ppiref.comparison.IAlign(path: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/ppiref/envs/latest/lib/python3.11/site-packages/external/ialign/bin/ialign.pl'), args: str = '-a 2 -minp 1 -mini 1', tmp_out_pref: str = 'iAlign_tmp_out', multiple_resolution: tuple[str, Literal['max', 'min']] = ('P-value', 'min'), **kwargs)

Bases: PPIComparator

Wrapper for the iAlign protein-protein interaction comparator.

iAlign is the direct adaptation of TM-align (3D alignment of protein structures) to protein-protein interfaces.

To use the wrapper, please download the official Perl source code from the iAlign website and place it under PPIRef/external/iAlign. Alternatively, you can place iAlign under a different location but change the ppiref.definitions.IALIGN_PATH. After installation, the resulting directory structure may look like this:

iAlign
├── README.md
├── bin
└── example

3 directories, 1 file

If you find iAlign useful, please cite the original paper:

@article{gao2010ialign,
    title={iAlign: a method for the structural comparison of protein--protein interfaces},
    author={Gao, Mu and Skolnick, Jeffrey},
    journal={Bioinformatics},
    volume={26},
    number={18},
    pages={2259--2265},
    year={2010},
    publisher={Oxford University Press}
}
Parameters:
  • path (Path, optional) – Path to the ialign.pl executable. Defaults to USALIGN_PATH.

  • args (str, optional) – Command line arguments to use for iAlign. Defaults to '-a 2 -minp 1 -mini 1' to operate in verbose mode and not to skip small interfaces.

  • tmp_out_pref (str, optional) – Prefix to use for temporary directories storing iAlign outputs. Defaults to 'iAlign_tmp_out'.

  • multiple_resolution (tuple[str, Literal['max', 'min']], optional) – If one or both input files contain multiple interfaces, this parameter specifies how to resolve multiple comparison results. The first element of the tuple specifies the metric to use for resolution, and the second element specifies whether to use the maximum or minimum.

compare(ppi0: Path, ppi1: Path) dict

Compare two protein-protein interactions with iAlign.

Parameters:
  • ppi0 (Path) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Path) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with two PPI ids being compared and all comparison metrics produced by iAlign.

Return type:

dict

class ppiref.comparison.IDist(kind: Literal['amino_acid_one_hot', 'esm_embedding', 'meiler_embedding'] = 'amino_acid_one_hot', near_duplicate_threshold: float = 0.04, pdb_dir: Path | str | None = None, max_interface_size: int = 1000000, *args, **kwargs)

Bases: PPIComparator

Implementation of iDist protein-protein interaction comparator used to created the PPIRef dataset.

The comparator uses a simple non-parametrized one-step message passing to embed protein-protein interfaces and then compares them using Euclidean distance. iDist approximates 3D alignment-based methods, iAlign and US-align, on detecting near-duplicate protein-protein interfaces. iDist is more than 100 times faster and finds same near duplicates with 99% precision and 97% recall.

Parameters:
  • kind (IDIST_EMBEDDING_KIND, optional) – Kind of node embeddings to use for message passing. Defaults to ‘amino_acid_one_hot’ which leads to the best alignment approximation performance.

  • near_duplicate_threshold (float, optional) – Threshold on Euclidean distance to detect near-duplicate interfaces. It is recommended to use the threshold of 0.04 for the interfaces extracted with the 6A cutoff radius between heavy atoms, and the threshold of 0.03 with the 10A cutoff. Please see the paper for details. Defaults to 0.04.

  • pdb_dir (Optional[Path], optional) – Directory storing complete .pdb files that were used to exctract interfaces from. Should be not None if kind == 'esm_embedding', as the ESM protein language model is used with full protein sequences. Defaults to None.

  • max_interface_size (int, optional) – Maximum number of nodes in the interface graph.

build_index() None

Build an index for fast near-duplicate detection based on Euclidean distance between embeddings.

cluster_embeddings() array

Cluster embeddings in the iDist cache using the agglomerative clustering algorithm such that there are no near-duplicated PPI interfaces in different clusters.

The clustering is performed based on the Euclidean distance between embeddings and iteratively connects embeddings that are closer than the near-duplicate threshold of iDist. By using the 'single' linkage strategy, the algorithm ensures that there is no contamination across clusters (i.e. no near-duplicates in different clusters). The clusters are then suitable for creating leakage-free data splits for machine learning.

Returns:

Cluster labels for each embedding from cache.

Return type:

np.array

compare(path0: Path | str, path1: Path | str) dict

Compare two protein-protein interfaces with iDist.

Parameters:
  • ppi0 (Union[Path, str]) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Union[Path, str]) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with two PPI ids being compared and the resulting iDist distance (Euclidean distance in the embedding space).

Return type:

dict

compare_all_against_all(ppis0: Iterable[Path] = None, ppis1: Iterable[Path] = None, ppi_pairs: Iterable[Iterable[Path]] = None, embed: bool = True) DataFrame

Compare all PPIs from one set against all PPIs from another set efficiently using iDist.

Parameters:
  • ppis0 (Iterable[Path], optional) – First set of PPI paths. Defaults to None.

  • ppis1 (Iterable[Path], optional) – Second set of PPI paths. Defaults to None.

  • ppi_pairs (Iterable[Iterable[Path]], optional) – Pre-defined pairs to compare instead of complete pair-wise comparison of two sets. Defaults to None.

  • embed (bool, optional) – If set to True, embeds all PPIs before comparison not to repeat same embeddign twice. Defaults to True.

Returns:

Data frame with comparison results. The data frame has two columns corresponding to pairs of PPI ids, and an additional column with iDist distances.

Return type:

pd.DataFrame

deduplicate_embeddings() None

Deduplicate embeddings in the iDist cache based on the threshold Euclidean distance between them.

The method iteratively removes embeddings that are closer than the threshold to any other embedding while making only one-sided comparisons (i.e. a<->b but not b<->a). Since the number of embeddings may be large, the method processes the pairwise distances matrix in chunks of consecuite rows.

embed(ppi: Path, store: bool = True) array

Embed a protein-protein interface.

Parameters:
  • ppi (Path) – Path to the .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • store (bool, optional) – Set to True to store the embedding in cache. Defaults to True.

Returns:

PPI interface embedding.

Return type:

np.array

embed_parallel(ppis: Iterable[Path], chunksize: int = 1) None

Embed a set of PPIs in parallel and store in cache.

Parameters:
  • ppis (Iterable[Path]) – Paths to the .pdb files containing PPIs. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • chunksize (int) – Number of PPIs to embed at a time by a single process.

embed_without_exception(ppi: Path) array

Embed a PPI and catch exceptions to avoid breaking the parallel execution.

Parameters:

ppi (Path) – Path to the .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

PPI interface embedding.

Return type:

np.array

get_embeddings() DataFrame

Get embeddings stored in cache as a Pandas data frame.

Returns:

Data frame with embeddings in rows indexed by PPI ids.

Return type:

pd.DataFrame

query(q: array) array

Query the index for near-duplicate embeddings.

Parameters:

q (np.array) – Input query embedding(s). May be a single embedding or a stack of multiple embeddings.

Returns:

If a single query embedding is provided, returns an array of near-duplicate PPI ids. If multiple query embeddings are provided, returns an array of arrays of near-duplicate PPI ids. The ids are sorted by their distance to the query embedding(s), with the first id being the closest near duplicate.

Return type:

np.array

read_embeddings(df: Path | DataFrame, dropna: bool = False) None

Read embeddings from a .csv file or a Pandas data frame and store them in cache.

Parameters:
  • df (Union[Path, pd.DataFrame]) – Pandas data frame with embeddings.

  • dropna (bool, optional) – Drop rows (embeddings) containing NaN values. A NaN value may appear if iDist fails to embed a PPI. Defaults to False.

write_embeddings(path: Path) None

Write embeddings stored in cache to a .csv file.

class ppiref.comparison.PPIComparator(max_workers: int = 0, parallel_kind: Literal['threads', 'processes'] = 'processes', verbose=False)

Bases: ABC

Abstract class for comparing protein-protein interactions (PPIs).

Parameters:
  • max_workers (int, optional) – Number of workers to use for parallel operations (such as comparing large sets of PPIs pairwise). Defaults to os.cpu_count() - 2.

  • parallel_kind (Literal['threads', 'processes'], optional) – Use multi-treading or multi-processing for parallel operations. Defaults to ‘processes’.

  • verbose (bool, optional) – If set to True, prints detailed log to the standard output. May be useful for debugging. Defaults to False.

compare(ppi0: Path, ppi1: Path) dict

Abstract method for comparing two PPIs. Should be implemented in a subclass.

Parameters:
  • ppi0 (Path) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Path) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with comparison results.

Return type:

dict

compare_all_against_all(ppis0: Iterable[Path] = None, ppis1: Iterable[Path] = None, ppi_pairs: Iterable[Iterable[Path]] = None) DataFrame

Comparing all PPIs from one set against all PPIs from another set. This method in the abstract class is used as a default implementation where all PPI pairs are compared in a data parallel way. A subclass may override this method to provide a more efficient implementation.

Parameters:
  • ppis0 (Iterable[Path], optional) – First set of PPI paths. Defaults to None.

  • ppis1 (Iterable[Path], optional) – Second set of PPI paths. Defaults to None.

  • ppi_pairs (Iterable[Iterable[Path]], optional) – Pre-defined pairs to compare instead of complete pair-wise comparison of two sets. Defaults to None.

Returns:

Data frame with comparison results. The data frame has two columns corresponding to pairs of PPI ids, and additional columns with comparison metrics.

Return type:

pd.DataFrame

class ppiref.comparison.SequenceIdentityComparator(pdb_dir: Path | str, nested_pdb_dir: bool = False, aligner: PairwiseAligner = None, **kwargs)

Bases: PPIComparator

Protein-protein interaction comparator based on sequence identity. The comparator uses the BioPython library to align protein sequences and calculate the pairwise sequence identity. The similarity is calculated as the maximum pairwise sequence identity between chains from different PPIs.

Parameters:
  • pdb_dir (Union[Path, str]) – Directory with complete .pdb files that were used to extract PPI interactions from. The directory is used to get complete protein sequences for comparison.

  • nested_pdb_dir (bool) – True if files are in the PDB format pdb_dir/bc/abcd.pdb. False if files are in the format pdb_dir/abcd.pdb. Defaults to False.

  • aligner (Align.PairwiseAligner, optional) – BioPython Aligner to use. Defaults to None to use the one employed in the PoseBusters package.

compare(ppi0: Path, ppi1: Path) dict

Compare two protein-protein interactions based on sequence similarity.

Parameters:
  • ppi0 (Path) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Path) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with two PPI ids being compared and maximum pairwise sequence identity

between chains from different PPIs.

Return type:

dict

class ppiref.comparison.USalign(path: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/ppiref/envs/latest/lib/python3.11/site-packages/external/USalign/USalign'), args: str = '', **kwargs)

Bases: PPIComparator

Wrapper for the US-align protein-protein interaction comparator.

Compared to iAlign, US-align is a more recent adaptation of TM-align (3D alignment of protein structures) and is designed for the unified comparison of different kinds of macromolecules.

To use the wrapper, please download the official compiled C++ executable from the US-align website and place it in the PPIRef/external/USalign directory. Alternatively, you can place US-align under a different location but change the ppiref.definitions.USALIGN_PATH. The resulting directory structure may look like this:

USalign
└── USalign

1 directories, 1 files

If you find US-align useful, please cite the original paper:

@article{zhang2022us,
    title={US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes},
    author={Zhang, Chengxin and Shine, Morgan and Pyle, Anna Marie and Zhang, Yang},
    journal={Nature methods},
    volume={19},
    number={9},
    pages={1109--1115},
    year={2022},
    publisher={Nature Publishing Group US New York}
}
Parameters:
  • path (Path, optional) – Path to the USalign executable. Defaults to ppiref.definitions.USALIGN_PATH.

  • args (str, optional) – Optional command line arguments to be passed to US-align. Defaults to ''.

compare(ppi0: Path, ppi1: Path) dict

Compare two protein-protein interactions with US-align.

Parameters:
  • ppi0 (Path) – Path to the first .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

  • ppi1 (Path) – Path to the second .pdb file containing a PPI. It is recommended to use the files produced by the ppiref.extraction.PPIExtractor class.

Returns:

Dictionary with two PPI ids being compared and all comparison metrics produced by US-align.

Return type:

dict

ppiref.definitions

Global variables used across the package.

ppiref.extraction

Module for extracting protein-protein interfaces from .pdb files.

class ppiref.extraction.PPIExtractor(out_dir: Path | str, kind: Literal['heavy', 'bsa'] = 'heavy', radius: float = 10.0, expansion_radius: float = 0.0, bsa: bool = False, join: bool = False, nest_out_dir: bool = True, max_workers: int = 0, chunk_size: int = 1, verbose: bool = False, noppi_files: bool = True, input_format: Literal['pdb', 'haddock'] = 'pdb')

Bases: object

Extract protein-protein interfaces from .pdb files.

Parameters:
  • out_dir – Path to output directory where extracted interfaces are written as .pdb files.

  • kind – Kind of interfaces to extract. The 'heavy' option leads to extracting interfaces based on the interatomic distances between heavy atoms. Specifically, if heavy atoms from different proteins are close enough (within the radius), they form an interface. The 'bsa' option extracts interfaces based on the buried residues determined by their buried surface area (BSA). Defaults to 'heavy'.

  • radius – Maximum distance in Angstroms (A) between heavy atoms from different proteins to be considered interacting. Defaults to 10.

  • expansion_radius – Expand interface by adding residues within the specified radius. Defaults to 0.

  • bsa – Calculate and write buried surface area (BSA) to output .pdb files. If set to true may lead to large computational overhead. Defaults to False.

  • join – If True joins dimeric interfaces into oligomeric interfaces based on shared residues. Defaults to False.

  • nest_out_dir – If True, the output .pdb files are written into subdirectories named by the middle two characters of the PDB ID. This leads to the file organization consistent with PDB. For example the A-B interaction from abcd.pdb is stored as out_dir/bc/abcd_A_B.pdb instead of out_dir/abcd_A_B.pdb. Defaults to True.

  • max_workers – Maximum number of workers to use for parallel processing. Defaults to os.cpu_count() - 2.

  • chunk_size – Number of files to process in a single worker at a time. Defaults to 1.

  • verbose – If True, print progress messages on each extraction. This option may be useful for debugging. Defaults to False.

  • noppi_files – If True, write .<pdb_id>.noppi files for PDB files that do not contain any PPIs. This option is useful in combination with PPIExtractor.extract_parallel with resume=True not to attempt reextracting PPIs from files that do not contain any. Defaults to True.

  • input_format – Format of input .pdb files based on their origin. Defaults to 'pdb' corresponding to the Protein Data Bank origin.

extract(pdb_path: Path | str, partners: Iterable[str] | None = None) None

Extract interfaces from the .pdb file and write as separate .pdb files.

The files will be named based on the input file names and the interacting chains. For example the A-B interaction from abcd.pdb will be stored as abcd_A_B.pdb. Please note that if the input file name contains underscores (_), they are replaced with dashes (-) in the output file name.

Parameters:
  • pdb_path – Path to the .pdb file to extract PPIs from.

  • partners – If not None the interface is extracted between specified chains. If is None all dimeric interfaces from the file are extracted. Defaults to None.

extract_parallel(in_dir: Path | str, in_file_pattern: str | None = '.*\\.pdb$', partition: tuple[float] = (0.0, 1.0), resume: bool = True) None

Extract interafces from all .pdb files in the directory in parallel.

Parameters:
  • in_dir – Input directory with .pdb files to extract PPIs from.

  • in_file_pattern – Regular expression pattern to match all input files in in_dir. Defaults to '.*\.pdb'.

  • partition – Fractional partition of input files to process. For example, (0., 0.5) will process the first half of the files. This is useful, when extracting in the data parallel way across multiple nodes. Defaults to (0., 1.).

  • resume – If set to True, will check what .pdb files were already processed based on the resulting files in the output directory, and skip them for processing. Defaults to True.

ppiref.retrieval

Module for retrieving similar protein-protein interactions from large datasets.

class ppiref.retrieval.MMSeqs2PPIRetriever(db: Path | str | None = None, ppi_split: str = 'ppiref_6A_filtered', ppi_fold: str = 'whole', verbose: bool = False)

Bases: object

Retriever of similar protein-protein interactions using MMseqs2 sequence similarity search.

This class retrieves PPIs from a larger database containing similar sequences to any of the sequences involved in the query PPI.

Please follow the official MMseqs2 installation guide to install MMseqs2, and cite the original paper if you find the wrapper useful:

@article{steinegger2017mmseqs2,
    title={MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets},
    author={Steinegger, Martin and S{"o}ding, Johannes},
    journal={Nature biotechnology},
    volume={35},
    number={11},
    pages={1026--1028},
    year={2017},
    publisher={Nature Publishing Group US New York}
}
Parameters:
  • db (Union[str, Path], optional) – MMseqs2 data base. Please see the MMseqs2 documentation on how to create a data base. Defaults to PPIRef/ppiref/ppi_6A_stats/mmseqs_db/db, which indexes all protein sequences present in the PPIRef50K (6A version) dataset, compirising all proper protein-protein interactions from PDB.

  • ppi_split (str, optional) – Split of PPI ids to consider. In combination with the ppi_fold argument, specifies the subset of PPIs to consider from the data base. Defaults to 'ppiref_6A_filtered', which corresponds to complete PPIRef50K.

  • ppi_fold (str, optional) – Subset (fold) of PPI ids to consider from the specified split. Defaults to 'whole' to consider all PPIs from the split.

  • verbose (bool, optional) – If set to True prints MMseqs2 log. Defaults to False.

query(seq: str | Path) tuple[list[float], list[str], list[str]]

Query the MMseqs2 data base for protein-protein interactions with similar sequences.

Parameters:

seq (Union[str, Path]) – Path to a query sequence in the fasta format.

Returns:

Tuple with three lists storing 3-tuples of PPI sequence similarity matches. The first list contains MMseqs2 sequence similarity scores, the second list contains ids of the corresponding matched PPI entries, and the third list contains chain names of the matched sequences from the corresponding PPI entries.

Return type:

tuple[list[float], list[str], list[str]]

ppiref.split

Functions to read and write data splits in a standardized way and single JSON format.

ppiref.split.read_fold(location: Path | str, fold: str | int, full_paths: bool = True, processed_split: dict[str, list[Path]] = None) list[Path | str]

Read a specific data fold from a data split of protein-protein interactions.

Parameters:
  • location (Union[Path, str]) –

    Source of the split file. This can be:

    • A name of the split file, which is stored as ppiref.definitions.PPIREF_SPLITS_DIR / f'{location}.json'.

    • A Path object representing the path to the JSON split file.

    • An absolute path string starting with '/' representing the path to the JSON split file.

  • fold (Union[str, int]) – Name of the fold to read. Should match one of the keys in the split dictionary. If 'whole', all PPIs are returned. If a '+'-separated string, PPIs from all specified folds are returned. If a non-negative int, a random sample of that size is returned. Defaults to 'whole'.

  • full_paths (bool, optional) – If set to True, return full paths instead of IDs. Defaults to True.

  • processed_split (dict[str, list[Path]], optional) – Pre-defined split dictionary. Defaults to None.

Returns:

Fold of PPIs represented by the list of their IDs or paths.

Return type:

list[Union[Path, str]]

ppiref.split.read_split(location: Path | str, full_paths: bool = True) dict[str, list[Path | str]]

Read data split of protein-protein interactions from a JSON file.

Parameters:
  • location (Union[Path, str]) –

    Source of the split file. This can be:

    • A name of the split file, which is stored as ppiref.definitions.PPIREF_SPLITS_DIR / f'{location}.json'.

    • A Path object representing the path to the JSON split file.

    • An absolute path string starting with '/' representing the path to the JSON split file.

  • full_paths (bool, optional) – If set to True, return full paths instead of IDs. Defaults to True.

Returns:

Dictionary of data folds. Each fold is a list of PPIs represented by their IDs or paths.

Return type:

dict[str, list[Union[Path, str]]]

ppiref.split.read_split_source(location: Path | str) Path

Read source directory containing .pdb files in a data split of protein-protein interactions.

Parameters:

location (Union[Path, str]) –

Source of the split file. This can be:

  • A name of the split file, which is stored as ppiref.definitions.PPIREF_SPLITS_DIR / f'{location}.json'.

  • A Path object representing the path to the JSON split file.

  • An absolute path string starting with '/' representing the path to the JSON split file.

Returns:

Path to the source directory containing .pdb files with PPI structures.

Return type:

Path

ppiref.split.write_split(location: Path | str, source: Path, folds: dict[str, Iterable[str | Path]], analyze: bool = True) None

Write data split of protein-protein interactions to a JSON file.

Parameters:
  • location (Union[Path, str]) –

    Destination of the split file. This can be:

    • A name of the split file, which will be stored as ppiref.definitions.PPIREF_SPLITS_DIR / f'{location}.json'.

    • A Path object representing the path to the JSON split file.

    • An absolute path string starting with '/' representing the path to the JSON split file.

  • source (Path) – Path to the source directory containing PPIs.

  • folds (dict[str, Iterable[Union[str, Path]]]) – Dictionary of folds. Each fold (e.g. 'train' or 'val') is a list of PPIs represented by their names or paths.

  • analyze (bool, optional) – If True, run simple sanity checks such as no overlapping PPI ids across the folds. If any issues are found, warnings are raised. Defaults to True.

ppiref.surface

Module to process protein surface properties.

class ppiref.surface.DR_SASA(path: Path | str = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/ppiref/envs/latest/lib/python3.11/site-packages/external/dr_sasa_n/build/dr_sasa'), tmp_dir: Path | str = None, verbose: bool = False, auto_clean: Literal['lazy', 'instant', 'none'] = 'lazy')

Bases: object

Wrapper for the dr_sasa software to calculate buried surface area (BSA) for PDB files.

In order to use the wrapper, build the C++ source code according to the original doctumentation and provide the path to the executable. The default value ppiref.definitions.DR_SASA_PATH assumes building dr_sasa in the PPIRef/external directory. The resulting directory structure may look like this:

dr_sasa_n
├── CMakeLists.txt
├── INSTALL
├── LICENSE
├── README.md
├── build
├── doc
├── examples
├── src
└── utils

6 directories, 4 files

If you find dr_sasa useful, please cite the original paper:

@article{ribeiro2019calculation,
    title={Calculation of accurate interatomic contact surface areas for the quantitative analysis of non-bonded molecular interactions},
    author={Ribeiro, Judemir and R{'\i}os-Vera, Carlos and Melo, Francisco and Sch{"u}ller, Andreas},
    journal={Bioinformatics},
    volume={35},
    number={18},
    pages={3499--3501},
    year={2019},
    publisher={Oxford University Press}
}
Parameters:
  • path (Union[Path, str], optional) – Path to dr_sasa executable. Defaults to ppiref.definitions.DR_SASA_PATH.

  • tmp_dir (Union[Path, str], optional) – Path to a temporary directory to store outputs. Defaults to None to create a directory with a random name.

  • verbose (bool, optional) – Print dr_sasa log. Defaults to False.

  • auto_clean (Literal['lazy', 'instant', 'none'], optional) – Strategy to clean the temporary files produced by dr_sasa. The 'lazy' strategy cleans the temporary directory on object destruction, 'instant' cleans the files immediately after the calculation, and 'none' does not clean the files. Defaults to 'lazy'.

clean(pdb_stem: str | None = None, partners: tuple[str] | None = None) None

Clean whole temporary directory or individual single-run files.

Parameters:
  • pdb_stem (Optional[str]) – PDB file stem corresponding to the files to clean. If None, cleans the whole directory.

  • partners (Optional[tuple[str]]) – Protein partners from the PDB file corresponding to the files to clean. If None, cleans the whole directory.

static parse_residue(res: str) Residue

Parse residue in the dr_sasa format into a Residue namedtuple.

Parameters:

res (str) – Residue in the dr_sasa format ('VAL/M/14A').

Returns:

Residue namedtuple (Residue(chain_id='M', residue_number=14, insertion='A')).

Return type:

Residue

ppiref.visualization

Module for visualizing protein-protein interactions using PyMOL

class ppiref.visualization.PyMOL(port: int = 9123)

Bases: object

Python PyMOL wrapper to visualize protein-protein interactions. The wrapper is based on a more general wrapper implemented in the Graphein package.

Parameters:

port (int, optional) – Port to use for communication with PyMOL. Defaults to PyMOLPORT from Graphein.

display_ppi(ppi_path: Path, reuse_session: bool = False, colors: Iterable[str] = ('hotpink', 'greencyan'), residue_color_sets: Iterable[str] = ('reds+magentas', 'greens+yellows+oranges'), swap_colors: bool = False, transparency: float = 0.95, color_by_residues: bool = False, sticks: bool = False, letters: bool = True)

Display protein-protein interaction using PyMOL. If used in a Jupyter notebook, the method may also be used to show a static image of the interaction.

Parameters:
  • ppi_path (Path) – Path to a .pdb file containing a PPI. It is recommended to use a file produced by the ppiref.extraction.PPIExtractor class.

  • reuse_session (bool, optional) – If set to True, displays PPI in the same session as during the previous call. Otherwise, creates a new PyMOL sessions. Defaults to False.

  • colors (Iterable[str], optional) – Two colors to use for two interacting proteins. Defaults to ('hotpink', 'greencyan').

  • residue_color_sets (Iterable[str], optional) – If color_by_residue is set to True, the two provided palettes will be used to color each type of residue in a different color in two interacting proteins. Please see PYMOL_COLOR_SETS from the same modile for the list of available options. Defaults to ('reds+magentas', 'greens+yellows+oranges').

  • swap_colors (bool, optional) – Swap colors for two interacting proteins. Defaults to False.

  • transparency (float, optional) – Transparency factor from the [0, 1] range. Defaults to 0.95.

  • color_by_residues (bool, optional) – Color residues of different amino acid types with different colors. Defaults to False.

  • sticks (bool, optional) – Show amino acids in the stick representation (more detailed). Defaults to False.

  • letters (bool, optional) – Show one-letter codes of amino acid types. Defaults to True.

ppiref.utils