{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Retrieving PPIs\n", "\n", "The package enables to search the Protein Data Bank (PDB) for protein-protein interactions (PPIs) similar to your query PPI. The search can be performed based on the interface structure or protein sequence of interest." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from ppiref.comparison import IDist\n", "from ppiref.retrieval import MMSeqs2PPIRetriever\n", "from ppiref.definitions import PPIREF_DATA_DIR, PPIREF_TEST_DATA_DIR\n", "import pandas as pd\n", "\n", "# Suppress Graphein log\n", "from loguru import logger\n", "logger.disable('graphein')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we will use the near-duplicate homooligomeric PPIs that involve different sequences (taken from Figure 3 in the [\"Revealing data leakage in protein interaction benchmarks\"](https://arxiv.org/abs/2404.10457) paper). We will try to retrieve PPIs from the PDB that are similar to one of the entries (1k3f) aiming to retrieve another one (1k9s).\n", "\n", "

\n", " \n", "

" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Fast search requires precomputed data: iDist embeddings for interface search and MMseqs2 database for sequence search. Thereofore, we download the `ppiref_6A_stats.zip` first." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Downloading: 100%|██████████| 3.10G/3.10G [05:03<00:00, 10.2MiB/s]\n", "Extracting: 100%|██████████| 15/15 [00:29<00:00, 1.98s/files]\n" ] } ], "source": [ "from ppiref.utils.misc import download_from_zenodo\n", "download_from_zenodo('ppi_6A_stats.zip')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## By similar interface structure\n", "\n", "One can find PPI interfaces in the PDB that are structurally similar to the query PPI. This can be done using the precomputed iDist embeddings. Under the hood, iDist will build an `sklearn` index for all the PPI embeddings and use it to find the neighbors of the query embedding, in the near-duplicate radius (0.04 by default, which is validated for 6A interfaces)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PPIiDist
01k3f_C_E0.000000
11k3f_A_D0.019316
21u1g_C_D0.029032
31sj9_A_F0.029668
48a7d_C_Q0.029722
55efo_A_B0.029956
62hrd_A_F0.030052
71sj9_B_D0.030148
81u1e_C_D0.030332
91u1d_C_D0.030373
\n", "
" ], "text/plain": [ " PPI iDist\n", "0 1k3f_C_E 0.000000\n", "1 1k3f_A_D 0.019316\n", "2 1u1g_C_D 0.029032\n", "3 1sj9_A_F 0.029668\n", "4 8a7d_C_Q 0.029722\n", "5 5efo_A_B 0.029956\n", "6 2hrd_A_F 0.030052\n", "7 1sj9_B_D 0.030148\n", "8 1u1e_C_D 0.030332\n", "9 1u1d_C_D 0.030373" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Initialize IDist and read embeddings for all PPI interfaces in PPIRef (i.e., all PPIs in PDB)\n", "idist = IDist()\n", "idist.read_embeddings(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/idist_emb.csv', dropna=True)\n", "\n", "# Embed your query PPI interface\n", "ppi_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir'\n", "query_ppi_path = ppi_dir / 'k3/1k3f_C_E.pdb'\n", "query_embedding = idist.embed(query_ppi_path, store=False)\n", "\n", "# Query for 10 most similar PPIs\n", "dists, ppi_ids = idist.query(query_embedding)\n", "df_idist = pd.DataFrame({'PPI': ppi_ids, 'iDist': dists}).head(10)\n", "df_idist" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "iDist enables to retrieve 1k9s as a near duplicate of 1k3f by the interface structure." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'1k9s' in [x.split('_')[0] for x in ppi_ids]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## By similar sequence\n", "\n", "One can also find PPIs in PDB that involve sequences similar to the one of interest. This can be done using the prepared [MMseqs2](https://github.com/soedinglab/mmseqs2) database. Install MMseqs2 according to the official documentation and then you can use the wrapper as below. Under the hood, the wrapper will use the `mmseqs2 easy-search` with default parameters." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PPISequnce similarityChain
01u1c_A_B1.0A
11u1c_A_C1.0A
21rxs_M_m1.0m
31rxs_N_m1.0m
41rxu_E_F1.0F
51rxu_A_F1.0F
61u1e_C_D1.0D
71u1e_D_E1.0D
81rxs_M_o1.0o
91rxs_O_o1.0o
\n", "
" ], "text/plain": [ " PPI Sequnce similarity Chain\n", "0 1u1c_A_B 1.0 A\n", "1 1u1c_A_C 1.0 A\n", "2 1rxs_M_m 1.0 m\n", "3 1rxs_N_m 1.0 m\n", "4 1rxu_E_F 1.0 F\n", "5 1rxu_A_F 1.0 F\n", "6 1u1e_C_D 1.0 D\n", "7 1u1e_D_E 1.0 D\n", "8 1rxs_M_o 1.0 o\n", "9 1rxs_O_o 1.0 o" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Initialize the wrapper for MMseqs2 database to store all sequences from PPIRef\n", "mmseqs2 = MMSeqs2PPIRetriever(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/mmseqs_db/db')\n", "\n", "# Prepare your fasta file (for example by downloading from Uniprot)\n", "query_path = PPIREF_TEST_DATA_DIR / 'misc/1k3f.fasta'\n", "\n", "# Query the MMseqs2 database for 10 PPIs involving sequences most similar to the query sequence\n", "# (returns triples (PPI id, sequence similarity, partner similar to query sequence))\n", "seq_sims, ppi_ids, partners = mmseqs2.query(query_path)\n", "df_mmseqs2 = pd.DataFrame({'PPI': ppi_ids, 'Sequnce similarity': seq_sims, 'Chain': partners})\n", "df_mmseqs2.head(10)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Since 1k3f and 1k9s share low sequence identity, the sequence search is not able to retrieve 1k9s as a near duplicate of 1k3f." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'1k9s' in [x.split('_')[0] for x in ppi_ids]" ] } ], "metadata": { "kernelspec": { "display_name": "ppiref", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }