{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retrieving PPIs\n",
    "\n",
    "The package enables to search the Protein Data Bank (PDB) for protein-protein interactions (PPIs) similar to your query PPI. The search can be performed based on the interface structure or protein sequence of interest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from ppiref.comparison import IDist\n",
    "from ppiref.retrieval import MMSeqs2PPIRetriever\n",
    "from ppiref.definitions import PPIREF_DATA_DIR, PPIREF_TEST_DATA_DIR\n",
    "import pandas as pd\n",
    "\n",
    "# Suppress Graphein log\n",
    "from loguru import logger\n",
    "logger.disable('graphein')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we will use the near-duplicate homooligomeric PPIs that involve different sequences (taken from Figure 3 in the [\"Revealing data leakage in protein interaction benchmarks\"](https://arxiv.org/abs/2404.10457) paper). We will try to retrieve PPIs from the PDB that are similar to one of the entries (1k3f) aiming to retrieve another one (1k9s).\n",
    "\n",
    "<p align=\"center\">\n",
    "  <img width=\"350\" src=\"./_static/images/1k3f_1k9s.png\"/>\n",
    "</p>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fast search requires precomputed data: iDist embeddings for interface search and MMseqs2 database for sequence search. Thereofore, we download the `ppiref_6A_stats.zip` first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Downloading: 100%|██████████| 3.10G/3.10G [05:03<00:00, 10.2MiB/s]\n",
      "Extracting: 100%|██████████| 15/15 [00:29<00:00,  1.98s/files]\n"
     ]
    }
   ],
   "source": [
    "from ppiref.utils.misc import download_from_zenodo\n",
    "download_from_zenodo('ppi_6A_stats.zip')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## By similar interface structure\n",
    "\n",
    "One can find PPI interfaces in the PDB that are structurally similar to the query PPI. This can be done using the precomputed iDist embeddings. Under the hood, iDist will build an `sklearn` index for all the PPI embeddings and use it to find the neighbors of the query embedding, in the near-duplicate radius (0.04 by default, which is validated for 6A interfaces)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PPI</th>\n",
       "      <th>iDist</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1k3f_C_E</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1k3f_A_D</td>\n",
       "      <td>0.019316</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1u1g_C_D</td>\n",
       "      <td>0.029032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1sj9_A_F</td>\n",
       "      <td>0.029668</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>8a7d_C_Q</td>\n",
       "      <td>0.029722</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5efo_A_B</td>\n",
       "      <td>0.029956</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2hrd_A_F</td>\n",
       "      <td>0.030052</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1sj9_B_D</td>\n",
       "      <td>0.030148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1u1e_C_D</td>\n",
       "      <td>0.030332</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1u1d_C_D</td>\n",
       "      <td>0.030373</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        PPI     iDist\n",
       "0  1k3f_C_E  0.000000\n",
       "1  1k3f_A_D  0.019316\n",
       "2  1u1g_C_D  0.029032\n",
       "3  1sj9_A_F  0.029668\n",
       "4  8a7d_C_Q  0.029722\n",
       "5  5efo_A_B  0.029956\n",
       "6  2hrd_A_F  0.030052\n",
       "7  1sj9_B_D  0.030148\n",
       "8  1u1e_C_D  0.030332\n",
       "9  1u1d_C_D  0.030373"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize IDist and read embeddings for all PPI interfaces in PPIRef (i.e., all PPIs in PDB)\n",
    "idist = IDist()\n",
    "idist.read_embeddings(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/idist_emb.csv', dropna=True)\n",
    "\n",
    "# Embed your query PPI interface\n",
    "ppi_dir = PPIREF_TEST_DATA_DIR / 'ppi_dir'\n",
    "query_ppi_path = ppi_dir / 'k3/1k3f_C_E.pdb'\n",
    "query_embedding = idist.embed(query_ppi_path, store=False)\n",
    "\n",
    "# Query for 10 most similar PPIs\n",
    "dists, ppi_ids = idist.query(query_embedding)\n",
    "df_idist = pd.DataFrame({'PPI': ppi_ids, 'iDist': dists}).head(10)\n",
    "df_idist"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "iDist enables to retrieve 1k9s as a near duplicate of 1k3f by the interface structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'1k9s' in [x.split('_')[0] for x in ppi_ids]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## By similar sequence\n",
    "\n",
    "One can also find PPIs in PDB that involve sequences similar to the one of interest. This can be done using the prepared [MMseqs2](https://github.com/soedinglab/mmseqs2) database. Install MMseqs2 according to the official documentation and then you can use the wrapper as below. Under the hood, the wrapper will use the `mmseqs2 easy-search` with default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PPI</th>\n",
       "      <th>Sequnce similarity</th>\n",
       "      <th>Chain</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1u1c_A_B</td>\n",
       "      <td>1.0</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1u1c_A_C</td>\n",
       "      <td>1.0</td>\n",
       "      <td>A</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1rxs_M_m</td>\n",
       "      <td>1.0</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1rxs_N_m</td>\n",
       "      <td>1.0</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1rxu_E_F</td>\n",
       "      <td>1.0</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1rxu_A_F</td>\n",
       "      <td>1.0</td>\n",
       "      <td>F</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1u1e_C_D</td>\n",
       "      <td>1.0</td>\n",
       "      <td>D</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1u1e_D_E</td>\n",
       "      <td>1.0</td>\n",
       "      <td>D</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1rxs_M_o</td>\n",
       "      <td>1.0</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1rxs_O_o</td>\n",
       "      <td>1.0</td>\n",
       "      <td>o</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        PPI  Sequnce similarity Chain\n",
       "0  1u1c_A_B                 1.0     A\n",
       "1  1u1c_A_C                 1.0     A\n",
       "2  1rxs_M_m                 1.0     m\n",
       "3  1rxs_N_m                 1.0     m\n",
       "4  1rxu_E_F                 1.0     F\n",
       "5  1rxu_A_F                 1.0     F\n",
       "6  1u1e_C_D                 1.0     D\n",
       "7  1u1e_D_E                 1.0     D\n",
       "8  1rxs_M_o                 1.0     o\n",
       "9  1rxs_O_o                 1.0     o"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize the wrapper for MMseqs2 database to store all sequences from PPIRef\n",
    "mmseqs2 = MMSeqs2PPIRetriever(PPIREF_DATA_DIR / 'ppiref/ppi_6A_stats/mmseqs_db/db')\n",
    "\n",
    "# Prepare your fasta file (for example by downloading from Uniprot)\n",
    "query_path = PPIREF_TEST_DATA_DIR / 'misc/1k3f.fasta'\n",
    "\n",
    "# Query the MMseqs2 database for 10 PPIs involving sequences most similar to the query sequence\n",
    "# (returns triples (PPI id, sequence similarity, partner similar to query sequence))\n",
    "seq_sims, ppi_ids, partners =  mmseqs2.query(query_path)\n",
    "df_mmseqs2 = pd.DataFrame({'PPI': ppi_ids, 'Sequnce similarity': seq_sims, 'Chain': partners})\n",
    "df_mmseqs2.head(10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since 1k3f and 1k9s share low sequence identity, the sequence search is not able to retrieve 1k9s as a near duplicate of 1k3f."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'1k9s' in [x.split('_')[0] for x in ppi_ids]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ppiref",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}