Clustering PPIs

The IDist.cluster_embeddings method enables to cluster protein-protein interactions (PPIs) based on their iDist embeddings. Specifically, using agglomerative clustering, iDist groups PPIs such that no near-duplicate PPIs are in different clusters. These clusters can then be used for creating lekage-free data splits for machine learning.

[2]:
from ppiref.comparison import IDist
from ppiref.utils.ppi import PPI
from ppiref.definitions import PPIREF_TEST_DATA_DIR

# Suppress Graphein log
from loguru import logger
logger.disable('graphein')

In this example, we will use 5 PPIs, where (1p7z_A_C, 3p9r_B_D) and (8atd_A_C, 8atd_B_D) are two groups of near duplicates, and 10gs_A_B is a single representative. The aim of the clustering is to cluster all groups of near duplicates into same clusters.

[3]:
ppis = [
    PPIREF_TEST_DATA_DIR / 'ppi/1p7z_A_C.pdb',
    PPIREF_TEST_DATA_DIR / 'ppi/3p9r_B_D.pdb',
    PPIREF_TEST_DATA_DIR / 'ppi/8atd_A_C.pdb',
    PPIREF_TEST_DATA_DIR / 'ppi/8atd_B_D.pdb',
    PPIREF_TEST_DATA_DIR / 'ppi/10gs_A_B.pdb'
]

for ppi in ppis:
    ppi_id = ppi.stem
    swap_colors = ppi_id == '8atd_B_D'  # partners are swapped with respect to 8atd_A_C
    print(f'{ppi_id}:')
    display(PPI(ppi).visualize(swap_colors=swap_colors))
1p7z_A_C:
2024-06-24 17:37:28.332 Python[47180:5686724] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
2024-06-24 17:37:28.784 Python[47180:5686724] TSM AdjustCapsLockLEDForKeyTransitionHandling - _ISSetPhysicalKeyboardCapsLockLED Inhibit
_images/clustering_ppis_3_2.png
3p9r_B_D:
2024-06-24 17:37:32.240 Python[47188:5686865] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
_images/clustering_ppis_3_5.png
8atd_A_C:
2024-06-24 17:37:36.315 Python[47194:5686938] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
_images/clustering_ppis_3_8.png
8atd_B_D:
2024-06-24 17:37:38.707 Python[47201:5687016] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
_images/clustering_ppis_3_11.png
10gs_A_B:
2024-06-24 17:37:41.096 Python[47206:5687064] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.
_images/clustering_ppis_3_14.png
[4]:
idist = IDist(max_workers=1)

# Embed all PPIs
idist.embed_parallel(ppis)

# Cluster
cluster_labels = idist.cluster_embeddings()
Embedding PPIs (1 processes):   0%|          | 0/5 [00:00<?, ?it/s]
[06/24/24 17:37:43] WARNING  To use the Graphein submodule      embeddings.py:34
                             graphein.protein.features.sequence
                             .embeddings, you need to install:
                             torch
                             To do so, use the following
                             command: conda install -c pytorch
                             torch
                    WARNING  To use the Graphein submodule      embeddings.py:45
                             graphein.protein.features.sequence
                             .embeddings, you need to install:
                             biovec
                             biovec cannot be installed via
                             conda
                             Alternatively, you can install
                             graphein with the extras:

                             pip install graphein[extras]
[06/24/24 17:37:44] WARNING  To use the Graphein submodule   visualisation.py:36
                             graphein.protein.visualisation,
                             you need to install: pytorch3d
                             To do so, use the following
                             command: conda install -c
                             pytorch3d pytorch3d
                    WARNING  To use the Graphein submodule          meshes.py:30
                             graphein.protein.meshes, you need to
                             install: pytorch3d
                             To do so, use the following command:
                             conda install -c pytorch3d pytorch3d
                    DEBUG    Deprotonating protein. This removes H graphs.py:188
                             atoms from the pdb_df dataframe
                    DEBUG    Detected 289 total nodes              graphs.py:435
                    INFO     Found: 83232 KNN edges             distance.py:1132
[06/24/24 17:37:45] INFO     Found: 83232 KNN edges             distance.py:1132
Embedding PPIs (1 processes):  20%|██        | 1/5 [00:04<00:17,  4.32s/it]
[06/24/24 17:37:46] DEBUG    Deprotonating protein. This removes H graphs.py:188
                             atoms from the pdb_df dataframe
                    DEBUG    Detected 289 total nodes              graphs.py:435
                    INFO     Found: 83232 KNN edges             distance.py:1132
[06/24/24 17:37:47] INFO     Found: 83232 KNN edges             distance.py:1132
Embedding PPIs (1 processes):  40%|████      | 2/5 [00:06<00:09,  3.26s/it]
[06/24/24 17:37:49] DEBUG    Deprotonating protein. This removes H graphs.py:188
                             atoms from the pdb_df dataframe
                    DEBUG    Detected 31 total nodes               graphs.py:435
                    INFO     Found: 930 KNN edges               distance.py:1132
                    INFO     Found: 930 KNN edges               distance.py:1132
                    DEBUG    Deprotonating protein. This removes H graphs.py:188
                             atoms from the pdb_df dataframe
                    DEBUG    Detected 30 total nodes               graphs.py:435
                    INFO     Found: 870 KNN edges               distance.py:1132
                    INFO     Found: 870 KNN edges               distance.py:1132
                    DEBUG    Deprotonating protein. This removes H graphs.py:188
                             atoms from the pdb_df dataframe
                    DEBUG    Detected 164 total nodes              graphs.py:435
                    INFO     Found: 26732 KNN edges             distance.py:1132
                    INFO     Found: 26732 KNN edges             distance.py:1132
Embedding PPIs (1 processes): 100%|██████████| 5/5 [00:07<00:00,  1.55s/it]

The clustering correctly grouped the near duplicates together:

[5]:
dict(zip(idist.get_embeddings().index, cluster_labels))
[5]:
{'1p7z_A_C': 2, '3p9r_B_D': 2, '8atd_A_C': 0, '8atd_B_D': 0, '10gs_A_B': 1}