- PDB database
- Swiss-Prot database (AI-predicted structures)
- OMG_Prot50 (Proteins are transcribed from the Open MetaGenomic.)
PDB database
download
- https://files.rcsb.org/pub/pdb/data/structures/all/pdb/
- save this website as PDB - FTP Archive over HTTP.html
- extract pdbXXXX.ent.gz list
from bs4 import BeautifulSoup
# how many ent.gz file
with open('PDB - FTP Archive over HTTP.html', 'r') as file:
html_content = file.read()
soup = BeautifulSoup(html_content, 'lxml')
text = soup.get_text('\n', '\n\n')
lines = text.split('\n')
PDB_id_list = []
for line in lines:
if 'ent.gz' in line:
PDB_id_list.append(line.split(".")[0][4:] #.split("pdb")[1])
# PDB_id_list[:5]
# ['100d', '101d', '101m', '102d', '102l']
Is's not okay to extract id by
.split("pdb")
. Because pdb might be also the part of the id in some special cases, for example, pdb1pdb.ent.gz.
- parallel downloads
# split into 44 entry_i.txt file
length = 5000
count = len(df)//length
for i in range(43):
subdf = df.iloc[i*5000:i*5000+5000]
sublist =subdf["id"].tolist()
string = ",".join(sublist)
text = open('groups/entry_'+ str(i)+'.txt', 'w')
text.write(string)
text.close()
final_sublist = df.iloc[43*5000:43*5000+5000]["id"].tolist()
finalstring = ",".join(final_sublist)
finaltext = open('groups/entry_'+ str(43)+'.txt', 'w')
finaltext.write(finalstring)
finaltext.close()
Obatin the batch-download script batch_download.sh from Batch Downloads with Shell Script
#!/bin/bash
cd /(your path)
for line in $(cat ./../groups/entry_list.txt); do
./../batch_download.sh -f ./../groups/${line} -p &
done
# while IFS= read -r line; do
# ./../batch_download.sh -f ./../groups/${line} -p &
# done < <(grep "" ./../groups/entry_list.txt)
where entry_list.txt has entry_i.txt line by line.
$ bash parallel_download.sh > download.log
Failed download is common when Shell script downloads large files simutaneously. Therefore, it is important to check whether the script has downloaded the complete and correct ent.gz files. For example, by
gunzip *gz
, the wrong .gz files will outputunzip error
. And then, this wrong .gz file should be removed and we need to download it again.
desciption
At this moment, I download 218,546 PDB entries from PDB database.
Swiss-Prot database
download
UniProt provides the reviewed Swiss-Prot database.

With the help of unipressed, I can quickly analyze the reviewed Swiss-Prot database. There are 22,140 enries whose 3D structure information are not stored. In the remaining 549,724 entries who have 3D structures, 33,725 entries have 3D structure from both Alphafold Protein Structure Database and PDB database; 2,194 entries only have 3D structure from PDB database; 513,805 entries only have 3D structure from Alphafold Protein Structure Database.
I compared the dataset that is already compressed in Alphafold Protein Structure Database (Fig.2) and the 513,805 entries who only have Alphafold Protein Structure Database 3D structure as recorded in UniProt (Fig.1). 4,274 entries' structures in (Fig.1) have not been stored in (Fig.2).
To minimize duplications between the UniProt Swiss-Prot database and the PDB database, I focused on downloading the structures of 513,805 entries from UniProt (Swiss-Prot) that are only available in the Alphafold Protein Structure Database. Therefore, I only need to download 4,274 entries' structures, because the remaining 509,531 entries' structures can be directly extracted from (Fig.2).

description
In UniProt, Swiss-Prot has 571,864 entries with its corresponding fasta file. 549,724 entries have 3D struture. And most (513,805) of these structure are from AlphaFold prediction.
OMG_Prot50 database
download
1.huggingface.co/datasets/tattabio/OMG_prot50
description
The
OMG_prot50
dataset is a protein-only dataset, created by clustering the Open MetaGenomic dataset (OMG) at 50% sequence identity. MMseqs2 linclust (Steinegger and Söding 2018) was used to cluster all 4.2B protein sequences from the OMG dataset, resulting in 207M protein sequences. Sequences were clustered at 50% sequence id and 90% sequence coverage, and singleton clusters were removed.