- PDB database
- Swiss-Prot database (AI-predicted structures)
- OMG_Prot50 (Proteins are transcribed from the Open MetaGenomic.)
- malidup
- malisam
PDB database
download
- https://files.rcsb.org/pub/pdb/data/structures/all/pdb/
- save this website as PDB - FTP Archive over HTTP.html
- extract pdbXXXX.ent.gz list
from bs4 import BeautifulSoup
# how many ent.gz file
with open('PDB - FTP Archive over HTTP.html', 'r') as file:
html_content = file.read()
soup = BeautifulSoup(html_content, 'lxml')
text = soup.get_text('\n', '\n\n')
lines = text.split('\n')
PDB_id_list = []
for line in lines:
if 'ent.gz' in line:
PDB_id_list.append(line.split(".")[0][4:] #.split("pdb")[1])
# PDB_id_list[:5]
# ['100d', '101d', '101m', '102d', '102l']
It's not okay to extract id by
.split("pdb")
. Because pdb might be also the part of the id in some special cases, for example, pdb1pdb.ent.gz.
- parallel downloads
# split into 44 entry_i.txt file
length = 5000
count = len(df)//length
for i in range(43):
subdf = df.iloc[i*5000:i*5000+5000]
sublist =subdf["id"].tolist()
string = ",".join(sublist)
text = open('groups/entry_'+ str(i)+'.txt', 'w')
text.write(string)
text.close()
final_sublist = df.iloc[43*5000:43*5000+5000]["id"].tolist()
finalstring = ",".join(final_sublist)
finaltext = open('groups/entry_'+ str(43)+'.txt', 'w')
finaltext.write(finalstring)
finaltext.close()
Obatin the batch-download script batch_download.sh from Batch Downloads with Shell Script
#!/bin/bash
cd /(your path)
for line in $(cat ./../groups/entry_list.txt); do
./../batch_download.sh -f ./../groups/${line} -p &
done
where entry_list.txt has entry_i.txt line by line.
$ bash parallel_download.sh > download.log
Failed download is common when Shell script downloads large files simutaneously. Therefore, it is important to check whether the script has downloaded the complete and correct ent.gz files. For example, by
gunzip *gz
, the wrong .gz files will outputunzip error
. And then, this wrong .gz file should be removed and we need to download it again.
desciption
At this moment, I download 218,546 PDB entries from PDB database.
Swiss-Prot database
download
UniProt provides the reviewed Swiss-Prot database.
Unipressed API client
description
In UniProt, Swiss-Prot has 571,864 entries with its corresponding fasta file. 549,724 entries have 3D struture. And most (513,805) of these structure are from AlphaFold prediction.
OMG_Prot50 database
download
huggingface.co/datasets/tattabio/OMG_prot50
description
The
OMG_prot50
dataset is a protein-only dataset, created by clustering the Open MetaGenomic dataset (OMG) at 50% sequence identity. MMseqs2 linclust (Steinegger and Söding 2018) was used to cluster all 4.2B protein sequences from the OMG dataset, resulting in 207M protein sequences. Sequences were clustered at 50% sequence id and 90% sequence coverage, and singleton clusters were removed.
malidup
http://prodata.swmed.edu/malidup/
malisam
http://prodata.swmed.edu/malisam/
Transporter Classification Database (TCDB)
The Transporter Classification Database (TCDB) is specialized with respect to curated information about the functions and evolution of transporters from all domains of life [Saier et al., 2021].