************
Introduction
************
==========================================
Objective
==========================================
The aim of this task was to retrieve and superimpose PDB structures using
open source tools and services and to demonstrate the functionality using CDK2,
UniProt: P24941 as an example.
==========================================
Solution
==========================================
The solution to this problem required two parts:
#. Search and retrieve data from the PDB. This was implemented by using urllib from the Python standard library to wrap around the RCSB PDB's webservices.
#. Superimpose the retrieved data. This was implemented using BioPython's PDB parsing and Superimposition functionality
Although this is a small demo task, I have tried to demonstrate coding practices that I
would employ in larger projects:
* I've used a "tests first" approach.
* Setup continuous integration with CircleCI.
* Configured documentation with Sphinx
******************
Test the Code
******************
1. Clone this repository
.. code-block:: shell
git clone https://github.com/prcurran/pdb_superimposer.git
2. Create a new environment
.. code-block:: shell
conda env create -f environment.yml
3. Install this package
.. code-block:: shell
conda activate super
conda env create -f environment.yml
pip install .
4. Run the CDK2 example
.. code-block:: shell
python example.py
5. Remove environment and clean up
.. code-block:: shell
conda env remove --name super --all
rm -rf pdb_superimposer
******************
CDK2 Example
******************
This section will run through the code for the CDK2 example
=========================
Step 1: Search the PDB
=========================
The first step is to search the PDB. This is done using the RCSB Search API which is a RESTful API
service that uses URL encoded JSON payloads over HTTP.
The full documentation for the query syntax can be found `here `_.
For this example, we use 3 of the request building blocks:
* `query` : This is where the search expression is constructed. For this example only the uniprot code is used but others could have been included (such as structure resolution).
* `request_options` : This controls various aspects of the search request, the pagination was alter to ensure all results were returned (default = first 10).
* `return_type`: Specifics the type of returned identifiers, the polymer entity was used over entity since each entity can contain multiple polymers.
.. code-block:: JSON
{
"query": {
"type": "group",
"logical_operator": "and",
"nodes": [
{
"type": "terminal",
"service": "text",
"parameters": {
"operator": "exact_match",
"value": "P24941",
"attribute": "rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession"
}
},
{
"type": "terminal",
"service": "text",
"parameters": {
"operator": "exact_match",
"value": "UniProt",
"attribute": "rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_name"
}
}
]
},
"request_options": {
"pager": {
"start": 0,
"rows": 500
}
},
"return_type": "polymer_entity"
}
The query was read from file and `urllib `_. was used to send the request
.. code-block:: python
search_query_path = "example_search.json"
with open(search_query_path, "r") as r:
query = r.read() # keep query as str
results = pdb_search_query(query)
pdb_entities = {i["identifier"].split("_")[0]: int(i["identifier"].split("_")[1]) - 1
for i in results["result_set"]}
=================================
Step 2: Download the Search Hits
=================================
Next, the data is download. This is done using the `RCSB FTP service `_.
Since for CDK2 there are over 400 entries, `multiprocessing `_ has been implemented to download
the files in parallel. I've also incorported a `tqdm `_
progress bar so that the user can see something is happening.
.. code-block:: python
def wrap_ftp_download(inputs):
# simple wrapper to manage flow of args
pdb, out_dir = inputs
return ftp_download(pdb, out_dir)
args = ((a, out_dir) for a in pdb_entities.keys())
with Pool(processes=processes) as pool:
list(tqdm(pool.imap_unordered(wrap_ftp_download, args), total=len(pdb_entities)))
=================================
Step 3: The Superimposition
=================================
Finally, the downloaded files can be superimposed. `BioPython's `_
:class:`Bio.PDB.Superimposer.Superimposer` contains
the functionality to do the actual transformation, minimising the RMS in the solution. However,
the selection of the atoms to be considered in the superimposition had to be implemented in this
package. Each residue in a PDB file contains a sequence identifier which corresponds to that residues
position when aligned with the UniProt reference sequence. I used these identifier to ensure:
* In both the `reference` and `other` chain there are residues at a given index (some are missing, some are expression tags)
* At a given index, the residue in `reference` and `other` is the same (some are mutated)
* For a given residue, all atoms are present (some are partially model)
* For a given residue, only heavy atoms are considered
I also included functionality to only select binding site residues since this is a useful operation
in some use case. As this task required ALL CDK2 structures, and some structures don't have bound ligands
it wasn't used here.
.. code-block:: python
reference = Helper.protein_from_file(ref_id, os.path.join(out_dir, f"{ref_id}.pdb"))
ref_chain = [c for c in reference[0]][polymer_entity]
for pdb, entity in pdb_entities.items():
other = Helper.protein_from_file(pdb, os.path.join(out_dir, f"{pdb}.pdb"))
other_chain = [c for c in other[0]][polymer_entity]
cs = ChainSuperimposer(reference=ref_chain, other=other_chain, other_struc=other)
cs.superimpose()
Helper.protein_to_file(other, )