Skip to content

Commit b18bec3

Browse files
authored
Parse secondary structure annotations from mmCIF (#65)
Unless `run_dssp` or `run_stride`, this will parse any secondary structure annotations in the mmCIF file. --- Co-authored by: Joe Greener <jgreener@hotmail.co.uk>
1 parent 58456ce commit b18bec3

File tree

3 files changed

+86
-22
lines changed

3 files changed

+86
-22
lines changed

docs/src/documentation.md

Lines changed: 21 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -286,7 +286,7 @@ The operators currently supported are:
286286
| Operators | Acts on |
287287
| :-------------------------- | :---------------------------------- |
288288
| `=`, `>`, `<` `>=`, `<=` | Properties with real values. |
289-
| `and`, `or`, `not` | Selections subsets. |
289+
| `and`, `or`, `not` | Selections subsets. |
290290

291291
The keywords supported are:
292292

@@ -541,13 +541,31 @@ So if you wanted the graph of chain contacts in a protein complex you could give
541541

542542
## Assigning secondary structure
543543

544-
The secondary structure code of a residue or atom can be accessed after assigning the secondary structure using [DSSP](https://github.com/PDB-REDO/dssp) or [STRIDE](https://webclu.bio.wzw.tum.de/stride).
544+
Any secondary structure assignment at a residue is accessed via [`sscode`](@ref):
545+
```julia
546+
sscode(res)
547+
sscode(at)
548+
```
549+
Or for the whole structure:
550+
```julia
551+
sscode.(collectresidues(struc))
552+
```
553+
`sscode` may return `'-'`, indicating either that the residue has no recognized secondary structure or that it has not yet been assigned.
554+
The secondary structure code of a residue can be changed using [`sscode!`](@ref).
555+
556+
mmCIF files with secondary structure [annotations](https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Items/_struct_conf_type.id.html) will, by default, parse and assign secondary structure.
557+
558+
Secondary structure can also be assigned using [DSSP](https://github.com/PDB-REDO/dssp) or [STRIDE](https://webclu.bio.wzw.tum.de/stride).
559+
This functionality is provided through extensions, so you must load the corresponding package library (`DSSP_jll` or `STRIDE_jll`) before this functionality will work.
560+
545561
To assign secondary structure when reading the structure:
546562
```julia
547563
# Assign secondary structure using DSSP
564+
using DSSP_jll
548565
read("/path/to/pdb/file.pdb", PDBFormat, run_dssp=true)
549566

550567
# Assign secondary structure using STRIDE
568+
using STRIDE_jll
551569
read("/path/to/pdb/file.pdb", PDBFormat, run_stride=true)
552570
```
553571
[`rundssp!`](@ref), [`runstride!`](@ref), [`rundssp`](@ref) and [`runstride`](@ref) can also be used to assign secondary structure to a [`MolecularStructure`](@ref) or [`Model`](@ref):
@@ -557,21 +575,12 @@ runstride!(struc)
557575
```
558576
The assignment process may fail if the structure is too large, since we use an intermediate PDB file where the atom serial cannot exceed 99999 and the chain ID must be a single character.
559577
To get access to the secondary structure code of a residue or atom as a `Char`:
560-
```julia
561-
sscode(res)
562-
sscode(at)
563-
```
564-
Or for the whole structure:
565-
```julia
566-
sscode.(collectresidues(struc))
567-
```
568-
The secondary structure code of a residue can be changed using [`sscode!`](@ref).
569578
[`rundssp`](@ref) and [`runstride`](@ref) can also be run directly on structure files:
570579
```julia
571580
rundssp("/path/to/pdb/file.pdb", "out.dssp") # Also works with mmCIF files
572581
runstride("/path/to/pdb/file.pdb", "out.stride")
573582
```
574-
See the documentation for [DSSP_jll](https://docs.juliahub.com/General/DSSP_jll/stable/autodocs) and [STRIDE_jll](https://docs.juliahub.com/General/STRIDE_jll/stable/autodocs), and also [ProteinSecondaryStructures.jl](https://github.com/m3g/ProteinSecondaryStructures.jl), for other ways to run these programs.
583+
See the documentation for [DSSP_jll](https://docs.juliahub.com/General/DSSP_jll/stable/autodocs) and [STRIDE_jll](https://docs.juliahub.com/General/STRIDE_jll/stable/autodocs), and also [ProteinSecondaryStructures.jl](https://github.com/BioJulia/ProteinSecondaryStructures.jl), for other ways to run these programs.
575584

576585
## Downloading PDB files
577586

src/mmcif.jl

Lines changed: 54 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,24 @@ const mmciforder = Dict(
4141
]
4242
)
4343

44+
const cif_sscodes = Dict(
45+
"HELX_LH_AL_P" => 'H',
46+
"HELX_P" => 'H',
47+
"STRN" => 'E',
48+
"HELX_LH_3T_P" => 'G',
49+
"HELX_LH_PI_P" => 'I',
50+
"HELX_LH_PP_P" => 'P',
51+
"TURN_OT_P" => 'T',
52+
"TURN_P" => 'T',
53+
"TURN_TY1P_P" => 'T',
54+
"TURN_TY2P_P" => 'T',
55+
"TURN_TY3P_P" => 'T',
56+
"TURN_TY1_P" => 'T',
57+
"TURN_TY2_P" => 'T',
58+
"TURN_TY3_P" => 'T',
59+
"BEND" => 'S',
60+
)
61+
4462
"""
4563
MMCIFDict(filepath; gzip=false)
4664
MMCIFDict(io; gzip=false)
@@ -150,17 +168,23 @@ function tokenizecif(f::IO)
150168
return tokens
151169
end
152170

153-
# Get tokens from a mmCIF file corresponding to atom site records only
171+
# Get tokens from a mmCIF file corresponding to atom_site or struct_conf records only
154172
# This will fail if there is only a single atom record in the file
155173
# and it is not in the loop format
156174
function tokenizecifstructure(f::IO)
157175
tokens = String[]
158176
reading = false
159177
in_keys = true
178+
category_groups = ["_atom_site.", "_struct_conf."]
160179
for line in eachline(f)
161-
if (!reading && !startswith(line, "_atom_site.")) || startswith(line, "#")
180+
startswith(line, "#") && continue
181+
iscat = any(k -> startswith(line, k), category_groups)
182+
if !reading && !iscat
162183
continue
163-
elseif startswith(line, "_atom_site.")
184+
elseif iscat
185+
if !reading
186+
push!(tokens, "loop_")
187+
end
164188
reading = true
165189
push!(tokens, rstrip(line))
166190
elseif startswith(line, ";")
@@ -175,14 +199,14 @@ function tokenizecifstructure(f::IO)
175199
end
176200
push!(tokens, join(token_buffer, "\n"))
177201
elseif (!in_keys && startswith(line, "_")) || startswith(line, "loop_") ||
178-
startswith(line, "LOOP_")
179-
break
202+
startswith(line, "LOOP_")
203+
reading = false
204+
in_keys = true
180205
else
181206
in_keys = false
182207
append!(tokens, splitline(line))
183208
end
184209
end
185-
length(tokens) > 0 ? pushfirst!(tokens, "loop_") : nothing
186210
return tokens
187211
end
188212

@@ -322,6 +346,30 @@ function MolecularStructure(mmcif_dict::MMCIFDict;
322346
fixlists!(struc)
323347
end
324348

349+
# Secondary structure assignment
350+
if !(run_dssp | run_stride) && haskey(mmcif_dict, "_struct_conf.conf_type_id")
351+
for (i, id) in pairs(mmcif_dict["_struct_conf.conf_type_id"])
352+
chainid = get(mmcif_dict, "_struct_conf.beg_auth_asym_id", nothing)[i]
353+
chainidend = get(mmcif_dict, "_struct_conf.end_auth_asym_id", nothing)[i]
354+
if chainid === nothing
355+
chainid = mmcif_dict["_struct_conf.beg_label_asym_id"][i]
356+
chainidend = mmcif_dict["_struct_conf.end_label_asym_id"][i]
357+
end
358+
chainidend == chainid || continue # mismatch in chain id
359+
haskey(chains(struc), chainid) || continue
360+
chain = struc[chainid]
361+
bg = tryparse(Int, mmcif_dict["_struct_conf.beg_auth_seq_id"][i])
362+
nd = tryparse(Int, mmcif_dict["_struct_conf.end_auth_seq_id"][i])
363+
(bg === nothing || nd === nothing) && continue
364+
code = get(cif_sscodes, id, '-')
365+
for j in bg:nd
366+
r = get(residues(chain), string(j), nothing)
367+
isa(r, Residue) || continue
368+
r.ss_code = code
369+
end
370+
end
371+
end
372+
325373
if run_dssp && run_stride
326374
throw(ArgumentError("run_dssp and run_stride cannot both be true"))
327375
end

test/runtests.jl

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1787,6 +1787,7 @@ end
17871787
@test dic["_pdbx_database_status.recvd_initial_deposition_date"] == ["1991-11-08"]
17881788
@test dic["_audit_author.name"] == ["Mueller, C.W.", "Schulz, G.E."]
17891789
@test length(dic["_atom_site.group_PDB"]) == 3816
1790+
@test length(dic["_struct_conf.conf_type_id"]) == 18
17901791
dic["_pdbx_database_status.recvd_initial_deposition_date"] = ["changed"]
17911792
@test dic["_pdbx_database_status.recvd_initial_deposition_date"] == ["changed"]
17921793
@test length(keys(dic)) == 610
@@ -1942,10 +1943,14 @@ end
19421943

19431944
open(testfilepath("mmCIF", "1AKE.cif")) do f
19441945
tokens = tokenizecifstructure(f)
1945-
@test length(tokens) == 80158
1946-
@test tokens[1] == "loop_"
1947-
@test tokens[10] == "_atom_site.label_seq_id"
1948-
@test tokens[80089] == "77.69"
1946+
loopidxs = findall(==("loop_"), tokens)
1947+
@test length(loopidxs) == 2
1948+
idxsc = loopidxs[1]
1949+
@test tokens[idxsc + 1] == "_struct_conf.conf_type_id"
1950+
idxas = loopidxs[2]
1951+
@test length(tokens) - idxas + 1 == 80158
1952+
@test tokens[idxas + 9] == "_atom_site.label_seq_id"
1953+
@test tokens[idxas + 80088] == "77.69"
19491954
end
19501955

19511956
# Test parsing empty file
@@ -1996,6 +2001,8 @@ end
19962001
res = collect(chs[1])
19972002
@test length(res) == 456
19982003
@test resid(res[20]) == "20"
2004+
@test sscode(res[11]) == '-'
2005+
@test sscode(res[12]) == 'H'
19992006
ats = collect(res[20])
20002007
@test length(ats) == 8
20012008
@test atomname.(ats) == ["N", "CA", "C", "O", "CB", "CG1", "CG2", "CD1"]

0 commit comments

Comments
 (0)