nom_pdb/coordinate.rs
1// Copyright (c) 2020 Tianyi Shi
2//
3// This software is released under the MIT License.
4// https://opensource.org/licenses/MIT
5
6use crate::common::parser::{jump_newline, parse_residue, parse_right, FieldParser};
7
8use crate::types::{
9 Anisou, Atom, AtomName, AtomSerial, Connect, Element, ModifiedAminoAcidTable,
10 ModifiedNucleotideTable, ParseFw2, ParseFw4, Residue,
11};
12use nom::{bytes::complete::take, character::complete::anychar, combinator::map, IResult};
13use std::str::from_utf8_unchecked;
14
15/// # ATOM
16///
17/// ## Overview
18///
19/// The ATOM records present the atomic coordinates for standard amino acids and nucleotides. They
20/// also present the occupancy and temperature factor for each atom. Non-polymer chemical
21/// coordinates use the HETATM record type. The element symbol is always present on each ATOM
22/// record; charge is optional. Changes in ATOM/HETATM records result from the standardization atom
23/// and residue nomenclature. This nomenclature is described in the [Chemical Component Dictionary](ftp://ftp.wwpdb.org/pub/pdb/data/monomers).
24///
25/// ## Record Format
26///
27/// |COLUMNS |DATA TYPE | FIELD | DEFINITION |
28/// |---------------|-------------|-------------|-------------------------------------------|
29/// | 1 - 6 |Record name | "ATOM " | |
30/// | 7 - 11 |Integer | serial | Atom serial number. |
31/// |13 - 16 |Atom | name | Atom name. |
32/// |17 |Character | altLoc | Alternate location indicator. |
33/// |18 - 20 |Residue name | resName | Residue name. |
34/// |22 |Character | chainID | Chain identifier. |
35/// |23 - 26 |Integer | resSeq | Residue sequence number. |
36/// |27 |AChar | iCode | Code for insertion of residues. |
37/// |31 - 38 |Real(8.3) | x | Orthogonal coordinates for X in Angstroms.|
38/// |39 - 46 |Real(8.3) | y | Orthogonal coordinates for Y in Angstroms.|
39/// |47 - 54 |Real(8.3) | z | Orthogonal coordinates for Z in Angstroms.|
40/// |55 - 60 |Real(6.2) | occupancy | Occupancy. |
41/// |61 - 66 |Real(6.2) | tempFactor | Temperature factor. |
42/// |77 - 78 |LString(2) | element | Element symbol, right-justified. |
43/// |79 - 80 |LString(2) | charge | Charge on the atom. |
44///
45/// ## Details
46///
47/// ATOM records for proteins are listed from amino to carboxyl terminus.
48/// Nucleic acid residues are listed from the 5' to the 3' terminus.
49/// Alignment of one-letter atom name such as C starts at column 14, while two-letter atom name such
50/// as FE starts at column 13. Atom nomenclature begins with atom type.
51/// No ordering is specified for polysaccharides.
52/// Non-blank alphanumerical character is used for chain identifier.
53/// The list of ATOM records in a chain is terminated by a TER record.
54/// If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL
55/// records. AltLoc is the place holder to indicate alternate conformation. The alternate
56/// conformation can be in the entire polymer chain, or several residues or partial residue (several
57/// atoms within one residue). If an atom is provided in more than one position, then a non-blank
58/// alternate location indicator must be used for each of the atomic positions. Within a residue,
59/// all atoms that are associated with each other in a given conformation are assigned the same
60/// alternate position indicator. There are two ways of representing alternate conformation- either
61/// at atom level or at residue level (see examples). For atoms that are in alternate sites
62/// indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the
63/// following general rules:
64///
65/// - In the simple case that involves a few atoms or a few residues with alternate sites, the
66/// coordinates occur one after the other in the entry.
67/// - In the case of a large heterogen groups which are disordered, the atoms for each conformer
68/// are listed together.
69///
70/// Alphabet letters are commonly used for insertion code. The insertion code is used when two
71/// residues have the same numbering. The combination of residue numbering and insertion code
72/// defines the unique residue. If the depositor provides the data, then the isotropic B value is
73/// given for the temperature factor. If there are neither isotropic B values from the depositor,
74/// nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the
75/// temperature factor. Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most
76/// cases, these are blank. For refinements with program REFMAC prior 5.5.0042 which use TLS
77/// refinement, the values of B may include only the TLS contribution to the isotropic temperature
78/// factor rather than the full isotropic value.
79///
80/// # HETATOM
81///
82/// ## Overview
83///
84/// http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#HETATM
85///
86/// Non-polymer or other “non-standard” chemical coordinates, such as water molecules or atoms presented in HET groups use the HETATM record type. They also present the occupancy and temperature factor for each atom. The ATOM records present the atomic coordinates for standard residues. The element symbol is always present on each HETATM record; charge is optional.
87///
88/// Changes in ATOM/HETATM records will require standardization in atom and residue nomenclature. This nomenclature is described in the Chemical Component Dictionary, ftp://ftp.wwpdb.org/pub/pdb/data/monomers.
89///
90/// ## Record Format
91///
92/// | COLUMNS | DATA TYPE | FIELD | DEFINITION |
93/// | ------- | ------------ | ---------- | -------------------------------- |
94/// | 1 - 6 | Record name | "HETATM" | |
95/// | 7 - 11 | Integer | serial | Atom serial number. |
96/// | 13 - 16 | Atom | name | Atom name. |
97/// | 17 | Character | altLoc | Alternate location indicator. |
98/// | 18 - 20 | Residue name | resName | Residue name. |
99/// | 22 | Character | chainID | Chain identifier. |
100/// | 23 - 26 | Integer | resSeq | Residue sequence number. |
101/// | 27 | AChar | iCode | Code for insertion of residues. |
102/// | 31 - 38 | Real(8.3) | x | Orthogonal coordinates for X. |
103/// | 39 - 46 | Real(8.3) | y | Orthogonal coordinates for Y. |
104/// | 47 - 54 | Real(8.3) | z | Orthogonal coordinates for Z. |
105/// | 55 - 60 | Real(6.2) | occupancy | Occupancy. |
106/// | 61 - 66 | Real(6.2) | tempFactor | Temperature factor. |
107/// | 77 - 78 | LString(2) | element | Element symbol; right-justified. |
108/// | 79 - 80 | LString(2) | charge | Charge on the atom. |
109///
110/// ## Details
111///
112/// The x, y, z coordinates are in Angstrom units.
113/// No ordering is specified for polysaccharides.
114/// See the HET section of this document regarding naming of heterogens. See the Chemical Component Dictionary for residue names, formulas, and topology of the HET groups that have appeared so far in the PDB (see ftp://ftp.wwpdb.org/pub/pdb/data/monomers ).
115/// If the depositor provides the data, then the isotropic B value is given for the temperature factor.
116/// If there are neither isotropic B values provided by the depositor, nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the temperature factor.
117/// Insertion codes and element naming are fully described in the ATOM section of this document.
118pub struct GenericAtomParser;
119
120impl GenericAtomParser {
121 pub fn parse<'a, 'b>(
122 inp: &'a [u8],
123 modified_aa: &'b ModifiedAminoAcidTable,
124 modified_nuc: &'b ModifiedNucleotideTable,
125 ) -> IResult<&'a [u8], Atom> {
126 let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
127 let inp = &inp[1..];
128 let (inp, name) = map(take(4usize), AtomName::parse_fw4)(inp)?;
129 let (inp, id1) = anychar(inp)?;
130
131 let (inp, residue) = parse_residue(inp, modified_aa, modified_nuc)?;
132
133 let inp = &inp[1..];
134 let (inp, chain) = anychar(inp)?;
135 let (inp, sequence_number) = parse_right::<u32>(inp, 4)?;
136 let (inp, insertion_code) = anychar(inp)?;
137 let inp = &inp[3..];
138 let (inp, x) = parse_right::<f32>(inp, 8)?;
139 let (inp, y) = parse_right::<f32>(inp, 8)?;
140 let (inp, z) = parse_right::<f32>(inp, 8)?;
141 let (inp, occupancy) = parse_right::<f32>(inp, 6)?;
142 let (inp, temperature_factor) = parse_right::<f32>(inp, 6)?;
143 let inp = &inp[10..];
144 let (inp, element) = map(take(2usize), Element::parse_fw2)(inp)?;
145 let (inp, charge) = map(take(2usize), |x: &[u8]| match x {
146 b" " => 0,
147 _ => {
148 let x = unsafe { from_utf8_unchecked(x) };
149 x.parse::<i8>().unwrap()
150 }
151 })(inp)?;
152 let (inp, _) = nom::character::complete::line_ending(inp)?;
153 Ok((
154 inp,
155 Atom {
156 id,
157 id1,
158 name,
159 residue,
160 chain,
161 sequence_number,
162 insertion_code,
163 coord: [x, y, z],
164 occupancy,
165 temperature_factor,
166 element,
167 charge,
168 },
169 ))
170 }
171}
172
173/// # ANISOU
174///
175/// The [ANISOU](http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ANISOU) records present the anisotropic temperature factors.
176///
177/// ## Record Format
178///
179/// | COLUMNS | DATA TYPE | FIELD | DEFINITION |
180/// | ------- | ------------ | -------- | -------------------------------- |
181/// | 1 - 6 | Record name | "ANISOU" | |
182/// | 7 - 11 | Integer | serial | Atom serial number. |
183/// | 13 - 16 | Atom | name | Atom name. |
184/// | 17 | Character | altLoc | Alternate location indicator |
185/// | 18 - 20 | Residue name | resName | Residue name. |
186/// | 22 | Character | chainID | Chain identifier. |
187/// | 23 - 26 | Integer | resSeq | Residue sequence number. |
188/// | 27 | AChar | iCode | Insertion code. |
189/// | 29 - 35 | Integer | u[0][0] | U(1,1) |
190/// | 36 - 42 | Integer | u[1][1] | U(2,2) |
191/// | 43 - 49 | Integer | u[2][2] | U(3,3) |
192/// | 50 - 56 | Integer | u[0][1] | U(1,2) |
193/// | 57 - 63 | Integer | u[0][2] | U(1,3) |
194/// | 64 - 70 | Integer | u[1][2] | U(2,3) |
195/// | 77 - 78 | LString(2) | element | Element symbol, right-justified. |
196/// | 79 - 80 | LString(2) | charge | Charge on the atom. |
197pub struct AnisouParser;
198
199impl FieldParser for AnisouParser {
200 type Output = Anisou;
201 fn parse(inp: &[u8]) -> IResult<&[u8], Anisou> {
202 let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
203 let inp = &inp[17..]; // 12 - 28
204
205 let (inp, u11) = parse_right::<i32>(inp, 7)?;
206 let (inp, u22) = parse_right::<i32>(inp, 7)?;
207 let (inp, u33) = parse_right::<i32>(inp, 7)?;
208 let (inp, u12) = parse_right::<i32>(inp, 7)?;
209 let (inp, u13) = parse_right::<i32>(inp, 7)?;
210 let (inp, u23) = parse_right::<i32>(inp, 7)?;
211 let inp = &inp[10..];
212 let (inp, _) = nom::character::complete::line_ending(inp)?;
213 Ok((
214 inp,
215 Anisou {
216 id,
217 u11,
218 u22,
219 u33,
220 u12,
221 u13,
222 u23,
223 },
224 ))
225 }
226}
227
228/// # Overview
229///
230/// The CONECT records specify connectivity between atoms for which coordinates are supplied. The connectivity is described using the atom serial number as shown in the entry. CONECT records are mandatory for HET groups (excluding water) and for other Connect not specified in the standard residue connectivity table. These records are generated automatically.
231///
232/// # Record Format
233///
234/// COLUMNS | DATA TYPE | FIELD | DEFINITION
235/// -----------|----------------|----------|-----------------------------------
236/// 1 - 6 | Record name | "CONECT"|
237/// 7 - 11 | Integer | serial | Atom serial number
238/// 12 - 16 | Integer | serial | Serial number of bonded atom
239/// 17 - 21 | Integer | serial | Serial number of bonded atom
240/// 22 - 26 | Integer | serial | Serial number of bonded atom
241/// 27 - 31 | Integer | serial | Serial number of bonded atom
242///
243/// Details
244///
245/// CONECT records are present for:
246///
247/// - Intra-residue connectivity within non-standard (HET) residues (excluding water).
248/// - Inter-residue connectivity of HET groups to standard groups (including water) or to other HET groups.
249/// - Disulfide bridges specified in the SSBOND records have corresponding records.
250///
251/// - No differentiation is made between atoms with delocalized charges (excess negative or positive charge).
252/// - Atoms specified in the CONECT records have the same numbers as given in the coordinate section.
253/// - All atoms connected to the atom with serial number in columns 7 - 11 are listed in the remaining fields of the record.
254/// - If more than four fields are required for non-hydrogen and non-salt bridges, a second CONECT record with the same atom serial number in columns 7 - 11 will be used.
255/// - These CONECT records occur in increasing order of the atom serial numbers they carry in columns 7 - 11. The target-atom serial numbers carried on these records also occur in increasing order.
256/// - The connectivity list given here is redundant in that each bond indicated is given twice, once with each of the two atoms involved specified in columns 7 - 11.
257/// - For hydrogen Connect, when the hydrogen atom is present in the coordinates, a CONECT record between the hydrogen atom and its acceptor atom is generated.
258/// - For NMR entries, CONECT records for one model are generated describing heterogen connectivity and others for LINK records assuming that all models are homogeneous models.
259pub struct ConectParser;
260
261impl FieldParser for ConectParser {
262 type Output = Vec<Connect>;
263 fn parse(inp: &[u8]) -> IResult<&[u8], Self::Output> {
264 let mut res = Vec::new();
265 let (inp, x) = parse_right::<AtomSerial>(inp, 5)?;
266 let mut last_inp = inp;
267 loop {
268 let (inp, y) = parse_right::<AtomSerial>(last_inp, 5)?;
269 if y > x {
270 res.push([x, y]);
271 } else {
272 res.push([y, x]);
273 }
274 if inp[..5] == b" "[..] {
275 break;
276 }
277 last_inp = inp
278 }
279 let (inp, _) = jump_newline(last_inp)?;
280 Ok((inp, res))
281 }
282}