nom_pdb/
coordinate.rs

1// Copyright (c) 2020 Tianyi Shi
2//
3// This software is released under the MIT License.
4// https://opensource.org/licenses/MIT
5
6use crate::common::parser::{jump_newline, parse_residue, parse_right, FieldParser};
7
8use crate::types::{
9    Anisou, Atom, AtomName, AtomSerial, Connect, Element, ModifiedAminoAcidTable,
10    ModifiedNucleotideTable, ParseFw2, ParseFw4, Residue,
11};
12use nom::{bytes::complete::take, character::complete::anychar, combinator::map, IResult};
13use std::str::from_utf8_unchecked;
14
15/// # ATOM
16///
17/// ## Overview
18///
19/// The ATOM records present the atomic coordinates for standard amino acids and nucleotides. They
20/// also present the occupancy and temperature factor for each atom. Non-polymer chemical
21/// coordinates use the HETATM record type. The element symbol is always present on each ATOM
22/// record; charge is optional. Changes in ATOM/HETATM records result from the standardization atom
23/// and residue nomenclature. This nomenclature is described in the [Chemical Component Dictionary](ftp://ftp.wwpdb.org/pub/pdb/data/monomers).
24///
25/// ## Record Format
26///
27/// |COLUMNS        |DATA  TYPE   | FIELD       | DEFINITION                                |
28/// |---------------|-------------|-------------|-------------------------------------------|
29/// | 1 -  6        |Record name  | "ATOM  "    |                                           |
30/// | 7 - 11        |Integer      | serial      | Atom  serial number.                      |
31/// |13 - 16        |Atom         | name        | Atom name.                                |
32/// |17             |Character    | altLoc      | Alternate location indicator.             |
33/// |18 - 20        |Residue name | resName     | Residue name.                             |
34/// |22             |Character    | chainID     | Chain identifier.                         |
35/// |23 - 26        |Integer      | resSeq      | Residue sequence number.                  |
36/// |27             |AChar        | iCode       | Code for insertion of residues.           |
37/// |31 - 38        |Real(8.3)    | x           | Orthogonal coordinates for X in Angstroms.|
38/// |39 - 46        |Real(8.3)    | y           | Orthogonal coordinates for Y in Angstroms.|
39/// |47 - 54        |Real(8.3)    | z           | Orthogonal coordinates for Z in Angstroms.|
40/// |55 - 60        |Real(6.2)    | occupancy   | Occupancy.                                |
41/// |61 - 66        |Real(6.2)    | tempFactor  | Temperature  factor.                      |
42/// |77 - 78        |LString(2)   | element     | Element symbol, right-justified.          |
43/// |79 - 80        |LString(2)   | charge      | Charge  on the atom.                      |
44///
45/// ## Details
46///
47/// ATOM records for proteins are listed from amino to carboxyl terminus.
48/// Nucleic acid residues are listed from the 5' to the 3' terminus.
49/// Alignment of one-letter atom name such as C starts at column 14, while two-letter atom name such
50/// as FE starts at column 13. Atom nomenclature begins with atom type.
51/// No ordering is specified for polysaccharides.
52/// Non-blank alphanumerical character is used for chain identifier.
53/// The list of ATOM records in a chain is terminated by a TER record.
54/// If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL
55/// records. AltLoc is the place holder to indicate alternate conformation. The alternate
56/// conformation can be in the entire polymer chain, or several residues or partial residue (several
57/// atoms within one residue). If an atom is provided in more than one position, then a non-blank
58/// alternate location indicator must be used for each of the atomic positions. Within a residue,
59/// all atoms that are associated with each other in a given conformation are assigned the same
60/// alternate position indicator. There are two ways of representing alternate conformation- either
61/// at atom level or at residue level (see examples). For atoms that are in alternate sites
62/// indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the
63/// following general rules:
64///
65/// - In the simple case that involves a few  atoms or a few residues with alternate sites, the
66///   coordinates occur one after  the other in the entry.
67/// - In the case of a large heterogen groups  which are disordered, the atoms for each conformer
68///   are listed together.
69///
70/// Alphabet letters are commonly used for insertion code. The insertion code is used when two
71/// residues have the same numbering. The combination of residue numbering and insertion code
72/// defines the unique residue. If the depositor provides the data, then the isotropic B value is
73/// given for the temperature factor. If there are neither isotropic B values from the depositor,
74/// nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the
75/// temperature factor. Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most
76/// cases, these are blank. For refinements with program REFMAC prior 5.5.0042 which use TLS
77/// refinement, the values of B may include only the TLS contribution to the isotropic temperature
78/// factor rather than the full isotropic value.
79///
80/// # HETATOM
81///
82/// ## Overview
83///
84/// http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#HETATM
85///
86/// Non-polymer or other “non-standard” chemical coordinates, such as water molecules or atoms presented in HET groups use the HETATM record type. They also present the occupancy and temperature factor for each atom. The ATOM records present the atomic coordinates for standard residues. The element symbol is always present on each HETATM record; charge is optional.
87///
88/// Changes in ATOM/HETATM records will require standardization in atom and residue nomenclature. This nomenclature is described in the Chemical Component Dictionary, ftp://ftp.wwpdb.org/pub/pdb/data/monomers.
89///
90/// ## Record Format
91///
92/// | COLUMNS | DATA  TYPE   | FIELD      | DEFINITION                       |
93/// | ------- | ------------ | ---------- | -------------------------------- |
94/// | 1 - 6   | Record name  | "HETATM"   |                                  |
95/// | 7 - 11  | Integer      | serial     | Atom serial number.              |
96/// | 13 - 16 | Atom         | name       | Atom name.                       |
97/// | 17      | Character    | altLoc     | Alternate location indicator.    |
98/// | 18 - 20 | Residue name | resName    | Residue name.                    |
99/// | 22      | Character    | chainID    | Chain identifier.                |
100/// | 23 - 26 | Integer      | resSeq     | Residue sequence number.         |
101/// | 27      | AChar        | iCode      | Code for insertion of residues.  |
102/// | 31 - 38 | Real(8.3)    | x          | Orthogonal coordinates for X.    |
103/// | 39 - 46 | Real(8.3)    | y          | Orthogonal coordinates for Y.    |
104/// | 47 - 54 | Real(8.3)    | z          | Orthogonal coordinates for Z.    |
105/// | 55 - 60 | Real(6.2)    | occupancy  | Occupancy.                       |
106/// | 61 - 66 | Real(6.2)    | tempFactor | Temperature factor.              |
107/// | 77 - 78 | LString(2)   | element    | Element symbol; right-justified. |
108/// | 79 - 80 | LString(2)   | charge     | Charge on the atom.              |
109///
110/// ## Details
111///
112/// The x, y, z coordinates are in Angstrom units.
113/// No ordering is specified for polysaccharides.
114/// See the HET section of this document regarding naming of heterogens. See the Chemical Component Dictionary for residue names, formulas, and topology of the HET groups that have appeared so far in the PDB (see ftp://ftp.wwpdb.org/pub/pdb/data/monomers ).
115/// If the depositor provides the data, then the isotropic B value is given for the temperature factor.
116/// If there are neither isotropic B values provided by the depositor, nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the temperature factor.
117/// Insertion codes and element naming are fully described in the ATOM section of this document.
118pub struct GenericAtomParser;
119
120impl GenericAtomParser {
121    pub fn parse<'a, 'b>(
122        inp: &'a [u8],
123        modified_aa: &'b ModifiedAminoAcidTable,
124        modified_nuc: &'b ModifiedNucleotideTable,
125    ) -> IResult<&'a [u8], Atom> {
126        let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
127        let inp = &inp[1..];
128        let (inp, name) = map(take(4usize), AtomName::parse_fw4)(inp)?;
129        let (inp, id1) = anychar(inp)?;
130
131        let (inp, residue) = parse_residue(inp, modified_aa, modified_nuc)?;
132
133        let inp = &inp[1..];
134        let (inp, chain) = anychar(inp)?;
135        let (inp, sequence_number) = parse_right::<u32>(inp, 4)?;
136        let (inp, insertion_code) = anychar(inp)?;
137        let inp = &inp[3..];
138        let (inp, x) = parse_right::<f32>(inp, 8)?;
139        let (inp, y) = parse_right::<f32>(inp, 8)?;
140        let (inp, z) = parse_right::<f32>(inp, 8)?;
141        let (inp, occupancy) = parse_right::<f32>(inp, 6)?;
142        let (inp, temperature_factor) = parse_right::<f32>(inp, 6)?;
143        let inp = &inp[10..];
144        let (inp, element) = map(take(2usize), Element::parse_fw2)(inp)?;
145        let (inp, charge) = map(take(2usize), |x: &[u8]| match x {
146            b"  " => 0,
147            _ => {
148                let x = unsafe { from_utf8_unchecked(x) };
149                x.parse::<i8>().unwrap()
150            }
151        })(inp)?;
152        let (inp, _) = nom::character::complete::line_ending(inp)?;
153        Ok((
154            inp,
155            Atom {
156                id,
157                id1,
158                name,
159                residue,
160                chain,
161                sequence_number,
162                insertion_code,
163                coord: [x, y, z],
164                occupancy,
165                temperature_factor,
166                element,
167                charge,
168            },
169        ))
170    }
171}
172
173/// # ANISOU
174///
175/// The [ANISOU](http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ANISOU) records present the anisotropic temperature factors.
176///
177/// ## Record Format
178///
179/// | COLUMNS | DATA  TYPE   | FIELD    | DEFINITION                       |
180/// | ------- | ------------ | -------- | -------------------------------- |
181/// | 1 - 6   | Record name  | "ANISOU" |                                  |
182/// | 7 - 11  | Integer      | serial   | Atom serial number.              |
183/// | 13 - 16 | Atom         | name     | Atom name.                       |
184/// | 17      | Character    | altLoc   | Alternate location indicator     |
185/// | 18 - 20 | Residue name | resName  | Residue name.                    |
186/// | 22      | Character    | chainID  | Chain identifier.                |
187/// | 23 - 26 | Integer      | resSeq   | Residue sequence number.         |
188/// | 27      | AChar        | iCode    | Insertion code.                  |
189/// | 29 - 35 | Integer      | u[0][0]  | U(1,1)                           |
190/// | 36 - 42 | Integer      | u[1][1]  | U(2,2)                           |
191/// | 43 - 49 | Integer      | u[2][2]  | U(3,3)                           |
192/// | 50 - 56 | Integer      | u[0][1]  | U(1,2)                           |
193/// | 57 - 63 | Integer      | u[0][2]  | U(1,3)                           |
194/// | 64 - 70 | Integer      | u[1][2]  | U(2,3)                           |
195/// | 77 - 78 | LString(2)   | element  | Element symbol, right-justified. |
196/// | 79 - 80 | LString(2)   | charge   | Charge on the atom.              |
197pub struct AnisouParser;
198
199impl FieldParser for AnisouParser {
200    type Output = Anisou;
201    fn parse(inp: &[u8]) -> IResult<&[u8], Anisou> {
202        let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
203        let inp = &inp[17..]; // 12 - 28
204
205        let (inp, u11) = parse_right::<i32>(inp, 7)?;
206        let (inp, u22) = parse_right::<i32>(inp, 7)?;
207        let (inp, u33) = parse_right::<i32>(inp, 7)?;
208        let (inp, u12) = parse_right::<i32>(inp, 7)?;
209        let (inp, u13) = parse_right::<i32>(inp, 7)?;
210        let (inp, u23) = parse_right::<i32>(inp, 7)?;
211        let inp = &inp[10..];
212        let (inp, _) = nom::character::complete::line_ending(inp)?;
213        Ok((
214            inp,
215            Anisou {
216                id,
217                u11,
218                u22,
219                u33,
220                u12,
221                u13,
222                u23,
223            },
224        ))
225    }
226}
227
228/// # Overview
229///
230/// The CONECT records specify connectivity between atoms for which coordinates are supplied. The connectivity is described using the atom serial number as shown in the entry. CONECT records are mandatory for HET groups (excluding water) and for other Connect not specified in the standard residue connectivity table. These records are generated automatically.
231///
232/// # Record Format
233///
234/// COLUMNS    |  DATA  TYPE    |  FIELD   |     DEFINITION
235/// -----------|----------------|----------|-----------------------------------
236///  1 -  6    |   Record name  |  "CONECT"|
237///  7 - 11    |  Integer       | serial   |    Atom  serial number
238/// 12 - 16    |   Integer      |  serial  |     Serial number of bonded atom
239/// 17 - 21    |   Integer      |  serial  |     Serial  number of bonded atom
240/// 22 - 26    |   Integer      |  serial  |     Serial number of bonded atom
241/// 27 - 31    |   Integer      |  serial  |     Serial number of bonded atom
242///
243/// Details
244///
245/// CONECT records are present for:
246///
247/// - Intra-residue connectivity within  non-standard (HET) residues (excluding water).
248/// - Inter-residue connectivity of HET  groups to standard groups (including water) or to other HET groups.
249/// - Disulfide bridges specified in the  SSBOND records have corresponding records.
250///
251/// - No differentiation is made between atoms with delocalized charges (excess negative or positive charge).
252/// - Atoms specified in the CONECT records have the same numbers as given in the coordinate section.
253/// - All atoms connected to the atom with serial number in columns 7 - 11 are listed in the remaining fields of the record.
254/// - If more than four fields are required for non-hydrogen and non-salt bridges, a second CONECT record with the same atom serial number in columns 7 - 11 will be used.
255/// - These CONECT records occur in increasing order of the atom serial numbers they carry in columns 7 - 11. The target-atom serial numbers carried on these records also occur in increasing order.
256/// - The connectivity list given here is redundant in that each bond indicated is given twice, once with each of the two atoms involved specified in columns 7 - 11.
257/// - For hydrogen Connect, when the hydrogen atom is present in the coordinates, a CONECT record between the hydrogen atom and its acceptor atom is generated.
258/// - For NMR entries, CONECT records for one model are generated describing heterogen connectivity and others for LINK records assuming that all models are homogeneous models.
259pub struct ConectParser;
260
261impl FieldParser for ConectParser {
262    type Output = Vec<Connect>;
263    fn parse(inp: &[u8]) -> IResult<&[u8], Self::Output> {
264        let mut res = Vec::new();
265        let (inp, x) = parse_right::<AtomSerial>(inp, 5)?;
266        let mut last_inp = inp;
267        loop {
268            let (inp, y) = parse_right::<AtomSerial>(last_inp, 5)?;
269            if y > x {
270                res.push([x, y]);
271            } else {
272                res.push([y, x]);
273            }
274            if inp[..5] == b"     "[..] {
275                break;
276            }
277            last_inp = inp
278        }
279        let (inp, _) = jump_newline(last_inp)?;
280        Ok((inp, res))
281    }
282}