1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
// Copyright (c) 2020 Tianyi Shi
//
// This software is released under the MIT License.
// https://opensource.org/licenses/MIT

use crate::common::parser::{jump_newline, parse_residue, parse_right, FieldParser};

use crate::types::{
    Anisou, Atom, AtomName, AtomSerial, Connect, Element, ModifiedAminoAcidTable,
    ModifiedNucleotideTable, ParseFw2, ParseFw4, Residue,
};
use nom::{bytes::complete::take, character::complete::anychar, combinator::map, IResult};
use std::str::from_utf8_unchecked;

/// # ATOM
///
/// ## Overview
///
/// The ATOM records present the atomic coordinates for standard amino acids and nucleotides. They
/// also present the occupancy and temperature factor for each atom. Non-polymer chemical
/// coordinates use the HETATM record type. The element symbol is always present on each ATOM
/// record; charge is optional. Changes in ATOM/HETATM records result from the standardization atom
/// and residue nomenclature. This nomenclature is described in the [Chemical Component Dictionary](ftp://ftp.wwpdb.org/pub/pdb/data/monomers).
///
/// ## Record Format
///
/// |COLUMNS        |DATA  TYPE   | FIELD       | DEFINITION                                |
/// |---------------|-------------|-------------|-------------------------------------------|
/// | 1 -  6        |Record name  | "ATOM  "    |                                           |
/// | 7 - 11        |Integer      | serial      | Atom  serial number.                      |
/// |13 - 16        |Atom         | name        | Atom name.                                |
/// |17             |Character    | altLoc      | Alternate location indicator.             |
/// |18 - 20        |Residue name | resName     | Residue name.                             |
/// |22             |Character    | chainID     | Chain identifier.                         |
/// |23 - 26        |Integer      | resSeq      | Residue sequence number.                  |
/// |27             |AChar        | iCode       | Code for insertion of residues.           |
/// |31 - 38        |Real(8.3)    | x           | Orthogonal coordinates for X in Angstroms.|
/// |39 - 46        |Real(8.3)    | y           | Orthogonal coordinates for Y in Angstroms.|
/// |47 - 54        |Real(8.3)    | z           | Orthogonal coordinates for Z in Angstroms.|
/// |55 - 60        |Real(6.2)    | occupancy   | Occupancy.                                |
/// |61 - 66        |Real(6.2)    | tempFactor  | Temperature  factor.                      |
/// |77 - 78        |LString(2)   | element     | Element symbol, right-justified.          |
/// |79 - 80        |LString(2)   | charge      | Charge  on the atom.                      |
///
/// ## Details
///
/// ATOM records for proteins are listed from amino to carboxyl terminus.
/// Nucleic acid residues are listed from the 5' to the 3' terminus.
/// Alignment of one-letter atom name such as C starts at column 14, while two-letter atom name such
/// as FE starts at column 13. Atom nomenclature begins with atom type.
/// No ordering is specified for polysaccharides.
/// Non-blank alphanumerical character is used for chain identifier.
/// The list of ATOM records in a chain is terminated by a TER record.
/// If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL
/// records. AltLoc is the place holder to indicate alternate conformation. The alternate
/// conformation can be in the entire polymer chain, or several residues or partial residue (several
/// atoms within one residue). If an atom is provided in more than one position, then a non-blank
/// alternate location indicator must be used for each of the atomic positions. Within a residue,
/// all atoms that are associated with each other in a given conformation are assigned the same
/// alternate position indicator. There are two ways of representing alternate conformation- either
/// at atom level or at residue level (see examples). For atoms that are in alternate sites
/// indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the
/// following general rules:
///
/// - In the simple case that involves a few  atoms or a few residues with alternate sites, the
///   coordinates occur one after  the other in the entry.
/// - In the case of a large heterogen groups  which are disordered, the atoms for each conformer
///   are listed together.
///
/// Alphabet letters are commonly used for insertion code. The insertion code is used when two
/// residues have the same numbering. The combination of residue numbering and insertion code
/// defines the unique residue. If the depositor provides the data, then the isotropic B value is
/// given for the temperature factor. If there are neither isotropic B values from the depositor,
/// nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the
/// temperature factor. Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most
/// cases, these are blank. For refinements with program REFMAC prior 5.5.0042 which use TLS
/// refinement, the values of B may include only the TLS contribution to the isotropic temperature
/// factor rather than the full isotropic value.
///
/// # HETATOM
///
/// ## Overview
///
/// http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#HETATM
///
/// Non-polymer or other “non-standard” chemical coordinates, such as water molecules or atoms presented in HET groups use the HETATM record type. They also present the occupancy and temperature factor for each atom. The ATOM records present the atomic coordinates for standard residues. The element symbol is always present on each HETATM record; charge is optional.
///
/// Changes in ATOM/HETATM records will require standardization in atom and residue nomenclature. This nomenclature is described in the Chemical Component Dictionary, ftp://ftp.wwpdb.org/pub/pdb/data/monomers.
///
/// ## Record Format
///
/// | COLUMNS | DATA  TYPE   | FIELD      | DEFINITION                       |
/// | ------- | ------------ | ---------- | -------------------------------- |
/// | 1 - 6   | Record name  | "HETATM"   |                                  |
/// | 7 - 11  | Integer      | serial     | Atom serial number.              |
/// | 13 - 16 | Atom         | name       | Atom name.                       |
/// | 17      | Character    | altLoc     | Alternate location indicator.    |
/// | 18 - 20 | Residue name | resName    | Residue name.                    |
/// | 22      | Character    | chainID    | Chain identifier.                |
/// | 23 - 26 | Integer      | resSeq     | Residue sequence number.         |
/// | 27      | AChar        | iCode      | Code for insertion of residues.  |
/// | 31 - 38 | Real(8.3)    | x          | Orthogonal coordinates for X.    |
/// | 39 - 46 | Real(8.3)    | y          | Orthogonal coordinates for Y.    |
/// | 47 - 54 | Real(8.3)    | z          | Orthogonal coordinates for Z.    |
/// | 55 - 60 | Real(6.2)    | occupancy  | Occupancy.                       |
/// | 61 - 66 | Real(6.2)    | tempFactor | Temperature factor.              |
/// | 77 - 78 | LString(2)   | element    | Element symbol; right-justified. |
/// | 79 - 80 | LString(2)   | charge     | Charge on the atom.              |
///
/// ## Details
///
/// The x, y, z coordinates are in Angstrom units.
/// No ordering is specified for polysaccharides.
/// See the HET section of this document regarding naming of heterogens. See the Chemical Component Dictionary for residue names, formulas, and topology of the HET groups that have appeared so far in the PDB (see ftp://ftp.wwpdb.org/pub/pdb/data/monomers ).
/// If the depositor provides the data, then the isotropic B value is given for the temperature factor.
/// If there are neither isotropic B values provided by the depositor, nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the temperature factor.
/// Insertion codes and element naming are fully described in the ATOM section of this document.
pub struct GenericAtomParser;

impl GenericAtomParser {
    pub fn parse<'a, 'b>(
        inp: &'a [u8],
        modified_aa: &'b ModifiedAminoAcidTable,
        modified_nuc: &'b ModifiedNucleotideTable,
    ) -> IResult<&'a [u8], Atom> {
        let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
        let inp = &inp[1..];
        let (inp, name) = map(take(4usize), AtomName::parse_fw4)(inp)?;
        let (inp, id1) = anychar(inp)?;

        let (inp, residue) = parse_residue(inp, modified_aa, modified_nuc)?;

        let inp = &inp[1..];
        let (inp, chain) = anychar(inp)?;
        let (inp, sequence_number) = parse_right::<u32>(inp, 4)?;
        let (inp, insertion_code) = anychar(inp)?;
        let inp = &inp[3..];
        let (inp, x) = parse_right::<f32>(inp, 8)?;
        let (inp, y) = parse_right::<f32>(inp, 8)?;
        let (inp, z) = parse_right::<f32>(inp, 8)?;
        let (inp, occupancy) = parse_right::<f32>(inp, 6)?;
        let (inp, temperature_factor) = parse_right::<f32>(inp, 6)?;
        let inp = &inp[10..];
        let (inp, element) = map(take(2usize), Element::parse_fw2)(inp)?;
        let (inp, charge) = map(take(2usize), |x: &[u8]| match x {
            b"  " => 0,
            _ => {
                let x = unsafe { from_utf8_unchecked(x) };
                x.parse::<i8>().unwrap()
            }
        })(inp)?;
        let (inp, _) = nom::character::complete::line_ending(inp)?;
        Ok((
            inp,
            Atom {
                id,
                id1,
                name,
                residue,
                chain,
                sequence_number,
                insertion_code,
                coord: [x, y, z],
                occupancy,
                temperature_factor,
                element,
                charge,
            },
        ))
    }
}

/// # ANISOU
///
/// The [ANISOU](http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ANISOU) records present the anisotropic temperature factors.
///
/// ## Record Format
///
/// | COLUMNS | DATA  TYPE   | FIELD    | DEFINITION                       |
/// | ------- | ------------ | -------- | -------------------------------- |
/// | 1 - 6   | Record name  | "ANISOU" |                                  |
/// | 7 - 11  | Integer      | serial   | Atom serial number.              |
/// | 13 - 16 | Atom         | name     | Atom name.                       |
/// | 17      | Character    | altLoc   | Alternate location indicator     |
/// | 18 - 20 | Residue name | resName  | Residue name.                    |
/// | 22      | Character    | chainID  | Chain identifier.                |
/// | 23 - 26 | Integer      | resSeq   | Residue sequence number.         |
/// | 27      | AChar        | iCode    | Insertion code.                  |
/// | 29 - 35 | Integer      | u[0][0]  | U(1,1)                           |
/// | 36 - 42 | Integer      | u[1][1]  | U(2,2)                           |
/// | 43 - 49 | Integer      | u[2][2]  | U(3,3)                           |
/// | 50 - 56 | Integer      | u[0][1]  | U(1,2)                           |
/// | 57 - 63 | Integer      | u[0][2]  | U(1,3)                           |
/// | 64 - 70 | Integer      | u[1][2]  | U(2,3)                           |
/// | 77 - 78 | LString(2)   | element  | Element symbol, right-justified. |
/// | 79 - 80 | LString(2)   | charge   | Charge on the atom.              |
pub struct AnisouParser;

impl FieldParser for AnisouParser {
    type Output = Anisou;
    fn parse(inp: &[u8]) -> IResult<&[u8], Anisou> {
        let (inp, id) = parse_right::<AtomSerial>(inp, 5)?;
        let inp = &inp[17..]; // 12 - 28

        let (inp, u11) = parse_right::<i32>(inp, 7)?;
        let (inp, u22) = parse_right::<i32>(inp, 7)?;
        let (inp, u33) = parse_right::<i32>(inp, 7)?;
        let (inp, u12) = parse_right::<i32>(inp, 7)?;
        let (inp, u13) = parse_right::<i32>(inp, 7)?;
        let (inp, u23) = parse_right::<i32>(inp, 7)?;
        let inp = &inp[10..];
        let (inp, _) = nom::character::complete::line_ending(inp)?;
        Ok((
            inp,
            Anisou {
                id,
                u11,
                u22,
                u33,
                u12,
                u13,
                u23,
            },
        ))
    }
}

/// # Overview
///
/// The CONECT records specify connectivity between atoms for which coordinates are supplied. The connectivity is described using the atom serial number as shown in the entry. CONECT records are mandatory for HET groups (excluding water) and for other Connect not specified in the standard residue connectivity table. These records are generated automatically.
///
/// # Record Format
///
/// COLUMNS    |  DATA  TYPE    |  FIELD   |     DEFINITION
/// -----------|----------------|----------|-----------------------------------
///  1 -  6    |   Record name  |  "CONECT"|
///  7 - 11    |  Integer       | serial   |    Atom  serial number
/// 12 - 16    |   Integer      |  serial  |     Serial number of bonded atom
/// 17 - 21    |   Integer      |  serial  |     Serial  number of bonded atom
/// 22 - 26    |   Integer      |  serial  |     Serial number of bonded atom
/// 27 - 31    |   Integer      |  serial  |     Serial number of bonded atom
///
/// Details
///
/// CONECT records are present for:
///
/// - Intra-residue connectivity within  non-standard (HET) residues (excluding water).
/// - Inter-residue connectivity of HET  groups to standard groups (including water) or to other HET groups.
/// - Disulfide bridges specified in the  SSBOND records have corresponding records.
///
/// - No differentiation is made between atoms with delocalized charges (excess negative or positive charge).
/// - Atoms specified in the CONECT records have the same numbers as given in the coordinate section.
/// - All atoms connected to the atom with serial number in columns 7 - 11 are listed in the remaining fields of the record.
/// - If more than four fields are required for non-hydrogen and non-salt bridges, a second CONECT record with the same atom serial number in columns 7 - 11 will be used.
/// - These CONECT records occur in increasing order of the atom serial numbers they carry in columns 7 - 11. The target-atom serial numbers carried on these records also occur in increasing order.
/// - The connectivity list given here is redundant in that each bond indicated is given twice, once with each of the two atoms involved specified in columns 7 - 11.
/// - For hydrogen Connect, when the hydrogen atom is present in the coordinates, a CONECT record between the hydrogen atom and its acceptor atom is generated.
/// - For NMR entries, CONECT records for one model are generated describing heterogen connectivity and others for LINK records assuming that all models are homogeneous models.
pub struct ConectParser;

impl FieldParser for ConectParser {
    type Output = Vec<Connect>;
    fn parse(inp: &[u8]) -> IResult<&[u8], Self::Output> {
        let mut res = Vec::new();
        let (inp, x) = parse_right::<AtomSerial>(inp, 5)?;
        let mut last_inp = inp;
        loop {
            let (inp, y) = parse_right::<AtomSerial>(last_inp, 5)?;
            if y > x {
                res.push([x, y]);
            } else {
                res.push([y, x]);
            }
            if inp[..5] == b"     "[..] {
                break;
            }
            last_inp = inp
        }
        let (inp, _) = jump_newline(last_inp)?;
        Ok((inp, res))
    }
}