hts-sys 2.2.0

This library provides HTSlib bindings.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
# HTS API

## HTSLib APIs and samtools

HTSLib is a C library implementation used to access and process the genome
sequence data. HTSLib implements multiple API interfaces, HTS API, VCF API and
SAM API.  HTS API provides a framework for use by other APIs and applications,
implements bgzf compression, htscodecs and provides CRAM format support.  VCF
APIs work with variant data in VCF and BCF format.

SAM API works with sequence data of different formats, SAM / BAM / CRAM /
FASTA / FASTQ, and provides methods to do operations on the data. It uses
methods from HTS API.

'samtools' is the utility used to read and modify sequence data. It uses SAM
APIs from HTSLib to work on the sequence data.


## About this document

There are a number of demonstration utilities and their source code in
'samples' directory of HTSLib and this document gives the description of them
and the usage of API of HTSLib.  The samples are for demonstration
purposes only and proper error handling is required for actual usage. This
document is based on HTSLib version 1.17.

Updates to this document may be made along with later releases when required.


## The sample apps

Flags - This application showcases the basic read of alignment files and flag
access. It reads and shows the count of read1 and read2 alignments.

Split - This application showcases the basic read and write of alignment data.
It saves the read1 and read2 as separate files in given directory, one as sam
and other as bam.

Split2 - This application showcases the output file format selection. It saves
the read1 and read2 as separate files in given directory, both as compressed
sam though the extensions are different.

Cram - This application showcases the different way in which cram reference
data is used for cram output creation.

Read_fast - This application showcases the fasta/fastq data read.

Read_header - This application showcases the read and access of header data.
It can show all header line of given type, data of a given tag on a specific
header line or for all lines of given type.

Read_ref - This application showcases the read and access of header data.
It shows all reference names which has length equal or greater to given input.

Read_bam - This application showcases read of different alignment data fields.
It shows contents of each alignment.

Read_aux - This application showcases read of specific auxiliary tag data in
alignment. It shows the data retrieved using 2 APIs, one as a string with tag
data and other as raw data alternatively.

Dump_aux - This application showcases read of all auxiliary tag data one by one
in an alignment. It shows the data retrieved.

Add_header - This application showcases the write of header lines to a file.
It adds header line of types, SQ, RG, PG and CO and writes to standard output.

Remove_header - This application showcases removal of header line from a file.
It removes either all header lines of given type or one specific line of given
type with given unique identifier. Modified header is written on standard
output.

Update_header - This application shows the update of header line fields, where
update is allowed. It takes the header line type, unique identifier for the
line, tag to be modified and the new value. Updated data is written on standard
output.

Mod_bam - This application showcases the update of alignment data. It takes
alignment name, position of field to be modified and new value of it.
Modified data is written on standard output.

Mod_aux - This application showcases the update of auxiliary data in alignment.
It takes alignment name, tag to be modified, its type and new value. Modified
data is written on standard output.

Mod_aux_ba - This application showcases the update of auxiliary array data in
alignment. It adds count of ATCGN base as an array in auxiliary data, BA:I.
Modified data is written on standard output.

Write_fast - This application showcases the fasta/fastq data write. It appends
a dummy data to given file.

Index_write - This application showcases the creation of index along with
output creation. Based on file type and shift, it creates bai, csi or crai
files.

Read_reg - This application showcases the usage of region specification in
alignment read.

Read_multireg - This application showcases the usage of mulitple regionn
specification in alignment read.

Pileup - This application showcases the pileup api, where all alignments
covering a reference position are accessed together. It displays the bases
covering each position on standard output.

Mpileup - This application showcases the mpileup api, which supports multiple
input files for pileup and gives a side by side view of them in pileup format.
It displays the bases covering each position on standard output.

Modstate - This application showcases the access of base modifications in
alignment. It shows the modifications present in an alignment and accesses them
using available APIs. There are 2 APIs and which one to be used can be selected
through input.

Pileup_mod - This application showcases the base modification access in pileup
mode. It shows the pileup display with base modifications.

Flags_field - This application showcases the read of selected fields alone,
reducing the overhead / increasing the performance. It reads the flag field
alone and shows the count of read1 and read2. This has impact only on CRAM
files.

Split_thread1 - This application showcases the use of threads in file handling.
It saves the read1 and read2 as separate files in given directory, one as sam
and other as bam. 2 threads are used for read and 1 each dedicated for each
output file.

Split_thread2 - This application showcases the use of thread pool in file
handling. It saves the read1 and read2 as separate files in given directory,
one as sam and other as bam. A pool of 4 threads is created and shared for both
read and write.


## Building the sample apps

The samples expect the HTSLib is installed, libraries and header file path are
part of the PATH environment variable. If not, these paths need to be explicitly
passed during the build time.

Gcc and compatible compilers can be used to build the samples.

These applications can be linked statically or dynamically to HTSLib.
For static linking, along with htslib other libraries and/or headers required
to build are, math, pthread, curl, lzma, z and bz2 libraries.

A makefile is available along with source files which links statically to
htslib. To use dynamic linking, update the makefile's 'LDFLAGS' and 'rpath'
path. The 'rpath' path to be set as the path to lib directory of htslib
installation.


## Usage of HTS APIs
### Sequence data file access for read

The sequence data file for read may be opened using the sam_open method. It
opens the file and returns samFile (htsFile) pointer on success or NULL on
failure.  The input can be path to a file in disk, network, cloud or '-'
designating the standard input.

SAM, BAM and CRAM file formats are supported and the input file format is
detected from the file content.

Once done with the file, it needs to be closed with sam_close.

Many times, header details would be required and can be read using
sam_hdr_read api.  It returns sam_hdr_t pointer or NULL. The returned header
needs to be destroyed using sam_hdr_destroy when no longer required.

The sequence data may be compressed or uncompressed on disk and on memory it
is read and kept as uncompressed BAM format. It can be read from a file using
sam_read1 api. samFile pointer, header and bam storage are to be passed as
argument and it returns 0 on success, -1 on end of file and < -1 in case of
errors.

The bam storage has to be initialised using bam_init1 api before the call and
can be reused for successive reads. Once done, it needs to be destroyed using
bam_destroy1.  The member field named core - bam1_core_t - in bam storage,
bam1_t, has the sequence data in an easily accessible way. Using the fields
and macros, data can easily be read from it.

    #include <htslib/sam.h>

    int main(int argc, char *argv[])
    {
        ...
        //initialize
        if (!(bamdata = bam_init1())) {
        ...
        //open input files - r reading
        if (!(infile = sam_open(inname, "r"))) {
        ...
        //read header
        if (!(in_samhdr = sam_hdr_read(infile))) {
        ...
        //read data, check flags and update count
        while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
            if (bamdata->core.flag & BAM_FREAD1) {
                cntread1++;
            }
        ...
        //clean up
        if (in_samhdr) {
            sam_hdr_destroy(in_samhdr);
        }
        if (infile) {
            sam_close(infile);
        }
        if (bamdata) {
            bam_destroy1(bamdata);
        }
        return ret;
    }
Refer: flags_demo.c

This shows the count of read1 and read2 alignments.

    ./flags /tmp/sample.sam.gz

To read CRAM files, reference data is required and if it is not available, based
on configuration, library may try to download it from external repositories.


### Sequence data file access for write

File access for write is similar to read with a few additional optional steps.

The output file can be opened using sam_open api as in read, with "w" instead
of "r" as mode. This opens the file for writing and uses mode to select the
output file type. "w" alone denotes SAM, "wb" denotes BAM and "wc" denotes CRAM.

Another way is to use sam_open_mode method, which sets the output file type and
compression based on the file name and explicit textual format specification.
This method expects a buffer to append type and compression flags. Usually a
buffer with standard file open flag is used, the buffer past the flag is passed
to the method to ensure existing flags and updates from this method are present
in the same buffer without being overwritten. This method will add more flags
indicating file type and compression based on name. If explicit format detail
given, then extension is ignored and the explicit specification is used. This
updated buffer can be used with sam_open to select the file format.

sam_open_format method may also be used to open the file for output as more
information on the output file can be specified using this. Can use
mode buffer from sam_open_mode api or explicit format structure for this.

The header data can be written using the sam_hdr_write api. When the header
data is copied to another variable and has different lifetime, it is good to
increase the reference count of the header using sam_hdr_incr_ref and
sam_hdr_destroy called as many times as required.

The alignment data can be written using the sam_write1 api. It takes a samFile
pointer, header pointer and the alignment data. The header data is required to
set the reference name in the alignment. It returns -ve value on error.

    int main(int argc, char *argv[])
    {
        ...
        if (!(infile = sam_open(inname, "r"))) {
        ...
        outfile1 = sam_open(file1, "w");            //as SAM
        outfile2 = sam_open(file2, "wb");           //as BAM
        ...
        if (!(in_samhdr = sam_hdr_read(infile))) {
        ...
        //write header
        if ((sam_hdr_write(outfile1, in_samhdr) == -1) ||
         (sam_hdr_write(outfile2, in_samhdr) == -1)) {
        ...
        while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
            if (bamdata->core.flag & BAM_FREAD1) {
                if (sam_write1(outfile1, in_samhdr, bamdata) < 0) {
        ...
    }
Refer: split.c

This creates 1.sam and 2.bam in /tmp/ containing read1 and read2 respectively.

    ./split /tmp/sample.sam.gz /tmp/

Below code excerpt shows sam_open_mode api usage.

    int main(int argc, char *argv[])
    {
        ...
        //set file open mode based on file name for 1st and as explicit for 2nd
        if ((sam_open_mode(mode1+1, file1, NULL) == -1) ||
         (sam_open_mode(mode2+1, file2, "sam.gz") == -1)) {
        ...
        if (!(infile = sam_open(inname, "r"))) {
        ...
        //open output files
        outfile1 = sam_open(file1, mode1);                          //as compressed SAM through sam_open
        outfile2 = sam_open_format(file2, mode2, NULL);             //as compressed SAM through sam_open_format
        ...
    }
Refer: split2.c

This creates 1.sam.gz and 2.sam in /tmp/ both having compressed data.

    ./split2 /tmp/sample.sam.gz /tmp/

An htsFormat structure filled appropriately can also be used to specify output
file format while using sam_open_format api.


### CRAM writing

CRAM files uses reference data and compresses alignment data. A CRAM file may
be created with external reference data file - most appropriate, with embedded
reference in it or with no reference data at all. It can also be created using
an autogenerated reference, based on consensus with-in the alignment data.
The reference detail can be set to an htsFormat structure using hts_parse_format
api and used with sam_open_format api to create appropriate CRAM file.

    ...
    snprintf(reffmt1, size1, "cram,reference=%s", reffile);
    snprintf(reffmt2, size2, "cram,embed_ref=1,reference=%s", reffile);
    ...
    if (hts_parse_format(&fmt1, reffmt1) == -1 ||                   //using external reference - uses the M5/UR tags to get
     reference data during read
            hts_parse_format(&fmt2, reffmt2) == -1 ||               //embed the reference internally
            hts_parse_format(&fmt3, "cram,embed_ref=2") == -1 ||    //embed autogenerated reference
            hts_parse_format(&fmt4, "cram,no_ref=1") == -1) {       //no reference data encoding at all
    ...
    outfile1 = sam_open_format(file1, "wc", &fmt1); outfile2 = sam_open_format(file2, "wc", &fmt2);
    ...
Refer: cram.c


### FASTA/FASTQ data access

FASTA/FASTQ files have the raw sequence data and the data can be read one by
one using sam_read1 or a selected range using a region. The data can be written
similar to alignment data using sam_write1 api. To write the file, format
can be set by updating mode buffer using sam_open_mode with file name
or explicit format text. This mode buffer can be used with sam_open or can be
used with sam_open_format with explicit format information in htsFormat
structure.

    ...
    if (!(bamdata = bam_init1())) {
    ...
    if (!(infile = sam_open(inname, "r"))) {
    ...
    if (infile->format.format != fasta_format && infile->format.format != fastq_format) {
    ...
    if (!(in_samhdr = sam_hdr_read(infile))) {
    ...
    while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
        printf("\nsequence: ");
        for (c = 0; c < bamdata->core.l_qseq; ++c) {
            printf("%c", seq_nt16_str[bam_seqi(bam_get_seq(bamdata), c)]);
        }
        if (infile->format.format == fastq_format) {
            printf("\nquality: ");
            for (c = 0; c < bamdata->core.l_qseq; ++c) {
                printf("%c", bam_get_qual(bamdata)[c]);
    ...
Refer: read_fast.c

    ...
    char mode[4] = "a";
    ...
    if (sam_open_mode(mode + 1, outname, NULL) < 0) {
    ...
    if (!(outfile = sam_open(outname, mode))) {
    ...
    if (bam_set1(bamdata, sizeof("test"), "test", BAM_FUNMAP, -1, -1, 0, 0, NULL, -1, -1, 0, 10, "AACTGACTGA", "1234567890", 0)
     < 0) {
    ...
    if (sam_write1(outfile, out_samhdr, bamdata) < 0) {
        printf("Failed to write data\n");
    ...
Refer: write_fast.c


### Header data read

The header gives the version, reference details, read group, change history
and comments. These data are stored inside the sam_hdr_t. Each of these
entries, except comments, have their unique identifier and it is required to
access different fields of them.  The api sam_hdr_count_lines gives the count
of the specified type of header line.  The value of a unique identifier to a
specific type of header line can be retrieved with sam_hdr_line_name api.  The
api sam_hdr_find_tag_id and sam_hdr_find_tag_pos can get the field data from a
header line using unique identifier values or using position. The full header
line can be retrieved using sam_hdr_find_line_pos or sam_hdr_line_id with
position and unique identifier values respectively.

    ...
    if (!(in_samhdr = sam_hdr_read(infile))) {
    ...
            ret = sam_hdr_find_tag_id(in_samhdr, header, id, idval, tag, &data);
    ...
            ret = sam_hdr_find_line_id(in_samhdr, header, id, idval, &data);
    ...
        linecnt = sam_hdr_count_lines(in_samhdr, header);
    ...
            ret = sam_hdr_find_tag_pos(in_samhdr, header, c, tag, &data);
    ...
            ret = sam_hdr_find_line_pos(in_samhdr, header, c, &data);
    ...
Refer: read_header.c

This will show the VN tag's value from HD header.

    ./read_header /tmp/sample.sam.gz HD VN

Shows the 2nd SQ line's LN field value.

    ./read_header /tmp/sample.sam.gz SQ SN T2 LN

Below code excerpt shows the reference names which has length above given value.

    ...
    linecnt = sam_hdr_count_lines(in_samhdr, "SQ"); //get reference count
    ...
    //iterate and check each reference's length
    for (pos = 1, c = 0; c < linecnt; ++c) {
        if ((ret = sam_hdr_find_tag_pos(in_samhdr, "SQ", c, "LN", &data) == -2)) {
    ...
        size = atoll(data.s);
        if (size < minsize) {
            //not required
            continue;
        }
        if (!(id = sam_hdr_line_name(in_samhdr, "SQ", c))) {
            //sam_hdr_find_tag_pos(in_samhdr, "SQ", c, "SN", &data) can also do the same!
    ...
        printf("%d,%s,%s\n", pos, id, data.s);
    ...
Refer: read_refname.c


### Alignment data read

The alignment / sequence data contains many fields. Mainly the read/query
name, flags indicating the properties of the read, reference sequence name,
position in reference to which it matches, quality of the read, CIGAR string
indicating the match status, position of mate / reverse strand, name of
reference sequence to which mate matches, the insert length, base sequence,
quality value of each base and auxiliary fields.

Header data would be required to retrieve the reference names as alignment
contains the position of the reference in the header.

A few of the data are directly visible in bam1_t and the rest are hidden
inside data member of bam1_t and can easily be retrieved using macros.
bam_get_qname gives the name of the read, sam_hdr_tid2name gives the reference
name. bam_get_cigar retrieves the cigar operation array, which can be decoded
using bam_cigar_oplen to get count of bases to which that operation applicable
and bam_cigar_opchr to get the cigar operation.  bam_seqi retrieves the base
data at a given position in alignment and it can be converted to character by
indexing the seq_nt16_str array.

    ...
    while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
    {
        //QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL [TAG:TYPE:VALUE]
        printf("NAME: %s\n", bam_get_qname(bamdata));                           //get the query name using the macro
        flags = bam_flag2str(bamdata->core.flag);                               //flags as string
        ...
        tidname = sam_hdr_tid2name(in_samhdr, bamdata->core.tid);
        ...
        printf("MQUAL: %d\n", bamdata->core.qual);                              //map quality value
        cigar = bam_get_cigar(bamdata);                                         //retrieves the cigar data
        for (i = 0; i < bamdata->core.n_cigar; ++i) {                           //no. of cigar data entries
            printf("%d%c", bam_cigar_oplen(cigar[i]), bam_cigar_opchr(cigar[i]));   //the macros gives the count of operation
             and the symbol of operation for given cigar entry
        }
        printf("\nTLEN/ISIZE: %"PRIhts_pos"\n", bamdata->core.isize);
        data = bam_get_seq(bamdata);
                 //get the sequence data
        if (bamdata->core.l_qseq != bam_cigar2qlen(bamdata->core.n_cigar, cigar)) { //checks the length with CIGAR and query
        ...
        for (i = 0; i < bamdata->core.l_qseq ; ++i) {       //sequence length
            printf("%c", seq_nt16_str[bam_seqi(data, i)]);  //retrieves the base from (internal compressed) sequence data
        ...
            printf("%c", bam_get_qual(bamdata)[i]+33);      //retrives the quality value
        ...
Refer: read_bam.c

Shows the data from alignments.

    ./read_bam /tmp/sample.sam.gz


### Aux data read

Auxiliary data gives extra information about the alignment. There can be a
number of such data and can be accessed by specifying required tag or by
iterating one by one through them once the alignment is read as bam1_t. The
auxiliary data are stored along with the variable length data in the data
field of bam1_t. There are macros defined to retrieve information about
auxiliary data from the data field of bam1_t.

Data for a specific tag can be retrieved as a string or can be retrieved as raw
data. bam_aux_get_str retrieves as a string, with tag name, tag type and data.
bam_aux_get can get raw data and with bam_aux_type and bam_aux2A, bam_aux2f etc.
the raw data can be extracted.

To iterate through all data, the start of aux data is retrieved using macro
bam_aux_first and successive ones using bam_aux_next. Macro bam_aux_tag gives
the tag of the aux field and bam_aux_type gives the information about type of
the aux field.

Bam_aux2i, bam_aux2f, bam_aux2Z macros retrieve the aux data's value as
integer, float and string respectively. The integer value may be of different
precision / size and the bam_aux_type character indicates how to use the
value. The string/hex data are NULL terminated.

For array data, bam_aux_type will return 'B' and bam_auxB_len gives the length
of the array. bam_aux_type with the next byte will give the type of data in
the array. bam_auxB2i, bam_auxB2f will give integer and float data from a
given position of the array.

    ...
    while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
        if (i % 2) {    //use options alternatively to demonstrate both
            //option 1 - get data as string with tag and type
            if ((c = bam_aux_get_str(bamdata, tag, &sdata)) == 1) {
                printf("%s\n",sdata.s);
    ...
            //option 2 - get raw data
            if (!(data = bam_aux_get(bamdata, tag))) {
    ...
                if (printauxdata(stdout, bam_aux_type(data), -1, data) == EXIT_FAILURE) {
    ...
Refer: read_aux.c

Shows the MD aux tag from alignments.

    ./read_aux ../../samtools/test/mpileup/mpileup.1.bam MD

    ...
    while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
        data = bam_aux_first(bamdata);                                              //get the first aux data
        while (data) {
            printf("%.2s:%c:", bam_aux_tag(data), NULL != strchr("cCsSiI", bam_aux_type(data)) ? 'i' : bam_aux_type(data));
              //macros gets the tag and type of aux data
            //dump the data
            if (printauxdata(stdout, bam_aux_type(data), -1, data) == EXIT_FAILURE) {
    ...
            data = bam_aux_next(bamdata, data);                                     //get the next aux data
    ...
Refer: dump_aux.c

Shows all the tags from all alignments.

    ./dump_aux ../../samtools/test/mpileup/mpileup.1.bam


### Add/Remove/Update header

There are specific types of data that can be part of header data. They have
a tag from HD, SQ, RG, PG and CO. Fully formatted header lines, separated by new
line, can be added with sam_hdr_add_lines api. A single header line can be added
using sam_hdr_add_line api where the header type, tag and value pair are passed
as arguments, terminated with a NULL argument. The PG header lines are special
that they have a kind of linkage to previous PG lines. This linkage can be auto
generated by using sam_hdr_add_pg api which sets the 'PP' field used in linkage.
sam_hdr_write api does the write of the header data to file.

    ...
    //add SQ line with SN as TR1 and TR2
    if (sam_hdr_add_lines(in_samhdr, &sq[0], 0)) {                                      //length as 0 for NULL terminated data
    ...
    //add RG line with ID as RG1
    if (sam_hdr_add_line(in_samhdr, "RG", "ID", "RG1", "LB", "Test", "SM", "S1", NULL)) {
    ...
    //add pg line
    if (sam_hdr_add_pg(in_samhdr, "add_header", "VN", "Test", "CL", data.s, NULL)) {    //NULL is to indicate end of args
    ...
    if (sam_hdr_add_line(in_samhdr, "CO", "Test data", NULL)) {                         //NULL is to indicate end of args
    ...
    //write output
    if (sam_hdr_write(outfile, in_samhdr) < 0) {
    ...
Refer: add_header.c

Not all type of header data can be removed but where it is possible, either a
specific header line can be removed or all of a header type can be removed. To
remove a specific line, header type, unique identifier field tag and its value
to be used. To remove all lines of a type, header type and unique identifier
field tag are to be used.

    ...
        //remove specific line
        if (sam_hdr_remove_line_id(in_samhdr, header, id, idval)) {
    ...
        //remove multiple lines of a header type
        if (sam_hdr_remove_lines(in_samhdr, header, id, NULL)) {
    ...
    if (sam_hdr_write(outfile, in_samhdr) < 0) {
    ...
Refer: rem_header.c

Shows the file content after removing SQ line with SN 2.
    ./rem_header ../../samtools/test/mpileup/mpileup.1.bam SQ 2

The unique identifier for the line needs to be found to update a field, though
not all types in the header may be modifiable.  The api sam_hdr_update_line
takes the unique identifier for the header line type, its value, the field
which needs to be modified and the new value with which to modify it, followed
by a NULL.
e.g. To change LN field from 2000 to 2250 in SQ line with unique identifier SN
as 'chr1', sam_hdr_update_line( header, "SQ", "SN", "chr1", "LN", "2250",
NULL).  To change PP field from ABC to DEF in PG line with ID APP.10,
sam_hdr_update_line( header, "PG", "ID", "APP.10", "PP", "DEF", NULL).

    ...
    //update with new data
    if (sam_hdr_update_line(in_samhdr, header, id, idval, tag, val, NULL) < 0) {
        printf("Failed to update data\n");
        goto end;
    }
    ...
Refer: update_header.c

Shows new sam file with 2nd SQ line having length as 38.

    ./update_header /tmp/sample.sam.gz SQ T1 LN 38


### Update alignment data

Many of the bam data fields may be updated by setting new value to appropriate
field in bam1_core_t structure and for a few, creating a new bam1_t record would
be easier than update of existing record.

    ...
    while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
    {
    ...
            case 1:// QNAME
                ret = bam_set_qname(bamdata, val);
            break;
            case 2:// FLAG
                bamdata->core.flag = atol(val) & 0xFFFF;
            break;
            case 3:// RNAME
            case 7:// RNEXT
                if ((ret = sam_hdr_name2tid(in_samhdr, val)) < 0) {
    ...
                if (field == 3) {
                    //reference
                    bamdata->core.tid = ret;
                }
                else {
                    //mate reference
                    bamdata->core.mtid = ret;
                }
            break;
            case 4:// POS
                bamdata->core.pos = atoll(val);
            break;
            case 5:// MAPQ
                bamdata->core.qual = atoi(val) & 0x0FF;
            break;
            case 6:// CIGAR
            {
    ...
                //get cigar array and set all data in new bam record
                if ((ncigar = sam_parse_cigar(val, NULL, &cigar, &size)) < 0) {
    ...
                if (bam_set1(newbam, bamdata->core.l_qname, bam_get_qname(bamdata), bamdata->core.flag, bamdata->core.tid,
                 bamdata->core.pos, bamdata->core.qual, ncigar, cigar, bamdata->core.mtid, bamdata->core.mpos,
                  bamdata->core.isize, bamdata->core.l_qseq, (const char*)bam_get_seq(bamdata),
                   (const char*)bam_get_qual(bamdata), bam_get_l_aux(bamdata)) < 0) {
    ...
                //correct sequence data as input is expected in ascii format and not as compressed inside bam!
                memcpy(bam_get_seq(newbam), bam_get_seq(bamdata), (bamdata->core.l_qseq + 1) / 2);
                //copy the aux data
                memcpy(bam_get_aux(newbam), bam_get_aux(bamdata), bam_get_l_aux(bamdata));
    ...
            break;
            case 8:// PNEXT
                bamdata->core.mpos = atoll(val);
            break;
            case 9:// TLEN
                bamdata->core.isize = atoll(val);
            break;
            case 10:// SEQ
    ...
                for( c = 0; c < i; ++c) {
                    bam_set_seqi(bam_get_seq(bamdata), c, seq_nt16_table[(unsigned char)val[c]]);
                }
            break;
            case 11:// QUAL
    ...
                for (c = 0; c < i; ++c) {
                    val[c] -= 33;               //phred score from ascii value
                }
                memcpy(bam_get_qual(bamdata), val, i);
    ...
Refer: mod_bam.c

Shows data with RNAME modified to T2.

    ./mod_bam /tmp/sample.sam ITR1 3 T2

The auxiliary data in bam1_t structure can be modified using
bam_aux_update_float, bam_aux_update_int etc. apis. If the aux field is not
present at all, it can be appended using bam_aux_append.

    ...
    //matched to qname, update aux
    if (!(data = bam_aux_get(bamdata, tag))) {
        //tag not present append
    ...
        if (bam_aux_append(bamdata, tag, type, length, (const uint8_t*)val)) {
    ...
    else {
        char auxtype = bam_aux_type(data);
        //update the tag with newer value
        switch (type) {
            case 'f':
            case 'd':
    ...
                if (bam_aux_update_float(bamdata, tag, atof(val))) {
    ...
            case 'C':
            case 'S':
            case 'I':
    ...
                if (bam_aux_update_int(bamdata, tag, atoll(val))) {
    ...
            case 'Z':
    ...
                if (bam_aux_update_str(bamdata, tag, length, val)) {
    ...
            case 'A':
    ...
                //update the char data directly on buffer
                *(data+1) = val[0];
    ...
Refer: mod_aux.c

Shows the given record's MD tag set to Test.

    ./mod_aux samtools/test/mpileup/mpileup.1.bam ERR013140.6157908 MD Z Test

The array aux fields can be updated using bam_aux_update_array api.

    ...
    if (bam_aux_update_array(bamdata, "BA", 'I', sizeof(cnt)/sizeof(cnt[0]), cnt)) {
    ...
Refer: mod_aux_ba.c

Shows the records updated with an array of integers, containing count of ACGT
and N in that order.

    ./mod_aux_ba samtools/test/mpileup/mpileup.1.bam


### Create an index

Indexes help to read data faster without iterating sequentially through the
file. Indexes contain the position information about alignments and that they
can be read easily. There are different type of indices, BAI, CSI, CRAI, TBI,
FAI etc. and are usually used with iterators.

Indexing of plain/textual files are not supported, compressed SAM&FASTA/Q, BAM,
and CRAM files can be indexed. CRAM files are indexed as .crai and the other two
can be indexed as .bai or .csi files. Each of these types have different
internal representations of the index information. Bai uses a fixed
configuration values where as csi has them dynamically updated based on the
alignment data.

Indexes can be created either with save of alignment data or explicitly by
read of existing alignment file.

To create index along with alignment write, the sam_idx_init api need to be
invoked before the start of alignment data write. This api takes the output
samFile pointer, header pointer, minimum shift and index file path. For BAI
index, the min shift has to be 0.

At the end of write, sam_idx_save api need to be invoked to save the index.

    //write header
    if (sam_hdr_write(outfile, in_samhdr)) {
    ...
    // initialize indexing, before start of write
    if (sam_idx_init(outfile, in_samhdr, size, fileidx)) {
    ...
        if (sam_write1(outfile, in_samhdr, bamdata) < 0) {
    ...
    if (sam_idx_save(outfile)) {
    ...
Refer:index_write.c

Creates mpileup.1.bam and mpileup.1.bam.bai in /tmp/.

    ./idx_on_write  ../../samtools/test/mpileup/mpileup.1.bam 0 /tmp/

To create index explicitly on an existing alignment data file, the
sam_index_build api or its alike can be used. sam_index_build takes the
alignment file path, min shift for the index and creates the index file in
same path. The output name will be based on the alignment file format and min
shift passed.

The sam_index_build2 api takes the index file path as well and gives more
control than the previous one.  The sam_index_build3 api provides an option to
configure the number of threads in index creation.


### Read with iterators

Index file helps to read required data without sequentially accessing the file
and are required to use iterators. The interested reference, start and end
position etc. are required to read data with iterators. With index and these
information, an iterator is created and relevant alignments can be accessed by
iterating it.

The api sam_index_load and the like does the index loading. It takes input
samFile pointer and file path. It loads the index file based on the input file
name, from the same path and with implicit index file extension - cram file
with .crai and others with .bai. The sam_index_load2 api accepts explicit path
to index file, which allows loading it from a different location and explicit
extensions. The sam_index_load3 api supports download/save of the index
locally from a remote location. These apis returns NULL on failure and index
pointer on success.

The index file path can be appended to alignment file path and used as well.
In this case the paths are expected to be separated by '##idx##'.

The sam_iter_queryi or sam_iter_querys apis may be used to create an iterator
and sam_itr_next api does the alignment data retrieval. Along with retrieval
of current data, it advances the iterator to next relevant data. The
sam_iter_queryi takes the interested positions as numeric values and
sam_iter_querys takes the interested position as a string.

With sam_iter_queryi, the reference id can be the 0 based index of reference
data, -2 for unmapped alignments, -3 to start read from beginning of file, -4
to continue from current position, -5 to return nothing. Based on the
reference id given, alignment covering the given start and end positions will
be read with sam_iter_next api.

With sam_iter_querys, the reference sequence is identified with the name and
interested positions can be described with start and end separated by '-' as
string. When sequence is identified as '.', it begins from the start of file
and when it is '*', unmapped alignments are read. Reference with <name>[:],
<name>:S, <name>:S-E, <name>:-E retrieves all data, all data covering position
S onwards, all data covering position S to E, all data covering upto position
E of reference with ID <name> respectively on read using sam_iter_next.

The index and iterator created are to be destroyed once the need is over.
sam_itr_destroy and hts_idx_destroy apis does this.

    ...
    //load index file
    if (!(idx = sam_index_load2(infile, inname, idxfile))) {
    ...
    //create iterator
    if (!(iter = sam_itr_querys(idx, in_samhdr, region))) {
    ...
    //read using iterator
    while ((c = sam_itr_next(infile, iter, bamdata)) >= 0) {
    ...
    if (iter) {
        sam_itr_destroy(iter);
    }
    if (idx) {
        hts_idx_destroy(idx);
    ...
Refer:index_reg_read.c

With sample.sam, region as \* will show alignments with name UNMAP2 and UNMAP3

    ./read_reg /tmp/sample.sam.gz \*

With region as \., it shows all alignments

    ./read_reg /tmp/sample.sam.gz \.

With region as T1:1-4, start 1 and end 4 it shows nothing and with T1:1-5 it
shows alignment with name ITR1.

    ./read_reg /tmp/sample.sam.gz T1:1-5

With region as T2:30-100, it shows alignment with name ITR2M which refers the
reference data T2.

    ./read_reg /tmp/sample.sam.gz T2:30-100


Multiple interested regions can be specified for read using sam_itr_regarray.
It takes index path, header, count of regions and region descriptions as array
of char array / string. This array passed need to be released by the user
itself.

    ...
    //load index file, assume it to be present in same location
    if (!(idx = sam_index_load(infile, inname))) {
    ...
    //create iterator
    if (!(iter = sam_itr_regarray(idx, in_samhdr, regions, regcnt))) {
    ...
    if (regions) {
        //can be freed as it is no longer required
        free(regions);
        regions = NULL;
    }
    //get required area
    while ((c = sam_itr_multi_next(infile, iter, bamdata) >= 0)) {
    ...
Refer:index_multireg_read.c

With compressed sample.sam and 2 regions from reference T1 (30 to 32) and 1
region from T2 (34 onwards), alignments with name A1, B1, A2 and ITR2M would
be shown.

    ./read_multireg /tmp/sample.sam.gz 2 T1:30-32,T2:34

To use numeric indices instead of textual regions, sam_itr_regions can be used.
It takes index file path, header, count of regions and an array of region
description (hts_reglist_t*), which has the start end positions as numerals.

The index and iterators are to be destroyed using the sam_itr_destroy and
hts_idx_destroy. The hts_reglist_t* array passed is destroyed by the library
on iterator destroy. The regions array (array of char array/string) needs to be
destroyed by the user itself.


### Pileup and MPileup

Pileup shows the transposed view of the SAM alignment data, i.e. it shows the
the reference positions and bases which cover that position through different
reads side by side. MPileup facilitates the piling up of multiple sam files
against each other and same reference at the same time.

Mpileup has replaced the pileup. The input expects the data to be sorted by
position.

Pileup needs to be initialized with bam_pileup_init method which takes pointer
to a method, which will be called by pileup to read data from required files,
and pointer to data which might be required for this read method to do the
read operation. It returns a pointer to the pileup iterator.

User can specify methods which need to be invoked during the load and unload
of an alignment, like constructor and destructor of objects.
Bam_plp_constructor and bam_plp_destructor methods does the setup of
these methods in the pileup iterator. During invocation of these methods, the
pointer to data passed in the initialization is passed as well. If user want
to do any custom status handling or actions during load or unload, it can be
done in these methods. Alignment specific data can be created and stored in
an argument passed to the constructor and the same will be accessible during
pileup status return. The same will be accessible during destructor as well
where any deallocation can be made.

User is expected to invoke bam_plp_auto api to get the pileup status. It
returns the pileup status or NULL on end. During this all alignments are read
one by one, using the method given in initialization for data read, until one
for a new reference is found or all alignment covering a position is read. On
such condition, the pileup status is returned and the same continuous on next
bam_plp_auto call.  The pileup status returned is an array for all positions
for which the processing is completed. Along with the result, the reference
index, position in reference data and number of alignments which covers this
position are passed. User can iterate the result array and get bases from each
alignment which covers the given reference position. The alignment specific
custom data which were created in constructor function will also be available
in the result.

The bam_plp_auto api invokes the data read method to load an alignment and the
constructor method is invoked during the load. Once the end of alignment is
passed, it is removed from the processing and destructor method is invoked,
that user could do deallocations and custom actions as in load during this
time. The custom data passed during the initialization is passed to the
constructor and destructor methods during invocation.

Once the forward and reverse strands are identified, the better of the quality
is identified and used. Both reads are required for this and hence reads are
cached until its mate is read. The maximum number of reads that can be cached
is controlled by bam_plp_set_maxcnt. Reads covering a position are cached and
as soon as mate is found, quality is adjusted and is removed from cache. Reads
above the cache limit are discarded.

Once done, the pileup iterator to be discarded by sam_plp_destroy api.

    ...
    if (!(plpiter = bam_plp_init(readdata, &conf))) {
    ...
    //set constructor destructor callbacks
    bam_plp_constructor(plpiter, plpconstructor);
    bam_plp_destructor(plpiter, plpdestructor);

    while ((plp = bam_plp_auto(plpiter, &tid, &refpos, &n))) {
        printf("%d\t%d\t", tid+1, refpos+1);
        for (j = 0; j < n; ++j) {
            //doesnt detect succeeding insertion and deletion together here, only insertion is identified
            //deletion is detected in plp->is_del as and when pos reaches the position
            //if detection ahead is required, use bam_plp_insertion here which gives deletion length along with insertion
            if (plp[j].is_del || plp[j].is_refskip) {
                printf("*");
                continue;
            }
            //start and end are displayed in UPPER and rest on LOWER
            printf("%c", plp[j].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
                            (plp[j].is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
                             tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)])));
            if (plp[j].indel > 0) {
                //insertions, anyway not start or end
                printf("+%d", plp[j].indel);
                for (k = 0; k < plp[j].indel; ++k) {
                    printf("%c", tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos + k + 1)]));
                }
            }
            else if (plp[j].indel < 0) {
                printf("%d", plp[j].indel);
                for (k = 0; k < -plp[j].indel; ++k) {
                    printf("?");
                }
    ...
    if (plpiter) {
        bam_plp_destroy(plpiter);
    ...
Refer:pileup.c

The read method may use a simple read or it could be an advanced read using
indices, iterators and region specifications based on the need. The constructor
method may create any custom data and store it in the pointer passed to it. The
same need to be released by use on destructor method.

MPileup works same as the pileup and supports multiple inputs against the same
reference, giving side by side view of reference and alignments from different
inputs.

MPileup needs to be initialized with bam_mpileup_init method which takes
pointer to a method, which will be called by pileup to read data from required
files, and an array of pointer to data which might be required for this read
method to do the read operation. It returns a pointer to the mpileup iterator.

User can specify methods which need to be invoked during the load and unload
of an alignment, like constructor and destructor of objects.
bam_mplp_constructor and bam_mplp_destructor methods does the setup
of these methods in the pileup iterator. During invocation of these methods,
the pointer to data passed in the initialization is passed as well. If user
want to do any custom status handling or actions during load or unload, it can
be done on these methods. Alignment specific data can be created and
stored in the custom data pointer and the same will be accessible during
return of pileup status. The same will be accessible during destructor as well
where any deallocation can be made.

User is expected to invoke bam_mplp_auto api to get the pileup status. It
returns the pileup status. During this all alignments are read one by one,
using the method given in initialization for data read, until one for a new
reference is found or all alignment covering a position is read. On such
condition, the pileup status is returned and the same continuous on next
bam_mplp_auto call.

The pileup status is returned through a parameter in the method itself, is an
array for all inputs, each containing array for positions on which the
processing is completed. Along with the result, the reference index, position
in reference data and number of alignments which covers this position are
passed. User can iterate the result array and get bases from each alignment
which covers the given reference position. The alignment specific custom data
which were created in constructor function will also be available in the
result.

Once the forward and reverse strands are identified, the better of the quality
is identified and used. Both reads are required for this and hence reads are
cached until its mate is read. The maximum number of reads that can be cached
is controlled by bam_mplp_set_maxcnt. Reads covering a position are cached and
as soon as mate is found, quality is adjusted and is removed from cache. Reads
above the cache limit are discarded.

Once done, the pileup iterator to be discarded by sam_mplp_destroy api.

    ...
    if (!(mplpiter = bam_mplp_init(argc - 1, readdata, (void**) conf))) {
    ...
    //set constructor destructor callbacks
    bam_mplp_constructor(mplpiter, plpconstructor);
    bam_mplp_destructor(mplpiter, plpdestructor);

    while (bam_mplp64_auto(mplpiter, &tid, &refpos, depth, plp) > 0) {
        printf("%d\t%"PRIhts_pos"\t", tid+1, refpos+1);

        for (input = 0; input < argc - 1; ++input) {
            for (dpt = 0; dpt  < depth[input]; ++dpt) {
                if (plp[input][dpt].is_del || plp[input][dpt].is_refskip) {
                    printf("*");
                    continue;
                }
                //start and end are displayed in UPPER and rest on LOWER
                printf("%c", plp[input][dpt].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
                 plp[input][dpt].qpos)]) : (plp[input]->is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
                  plp[input][dpt].qpos)]) : tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
                   plp[input][dpt].qpos)])));
                if (plp[input][dpt].indel > 0) {
                    //insertions, anyway not start or end
                    printf("+%d", plp[input][dpt].indel);
                    for (k = 0; k < plp[input][dpt].indel; ++k) {
                        printf("%c", tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
                         plp[input][dpt].qpos + k + 1)]));
                    }
                }
                else if (plp[input][dpt].indel < 0) {
                    printf("%d", plp[input][dpt].indel);
                    for (k = 0; k < -plp[input][dpt].indel; ++k) {
                        printf("?");
    ...
    if (mplpiter) {
        bam_mplp_destroy(mplpiter);
    }
    ...
    if (plp) {
        free(plp);
    ...
Refer:mpileup.c

This sample takes multiple sam files and shows the pileup of data side by side.

    ./mpileup /tmp/mp.bam /tmp/mp.sam


### Base modifications

The alignment data may contain base modification information as well. This
gives the base, modifications found, orientation in which it was found and the
quality for the modification. The base modification can be identified using
hts_parse_basemod api. It stores the modification details on hts_base_mod_state
and this has to be initialized using hts_base_mod_state_alloc api.

Once the modifications are identified, they can be accessed through different
ways. bam_mods_recorded api gives the modifications identified for an alignment.
Modifications can be queried for each base position iteratively using
bam_mods_at_next_pos api. Check the returned value with buffer size to see
whether the buffer is big enough to retrieve all modifications.
Instead of querying for each position, the next modified position can be
directly retrieved directly using bam_next_basemod api. An alignment can be
queried to have a specific modification using bam_mods_query_type api. At the
end of processing, the state need to be released using hts_base_mod_state_free
api.

    ...
    if (!(ms = hts_base_mod_state_alloc())) {
    ...
    while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
    {
    ...
        if (bam_parse_basemod(bamdata, ms)) {
    ...
        bm = bam_mods_recorded(ms, &cnt);
        for (k = 0; k < cnt; ++k) {
            printf("%c", bm[k]);
        }
        printf("\n");
        hts_base_mod mod[5] = {0};  //for ATCGN
        if (opt) {
            //option 1
            for (; i < bamdata->core.l_qseq; ++i) {
                if ((r = bam_mods_at_next_pos(bamdata, ms, mod, sizeof(mod)/sizeof(mod[0]))) <= -1) {
                    printf("Failed to get modifications\n");
                    goto end;
                }
                else if (r > (sizeof(mod) / sizeof(mod[0]))) {
                    printf("More modifications than this app can handle, update the app\n");
                    goto end;
                }
                else if (!r) {
                    //no modification at this pos
                    printf("%c", seq_nt16_str[bam_seqi(data, i)]);
                }
                //modifications
                for (j = 0; j < r; ++j) {
                    printf("%c%c%c", mod[j].canonical_base, mod[j].strand ? '-' : '+', mod[j].modified_base);
    ...
        else {
            //option 2
            while ((r = bam_next_basemod(bamdata, ms, mod, sizeof(mod)/sizeof(mod[0]), &pos)) >= 0) {
                for (; i < bamdata->core.l_qseq && i < pos; ++i) {
                    printf("%c", seq_nt16_str[bam_seqi(data, i)]);
                }
                //modifications
                for (j = 0; j < r; ++j) {
                    printf("%c%c%c", mod[j].canonical_base, mod[j].strand ? '-' : '+', mod[j].modified_base);
                }
    ...
        //check last alignment's base modification
        int strand = 0, impl = 0;
        char canonical = 0, modification[] = "mhfcgebaon";      //possible modifications
        printf("\n\nLast alignment has \n");
        for (k = 0; k < sizeof(modification) - 1; ++k) {        //avoiding NUL termination
            if (bam_mods_query_type(ms, modification[k], &strand, &impl, &canonical)) {
                printf ("No modification of %c type\n", modification[k]);
            }
            else {
                printf("%s strand has %c modified with %c, can %sassume unlisted as unmodified\n", strand ? "-/bottom/reverse" :
                "+/top/forward", canonical, modification[k], impl?"" : "not " );
            }
        }
    ...
    if (ms) {
        hts_base_mod_state_free(ms);
    ...
Refer:modstate.c

The modification can be accessed in pileup mode as well. bam_mods_at_qpos gives
the modification at given pileup position. Insertion and deletion to the given
position with possible modification can be retrieved using bam_plp_insertion_mod
api.

    ...
    int plpconstructor(void *data, const bam1_t *b, bam_pileup_cd *cd) {
        //when using cd, initialize and use as it will be reused after destructor
        cd->p = hts_base_mod_state_alloc();
        //parse the bam data and gather modification data from MM tags
        return (-1 == bam_parse_basemod(b, (hts_base_mod_state*)cd->p)) ? 1 : 0;
    }

    int plpdestructor(void *data, const bam1_t *b, bam_pileup_cd *cd) {
        if (cd->p) {
            hts_base_mod_state_free((hts_base_mod_state *)cd->p);
            cd->p = NULL;
        }
        return 0;
    }

    int main(int argc, char *argv[])
    {
    ...
    if (!(plpiter = bam_plp_init(readdata, &conf))) {
    ...
    //set constructor destructor callbacks
    bam_plp_constructor(plpiter, plpconstructor);
    bam_plp_destructor(plpiter, plpdestructor);

    while ((plp = bam_plp_auto(plpiter, &tid, &refpos, &depth))) {
        memset(&mods, 0, sizeof(mods));
        printf("%d\t%d\t", tid+1, refpos+1);

        for (j = 0; j < depth; ++j) {
            dellen = 0;
            if (plp[j].is_del || plp[j].is_refskip) {
                printf("*");
                continue;
            }
            /*invoke bam mods_mods_at_qpos before bam_plp_insertion_mod that the base modification
            is retrieved before change in pileup pos thr' plp_insertion_mod call*/
            if ((modlen = bam_mods_at_qpos(plp[j].b, plp[j].qpos, plp[j].cd.p, mods, NMODS)) == -1) {
    ...
            //use plp_insertion/_mod to get insertion and del at the same position
            if ((inslen = bam_plp_insertion_mod(&plp[j], (hts_base_mod_state*)plp[j].cd.p, &insdata, &dellen)) == -1) {
    ...
            //start and end are displayed in UPPER and rest on LOWER, only 1st modification considered
            //base and modification
            printf("%c%c%c", plp[j].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
                (plp[j].is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
                    tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)])),
                modlen > 0 ? mods[0].strand ? '-' : '+' : '\0', modlen > 0 ? mods[0].modified_base : '\0');
            //insertion and deletions
            if (plp[j].indel > 0) {
                //insertion
                /*insertion data from plp_insertion_mod, note this shows the quality value as well
                which is different from base and modification above;the lower case display is not attempted either*/
                printf("+%d%s", plp[j].indel, insdata.s);
                //handle deletion if any
                if (dellen) {
                    printf("-%d", dellen);
                    for (k = 0; k < dellen; ++k) {
                        printf("?");
    ...
            else if (plp[j].indel < 0) {
                //deletion
                printf("%d", plp[j].indel);
                for (k = 0; k < -plp[j].indel; ++k) {
                    printf("?");
                }
            }
    ...
Refer:pileup_mod.c


### Read selected fields

At times the whole alignment data may not be of interest and it would be
better to read required fields alone from the alignment data. CRAM file format
supports such specific data read and HTSLib provides an option to use this.
This can improve the performance on read operation.

The hts_set_opt method does the selection of specified fields. There are flags
indicating specific fields, like SAM_FLAG, SAM_SEQ, SAM_QNAME, in alignment
data and a combination of flags for the required fields can be passed with
CRAM_OPT_REQUIRED_FIELDS to this api.

    ...
       //select required field alone, this is useful for CRAM alone
       if (hts_set_opt(infile, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG) < 0) {
    ...
       //read header
       in_samhdr = sam_hdr_read(infile);
    ...
    //read data, check flags and update count
    while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
        if (bamdata->core.flag & BAM_FREAD1) {
            cntread1++;
    ...
Refer: flags_htsopt_field.c


### Thread-pool to read / write

The HTSLib api supports thread pooling for better performance. There are a few
ways in which this can be used. The pool can be made specific for a file or a
generic pool can be created and shared across multiple files. Another way to
use thread pool is to schedule tasks explicitly to queues which gets executed
using threads in pool.

To have a thread pool specific for a file, hts_set_opt api can be used with the
file pointer, HTS_OPT_NTHREADS and the number of threads to use in the pool.
Closure of file releases the thread pool as well. To have a thread pool which
can be shared across different files, it needs to be initialized using
hts_tpool_init api, passing number of threads as argument. This thread pool can
be associated with a file using hts_set_opt api. The file pointer,
HTS_OPT_THREAD_POOL and the thread pool address are to be passed as arguments
to api. The thread pool has to be released with hts_tpool_destroy.

Below excerpt shows file specific thread pool,

    ...
    //create file specific threads
    if (hts_set_opt(infile, HTS_OPT_NTHREADS, 2) < 0 ||     //2 thread specific for reading
    hts_set_opt(outfile1, HTS_OPT_NTHREADS, 1) < 0 ||       //1 thread specific for sam write
    hts_set_opt(outfile2, HTS_OPT_NTHREADS, 1) < 0) {       //1 thread specific for bam write
        printf("Failed to set thread options\n");
        goto end;
    }
Refer: split_thread1.c

Below excerpt shows thread pool shared across files,

    ...
    //create a pool of 4 threads
    if (!(tpool.pool = hts_tpool_init(4))) {
    ...
    //share the pool with all the 3 files
    if (hts_set_opt(infile, HTS_OPT_THREAD_POOL, &tpool) < 0 ||
    hts_set_opt(outfile1, HTS_OPT_THREAD_POOL, &tpool) < 0 ||
    hts_set_opt(outfile2, HTS_OPT_THREAD_POOL, &tpool) < 0) {
    ...
    if (tpool.pool) {
        hts_tpool_destroy(tpool.pool);
    }
    ...
Refer: split_thread2.c


## More Information

### CRAM reference files

The cram reference data is required for the read of sequence data in CRAM
format. The sequence data file may have it as embedded or as a reference to
the actual file. When it is a reference, it is downloaded locally, in the
cache directory for later usage. It will be stored in a directory structure
based on the MD5 checksum in the cache directory.

Each chromosome in a reference file gets saved as a separate file with md5sum
as its path and name. The initial 4 numerals make the directory name and rest
as the file name (<cache path>/<1st 2 of md5sum>/<2nd 2 of md5sum>/<rest of
md5sum>).

The download would be attempted from standard location, EBI ENA
(https://www.ebi.ac.uk/ena).


### Bam1_t

This structure holds the sequence data in BAM format. There are fixed and
variable size fields, basic and extended information on sequence
data. Variable size data and extended information are kept together in a
buffer, named data in bam1_t. Fields in the member named core, bam1_core_t,
and a few macros together support the storage and handling of the whole
sequence data.

- core has a link to reference as a 0 based index in field tid. The mate /
  reverse strand's link to reference is given by mtid.

- Field pos and mpos gives the position in reference to which the sequence and
  its mate / reverse strand match.

- Field flag gives the properties of the given alignment. It shows the
  alignment's orientation, mate status, read order etc.

- Field qual gives the quality of the alignment read.

- l_qname gives the length of the name of the alignment / read, l_extranul gives
  the extra space used internally in the data field.

- l_qseq gives the length of the alignment / read in the data field.

-- n_cigar gives the number of CIGAR operations for the given alignment.

- isize gives the insert size of the read / alignment.

The bases in sequence data are stored by compressing 2 bases together in a
byte.  When the reverse flag is set, the base data is reversed and
complemented from the actual read (i.e. if the forward read is ACTG, the
reverse read to be CAGT; it will be stored in SAM format with reversed and
complemented format as ACTG with reverse flag set).

Macros bam_get_qname, bam_get_seq, bam_get_qual, bam_get_aux, bam_get_l_aux,
bam_seqi etc access the data field and retrieve the required data. The aux
macros support the retrieval of auxiliary data from the data field.


### Sam_hdr_t

This structure holds the header information. This holds the number of targets
/ SQ lines in the file, each one's length, name and reference count to this
structure. It also has this information in an internal data structure for
easier access of each field of this data.

When this data is shared or assigned to another variable of a different scope
or purpose, the reference count needs to be incremented to ensure that it is
valid till the end of the variable's scope. sam_hdr_incr_ref and it needs to
be destroyed as many times with sam_hdr_destroy api.


### Index

Indices need the data to be sorted by position.  They can be of different
types with extension .bai, .csi or .tbi for compressed SAM/BAM files and .crai
for CRAM files.  The index name can be passed along with the alignment file
itself by appending a specific character sequence. The apis can detect this
sequence and extract the index path. ##idx## is the sequence which separates
the file path and index path.


### Data files

The data files can be a local file, a network file, a file accessible through
the web or in cloud storage like google and amazon.  The data files can be
represented with URIs like file://, file://localhost/.., ,ftp://..,
gs+http[s].., s3+http[s]://