klassify 0.1.6

Classify chimeric reads based on unique kmer contents
Documentation
�

��h�!���dZddlZddlZddlmZmZddlZddlm	Z	ddl
mZddlm
Z
mZdedeeeffd	�Zd,d
ededed
efd�Zdedededefd�Zdedefd�Zdededededeef
d�Zdedefd�Zdededededededeeeeeffd �Zd!ed"ed#eed$eedef
d%�Zd&�Zdeded'ed(edzded)ededefd*�Zed+k(re�yy)-a�
Breakpoint-first mosaic using a precomputed UCSC chain (minimap2 + transanno already run),
with optional on-the-fly generation if no chain is provided.

Pipeline:
1) (Optional) If --chain not provided:
     minimap2 (A vs B) -> PAF
     transanno minimap2-to-chain -> CHAIN
2) Load CHAIN with the 'liftover' Python library to get an A->B converter.
3) Simulate N breakpoints on A with a minimum spacing.
4) Lift each A breakpoint to B (keep only those mapping to B's main chrom on '+'
   strand; require strictly increasing on B).
5) Write paired A/B breakpoints to TSV.
6) Build mosaic by alternating A and B between consecutive breakpoints, starting on A.

Requirements:
  - Python: liftover, pyfaidx, numpy
  - If --chain is NOT supplied: minimap2 and transanno must be in PATH.

Example:
  python build_mosaic.py     --faA A.fa --faB B.fa     --out-prefix results/run1     --chain precomputed.chain     --n 4 --min-distance 1000000 --seed 13
�N)�List�Tuple)�	ChainFile)�Fasta)�
check_tool�run�fa_path�returnc���t|�}|j�std|����t|j��d}|t	||ddj
�j
�fS)z2
    Read the first record from a FASTA file.
    zNo records in rN)r�keys�
ValueError�list�str�seq�upper)r	�fa�names   �9/Users/bao/code/klassify/scripts/simulate/build_mosaic.py�read_first_recordr)sa��
�w��B�
�7�7�9��>�'��3�4�4�����	�?�1��D���R��X�a�[�_�_�%�+�+�-�-�-��path�headerr�widthc���t|dd��5}|jd|�d��tdt|�|�D]}|j||||zdz��	ddd�y#1swYyxYw)z
    Write a FASTA file.
    �w�utf-8��encoding�>�
rN)�open�write�range�len)rrrr�fw�is      r�write_fastar'4se��
�d�C�'�	*�b�
���1�V�H�B�� ��q�#�c�(�E�*�A��H�H�S��Q��Y�'�$�.�/�+�
+�	*�	*�s�AA$�$A-�faA�faB�threads�out_pafc��dddddt|�||g}t|dd��5}t||�	�d
d
d
�y
#1swYy
xYw)zb
    Produce a PAF mapping A (query) -> B (target) with CIGAR/CS; suitable for
    transanno.
    �minimap2z-cx�asm5z--csz-trrr)�stdoutN)rr!r)r(r)r*r+�cmd�outs      r�run_minimap2_pafr2>sB��
�u�f�f�d�C��L�#�s�
K�C�	
�g�s�W�	-���C���
.�	-�	-�s	�9�A�paf�	out_chainc�(�dd|d|g}t|�y)z,
    Convert PAF to UCSC chain (A -> B)
    �	transanno�
minimap2chainz--outputN)r)r3r4r0s   r�paf_to_chain_with_transannor8Hs�����j�)�
D�C���Hr�L�n�dmin�seedc�t��
�|dks|dkrgStjj|�}g}d}d}d|dz
}	}t|�|kr_||krZt	|j||	���
t
��
fd�|D��r|j�
�|dz
}t|�|kr||kr�Z|j�|S)z�
    Sample n positions in [1, L-1) (avoid exact ends), with >= dmin spacing.
    Greedy retries; may return fewer if space is tight.
    �ri@
�c3�@�K�|]}t�|z
��k\���y�w)N)�abs)�.0�yr;�xs  ��r�	<genexpr>z%sample_breakpoints.<locals>.<genexpr>^s�����1�5�a�s�1�q�5�z�T�!�5�s�)	�np�random�default_rngr$�int�integers�all�append�sort)r9r:r;r<�rng�picks�tries�	max_tries�low�highrDs  `       @r�sample_breakpointsrTPs����
	�A�v��a���	�
�)�)�
�
��
%�C��E�
�E��I��1�q�5��C�

�e�*�q�.�U�Y�.�����S�$�'�(���1�5�1�1��L�L��O�
��
��	�e�*�q�.�U�Y�.�

�J�J�L��Lr�chromA�posA_0basedc��|||S)z
    Use liftover to map a 0-based A position to B.
    Returns list of hits (chrom, pos, strand); we will filter outside.
    �)�	converterrUrVs   r�lift_posrZes��
�V��[�)�)r�
chromB_targetc���tjj|�}d}t|�D]�}	t	|||t|j
dd���}
t|
�|kr�9g}d}d}
|
D]k}t|||�}|D���cgc]\}}}||k(s�
|dk(s�|||f��}}}}|sd}n2|d\}	}}	t|�}||
krd}n|j|�|}
�m|s��|
|fcStd��cc}}}w)	a
    Find n A breakpoints that:
      - are >= dmin apart on A,
      - each lifts to B (same chrom as chromB_target),
      - all on '+' strand (to keep B slicing simple),
      - B positions strictly increase with A.
    Will resample multiple times; raises if cannot satisfy.
    ��ri���T������+Fz`Could not find enough mapped breakpoints; try lowering --min-distance or using a different seed.)rFrGrHr#rTrIrJr$rZrL�RuntimeError)rYrUr[r9r:r;r<rN�attempts�_�a_breaks�b_hits�ok�prev_b�a�
candidates�c�p�s�bposs                    r�choose_mapped_breaksrmms'��
�)�)�
�
��
%�C��H�
�8�_��%�a��D�#�c�l�l�1�i�6P�2Q�R���x�=�1�����
�����A�!�)�V�Q�7�J�,6��+5�i�q�!�Q��m�9K�PQ�UX�PX��A�q�	�:�
������#�A��J�A�t�Q��t�9�D��v�~�����M�M�$���F�!�"��V�#�#�3�4�j����!s�C1�C1�C1�seqA�seqBrc�b_breaksc��t|�t|�k(rt|�dkDsJ�t|�}dg|z|gz}g}d}tt|�dz
�D]V}||||dz}
}	|r|j||	|
�n-|dz
}||||dz}
}|j||
kr|||
nd�|}�Xdj|�S)z�
    Start on A. Between consecutive A breakpoints, alternate A and B segments:
      segments: [0..a1] (A), [b1..b2] (B), [a2..a3] (A), ...
    Tail: append the remainder of whichever donor you end on.
    rTr?�)r$r#rL�join)rnrorcrpr9�a_pos�pieces�donor_Ar&�aL�aR�bi�bL�bRs              r�build_mosaic_alternater|�s����x�=�C��M�)�c�(�m�a�.?�?�?��D�	�A�
�C�(�N�a�S� �E��F��G�
�3�u�:��>�
"���q��5��Q��<�B����M�M�$�r�"�+�&��Q��B��b�\�8�B��F�#3��B��M�M��r��$�r�"�+�r�:��+��#��7�7�6�?�rc
�p�tjd��}|jdd��|jdd��|jdd	��|jd
d��|jdtd
��|jdtdd��|jdtdd��|jdtd��|j	�}t|j|j|j|j|j|j|j|j�y)NzNBreakpoint-first mosaic using a precomputed chain (or generate one if absent).)�descriptionr(z*FASTA for genome A (source of breakpoints))�helpr)z'FASTA for genome B (target of liftover)�
out_prefixzYOutput prefix (creates .tsv and .fa; may also create .paf/.chain if --chain not provided)z--chainzVPrecomputed UCSC chain file mapping A->B. If provided, minimap2/transanno are skipped.z	--threads�)�type�defaultz--min-distancei@Bz?Minimum spacing between A breakpoints (default: %(default)s bp))r�r�rz--n�z!Number of breakpoints to simulatez--seed�*)�argparse�ArgumentParser�add_argumentrI�
parse_args�build_mosaicr(r)r��chainr*�min_distancer:r<)�ap�argss  r�mainr��s��	�	 �	 �d�
�B��O�O�E� L�O�M��O�O�E� I�O�J��O�O��
h����O�O��
e����O�O�K�c�2�O�6��O�O��
��
N�	���O�O�
�C��)L����O�O�H�3��O�3�
�=�=�?�D������������
�
�����������	�	�	rr�r�r�c�~�t|�\}}	t|�\}
}|dz}|dz}
|r/|}tjj|�sHt	d|����td�td�|dz}|dz}t
||||�t||�t|d�	�}t|||
t|	�t|�t|�t|��
�\}}tjtjj|�xsdd�
�t|dd��5}|jd�t!t#||��D](\}\}}|j|�d|�d|�d|
�d|�d�
��*	ddd�t%|	|||�}t'|
tjj)|�|��y#1swY�CxYw)z4
    Build a mosaic from a pair of FASTA files.
    z.breakpoints.tsvz
.mosaic.faz--chain file not found: r-r6z.pafz.chainF)�	one_based)rYrUr[r9r:r;r<�.T)�exist_okrrrz3idx	A_chrom	A_bp_0based	B_chrom	B_bp_0based	strand
�	z	+
N)rr)r�osr�exists�FileNotFoundErrorrr2r8rrmr$rI�makedirs�dirnamer!r"�	enumerate�zipr|r'�basename)r(r)r�r�r*r�r:r<rUrn�chromBro�out_tsv�out_fa�
chain_pathr+rYrcrprr&rg�b�mosaics                        rr�r��s���%�S�)�L�F�D�$�S�)�L�F�D��-�-�G�
�,�
&�F�
��
��w�w�~�~�j�)�#�&>�z�l�$K�L�L��:���;���v�%���(�*�
���c�7�G�4�#�G�Z�8��*��6�I�.����

�d�)�

�a�&�
��
�
��Y���H�h��K�K�������(�/�C�$�?�	
�g�s�W�	-��	���K�L�"�3�x��#:�;�I�A�v��1�
�G�G�q�c��F�8�2�a�S��6�(�"�Q�C�u�=�>�<�
.�$�D�$��(�
C�F���r�w�w�/�/�
�;��H�
.�	-�s
�AF3�3F<�__main__)�<)�__doc__r�r��typingrr�numpyrF�liftoverr�pyfaidxr�utilsrrrrrIr'r2r8rTrZrmr|r�r��__name__rXrr�<module>r�s����6�	�����!�.�s�.�u�S�#�X��.�0�c�0�3�0�S�0��0��#��C��#����
�S�
�S�
��#��#��S�����S�	��**��*�#�*�)��)�+.�)�36�)�;>�)�FI�)�QT�)�
�4��9�d�3�i�� �)�X�

����$(��I��9=�c�����4$�N7I�	�7I�	�7I��7I���:�	7I�
�7I��
7I��7I��7I�t�z���F�r