jin 0.1.0

Approximate Nearest Neighbor Search: HNSW, DiskANN, IVF-PQ, ScaNN, quantization
Documentation
�

�gi�T���SrSSKrSSKrSSKrSSKrSSKrSSKJr SSKr	S<S\
S\
S\
S\
S\S	\S
\S\
S\\	R\	R44S
jjrS=S\	RS\S\S\
S\	R4
SjjrS>S\	RS\	RS\
S\	R4SjjrS?S\	RS\	RS\
S\S\S\
S\\	R\	R\	R44SjjrS@S\	RS\	RS\S\
S\	R4
SjjrS\	RS\	RS\
S\	R4SjrS\	RS\	RS\	RS \	RS\
S\	R4S!jrS\	RS\	RS\4S"jrS#\S\	RSS4S$jrS#\S%\	RSS4S&jrS#\S'\	RSS4S(jrS#\S)\	RSS4S*jrS+S,S-S.S/.S0S1S-S2S/.S3S4S-S5S/.S6.rS7\S8\S\4S9jrS:r\S;:Xa\"5 gg)AaGenerate multi-scale datasets for rigorous ANN benchmarking.

Based on research from:
- ann-benchmarks (erikbern): Standard recall-QPS evaluation
- BigANN 2023 NeurIPS: Filtered, OOD, streaming scenarios at 10M scale
- VIBE: Out-of-distribution evaluation for embeddings

Scales:
- B (base):  10K vectors - quick validation
- T (test): 100K vectors - meaningful scaling behavior
- P (prod):   1M vectors - production-relevant

Run: uvx --with numpy python scripts/generate_multiscale_data.py [--scale B|T|P|all]
�N)�Path�n�dim�n_topics�rank�topic_scale�topic_noise�global_noise�seed�returnc��[RRU5n[X15nUR	X45R[R5n	[RRU	SS9up�UR	X#45R[R5U-nS[R"SUS-[RS9S--n
X�R5-nURX USS9R[R5nX�UR	X45R[R5U--nUU
R-nUUR	X45R[R5U--
nU[RRUSSS	9-nUR[R5U4$)
a'Generate realistic anisotropic embeddings with topic structure.

Key insight from empirical embedding studies:
- Real embeddings (BERT, GTE, etc.) have ~80% variance in top 10% of PCA components
- Topic clustering is implicit in semantic similarity
- Zipf-like frequency distribution for topics
�reduced)�mode��?���dtypeg�������?T)�size�p�replace��axis�keepdims)�np�random�default_rng�min�standard_normal�astype�float32�linalg�qr�arange�sum�choice�int32�T�norm)rrrrrr	r
r�rng�A�Q�_�	centroids�freqs�probs�	topic_ids�latent�xs                  �C/Users/arc/Documents/dev/plesio/scripts/generate_multiscale_data.py�generate_anisotropic_embeddingsr4sz��"
�)�)�
�
��
%�C��t�>�D�	���S�K�(�/�/��
�
�;�A�
�9�9�<�<��	�<�*�D�A��#�#�X�$4�5�<�<�R�Z�Z�H�;�V�I�
�2�9�9�Q��1��B�J�J�?�3�F�G�E��I�I�K��E��
�
�8�u�d�
�C�J�J�2�8�8�T�I��
!�C�$7�$7��	�$B�$I�$I�"�*�*�$U�Xc�$c�
c�F������A���	�	�a�X�	&�	-�	-�b�j�j�	9�L�	H�H�A���������D��	1�1�A��8�8�B�J�J���*�*��vectors�frac�noisec	��[RRU5nURupVUR	5n[XQ-5nUS::aU$UR
XXSS9n	UR
XXSS9n
URX�45R[R5U-nXzU-Xy'Xy==[RRXySSS9-ss'U$)z1Inject near-duplicates to model repeated content.rF�rTrr)rrr�shape�copy�intr%rrr r!r()r6r7r8rr)rr�out�n_dups�targets�sources�perturbations            r3�inject_near_duplicatesrCHs���
�)�)�
�
��
%�C�
�]�]�F�A�
�,�,�.�C�
���]�F�
��{��
��j�j��E�j�2�G��j�j��D�j�1�G��&�&��}�5�<�<�R�Z�Z�H�5�P�L��<�,�.�C�L��L�B�I�I�N�N�3�<�a�$�N�G�G�L��Jr5�queries�kc	���XR-n[R"SS[R"USS5-
-5n[R"XBSS9SS2SUS-24n[R
"USS9SS2SS24nUSS2SS24S-nXV-n[R"USS5nU*[R"[R"U5SS9-nUR[R5$)	aECompute Local Intrinsic Dimensionality via MLE estimator.

LID measures the local dimensionality around each query point.
- Low LID: queries in dense, low-dimensional regions (easy)
- High LID: queries in sparse, high-dimensional regions (hard)

Based on Amsaleg et al. "Estimating Local Intrinsic Dimensionality" (KDD 2015)
g@r�����r�rN绽���|�=g�A�����?)
r'r�sqrt�clip�	partition�sortr$�logrr )	r6rDrE�sims�dists�k_dists�r_max�ratios�lids	         r3�compute_lid_mlerUas����Y�Y��D�
�G�G�C�3�����r�1�!5�5�6�7�E��l�l�5�!�,�Q���1���W�5�G��g�g�g�A�&�q�!�"�u�-�G�
�A�r�s�F�O�e�#�E�
�_�F�
�W�W�V�U�K�
0�F��"�r�v�v�b�f�f�V�n�1�-�
-�C��:�:�b�j�j�!�!r5�train�train_topics�	n_queries�difficulty_dist�generator_paramsc��[RRU5nURSnUS-n[	UUURSS5URSS5URSS5URS	S
5URSS5US
-S9up�[
X	SS9nX�R-n[R"USSS9SS2SS24n
[R"U
SS2S4U
SS2S45n[R"U
SS2S4U
SS2S45nX�-
nX�R5-
UR5UR5-
S--nUUR5-
UR5UR5-
S--nUSU-
-n[R"US5n[R"US5nUU:nUU:�UU:-nUU:�n[X#RSS5-5n[X#RSS5-5nUU-
U-
n"U5Sn[R"U5Sn[R"U5SnUR!U[U[#U55SS9nUR!U[U[#U55SS9n UR!U[U[#U55SS9n![R$"UU U!/5n"[#U"5U:a`[R&"[R("U5U"5n#UR!U#U[#U"5-
SS9n$[R$"U"U$/5n"[R*"U[R,S9n%SU%[#U5[#U5[#U 5-&SU%[#U5[#U 5-S&U	U"n&UU"n'U&U'U%4$)a)Select queries stratified by difficulty (LID-based).

IMPORTANT: Queries must be from the SAME distribution as training data,
not random unit vectors (which would be orthogonal to the data manifold).

Returns: (queries, lid_values, difficulty_labels)
- difficulty_labels: 0=easy, 1=medium, 2=hard
r�r��r�@r皙�����?r	皙�����?r
���Q��?���rrrr	r
r�rE�����rHNrrI�!�B�easy���Q��?�hardFr:r�)rrrr;r4�getrUr'rL�maximum�minimumr�max�
percentiler=�wherer%�len�concatenate�	setdiff1dr#�zerosr&)(rVrWrXrYrZrr)r�n_candidates�
candidatesr,�
lid_valuesrO�top2�top1_sim�top2_sim�margin�lid_norm�margin_norm�difficulty_score�p33�p66�	easy_mask�medium_mask�	hard_mask�n_easy�n_hard�n_medium�easy_idx�
medium_idx�hard_idx�
selected_easy�selected_medium�
selected_hard�selected�	remaining�extra�labelsrD�lidss(                                        r3�select_queries_by_difficultyr��s��� 
�)�)�
�
��
%�C�
�+�+�a�.�C��r�>�L�3���!�%�%�j�#�6�
�
!�
!�&�"�
-�$�(�(���<�$�(�(���<�%�)�)�.�$�?�
�D�[�	�M�J�!��b�9�J�����D�
�<�<��b�q�)�!�R�S�&�1�D��z�z�$�q�!�t�*�d�1�a�4�j�1�H��z�z�$�q�!�t�*�d�1�a�4�j�1�H�
�
 �F��^�^�-�-�*�.�.�2B�Z�^�^�EU�2U�X]�2]�^�H��F�J�J�L�(�V�Z�Z�\�F�J�J�L�-H�5�-P�Q�K� �1�{�?�3��
�-�-�(�"�
-�C�
�-�-�(�"�
-�C� �3�&�I�#�s�*�/?�#�/E�F�K� �C�'�I���0�0���>�>�
?�F�
��0�0���>�>�
?�F��6�!�F�*�H��x�x�	�"�1�%�H����+�&�q�)�J��x�x�	�"�1�%�H��J�J�x��V�S��]�)C�U�J�S�M��j�j��S��3�z�?�-K�UZ�j�[�O��J�J�x��V�S��]�)C�U�J�S�M��~�~�}�o�}�M�N�H�
�8�}�y� ��L�L����<�!8�(�C�	��
�
�9�i�#�h�-�&?��
�O���>�>�8�U�"3�4���X�X�i�r�x�x�
0�F�IJ�F�3�}��c�-�0��_�1E�E�F�78�F�3�}��c�/�2�2�3�4���"�G��h��D��D�&� � r5�drift_strengthc�r�[RRU5nURupV[	S[U55nUR
[U5USS9nXn	U	RSS9n
X�-
n[RRUSS9up�n[	SUS-5nUS	URnURXo45R[R5n[RRU5unnUU-nUR5nUUUUR---nUUUUR---
nUURXV45R[R5S
--
nU[RR!USSS
9-nUR[R5$)a�Simulate cross-modal distribution shift.

Based on BigANN 2023 OOD track: Text2Image-10M where:
- Database: image embeddings (Se-ResNext-101)
- Queries: text embeddings (DSSM)

We simulate this by:
1. Computing principal subspace of train data
2. Partially rotating queries into an orthogonal subspace
3. Adding controlled noise

This creates queries that are semantically related but in a
different embedding subspace (mimicking cross-modal embeddings).
�'Fr:rrH)�
full_matrices� �N皙�����?rTr)rrrr;rrrr%�meanr!�svdr'rrr r"r<r()rDrVr�rr)rr�sample_size�
sample_idx�sampler��centered�U�S�VtrE�top_directions�random_directionsr,�
proj_coeff�drifteds                     r3�simulate_cross_modal_driftr��s���(
�)�)�
�
��
%�C�
�]�]�F�A��e�S��Z�(�K����C��J��U��C�J�
�
�F��;�;�A�;��D��}�H��y�y�}�}�X�U�}�;�H�A�"�	�B��q���A����V�X�X�N��+�+�S�H�5�<�<�R�Z�Z�H���9�9�<�<�(9�:���q��>�)�J��l�l�n�G��~��n�.>�.>�!>�?�?�G��~��.?�.A�.A�!A�B�B�G��s�"�"�A�8�,�3�3�B�J�J�?�$�F�F�G��r�y�y�~�~�g�A��~�=�=�G��>�>�"�*�*�%�%r5�testc��XR-n[R"U*SS9SS2SU24nUR[R5$)zExact k-NN via brute force.rrHN)r'r�argsortrr&)rVr�rE�similarities�	neighborss     r3�compute_ground_truthr�sA���'�'�>�L��
�
�L�=�q�1�!�R�a�R�%�8�I����B�H�H�%�%r5�train_labels�test_labelsc�
�0n[R"U5H*n[R"X&:H5SU[U5'M, [R"[U5U4S[RS9n[[U55Hun[X85nURU5n	U	b[U	5S:XaM5XX	R-n
[R"U
*5SUnX�XxS[U524'Mw U$)z)Exact k-NN within label-filtered subsets.rrGrN)r�uniquerqr=�fullrrr&�rangerlr'r�)rVr�r�r�rE�label_to_ids�lblr��i�idsrO�
topk_locals            r3�compute_filtered_ground_truthr�%s����L��y�y��&��!#���,�*=�!>�q�!A��S��X��'�����T��A���"�(�(�;�I�
�3�t�9�
���+�.�!�����s�#���;�#�c�(�a�-���w�����%���Z�Z���&�r��*�
�),��	�%�c�*�o�%�%�&���r5c���XR-nSU-
nURSS9nURSS9n[R"US:�XT-[R
5n[
XSS9n[R"U*SS9SS2SS24n[R"UR5[U5S9n	[[R"X�R5-
S	-5U	R5S	-S
--5n
[[R"U[R"U555[[R"U[R"U555[[R"U55[[R"U55[[R"US55[[R"US55U
S
.$)z#Compute dataset difficulty metrics.rrHrr\rdN�
��	minlength�rI��K)�relative_contrast_mean�relative_contrast_median�lid_mean�lid_std�lid_p25�lid_p75�hubness_skewness)r'rr�rrq�infrUr��bincount�flattenrr�float�std�isfinite�medianrp)rVr�rOrP�d_min�d_mean�crrT�top10�
hub_counts�hubness_skews           r3�compute_difficulty_metricsr�>sh���'�'�>�D�
��H�E��I�I�1�I��E�
�Z�Z�Q�Z�
�F�	���%�!�)�V�^�R�V�V�	4�B��%��
,�C�
�J�J��u�1�%�a��"��f�-�E����U�]�]�_��E�
�C�J�����*���/@�"@�Q�!F�G�:�>�>�K[�_`�K`�ch�Kh�i�j�L�#(�����2�;�;�r�?�0C�(D�"E�$)�"�)�)�B�r�{�{�2��4G�*H�$I��"�'�'�#�,�'�������%�����s�B�/�0�����s�B�/�0�(��r5�pathc��URup#[US5nURS5 UR[R"SX#55 URUR55 SSS5 g!,(df   g=f)N�wbsVEC1�<II�r;�open�write�struct�pack�tobytes)r�r6r�d�fs     r3�save_vectorsr�^s[���=�=�D�A�	
�d�D�	�Q�	�����	������E�1�(�)�	������!�"�
�	�	���AA;�;
B	r�c��URup#[US5nURS5 UR[R"SX#55 URUR55 SSS5 g!,(df   g=f)Nr�sNBR1r�r�)r�r�rrEr�s     r3�save_neighborsr�fs]���?�?�D�A�	
�d�D�	�Q�	�����	������E�1�(�)�	���	�!�!�#�$�
�	�	�r�r�c	�L�UR[R5n[US5nUR	S5 UR	[
R"S[U555 UR	UR55 SSS5 g!,(df   g=f)Nr�sLBL1�<I)	rr�uint32r�r�r�r�rrr�)r�r��
labels_u32r�s    r3�save_labelsr�nsh�����r�y�y�)�J�	
�d�D�	�Q�	�����	������D�#�j�/�2�3�	���
�"�"�$�%�
�	�	���A B�
B#�arrc	�L�UR[R5n[US5nUR	S5 UR	[
R"S[U555 UR	UR55 SSS5 g!,(df   g=f)zSave 1D float32 array.r�sF32Ar�N)	rrr r�r�r�r�rrr�)r�r��arr_f32r�s    r3�save_f32_arrayr�vsf���j�j����$�G�	
�d�D�	�Q�	�����	������D�#�g�,�/�0�	������!�"�
�	�	�r�r���i�zBase (quick validation))�n_train�n_testr�desci��rbzTest (scaling behavior)i@Bi�zProd (production-relevant)��Br'�P�scale�data_dirc�z	�[UnUSnUSnUSnX-nURSSS9 [SS35 [SUS	US
35 [SUSS
USUS35 [S5 0n[R"5n[S5 [	SUS-5n	[	SUS-5n
[X5U
U	SSSSS9up�[
USSSS9nU
U	SSSS.n
[S5 [X�USS SS!.U
S"S#9up�n[S$5 [X�S"S%9n[S&5 [X�5nUUS''[US(-U5 [US)-U5 [US*-U5 [US+-U5 [US,-U5 [US--U5 [S.US/S0S1US2S335 [S45 [X�S5SS69n[UUS"S%9n[UU5nUUS7'[US8-U5 [US9-U5 [S:US/S0S1US2S335 [S;5 [R "X�S<9n[#SUS=-5n[R$"US:�5S>n[R$"UU:�US:-5S>n['US5-5nUU-
n[R(R+S?5n[-U5S>:�aUR/UUSS@9nOUR/U
USS@9n[-U5S>:�aUR/UUSS@9nOUR/U
USS@9n[R0"UU/5R3[R45nUR7U5n UU n!UU n[9UU!UUS"S%9n"[USA-U!5 [USB-U5 [USC-U"5 [R:"U"SDSD2S>4S>:�5n#[R<"UV$s/sHn$UU$PM
 sn$5n%['U#5[?U%5USE.USF'[SGU#SHUSIU%SJ35 [SK5 [R"5U-
n&[;SLURASM555n'UUUUU&U'SN-SN-SO.USP'[CUSQ-SR5n([DRF"UU(SSST9 SDSDSD5 [SUU'SN-SN-S3SVU&S3SW35 U$s sn$f!,(df   N/=f)Xz(Generate all datasets for a given scale.r�r�rT)�parents�exist_ok�
�F======================================================================zScale z: r�z	  Train: �,z x z, Test: z+
[1/4] Generating anisotropic embeddings...r^r�r]�2r_r`ra�*rcr��{�G�z�?�+)r7r8r)rrrr	r
z%  Selecting LID-stratified queries...rig��(\���?)rh�mediumrj�d)rYrZrz#  Computing ground truth (k=100)...rdz!  Computing difficulty metrics...�basez	train.binztest.binz
neighbors.binztrain_topics.binz
test_lids.binztest_difficulty.binz  Base: Cr=r�z.3fz, LID=r�z.1fz/
[2/4] Generating cross-modal drift scenario...�333333�?)r�r�driftztest_drift.binzneighbors_drift.binz  Drift: Cr=z&
[3/4] Generating filtered scenario...r�r�ri,r:ztest_filter.binztest_filter_topics.binzneighbors_filter.binN)�
valid_queries�avg_category_size�n_small_category_queries�filterz
  Filter: �/z valid, avg category size=z.0fz
[4/4] Saving metadata...c3�T# �UHoR5Rv� M  g7f)N)�stat�st_size)�.0r�s  r3�	<genexpr>�!generate_scale.<locals>.<genexpr>$s���G�/F�!�V�V�X�%�%�/F�s�&(z*.bini)r�r�r�r�generation_time_s�
total_size_mb�metazmetrics.json�wrk��indentz

  Total: z MB in �s)$�SCALES�mkdir�print�timerr4rCr�r�r�r�r�r�r�r�rr�rorqr=rrrrr%rsrr&�permutationr�r$r�r��globr��json�dump))r�r��cfgr�r�r�	scale_dir�metrics�t0rrrVrWrZr��	test_lids�test_difficulty�gt�base_metrics�
test_drift�gt_drift�
drift_metrics�topic_counts�min_category_size�eligible_large�eligible_small�n_small_queries�n_large_queriesr)�large_topics�small_topics�test_filter_topics�shuffle_idx�test_filter�	gt_filter�
valid_filters�tr
�elapsed�
total_sizer�s)                                         r3�generate_scaler>�sb��
��-�C��)�n�G�
��]�F�

�e�*�C�� �I�
�O�O�D�4�O�0�	�B�v�h�-��	�F�5�'��C��K�=�
)�*�	�I�g�a�[��C�5����
�
;�<�	�V�H���G�	
����B�

�
8�9��r�3�!�8��D��3��2�
�&�H�9���
����
���E�
#�5�t�4�b�I�E��������
�
1�2�'C�
�V�!%��t�D�)�
�	(�$�D�_�
�
/�0�	�e�S�	1�B�
�
-�.�-�e�:�L�"�G�F�O���[�(�%�0���Z�'��.��9��.��3��	�.�.��=��9��.�	�:��	�1�1�?�C�	�K��%=�>�s�C�6�,�Wa�Jb�cf�Ig�
h�i�

�
<�=�+�D��RU�V�J�#�E�:��=�H�.�u�j�A�M�$�G�G����-�-�z�:��9�4�4�h�?�	�L��'?�@��E�V�M�Zd�Le�fi�Kj�
k�l�

�
3�4��;�;�|�@�L��B��3��/���X�X�l�c�1�2�1�5�N��X�X�|�/@�@�\�TW�EW�X�Y�Z[�\�N��&�3�,�'�O���.�O�
�)�)�
�
��
$�C�
�>��Q���z�z�.�/�4�z�P���z�z�(�O�T�z�J��
�>��Q���z�z�.�/�4�z�P���z�z�(�O�T�z�J������|�(D�E�L�L�R�X�X�V���/�/�&�)�K��{�#�K�+�K�8��-�
�{�L�*<���I���.�.��<��	�4�4�6H�I��9�5�5�y�A��F�F�9�Q��T�?�a�/�0�M����:L� M�:L�Q��a��:L� M�N���]�+�"�#4�5�$3��G�H��
�J�}�o�Q�v�h�.H�IZ�[^�H_�
`�a�

�
&�'��i�i�k�B��G��G�y�~�~�g�/F�G�G�J�����$�#�d�*�T�1�
�G�F�O�
�i�.�(�#�	.�!��	�	�'�1�Q�'�
/�
�K�
�T�)�D�0��5�W�W�S�M��
K�L��N��?!N�4
/�	.�s�R'�,R,�,
R:c�|�[R"SS9nURS/SQSSS9 URS[[[5R
R
S	-S
-SS9 UR
5n[S
5 [S5 [SUR35 URS:Xa/SQOUR/n0nUHn[XAR5nXSU'M [URS-S5n[R "X6SS9 SSS5 [S5 [S5 [SSR#U535 [SUR35 g![a3 [SU35 [S5 [R"S5 M�f=f!,(df   N�=f)Nz'Generate multi-scale ANN benchmark data)�descriptionz--scale)r�r'r��allrAz6Scale to generate: B=10K, T=100K, P=1M, all=all scales)�choices�default�helpz--output�data�
multiscalezOutput directory)�typerCrDz(Multi-Scale ANN Benchmark Data Generatorr�zOutput: r�z 
ERROR: Out of memory for scale z*Try running with smaller scale or more RAMrzall_metrics.jsonrrkrzG
======================================================================zGeneration complete!zScales generated: z, )�argparse�ArgumentParser�add_argumentr�__file__�parent�
parse_argsr�outputr�r>�MemoryError�sys�exitr�r!r"�join)�parser�args�scales�all_metricsr�r%r�s       r3�mainrW7s���
�
$�
$�1Z�
[�F�
����&��
E�	������
��X��%�%�,�,�v�5��D�
�	������D�	�
4�5�	�(�O�	�H�T�[�[�M�
"�#� $�
�
�e� 3�_�$�*�*��F��K���	�$�U�K�K�8�G�!(����
�d�k�k�.�.��	4���	�	�+��+�
5�
�/��	�
 �!�	��t�y�y��0�1�
2�3�	�H�T�[�[�M�
"�#���	��5�e�W�=�>��>�?��H�H�Q�K�	��
5�	4�s�
E-�F-�-9F*�)F*�-
F;�__main__)r]r^rgffffff�?r�r)r�rr)r\)r)rr) �__doc__rHr!r�rPr�pathlibr�numpyrr=r��tuple�ndarrayr4rCrU�dictr�r�r�r�r�r�r�r�r�r�strr>rW�__name__�r5r3�<module>rbsc��
���
�
����������&+�
�&+�	�&+��&+��	&+�
�&+��
&+��&+��&+��2�:�:�r�z�z�!�"�&+�V���	�
�Z�Z��
�����	�
�Z�Z��8�"�
�Z�Z�"�
�Z�Z�"��"��Z�Z�	"�N�
X!�
�:�:�X!��*�*�X!��X!��	X!�
�X!��
X!��2�:�:�r�z�z�2�:�:�-�.�X!�| ��	>&�
�Z�Z�>&�
�:�:�>&��>&��	>&�
�Z�Z�>&�B&��
�
�&�"�*�*�&��&����&��
�:�:��
�*�*���*�*�����	�
���Z�Z�
�2�b�j�j���
�
��t��@#�t�#�b�j�j�#�T�#�%��%�"�*�*�%��%�&�d�&�B�J�J�&�4�&�#��#�B�J�J�#�4�#��s�3�@Y�	Z���S�B[�	\��$�s�D`�	a�
��j�#�j��j�$�j�Z'$�T�z���F�r5