{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "97bb4212",
"metadata": {},
"outputs": [],
"source": [
"from seismic import SeismicIndex"
]
},
{
"cell_type": "markdown",
"id": "c912a968",
"metadata": {},
"source": [
"## Indexing"
]
},
{
"cell_type": "markdown",
"id": "ee9a4a91",
"metadata": {},
"source": [
"### Building"
]
},
{
"cell_type": "markdown",
"id": "c3696691",
"metadata": {},
"source": [
"We can build the index either from a jsonl file or a compressed archive .tar.gz containing the jsonl file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2eefe4af",
"metadata": {},
"outputs": [],
"source": [
"json_input_file = \"\"\n",
"compressed_input_file = \"\""
]
},
{
"cell_type": "markdown",
"id": "feb5215e",
"metadata": {},
"source": [
"We can use the default configuration by specifying only the input file or choose each of the parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08e55de8",
"metadata": {},
"outputs": [],
"source": [
"index = SeismicIndex.build(json_input_file)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5fb2b50c",
"metadata": {},
"outputs": [],
"source": [
"index = SeismicIndex.build(\n",
" compressed_input_file,\n",
" n_postings=3500,\n",
" centroid_fraction=0.1,\n",
" min_cluster_size=2,\n",
" summary_energy=0.4, \n",
" batched_indexing=10000000)"
]
},
{
"cell_type": "markdown",
"id": "59d4942e",
"metadata": {},
"source": [
"By setting the nknn parameter we can build the knn graph together with the index."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "80f6f899",
"metadata": {},
"outputs": [],
"source": [
"index = SeismicIndex.build(\n",
" json_input_file,\n",
" n_postings=3500,\n",
" centroid_fraction=0.1,\n",
" min_cluster_size=2,\n",
" summary_energy=0.4,\n",
" nknn=10,\n",
" batched_indexing=10000000)"
]
},
{
"cell_type": "markdown",
"id": "45135b50",
"metadata": {},
"source": [
"While, if we set also the knn_path, we can add to the index a precomputed knn graph.\n",
"In this case, the nknn parameter allow us to add a subset of the knn graph (with less neighbors)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8c5305aa",
"metadata": {},
"outputs": [],
"source": [
"knn_path = \"\"\n",
"\n",
"index = SeismicIndex.build(\n",
" json_input_file,\n",
" n_postings=3500,\n",
" centroid_fraction=0.1,\n",
" min_cluster_size=2,\n",
" summary_energy=0.4,\n",
" knn_path=knn_path,\n",
" nknn=5,\n",
" batched_indexing=10000000)"
]
},
{
"cell_type": "markdown",
"id": "34eecf33",
"metadata": {},
"source": [
"Once the index is constructed, we can serialize and store it in a file.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06de631d",
"metadata": {},
"outputs": [],
"source": [
"index_path = \"\"\n",
"\n",
"index.save(index_path)"
]
},
{
"cell_type": "markdown",
"id": "7f572d70",
"metadata": {},
"source": [
"### Loading"
]
},
{
"cell_type": "markdown",
"id": "a8c20f59",
"metadata": {},
"source": [
"We may load a serialized index to query it."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "44532aed",
"metadata": {},
"outputs": [],
"source": [
"index_path = \"\"\n",
"\n",
"index = SeismicIndex.load(index_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00e1c150",
"metadata": {},
"outputs": [],
"source": [
"print(\"Number of documents: \", index.len)\n",
"print(\"Avg number of non-zero components: \", index.nnz / index.len)\n",
"print(\"Dimensionality of the vectors: \", index.dim)\n",
"\n",
"index.print_space_usage_byte()"
]
},
{
"cell_type": "markdown",
"id": "f618f314",
"metadata": {},
"source": [
"### KNN Graph\n",
"\n",
"Given an inverted index, we can build a knn graph and attach to it with the build_knn function.\n",
"It is also possible to serialise the graph and then link it to another index with the load_knn function."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "946322aa",
"metadata": {},
"outputs": [],
"source": [
"nknn=10\n",
"index.build_knn(nknn)\n",
"\n",
"knn_path = \"\"\n",
"\n",
"index.save_knn(knn_path)"
]
},
{
"cell_type": "markdown",
"id": "8a0034c7",
"metadata": {},
"source": [
"When adding the knn graph we can specify a subset of the neighbours we want for each entry of the index or load the full knn graph"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f93fb6c",
"metadata": {},
"outputs": [],
"source": [
"index_path = \"\"\n",
"knn_path = \"\"\n",
"\n",
"#load full knn graph\n",
"index.load_knn2(knn_path)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f27085b7",
"metadata": {},
"outputs": [],
"source": [
"nknn = 5\n",
"#load partial graph\n",
"index.load_knn2(knn_path, nknn)"
]
},
{
"cell_type": "markdown",
"id": "c73a4d8a",
"metadata": {},
"source": [
"### Perform the search"
]
},
{
"cell_type": "markdown",
"id": "93ed4c1a",
"metadata": {},
"source": [
"Prepare the data to perform the search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "46e2f0d9",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import json\n",
"\n",
"file_path = \"\"\n",
"\n",
"queries = []\n",
"with open(file_path, 'r') as f:\n",
" for line in f:\n",
" queries.append(json.loads(line))\n",
"\n",
"MAX_TOKEN_LEN = 30\n",
"string_type = f'U{MAX_TOKEN_LEN}'\n",
"\n",
"queries_ids = np.array([q['id'] for q in queries], dtype=string_type)\n",
"\n",
"query_components = []\n",
"query_values = []\n",
"\n",
"for query in queries:\n",
" vector = query['vector']\n",
" query_components.append(np.array(list(vector.keys()), dtype=string_type))\n",
" query_values.append(np.array(list(vector.values()), dtype=np.float32))"
]
},
{
"cell_type": "markdown",
"id": "b8286c10",
"metadata": {},
"source": [
"We can ran a single search or a parallel batch search"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f68dc58",
"metadata": {},
"outputs": [],
"source": [
"results = index.search(\n",
" query_id=str(queries_ids[0]),\n",
" query_components=query_components[0],\n",
" query_values=query_values[0],\n",
" k=10,\n",
" query_cut=20,\n",
" heap_factor=0.7,\n",
" n_knn=0,\n",
" sorted=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5682de5f",
"metadata": {},
"outputs": [],
"source": [
"\n",
"results = index.batch_search(\n",
" queries_ids=queries_ids,\n",
" query_components=query_components,\n",
" query_values=query_values,\n",
" k=10,\n",
" query_cut=20,\n",
" heap_factor=0.7,\n",
" n_knn=0,\n",
" sorted=True,\n",
" num_threads=1,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8c6b6cf4",
"metadata": {},
"source": [
"## Evaluation of results\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "4216faaa",
"metadata": {},
"source": [
"Evaulation of the results with the ir_measure package for the choosen dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90784514",
"metadata": {},
"outputs": [],
"source": [
"import ir_measures\n",
"import ir_datasets\n",
"\n",
"ir_results = [ir_measures.ScoredDoc(query_id, doc_id, score) for r in results for (query_id, score, doc_id) in r]\n",
"qrels = ir_datasets.load('msmarco-passage/dev/small').qrels\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c0aa5cd",
"metadata": {},
"outputs": [],
"source": [
"from ir_measures import *\n",
"\n",
"ir_measures.calc_aggregate([RR@10], qrels, ir_results)"
]
},
{
"cell_type": "markdown",
"id": "a3a67154",
"metadata": {},
"source": [
"# Raw Seismic Index\n"
]
},
{
"cell_type": "markdown",
"id": "d7afb44c",
"metadata": {},
"source": [
"Input file in Seismic inner format (this means that we have to provide a method to produce documents and queries in the seismic inner format)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9d2e63df",
"metadata": {},
"outputs": [],
"source": [
"from seismic import SeismicIndexRaw"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2f21e7b",
"metadata": {},
"outputs": [],
"source": [
"input_path = \"\"\n",
"\n",
"index = SeismicIndexRaw.build(input_path)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5b3d2654",
"metadata": {},
"outputs": [],
"source": [
"queries_path=\"\"\n",
"\n",
"query_path=\"\"\n",
"\n",
"\n",
"results = index.batch_search(\n",
" query_path,\n",
" k=10,\n",
" query_cut=3,\n",
" heap_factor=0.9,\n",
" n_knn=0,\n",
" sorted=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}