seismic 0.2.1 - Docs.rs

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97bb4212",
   "metadata": {},
   "outputs": [],
   "source": [
    "from seismic import SeismicIndex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c912a968",
   "metadata": {},
   "source": [
    "## Indexing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee9a4a91",
   "metadata": {},
   "source": [
    "### Building"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3696691",
   "metadata": {},
   "source": [
    "We can build the index either from a jsonl file or a compressed archive .tar.gz containing the jsonl file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2eefe4af",
   "metadata": {},
   "outputs": [],
   "source": [
    "json_input_file = \"\"\n",
    "compressed_input_file = \"\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "feb5215e",
   "metadata": {},
   "source": [
    "We can use the default configuration by specifying only the input file or choose each of the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08e55de8",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SeismicIndex.build(json_input_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fb2b50c",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SeismicIndex.build(\n",
    "    compressed_input_file,\n",
    "    n_postings=3500,\n",
    "    centroid_fraction=0.1,\n",
    "    min_cluster_size=2,\n",
    "    summary_energy=0.4, \n",
    "    batched_indexing=10000000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59d4942e",
   "metadata": {},
   "source": [
    "By setting the nknn parameter we can build the knn graph together with the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80f6f899",
   "metadata": {},
   "outputs": [],
   "source": [
    "index = SeismicIndex.build(\n",
    "    json_input_file,\n",
    "    n_postings=3500,\n",
    "    centroid_fraction=0.1,\n",
    "    min_cluster_size=2,\n",
    "    summary_energy=0.4,\n",
    "    nknn=10,\n",
    "    batched_indexing=10000000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45135b50",
   "metadata": {},
   "source": [
    "While, if we set also the knn_path, we can add to the index a precomputed knn graph.\n",
    "In this case, the nknn parameter allow us to add a subset of the knn graph (with less neighbors)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c5305aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "knn_path = \"\"\n",
    "\n",
    "index = SeismicIndex.build(\n",
    "    json_input_file,\n",
    "    n_postings=3500,\n",
    "    centroid_fraction=0.1,\n",
    "    min_cluster_size=2,\n",
    "    summary_energy=0.4,\n",
    "    knn_path=knn_path,\n",
    "    nknn=5,\n",
    "    batched_indexing=10000000)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34eecf33",
   "metadata": {},
   "source": [
    "Once the index is constructed, we can serialize and store it in a file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06de631d",
   "metadata": {},
   "outputs": [],
   "source": [
    "index_path = \"\"\n",
    "\n",
    "index.save(index_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f572d70",
   "metadata": {},
   "source": [
    "### Loading"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8c20f59",
   "metadata": {},
   "source": [
    "We may load a serialized index to query it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44532aed",
   "metadata": {},
   "outputs": [],
   "source": [
    "index_path = \"\"\n",
    "\n",
    "index = SeismicIndex.load(index_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00e1c150",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of documents: \", index.len)\n",
    "print(\"Avg number of non-zero components: \", index.nnz / index.len)\n",
    "print(\"Dimensionality of the vectors: \", index.dim)\n",
    "\n",
    "index.print_space_usage_byte()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f618f314",
   "metadata": {},
   "source": [
    "### KNN Graph\n",
    "\n",
    "Given an inverted index, we can build a knn graph and attach to it with the build_knn function.\n",
    "It is also possible to serialise the graph and then link it to another index with the load_knn function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "946322aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "nknn=10\n",
    "index.build_knn(nknn)\n",
    "\n",
    "knn_path = \"\"\n",
    "\n",
    "index.save_knn(knn_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a0034c7",
   "metadata": {},
   "source": [
    "When adding the knn graph we can specify a subset of the neighbours we want for each entry of the index or load the full knn graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f93fb6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "index_path = \"\"\n",
    "knn_path = \"\"\n",
    "\n",
    "#load full knn graph\n",
    "index.load_knn2(knn_path)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f27085b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "nknn = 5\n",
    "#load partial graph\n",
    "index.load_knn2(knn_path, nknn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c73a4d8a",
   "metadata": {},
   "source": [
    "### Perform the search"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93ed4c1a",
   "metadata": {},
   "source": [
    "Prepare the data to perform the search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46e2f0d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import json\n",
    "\n",
    "file_path = \"\"\n",
    "\n",
    "queries = []\n",
    "with open(file_path, 'r') as f:\n",
    "    for line in f:\n",
    "        queries.append(json.loads(line))\n",
    "\n",
    "MAX_TOKEN_LEN = 30\n",
    "string_type  = f'U{MAX_TOKEN_LEN}'\n",
    "\n",
    "queries_ids = np.array([q['id'] for q in queries], dtype=string_type)\n",
    "\n",
    "query_components = []\n",
    "query_values = []\n",
    "\n",
    "for query in queries:\n",
    "    vector = query['vector']\n",
    "    query_components.append(np.array(list(vector.keys()), dtype=string_type))\n",
    "    query_values.append(np.array(list(vector.values()), dtype=np.float32))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8286c10",
   "metadata": {},
   "source": [
    "We can ran a single search or a parallel batch search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f68dc58",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = index.search(\n",
    "    query_id=str(queries_ids[0]),\n",
    "    query_components=query_components[0],\n",
    "    query_values=query_values[0],\n",
    "    k=10,\n",
    "    query_cut=20,\n",
    "    heap_factor=0.7,\n",
    "    n_knn=0,\n",
    "    sorted=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5682de5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "results = index.batch_search(\n",
    "    queries_ids=queries_ids,\n",
    "    query_components=query_components,\n",
    "    query_values=query_values,\n",
    "    k=10,\n",
    "    query_cut=20,\n",
    "    heap_factor=0.7,\n",
    "    n_knn=0,\n",
    "    sorted=True,\n",
    "    num_threads=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c6b6cf4",
   "metadata": {},
   "source": [
    "## Evaluation of results\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4216faaa",
   "metadata": {},
   "source": [
    "Evaulation of the results with the ir_measure package for the choosen dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90784514",
   "metadata": {},
   "outputs": [],
   "source": [
    "import ir_measures\n",
    "import ir_datasets\n",
    "\n",
    "ir_results = [ir_measures.ScoredDoc(query_id, doc_id, score) for r in results for (query_id, score, doc_id) in r]\n",
    "qrels = ir_datasets.load('msmarco-passage/dev/small').qrels\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c0aa5cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from ir_measures import *\n",
    "\n",
    "ir_measures.calc_aggregate([RR@10], qrels, ir_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3a67154",
   "metadata": {},
   "source": [
    "# Raw Seismic Index\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7afb44c",
   "metadata": {},
   "source": [
    "Input file in Seismic inner format (this means that we have to provide a method to produce documents and queries in the seismic inner format)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d2e63df",
   "metadata": {},
   "outputs": [],
   "source": [
    "from seismic import SeismicIndexRaw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2f21e7b",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_path = \"\"\n",
    "\n",
    "index = SeismicIndexRaw.build(input_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b3d2654",
   "metadata": {},
   "outputs": [],
   "source": [
    "queries_path=\"\"\n",
    "\n",
    "query_path=\"\"\n",
    "\n",
    "\n",
    "results = index.batch_search(\n",
    "    query_path,\n",
    "    k=10,\n",
    "    query_cut=3,\n",
    "    heap_factor=0.9,\n",
    "    n_knn=0,\n",
    "    sorted=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}