update hybrid retrieval tutorials (deepset-ai#299)

anakin87 · web-flow · commit 4978600bda6f · 2024-02-26T11:37:20.000+03:00
diff --git a/tutorials/26_Hybrid_Retrieval.ipynb b/tutorials/26_Hybrid_Retrieval.ipynb
@@ -119,7 +119,7 @@
     "\n",
     "You'll start creating a hybrid pipeline by initializing a DocumentStore and preprocessing documents before storing them in the DocumentStore.\n",
     "\n",
-    "You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [ywchoi/pubmed_abstract_3](https://huggingface.co/datasets/ywchoi/pubmed_abstract_3/viewer/default/test) in this tutorial.\n",
+    "You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n",
     "\n",
     "Initialize `InMemoryDocumentStore` and don't forget to set `use_bm25=True` and the dimension of your embeddings in `embedding_dim`:"
    ]
@@ -135,7 +135,7 @@
     "from datasets import load_dataset\n",
     "from haystack.document_stores import InMemoryDocumentStore\n",
     "\n",
-    "dataset = load_dataset(\"ywchoi/pubmed_abstract_3\", split=\"test\")\n",
+    "dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n",
     "\n",
     "document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)"
    ]
@@ -150,9 +150,10 @@
     "The data has 3 features:\n",
     "* *pmid*\n",
     "* *title*\n",
-    "* *text*\n",
+    "* *content*: the abstract\n",
+    "* *contents*: abstract + title\n",
     "\n",
-    "Concatenate *title* and *text* to embed and search both. The single features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n"
+    "For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n"
    ]
   },
   {
@@ -165,13 +166,10 @@
    "source": [
     "from haystack.schema import Document\n",
     "\n",
-    "documents = []\n",
+    "docs = []\n",
     "for doc in dataset:\n",
-    "    documents.append(\n",
-    "        Document(\n",
-    "            content=doc[\"title\"] + \" \" + doc[\"text\"],\n",
-    "            meta={\"title\": doc[\"title\"], \"abstract\": doc[\"text\"], \"pmid\": doc[\"pmid\"]},\n",
-    "        )\n",
+    "    docs.append(\n",
+    "        Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n",
     "    )"
    ]
   },
@@ -216,7 +214,7 @@
    },
    "outputs": [],
    "source": [
-    "docs_to_index = preprocessor.process(documents)"
+    "docs_to_index = preprocessor.process(docs)"
    ]
   },
   {
@@ -381,7 +379,7 @@
    "outputs": [],
    "source": [
     "prediction = pipeline.run(\n",
-    "    query=\"treatment for HIV\",\n",
+    "    query=\"apnea in infants\",\n",
     "    params={\n",
     "        \"SparseRetriever\": {\"top_k\": 10},\n",
     "        \"DenseRetriever\": {\"top_k\": 10},\n",
diff --git a/tutorials/33_Hybrid_Retrieval.ipynb b/tutorials/33_Hybrid_Retrieval.ipynb
@@ -134,26 +134,16 @@
    "source": [
     "## Fetching and Processing Documents\n",
     "\n",
-    "As Documents, you will use the PubMed Abstracts. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [ywchoi/pubmed_abstract_3](https://huggingface.co/datasets/ywchoi/pubmed_abstract_3/viewer/default/test) in this tutorial.\n",
+    "As Documents, you will use the PubMed Abstracts. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n",
     "\n",
     "Then, you will create Documents from the dataset with a simple for loop.\n",
-    "Each data point in the PubMed dataset has 3 features:\n",
+    "Each data point in the PubMed dataset has 4 features:\n",
     "* *pmid*\n",
     "* *title*\n",
-    "* *text*\n",
+    "* *content*: the abstract\n",
+    "* *contents*: abstract + title\n",
     "\n",
-    "Concatenate *title* and *text* before creating the Document content to make sure that titles of PubMed abstracts are searchable.\n",
-    "\n",
-    "Other features of articles will be stored as `meta`, and you can then use this info to have a **pretty print** of the search results or for [metadata filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "id": "JcMIAXulPSU3"
-   },
-   "source": [
-    "> This step might take ~2 min depending on your internet speed 🏎️"
+    "For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results or for [metadata filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)."
    ]
   },
   {
@@ -167,15 +157,12 @@
     "from datasets import load_dataset\n",
     "from haystack import Document\n",
     "\n",
-    "dataset = load_dataset(\"ywchoi/pubmed_abstract_3\", split=\"test\")\n",
+    "dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n",
     "\n",
     "docs = []\n",
     "for doc in dataset:\n",
     "    docs.append(\n",
-    "        Document(\n",
-    "            content=doc[\"title\"] + \" \" + doc[\"text\"],\n",
-    "            meta={\"title\": doc[\"title\"], \"abstract\": doc[\"text\"], \"pmid\": doc[\"pmid\"]},\n",
-    "        )\n",
+    "        Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n",
     "    )"
    ]
   },
@@ -457,7 +444,7 @@
     }
    ],
    "source": [
-    "query = \"treatment for HIV\"\n",
+    "query = \"apnea in infants\"\n",
     "\n",
     "result = hybrid_retrieval.run(\n",
     "    {\"text_embedder\": {\"text\": query}, \"bm25_retriever\": {\"query\": query}, \"ranker\": {\"query\": query}}\n",