|
119 | 119 | "\n",
|
120 | 120 | "You'll start creating a hybrid pipeline by initializing a DocumentStore and preprocessing documents before storing them in the DocumentStore.\n",
|
121 | 121 | "\n",
|
122 |
| - "You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [ywchoi/pubmed_abstract_3](https://huggingface.co/datasets/ywchoi/pubmed_abstract_3/viewer/default/test) in this tutorial.\n", |
| 122 | + "You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n", |
123 | 123 | "\n",
|
124 | 124 | "Initialize `InMemoryDocumentStore` and don't forget to set `use_bm25=True` and the dimension of your embeddings in `embedding_dim`:"
|
125 | 125 | ]
|
|
135 | 135 | "from datasets import load_dataset\n",
|
136 | 136 | "from haystack.document_stores import InMemoryDocumentStore\n",
|
137 | 137 | "\n",
|
138 |
| - "dataset = load_dataset(\"ywchoi/pubmed_abstract_3\", split=\"test\")\n", |
| 138 | + "dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n", |
139 | 139 | "\n",
|
140 | 140 | "document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)"
|
141 | 141 | ]
|
|
150 | 150 | "The data has 3 features:\n",
|
151 | 151 | "* *pmid*\n",
|
152 | 152 | "* *title*\n",
|
153 |
| - "* *text*\n", |
| 153 | + "* *content*: the abstract\n", |
| 154 | + "* *contents*: abstract + title\n", |
154 | 155 | "\n",
|
155 |
| - "Concatenate *title* and *text* to embed and search both. The single features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n" |
| 156 | + "For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n" |
156 | 157 | ]
|
157 | 158 | },
|
158 | 159 | {
|
|
165 | 166 | "source": [
|
166 | 167 | "from haystack.schema import Document\n",
|
167 | 168 | "\n",
|
168 |
| - "documents = []\n", |
| 169 | + "docs = []\n", |
169 | 170 | "for doc in dataset:\n",
|
170 |
| - " documents.append(\n", |
171 |
| - " Document(\n", |
172 |
| - " content=doc[\"title\"] + \" \" + doc[\"text\"],\n", |
173 |
| - " meta={\"title\": doc[\"title\"], \"abstract\": doc[\"text\"], \"pmid\": doc[\"pmid\"]},\n", |
174 |
| - " )\n", |
| 171 | + " docs.append(\n", |
| 172 | + " Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n", |
175 | 173 | " )"
|
176 | 174 | ]
|
177 | 175 | },
|
|
216 | 214 | },
|
217 | 215 | "outputs": [],
|
218 | 216 | "source": [
|
219 |
| - "docs_to_index = preprocessor.process(documents)" |
| 217 | + "docs_to_index = preprocessor.process(docs)" |
220 | 218 | ]
|
221 | 219 | },
|
222 | 220 | {
|
|
381 | 379 | "outputs": [],
|
382 | 380 | "source": [
|
383 | 381 | "prediction = pipeline.run(\n",
|
384 |
| - " query=\"treatment for HIV\",\n", |
| 382 | + " query=\"apnea in infants\",\n", |
385 | 383 | " params={\n",
|
386 | 384 | " \"SparseRetriever\": {\"top_k\": 10},\n",
|
387 | 385 | " \"DenseRetriever\": {\"top_k\": 10},\n",
|
|
0 commit comments