Skip to content

Commit 4978600

Browse files
authored
update hybrid retrieval tutorials (deepset-ai#299)
1 parent e459f3d commit 4978600

File tree

2 files changed

+18
-33
lines changed

2 files changed

+18
-33
lines changed

tutorials/26_Hybrid_Retrieval.ipynb

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@
119119
"\n",
120120
"You'll start creating a hybrid pipeline by initializing a DocumentStore and preprocessing documents before storing them in the DocumentStore.\n",
121121
"\n",
122-
"You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [ywchoi/pubmed_abstract_3](https://huggingface.co/datasets/ywchoi/pubmed_abstract_3/viewer/default/test) in this tutorial.\n",
122+
"You will use the PubMed Abstracts as Documents. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n",
123123
"\n",
124124
"Initialize `InMemoryDocumentStore` and don't forget to set `use_bm25=True` and the dimension of your embeddings in `embedding_dim`:"
125125
]
@@ -135,7 +135,7 @@
135135
"from datasets import load_dataset\n",
136136
"from haystack.document_stores import InMemoryDocumentStore\n",
137137
"\n",
138-
"dataset = load_dataset(\"ywchoi/pubmed_abstract_3\", split=\"test\")\n",
138+
"dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n",
139139
"\n",
140140
"document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)"
141141
]
@@ -150,9 +150,10 @@
150150
"The data has 3 features:\n",
151151
"* *pmid*\n",
152152
"* *title*\n",
153-
"* *text*\n",
153+
"* *content*: the abstract\n",
154+
"* *contents*: abstract + title\n",
154155
"\n",
155-
"Concatenate *title* and *text* to embed and search both. The single features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n"
156+
"For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results.\n"
156157
]
157158
},
158159
{
@@ -165,13 +166,10 @@
165166
"source": [
166167
"from haystack.schema import Document\n",
167168
"\n",
168-
"documents = []\n",
169+
"docs = []\n",
169170
"for doc in dataset:\n",
170-
" documents.append(\n",
171-
" Document(\n",
172-
" content=doc[\"title\"] + \" \" + doc[\"text\"],\n",
173-
" meta={\"title\": doc[\"title\"], \"abstract\": doc[\"text\"], \"pmid\": doc[\"pmid\"]},\n",
174-
" )\n",
171+
" docs.append(\n",
172+
" Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n",
175173
" )"
176174
]
177175
},
@@ -216,7 +214,7 @@
216214
},
217215
"outputs": [],
218216
"source": [
219-
"docs_to_index = preprocessor.process(documents)"
217+
"docs_to_index = preprocessor.process(docs)"
220218
]
221219
},
222220
{
@@ -381,7 +379,7 @@
381379
"outputs": [],
382380
"source": [
383381
"prediction = pipeline.run(\n",
384-
" query=\"treatment for HIV\",\n",
382+
" query=\"apnea in infants\",\n",
385383
" params={\n",
386384
" \"SparseRetriever\": {\"top_k\": 10},\n",
387385
" \"DenseRetriever\": {\"top_k\": 10},\n",

tutorials/33_Hybrid_Retrieval.ipynb

Lines changed: 8 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -134,26 +134,16 @@
134134
"source": [
135135
"## Fetching and Processing Documents\n",
136136
"\n",
137-
"As Documents, you will use the PubMed Abstracts. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [ywchoi/pubmed_abstract_3](https://huggingface.co/datasets/ywchoi/pubmed_abstract_3/viewer/default/test) in this tutorial.\n",
137+
"As Documents, you will use the PubMed Abstracts. There are a lot of datasets from PubMed on Hugging Face Hub; you will use [anakin87/medrag-pubmed-chunk](https://huggingface.co/datasets/anakin87/medrag-pubmed-chunk) in this tutorial.\n",
138138
"\n",
139139
"Then, you will create Documents from the dataset with a simple for loop.\n",
140-
"Each data point in the PubMed dataset has 3 features:\n",
140+
"Each data point in the PubMed dataset has 4 features:\n",
141141
"* *pmid*\n",
142142
"* *title*\n",
143-
"* *text*\n",
143+
"* *content*: the abstract\n",
144+
"* *contents*: abstract + title\n",
144145
"\n",
145-
"Concatenate *title* and *text* before creating the Document content to make sure that titles of PubMed abstracts are searchable.\n",
146-
"\n",
147-
"Other features of articles will be stored as `meta`, and you can then use this info to have a **pretty print** of the search results or for [metadata filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)."
148-
]
149-
},
150-
{
151-
"cell_type": "markdown",
152-
"metadata": {
153-
"id": "JcMIAXulPSU3"
154-
},
155-
"source": [
156-
"> This step might take ~2 min depending on your internet speed 🏎️"
146+
"For searching, you will use the *contents* feature. The other features will be stored as metadata, and you will use them to have a **pretty print** of the search results or for [metadata filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)."
157147
]
158148
},
159149
{
@@ -167,15 +157,12 @@
167157
"from datasets import load_dataset\n",
168158
"from haystack import Document\n",
169159
"\n",
170-
"dataset = load_dataset(\"ywchoi/pubmed_abstract_3\", split=\"test\")\n",
160+
"dataset = load_dataset(\"anakin87/medrag-pubmed-chunk\", split=\"train\")\n",
171161
"\n",
172162
"docs = []\n",
173163
"for doc in dataset:\n",
174164
" docs.append(\n",
175-
" Document(\n",
176-
" content=doc[\"title\"] + \" \" + doc[\"text\"],\n",
177-
" meta={\"title\": doc[\"title\"], \"abstract\": doc[\"text\"], \"pmid\": doc[\"pmid\"]},\n",
178-
" )\n",
165+
" Document(content=doc[\"contents\"], meta={\"title\": doc[\"title\"], \"abstract\": doc[\"content\"], \"pmid\": doc[\"id\"]})\n",
179166
" )"
180167
]
181168
},
@@ -457,7 +444,7 @@
457444
}
458445
],
459446
"source": [
460-
"query = \"treatment for HIV\"\n",
447+
"query = \"apnea in infants\"\n",
461448
"\n",
462449
"result = hybrid_retrieval.run(\n",
463450
" {\"text_embedder\": {\"text\": query}, \"bm25_retriever\": {\"query\": query}, \"ranker\": {\"query\": query}}\n",

0 commit comments

Comments
 (0)