跳转至

Paper retriever

labridge.func_modules.paper.retrieve.paper_retriever

labridge.func_modules.paper.retrieve.paper_retriever.PaperRetriever

We use hybrid, multi-level retrieving methods.

In the first step, the retriever retrieve the vector index and the summary index to get candidate papers. These two index storages are constructed in the class PaperStorage, refer to its docstring for details.

  • In the vector index, the paper contents except for references are chunked and embedded. The retriever get vector_similarity_top_k most relevant text chunk from the vector index, then we collect their ref_doc_id.
  • In the summary index, each paper is summarized. Both the summary text and the paper chunks are stored. The retriever search in the summary texts to get summary_similarity_top_k most relevant summaries of docs. Similarly, we collect their doc_id.

We have collected several relevant papers in the first step. Subsequently, we use the PaperSummaryLLMPostSelector to rank these papers according to the relevance between their summaries and the query, the relevance scores are given by the LLM. Among these papers, the LLM selects docs_top_k most relevant papers.

Finally, we conduct secondary_retrieve among the text chunks of these luckily selected papers. Note that, in this period, we hide all metadata of these nodes from the LLM and the embed model for the sake of grained retrieving. At last, we will get re_retrieve_top_k text chunks.

If the final_use_context is set to True, the prev_node and next_node of each node will be added. If the final_use_summary is set to True, the summary_node corresponding to each_node's doc will be added.

PARAMETER DESCRIPTION
llm

the employed LLM.

TYPE: LLM

paper_vector_retriever

the retriever based on the VectorIndex in paper storage.

TYPE: VectorIndexRetriever

paper_summary_retriever

the retriever based on the DocumentSummaryIndex in the paper storage.

TYPE: DocumentSummaryIndexEmbeddingRetriever

docs_top_k

the number of most relevant docs in the second retrieving step.

TYPE: int DEFAULT: 2

re_retrieve_top_k

the number of the finally retrieved nodes.

TYPE: int DEFAULT: 5

final_use_context

Whether to add the context nodes of each final node.

TYPE: bool DEFAULT: True

final_use_summary

Whether to add the summary node of each final node's doc.

TYPE: bool DEFAULT: True

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
class PaperRetriever:
	r"""
	We use hybrid, multi-level retrieving methods.

	In the first step, the retriever retrieve the vector index and the summary index to get candidate papers.
	These two index storages are constructed in the class `PaperStorage`, refer to its docstring for details.

	- In the vector index, the paper contents except for references are chunked and embedded. The retriever get
	`vector_similarity_top_k` most relevant text chunk from the vector index, then we collect their `ref_doc_id`.
	- In the summary index, each paper is summarized. Both the summary text and the paper chunks are stored.
	The retriever search in the summary texts to get `summary_similarity_top_k` most relevant summaries of docs.
	Similarly, we collect their `doc_id`.

	We have collected several relevant papers in the first step. Subsequently, we use the `PaperSummaryLLMPostSelector`
	to rank these papers according to the relevance between their summaries and the query, the relevance scores are
	given by the LLM. Among these papers, the LLM selects `docs_top_k` most relevant papers.

	Finally, we conduct secondary_retrieve among the text chunks of these luckily selected papers.
	Note that, in this period, we hide all metadata of these nodes from the LLM and the embed model for the sake of
	grained retrieving. At last, we will get `re_retrieve_top_k` text chunks.

	If the `final_use_context` is set to True, the prev_node and next_node of each node will be added.
	If the `final_use_summary` is set to True, the summary_node corresponding to each_node's doc will be added.

	Args:
		llm (LLM): the employed LLM.
		paper_vector_retriever (VectorIndexRetriever): the retriever based on the VectorIndex in paper storage.
		paper_summary_retriever (DocumentSummaryIndexEmbeddingRetriever):
			the retriever based on the DocumentSummaryIndex in the paper storage.
		docs_top_k (int): the number of most relevant docs in the second retrieving step.
		re_retrieve_top_k (int): the number of the finally retrieved nodes.
		final_use_context (bool): Whether to add the context nodes of each final node.
		final_use_summary (bool): Whether to add the summary node of each final node's doc.
	"""
	def __init__(
		self,
		llm: LLM,
		paper_vector_retriever: VectorIndexRetriever,
		paper_summary_retriever: DocumentSummaryIndexEmbeddingRetriever,
		docs_top_k: int = 2,
		re_retrieve_top_k: int = 5,
		final_use_context: bool = True,
		final_use_summary: bool = True
	):
		self.paper_vector_retriever = paper_vector_retriever
		self.paper_summary_retriever = paper_summary_retriever
		self.paper_summary_post_selector = PaperSummaryLLMPostSelector(
			summary_nodes=[],
			llm=llm,
			choice_top_k=docs_top_k,
		)
		self.re_retrieve_top_k = re_retrieve_top_k
		self.final_use_context = final_use_context
		self.final_use_summary = final_use_summary
		self.doc_id_to_summary_id = self.paper_summary_retriever._index._index_struct.doc_id_to_summary_id
		self.summary_id_to_node_ids = self.paper_summary_retriever._index._index_struct.summary_id_to_node_ids
		self.retrieved_nodes = []
		root = Path(__file__)
		for i in range(5):
			root = root.parent
		self.root = root

	def _exclude_all_llm_metadata(self, node: BaseNode):
		r""" Hidden all metadata of a node to LLM. """
		node.excluded_llm_metadata_keys.extend(list(node.metadata.keys()))

	def _exclude_all_embedding_metadata(self, node: BaseNode):
		r""" Hidden all metadata of a node to the embed model. """
		node.excluded_embed_metadata_keys.extend(list(node.metadata.keys()))

	def get_ref_info(self) -> List[PaperInfo]:
		r"""
		Get the reference paper infos

		Returns:
			List[PaperInfo]: The reference paper infos in answering.
		"""
		doc_ids, doc_titles, doc_possessors = [], [], []
		ref_infos = []
		for node_score in self.retrieved_nodes:
			ref_doc_id = node_score.node.ref_doc_id
			if ref_doc_id not in doc_ids:
				doc_ids.append(ref_doc_id)
				title = node_score.node.metadata.get(PAPER_TITLE) or ref_doc_id
				possessor = node_score.node.metadata.get(PAPER_POSSESSOR)
				rel_path = node_score.node.metadata.get(PAPER_REL_FILE_PATH)
				if rel_path is None:
					raise ValueError("Invalid database.")
				paper_info = PaperInfo(
					title=title,
					possessor=possessor,
					file_path=str(self.root / rel_path),
				)
				ref_infos.append(paper_info)

				doc_titles.append(title)
				doc_possessors.append(possessor)
		return ref_infos

	def _secondary_retrieve(
		self,
		final_doc_ids: List[str],
		item_to_be_retrieved: str
	) -> Tuple[List[NodeWithScore], List[NodeWithScore]]:
		r"""
		Secondary retrieve among the nodes of the selected papers.

		Args:
			final_doc_ids (List[str]): the doc_ids of the selected papers.
			item_to_be_retrieved (str): the retrieving items.

		Returns:
			the summary_nodes and the content_nodes:

				- summary_nodes (List[NodeWithScore]): the summary nodes of these docs.
				- content_nodes (List[NodeWithScore]): the retrieved nodes among the chunked nodes of these docs.
		"""
		# get all nodes of these docs.
		summary_nodes = []
		all_doc_nodes = []
		for doc_id in final_doc_ids:
			summary_id = self.doc_id_to_summary_id[doc_id]
			summary_node = self.paper_summary_retriever._index.docstore.get_node(summary_id)
			# exclude metadata of summary nodes for llm using.
			self._exclude_all_llm_metadata(summary_node)
			summary_nodes.append(NodeWithScore(node=summary_node))

			# all doc nodes.
			doc_node_ids = self.summary_id_to_node_ids[summary_id]
			doc_nodes = self.paper_summary_retriever._index.docstore.get_nodes(doc_node_ids)
			# exclude metadata of content nodes
			for doc_node in doc_nodes:
				self._exclude_all_llm_metadata(doc_node)
				self._exclude_all_embedding_metadata(doc_node)
			all_doc_nodes.extend(doc_nodes)

		content_index = VectorStoreIndex(nodes=all_doc_nodes, embed_model=self.paper_vector_retriever._embed_model)
		content_retriever = content_index.as_retriever(similarity_top_k=self.re_retrieve_top_k)
		content_nodes = content_retriever.retrieve(item_to_be_retrieved)
		return summary_nodes, content_nodes

	async def _asecondary_retrieve(
		self,
		final_doc_ids: List[str],
		item_to_be_retrieved: str
	) -> Tuple[List[NodeWithScore], List[NodeWithScore]]:
		r"""
		Asynchronous secondary retrieve among the nodes of the selected papers.

		Args:
			final_doc_ids (List[str]): the doc_ids of the selected papers.
			item_to_be_retrieved (str): the retrieving items.

		Returns:
			the summary_nodes and the content_nodes:

				- summary_nodes (List[NodeWithScore]): the summary nodes of these docs.
				- content_nodes (List[NodeWithScore]): the retrieved nodes among the chunked nodes of these docs.
		"""
		summary_nodes = []
		all_doc_nodes = []
		for doc_id in final_doc_ids:
			summary_id = self.doc_id_to_summary_id[doc_id]
			summary_node = self.paper_summary_retriever._index.docstore.get_node(summary_id)
			# exclude metadata of summary nodes for llm using.
			self._exclude_all_llm_metadata(summary_node)
			summary_nodes.append(NodeWithScore(node=summary_node))

			# all doc nodes.
			doc_node_ids = self.summary_id_to_node_ids[summary_id]
			doc_nodes = self.paper_summary_retriever._index.docstore.get_nodes(doc_node_ids)
			# exclude metadata of content nodes
			for doc_node in doc_nodes:
				self._exclude_all_llm_metadata(doc_node)
				self._exclude_all_embedding_metadata(doc_node)
			all_doc_nodes.extend(doc_nodes)

		content_index = VectorStoreIndex(nodes=all_doc_nodes, embed_model=self.paper_vector_retriever._embed_model)
		content_retriever = content_index.as_retriever(similarity_top_k=self.re_retrieve_top_k)
		content_nodes = await content_retriever.aretrieve(item_to_be_retrieved)
		return summary_nodes, content_nodes

	def _get_context(self, content_nodes: List[NodeWithScore]) -> List[NodeWithScore]:
		r"""
		Get the 1-hop context nodes of each content node retrieved in the secondary retrieving.
		"""
		content_ids = [node.node.node_id for node in content_nodes]
		extra_ids = []
		for node in content_nodes:
			prev_node = node.node.prev_node
			next_node = node.node.next_node
			if prev_node is not None:
				prev_id = node.node.prev_node.node_id
				if prev_id not in content_ids:
					extra_ids.append(prev_id)
					content_ids.append(prev_id)

			if next_node is not None:
				next_id = node.node.next_node.node_id
				if next_id not in content_ids:
					extra_ids.append(next_id)
					content_ids.append(next_id)

		context_nodes = self.paper_summary_retriever._index.docstore.get_nodes(extra_ids)
		context_nodes = [NodeWithScore(node=node) for node in context_nodes]
		# exclude metadata in LLM using.
		for node in context_nodes:
			self._exclude_all_llm_metadata(node.node)
		return context_nodes

	@dispatcher.span
	def retrieve(
		self,
		item_to_be_retrieved: str,
	) -> List[NodeWithScore]:
		r"""
		This tool is used to retrieve academic information in the Laboratory's shared paper database.
		It is useful to help answer the user's academic questions.

		Args:
			item_to_be_retrieved (str): The things that you want to retrieve in the shared paper database.
		"""
		# This docstring is used as the tool description.
		vector_nodes = self.paper_vector_retriever.retrieve(item_to_be_retrieved)
		summary_chunk_nodes = self.paper_summary_retriever.retrieve(item_to_be_retrieved)

		hybrid_doc_ids = set()
		for node in summary_chunk_nodes + vector_nodes:
			hybrid_doc_ids.add(node.node.ref_doc_id)

		doc_id_to_summary_id = self.paper_summary_retriever._index._index_struct.doc_id_to_summary_id
		hybrid_summary_ids = [doc_id_to_summary_id[doc_id] for doc_id in hybrid_doc_ids]
		doc_summary_nodes = self.paper_summary_retriever._index.docstore.get_nodes(hybrid_summary_ids)

		self.paper_summary_post_selector._summary_nodes = doc_summary_nodes
		final_doc_ids = self.paper_summary_post_selector.select(item_to_be_retrieved)

		summary_nodes, content_nodes = self._secondary_retrieve(
			final_doc_ids=final_doc_ids,
			item_to_be_retrieved=item_to_be_retrieved,
		)

		final_nodes = content_nodes
		if self.final_use_summary:
			final_nodes.extend(summary_nodes)

		if self.final_use_context:
			context_nodes = self._get_context(content_nodes)
			final_nodes.extend(context_nodes)
		self.retrieved_nodes = final_nodes
		return final_nodes

	@dispatcher.span
	async def aretrieve(
		self,
		item_to_be_retrieved: str,
	) -> List[NodeWithScore]:
		r"""
		This tool is used to retrieve academic information in the Laboratory's shared paper database, which contains
		abundant research papers. It is useful to help you to answer the user's academic questions.

		Args:
			item_to_be_retrieved (str): The things that you want to retrieve in the shared paper database.
		"""
		vector_nodes = await self.paper_vector_retriever.aretrieve(item_to_be_retrieved)
		summary_chunk_nodes = await self.paper_summary_retriever.aretrieve(item_to_be_retrieved)

		hybrid_doc_ids = set()
		for node in summary_chunk_nodes + vector_nodes:
			hybrid_doc_ids.add(node.node.ref_doc_id)

		doc_id_to_summary_id = self.paper_summary_retriever._index._index_struct.doc_id_to_summary_id
		hybrid_summary_ids = [doc_id_to_summary_id[doc_id] for doc_id in hybrid_doc_ids]
		doc_summary_nodes = self.paper_summary_retriever._index.docstore.get_nodes(hybrid_summary_ids)

		self.paper_summary_post_selector._summary_nodes = doc_summary_nodes
		final_doc_ids = await self.paper_summary_post_selector.aselect(item_to_be_retrieved)

		summary_nodes, content_nodes = self._secondary_retrieve(
			final_doc_ids=final_doc_ids,
			item_to_be_retrieved=item_to_be_retrieved,
		)

		final_nodes = content_nodes
		if self.final_use_summary:
			final_nodes.extend(summary_nodes)

		if self.final_use_context:
			context_nodes = self._get_context(content_nodes)
			final_nodes.extend(context_nodes)
		self.retrieved_nodes = final_nodes
		return final_nodes


	@classmethod
	def from_storage(
		cls,
		llm: Optional[LLM] = None,
		embed_model: Optional[BaseEmbedding] = None,
		vector_persist_dir: Optional[Union[Path, str]] = None,
		paper_summary_persist_dir: Optional[Union[Path, str]] = None,
		vector_similarity_top_k: Optional[int] = PAPER_VECTOR_TOP_K,
		summary_similarity_top_k: Optional[int] = PAPER_SUMMARY_TOP_K,
		service_context: Optional[ServiceContext] = None,
		docs_top_k: int = PAPER_TOP_K,
		re_retrieve_top_k: int = PAPER_RETRIEVE_TOP_K,
		final_use_context: bool = True,
		final_use_summary: bool = True,
	):
		r"""
		Load from an existing storage.
		"""
		root = Path(__file__)
		for i in range(5):
			root = root.parent

		llm = llm or llm_from_settings_or_context(Settings, service_context)
		embed_model = embed_model or embed_model_from_settings_or_context(Settings, service_context)

		vector_persist_dir = vector_persist_dir or root / DEFAULT_PAPER_VECTOR_PERSIST_DIR
		vector_storage_context = StorageContext.from_defaults(persist_dir=vector_persist_dir)
		vector_index = load_index_from_storage(
			storage_context=vector_storage_context,
			index_id=PAPER_VECTOR_INDEX_ID,
			embed_model=embed_model
		)
		vector_retriever = vector_index.as_retriever(similarity_top_k=vector_similarity_top_k)

		paper_summary_persist_dir = paper_summary_persist_dir or root / DEFAULT_PAPER_SUMMARY_PERSIST_DIR
		paper_summary_storage_context = StorageContext.from_defaults(persist_dir=paper_summary_persist_dir)
		paper_summary_index = load_index_from_storage(
			storage_context=paper_summary_storage_context,
			index_id=PAPER_SUMMARY_INDEX_ID,
			llm=llm,
			embed_model=embed_model
		)
		summary_retriever = paper_summary_index.as_retriever(
			retriever_mode=DocumentSummaryRetrieverMode.EMBEDDING,
			similarity_top_k=summary_similarity_top_k)
		return cls(
			llm = llm,
			paper_vector_retriever=vector_retriever,
			paper_summary_retriever=summary_retriever,
			docs_top_k=docs_top_k,
			re_retrieve_top_k=re_retrieve_top_k,
			final_use_context=final_use_context,
			final_use_summary=final_use_summary,
		)

labridge.func_modules.paper.retrieve.paper_retriever.PaperRetriever.aretrieve(item_to_be_retrieved) async

This tool is used to retrieve academic information in the Laboratory's shared paper database, which contains abundant research papers. It is useful to help you to answer the user's academic questions.

PARAMETER DESCRIPTION
item_to_be_retrieved

The things that you want to retrieve in the shared paper database.

TYPE: str

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
@dispatcher.span
async def aretrieve(
	self,
	item_to_be_retrieved: str,
) -> List[NodeWithScore]:
	r"""
	This tool is used to retrieve academic information in the Laboratory's shared paper database, which contains
	abundant research papers. It is useful to help you to answer the user's academic questions.

	Args:
		item_to_be_retrieved (str): The things that you want to retrieve in the shared paper database.
	"""
	vector_nodes = await self.paper_vector_retriever.aretrieve(item_to_be_retrieved)
	summary_chunk_nodes = await self.paper_summary_retriever.aretrieve(item_to_be_retrieved)

	hybrid_doc_ids = set()
	for node in summary_chunk_nodes + vector_nodes:
		hybrid_doc_ids.add(node.node.ref_doc_id)

	doc_id_to_summary_id = self.paper_summary_retriever._index._index_struct.doc_id_to_summary_id
	hybrid_summary_ids = [doc_id_to_summary_id[doc_id] for doc_id in hybrid_doc_ids]
	doc_summary_nodes = self.paper_summary_retriever._index.docstore.get_nodes(hybrid_summary_ids)

	self.paper_summary_post_selector._summary_nodes = doc_summary_nodes
	final_doc_ids = await self.paper_summary_post_selector.aselect(item_to_be_retrieved)

	summary_nodes, content_nodes = self._secondary_retrieve(
		final_doc_ids=final_doc_ids,
		item_to_be_retrieved=item_to_be_retrieved,
	)

	final_nodes = content_nodes
	if self.final_use_summary:
		final_nodes.extend(summary_nodes)

	if self.final_use_context:
		context_nodes = self._get_context(content_nodes)
		final_nodes.extend(context_nodes)
	self.retrieved_nodes = final_nodes
	return final_nodes

labridge.func_modules.paper.retrieve.paper_retriever.PaperRetriever.from_storage(llm=None, embed_model=None, vector_persist_dir=None, paper_summary_persist_dir=None, vector_similarity_top_k=PAPER_VECTOR_TOP_K, summary_similarity_top_k=PAPER_SUMMARY_TOP_K, service_context=None, docs_top_k=PAPER_TOP_K, re_retrieve_top_k=PAPER_RETRIEVE_TOP_K, final_use_context=True, final_use_summary=True) classmethod

Load from an existing storage.

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
@classmethod
def from_storage(
	cls,
	llm: Optional[LLM] = None,
	embed_model: Optional[BaseEmbedding] = None,
	vector_persist_dir: Optional[Union[Path, str]] = None,
	paper_summary_persist_dir: Optional[Union[Path, str]] = None,
	vector_similarity_top_k: Optional[int] = PAPER_VECTOR_TOP_K,
	summary_similarity_top_k: Optional[int] = PAPER_SUMMARY_TOP_K,
	service_context: Optional[ServiceContext] = None,
	docs_top_k: int = PAPER_TOP_K,
	re_retrieve_top_k: int = PAPER_RETRIEVE_TOP_K,
	final_use_context: bool = True,
	final_use_summary: bool = True,
):
	r"""
	Load from an existing storage.
	"""
	root = Path(__file__)
	for i in range(5):
		root = root.parent

	llm = llm or llm_from_settings_or_context(Settings, service_context)
	embed_model = embed_model or embed_model_from_settings_or_context(Settings, service_context)

	vector_persist_dir = vector_persist_dir or root / DEFAULT_PAPER_VECTOR_PERSIST_DIR
	vector_storage_context = StorageContext.from_defaults(persist_dir=vector_persist_dir)
	vector_index = load_index_from_storage(
		storage_context=vector_storage_context,
		index_id=PAPER_VECTOR_INDEX_ID,
		embed_model=embed_model
	)
	vector_retriever = vector_index.as_retriever(similarity_top_k=vector_similarity_top_k)

	paper_summary_persist_dir = paper_summary_persist_dir or root / DEFAULT_PAPER_SUMMARY_PERSIST_DIR
	paper_summary_storage_context = StorageContext.from_defaults(persist_dir=paper_summary_persist_dir)
	paper_summary_index = load_index_from_storage(
		storage_context=paper_summary_storage_context,
		index_id=PAPER_SUMMARY_INDEX_ID,
		llm=llm,
		embed_model=embed_model
	)
	summary_retriever = paper_summary_index.as_retriever(
		retriever_mode=DocumentSummaryRetrieverMode.EMBEDDING,
		similarity_top_k=summary_similarity_top_k)
	return cls(
		llm = llm,
		paper_vector_retriever=vector_retriever,
		paper_summary_retriever=summary_retriever,
		docs_top_k=docs_top_k,
		re_retrieve_top_k=re_retrieve_top_k,
		final_use_context=final_use_context,
		final_use_summary=final_use_summary,
	)

labridge.func_modules.paper.retrieve.paper_retriever.PaperRetriever.get_ref_info()

Get the reference paper infos

RETURNS DESCRIPTION
List[PaperInfo]

List[PaperInfo]: The reference paper infos in answering.

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def get_ref_info(self) -> List[PaperInfo]:
	r"""
	Get the reference paper infos

	Returns:
		List[PaperInfo]: The reference paper infos in answering.
	"""
	doc_ids, doc_titles, doc_possessors = [], [], []
	ref_infos = []
	for node_score in self.retrieved_nodes:
		ref_doc_id = node_score.node.ref_doc_id
		if ref_doc_id not in doc_ids:
			doc_ids.append(ref_doc_id)
			title = node_score.node.metadata.get(PAPER_TITLE) or ref_doc_id
			possessor = node_score.node.metadata.get(PAPER_POSSESSOR)
			rel_path = node_score.node.metadata.get(PAPER_REL_FILE_PATH)
			if rel_path is None:
				raise ValueError("Invalid database.")
			paper_info = PaperInfo(
				title=title,
				possessor=possessor,
				file_path=str(self.root / rel_path),
			)
			ref_infos.append(paper_info)

			doc_titles.append(title)
			doc_possessors.append(possessor)
	return ref_infos

labridge.func_modules.paper.retrieve.paper_retriever.PaperRetriever.retrieve(item_to_be_retrieved)

This tool is used to retrieve academic information in the Laboratory's shared paper database. It is useful to help answer the user's academic questions.

PARAMETER DESCRIPTION
item_to_be_retrieved

The things that you want to retrieve in the shared paper database.

TYPE: str

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
@dispatcher.span
def retrieve(
	self,
	item_to_be_retrieved: str,
) -> List[NodeWithScore]:
	r"""
	This tool is used to retrieve academic information in the Laboratory's shared paper database.
	It is useful to help answer the user's academic questions.

	Args:
		item_to_be_retrieved (str): The things that you want to retrieve in the shared paper database.
	"""
	# This docstring is used as the tool description.
	vector_nodes = self.paper_vector_retriever.retrieve(item_to_be_retrieved)
	summary_chunk_nodes = self.paper_summary_retriever.retrieve(item_to_be_retrieved)

	hybrid_doc_ids = set()
	for node in summary_chunk_nodes + vector_nodes:
		hybrid_doc_ids.add(node.node.ref_doc_id)

	doc_id_to_summary_id = self.paper_summary_retriever._index._index_struct.doc_id_to_summary_id
	hybrid_summary_ids = [doc_id_to_summary_id[doc_id] for doc_id in hybrid_doc_ids]
	doc_summary_nodes = self.paper_summary_retriever._index.docstore.get_nodes(hybrid_summary_ids)

	self.paper_summary_post_selector._summary_nodes = doc_summary_nodes
	final_doc_ids = self.paper_summary_post_selector.select(item_to_be_retrieved)

	summary_nodes, content_nodes = self._secondary_retrieve(
		final_doc_ids=final_doc_ids,
		item_to_be_retrieved=item_to_be_retrieved,
	)

	final_nodes = content_nodes
	if self.final_use_summary:
		final_nodes.extend(summary_nodes)

	if self.final_use_context:
		context_nodes = self._get_context(content_nodes)
		final_nodes.extend(context_nodes)
	self.retrieved_nodes = final_nodes
	return final_nodes

labridge.func_modules.paper.retrieve.paper_retriever.PaperSummaryLLMPostSelector

Use LLM to re-rank the retrieved papers obtained by vector_retriever and summary_retriever, according to their summaries.

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class PaperSummaryLLMPostSelector:
	r"""
	Use LLM to re-rank the retrieved papers obtained by vector_retriever and summary_retriever,
	according to their summaries.
	"""
	def __init__(
		self,
		summary_nodes: List[BaseNode],
		llm: Optional[LLM] = None,
		choice_select_prompt: Optional[BasePromptTemplate] = None,
		choice_batch_size: int = 10,
		choice_top_k: int = 2,
		format_node_batch_fn: Optional[Callable] = None,
		parse_choice_select_answer_fn: Optional[Callable] = None,
	):
		self._summary_nodes = summary_nodes
		self._choice_select_prompt = (choice_select_prompt or DOC_CHOICE_SELECT_PROMPT)
		self._choice_batch_size = choice_batch_size
		self._choice_top_k = choice_top_k
		self._format_node_batch_fn = (format_node_batch_fn or default_format_node_batch_fn)
		self._parse_choice_select_answer_fn = (parse_choice_select_answer_fn or default_parse_choice_select_answer_fn)
		self._llm = llm or Settings.llm

	def select(
		self,
		item_to_be_retrieved: str,
	) -> List[str]:
		r"""
		Select from the paper summaries according to the relevance to the retrieving string.

		Args:
			item_to_be_retrieved (str): The retrieving string.

		Return the ref_doc_ids, titles, possessors of the selected docs.
		"""
		all_nodes: List[BaseNode] = []
		all_relevances: List[float] = []
		for idx in range(0, len(self._summary_nodes), self._choice_batch_size):
			summary_nodes = self._summary_nodes[idx: idx + self._choice_batch_size]
			fmt_batch_str = self._format_node_batch_fn(summary_nodes)
			# call each batch independently
			raw_response = self._llm.predict(
				self._choice_select_prompt,
				context_str=fmt_batch_str,
				query_str=item_to_be_retrieved,
			)

			raw_choices, relevances = self._parse_choice_select_answer_fn(raw_response, len(summary_nodes))
			choice_idxs = [choice - 1 for choice in raw_choices]

			choice_summary_nodes = [summary_nodes[ci] for ci in choice_idxs]

			all_nodes.extend(choice_summary_nodes)
			all_relevances.extend(relevances)

		zipped_list = list(zip(all_nodes, all_relevances))
		sorted_list = sorted(zipped_list, key=lambda x: x[1], reverse=True)
		top_k_list = sorted_list[: self._choice_top_k]

		doc_ids = [node.ref_doc_id for node, relevance in top_k_list]
		return doc_ids

	async def aselect(
		self,
		item_to_be_retrieved: str,
	) -> List[str]:
		r"""
		Asynchronously select from the paper summaries according to the relevance to the retrieving string.

		Args:
			item_to_be_retrieved (str): The retrieving string.

		Return the ref_doc_ids, titles, possessors of the selected docs.
		"""
		all_nodes: List[BaseNode] = []
		all_relevances: List[float] = []
		for idx in range(0, len(self._summary_nodes), self._choice_batch_size):
			summary_nodes = self._summary_nodes[idx: idx + self._choice_batch_size]
			fmt_batch_str = self._format_node_batch_fn(summary_nodes)
			# call each batch independently
			raw_response = await self._llm.apredict(
				self._choice_select_prompt,
				context_str=fmt_batch_str,
				query_str=item_to_be_retrieved,
			)
			raw_choices, relevances = self._parse_choice_select_answer_fn(raw_response, len(summary_nodes))
			choice_idxs = [choice - 1 for choice in raw_choices]

			choice_summary_nodes = [summary_nodes[ci] for ci in choice_idxs]

			all_nodes.extend(choice_summary_nodes)
			all_relevances.extend(relevances)

		zipped_list = list(zip(all_nodes, all_relevances))
		sorted_list = sorted(zipped_list, key=lambda x: x[1], reverse=True)
		top_k_list = sorted_list[: self._choice_top_k]

		doc_ids = [node.ref_doc_id for node, relevance in top_k_list]
		return doc_ids

labridge.func_modules.paper.retrieve.paper_retriever.PaperSummaryLLMPostSelector.aselect(item_to_be_retrieved) async

Asynchronously select from the paper summaries according to the relevance to the retrieving string.

PARAMETER DESCRIPTION
item_to_be_retrieved

The retrieving string.

TYPE: str

Return the ref_doc_ids, titles, possessors of the selected docs.

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
async def aselect(
	self,
	item_to_be_retrieved: str,
) -> List[str]:
	r"""
	Asynchronously select from the paper summaries according to the relevance to the retrieving string.

	Args:
		item_to_be_retrieved (str): The retrieving string.

	Return the ref_doc_ids, titles, possessors of the selected docs.
	"""
	all_nodes: List[BaseNode] = []
	all_relevances: List[float] = []
	for idx in range(0, len(self._summary_nodes), self._choice_batch_size):
		summary_nodes = self._summary_nodes[idx: idx + self._choice_batch_size]
		fmt_batch_str = self._format_node_batch_fn(summary_nodes)
		# call each batch independently
		raw_response = await self._llm.apredict(
			self._choice_select_prompt,
			context_str=fmt_batch_str,
			query_str=item_to_be_retrieved,
		)
		raw_choices, relevances = self._parse_choice_select_answer_fn(raw_response, len(summary_nodes))
		choice_idxs = [choice - 1 for choice in raw_choices]

		choice_summary_nodes = [summary_nodes[ci] for ci in choice_idxs]

		all_nodes.extend(choice_summary_nodes)
		all_relevances.extend(relevances)

	zipped_list = list(zip(all_nodes, all_relevances))
	sorted_list = sorted(zipped_list, key=lambda x: x[1], reverse=True)
	top_k_list = sorted_list[: self._choice_top_k]

	doc_ids = [node.ref_doc_id for node, relevance in top_k_list]
	return doc_ids

labridge.func_modules.paper.retrieve.paper_retriever.PaperSummaryLLMPostSelector.select(item_to_be_retrieved)

Select from the paper summaries according to the relevance to the retrieving string.

PARAMETER DESCRIPTION
item_to_be_retrieved

The retrieving string.

TYPE: str

Return the ref_doc_ids, titles, possessors of the selected docs.

Source code in labridge\func_modules\paper\retrieve\paper_retriever.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def select(
	self,
	item_to_be_retrieved: str,
) -> List[str]:
	r"""
	Select from the paper summaries according to the relevance to the retrieving string.

	Args:
		item_to_be_retrieved (str): The retrieving string.

	Return the ref_doc_ids, titles, possessors of the selected docs.
	"""
	all_nodes: List[BaseNode] = []
	all_relevances: List[float] = []
	for idx in range(0, len(self._summary_nodes), self._choice_batch_size):
		summary_nodes = self._summary_nodes[idx: idx + self._choice_batch_size]
		fmt_batch_str = self._format_node_batch_fn(summary_nodes)
		# call each batch independently
		raw_response = self._llm.predict(
			self._choice_select_prompt,
			context_str=fmt_batch_str,
			query_str=item_to_be_retrieved,
		)

		raw_choices, relevances = self._parse_choice_select_answer_fn(raw_response, len(summary_nodes))
		choice_idxs = [choice - 1 for choice in raw_choices]

		choice_summary_nodes = [summary_nodes[ci] for ci in choice_idxs]

		all_nodes.extend(choice_summary_nodes)
		all_relevances.extend(relevances)

	zipped_list = list(zip(all_nodes, all_relevances))
	sorted_list = sorted(zipped_list, key=lambda x: x[1], reverse=True)
	top_k_list = sorted_list[: self._choice_top_k]

	doc_ids = [node.ref_doc_id for node, relevance in top_k_list]
	return doc_ids