Paper Deepdives, RAG, II - Dense Passage Retrieval

1. Introduction

In the previous post, I discussed the first paper of this series. There, the authors employed an inverted index and sparse retrieval methods, like TFIDF, to fetch relevant passages or documents from Wikipedia for answering questions. The primary innovation of today’s paper revolves around this retrieval aspect. Instead of relying on lexical matching and traditional IR methods, Karpukhin et al. utilize a dense vector representation of documents to retrieve passages or documents pertinent to a given question.

In this post, besides briefly summarizing the paper’s contributions, I will also introduce a library named FAISS. With the rise of LLMs and vector databases, such tools and libraries are becoming increasingly significant!

Paper in Detail

Problem Definition

Given a question, the objective is to identify the document containing its answer. Earlier techniques employed sparse retrieval methods for this task. In essence, sparse retrieval directly matches the query terms with document terms without resorting to dense representations. The paper offers a clear example and explanation:

’’’ Consider the question “Who is the bad guy in Lord of the Rings?”, which can be answered from the context “Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.” A term-based system would struggle to retrieve such a context, whereas a dense retrieval system would efficiently correlate “bad guy” with “villain” and fetch the correct context. ‘’’

Paper’s Solution

Let’s envisage each document or passage as a dense vector — an n-dimensional continuous and compact vector. For instance, in this representation, a lengthy document might be denoted as $[0.12,-0.58,0.39, \ldots]$.

Similarly, the query is also represented as a dense vector. Once vectors are established, a similarity measure can determine how alike or dissimilar two vectors are. Consequently, from hundreds of document vectors, those most akin to the query vector can be pinpointed. Using the paper’s notation:

Given query q and document p, let $E_Q$ and $E_P$ be encoding systems that transform the query and the document into dense vectors, respectively. The similarity is then computed using the dot product:

\[\operatorname{sim}(q, p)=E_Q(q)^{\top} E_P(p)\]

Two challenges arise:

How do we train these encoding systems to ensure that a relevant passage and a question are indeed proximate in the dense encoding space?
Even with optimal encoding systems, won’t calculating the similarities between millions of documents and a particular question be computationally intensive?

Training Encoding Systems with Contrastive Learning

To address the training challenge, the authors apply a traditional method that’s both straightforward and intuitive. Broadly, for a query, q, where $p^{+}$ denotes the document or passage containing the answer, the aim is to maximize the similarity between the query and the passage while minimizing the similarity with all other passages.

This goal is realized using contrastive learning. Here, a set of documents irrelevant to the question, termed negatives, $p^{-}$, are chosen. The loss function ensures that the loss is substantial when the relative similarity between the relevant document and the question is minimal. Essentially, this loss function nudges the anchor (question) towards the positives and distances it from the negatives in the vector space. Quoting from the paper:

\[\begin{aligned} & L\left(q_i, p_i^{+}, p_{i, 1}^{-}, \cdots, p_{i, n}^{-}\right) \\ = & -\log \frac{e^{\operatorname{sim}\left(q_i, p_i^{+}\right)}}{e^{\operatorname{sim}\left(q_i, p_i^{+}\right)}+\sum_{j=1}^n e^{\operatorname{sim}\left(q_i, p_{i, j}^{-}\right)}} . \end{aligned}\]

The pivotal question is the selection of negatives. The authors experiment with different negative setups:

Arbitrary documents from the corpus.
Documents most analogous when retrieved by a sparse-retrieval method, like BM25. This ensures the negatives aren’t too “trivial” and maintain considerable lexical overlap with the query.
Gold: Passages that are relevant for other queries but not for the current question.

Encoding Problem

Typically, a document is an ordered sequence of words, leading to the question: how do we represent it as a single vector? Given that there are as many vectors as there are words in a document, one might think about averaging the individual word vectors. However, the authors took a different approach. They use a [CLS] token to represent the entire document. This token is intended to encapsulate the representation of the full sequence.

FAISS: Efficiently Finding the Most Similar Document

As noted earlier, pinpointing the most similar document can be computationally demanding unless approached intelligently. This is where FAISS steps in.

Developed by Facebook AI Research (FAIR), FAISS is a library tailored for large-scale similarity search and clustering of dense vectors. It’s primarily designed for in-memory operations, facilitating speedy searches—vital for real-time applications such as open-domain question-answering.

An essential point to note is that FAISS doesn’t retain the original data items (like text passages). Instead, it operates on their vector representations. When vectors are integrated into FAISS, they can be tagged with unique IDs. After conducting a similarity search, FAISS provides the IDs of the most similar vectors. These IDs serve as “pointers” enabling the retrieval of the original data from its respective storage location, be it databases or file systems.

Conclusion

In NLP, efficiently retrieval and meaningful representation critical challenges. The paper’s innovative approach, from transforming lengthy documents into a single dense vector to the strategic employment of FAISS for rapid similarity searches, exemplifies the strides being made in this field. Libraries like FAISS, optimized for real-time applications, highlight the growing intersection of research and practical application. As we delve deeper into the complexities of representing and accessing vast textual data, tools and methodologies like the ones discussed in this paper pave the way for future advancements. Stay tuned as we explore more cutting-edge insights from subsequent RAG papers and beyond.

Resources

Karpukhin et al.

← Previous Post Next Post →