Vector Database Search - Hierarchical Navigable Small Worlds (HNSW) Explained

let's define what we mean by vector databases and vector search. In short, vector databases are used to store high-dimensional vectors, where each vector represents some kind of data, like an image or a piece of text, whose core structure would be that, the closer those vectors are, the more similar, SImple approach wll be to calculate the similarity between query and each vector but this will be extrememly slow as database grows O(N) time One way to speed up, the search process is to use a data structure called a navigable small W, which is a graph where each node represents a vector. The way we create this graph is as follows. Imagine we have a database of documents, and each document is represented by a vector.01:59It can construct a navigable small word graph by simply connecting each document to its k-most similar images in an iterative way. So, we pick a document from the database, and we add it to the graph. Then we pick another document, we find a CLOS.02:14escate documents to that document, which right now is only the previous document we have added. So, we link that to, then we add another document, and we connect it again with the k-nearest documents we have added so far, and we repeat this step again, and again, and again, until we have added all the documents in the database in our graph. So now to calculate similarity, we start with a query and randomly select a node from the graph Now we calculate the similarity between this random query and other closest node to our random chosen node, and calculate similarity, we continue to calculate this similarity again and again, for different nodes until no nodes is closer to the query. This we repeat by selecting multiple different random nodes to get the better result and it is better than the previous discussed brute method Skip list is a probabilistic data structure that allows inserting and searching elements within a sorted list for O(logn) on average. A skip list is constructed by several layers of linked lists. The lowest layer has the original linked list with all the elements in it. When moving to higher levels, the number of skipped elements increases, thus decreasing the number of connections. A skip linked list is a variation of a traditional linked list designed to improve search efficiency. In this data structure, each node can have multiple links, or "skip pointers," that point to other nodes further along the list. This structure allows you to quickly skip over large sections of the list during a search operation, rather than having to traverse each node sequentially. The higher levels of the list tend to have fewer links, while the lower levels contain more links, with the very bottom level holding all the nodes. This multi-level structure enables faster navigation and searching compared to a standard linked list. ( , ) Would you like me to provide an example of how searching works in a skip linked list? The example provided illustrates searching for the value 11 in such a list. The search process begins at the first node. You compare the target value (11) with the value of the next node at the current level. If the next node's value is higher than the target or if you reach the end of the list at that level, you move down to the next lower level. If the next node's value is lower than the target, you move horizontally to that node at the current level and repeat the comparison. Following the steps for finding 11: Start at the first node. The next node at the current level is the end node, which is effectively higher than 11, so you move down. At the next level down, the next node is 13, which is greater than 11, so you move down again. At the next level, the next node is lower than 11, so you move to that node horizontally. From this new position, the next node is 13 again, which is greater than 11, so you move down. At the lowest level, the next node is 11, which is the target value. The search successfully ends upon finding the node with the value 11. This process demonstrates how the skip pointers allow the search to descend through the levels, skipping over many nodes, until it narrows down the search space to find the target value efficiently. ( , , ) We combine the two concepts of Navigable Small worlds and SKip Linked List which provides us So In Hierachical naviagable small world we start with the top node as it has less connections and calculate the similarity These Blue nodes are the close In the Hierarchical Navigable Small Worlds (HNSW) algorithm, the cosine similarity (or another distance metric) is calculated repeatedly throughout the search process, not just once at the top level with the initial closest node. The search typically begins at a random entry point, often in an upper level of the hierarchical graph. At each step, the algorithm calculates the similarity between the query vector and the current node, as well as the similarity between the query vector and all of the current node's neighbors. The algorithm then moves to the neighbor that has the best similarity score (i.e., is closest) to the query. This process of calculating similarities and moving to the best neighbor is iterated. As the search progresses through the graph, potentially moving down through the hierarchical layers, this calculation of similarity with the current node's neighbors is performed at each visited node to guide the search towards the area containing the closest vectors. The search stops when the algorithm reaches a node where none of its neighbors are closer to the query than the current node itself. ( , , , , ) Would you like a more detailed breakdown of how the search navigates through the different levels of the HNSW graph?