How pgvector works

from mqp on Discord

pgvector isn't as weird as it looks, i suggest just understanding it, it took me about half an hour of reading and thinking about the papers etc. it points to. the short version is

when you make the index it uses some kind of k-means clustering sort of algorithm to partition all the existing data points into K spatial clusters. iirc K = the lists parameter to the index
it never changes these clusters. data points are assigned to the clusters when you insert/update/delete
when you do a query of the form select order by distance to point limit N it tries to figure out whether it should use the index based on normal stuff like how large N is compared to the table size

so if you understand this then you understand

if it decides it should use the index, then it will determine N clusters to look at, N = the probes pgvector config, it will pick the N clusters with the closest centroids to the point you are looking for close rows to
then it will look for points in those clusters only for your query, speeding up your query by a factor of ~N/K compared to if it scanned all points

the index works with some specific distance metric because it used that distance metric to make the clusters, you can't use the same index with some other distance metric
it may not return all matches, because some closest point matches may have been in clusters that it didn't check -- the centroids only give it a good heuristic
higher N config will make it take longer proportional to N but have a higher chance of catching all matches
higher K config will have similar effects i am too lazy to type but which you can reason about
the quality of the index will depend on how well the clusters were chosen, a great way to have terrible clusters would be to change all the data points and never rebuild the index, if you do that then when you do a query you will pick stupid clusters and get no results or terrible results

Actionable bits

I’m trying to make an index on user interest vectors. What does the index do? pgvector partitions the data into a set of “lists”, each of which covers a distinct partition of the multi-dimensional space vectors are embedded in. When initially building an index, the system takes a sample of vectors and runs a K-means clustering of the sample, to generate the space partitions for each list.

-- This index has a small number of lists and resulted in the planner 
-- ignoring my index using the query from below
create index concurrently if not exists user_embeddings_interest_embedding_2 on user_embeddings using ivfflat (interest_embedding vector_cosine_ops)
    with (lists = 100);
-- this index has a larger number of lists and the planner used my index
create index concurrently if not exists user_embeddings_interest_embedding_2 on user_embeddings using ivfflat (interest_embedding vector_cosine_ops)
    with (lists = 500);

Another person’s experience with changing their lists parameter:

Untitled

I’m trying to find users who’ve similar interest vectors to myself, or I’m trying to find similar contracts to my contract.