from mqp on Discord
pgvector isn't as weird as it looks, i suggest just understanding it, it took me about half an hour of reading and thinking about the papers etc. it points to. the short version is
- when you make the index it uses some kind of k-means clustering sort of algorithm to partition all the existing data points into K spatial clusters. iirc K = the
lists
parameter to the index
- it never changes these clusters. data points are assigned to the clusters when you insert/update/delete
- when you do a query of the form
select order by distance to point limit N
it tries to figure out whether it should use the index based on normal stuff like how large N is compared to the table size
so if you understand this then you understand
- if it decides it should use the index, then it will determine N clusters to look at, N = the
probes
pgvector config, it will pick the N clusters with the closest centroids to the point you are looking for close rows to
- then it will look for points in those clusters only for your query, speeding up your query by a factor of ~N/K compared to if it scanned all points
- the index works with some specific distance metric because it used that distance metric to make the clusters, you can't use the same index with some other distance metric
- it may not return all matches, because some closest point matches may have been in clusters that it didn't check -- the centroids only give it a good heuristic
- higher N config will make it take longer proportional to N but have a higher chance of catching all matches
- higher K config will have similar effects i am too lazy to type but which you can reason about
- the quality of the index will depend on how well the clusters were chosen, a great way to have terrible clusters would be to change all the data points and never rebuild the index, if you do that then when you do a query you will pick stupid clusters and get no results or terrible results
Actionable bits
- I’m trying to make an index on user interest vectors. What does the index do? pgvector partitions the data into a set of “lists”, each of which covers a distinct partition of the multi-dimensional space vectors are embedded in. When initially building an index, the system takes a sample of vectors and runs a K-means clustering of the sample, to generate the space partitions for each list.
-- This index has a small number of lists and resulted in the planner
-- ignoring my index using the query from below
create index concurrently if not exists user_embeddings_interest_embedding_2 on user_embeddings using ivfflat (interest_embedding vector_cosine_ops)
with (lists = 100);
-- this index has a larger number of lists and the planner used my index
create index concurrently if not exists user_embeddings_interest_embedding_2 on user_embeddings using ivfflat (interest_embedding vector_cosine_ops)
with (lists = 500);
Another person’s experience with changing their lists parameter:
- I’m trying to find users who’ve similar interest vectors to myself, or I’m trying to find similar contracts to my contract.