Auto-Embeddings
Generate embeddings for your application automatically
Auto-emeddings leverages pgvector and OpenAI to easily generate embeddings for your data in near real-time and without hassle. In addition, Auto-Embeddings allows you to seamlessly query your data using natural language or even using another object in your database (similarity search). Finally, the embeddings generated by Auto-Embeddings can be leveraged by your own code to perform any analysis or provide any extra functionality you may need.
Generating embeddings
Embeddings generation works in the following way:
- Every
sync-period
(5 minutes by default) graphite queries your data and searches for new data (or existing data that changed recently). - It extracts the relevant data for generating the embeddings
- It requests OpenAI to generate embeddings
- It stores the embeddings alongside the data.
Querying embeddings
Once the embeddings have been generated you can leverage pgvector to perform any operation you may want. In addition, Auto-Embeddings will automatically provide two new queries; graphiteSearchXXX
(search using natural language) and graphiteSimilarXXX
(search for objects similar to a given object), where XXX
is a name given by you.
Demo
To demonstrate how to configure Auto-Embeddings we will use an example project where we are storing movies.
Before we start, let’s start by explaining the project. Our project contains a single table called movies
with the following columns:
Our goal is going to be to generate embeddings using the data in the columns name
, genre
and overview
.
Preparing your database
Before we can start generating embeddings we need to prepare our database:
- First we are going to need a column to store the embeddings.
- Finally, we are going to need a mechanism to detect when embeddings need to be regenerated.
Embeddings Column
Creating a column to store embeddings is as easy as creating any other column. Just make sure it is of type vector(1536)
and it can be NULL
. For our project we can simply go to the SQL tab and create a migration with the following content:
-- in this example we are using "embeddings" as the name for our column
-- but you can choose anything you want
ALTER TABLE public.movies ADD COLUMN embeddings vector(1536) NULL;
For instance:
Detecting Changes
On every sync-period
graphite will perform a graphql query to get all the rows that have outdated embeddings. This means we can build this query in a way that:
- Gets the data we need for the embeddings.
- Retrieves objects with the embeddings column set to
NULL
. - Leverages another mechanism to detect which rows needs their embeddings regenerated.
In this example we are going to opt for the following mechanism to detect outdated embeddings:
- Add a column
outdated
(boolean) to indicated whether the row is outdated or not. - Add a postgres trigger and function that will set the
outdated
column to true everytime there is a change to our data.
With this in mind we are going to create a migration with the following contents:
-- Add the "outdated" column of type "boolean" with a default value of true
ALTER TABLE "movies" ADD COLUMN "outdated" boolean DEFAULT true;
-- Create a trigger that sets "outdated" to true if the columns
-- "name", "genre" or "overview" are updated
CREATE OR REPLACE FUNCTION set_outdated_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF NEW.name <> OLD.name OR NEW.genre <> OLD.genre OR NEW.overview <> OLD.overview THEN
NEW.outdated := true;
ELSEIF NEW.embeddings IS NOT NULL THEN
NEW.outdated := false;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Attach the trigger to the table
CREATE TRIGGER set_outdated
BEFORE UPDATE ON "movies"
FOR EACH ROW
EXECUTE FUNCTION set_outdated_trigger();
For instance:
Now, graphite
’s query will be (more on this later):
query GetOutdatedMovies {
movies(where: {
_or: [
{embeddings: {_is_null: true}}, # new rows without embeddings
{outdated: {_eq: true}, # existing rows with changed data
},
]}) {
id # id column is mandatory
name
genre
overview
}
}
outdated
directly from your application or you could use some completely different mechanism (i.e. a computed field). The important bit is that graphite needs to be able to make a graphql query and get the relevant rows and data.Configuring Auto-Embeddings
Now that we have prepared our database we can proceed to configure Auto-Embeddings. You will need the following data:
- A unique name. We are going to use
movies
for this particular example but it can be anything. This will determine the name of the GraphQL queriesgraphiteSearchXXX
andgraphiteSimilarXXX
. - The location of the embeddings column; schema, table and column names, in this example
public
,movies
andembeddings
respectively. - A GraphQL query to retrieve the outdated rows and their new data (the query we worked on in the previous section)
- A GraphQL mutation that takes the
id
of the object, the embeddings and that updates the relevant object. For instance, in this particular example the following mutation would suffice:
mutation UpdateEmbeddingsMovie($id: uuid!, $embeddings: vector) {
updateMovie(pk_columns: {id: $id}, _set: {embeddings: $embeddings}) {
__typename
}
}
NOTE: It is important that the query returns the id of the object and that the mutation accepts it as otherwise graphite won’t know which object to update.
Now that we have all the data we need, adding the configuration is as simple as running the following graphql mutation:
Aftermath
Embeddings Generation
After executing the mutation above two things will happen; the first one is that if we look at our logs we will start seeing entries like this on the next sync-period
:
The logs indicate that graphite has started to generate embeddings for the movies. We can track the progress by counting movies with the embeddings
column set to NULL
:
Until eventually it reaches 0.
Natural Language and Similarity Search
The second thing that will happen is that the queries graphiteSearchMovies
, graphiteSearchMoviesAggregate
, graphiteSimilarMovies
and graphiteSimilarMoviesAggregate
will be created. These queries will work similar to the standard movies
and moviesAggregate
queries provided by hasura and will respect the same permissions but they will also allow you to query movies using natural language or other movies for comparison. For instance:
Both queries accept a third argument maxDistance
. This argument allows you to filter responses that are too far from the query, useful for ensuring that very specific queries do not return entirely unrelated responses. The argument is a float between 0.0
(an exact match) and 1.0 (completely unrelated), with a default setting of 1.0
, ensuring the best matches are always returned even if they are unrelated to the query.