Task Overview
This guide will walk you through the process of building a Graph-Vector RAG in Python to recommend professors based on their research areas.
Introduction
We want to create a Python script that uses HelixDB to help students find professors based on their research areas and bio. For example a user could ask
- “What professors does High-Energy Physics research?”
- “What professors are working in the X University?”
- “What professors are working in the Computer Science department?”
- “What professors are working in the X University and are working in the Computer Science department?”
- “I like doing research in Large Language Models, can you recommend me some professors doing this in X University?”
The dataset
In this example, we have a dataset on Professors with fields like:
- Name
- Title
- Department(s)
- University
- Their page URL
- Short biography
- Key Research Areas and their descriptions
- Lab(s), the name, lab’s research focus
We will be ingesting this data from a JSON file, an example is shown below:
Building a Graph
Based on this data, we can create a Vector Graph RAG with the following nodes and edges:
Nodes:
- Professor Node with properties
name
,title
,page
,bio
- Research Area Node with properties
area
anddescription
- Department Node with the property
name
- University Node with the property
name
- Lab Node with properties
name
,research_focus
Vector Nodes for Embeddings:
We will have an embedding of the professors’s combined string on their research areas and research description.
- Professor Combined Research Area and Description Node with the property
areas_and_descriptions
Edges:
- Professor to Research Area Edge
- Professor to Department Edge
- Professor to University Edge
- Professor to Lab Edge
- Professor to Professor Combined Research Area and Description Embedding
Why do we need a Vector Graph RAG?
Let’s imagine it on a larger scale where we have a lot of data on professors,
research areas, departments, academic achievements, and labs across 1000+ universities. We can utilize a graph to
connect these nodes and edges to each other so that we can traverse the graph faster to find all professors that are
working in a specific university, department, research area, or lab e.g we can filter the graph to find all professors that are working in the University X using FILTER
in HelixQL. And we can utilize the vector embeddings to find professors that are most similar to a given query e.g “Which professors have worked in a startup before?”.