Skip to main content

Keyword Search using BM25 with SearchBM25  

Search for keywords in nodes using the BM25 ranking algorithm.
SearchBM25<Type>(text, limit)
BM25 is a ranking function used for full-text search. It searches through the text properties of nodes and ranks results based on keyword relevance and frequency.
When using the SDKs or curling the endpoint, the query name must match what is defined in the queries.hx file exactly.
QUERY SearchKeyword (keywords: String, limit: I64) =>
    documents <- SearchBM25<Document>(keywords, limit)
    RETURN documents

QUERY InsertDocument (content: String, created_at: Date) =>
    document <- AddN<Document>({ content: content, created_at: created_at })
    RETURN document
Here’s how to run the query using the SDKs or curl
from datetime import datetime, timezone
from helix.client import Client

client = Client(local=True, port=6969)

sample_docs = [
    "Machine learning algorithms for data analysis",
    "Introduction to artificial intelligence and neural networks",
    "Database optimization techniques and performance tuning",
    "Web development with modern JavaScript frameworks"
]

for content in sample_docs:
    client.query("InsertDocument", {
        "content": content,
        "created_at": datetime.now(timezone.utc).isoformat(),
    })

result = client.query("SearchKeyword", {
    "keywords": "machine learning algorithms",
    "limit": 5
})

print(result)

Example 2: Keyword search with postfiltering

QUERY SearchRecentKeywords (keywords: String, limit: I64, cutoff_date: Date) =>
    searched_docs <- SearchBM25<Document>(keywords, limit)
    documents <- searched_docs::WHERE(_::{created_at}::GTE(cutoff_date))
    RETURN documents

QUERY InsertDocument (content: String, created_at: Date) =>
    document <- AddN<Document>({ content: content, created_at: created_at })
    RETURN document
Here’s how to run the query using the SDKs or curl
from datetime import datetime, timezone, timedelta
from helix.client import Client

client = Client(local=True, port=6969)

recent_date = datetime.now(timezone.utc).isoformat()
old_date = (datetime.now(timezone.utc) - timedelta(days=15)).isoformat()

recent_docs = [
    "Modern machine learning techniques in 2024",
    "Latest artificial intelligence research papers"
]

for content in recent_docs:
    client.query("InsertDocument", {
        "content": content,
        "created_at": recent_date,
    })

old_docs = [
    "Traditional machine learning approaches from last year",
    "Historical AI development milestones"
]

for content in old_docs:
    client.query("InsertDocument", {
        "content": content,
        "created_at": old_date,
    })

cutoff_date = (datetime.now(timezone.utc) - timedelta(days=10)).isoformat()

result = client.query("SearchRecentKeywords", {
    "keywords": "machine learning artificial intelligence",
    "limit": 5,
    "cutoff_date": cutoff_date,
})

print(result)