HelixDB Docs

Understanding Grouping Operations

HelixDB provides two powerful operations for organizing and summarizing data: GROUP_BY and AGGREGATE_BY. While they may seem similar, they serve different purposes and return different results.

Key Differences

Feature	GROUP_BY	AGGREGATE_BY
Returns	Count summaries only	Full data objects + counts
Memory Usage	Low - only stores counts	Higher - stores all objects
Use Case	Analytics, distributions	Detailed reports, processing
Output Size	Small, compact	Large, comprehensive
Best For	Dashboards, statistics	Data analysis, transformations

Syntax Comparison

Both operations support single or multiple properties:

// GROUP_BY - Returns counts only
::GROUP_BY(property)
::GROUP_BY(property1, property2, ...)

// AGGREGATE_BY - Returns full data
::AGGREGATE_BY(property)
::AGGREGATE_BY(property1, property2, ...)

Output Format Comparison

GROUP_BY Output

[
    {'country': 'USA', 'count': 3},
    {'country': 'Canada', 'count': 2},
    {'country': 'UK', 'count': 1}
]

AGGREGATE_BY Output

[
    {
        'count': 3,
        'data': [
            {'name': 'Alice', 'country': 'USA', 'age': 28},
            {'name': 'Charlie', 'country': 'USA', 'age': 25},
            {'name': 'Frank', 'country': 'USA', 'age': 35}
        ]
    },
    {
        'count': 2,
        'data': [
            {'name': 'Bob', 'country': 'Canada', 'age': 32},
            {'name': 'Eve', 'country': 'Canada', 'age': 27}
        ]
    }
]

Performance Characteristics

GROUP_BY Performance

Memory: O(n) where n = number of unique groups
Speed: Fast - only counts are stored
Bandwidth: Minimal - small response size
Scalability: Excellent for large datasets

AGGREGATE_BY Performance

Memory: O(m) where m = total number of items
Speed: Moderate - full objects stored
Bandwidth: Higher - complete data returned
Scalability: Good for moderate datasets

For large datasets where you only need counts, GROUP_BY can be orders of magnitude more efficient in terms of memory and bandwidth usage.

Use Case Decision Tree

Do you need the actual data objects?
│
├─ YES → Do you need to process/transform them?
│         │
│         ├─ YES → Use AGGREGATE_BY
│         │        (You need the full objects)
│         │
│         └─ NO  → Do you need to display them?
│                  │
│                  ├─ YES → Use AGGREGATE_BY
│                  │        (You need to show details)
│                  │
│                  └─ NO  → Use GROUP_BY
│                           (You only need counts)
│
└─ NO  → Use GROUP_BY
          (Counts are sufficient)

Best Practices

Use GROUP_BY When:

Building analytics dashboards
Showing data distributions
Generating summary reports
Optimizing for memory/bandwidth
Working with large datasets (millions of records)
Creating charts or graphs

Use AGGREGATE_BY When:

Need to process grouped data further
Building detailed reports with examples
Need to display sample records per group
Performing transformations on grouped items
Working with moderate datasets (thousands of records)
Building data exploration interfaces

When using the SDKs or curling the endpoint, the query name must match what is defined in the queries.hx file exactly.

Example 1: Side-by-Side Comparison - User Distribution

QUERY GroupUsersByCountry () =>
    users <- N<User>
    RETURN users::GROUP_BY(country)

QUERY AggregateUsersByCountry () =>
    users <- N<User>
    RETURN users::AGGREGATE_BY(country)

QUERY CreateUser (name: String, country: String, age: U8) =>
    user <- AddN<User>({
        name: name,
        country: country,
        age: age
    })
    RETURN user

Here’s how to run both queries using the SDKs or curl

from helix.client import Client

client = Client(local=True, port=6969)

users = [
    {"name": "Alice", "country": "USA", "age": 28},
    {"name": "Bob", "country": "Canada", "age": 32},
    {"name": "Charlie", "country": "USA", "age": 25},
    {"name": "Diana", "country": "UK", "age": 30},
    {"name": "Eve", "country": "Canada", "age": 27},
    {"name": "Frank", "country": "USA", "age": 35},
]

for user in users:
    client.query("CreateUser", user)

# GROUP_BY - Returns only counts
group_result = client.query("GroupUsersByCountry", {})
print("GROUP_BY result:", group_result)
# Output: [{'country': 'USA', 'count': 3}, {'country': 'Canada', 'count': 2}, ...]

# AGGREGATE_BY - Returns full data
aggregate_result = client.query("AggregateUsersByCountry", {})
print("AGGREGATE_BY result:", aggregate_result)
# Output: [{'count': 3, 'data': [{'name': 'Alice', ...}, ...]}, ...]

Example 2: Using COUNT with Both Operations

QUERY CountWithGroupBy () =>
    orders <- N<Order>
    RETURN orders::COUNT::GROUP_BY(status)

QUERY CountWithAggregateBy () =>
    orders <- N<Order>
    RETURN orders::COUNT::AGGREGATE_BY(status)

QUERY CreateOrder (customer_name: String, status: String, total: F64) =>
    order <- AddN<Order>({
        customer_name: customer_name,
        status: status,
        total: total
    })
    RETURN order

Here’s how to run both queries using the SDKs or curl

from helix.client import Client

client = Client(local=True, port=6969)

orders = [
    {"customer_name": "Alice", "status": "pending", "total": 99.99},
    {"customer_name": "Bob", "status": "shipped", "total": 149.99},
    {"customer_name": "Charlie", "status": "pending", "total": 75.50},
    {"customer_name": "Diana", "status": "delivered", "total": 200.00},
    {"customer_name": "Eve", "status": "shipped", "total": 89.99},
]

for order in orders:
    client.query("CreateOrder", order)

# COUNT with GROUP_BY - Compact counts
group_count = client.query("CountWithGroupBy", {})
print("COUNT::GROUP_BY result:", group_count)
# Output: [{'status': 'pending', 'count': 2}, {'status': 'shipped', 'count': 2}, ...]

# COUNT with AGGREGATE_BY - Counts with data
aggregate_count = client.query("CountWithAggregateBy", {})
print("COUNT::AGGREGATE_BY result:", aggregate_count)
# Output: [{'count': 2, 'data': [{'customer_name': 'Alice', ...}, ...]}, ...]

Common Pitfalls

Memory Issues with AGGREGATE_BY

// ❌ Bad: AGGREGATE_BY on millions of records
users <- N<User>
RETURN users::AGGREGATE_BY(country)

// ✅ Good: Use GROUP_BY for large datasets
users <- N<User>
RETURN users::GROUP_BY(country)

Using GROUP_BY When You Need Data

// ❌ Bad: GROUP_BY when you need to process items
items <- N<Item>
grouped <- items::GROUP_BY(category)
// Can't access individual items here!

// ✅ Good: Use AGGREGATE_BY to access data
items <- N<Item>
grouped <- items::AGGREGATE_BY(category)
// Now you have access to full item data

Summary

Choose the right operation for your use case:

GROUP_BY: Lightweight, fast, perfect for counts and distributions
AGGREGATE_BY: Comprehensive, detailed, ideal for data processing

Both operations are powerful tools in your HelixDB toolkit. Understanding when to use each will help you build efficient and effective queries.

GROUP_BY

Group results with count summaries

AGGREGATE_BY

Aggregate results with full data objects

COUNT Operation

Count operation and other result operations

Property Access

Property filtering and access patterns

HelixQL

Grouping and Aggregation Guide

Understanding Grouping Operations

Key Differences

Syntax Comparison

Output Format Comparison

GROUP_BY Output

AGGREGATE_BY Output

Performance Characteristics

GROUP_BY Performance

AGGREGATE_BY Performance

Use Case Decision Tree

Best Practices

Use GROUP_BY When:

Use AGGREGATE_BY When:

Example 1: Side-by-Side Comparison - User Distribution

Example 2: Using COUNT with Both Operations

Common Pitfalls

Memory Issues with AGGREGATE_BY

Using GROUP_BY When You Need Data

Summary

GROUP_BY

AGGREGATE_BY

COUNT Operation

Property Access

HelixQL

​Understanding Grouping Operations

​Key Differences

​Syntax Comparison

​Output Format Comparison

​GROUP_BY Output

​AGGREGATE_BY Output

​Performance Characteristics

​GROUP_BY Performance

​AGGREGATE_BY Performance

​Use Case Decision Tree

​Best Practices

​Use GROUP_BY When:

​Use AGGREGATE_BY When:

​Example 1: Side-by-Side Comparison - User Distribution

​Example 2: Using COUNT with Both Operations

​Common Pitfalls

​Memory Issues with AGGREGATE_BY

​Using GROUP_BY When You Need Data

​Summary

​Related Topics

GROUP_BY

AGGREGATE_BY

COUNT Operation

Property Access

Understanding Grouping Operations

Key Differences

Syntax Comparison

Output Format Comparison

GROUP_BY Output

AGGREGATE_BY Output

Performance Characteristics

GROUP_BY Performance

AGGREGATE_BY Performance

Use Case Decision Tree

Best Practices

Use GROUP_BY When:

Use AGGREGATE_BY When:

Example 1: Side-by-Side Comparison - User Distribution

Example 2: Using COUNT with Both Operations

Common Pitfalls

Memory Issues with AGGREGATE_BY

Using GROUP_BY When You Need Data

Summary

Related Topics