Skip to main content

Understanding Grouping Operations  

HelixDB provides two powerful operations for organizing and summarizing data: GROUP_BY and AGGREGATE_BY. While they may seem similar, they serve different purposes and return different results.

Key Differences

FeatureGROUP_BYAGGREGATE_BY
ReturnsCount summaries onlyFull data objects + counts
Memory UsageLow - only stores countsHigher - stores all objects
Use CaseAnalytics, distributionsDetailed reports, processing
Output SizeSmall, compactLarge, comprehensive
Best ForDashboards, statisticsData analysis, transformations

Syntax Comparison

Both operations support single or multiple properties:
// GROUP_BY - Returns counts only
::GROUP_BY(property)
::GROUP_BY(property1, property2, ...)

// AGGREGATE_BY - Returns full data
::AGGREGATE_BY(property)
::AGGREGATE_BY(property1, property2, ...)

Output Format Comparison

GROUP_BY Output

[
    {'country': 'USA', 'count': 3},
    {'country': 'Canada', 'count': 2},
    {'country': 'UK', 'count': 1}
]

AGGREGATE_BY Output

[
    {
        'count': 3,
        'data': [
            {'name': 'Alice', 'country': 'USA', 'age': 28},
            {'name': 'Charlie', 'country': 'USA', 'age': 25},
            {'name': 'Frank', 'country': 'USA', 'age': 35}
        ]
    },
    {
        'count': 2,
        'data': [
            {'name': 'Bob', 'country': 'Canada', 'age': 32},
            {'name': 'Eve', 'country': 'Canada', 'age': 27}
        ]
    }
]

Performance Characteristics

GROUP_BY Performance

  • Memory: O(n) where n = number of unique groups
  • Speed: Fast - only counts are stored
  • Bandwidth: Minimal - small response size
  • Scalability: Excellent for large datasets

AGGREGATE_BY Performance

  • Memory: O(m) where m = total number of items
  • Speed: Moderate - full objects stored
  • Bandwidth: Higher - complete data returned
  • Scalability: Good for moderate datasets
For large datasets where you only need counts, GROUP_BY can be orders of magnitude more efficient in terms of memory and bandwidth usage.

Use Case Decision Tree

Do you need the actual data objects?

├─ YES → Do you need to process/transform them?
│         │
│         ├─ YES → Use AGGREGATE_BY
│         │        (You need the full objects)
│         │
│         └─ NO  → Do you need to display them?
│                  │
│                  ├─ YES → Use AGGREGATE_BY
│                  │        (You need to show details)
│                  │
│                  └─ NO  → Use GROUP_BY
│                           (You only need counts)

└─ NO  → Use GROUP_BY
          (Counts are sufficient)

Best Practices

Use GROUP_BY When:

  1. Building analytics dashboards
  2. Showing data distributions
  3. Generating summary reports
  4. Optimizing for memory/bandwidth
  5. Working with large datasets (millions of records)
  6. Creating charts or graphs

Use AGGREGATE_BY When:

  1. Need to process grouped data further
  2. Building detailed reports with examples
  3. Need to display sample records per group
  4. Performing transformations on grouped items
  5. Working with moderate datasets (thousands of records)
  6. Building data exploration interfaces
When using the SDKs or curling the endpoint, the query name must match what is defined in the queries.hx file exactly.

Example 1: Side-by-Side Comparison - User Distribution

QUERY GroupUsersByCountry () =>
    users <- N<User>
    RETURN users::GROUP_BY(country)

QUERY AggregateUsersByCountry () =>
    users <- N<User>
    RETURN users::AGGREGATE_BY(country)

QUERY CreateUser (name: String, country: String, age: U8) =>
    user <- AddN<User>({
        name: name,
        country: country,
        age: age
    })
    RETURN user
Here’s how to run both queries using the SDKs or curl
from helix.client import Client

client = Client(local=True, port=6969)

users = [
    {"name": "Alice", "country": "USA", "age": 28},
    {"name": "Bob", "country": "Canada", "age": 32},
    {"name": "Charlie", "country": "USA", "age": 25},
    {"name": "Diana", "country": "UK", "age": 30},
    {"name": "Eve", "country": "Canada", "age": 27},
    {"name": "Frank", "country": "USA", "age": 35},
]

for user in users:
    client.query("CreateUser", user)

# GROUP_BY - Returns only counts
group_result = client.query("GroupUsersByCountry", {})
print("GROUP_BY result:", group_result)
# Output: [{'country': 'USA', 'count': 3}, {'country': 'Canada', 'count': 2}, ...]

# AGGREGATE_BY - Returns full data
aggregate_result = client.query("AggregateUsersByCountry", {})
print("AGGREGATE_BY result:", aggregate_result)
# Output: [{'count': 3, 'data': [{'name': 'Alice', ...}, ...]}, ...]

Example 2: Using COUNT with Both Operations

QUERY CountWithGroupBy () =>
    orders <- N<Order>
    RETURN orders::COUNT::GROUP_BY(status)

QUERY CountWithAggregateBy () =>
    orders <- N<Order>
    RETURN orders::COUNT::AGGREGATE_BY(status)

QUERY CreateOrder (customer_name: String, status: String, total: F64) =>
    order <- AddN<Order>({
        customer_name: customer_name,
        status: status,
        total: total
    })
    RETURN order
Here’s how to run both queries using the SDKs or curl
from helix.client import Client

client = Client(local=True, port=6969)

orders = [
    {"customer_name": "Alice", "status": "pending", "total": 99.99},
    {"customer_name": "Bob", "status": "shipped", "total": 149.99},
    {"customer_name": "Charlie", "status": "pending", "total": 75.50},
    {"customer_name": "Diana", "status": "delivered", "total": 200.00},
    {"customer_name": "Eve", "status": "shipped", "total": 89.99},
]

for order in orders:
    client.query("CreateOrder", order)

# COUNT with GROUP_BY - Compact counts
group_count = client.query("CountWithGroupBy", {})
print("COUNT::GROUP_BY result:", group_count)
# Output: [{'status': 'pending', 'count': 2}, {'status': 'shipped', 'count': 2}, ...]

# COUNT with AGGREGATE_BY - Counts with data
aggregate_count = client.query("CountWithAggregateBy", {})
print("COUNT::AGGREGATE_BY result:", aggregate_count)
# Output: [{'count': 2, 'data': [{'customer_name': 'Alice', ...}, ...]}, ...]

Common Pitfalls

Memory Issues with AGGREGATE_BY

// ❌ Bad: AGGREGATE_BY on millions of records
users <- N<User>
RETURN users::AGGREGATE_BY(country)

// ✅ Good: Use GROUP_BY for large datasets
users <- N<User>
RETURN users::GROUP_BY(country)

Using GROUP_BY When You Need Data

// ❌ Bad: GROUP_BY when you need to process items
items <- N<Item>
grouped <- items::GROUP_BY(category)
// Can't access individual items here!

// ✅ Good: Use AGGREGATE_BY to access data
items <- N<Item>
grouped <- items::AGGREGATE_BY(category)
// Now you have access to full item data

Summary

Choose the right operation for your use case:
  • GROUP_BY: Lightweight, fast, perfect for counts and distributions
  • AGGREGATE_BY: Comprehensive, detailed, ideal for data processing
Both operations are powerful tools in your HelixDB toolkit. Understanding when to use each will help you build efficient and effective queries.