Graph Database Design for Recommender Systems

Design patterns to make your Graph database serve as feature store, model store, and serving layer

Building a recommender system on a graph database isn’t just about modelling relationships—it’s about designing a schema that scales, performs, and evolves with your algorithms. This deep dive explores how we designed Neo4j to power 15+ different recommendation approaches for Steam gaming data.

The key insight? Your graph schema becomes your feature store, similarity engine, and serving layer all in one. But only if you design it right.

The Schema Design Challenge

Most graph database tutorials show you how to create nodes and relationships. Production recommender systems need to solve harder problems:

Multi-algorithm support: Content-based, collaborative filtering, embeddings, and deep learning all need different graph patterns
Performance at scale: 50M+ user-game interactions with sub-100ms query times
Feature engineering: Converting raw properties into recommendation-ready features
Schema evolution: Adding new node types and relationships without breaking existing algorithms

Our Steam recommender handles all of this through a layered schema design that separates core data from computed features.

Core Schema: The Foundation Layer

The foundation layer models the essential Steam gaming entities and their natural relationships:

Users anchor the entire graph. Every recommendation algorithm traces paths from users to discover similarity patterns. We keep user nodes lean—just identifier, display name, and creation timestamp. Rich profile data gets modelled as separate connected nodes.

Apps as Feature Aggregators

CREATE (a:APP {
    appid: 413150,
    title: "Stardew Valley",
    price: 14.99,
    required_age: 0,
    is_multiplayer: true
})

Games aggregate features from multiple dimensions. Rather than storing all metadata directly on APP nodes, we model features as separate nodes (GENRE, DEVELOPER, TYPE). This pattern enables flexible querying and algorithm-specific feature selection.

Property-to-Node Expansion

The most powerful design pattern: converting properties into nodes, more on it in Schema Evolution Patterns.

Performance Layer: Constraints and Indexes

Graph databases perform through strategic constraints and indexes. Our schema uses five constraint patterns:

Uniqueness Constraints

CREATE CONSTRAINT steamid IF NOT EXISTS
FOR (user:USER)
REQUIRE user.steamid IS UNIQUE;
 
CREATE CONSTRAINT appid IF NOT EXISTS  
FOR (app:APP)
REQUIRE app.appid IS UNIQUE;

Every entity type gets a uniqueness constraint on its primary identifier. This enables fast MERGE operations during data loading and prevents duplicate nodes.

Composite Business Logic

CREATE CONSTRAINT genre IF NOT EXISTS
FOR (genre:GENRE) 
REQUIRE genre.genre IS UNIQUE;

Game-related entities like genres and developers use their natural names as unique identifiers. This simplifies data modeling and makes queries more readable.

Query-Optimised Indexes Beyond uniqueness, we create indexes for common query patterns:

-- Fulltext search on game titles
CREATE FULLTEXT INDEX app_title IF NOT EXISTS
FOR (n:APP) ON EACH [n.title]
 
-- Temporal queries on user activity
CREATE INDEX user_created_at IF NOT EXISTS
FOR (n:USER) ON (n.created_at)
 
-- Relationship properties for time-based filtering
CREATE INDEX friend_since_date IF NOT EXISTS
FOR ()-[n:FRIENDS]-() ON (n.friend_since)

The fulltext index enables “games like Cyberpunk” searches. The temporal indexes support user cohort analysis and friendship timeline queries.

Algorithm-Specific Schema Extensions

The foundation layer handles core data. Algorithm layers add specialised patterns for different recommendation approaches.

Content-Based Features Layer

Content-based filtering requires feature vectors. We model this through a FEATURE abstraction:

graph TD
    U[USER] -->|LIKES| F[FEATURE]
    A[APP] -->|HAS| F
    U -->|OWNS| A
    F -->|IS_A| G[GENRE]
    F -->|IS_A| T[TYPE]
    F -->|IS_A| NP[N_PLAYERS]
    
    style F fill:#ffeb3b
    style U fill:#e1f5fe
    style A fill:#f3e5f5

The pattern works through inheritance:

-- Games connect to their natural features
MATCH (a:APP)-[:HAS_GENRE]->(g:GENRE)
MERGE (a)-[:FEATURE]->(g:Feature:GENRE)
 
-- Users inherit features from owned games  
MATCH (u:USER)-[:OWNS]->(a:APP)-[:FEATURE]->(f:Feature)
WITH u, f, count(a) as weight
MERGE (u)-[:LIKES {weight: weight}]->(f)

Now users and games exist in the same feature space. KNN similarity becomes a graph traversal rather than matrix math.

Collaborative Filtering Similarities

Collaborative filtering creates computed similarity relationships:

graph TD
    U1[USER] -->|PLAYED| A1[APP]
    U2[USER] -->|PLAYED| A1
    U2 -->|PLAYED| A2[APP]
    A1 -.->|ITEM_SIMILAR| A2
    U1 -.->|USER_SIMILAR| U2
    
    style A1 fill:#f3e5f5
    style A2 fill:#f3e5f5
    style U1 fill:#e1f5fe
    style U2 fill:#e1f5fe

The algorithm computes Jaccard similarity between users based on shared games, then writes similarity relationships:

-- Compute item-item similarities
CALL gds.nodeSimilarity.write(
    'user_item_projection',
    {
        writeRelationshipType: 'ITEM_SIMILAR',
        similarityCutoff: 0.1,
        topK: 20
    }
)

These computed relationships become the graph’s memory of similarity patterns.

FastRP Embedding Properties

FastRP creates embedding vectors as node properties:

-- Generate 128-dimensional embeddings
CALL gds.fastRP.mutate(
    'multi_entity_graph',
    {
        embeddingDimension: 128,
        mutateProperty: 'embedding'
    }
)

Now every node (users, games, groups, friends) has a 128-dimensional vector property. Cross-type recommendations become similarity queries in embedding space.

Schema Evolution Patterns

Production schemas evolve. Our design supports non-breaking changes through three patterns:

Additive Node Labels

New algorithms add labels to existing nodes:

-- Content-based filtering adds feature labels
MATCH (u:USER) WHERE exists(u.features)
SET u:USER_WITH_FEATURES
 
-- Embedding algorithms add vector labels  
MATCH (n) WHERE exists(n.embedding)
SET n:HAS_EMBEDDING

Existing queries continue working. New algorithms use specialised labels for performance.

Relationship Type Expansion

New relationship types augment the schema:

-- Original relationships
(u:USER)-[:PLAYED]->(a:APP)
 
-- Algorithm-specific relationships
(u:USER)-[:SIMILAR_USER]->(u2:USER)
(a:APP)-[:SIMILAR_ITEM]->(a2:APP)
(u:USER)-[:FEATURE_SIMILAR]->(a:APP)

Each algorithm writes its own relationship types. This prevents conflicts and enables algorithm-specific optimisations.

Property-to-Node Migration

The most powerful evolution: converting properties to nodes without breaking existing code:

-- Before: multiplayer as property
CREATE (a:APP {is_multiplayer: true})
 
-- After: multiplayer as connected node
CREATE (a:APP {is_multiplayer: true})
-[:IS_MULTIPLAYER]->(n:N_PLAYERS {is_multiplayer: true})

The property remains on the original node for backward compatibility. The new node enables relationship-based queries.

Query Pattern Optimisation

Schema design is only as good as query performance. We optimise for three critical patterns:

User Recommendation Queries

-- Content-based recommendations
MATCH (u:USER {steamid: $userId})-[:LIKES]->(f:Feature)
MATCH (a:APP)-[:FEATURE]->(f)
WHERE NOT (u)-[:PLAYED]->(a)
RETURN a.title, count(f) as feature_overlap
ORDER BY feature_overlap DESC
LIMIT 10

Optimisation: The USER_WITH_FEATURES label restricts traversal to users with computed feature vectors, reducing query time from seconds to milliseconds.

Similarity Discovery

-- Find similar games via collaborative filtering
MATCH (a:APP {appid: $gameId})-[:ITEM_SIMILAR]-(similar:APP)
RETURN similar.title, similar.appid
ORDER BY similar.similarity DESC
LIMIT 10

Optimisation: Pre-computed similarity relationships eliminate runtime computation. Queries become simple graph traversals.

Cross-Type Recommendations

-- FastRP cross-type recommendations  
MATCH (u:USER {steamid: $userId})
CALL gds.knn.stream('embedding_projection', {
    nodeProperty: 'embedding',
    topK: 10,
    concurrency: 1,
    sourceNode: u
})
YIELD node1, node2, similarity
WHERE labels(node2) = ['APP']
RETURN node2.title, similarity

Optimisation: The embedding projection includes only nodes with embedding properties, dramatically reducing search space.

Memory and Storage Optimisation

At scale, schema design impacts memory and storage:

Relationship Direction Strategy

-- Symmetric relationships (friendships)
(u1:USER)-[:FRIENDS]-(u2:USER)  // Undirected
 
-- Asymmetric relationships (ownership)  
(u:USER)-[:PLAYED]->(a:APP)     // Directed

Undirected relationships use less storage but require careful query planning. Directed relationships enable faster traversals in one direction.

Batch Import Optimisation

-- Optimized for bulk loading
UNWIND $records as record
MERGE (u:USER {steamid: record.steamid})
MERGE (a:APP {appid: record.appid})  
MERGE (u)-[:PLAYED {playtime: record.playtime}]->(a)

UNWIND with MERGE operations enable transaction-efficient bulk loading while preventing duplicates.

Real-World Performance Results

Our schema design delivers production performance across multiple recommendation algorithms:

Query Performance

User recommendations: <50ms for 10 results
Similarity lookups: <10ms for pre-computed relationships
Cross-type recommendations: <100ms using FastRP embeddings

Storage Efficiency

50M user-game relationships: 2.1GB Neo4j database
15 algorithm-specific relationship types: minimal storage overhead
Feature vectors and embeddings: 500MB additional space

Scalability Patterns

Read replicas for recommendation serving
Write clusters for model training
Projection-based algorithm isolation

Schema Anti-Patterns to Avoid

Over-Normalized Entity Modeling

-- Anti-pattern: Excessive normalization
(u:USER)-[:HAS_PROFILE]->(p:PROFILE)-[:HAS_NAME]->(n:NAME {value: "Alice"})
 
-- Better: Reasonable denormalization  
(u:USER {name: "Alice", profile_created: datetime()})

Graph databases aren’t relational databases. Some denormalization improves performance.

Algorithm-Agnostic Feature Storage

-- Anti-pattern: Generic features
(n)-[:HAS_FEATURE]->(f:FEATURE {name: "genre", value: "RPG"})
 
-- Better: Typed feature nodes
(n)-[:HAS_GENRE]->(g:GENRE {genre: "RPG"})

Strongly typed relationships enable algorithm-specific optimisations and clearer query patterns.

The Future of Graph-Based Recommenders

Our schema design points toward emerging patterns in recommendation systems:

Multi-Modal Integration: Unified graphs supporting text, images, and behavioural data in single recommendation queries.

Real-Time Learning: Schema patterns that support online learning algorithms updating recommendations based on live user behaviour.

Explainable AI: Graph traversals that provide natural explanation paths for recommendations (“Because you liked RPGs and your friend Alice recommends it”).

Federated Recommendations: Schema designs that enable recommendations across multiple platforms and data sources.

The graph database becomes more than storage—it becomes the computational substrate for next-generation recommendation algorithms.

Conclusion: Schema as Strategy

In recommendation systems, schema design is strategic choice. Our Neo4j design enables:

Algorithm flexibility: One schema supports 15+ different recommendation approaches
Performance scalability: Sub-100ms queries across 50M+ relationships
Feature evolution: Non-breaking schema changes as algorithms evolve
Operational simplicity: One database serves as feature store, model store, and serving layer

The key insight: design your schema for the algorithms you haven’t built yet. The patterns that support today’s content-based filtering will power tomorrow’s graph neural networks.

Your schema is not just about storing data, but enabling algorithmic innovation at scale.

Dr. Riccardo Scott

Explorer