Leveraging Neo4j’s Graph Data Science library to compute real-time user-user and item-item similarities over 15 million interactions—no matrices, no bottlenecks, sub-100 ms query times.
Traditional collaborative filtering computes similarities in memory-constrained matrices. Modern graph databases flip this paradigm: relationships become first-class citizens, and similarity algorithms operate directly on graph structure. The result? Collaborative filtering that scales to billions of interactions while maintaining sub-100ms query times.
This deep dive reveals how to implement collaborative filtering using Neo4j’s Graph Data Science library, processing 15 million user-game interactions to generate real-time recommendations. The key insight: graph-native algorithms eliminate the memory bottlenecks that plague traditional matrix-based approaches.
Our system generates item-item and user-user similarities across 200,000 users and 50,000 games, all within Neo4j’s unified graph structure.
The Graph Advantage in Collaborative Filtering
Traditional collaborative filtering faces three fundamental constraints:
Memory Explosion: User-item matrices become prohibitively large.
Update Complexity: Adding new users or items requires matrix reconstruction, making real-time updates impractical.
Algorithm Isolation: Similarity computation happens separate from storage, requiring expensive data movement between systems.
Graph-based collaborative filtering solves all three by treating relationships as the computational foundation:
graph TD A[Traditional Matrix CF] --> B[User-Item Matrix<br/>200K × 50K<br/>~40GB RAM] B --> C[In-Memory Similarity<br/>Computation] C --> D[Separate Storage<br/>for Results] E[Graph-Based CF] --> F[Native Relationships<br/>15M edges<br/>~2GB storage] F --> G[In-Graph Similarity<br/>Algorithms] G --> H[Similarity as<br/>Graph Relationships] style E fill:#4caf50 style A fill:#ffeb3b style F fill:#e8f5e8 style G fill:#e8f5e8 style H fill:#e8f5e8
The transformation is profound: similarities become graph traversals rather than matrix operations.
Item-Based Collaborative Filtering: Apps Through Users
Item-based collaborative filtering finds games similar to games you’ve played. In graph terms: “Which apps share the most users with apps I’ve played?”
Graph Projection Strategy
The foundation is a bipartite user-app graph projected for similarity computation:
graph LR A[User: Alice] -->|PLAYED<br/>playtime: 100h| B[APP: Cyberpunk] A -->|PLAYED<br/>playtime: 50h| C[APP: Witcher 3] D[User: Bob] -->|PLAYED<br/>playtime: 80h| B D -->|PLAYED<br/>playtime: 120h| E[APP: Elden Ring] F[User: Carol] -->|PLAYED<br/>playtime: 200h| C F -->|PLAYED<br/>playtime: 90h| E B -.->|Similar via<br/>Alice & Bob| C C -.->|Similar via<br/>Carol| E style A fill:#e1f5fe style D fill:#e1f5fe style F fill:#e1f5fe style B fill:#f3e5f5 style C fill:#f3e5f5 style E fill:#f3e5f5
For item-based similarity, we reverse the relationship direction to create an app-centric projection:
-- Item-based projection: Apps point to Users
CALL gds.graph.project(
'app_user_projection',
['APP', 'USER'],
{
'PLAYED': {
'orientation': 'REVERSE', // APP <- USER
'properties': ['playtime_forever', 'playtime_user_normalized']
}
}
)This projection enables Neo4j GDS to compute app-app similarities based on shared users.
Jaccard Similarity on Graph Structure
Neo4j’s NodeSimilarity algorithm computes Jaccard similarity natively on graph relationships:
-- Compute item-item similarities using Jaccard
CALL gds.nodeSimilarity.write(
'app_user_projection',
{
topK: 40,
similarityCutoff: 0.01,
writeRelationshipType: 'SIMILAR_NODESIM_APP_VIA_USER'
}
)Jaccard Similarity Formula: For apps A and B:
The algorithm automatically handles:
- Sparse data: Only non-zero relationships are considered
- Parallel computation: Multi-core processing across the graph
- Memory efficiency: Streaming computation without loading full matrices
Weighted Collaborative Filtering
Raw interaction counts miss signal strength. Users who play games for 200 hours show stronger preference than 5-minute trials. We implement weighted Jaccard similarity using playtime data:
-- Weighted item-based similarity using playtime
CALL gds.nodeSimilarity.write(
'app_user_projection',
{
topK: 40,
relationshipWeightProperty: 'playtime_forever',
writeRelationshipType: 'SIMILAR_NODESIM_APP_VIA_USER_WEIGHTED'
}
)Weighted Jaccard Formula:
Where represents user u’s playtime for app A.
Normalization Strategies for Fair Comparison
Playtime varies dramatically across users and games. We implement multiple normalization schemes:
User-Normalized Playtime: Relative to user’s total gaming activity
-- User normalization: playtime / user_total_playtime
MATCH (user:USER)-[p:PLAYED]-(app:APP)
SET p.playtime_user_normalized = p.playtime_forever / user.total_playtimeApp-Normalized Playtime: Relative to game’s average playtime
-- App normalization: playtime / app_average_playtime
MATCH (user:USER)-[p:PLAYED]-(app:APP)
SET p.playtime_app_normalized = p.playtime_forever / app.average_playtimeEach normalization strategy captures different signals:
- User-normalized: Identifies games important to individual users
- App-normalized: Accounts for games with naturally different playtime patterns
User-Based Collaborative Filtering: Users Through Apps
User-based collaborative filtering finds users with similar gaming preferences: “Which users play similar games to me?”
Projection for User Similarity
User-based filtering requires a natural orientation projection:
graph LR A[USER: Alice] -->|PLAYED| B[APP: RPGs] A -->|PLAYED| C[APP: Strategy] D[USER: Bob] -->|PLAYED| B D -->|PLAYED| E[APP: Action] F[USER: Carol] -->|PLAYED| C F -->|PLAYED| G[APP: Simulation] A -.->|Similar via<br/>shared games| D A -.->|Similar via<br/>Strategy games| F style A fill:#e1f5fe style D fill:#e1f5fe style F fill:#e1f5fe style B fill:#f3e5f5 style C fill:#f3e5f5 style E fill:#f3e5f5 style G fill:#f3e5f5
-- User-based projection: Users point to Apps
CALL gds.graph.project(
'user_app_projection',
['USER', 'APP'],
{
'PLAYED': {
'orientation': 'NATURAL', // USER -> APP
'properties': ['playtime_forever', 'playtime_user_normalized']
}
}
)User-User Similarity Computation
-- Compute user-user similarities using weighted Jaccard
CALL gds.nodeSimilarity.write(
'user_app_projection',
{
topK: 40,
relationshipWeightProperty: 'playtime_forever',
writeRelationshipType: 'SIMILAR_NODESIM_USER_VIA_APP_WEIGHTED'
}
)This creates SIMILAR_NODESIM_USER_VIA_APP_WEIGHTED relationships connecting users with similar gaming preferences.
Recommendation Scoring Algorithms
Graph-based similarities enable sophisticated recommendation scoring through graph traversals.
Item-Based Recommendation Scoring
Find games similar to games the user has played, weighted by user’s engagement:
graph TD A[User: Alice] -->|PLAYED<br/>100h| B[APP: Cyberpunk] A -->|PLAYED<br/>80h| C[APP: Witcher 3] B -.->|SIMILAR<br/>0.85| D[APP: Elden Ring] C -.->|SIMILAR<br/>0.73| D D -->|"Weighted Score<br/>(100×0.85 + 80×0.73)<br/>÷ (100 + 80)<br/>= 0.80"| E[Recommendation] style A fill:#e1f5fe style B fill:#f3e5f5 style C fill:#f3e5f5 style D fill:#fff3e0 style E fill:#4caf50
Implementation:
-- Item-based recommendation scoring
MATCH (user:USER {steamid: $user_id})-[p:PLAYED]->(owned_app:APP)
MATCH (owned_app)-[sim:SIMILAR_NODESIM_APP_VIA_USER]-(recommended_app:APP)
WHERE NOT EXISTS((user)-[:PLAYED]-(recommended_app))
AND p.playtime_forever > 0
WITH recommended_app,
sum(p.playtime_forever * sim.score) / sum(p.playtime_forever) as weighted_score
ORDER BY weighted_score DESC
LIMIT 10
RETURN recommended_app.title, weighted_scoreThis scoring function:
- Finds owned games with positive playtime
- Traverses similarities to find similar games not yet played
- Weights by engagement: Games played longer have stronger influence
- Aggregates scores: Multiple pathways reinforce recommendations
User-Based Recommendation Scoring
Find games played by similar users, weighted by user similarity:
graph TD A[User: Alice] -.->|SIMILAR<br/>0.82| B[User: Bob] A -.->|SIMILAR<br/>0.75| C[User: Carol] B -->|PLAYED<br/>150h| D[APP: Elden Ring] C -->|PLAYED<br/>90h| D D -->|"Weighted Score<br/>(150×0.82 + 90×0.75)<br/>÷ (150 + 90)<br/>= 0.79"| E[Recommendation] style A fill:#e1f5fe style B fill:#e1f5fe style C fill:#e1f5fe style D fill:#fff3e0 style E fill:#4caf50
Implementation:
-- User-based recommendation scoring
MATCH (user:USER {steamid: $user_id})-[sim:SIMILAR_NODESIM_USER_VIA_APP]->(similar_user:USER)
MATCH (similar_user)-[p:PLAYED]->(recommended_app:APP)
WHERE NOT EXISTS((user)-[:PLAYED]-(recommended_app))
AND p.playtime_forever > 0
WITH recommended_app,
sum(p.playtime_forever * sim.score) / sum(p.playtime_forever) as weighted_score
ORDER BY weighted_score DESC
LIMIT 10
RETURN recommended_app.title, weighted_scoreMatrix Factorization: Hybrid Graph-Matrix Approach
Pure graph algorithms excel at interpretability but miss latent factors captured by matrix factorization. Our system implements a hybrid approach using the implicit library for matrix factorization while leveraging Neo4j for data management.
Data Pipeline for Matrix Factorization
graph LR A[Neo4j Graph] -->|Extract| B[User-Item<br/>Interactions] B -->|Transform| C[Sparse CSR<br/>Matrix] C -->|Train| D[ALS/BPR/LMF<br/>Models] D -->|Generate| E[User-Item<br/>Predictions] E -->|Write Back| F[Neo4j Similarity<br/>Relationships] style A fill:#4caf50 style C fill:#ffeb3b style D fill:#ff9800 style F fill:#4caf50
Sparse Matrix Construction
Convert graph relationships to sparse matrices for efficient factorization:
def build_interaction_matrix(df: pd.DataFrame) -> csr_matrix:
# Create user and item ID mappings
user_ids = df['steamid'].unique()
app_ids = df['appid'].unique()
user_id_to_idx = {uid: idx for idx, uid in enumerate(user_ids)}
app_id_to_idx = {aid: idx for idx, aid in enumerate(app_ids)}
# Build sparse matrix
data = df['score_'].values # playtime or binary
row_indices = df['steamid'].map(user_id_to_idx).values
col_indices = df['appid'].map(app_id_to_idx).values
return csr_matrix((data, (row_indices, col_indices)))Multiple Matrix Factorization Algorithms
Our system implements seven different factorization approaches (for more detail see here):
Alternating Least Squares (ALS):
ImplicitModel(AlternatingLeastSquares(factors=64), "SIMILAR_MF_ALS")
ImplicitModel(AlternatingLeastSquares(factors=64, alpha=10), "SIMILAR_MF_WALS")Bayesian Personalized Ranking (BPR):
ImplicitModel(BayesianPersonalizedRanking(factors=64), "SIMILAR_MF_BPR") Logistic Matrix Factorization (LMF):
ImplicitModel(LogisticMatrixFactorization(factors=64), "SIMILAR_MF_LMF")Item-Item Similarity Models:
ImplicitModel(TFIDFRecommender(), "SIMILAR_ItemItem_TFIDF")
ImplicitModel(CosineRecommender(), "SIMILAR_ItemItem_Cosine")
ImplicitModel(BM25Recommender(B=0.2), "SIMILAR_ItemItem_BM25")Each algorithm captures different aspects of user-item interactions, enabling ensemble approaches.
Memory Management and Graph Projections
Production collaborative filtering requires careful memory management. Neo4j GDS projections provide controlled memory usage with automatic cleanup.
Projection Lifecycle Management
class NodeSim(Model):
def run(self):
self._pre_clean() # Delete existing similarities
projection = self._project() # Create in-memory projection
self._write_sim_to_db(projection) # Compute and write similarities
self._post_clean() # Delete projection, free memoryMemory-Efficient Projections
-- Projection with selective properties for memory efficiency
CALL gds.graph.project(
'user_app_memory_optimized',
['USER', 'APP'],
{
'PLAYED': {
'properties': ['playtime_forever'] // Only needed properties
}
}
)Memory Usage Patterns:
- Full graph: 15M relationships = ~2GB storage
- In-memory projection: Selective properties = ~800MB RAM
- Similarity computation: Parallel processing = ~1.2GB peak
- Post-cleanup: Memory freed automatically
Performance Optimisation and Scalability
Our collaborative filtering system delivers production performance through multiple optimization layers.
Graph Algorithm Performance
graph LR A[15M Relationships] --> B[Node Similarity<br/>Computation<br/>12 minutes] B --> C[40K Similarity<br/>Relationships/user] C --> D[Recommendation<br/>Query<br/>< 30ms] style A fill:#4caf50 style B fill:#ff9800 style C fill:#ffeb3b style D fill:#2196f3
Detailed Performance Metrics:
- Projection creation: 2-3 minutes for 15M relationships
- Similarity computation: 8-15 minutes depending on algorithm variant
- Result writing: 1-2 minutes to create similarity relationships
- Query performance: 20-100ms for top-10 recommendations
Similarity Cutoff Optimisation
-- Optimize storage with similarity cutoffs
CALL gds.nodeSimilarity.write(
'app_user_projection',
{
topK: 40, // Limit results per node
similarityCutoff: 0.01, // Filter weak similarities
writeRelationshipType: 'SIMILAR_NODESIM_APP_VIA_USER'
}
)Impact of cutoff thresholds:
similarityCutoff: 0.0: 2.1M similarity relationshipssimilarityCutoff: 0.01: 890K similarity relationships (58% reduction)similarityCutoff: 0.05: 340K similarity relationships (84% reduction)
Higher cutoffs dramatically reduce storage while maintaining recommendation quality.
Index Strategy for Fast Traversals
-- Optimize traversal performance with targeted indexes
CREATE INDEX user_steamid_index IF NOT EXISTS FOR (u:USER) ON (u.steamid);
CREATE INDEX app_appid_index IF NOT EXISTS FOR (a:APP) ON (a.appid);
CREATE INDEX similar_score_index IF NOT EXISTS FOR ()-[r:SIMILAR_NODESIM_APP_VIA_USER]-() ON (r.score);These indexes enable sub-30ms recommendation queries across millions of relationships.
Handling Implicit Feedback and Cold Start
Real-world collaborative filtering must handle missing ratings and new users gracefully.
Implicit Feedback Strategies
Steam data provides playtime rather than explicit ratings. We implement multiple strategies for implicit feedback interpretation:
Binary Conversion: Transform playtime to binary signals
# Convert any playtime > 0 to positive feedback
ratings_matrix.data = np.ones(len(ratings_matrix.data))Log Transformation: Reduce impact of extreme playtime values
# Log-scale transformation for playtime
ratings_matrix.data = np.log1p(ratings_matrix.data)Confidence Weighting: Higher playtime = higher confidence
# Alpha parameter controls confidence scaling
# confidence_matrix = 1 + α * ratings_matrix
confidence_matrix = 1 + alpha * ratings_matrixCold Start Solutions
New User Cold Start: Users without interaction history
-- Bootstrap new users with popular games in their preferred genres
MATCH (new_user:USER {steamid: $new_user_id})
MATCH (popular_app:APP)-[:HAS_GENRE]->(genre:GENRE)
WHERE genre.genre IN $user_preferred_genres
WITH popular_app, count(()-[:PLAYED]->(popular_app)) as popularity
ORDER BY popularity DESC
LIMIT 10
RETURN popular_app.titleNew Item Cold Start: Games without user interactions
-- Recommend new games to users who like similar genres/developers
MATCH (new_app:APP)-[:HAS_GENRE|DEVELOPED_BY]->(feature)
MATCH (user:USER)-[:PLAYED]->(owned_app:APP)-[:HAS_GENRE|DEVELOPED_BY]->(feature)
WITH user, new_app, count(feature) as feature_overlap
WHERE feature_overlap >= 2
RETURN user.steamid, new_app.titleIntegration with Hybrid Recommendation Systems
Collaborative filtering rarely operates in isolation. Our system integrates seamlessly with content-based and deep learning approaches:
Score Fusion Strategies
-- Weighted combination of collaborative and content-based scores
MATCH (user:USER {steamid: $user_id})
WITH user
MATCH (user)-[collab:SIMILAR_NODESIM_APP_VIA_USER]-(app1:APP)
MATCH (user)-[content:SIMILAR_SPARSE_FEATURES]-(app2:APP)
WHERE app1 = app2
AND NOT EXISTS((user)-[:PLAYED]-(app1))
WITH app1,
0.7 * collab.score + 0.3 * content.score as hybrid_score
ORDER BY hybrid_score DESC
LIMIT 10
RETURN app1.title, hybrid_scoreFeature Engineering for Deep Learning
# Export collaborative similarities as features for neural networks
def export_collaborative_features():
query = """
MATCH (user:USER)-[sim:SIMILAR_NODESIM_USER_VIA_APP]-(similar_user:USER)
RETURN user.steamid, collect(similar_user.steamid) as similar_users,
collect(sim.score) as similarity_scores
"""
return gds.run_cypher(query)Collaborative similarities become input features for two-tower and transformer-based models.
Conclusion: Graph-Native Collaborative Filtering at Scale
Graph-based collaborative filtering transforms the traditional matrix-memory bottleneck into scalable relationship computation. Our Neo4j implementation demonstrates that sophisticated collaborative filtering can operate directly on graph structure without sacrificing performance or interpretability.
Key architectural advantages:
- Memory efficiency: Graph projections use 60% less memory than equivalent matrices
- Real-time updates: Add users and items without matrix reconstruction
- Algorithm diversity: Multiple similarity measures within unified graph framework
- Hybrid integration: Seamless combination with content-based and deep learning approaches
Performance at scale:
- 15M relationships processed in under 15 minutes
- Sub-100ms recommendation queries across millions of similarities
- 2.1M similarity relationships with automatic memory management
The strategic insight extends beyond recommendation systems. Graph-native collaborative filtering becomes the foundation for social recommendations, cross-domain suggestions, and multi-stakeholder platforms where relationships define value.
Your collaborative filtering system isn’t just finding similar users and items—it’s discovering the relationship patterns that drive engagement, revealing the community structures that amplify recommendations, and building the graph foundation for next-generation commerce platforms.
Graph algorithms don’t just scale collaborative filtering—they transform it into a relationship discovery engine that powers entire digital ecosystems.