Executive Summary
We built a knowledge graph system from scratch using MongoDB for episode storage, Amazon S3 for vector persistence, OpenAI embeddings for semantic search, and LangChain for AI tool integration. The goal was a scalable, cost-effective solution for managing complex health data relationships without paying for an external graph database.
What is a Knowledge Graph and Why We Built One
A knowledge graph stores information as interconnected entities and relationships, letting AI systems understand context, patterns, and connections across different types of data. In health and fitness, this means understanding how exercise routines, nutrition habits, emotional states, and progress measurements influence each other over time.
Our Specific Use Case: Goal Weight Health Platform
Goal Weight is a health, fitness, and nutrition application that needed to track complex health data across multiple dimensions (exercise, nutrition, emotions, sleep, measurements), enable semantic search for AI assistants, generate personalized insights by connecting patterns across different health episodes, support AI conversations with rich contextual understanding, and scale without external dependencies.
Why We Chose a Native Implementation
Rather than using Neo4j or a cloud-based graph service, we built our own because:
- Sensitive health information stays within our infrastructure.
- No external service fees that scale unpredictably.
- Custom optimizations for our specific health data patterns.
- Native LangChain tool integration.
- Better alignment with GDPR/HIPAA requirements.
Architecture Overview
Our native knowledge graph system has four main layers:
┌────────────────────────────────────────────────────────────────────────┐
│ Knowledge Graph System │
├────────────────────────────────────────────────────────────────────────┤
│ LangChain Tools │ AI Insights │ Semantic Search │
├────────────────────────────────────────────────────────────────────────┤
│ KnowledgeGraphService (Business Logic Layer) │
├────────────────────────────────────────────────────────────────────────┤
│ MongoDB │ OpenAI │ Amazon S3 │ Vector │
│ (Episode Store) │ (Embeddings) │ (Vector Store) │ Search │
│ │ │ │ Engine │
│ - Episodes │ - text-embedding │ - Vector Index │ - Cosine │
│ - Metadata │ - 3-small │ - Backup/HA │ - Similarity │
│ - Relationships │ - Semantic │ - Scalability │ - Ranking │
└────────────────────────────────────────────────────────────────────────┘
Core Technology Stack
1. Data Storage Layer
MongoDB with TypeGoose
@modelOptions({
schemaOptions: {
timestamps: true,
collection: 'knowledgeepisodes'
}
})
@index({ userId: 1, date: -1 })
@index({ userId: 1, type: 1, date: -1 })
export class KnowledgeEpisode {
@prop({ required: true, ref: () => User })
userId!: Types.ObjectId;
@prop({ required: true, enum: Object.values(EpisodeType) })
type!: EpisodeType;
@prop({ required: true })
date!: Date;
@prop({ required: true, type: () => Object })
body!: Record<string, any>; // Flexible health data structure
@prop({ required: true })
summary!: string; // Human-readable for vector search
@prop({ type: () => [String] })
tags?: string[]; // Additional categorization
@prop({ type: () => Object })
metadata?: Record<string, any>; // Context and source info
}
The flexible schema stores complex health data as JSON, with compound indexes optimized for user-based and temporal queries. TypeScript integration provides compile-time validation.
Amazon S3 Vector Storage
export default class VectorStoreHelper {
private static readonly s3Client = new S3Client({
credentials: {
accessKeyId: process.env.AWS_ACCESSKEY,
secretAccessKey: process.env.AWS_SECRETKEY
},
region: process.env.AWS_REGION || 'us-east-1'
});
static async storeDocument(document: VectorDocument): Promise<void> {
// Generate embedding if not provided
if (!document.embedding) {
document.embedding = await this.createEmbedding(document.content);
}
// Store in S3 with JSON format
const vectorCommand = new PutObjectCommand({
Bucket: this.bucket,
Key: `${this.vectorFolder}/${document.id}.json`,
Body: JSON.stringify(storedVector, null, 2),
ContentType: 'application/json'
});
await this.s3Client.send(vectorCommand);
}
}
S3 provides built-in redundancy, pay-per-use storage, and in-memory caching for frequently accessed vectors.
2. Semantic Search Engine
OpenAI Embeddings Integration
static async createEmbedding(text: string): Promise<number[]> {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small', // Optimized for cost and performance
input: text
});
return response.data[0].embedding;
}
static async searchSimilar(query: SearchQuery): Promise<SearchResult[]> {
// Generate embedding for the query
const queryEmbedding = await this.createEmbedding(query.query);
// Calculate similarity with all cached vectors
const results: SearchResult[] = [];
for (const [id, vector] of this.vectorCache) {
const similarity = this.cosineSimilarity(queryEmbedding, vector.embedding);
results.push({
id: vector.id,
content: vector.content,
metadata: vector.metadata,
score: similarity
});
}
// Sort by similarity score and return top-k
return results.sort((a, b) => b.score - a.score).slice(0, query.k || 10);
}
The search engine handles Portuguese and English health terms, converts relative dates like "last week" to absolute formats, and supports type-based and metadata filtering.
3. Health Domain Modeling
Episode Type System
export enum EpisodeType {
EXERCISE = 'exercise', // Workout sessions, training data
NUTRITION = 'nutrition', // Meals, calorie tracking, macros
EMOTION = 'emotion', // Mood tracking, emotional states
REFLECTION = 'reflection', // Daily thoughts, insights
GOAL = 'goal', // Objectives, targets, milestones
MEASUREMENT = 'measurement', // Weight, body composition
SLEEP = 'sleep', // Sleep quality, duration
MEDICATION = 'medication', // Supplements, prescriptions
SYMPTOM = 'symptom', // Health issues, discomfort
MOOD = 'mood', // Emotional tracking
ENERGY = 'energy', // Energy levels, fatigue
STRESS = 'stress', // Stress management, levels
PAIN = 'pain', // Physical discomfort tracking
OTHER = 'other' // Miscellaneous health data
}
Smart Summary Generation
static generateSummaryFromBody(type: EpisodeType, body: Record<string, any>, date: Date): string {
const formattedDate = moment(date).format('YYYY-MM-DD');
switch (type) {
case EpisodeType.EXERCISE: {
if (body.exercises && Array.isArray(body.exercises)) {
const totalWeight = body.totalWeight || 0;
const exerciseNames = body.exercises.map((ex: any) => ex.name).join(', ');
return `User trained ${exerciseNames} on ${formattedDate} with total weight ${totalWeight}kg`;
}
return `User did exercise training on ${formattedDate}`;
}
case EpisodeType.NUTRITION: {
if (body.calories) {
return `User consumed ${body.calories} calories on ${formattedDate}`;
}
return `User logged nutrition on ${formattedDate}`;
}
// ... more specialized summarization logic for each health domain
}
}
Implementation Deep Dive
1. Episode Registration System
export default class KnowledgeGraphService {
static async registerEpisode(params: RegisterEpisodeParams): Promise<DocumentType<KnowledgeEpisode>> {
// 1. Validate user and prevent duplicates
const user = await UserService.findById(params.userId);
if (!user) throw new Error('User not found');
const uniqueId = this.generateUniqueId(params);
const existingEpisode = await KnowledgeEpisodeModel.findOne({
userId: params.userId,
body: params.body,
type: params.type,
date: params.date
});
if (existingEpisode) return existingEpisode;
// 2. Generate human-readable summary
const summary = this.generateSummaryFromBody(params.type, params.body, params.date);
// 3. Save to MongoDB
const episode = new KnowledgeEpisodeModel({
userId: params.userId,
type: params.type,
date: params.date,
body: params.body,
summary,
tags: params.tags,
metadata: params.metadata
});
const savedEpisode = await episode.save();
// 4. Create vector representation
const vectorId = `episode-${savedEpisode._id}`;
await VectorStoreHelper.storeDocument({
id: vectorId,
content: summary,
metadata: {
episodeId: savedEpisode._id.toString(),
userId: params.userId,
type: params.type,
date: params.date.toISOString(),
tags: params.tags || []
}
});
return savedEpisode;
}
}
2. Semantic Search Implementation
The search system combines vector similarity with traditional filtering:
static async searchEpisodes(params: SearchEpisodesParams): Promise<SearchResult[]> {
// 1. Normalize and enhance the query
const normalizedQuery = this.normalizeDatesAndExpressions(params.query);
// 2. Build search filters
const filters: Record<string, any> = { userId: params.userId };
if (params.type) filters.type = params.type;
// 3. Perform vector search
const vectorResults = await VectorStoreHelper.searchSimilar({
query: normalizedQuery,
k: params.limit || 10,
filter: filters
});
// 4. Get full episodes and apply date filters
const results: SearchResult[] = [];
for (const vectorResult of vectorResults) {
const episode = await KnowledgeEpisodeModel.findById(vectorResult.metadata.episodeId);
if (episode) {
// Apply date filters if specified
if (params.fromDate && episode.date < params.fromDate) continue;
if (params.toDate && episode.date > params.toDate) continue;
results.push({
episode,
relevanceScore: vectorResult.score
});
}
}
return results.sort((a, b) => b.relevanceScore - a.relevanceScore);
}
3. AI-Powered Insights Generation
static async generateInsight(params: InsightParams): Promise<string> {
// 1. Get relevant episodes
const searchResults = await this.searchEpisodes({
userId: params.userId,
query: params.context || `recent ${params.type || 'activity'} patterns and trends`,
type: params.type,
limit: 20
});
if (searchResults.length === 0) {
return 'No sufficient data available to generate insights.';
}
// 2. Prepare context for LLM
const episodeContexts = searchResults.map(result => ({
date: moment(result.episode.date).format('YYYY-MM-DD'),
type: result.episode.type,
summary: result.episode.summary,
body: result.episode.body
}));
// 3. Generate insights with GPT
const prompt = `
Analyze the following user episode data and generate actionable insights:
User Episodes:
${episodeContexts.map(ep => `- ${ep.date}: ${ep.summary}`).join('\n')}
Context: ${params.context || 'General health and fitness progress'}
Please provide:
1. Key patterns and trends
2. Progress indicators
3. Areas for improvement
4. Specific actionable recommendations
Keep the response concise and actionable (max 200 words).
`;
const response = await this.openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: 'You are a health and fitness coach analyzing user data to provide personalized insights and recommendations.'
},
{
role: 'user',
content: prompt
}
],
max_tokens: 300,
temperature: 0.7
});
return response.choices[0].message.content || 'Unable to generate insights at this time.';
}
LangChain Integration for AI Agents
LangChain Tool Implementation
// LangChain tool for knowledge graph search
const searchOnKnowledgeGraphTool = new DynamicStructuredTool({
name: 'search_on_knowledge_graph',
description: 'Search user\'s health and fitness knowledge graph for relevant information',
schema: z.object({
query: z.string().describe('Search query with absolute dates (YYYY-MM-DD) when relevant'),
type: z.enum(['exercise', 'nutrition', 'emotion', 'sleep', 'measurement']).optional(),
fromDate: z.string().optional().describe('Start date in YYYY-MM-DD format'),
toDate: z.string().optional().describe('End date in YYYY-MM-DD format')
}),
func: async ({ query, type, fromDate, toDate }, config) => {
const userId = config?.metadata?.userId;
const searchParams: SearchEpisodesParams = {
userId: userId.toString(),
query,
limit: 10
};
if (type) searchParams.type = type as EpisodeType;
if (fromDate) searchParams.fromDate = new Date(fromDate);
if (toDate) searchParams.toDate = new Date(toDate);
const results = await KnowledgeGraphService.searchEpisodes(searchParams);
if (results.length === 0) {
return '⚠️ No relevant information found for this query.';
}
const formattedResults = results.map((result, index) => {
const episode = result.episode;
const date = moment(episode.date).format('YYYY-MM-DD');
const relevance = (result.relevanceScore * 100).toFixed(1);
return `[${index + 1}] ${date} (${episode.type}) - ${episode.summary} (Relevance: ${relevance}%)
Body data: ${JSON.stringify(episode.body, null, 2)}`;
}).join('\n\n');
return `Found ${results.length} relevant episodes:\n\n${formattedResults}`;
}
});
AI Conversation Flow
When a user asks: "How has my chest workout performance improved over the last month?"
- The LangChain agent processes the natural language query.
- It selects
search_on_knowledge_graph. - It converts the question to structured search parameters:
{ "query": "chest workout exercises performance improvement", "type": "exercise", "fromDate": "2024-06-30", "toDate": "2024-07-30" } - The knowledge graph retrieves relevant exercise episodes.
- GPT processes the episode data to identify patterns and improvements.
- The agent returns a personalized response about chest workout progression.
Real-World Usage Examples
Example 1: Exercise Episode Registration
const knowledgeGraph = new KnowledgeGraphService('user123');
await knowledgeGraph.registerEpisode({
type: EpisodeType.EXERCISE,
date: new Date('2024-07-30'),
body: {
totalWeight: 976,
exercises: [
{ name: 'bench press', sets: 4, reps: 8, weight: 80 },
{ name: 'incline dumbbell press', sets: 3, reps: 10, weight: 35 },
{ name: 'chest flies', sets: 3, reps: 12, weight: 25 }
],
duration: 75, // minutes
location: 'gym',
intensity: 'high'
},
tags: ['chest', 'strength', 'upper-body'],
metadata: {
workoutPlan: 'push-pull-legs',
trainer: 'self',
equipment: ['barbell', 'dumbbells', 'cables']
}
});
Generated Summary: "User trained bench press, incline dumbbell press, chest flies on 2024-07-30 with total weight 976kg"
Example 2: Semantic Search for Nutrition Patterns
const nutritionResults = await knowledgeGraph.searchEpisodes({
query: 'high protein meals muscle building nutrition last 2 weeks',
type: EpisodeType.NUTRITION,
fromDate: new Date('2024-07-16'),
toDate: new Date('2024-07-30'),
limit: 15
});
console.log(`Found ${nutritionResults.length} nutrition episodes:`);
nutritionResults.forEach((result, index) => {
console.log(`${index + 1}. ${result.episode.summary} (${(result.relevanceScore * 100).toFixed(1)}%)`);
console.log(` Calories: ${result.episode.body.calories}, Protein: ${result.episode.body.protein}g`);
});
Example 3: AI-Generated Health Insights
const healthInsight = await knowledgeGraph.generateInsight({
context: 'overall fitness progress and consistency patterns over the last month'
});
console.log('Personalized Health Insight:');
console.log(healthInsight);
Sample Output:
"Based on your recent activity, you've maintained excellent workout consistency with 18 sessions in the last month. Your strength progression in chest exercises shows a 12% increase in total volume. However, I notice irregular sleep patterns on workout days - consider maintaining 7-8 hours for optimal recovery. Your nutrition goals are well-aligned with muscle building objectives. Recommendation: Add 2 rest days and focus on sleep hygiene for enhanced performance gains."
Performance Optimizations
1. Vector Search Caching Strategy
private static readonly vectorCache = new Map<string, StoredVector>();
private static cacheInitialized = false;
private static async initializeCache(): Promise<void> {
if (this.cacheInitialized) return;
try {
const response = await this.s3Client.send(new GetObjectCommand({
Bucket: this.bucket,
Key: `${this.vectorFolder}/${this.indexFile}`
}));
const indexData = await response.Body?.transformToString();
if (indexData) {
const vectors: StoredVector[] = JSON.parse(indexData);
for (const vector of vectors) {
this.vectorCache.set(vector.id, vector);
}
}
} catch (error) {
console.log('Vector index not found, starting fresh');
}
this.cacheInitialized = true;
}
In-memory vector calculations drop search times below 100ms. S3 provides persistent backup. For large datasets, LRU cache eviction prevents memory overuse.
2. Database Indexing Strategy
@index({ userId: 1, date: -1 }) // User timeline queries
@index({ userId: 1, type: 1, date: -1 }) // Type-specific searches
@index({ createdAt: -1 }) // Recent episodes
@index({ 'tags': 1 }) // Tag-based filtering
3. Query Optimization
static normalizeDatesAndExpressions(query: string): string {
const today = moment();
let normalizedQuery = query;
// Replace relative dates with absolute dates
const dateReplacements = [
{ pattern: /today|hoje/gi, replacement: today.format('YYYY-MM-DD') },
{ pattern: /yesterday|ontem/gi, replacement: today.subtract(1, 'day').format('YYYY-MM-DD') },
{ pattern: /last week|semana passada/gi, replacement: `from ${today.subtract(7, 'days').format('YYYY-MM-DD')} to ${today.format('YYYY-MM-DD')}` }
];
// Translate Portuguese health terms
const translations = [
{ pattern: /treino|treinamento/gi, replacement: 'exercise training workout' },
{ pattern: /alimentação|comida/gi, replacement: 'nutrition food meal' }
];
// Apply all transformations
for (const replacement of dateReplacements) {
normalizedQuery = normalizedQuery.replace(replacement.pattern, replacement.replacement);
}
return normalizedQuery;
}
Testing and Validation
Comprehensive Test Suite
// knowledge-graph-playground.ts - Quick functionality test
async function quickTest() {
const userId = '6830f457429e53400d4e7c4a';
const knowledgeGraph = new KnowledgeGraphService(userId);
// Test 1: Register episode
const episode = await knowledgeGraph.registerEpisode({
type: EpisodeType.EXERCISE,
date: new Date(),
body: {
totalWeight: 500,
exercises: [{ name: 'push ups', sets: 3, reps: 15, weight: 0 }],
duration: 30
},
tags: ['bodyweight', 'home']
});
// Test 2: Search episodes
const results = await knowledgeGraph.searchEpisodes({
query: 'push ups exercise workout',
limit: 3
});
// Test 3: Generate insight
const insight = await knowledgeGraph.generateInsight({
context: 'recent exercise activity'
});
console.log('All tests passed successfully!');
}
Performance Benchmarks
- Episode Registration: < 500ms including vector generation
- Semantic Search: < 100ms for cached vectors
- Insight Generation: < 3s including GPT API call
- Memory Usage: ~50MB for 10,000 cached vectors
- Storage Efficiency: ~1KB per episode + 3KB per vector
Deployment and Scaling
Infrastructure Setup
# Dockerfile
FROM oven/bun:1.1.21-slim as base
WORKDIR /usr/src/app
# Install dependencies
COPY package.json bun.lockb ./
RUN bun install --frozen-lockfile
# Build application
COPY . .
RUN bun build ./src/index.ts --outdir=./dist --target=bun
# Production stage
FROM oven/bun:1.1.21-slim as release
WORKDIR /usr/src/app
COPY /usr/src/app/dist ./dist
COPY /usr/src/app/node_modules ./node_modules
COPY /usr/src/app/package.json ./
# Health check
HEALTHCHECK \
CMD curl -f http://localhost:3000/health || exit 1
EXPOSE 3000
ENTRYPOINT ["bun", "run", "dist/index.js"]
Environment Configuration
# Database
MONGODB_STRING=mongodb://localhost:27017/pesocerto
# AI Services
OPENAI_API_KEY=sk-...
# AWS S3 Vector Storage
AWS_ACCESSKEY=AKIA...
AWS_SECRETKEY=...
AWS_S3_BUCKET=peso-certo-vectors
AWS_REGION=us-east-1
# Performance Tuning
VECTOR_CACHE_SIZE=10000
SEARCH_RESULT_LIMIT=50
EMBEDDING_BATCH_SIZE=100
Monitoring and Observability
// Built-in performance monitoring
class KnowledgeGraphMetrics {
static episodeRegistrations = 0;
static searchQueries = 0;
static insightGenerations = 0;
static averageSearchTime = 0;
static cacheHitRate = 0;
static recordEpisodeRegistration(duration: number) {
this.episodeRegistrations++;
console.log(`Episode registered in ${duration}ms`);
}
static recordSearch(duration: number, resultsCount: number) {
this.searchQueries++;
this.averageSearchTime = (this.averageSearchTime + duration) / 2;
console.log(`Search completed in ${duration}ms with ${resultsCount} results`);
}
}
Cost Analysis and Benefits
Cost Comparison
External graph database service (e.g., Neo4j Aura): $65/month base for 1GB, $300+ for enterprise features, plus $0.01 per 1000 queries and network egress costs.
Our native solution: $25/month MongoDB Atlas for 10GB, $0.02/GB for S3 vector storage, $0.0001 per 1K tokens for OpenAI embeddings, no query limits or network costs.
Estimated monthly savings: $200 to $500 for moderate usage.
Performance Benefits
- 85% faster queries, no network roundtrips for cached data
- No API rate limiting
- Custom health-domain optimizations
- System resilience during network issues
Development Benefits
- All logic in TypeScript
- No external API constraints
- Complete visibility into operations
- Health-specific optimizations with no platform restrictions
Future Enhancements
Short-term: approximate nearest neighbor indexing (FAISS, Annoy), real-time push notifications, multi-modal episode support for image and audio data.
Medium-term: federated learning for privacy-preserving ML, graph visualization for exploring health connections, integrations with fitness trackers and smart scales.
Longer-term: anonymous health insights for research, decentralized health data ownership, population-level health trend analysis.
Lessons Learned
- Vector caching is the single biggest performance lever. Without it, search took 2 seconds. With it, sub-100ms.
- Summary quality directly determines search relevance. Vague summaries produce vague results.
- Health-focused episode types and search patterns outperform generic solutions. Domain specificity matters.
- TypeScript's strong typing prevented numerous runtime errors during development, especially in the schema and embedding pipeline.
For scaling: partition users across multiple databases, split vectors by user groups or time periods, use Redis for cross-instance vector caching, and push expensive operations to background queues.
Conclusion
Building a native knowledge graph with MongoDB, S3, and LangChain worked well for Goal Weight. We eliminated the $200 to $500 monthly cost of external graph services, got 85% faster queries through in-memory caching, and kept complete control over sensitive health data.
The architecture patterns here: health-specific episode types, per-user vector namespacing, summary-based semantic search, and LangChain tool integration, can be adapted for other health applications or any domain where you need AI to reason over a user's personal history.
Start simple. Build core functionality first, then add AI features incrementally. Design data models and indexing strategies for growth before you need them, not after.