Mario Brusarosco

transaction schema unification

In the ground since Fri Dec 20 2024

Last watered inFri Dec 20 2024

Related Topics

Transaction Schema Unification Refactor

Overview

This document explains a critical refactoring that unified transaction schemas across domains, fixed a movement_type extraction bug, and improved the overall architecture of our financial document parsing system.

The Problem We Faced

Initial Issue

Users were getting validation errors when uploading bank statements:

1Field required [type=missing, input_value={'date': '2025-04-07', 'd...202,06', 'category': ''}, input_type=dict]
2raw_statement.transactions.0.movement_type

Root Cause Analysis

Through systematic debugging, we discovered the issue wasn't with AI extraction (OpenAI was correctly extracting movement_type), but with our data flow architecture.

Architecture Problems

1. Duplicate Transaction Schemas

We had two different transaction schemas serving the same purpose:

1# AI Layer (app/core/ai/models/responses.py)
2class TransactionData(BaseModel):
3 date: str
4 description: str
5 amount: str
6 category: str = ""
7 # Missing: movement_type
8
9# Statements Domain (app/domains/statements/schemas.py)
10class StatementTransaction(BaseModel):
11 date: str
12 description: str
13 amount: str
14 movement_type: str # Present here!
15 category: str = ""

2. Data Loss During Conversion

The statements service was manually converting between transaction formats:

1# Statements Service (BEFORE FIX)
2"transactions": [
3 {
4 "date": tx.date,
5 "description": tx.description,
6 "amount": tx.amount,
7 "category": tx.category
8 # Missing: movement_type! 🚨
9 }
10 for tx in financial_data.transactions
11]

Result: Even though OpenAI extracted movement_type correctly, it was dropped during conversion.

3. Wrong Domain Ownership

1❌ BEFORE: AI Layer defines TransactionData
2- AI concerns mixed with business logic
3- Transaction schema owned by infrastructure layer
4- Domains import from infrastructure
5
6✅ AFTER: Transactions Domain defines TransactionData
7- Business logic owns business schemas
8- AI layer imports from domain
9- Proper dependency direction

The Solution: Domain-Driven Schema Unification

Step 1: Move Schema to Correct Domain

Moved transaction schema to its rightful owner:

1# app/domains/transactions/schemas.py
2class TransactionData(BaseModel):
3 """
4 Simplified transaction data for AI parsing and document processing.
5
6 Used by:
7 - AI providers (OpenAI, Ollama) for structured output
8 - Statement and invoice processing
9 - Document parsing workflows
10 """
11 date: str = Field(description="Transaction date in ISO format")
12 description: str = Field(description="Complete transaction description")
13 amount: str = Field(description="Transaction amount (without sign, preserve precision)")
14 movement_type: str = Field(description="Movement type: 'income' | 'expense' | 'transfer' | 'investment' | 'other'")
15 category: str = Field(default="", description="Transaction category (empty if not explicit)")

Step 2: Update Dependencies

AI Layer now imports from transactions domain:

1# app/core/ai/models/responses.py
2from app.domains.transactions.schemas import TransactionData
3
4# app/core/ai/models/__init__.py
5from app.domains.transactions.schemas import TransactionData

Statements Domain uses same schema:

1# app/domains/statements/schemas.py
2from app.domains.transactions.schemas import TransactionData
3
4class RawBankStatement(BaseModel):
5 transactions: List[TransactionData] # Same schema!

Step 3: Eliminate Redundant Conversion

BEFORE (Manual Conversion):

1"transactions": [
2 {
3 "date": tx.date,
4 "description": tx.description,
5 "amount": tx.amount,
6 "movement_type": tx.movement_type, # Easy to forget!
7 "category": tx.category
8 }
9 for tx in financial_data.transactions
10]

AFTER (Direct Usage):

1"transactions": financial_data.transactions # No conversion needed!

Technical Implementation

Data Flow (Fixed)

11. PDF Upload → Extract Text
22. Text → OpenAI API with TransactionData schema
33. OpenAI extracts movement_type correctly ✅
44. Returns List[TransactionData] with all fields ✅
55. Direct assignment to RawBankStatement ✅
66. Validation succeeds ✅

Key Changes Made

Educational Insights

Why This Refactor Was Necessary

Domain-Driven Design Principles:

  • Business concepts belong in business domains
  • Infrastructure should depend on domain, not vice versa
  • Avoid schema duplication across layers

Schema Evolution Problems:

  • Manual conversion is error-prone - easy to forget fields
  • Schema drift happens when definitions are duplicated
  • Maintenance overhead increases with multiple definitions

How We Debugged This

Systematic Approach:

  1. Traced data flow from AI response to validation
  2. Checked each transformation step for data loss
  3. Identified the exact line where movement_type was dropped
  4. Root cause analysis revealed architectural issue

The error was NOT in AI extraction (which worked perfectly), but in our data transformation logic.

Lessons Learned

Benefits Achieved

Immediate Fixes

  • movement_type extraction works - no more validation errors
  • Cleaner codebase - eliminated redundant conversion logic
  • Reduced complexity - fewer lines of code, fewer bugs

Long-term Improvements

  • Easier maintenance - single place to update transaction schema
  • Automatic compatibility - changes flow through automatically
  • Better architecture - proper domain boundaries
  • Future-proof - easier to add new transaction fields

Performance Gains

  • No unnecessary object creation during conversion
  • Direct object usage reduces memory allocations
  • Simpler code paths improve readability and performance

Best Practices Established

Schema Design

1# ✅ Good: Single authoritative schema
2app/domains/transactions/schemas.py:
3 - TransactionData (for AI parsing)
4 - TransactionBase (for domain operations)
5
6# ❌ Avoid: Duplicate schemas in different layers
7app/core/ai/models/responses.py: TransactionData
8app/domains/statements/schemas.py: StatementTransaction

Dependency Management

1# ✅ Good: Infrastructure depends on domain
2from app.domains.transactions.schemas import TransactionData
3
4# ❌ Avoid: Domain depends on infrastructure
5from app.core.ai.models.responses import TransactionData

Data Transformation

1# ✅ Good: Direct usage of shared schemas
2"transactions": financial_data.transactions
3
4# ❌ Avoid: Manual conversion between identical structures
5"transactions": [{"field": tx.field} for tx in items]

Future Considerations

Schema Evolution

  • Add new fields only to transactions domain
  • Changes automatically propagate to all consumers
  • Version migration can be handled in one place

Testing Strategy

  • Test schema consistency across domains
  • Validate data flow from AI to database
  • Monitor for schema drift in CI/CD

Monitoring

  • Log schema field counts to detect missing fields
  • Track AI extraction success rates for new fields
  • Alert on validation failures during parsing

Conclusion

This refactor demonstrates the importance of proper domain-driven design and schema management. By moving transaction schemas to their rightful domain and eliminating redundant conversions, we:

  1. Fixed immediate bugs (movement_type extraction)
  2. Improved system architecture (proper domain boundaries)
  3. Reduced future maintenance (single source of truth)
  4. Enhanced debuggability (cleaner data flow)

The lesson: Architecture problems often manifest as data transformation bugs. When debugging, look beyond the immediate error to understand the underlying structural issues.