AI-Ready: Cleaning Up the Data Basement in Financial Services - Part 1

Written by Anuraj Soni | Dec 16, 2025 7:44:15 PM

In an earlier BBI article on optimizing data readiness for AI modeling in financial services, we focused on what it takes to create “AI-ready” data: high-quality, well-governed, well-labeled datasets for supervised models and analytics. This piece assumes that foundational work has started. Here, we go a level deeper into what it means to become GenAI-ready – especially for copilots and assistants that rely on messy, unstructured content and RAG-style architectures.

Boards are asking for a GenAI strategy. Vendors are showing glossy demos of copilots for relationship managers, underwriters, and advisors. Competitors are announcing pilots every quarter.

On the surface, financial services looks like it’s racing into the GenAI future.

Scratch that surface and the story is very different. Many banks, digital lenders, and wealth platforms are stuck in the same pattern: a handful of impressive pilots, a lot of PowerPoint, and almost nothing that has scaled into the messy, regulated reality of day-to-day operations.

The reason is not a lack of models.

It’s the data basement.

Everyone wants a beautiful AI “living room experience” – sleek chat interfaces for customers, smart assistants for employees, agents that can “do work” in the background. But underneath that living room is a basement that’s cluttered, leaking, and poorly documented. The plumbing is old, nobody quite knows what’s stored where, and any attempt to plug GenAI into it reliably triggers surprises.

Before you invite GenAI into your organization, you need to clean up the basement.

What GenAI Actually Needs from Your Data

Most GenAI conversations in financial services still start with models: which LLM, which vendor, which cloud. That’s the wrong starting point. Once again, it is not about technology, it is about business! GenAI is far more demanding on data than traditional AI. It doesn’t just need labeled datasets for training; it needs live access to your unstructured content – policies, credit memos, emails, KYC documents – through document stores, vector databases and RAG pipelines.

In a regulated environment, GenAI only becomes useful when it’s deeply grounded in your own data and context. The model is the engine; the data is the fuel, the road and the GPS.

Practically, GenAI needs four things from your data.

Connected, contextual data

A GenAI assistant for underwriters (as an example) cannot live on a single system. It needs to pull together bureau data, internal behavior scores, collateral, collections history, income proofs, previous decisions, and policy constraints into one coherent view.

A relationship manager copilot needs to combine customer holdings, transaction patterns, preferences, product catalog, pricing rules, and recent interactions.

If this context lives in ten different systems with inconsistent keys and overlapping identifiers, GenAI will amplify the fragmentation, not fix it.

High-quality structured and unstructured data

The industry is used to thinking about structured data: balances, transactions, delinquencies, limits, exposures. Data warehouses, marts, and cubes exist to serve that world.

GenAI changes the game because its biggest promise sits in unstructured content:

Credit memos and committee notes
KYC and onboarding files
Contracts and covenants
Policy documents and procedures
Research reports and product notes
Emails and internal FAQs

This is where judgment, nuance, and institutional memory live. This is also where chaos reigns: shared drives, endless folders, random naming conventions, and no metadata.

If you don’t bring structure and discipline to this unstructured content, your GenAI pilots will spend more time tripping over bad inputs than delivering value.

Governed, safe, and explainable data

In financial services, “move fast and break things” isn’t a strategy; it’s a regulatory incident that could land organizations into messy fines and reputational risks.

Data used by GenAI needs clear:

Lineage – where did this number, document, or answer come from?
Ownership – who is responsible for this dataset and its quality?
Access rules – who is allowed to see what, and under what conditions?
Protection – how is PII masked, tokenized, or redacted?

Without that, risk and compliance leaders will (correctly) refuse to allow GenAI to touch anything important. And even if you somehow push it through, you will not be able to explain or defend decisions when they are challenged.

RAG and real-time plumbing

Modern enterprise GenAI is increasingly built on Retrieval-Augmented Generation (RAG): the model answers questions by “calling out” to trusted knowledge sources and data stores, not just relying on what it was trained on.

To do that well, you need:

Searchable document stores with good metadata
Vector databases for semantic search and retrieval
Event streams and APIs that expose the latest, relevant data
Robust connectors and orchestration around all of the above

This is not “nice to have”. Without durable plumbing, you end up manually exporting CSVs and PDFs into short-lived pilots that never make it into production.

The Basement Reality in Most Financial Institutions

Most leaders in financial services know their data isn’t perfect. What they often underestimate is how directly that state blocks GenAI from ever leaving the lab.

Here’s what the basement usually looks like when you turn the lights on.

Fragmented, legacy landscape

Core banking, cards, SME lending, retail lending, wealth, treasury – each with its own stack, data store, and team. Within each, historical mergers and product launches leave multiple “systems of record” behind.

The same customer looks different across these systems: different IDs, different statuses, inconsistent KYC. Stitching them together is an art form.

GenAI sitting on top of this doesn’t magically create a single customer view. It just amplifies the inconsistencies.

Unstructured content sprawl

Decades of activity live in shared drives, aging document management systems, email archives, and SharePoint sites:

Policy documents with multiple conflicting versions
Credit memos with free-form narratives
Legal opinions hidden in nested folders
Committee minutes stored by whoever joined last

Nobody is quite sure which folder is “the source of truth”. There is no consistent taxonomy, no labels, no expiry. For a model, this is worse than no data at all.

Inconsistent semantics and data quality

Ask three teams to define “active customer” and you might get three different answers. The same goes for NPA, exposure, churn, or even “approved”.

When you plug GenAI into this world without reconciling meanings, you’re effectively saying: “Summarize conflicting realities for me, quickly.” The outputs may sound fluent, but they are built on sand.

Weak lineage, cataloging, and discoverability

In too many organizations, basic questions like “Who owns this field?” or “What’s the authoritative source for this metric?” require days of chasing.

A patchy or outdated data catalog (if it exists at all) reinforces this. GenAI needs to know where to look and which source to trust. If humans don’t know, the model certainly won’t.

Governance, risk, and fear

There are often no clear, approved policies on:

What data can be used in prompts, logs, and fine-tuning
How sensitive data is masked before any LLM sees it
How GenAI usage will be monitored and audited

In this vacuum, the safest thing for risk and compliance to say is “No” or “Only in the lab”. That’s rational behaviour, not resistance to innovation.

Infra built for reports, not AI

Traditional data warehouses and BI stacks were designed for overnight or T+1 reporting. They serve dashboards, not real-time assistants and agents.

GenAI-powered experiences need fresher data, tighter loops, and programmatic access through APIs and events, not just scheduled batch loads. That shift hasn’t happened yet in most environments.

About BBI – and an Invitation

At BBI, we sit squarely in the data basement.

We work with lenders, banks, wealth and asset managers, and GCCs to modernize their data platforms and get them GenAI-ready – from domain modeling and data products to document pipelines, RAG architectures, and the governance that keeps regulators and boards comfortable.

If you’re under pressure to “do something with GenAI” but have a nagging feeling that your data basement will hold you back, that’s exactly the conversation we like to have. If you’re still earlier on the curve and need to get your core data estate AI-ready before you even touch GenAI, start with our article on optimizing data readiness for AI modeling in financial services – and then use this piece as your next step.

If you’d like a pragmatic view of where you are on the data readiness ladder – and what a 90-day path forward could look like – reach out. We’re happy to share the questions we use and the patterns we’ve seen, so you can invite GenAI in only when the basement is ready.

View full post