Data readiness is the ability to prove the fitness of data for generative AI use cases. Jean-Luc Chatelain told us how it affects enterprise GenAI adoption.
Generative AI capabilities have become a must for digital transformation. While the percentage of companies embarking on generative AI projects is rising, one key challenge continues to stand in their way: data readiness.
As the not-so-old adage goes, bad data leads to bad AI. Poor data quality and untapped value trapped in data silos continue to pose production and scalability challenges for generative AI projects.
In a recent webinar, we spoke to Jean-Luc Chatelain, founder and managing director at Verax Capital Advisors and former CTO of AI at Accenture, about data challenges that stifle generative AI adoption, and how emerging technologies such as Retrieval Augmented Generation (RAG) can help.
Simon McVeigh, who leads K2view’s sales engineering team and hosted the webinar, cited a Gartner study that found only 20% of AI projects, including generative AI, are being deployed at scale. Another 30% are abandoned before reaching production. In both cases, lack of data readiness is one of the top issues derailing generative AI projects.
Why are enterprises still facing data readiness issues when they’ve been prioritizing and investing in data organization for years? Jean-Luc outlined 3 main reasons:
Poorly curated data
At least 50% of the value of enterprise data is still trapped in both structured and unstructured data due to poor data management. Generative AI project models can’t retrieve data that isn’t curated or maintained correctly. What’s left is incomplete data, often leading to an irrelevant answer or an AI hallucination.
Rising volumes of transactional data
Transactional data – such as service requests, purchases, payments, and invoices – are the life of the enterprise. The volume of transactional data is constantly increasing, which makes data curation and cleansing an ever-challenging feat.
Manual data preparation
Data scientists spend 60-80% of their time preparing data manually and checking its quality. “We’re using very expensive people to serve as data janitors,” said Jean-Luc. This is one of the main reasons most enterprises cannot go beyond POCs and scale AI in production. The cost of having a human in the loop is far too high.
Organizing your data and assuring its quality are 2 important aspects of data readiness. Both have an impact on your RAG architecture, which augments your Large Language Model (LLM) with trusted business entity data. The ability to inject your LLM with real-time customer data, for example, from your company's own sources, results in better AI personalization and more meaningful interactions.
LLMs typically use a pre-trained model based on publicly available information from the Internet, explained Jean-Luc. But that doesn’t give you the whole picture. None of these models knows the specifics of your business. By accessing data from your own private data stores, your LLM now has the means to produce more relevant answers.
Although other methods, like fine-tuning, may also contribute to data readiness, in a retrieval augmented generation vs fine-tuning comparison, RAG proves to be more reliable.
Retrieval-augmented generation is a generative AI framework that enhances your LLM by infusing fresh, reliable data from authoritative internal knowledge bases (unstructured data) and enterprise systems (structured data) to generate more informed, context-specific, and dependable responses.
The RAG model retrieves, selects, and prioritizes the most pertinent information from the appropriate sources based on the user's query. It then transforms this data into an enriched, contextual prompt to provide an accurate and coherent response to the user.
In short, RAG turns your generic LLM into one that knows your business and your customers.
Simon gave a simple example of how RAG achieves ROI for generative AI projects and improves the customer experience. Imagine a customer of an Internet Service Provider (ISP) whose Internet is down interacting with the ISP’s chatbot. He tells the bot that his internet isn’t working. The bot, trained on generic data, provides a comprehensive response about how the customer can troubleshoot the issue, such as by restarting his router.
This response isn’t technically wrong. It’s one possible solution, after all. The problem is that it’s way too generic in that it lacks context into this specific customer’s situation. Perhaps the customer’s whole region is experiencing an Internet outage. Or maybe he’s late paying his bill.
An advanced RAG chatbot would enable the LLM to combine information specific to the enterprise and to the customer to provide a more accurate, up-to-date, and contextually relevant response.
Here’s Jean-Luc’s take on the top 5 data issues to address RAG conversational AI:
Dealing with unstructured and structured data
In the context of RAG, the focus is usually on unstructured data. However, unstructured data alone cannot provide a complete picture of the enterprise, and it doesn’t account for individual customer situations. According to Jean-Luc, unstructured data provides only half of the answer. Providing the most relevant and valuable responses requires both unstructured and structured (transactional) data. However, ensuring proper data preparation and structured data quality is difficult.
Conversing in real-time
People expect their conversations with chatbots to feel as natural as talking to a person. Pausing mid-sentence leads to a frustrating customer experience. Providing accurate responses in a dynamic environment requires the ability to access source systems in the RAG pipeline and process transactional data in real time. These constraints often mean companies must choose between speed and accuracy.
Tearing down silos of information
Information silos created by systems of record like ERP, HRM, CRM, and others lead to data fragmentation, which makes enterprise-wide data retrieval difficult. For RAG to work, information stored in enterprise systems must be accessible, of high quality, and include accurate metadata.
Ensuring data security and privacy
Personally Identifiable Information (PII) and other sensitive data are vulnerable to data leaks or malicious attacks. It’s your responsibility to ensure your data is protected during the entire process, by being properly masked, and having the right access controls in place.
Balancing cost and scalability
One reason so many generative AI projects never make it to deployment is because they become too expensive to implement at scale. Often, it ends up being cheaper to continue doing certain processes manually than deploying generative AI. For companies with millions of customers, the cost of supporting a high volume of inbound requests and inferences eliminates any prospect of ROI.
The key to successfully deploying GenAI projects and gaining the most possible value from RAG is by maximizing generative AI data readiness. Clean, high-quality, and unified data with the correct metadata plays a major role in avoiding generic responses and providing a satisfactory customer experience. In addition, the data must be secured and available for real-time conversations at scale.
According to Jean-Luc, there are 3 key focus areas for improving data readiness:
Significantly improve data quality automation
Ensure data scientists aren’t spending the bulk of their time manually cleaning and preparing data for LLM use. Develop a system for keeping up with data preparation requirements as the volume of transactional data rises.
Focus on trustworthiness
Data security and privacy must be top-of-mind when planning generative AI projects. As RAG GenAI expands, so do the types and severity of security threats. All sensitive data, from production to usage, must be protected.
Break down data silos
Ensure data stored in enterprise systems is accessible and unified, with the correct metadata, so it can be accessed and used to provide context-based responses in real time.
Learn more about K2view GenAI Data Fusion,
the RAG tool that makes your data GenAI-ready.