How to Prepare Your Enterprise Data

Karen Pfeifer has over 25 years of extensive experience in the media industry. With a proven track record in leadership roles at renowned companies such as Disney, Hulu, Comcast/NBCU, and DirecTV/AT&T, Karen is a seasoned expert in data strategy, product development, governance, data science, and analytics. A proud graduate of UCLA, where she earned her B.A. in Music, Karen further honed her analytical skills by obtaining an M.S. in Analytics and Data Science from St. Joseph’s University. Her unique blend of creativity and technical expertise positions her at the forefront of transforming how data drives decision-making in the media landscape.

As an executive leader in data, I am constantly asked, “Can AI do that?” Yes, Generative AI (GenAI) is poised to revolutionize numerous industries, from content creation and medical advances to technology automation and financial modeling. However, in my experience (or the “school of hard knocks”), I know that for enterprises to harness the true potential of GenAI, they must first ensure their data is ready. Unlike traditional AI applications that rely on labeled datasets, GenAI thrives on massive amounts of high-quality, clean data—highlighting the importance of using accurate and well-structured data to train GenAI models.

In other words, there is a lot (and I mean A LOT) of hard work to do before one can harness the power of GenAI.

  ​GenAI is a powerful tool, but it’s only as effective as the data it’s trained on. By prioritizing data readiness, enterprises can ensure their GenAI initiatives deliver real-world value and a significant competitive advantage

The Data Readiness Imperative

Enterprise data, by its very nature, is often complex. It can encompass a vast array of information from multiple sources with varying degrees of structure and consistency. In my experience, enterprise data is a mishmash of legacy databases, third-party sources, and many “snowflakes” from different departments. (You might be surprised at the number of ways the advertiser Coca-Cola could be listed in a database.) This inherent complexity can create significant roadblocks for GenAI adoption. Imagine trying to train a large language model to generate realistic dialogue if your data consists of siloed spreadsheets, poorly documented legacy systems, and inconsistently formatted records. The results would carry a high likelihood of AI hallucinations and confabulations.

Key Components of Data Readiness for GenAI

At this early stage of GenAI experimentation and adoption, I strongly advocate that it’s not about the “new and shiny tech” but the investment in data readiness. (I might suggest this is the “measure twice, cut once” approach for enterprises.) Here are some critical aspects to consider when preparing your enterprise data for GenAI:

Data Quality and Integration: Investing in data-cleansing techniques and establishing data quality standards can significantly improve GenAI performance. GenAI is highly sensitive to data quality issues like errors, inconsistencies, and biases. Inaccurate data can lead to skewed or nonsensical outputs from your GenAI models. Additionally, integrating data from disparate sources into a unified and accessible enterprise data warehouse, with data that is governable through upstream normalization techniques, is essential for GenAI to function effectively.

Modern Data Architecture: Modern data architectures that leverage technologies like vector databases and LLM orchestration frameworks can provide the scalability and flexibility required for GenAI success. Traditional data architectures may not be equipped to handle the demands of GenAI workloads, which often involve complex algorithms, unstructured data sources, and real-time processing.

Data Governance and Security: Robust data governance practices are essential to mitigate risks and ensure compliance with regulations when deploying GenAI. This includes establishing clear ownership of data assets, implementing access controls, and creating a mature data security and privacy framework. Data governance success relies on crucial input from subject matter experts across the company and buy-in from leadership to align on one source of truth for definitions, catalogs, dictionaries, lineage, and other data governance artifacts.

Challenges and Considerations

Assuming that you get the data in tip-top shape, there are challenges to consider:

Unforeseen Biases and Inaccuracies: GenAI models can inadvertently amplify biases in their training data, leading to discriminatory or inaccurate outputs. It’s crucial to carefully evaluate the training data for potential biases and implement techniques to mitigate them. Imagine that you are training on a massive dataset of streaming video. However, the dataset inadvertently contains a disproportionate amount of videos featuring left-handed juggling clowns. Now, your data will exhibit biases and inaccuracies, such as cultural bias against clowns or inaccurate transcriptions of the video for clown accents or dialects. This could lead to errors and misunderstandings in your transcriptions or stereotypes in the content. (Not that I suggest I’ve managed through clown bias!)

Hallucinations and Confabulations: GenAI models can generate highly realistic but entirely fabricated information. This can be particularly risky in applications where factual accuracy is paramount. Organizations should establish clear guidelines for how GenAI outputs are validated and used.

Privacy and Security Concerns: Handling sensitive data for GenAI applications raises concerns about data breaches, unauthorized access, and potential misuse. Implementing robust data security measures and adhering to data privacy regulations are essential and cannot be achieved with high confidence without strong enterprise data management and governance practices. This is one area that requires partnership with your CISO or security teams.

Integration Complexities: Most companies don’t have the luxury of starting from scratch with clean data and modern systems. Rather, GenAI needs to be integrated into existing enterprise systems and workflows, which can be technically challenging and disruptive. A well-defined strategy that leans into automated quality controls, normalized data, denormalized value-driven data products, role-based federated access, and a phased approach can help minimize disruption and ensure a smooth integration process or trusted outputs to receiving business systems.

Skill and Talent Gaps: Given how “new” GenAI is, many enterprises may lack the in-house expertise to manage and optimize GenAI deployments. Investing in training programs or partnering with AI specialists can help bridge these gaps. It is important not just to drive data and AI literacy, such as understanding the terms, but rather data fluency, where everyone understands how to use the data and GenAI to drive new insights and discovery.

Conclusion

It’s no surprise that most companies want to harness GenAI’s potential power and productivity gains. My experience, though, is that major gains only come with having your company’s data in order. By taking a proactive approach to data readiness, enterprises can unlock the transformative potential of Generative AI. Investing in data quality, modern data architecture, and robust data governance practices will lay the foundation for successful GenAI adoption. Remember, GenAI is a powerful tool, but it’s only as effective as the data it’s trained on. By prioritizing data readiness, enterprises can ensure their GenAI initiatives deliver real-world value and a significant competitive advantage.