To get ahead with AI, fine-tune your data strategy

Data strategy is more than a passing trend for development teams—it’s the foundation layer to make the most of AI and automation technologies. Your data strategy defines how data quality, governance, and accessibility support your business goals. Good data can be a goldmine. And your organization’s efficiency depends on it. But 80 to 90% of the world’s data is unstructured. It’s messy, inconsistent, and hard to process with traditional databases and algorithms. AI offers ways to organize and make sense of unstructured data, which can open up new products or commercial opportunities.

But rushing into AI projects without a solid data strategy often leads to disappointing results: “garbage in, garbage out,” as data developers often say. Get your data foundations right for a smooth, or at least less bumpy, AI project rollout.

In the first episode of Stack Overflow’s Leaders of Code podcast, Don Woodlock, Head of Global Healthcare Solutions at InterSystems, and Stack Overflow CEO Prashanth Chandrasekhar discuss data strategy’s critical role in AI development with host Ben Popper.

Woodlock believes that failing to fine-tune your data is like going to a party and picking up the guitar in the living room only to find it’s wildly out of tune. Even Jimmy Hendrix would struggle to impress the guests. To go with the analogy, he says, “Step one is to get it tuned, then you can layer great playing on top of that. That’s the way I think of data.”

Woodlock highlights the importance of a clean data strategy before starting with AI projects. He recommends getting the foundations right before moving to the technical implementation, like building a RAG (retrieval-augmented generation) system or choosing an AI platform. The plan should be to have a five to ten-year vision of how the data and systems can integrate.

He notes that a lot of healthcare data is unstructured, and it can get messy. In medical records, patient data from multiple sources may have different IDs or name variations like “Don” versus “Donald,” or your new address versus your old one. Without a patient-matching algorithm, the data isn’t properly integrated. Data normalization improves the accuracy of AI models and analysis for better patient outcomes.

For complex data and AI integration projects, it helps to be realistic about your starting point. Clarifai’s Mathew Zeilier, speaking previously with us, observed that many enterprises overestimate the quality of their data. When they dig into it, they discover “there’s not that much of it, or they don’t even know where it is internally.”

Woodlock and Chandrasekhar emphasize that data quality is equally important as the AI model in producing high-quality output. A clean, centralized knowledge base supports AI model training improvements that yield better results for internal and customer-facing AI initiatives. Organizing and codifying your team’s knowledge is a virtuous circle for future model training or RAG methods and indexing.

Having a human in the loop is also vital to review any AI system output, but the stakes are high in regulated industries like healthcare, where data collection is subject to legal guidelines for privacy and security.

Woodlock gives the example of a clinician writing up a patient’s medical notes. Automated notetakers are well-established, but AI tools speed up this process. Clinicians need to be aware of the high potential for inaccuracies and review all AI-generated outputs for potential harm. Research by Microsoft and Carnegie Mellon University shows that although AI tools can improve productivity, over-reliance can inhibit critical engagement with work.

Chandrasekhar believes that bringing humans and GenAI together helps Stack Overflow’s customers deliver an outstanding user experience by better integrating AI into system workflows. He emphasizes the need for high-quality, curated data built from your team’s knowledge to prevent “LLM brain drain,”: when models stagnate due to a lack of new insights and human-generated information.

InterSystems has embedded GenAI into its software to improve the clinician user experience, aiming to fix the frustration clinicians have historically encountered with clunky, unreliable software. The goal is to make tech feel more human. Narrow AI (nAI) can ask a conversational flow of questions about the patient and review available medical knowledge and it can automatically write documents like discharge or surgical summaries.

Other healthcare tech providers have seen similar efficiencies from AI. HiLabs’ Amit Garg proposes that GenAI and ML (machine learning) can mimic healthcare subject matter experts to standardize, enrich, and clean data. This approach solves persistent data challenges, like maintaining the accuracy of health plan provider directories. It’s important to note that this technology doesn’t replace people; instead, it allows teams to engage in deeper thinking tasks.

In the podcast, Woodlock says many companies find it challenging to roll out a successful genAI pilot. Although pilots may show double-digit productivity gains, scaling results across the organization can be tough.

This is often due to the human element needed. Rather than blithely assuming that the tech alone will deliver productivity gains, organizations need to marry new tech with new ways of working. Processes and governance that work in a smaller pilot project may not run as smoothly across a large, matrixed organization. Clear guidelines are necessary to support adoption.

The rollout phase is also about building trust with stakeholders. In a medical setting, of course, there are major, understandable concerns about inaccuracies that could negatively impact care and violations of patient privacy. Healthcare organizations that want to incorporate these tools into their workflows should focus on building trust by running pilot programs and sharing the results.

This skepticism in novel AI output is mirrored in our annual Developer Survey. Enthusiasm for genAI developer tools is increasing each year, with over 3 in 4 (76%) respondents using or planning to use them. However, trust in the output of AI tools is not assured; 31% of developers are skeptical, and only 42% of professional developers trust their accuracy. They express similar concerns about hallucinations and deploying AI-generated code directly into critical production environments.

Good data management and governance shouldn’t necessarily slow down processes. Conversely, they can help you move quicker. Woodlock quotes F1 driver Mario Andretti: “A lot of people think that the brakes are to slow you down. If you have good brakes, you can drive faster.”

Similarly, Woodlock says that once organizations figure out their governance style, they can speed up their AI journey.

In a previous conversation on the Stack Overflow podcast, Coalesce’s Satish Jayanthi observed that a successful data strategy needs the right people, processes, and technology to come together. The people are the trickiest part: the right stakeholders need to be at the table to oversee data governance.

As AI adoption grows, the plethora of models and approaches to data management and governance creates opportunities, but also adds complexity.

In the last year, the industry has shifted from a handful of good general-purpose LLMs (large language models) to several reliable open-source and nAI models supporting specific business requirements. Throw in agentic AI, and there’s a multitude of offerings to choose from.

Woodlock’s priorities are to focus on accuracy, like measuring the reliability of a patient and clinician’s conversation summary. Upskilling your team about AI trends is also crucial: his Code to Care video series explains AI-related topics like RAG and agentic AI.

Chandrasekhar observes that the data used to train models has been largely exhausted. There’s a need to develop mechanisms for new knowledge and data creation: “With more pressure on our customers to do more with less, there’s a temptation to believe AI will rapidly drive productivity gains.” He notes that “It’s important to recognize that AI is not a panacea for all things yet” and cautions that many are overestimating AI’s impact in the short term and underestimating its long-term transformational impact.

To summarize the conversation: First, you need to lay the foundations, like establishing clean data sets and your knowledge base. Start now, because getting this right can take longer than you think. Then, you’ll be set up to make the most of the opportunities AI offers.

Error'd: Cuts Like a Knife

A new AI translation system for headphones clones multiple voices simultaneously

How Pastel Colors Enhance User Experience

To get ahead with AI, fine-tune your data strategy

Check out our other content

Error'd: Cuts Like a Knife

A new AI translation system for headphones clones multiple voices simultaneously

How Pastel Colors Enhance User Experience

Error'd: Cuts Like a Knife

A new AI translation system for headphones clones multiple voices simultaneously

How Pastel Colors Enhance User Experience

Grafana 12 is now available with new observability as code features, Dynamic Dashboards, and more

Using AI to find patient zero in marketing campaigns

Agentic Intelligence – Communications of the ACM

Most Popular Articles

Error'd: Cuts Like a Knife

A new AI translation system for headphones clones multiple voices simultaneously

How Pastel Colors Enhance User Experience

Grafana 12 is now available with new observability as code features, Dynamic Dashboards, and more

Using AI to find patient zero in marketing campaigns

Agentic Intelligence – Communications of the ACM

Why do LLMs have emergent properties?

Changing Log Level at Runtime