A Data Engineering Perspective of LLMs

Data engineering is a field I would categorize as a subspecialty of software engineering. It shares the same concerns as software engineering—scalability, maintainability, and other “-ilities”—but its primary focus is on data. It’s a unique discipline because data is inherently messy, and as a result, no standard enterprise framework has emerged to dominate the space—and perhaps it never will. 

The complexity of a data project can often be measured by the variety and nature of its data sources. For example, a project involving high-volume Internet of Things (IoT) data is vastly different from one dealing with a structured database modeled with Slowly Changing Dimensions (SCD) Type 1. 

Data projects generate value by transforming data. Sometimes, simply joining two datasets can uncover new insights. In other cases, cleansing data helps provide clearer metrics for the C-suite to make decisions. In nearly all cases, we are fully aware of the data sources being ingested and monitor them closely for changes, as modifications can impact upstream pipelines. 

That’s where Large Language Models (LLMs) start to feel a bit strange. Unlike traditional data projects where the sources are known and controlled, with LLMs—especially frontier models like ChatGPT or Anthropic—we don’t have true visibility into the training data. If someone asked me to build a data project without transparency into the underlying data sources, I’d be very cautious. 

The Washington Post attempted to extrapolate what’s in some frontier models by analyzing Google’s C4 dataset: 

https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning

We can probably assume that most LLMs have read everything publicly visible on the internet—blogs, Reddit, YouTube transcripts, Wikipedia, and so on. The bigger question is whether they’ve read copyrighted books—and, if so, how can we even know? 

A while back, I was quite obsessed with making tofu from scratch.  The process involved making my own soy milk, then coagulating it with Nigari (a magnesium chloride solution made from seawater).  I sourced Nigari locally from Vancouver Island, where the manufacturer gave some tips on how to make it.  After many attempts and watching tofu videos on YouTube many times, I finally got a recipe down.   

Recently, I had a craving to make Tau Foo Fah, a silky tofu dessert popular across Asia. Unlike Japanese tofu, this version uses a different coagulant—gypsum and cornstarch. The most traditional method I’ve seen involves dissolving the coagulant in water, pouring hot soy milk over it from a height, and letting it set. My results were always inconsistent, so I turned to ChatGPT 

https://chatgpt.com/share/67ef2c2c-2728-8011-bd4b-b07973cb87e4

In the ChatGPT iOS app, I must have accidentally triggered the “deep research” feature—so instead of querying the regular model, I used a deep research credit. The model suggested the following method. 

  • Heat soy milk to ~185°F (85°C). 
  • Dissolve gypsum + cornstarch in a bit of water. 
  • Pour hot soy milk into the coagulant mix in a mason jar (don’t stir too much). 
  • Place jar on trivet in Instant Pot with water (pot-in-pot method). 
  • Steam on Low for 10 mins → natural release 10–15 mins. 
  • Let it rest undisturbed before opening. 

For the sake of experimentation, I tried two approaches in the Instant Pot: 

1. I mixed the coagulants in cold soy milk (1% corn starch, .74% gypsum, 100% soymilk), put a cap on a mason jar and steamed it on high for 10 minutes in an instant pot 

2. Did what ChatGPT suggested with heating it to 185° F 

The results were vastly different. The ChatGPT method didn’t set properly. Looking closer at the deep research citations, I realized most were blogs, and the model had likely built its answer on those foundations. It also seemed to confuse traditional tofu-making techniques with those specific to Tau Foo Fah. 

This experience got me thinking more about the importance of data sources and data lineage when using LLMs.  Here are a couple questions that take a data centric approach to thinking about them. 

1. Do you know what data sources are being referenced in the LLM? 
 

    It sounds basic, but it’s critical. LLMs are black boxes in many respects. Just as we wouldn’t integrate unknown data into a pipeline without thorough vetting, we shouldn’t treat LLM outputs as reliable without understanding what they’re based on. 

    2. Does the prompt you’re asking have high-quality representation in the model’s training data? 

    It helps to step back and consider the quality and coverage of data within the problem space. For coding-related queries, I assume the quality is high—thanks to well-defined languages grammars, and extensive examples on Stack Overflow and Reddit. 

    Other well-represented domains likely include medicine and law, given the large corpus of reliable reference material available. 

    But in areas like food, I’ve learned not to rely on ChatGPT as an authoritative source. Many great recipes are locked behind non-digitized books or restricted datasets that LLMs can’t legally access. 

    3. Are you able to see data source citations in your answer? 

    When using a general model like GPT-4o, answers are often returned without citations. That leaves the burden of validation on you—the user. 

    For low-stakes questions like “What kind of plant is this?”, a wrong answer is annoying, but not disastrous. But if you ask, “How much revenue did we generate last month?” and use the response to make business decisions without verifying the source, the risk increases significantly. 

    Medical questions are even riskier. Making a health decision based on uncited LLM output is problematic. 

    Google’s NotebookLM stands out here. You can upload your own PDFs (e.g., data sources), and the model provides footnoted citations linking back to the original documents—much more traceable and reliable. 

    4. Are you an expert in the domain you’re querying? 

    A senior software engineer might use ChatGPT to refine a coding solution, while a junior engineer may dangerously implement answers without understanding them. 

    In general, we need to be cautious about accepting LLM responses at face value. It’s tempting to delegate too much thinking to the machine, but we must not bypass our own analytical skills as we inevitably integrate these tools into our workflows.