A Data Engineering Perspective of LLMs

Data engineering is a field I would categorize as a subspecialty of software engineering. It shares the same concerns as software engineering—scalability, maintainability, and other “-ilities”—but its primary focus is on data. It’s a unique discipline because data is inherently messy, and as a result, no standard enterprise framework has emerged to dominate the space—and perhaps it never will. 

The complexity of a data project can often be measured by the variety and nature of its data sources. For example, a project involving high-volume Internet of Things (IoT) data is vastly different from one dealing with a structured database modeled with Slowly Changing Dimensions (SCD) Type 1. 

Data projects generate value by transforming data. Sometimes, simply joining two datasets can uncover new insights. In other cases, cleansing data helps provide clearer metrics for the C-suite to make decisions. In nearly all cases, we are fully aware of the data sources being ingested and monitor them closely for changes, as modifications can impact upstream pipelines. 

That’s where Large Language Models (LLMs) start to feel a bit strange. Unlike traditional data projects where the sources are known and controlled, with LLMs—especially frontier models like ChatGPT or Anthropic—we don’t have true visibility into the training data. If someone asked me to build a data project without transparency into the underlying data sources, I’d be very cautious. 

The Washington Post attempted to extrapolate what’s in some frontier models by analyzing Google’s C4 dataset: 

https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning

We can probably assume that most LLMs have read everything publicly visible on the internet—blogs, Reddit, YouTube transcripts, Wikipedia, and so on. The bigger question is whether they’ve read copyrighted books—and, if so, how can we even know? 

A while back, I was quite obsessed with making tofu from scratch.  The process involved making my own soy milk, then coagulating it with Nigari (a magnesium chloride solution made from seawater).  I sourced Nigari locally from Vancouver Island, where the manufacturer gave some tips on how to make it.  After many attempts and watching tofu videos on YouTube many times, I finally got a recipe down.   

Recently, I had a craving to make Tau Foo Fah, a silky tofu dessert popular across Asia. Unlike Japanese tofu, this version uses a different coagulant—gypsum and cornstarch. The most traditional method I’ve seen involves dissolving the coagulant in water, pouring hot soy milk over it from a height, and letting it set. My results were always inconsistent, so I turned to ChatGPT 

https://chatgpt.com/share/67ef2c2c-2728-8011-bd4b-b07973cb87e4

In the ChatGPT iOS app, I must have accidentally triggered the “deep research” feature—so instead of querying the regular model, I used a deep research credit. The model suggested the following method. 

  • Heat soy milk to ~185°F (85°C). 
  • Dissolve gypsum + cornstarch in a bit of water. 
  • Pour hot soy milk into the coagulant mix in a mason jar (don’t stir too much). 
  • Place jar on trivet in Instant Pot with water (pot-in-pot method). 
  • Steam on Low for 10 mins → natural release 10–15 mins. 
  • Let it rest undisturbed before opening. 

For the sake of experimentation, I tried two approaches in the Instant Pot: 

1. I mixed the coagulants in cold soy milk (1% corn starch, .74% gypsum, 100% soymilk), put a cap on a mason jar and steamed it on high for 10 minutes in an instant pot 

2. Did what ChatGPT suggested with heating it to 185° F 

The results were vastly different. The ChatGPT method didn’t set properly. Looking closer at the deep research citations, I realized most were blogs, and the model had likely built its answer on those foundations. It also seemed to confuse traditional tofu-making techniques with those specific to Tau Foo Fah. 

This experience got me thinking more about the importance of data sources and data lineage when using LLMs.  Here are a couple questions that take a data centric approach to thinking about them. 

1. Do you know what data sources are being referenced in the LLM? 
 

    It sounds basic, but it’s critical. LLMs are black boxes in many respects. Just as we wouldn’t integrate unknown data into a pipeline without thorough vetting, we shouldn’t treat LLM outputs as reliable without understanding what they’re based on. 

    2. Does the prompt you’re asking have high-quality representation in the model’s training data? 

    It helps to step back and consider the quality and coverage of data within the problem space. For coding-related queries, I assume the quality is high—thanks to well-defined languages grammars, and extensive examples on Stack Overflow and Reddit. 

    Other well-represented domains likely include medicine and law, given the large corpus of reliable reference material available. 

    But in areas like food, I’ve learned not to rely on ChatGPT as an authoritative source. Many great recipes are locked behind non-digitized books or restricted datasets that LLMs can’t legally access. 

    3. Are you able to see data source citations in your answer? 

    When using a general model like GPT-4o, answers are often returned without citations. That leaves the burden of validation on you—the user. 

    For low-stakes questions like “What kind of plant is this?”, a wrong answer is annoying, but not disastrous. But if you ask, “How much revenue did we generate last month?” and use the response to make business decisions without verifying the source, the risk increases significantly. 

    Medical questions are even riskier. Making a health decision based on uncited LLM output is problematic. 

    Google’s NotebookLM stands out here. You can upload your own PDFs (e.g., data sources), and the model provides footnoted citations linking back to the original documents—much more traceable and reliable. 

    4. Are you an expert in the domain you’re querying? 

    A senior software engineer might use ChatGPT to refine a coding solution, while a junior engineer may dangerously implement answers without understanding them. 

    In general, we need to be cautious about accepting LLM responses at face value. It’s tempting to delegate too much thinking to the machine, but we must not bypass our own analytical skills as we inevitably integrate these tools into our workflows. 

    State of Data Engineering 2025 Q1

    AI Updates

    There is a lot of chatter about 2025 being the year of agentic frameworks.  To me, this means a system in which a subset can allow AI models to take independent actions based on their environment, typically interacting with external APIs or interfaces. 

    The terminology around this concept is still evolving, and definitions may shift in the coming months—similar to the shifting discussions around Open Table Formats. Below are some key takeaways from recent discussions on agentic systems: 

    • Agents can take actions allowed by a given environment 
    • Agents can be potentially dangerous if given the ability to perform write actions 
    • Agent workflows can potentially be very expensive if not constrained (EG excessive token usage) 
    • Evaluating success on agent workflows is tricky. 
    1. Chip Huyen recently wrote a new book on AI Engineering and created a standalone blog post about agents  https://huyenchip.com//2025/01/07/agents.html 
    1. Anthropic has explored agentic systems in their article: https://www.anthropic.com/research/building-effective-agents 

    When writing traditional software, it’s helpful to think in terms of determinism. In probability theory: 

    • P(1) represents an outcome that is guaranteed to occur (100%). 
    • P(0) represents an outcome that will never occur (0%). 

    Similarly, code is deterministic—given the same input, it will always produce the same output. Business rules drive execution, either directly through the code or abstracted into configuration files. 

    However, agents break this determinism by introducing an element of decision-making and uncertainty. Instead of executing predefined logic, agents operate within a framework that allows for flexible responses based on real-time data. This can be powerful, as it enables systems to handle edge cases without explicitly coding for them. But it also raises evaluation challenges—how do we determine if an agent is performing well when its behavior isn’t fully predictable? 

    Taking a data engineering perspective on this, I think usefulness of agents will be directly proportional to the quality of the data sources available for an agent.

    • ChatGPT and similar LLMs have been useful primarily because they have likely ingested the entire Internet. Whether this has been done in full compliance with copyright laws is a separate issue.
    • More structured domains, such as coding, have thrived due to the constraints imposed by programming languages, as these languages are inherently built upon specific grammatical rules.

    When customers add data sources to their agentic frameworks I can only imagine a lot of data cleansing and structuring will have to be done to make these systems successful.

    Other interesting links to learn about AI 
     

    re:Invent Recap

    Every year, there are tons of announcements, and you can catch a recap of all the new services in this 126 slide PDF from AWS courtesy of community days.

    https://assets.community.aws/a/reCap-Summary-2024.pdf

    Data Highlights: 
     

    • Glue 5.0 – be sure to use this new latest version 

     

    Sagemaker Unified Studio

    Another major launch during re:Invent was Sagemaker Unified Studio.  Capabilities include: 
     

    • Amazon SageMaker Unified Studio (public preview): use all your data and tools for analytics and AI in a single development environment. 
    • Amazon SageMaker Lakehouse: unify data access across Amazon S3 data lakes, Amazon Redshift, and federated data sources. 
    • Data and AI Governance: discover, govern, and collaborate on data and AI securely with Amazon SageMaker Catalog, built on Amazon DataZone. 
    • Model Development: build, train, and deploy ML and foundation models, with fully managed infrastructure, tools, and workflows with Amazon SageMaker AI. 
    • Generative AI app development: Build and scale generative AI applications with Amazon Bedrock. 
    • SQL Analytics: Gain insights with the most price-performant SQL engine with Amazon Redshift. 
    • Data Processing: Analyze, prepare, and integrate data for analytics and AI using open-source frameworks on Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (MWAA). 

    https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai

    https://aws.amazon.com/blogs/big-data/simplify-data-access-for-your-enterprise-using-amazon-sagemaker-lakehouse

    In the data field, there are two options, build vs buy. AWS traditionally follows the build approach, offering tools for developers to create their own platforms. However, operationalizing data platforms is complex, and competitors like Databricks and Snowflake offer faster onboarding workflows with vertically integrated components. 

    SageMaker Lakehouse shows potential as it attempts to bridge the gap between build and buy solutions. It offers robust management features within its UI, including CI/CD workflows similar to GitHub. However, given its early stage, I would approach this large platform with your primary use cases, and do a proof of concept first before approaching any production workloads.

    However my main complaint is this service should never been called Sagemaker Unified Studio to begin with. Historically Sagemaker has been associated with the AI aspect of AWS, so it is easy to be confused that the new portfolio now includes data engineering. If I had to do it all over I would have called it

    1. AWS Data Unified Studio which has:
      • Data Engineering Workflows
      • Sagemaker AI Workflows
      • Catalog Workflows

    Open Table Format Wars

    Industry has been talking about the Open Table Format (OTF) wars for a while, and really the problem has been more of a political problem rather than a technological one. 

    As a reminder, there are 3 major players in the space, Apache Hudi, Apache Iceberg, and Databricks Deltalake.  In the past months, AWS, Databricks, and Snowflake have begun to coalesce around allowing catalogs based off Apache Iceberg to query and write to each other. 

    At its core, Apache Iceberg is a set of specifications. In this case, both the AWS Glue Data Catalog and Databricks have implemented the Apache Iceberg REST contracts, enabling them to query each other. AWS facilitates this by exposing an external endpoint that other vendors can access. 

    This marks a significant shift in how we approach cross-platform and multi-cloud data querying. What’s puzzling is why there now seems to be broad industry alignment on this paradigm—but regardless, it’s a promising step forward for data engineering. Looking ahead, I wonder if we’ll eventually see true separation of compute and storage across clouds. 
     

    This use case is particularly relevant for large enterprises operating across multiple clouds and platforms. For smaller companies, however, it may not be as critical. 

    What about Apache Hudi for cross query federation? It remains a significant platform, but I’m pessimistic about Apache XTable catching on.   However, if your use case involves federated querying across a broader ecosystem, consider this design pattern. 

    Keep in mind that cross-platform read and write capabilities are being rolled out gradually. Be sure to conduct extensive proof-of-concept (POC) testing to ensure everything functions as expected. 

    Guidelines through data engineering technology selection

    History is often a useful guide, and some are saying the Iceberg format will replace parquet files.  As an example, a nifty feature is hidden partitioning, where you can change your partition and not have to rewrite your physical files in storage. The tradeoff is that Iceberg is not the simplest of platforms, you still do need to manage metadata and snapshots. 

    Here are some tips to guide your projects through a technology selection. 

    Does you actually have big data? 

    Big data is a hard term to define but typically think about organizations which have terabytes or petabytes of data.  If they do qualify for this, you most likely will choose one of the 3 OTFs. 

    What ecosystem are you working in? 
    A fair amount of consideration should be given to the ecosystem and technologies they are working in.   
     

    • What programming languages does your staff know? 
    • Do they have any experience with any previous OTFs 
       

    What are the dimensions you need to balance? 

    There is a weird blending of datalakes and data warehouses now where the lines aren’t clear because features exist in both.  We aren’t going to get clear lines soon, so consider these items. 

    • Cost At Scale 
      • After factoring in your projects expected growth patterns, which solution will scale the best at cost?  For example, holding petabytes of data is probably better in a datalake than a data warehouse.  
    • Managed or Unmanaged AWS Service? 
      • It’s important to recognize that managed services, such as AWS Glue or MWAA, are not a one-size-fits-all solution. While they simplify operations, they also abstract away fine-tuning capabilities. Managed services often come at a higher cost but reduce operational complexity. When evaluating them, consider both infrastructure costs and the time and effort required for ongoing management. 
    • If your team has the engineering resources and expertise to manage an unmanaged platform, it may be a cost-effective choice at scale. If your project consists of a small team with limited bandwidth, opting for a managed AWS service can help streamline operations and reduce overhead. 

    Some members of the Hudi team argue that Open Table Formats aren’t truly “open.” While they raise a valid point, vendor lock-in hasn’t been a major concern for most projects. After all, choosing AWS or Databricks also involves a level of vendor lock-in. As a result, the arguments around openness may resonate more with a niche audience rather than the broader industry. 

    https://thenewstack.io/your-data-stack-is-outdated-heres-how-to-future-proof-it

    Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality 

    This is a really intriguing pattern for data quality.   It enables you to test your data in a staging area, then commit it.  This pattern is not necessarily new, but Iceberg’s branching features offers an easier way of doing this

    https://aws.amazon.com/blogs/big-data/build-write-audit-publish-pattern-with-apache-iceberg-branching-and-aws-glue-data-quality

    https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/

    Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg 

    https://aws.amazon.com/blogs/big-data/implement-historical-record-lookup-and-slowly-changing-dimensions-type-2-using-apache-iceberg

    Building a transaction datalake lake using Amazon Athena, Apache Iceberg and dbt 
     

    A case study how the UK Ministry of Justice saved quick a bit of money switching over to Iceberg using dbt 
     
    https://ministryofjustice.github.io/data-and-analytics-engineering/blog/posts/building-a-transaction-data-lake-using-amazon-athena-apache-iceberg-and-dbt/ 

     
    Expand data access through Apache Iceberg using Delta Lake UniForm on AWS 

    Kind of a strange article, but there are still some customers using UniForm.  With it, you can read Delta Lake tables as Apache Iceberg tables. 

    https://aws.amazon.com/blogs/big-data/expand-data-access-through-apache-iceberg-using-delta-lake-uniform-on-aws

    Amazon S3 Tables 

    One of the most significant announcements in the data space was Amazon S3 Tables. A common challenge with data lakes is that querying data in S3 can be slow. To mitigate this, techniques like compaction and partitioning are used to reduce the amount of data scanned. 

    Amazon S3 Tables address this issue by allowing files to be stored in S3 using the native Apache Iceberg format, enabling significantly faster query performance. 

    https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads

    https://aws.amazon.com/blogs/storage/how-amazon-s3-tables-use-compaction-to-improve-query-performance-by-up-to-3-times

    https://community.sap.com/t5/technology-blogs-by-members/new-amazon-s3-iceberg-tables-made-us-drop-our-parquet-library-in-abap/ba-p/13971752

    However, there is a fair amount of criticism from competitors that there is a substantial mark-up in cost in this service, so be sure to read the S3 table pricing page 

    https://aws.amazon.com/s3/pricing

    Redshift

    Redshift History Mode – Redshift has released a new feature where if you have a zero ETL integration enabled, you can turn on Slowly Changing Dimensions (SCD) Type 2 out of the box. 

    Our historical pattern used to be 

    • DMS –> S3 (SCD Type 2) –> Spark job to reconcile SCD Type 1 or 2 

    Now we can do 

    • Zero ETL Redshift Replication –> Access SCD Type 2 tables 

    This is a relatively new feature so please do your comprehensive testing. 

    https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-history-mode.html

    Amazon Redshift now supports incremental refresh on Materialized Views (MVs) for data lake tables 
    https://aws.amazon.com/about-aws/whats-new/2024/10/amazon-redshift-incremental-refresh-mvs-data-lake-tables/ 

    Apache Hudi Updates 

    Apache Hudi recently has reached a major milestone of having a 1.0.0 release in December, with several new features 

    A notable new feature allows users to create indexes on columns other than the Hudi record key, enabling faster lookups on non-record key columns within the table. 

    https://hudi.apache.org/blog/2024/12/16/announcing-hudi-1-0-0/#secondary-indexing-for-faster-lookups

    https://www.onehouse.ai/blog/accelerating-lakehouse-table-performance-the-complete-guide

    Glue

    Glue 5.0 

    A newer version is out with faster runtimes, so be sure to use this version in your projects 

    https://aws.amazon.com/about-aws/whats-new/2024/12/aws-glue-5-0

    New Glue Connectors 

    There are now 16 new connectors Adobe Analytics, Asana, Datadog, Facebook Page Insights, Freshdesk, Freshsales, Google Search Console, Linkedin, Mixpanel, Paypal Checkout, Quickbooks, SendGrid, SmartSheets, Twilio, WooCommerce and Zoom Meetings for Glue 

    https://aws.amazon.com/about-aws/whats-new/2024/12/aws-glue-16-native-connectors-applications

    Glue Compaction with Mor Tables 

    https://aws.amazon.com/blogs/big-data/accelerate-queries-on-apache-iceberg-tables-through-aws-glue-auto-compaction

    Monitor and Manage Data Quality with Glue (Youtube Video) 

    Video

    https://aws.amazon.com/about-aws/whats-new/2024/10/amazon-redshift-incremental-refresh-mvs-data-lake-tables

    State of Data Engineering Q3 2024

    Here is this quarter’s state of data engineering newsletter. There is only a little chat about AI this time, and a focus on Open Table Formats, the Apache Iceberg Rest Spec, Open Table Format updates, and new updates in the Amazon Data Engineering ecosystem.

    Prompt Engineering – Meta Analysis Whitepaper


    One of my favorite AI podcasts, Latent Space, recently featured Sander Schulhoff, one of the authors of a comprehensive research paper on prompt engineering. This meta-study reviews over 1,600 published papers, with co-authors from OpenAI, Microsoft, and Stanford. 
     
    [podcast] 
    https://www.latent.space/p/learn-prompting 

     
    [whitepaper] 
    https://arxiv.org/abs/2406.06608 

     
    The whitepaper is an interesting academic deep dive into prompting, and how to increase the quality of it through exemplars (examples provided into an LLM), but not providing too many (more than 20 hurts quality), and strange things like Minecraft agents are tools to understand how this ecosystem works. 
     
    Other practical tips are given where asking for data in JSON and XML generally is more accurate and results formatted based of the LLM’s training data is better.  That does kind of lead to a problem if you don’t know what the training data is based off 
     
    It provides a wide range of tools outside of the two common ones we use the most – chain of thought – which is a multi turn conversation, and retrieval augmented generation (RAG). 
     
    We can expect in the next couple of years where LLMs themselves will integrate these workflows so we don’t care about it anymore, but if you want to squeak out some better performance this paper is worth reading. 

    Open Table Format Wars – Continued

    As a quick refresher, the history of data engineering kind of goes like this in 30 seconds

    1. 1980s – big data warehouses exist.  SQL is lingua franca
    2. 2000s – Apache Hadoop ecosystem comes out to address limitations of data warehouses to cope with size and processing
    3. 2010s – Datalakes emerge where data is still in cloud storage (E.G. Amazon S3)
    4. 2020ish – Datalakehouses or Transactional Datalakes come out to address limitations of Datalakes capability to be ACID compliant
    5. 2023 –  Consensus emerges over the term Open Table Format (OTF) with three contenders
      • Apache Hudi
      • Databrick Deltalake
      • Apache Iceberg
    6. Mid 2024s
      • June 3, 2024 – Snowflake announces Polaris catalog support for Apache Iceberg
      • June 4, 2024 – Databricks buys Tabular (thereby bringing in the founders of Apache Iceberg)

    Historically, we see a major shift in technology about every 20 years, with older systems being overhauled to meet new paradigms. Consider the companies that fully embraced Apache Hadoop in the 2000s—they’re now in the process of rebuilding their systems. Right now we are in the middle of the maturing of open table formats.

    Data has always kind challenging to deal with because the nature of data is messy, and moving data from one system to another seems simple, but is quite a bit of work as we know most ETL rarely is straight forward when taking into accounts SLAs, schema changes, data volumes, etc.

    OTFs really matter for us when we deal with big data, and especially for extremely large data (think Uber or Netflix size data).  Databases usually can handle the blue and green without problem, but break at the yellow and red.

    When working with your data platform, these are key questions you should be asking to help in refining your technology stack.

    • How much data is being processed (are we talking hundreds of gigabytes, terabytes, or petabytes?)
    • What is the SLA the data needs to be queried?
    • What is the existing data foot print in your organization (are you using a lot of MySQL, Microsoft, etc)
    • Does the organization have the capability to own the engineering effort of an OTF platform?
    • Do any of the customer’s data sources work for ZeroETL (like Salesforce, Aurora MySQL/Postgres, RDS?)
    • Is the customer already using Databricks, Hudi, Snowflake, Iceberg, Redshift, or Big Query?

    The Future: Interoperability via the Apache Iceberg Catalog API

    Apache Iceberg, which emerged from Netflix recently has recently been making a lot of news lately. From the acquisition of Tabular (basically the guys who founded Iceberg), to Snowflake open sourcing the Polaris catalog, to Databricks support in private preview, many signs are pointing to a more cross compatible future if certain conditions are met.

    In this article

    https://www.snowflake.com/en/blog/introducing-polaris-catalog/

    There is a pretty important diagram where it shows cross compatibility of AWS, Azure, and Google Cloud. We aren’t here yet, but if all 3 vendors move towards implementing the Apache Iceberg HTTP Catalog API spec, that means cross federated querying will be possible.

    I’m hopeful, because ETL’ing data from one place to another place has always been a huge hassle. This type of future really opens up interesting workloads where compute really can be separate even from your cloud.

    Everything is a little strange to me, because moving towards the future really isn’t a technology problem, but more of a political one if each cloud choose to move that direction. We are getting signs, but I would say by this time next year, we will learn the intentions of all players. Meanwhile, stay tuned.


    New emerging technology: DuckDB

    DuckDB was created in 2018 and is a fast in-process analytical database.  There is a hosted version called MotherDuck, which is based off a serverless offering. DuckDB takes a different approach where you can run analysis on a large data set either via a CLI or your favorite programming language.  The mechanisms are slightly different where the compute runs closer to your application itself.


    Article: Running Iceberg and Serverless DuckDB in AWS

    https://www.definite.app/blog/cloud-iceberg-duckdb-aws

    In this article, DuckDB can query Iceberg tables stored in S3.  Also, as an alternative it describes deploying DuckDB in a serverless environment using ECS with custom containers via HTTP requests.

    In the future, I expect AWS to take more notice and integrate DuckDB in the ecosystem in the next couple of years. 

    ChatGPT even has a DuckDB analyst ready

    https://chatgpt.com/g/g-xRmMntE3W-duckdb-data-analyst

    Use Cases:

    • Say you have a lot of log data in EC2.  Typically, you would load it into S3 and query via Athena.  Instead you could  load the data in EC2, and then load a DuckDB instance there where you can query it without penalty for exploration
    • Preprocessing and pre-cleaning of user-generated data for machine learning training
    • Any type of system that previously used SQLite
    • Exploration of any data sets if it is on your laptop – this one is a no brainer.

    — Iceberg Updates:



    [Article]: The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables


    TLDR: Iceberg now supports snapshot and orphan file removal
    https://aws.amazon.com/blogs/big-data/the-aws-glue-data-catalog-now-supports-storage-optimization-of-apache-iceberg-tables/

    Amazon previously tackled the issue of small file accumulation in Apache Iceberg tables by introducing automatic compaction. This feature consolidated small files into larger ones, improving query performance and reducing metadata overhead, ultimately optimizing storage and enhancing analytics workloads.

    Building on this, Amazon has now released a new feature in AWS Glue Data Catalog that automatically deletes expired snapshots and removes orphan files. This addition helps control storage costs and maintain compliance with data retention policies by cleaning up unnecessary data, offering a more complete solution for managing Iceberg tables efficiently.

    [Feature]: Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog

    TLDR: If you want faster SLA in Iceberg tables, run the table statistics feature for a potential of 24 –> 80% in improvement in query time

    https://aws.amazon.com/blogs/big-data/accelerate-query-performance-with-apache-iceberg-statistics-on-the-aws-glue-data-catalog

    Column-level statistics are a method for enhancing the query performance of Iceberg tables in Amazon Redshift Spectrum and Athena. These statistics are based on the Puffin file format and allow query engines to optimize SQL operations more effectively. You can enable this feature via the AWS Console or by running an AWS Glue job. According to the performance data in the blog post, improvements range from 24% to 83%, simply by running a job to store metadata.

    Summary:

    • Use this if you have large datasets and need consistent query performance. Small datasets may not benefit enough to justify the effort.
    • Be aware of the overhead involved in running and maintaining statistics jobs.
    • Since data will likely change over time, you should set up automated jobs to periodically regenerate the statistics to maintain performance gains. While manual effort is required now, this feature could be more integrated into the platform in the future.

    [Article]: Petabyte-Scale Row-Level Operations in Data Lakehouses
    Authors: Apache Foundation, Apple Employees, Founder of Apache Iceberg

    TLDR: If you need to do petabyte scale row level changes, read this paper.
    https://www.vldb.org/pvldb/vol17/p4159-okolnychyi.pdf

    We rarely have run into the scale of needed to run petabyte row level changes, but it details a strategy with these techniques


        
    TechniqueExplanationHudi EquivalentDatabricks Equivalent
    Eager MaterializationRewrites entire data files when rows are modified; suitable for bulk updates.Copy-on-Write (COW)Data File Replacement
    Lazy MaterializationCaptures changes in delete files, applying them at read time; more efficient for sparse updates.Merge-on-Read (MOR)Delete Vectors
    Position DeletesTracks rows for deletion based on their position within data files. Delete Vectors
    Equality DeletesDeletes rows based on specific column values, e.g., row ID or timestamp. Delete Vectors
    Storage-Partitioned JoinsEliminates shuffle costs by ensuring data is pre-partitioned based on join keys. Low Shuffle MERGE
    Runtime FilteringDynamically filters out unnecessary data during query execution to improve performance. Runtime Optimized Filtering
    Executor CacheCaches delete files in Spark executors to avoid redundant reads and improve performance.  
    Adaptive WritesDynamically adjusts file sizes and data distribution at runtime to optimize storage and prevent skew.  
    Minor CompactionMerges delete files without rewriting the base data to maintain read performance.Compaction in MOR 
    Hybrid MaterializationCombines both eager and lazy materialization strategies to optimize different types of updates.  

    The paper also half reads as a marketing paper for Iceberg, but the interesting aspect is that half of the authors are from Apple.  One of the authors of that paper also made this video on how Apache Iceberg is used at Apple.

    Video:
    https://www.youtube.com/watch?v=PKrkB8NGwdY

    [Article]: Faster EMR 7.1 workloads for Iceberg

    TLDR: EMR 7.1 runs faster on its customized Spark runtime onEC2

    https://aws.amazon.com/blogs/big-data/amazon-emr-7-1-runtime-for-apache-spark-and-iceberg-can-run-spark-workloads-2-7-times-faster-than-apache-spark-3-5-1-and-iceberg-1-5-2

    This article essentially serves as marketing for Amazon EMR, but it also demonstrates the product team’s commitment to enhancing performance with Apache Iceberg. It’s a slightly curious comparison, as most users on AWS would likely already be using EMR rather than managing open-source Spark on EC2. Nevertheless, the article emphasizes that EMR’s custom Spark runtime optimizations are significantly faster than running open-source Spark (OSS) on EC2.

    1. Optimizations for DataSource V2 (dynamic filtering, partial hash aggregates).
    2. Iceberg-specific enhancements (data prefetching, file size-based estimation).
    3. Better query planning and optimized physical operators for faster execution.
    4. Integration with Amazon S3 for reduced I/O and data scanning.
    5. Java runtime improvements for better memory and garbage collection management.
    6. Optimized joins and aggregations, reducing shuffle and join overhead.
    7. Increased parallelism and efficient task scheduling for better cluster utilization.
    8. Improved resource management and autoscaling for cost and performance optimization.

    [Article]: Using Amazon Data Firehose to populate Iceberg Tables
    TLDR: Use this technique if you might need Iceberg tables from the raw zone for streaming data and you need ACID guarantees

     https://www.tind.au/blog/firehose-iceberg/

    Recently, a sharp-eyed developer spotted an exciting new feature in a GitHub Changelog: Amazon Data Firehose now has the ability to write directly to Iceberg tables. This feature could be hugely beneficialfor anyone working with streaming data and needing ACID guarantees in their data lake architecture.

    Warning: This feature isn’t production-ready yet, but it’s promising enough that we should dive into how it works and how it simplifies the data pipeline.

    : An Interesting Future: Example of Iceberg being queried from Snowflake and Databricks

    Randy Pitcher from Databricks shows an example how an Iceberg table created in Databricks is queried with Snowflake.   As mentioned earlier, the chattering is not all vendors are fully implemented the Catalog API spec (yet), but once this gets mature in 2026-ish, expect the ability to query data across cloud to be possible.

    https://www.linkedin.com/posts/randypitcherii_snowflake-is-killing-it-with-their-iceberg-ugcPost-7239751397779419136-z1ue

    Redshift Updates

    Major updates for Zero ETL

    All Other AWS Updates:

    Other