Rest in Peace Dad

Around US Thanksgiving, my mom let me know that my dad had been diagnosed with stage 4 colon cancer with a life expectancy of 1-2 years. In mid December my dad was hospitalized and one of the doctors suggested all family members should come urgently. When I arrived, the days consisted of going back and forth to the hospital and I commandeered a corner of the cafeteria as my remote office.

I think we all have different ways of dealing with stress, and my routine was prayer, hitting the gym, and doing yoga at night like clockwork. The first days were overwhelming with uncertainty, but this routine helped me to stay focused on things on the support tasks that needed to be done for the day.

Towards the end of the hospital stay, the doctors stated that there wasn’t much they could do, revising his life expectancy of weeks to months and suggested the best course of action was to be on hospice care.

My mom and I had a meeting with the hospice staff, who explained what hospice care entails. Compared to regular medical care which is to save a life, hospice care aims to provide comfort care to prepare for a patient’s end of life within a couple months. Since my dad had been diagnosed with a terminal condition of less than 6 months, he was eligible for the care via insurance.

As I’m sure no one is surprised, my dad hated being in the hospital. The hardest thing to see was him exclaiming in English and Vietnamese that, “Ba muốn đi về (I want to go home)” There were some serious medical complications preventing him from coming home, but fortunately, one of the doctors managed to do a treatment plan that enabled him to improve just enough to leave the hospital. One of his last prayers and wishes was to be at home. Thankfully, throughout the ordeal he didn’t have any pain in the hospital.

When he was discharged, we got him set-up at home successfully with a patient bed, and oxygen machine provided by hospice care. However, it was then we realized the magnitude of care needed. Now we would need to take care of my dad 24/7 as the cancer had robbed my dad of his independence. By a fortunate turn of events, God in his good graces lined up a caretaker who was a contact at my mom’s old home church to help watch him at night. Without that caregiver, everyone would have been exhausted to the point of feeling like zombies.

All kinds of questions began to arise, requiring us to adapt quickly. How would he communicate? How would we monitor him? One of the most low tech, but successful things we got was one of those bells that you ding when your order is ready at the diner. Another was a baby monitor where we could see him when we weren’t in the room.

The first couple days were okay, where the new sounds of the house consisted of the whir of an oxygen machine to support his lungs, and an occasional ring from my dad requesting some type of service. It was kind of cute in the beginning, like a customer asking for some food or water. Things kind of seemed normal, where he would read the news on the iPad and even have short conversations with us.

Meanwhile my family were having discussions about finances and the financial implications of having a night caretaker if this lasted weeks or months, and what are the financial thresholds a family can bear.

I think as a society we don’t talk enough about end of life and what is a good way to die? When a parent isn’t able to take care of themselves, what do we do? How much do we pay? Who is going to take care of the person? What kind of hardships would be spread amongst family? Do you want to be there to witness last moments? It seems cruel to equate finances in context of one’s life, but it is an important topic to broach.

When hospice care is at home, there is an unfair burden placed on the caregiver as they are expected to help manage medication for comfort vs lucidity of a person. Each day felt like an impossibility of choices. Administering medication for comfort often results in sedation, while withholding it can lead to suffering.

I give my mom a lot of credit for having numerous conversations with my dad about advance directives and his end-of-life wishes, ensuring that the family had clear expectations about the path to be taken.

As the days progressed, one of the nurses noticed his breathing and said he was struggling. We had a frank conversation about what does it looks like when a person is about to die. She warned us that a common pattern is that people have moments like they are completely normal with a day of a burst of energy, then crash quickly.

In the first week of January he passed away, peacefully and comfortable in the evening.

Sometimes, we reflect on this situation and ask where God was in these moments, why he wasn’t healed, and why a life expectancy of 1-2 years shortened to just weeks.

My approach to prayer is to ask for a specific outcome, such as healing, but if it doesn’t occur, I trust in God’s grand plan regarding life and death.

Throughout this ordeal, there have been many small blessings. First off, his wish and desire to go home were fulfilled, and the last medical treatment plan enabled him to improve enough to leave the hospital.

The second blessing was having a caregiver to cover the nights, starting from the first night after his discharge, allowing my mom to get some sleep. We were panicking when he got home because I knew my mom was not in a condition to stay up all night.

I’ve heard that losing a parent is one of the hardest experiences a person can go through. I’m still processing the loss, but surprisingly, I don’t feel a sense of guilt. By this, I mean that while he was healthy, we, including my mom and partner, spent a lot of time traveling together and had a good relationship. However, it doesn’t mean that there isn’t pain in my heart as I wish there were 10-15 more years to enjoy with him.

In 2014, my partner and I went to Mexico with my parents and this was the first international trip I took with them as adults. We had tons of adventures where uh, I literally got the car stranded in the middle of nowhere driving to a snorkeling spot in Mexico and I was surprised at his grace because my dad was super chill about. There was another trip to Mexico where uhh the car overheated when we went to a mountain town (buyer beware caution, if you ever travel with me, expect some shenanigans). And most recently my parents always wanted to go to Europe so we went to Italy only this past June. The trip was successful in my eye because a) they didn’t lost b) they didn’t get robbed. My dad was usually quiet, but as we took some private tours through Rome, I was surprised at his inquisitive nature about the surroundings around us about Roman culture and life there.

Since 2014 I have been intentional about traveling with my parents as much as possible as I know there would be some point of time, they would not be physically able to travel due to mobility issues. But nowhere in my wildest dreams did I expect this adventuring to be cut short by a fast-moving cancer.

The loss of a parent is strange, and grief comes in waves. It’s not like the world stops, but there are certain intense memories when I reflect that cause tears to fall. It is a balance of a completely normal day, then a realization you no longer can say “my parents.”

The days after a loved one passes through involves the immediate grief to be processed, the need to support family members, but also the huge logistical task of planning a funeral. It is no different than planning a party oddly enough.

I do have to give credit to my mom, as she had an inkling of the seriousness of my dad’s condition, so she already bought a funeral package near her house. This significantly alleviated our stress, as the costs were covered and the arrangements mostly preselected.

We went to the funeral home the day after my dad’s passing, encountering a surreal experience akin to buying a car. There was the base package that was already taken care of, but if you want, you can pay more to upgrade to a fancier casket, or pay more for a fancier box for the ashes. Fortunately, the staff had the sensitivity to inquire about upgraded packages, but not to push anything.

The funeral staff also had an odd warning for us that we might get calls from scammers posing as a funeral home to verify information. A couple days later, on my dad’s cell phone there were actually a couple of voice mails from fake funeral homes asking to verify information. A part of me wanted to call them back just to see where the scam would go, but I dropped the issue. Googling around, this is actually quite a big issue

https://www.ftc.gov/business-guidance/blog/2023/06/scammers-impersonate-funeral-home-staff-prey-mourning-families-can-it-get-any-lower

It makes me wonder when someone dies, how exactly did these people get my dad’s cell phone number? I wonder if there is a black market for death records someone on the dark web. It is quite sad that people would prey on people at their most vulnerable.

The reality is when someone passes away, financial considerations come into play. One option presented was to keep the ashes at the funeral home, which would cost an additional $4,000. My mom thought about this for briefly, but since my dad expressed wishes for ashes to be returned to Vietnam, we realized that amount of money could essentially cover a trip there.

I was responsible for creating the memorial photo slideshow during the service and wanted to give a slideshow of dad throughout the decades. Fortunately, both my mom’s and my Google Photos were active, allowing me to gather photos via image auto-tagging. I then wrote a Python script to rename the files by date for chronological organization and added date-time stamps on the bottom right of each image.

At home, my dad, and as I later learned his brother in Vietnam, were both significant packrats. A theory suggest that growing up in conditions of scarcity may lead to hoarding as a protective mechanisms.

Over several days, I sifted through my dad’s stuff, finding a collection of old cables, cell phones, old laptops, random trinkets until I discovered an old mini dv camcorder and about 30 tapes. After locating the correct power adapter, I played the tapes back and found that the tapes spanned 2007-2010. During that time, my dad had just set-up the camcorder and recorded special occasions with the camera and tripod just sitting there.

While creating the slideshow, I felt a pang of sadness at having many photos, but few videos of him. However, this discovery filled that gap with raw footage of him interacting and talking with family – precisely the memories I longed for.

In today’s society there is a strong craving for the perfect ‘Instagram’ photo, a trend that I have fallen to also. However, I’m come to realize the most important media is ones that captures the raw authenticity of one’s self without filters or edits. The videos of dad just walking around and doing mundane stuff really has brought me the most joy. Maybe it is because I am afraid as the days and years go by I might forget what he was like, his speech, and mannerisms.

The next puzzle was how do I digitize such an ancient format as the only input was firewire. After a bit of googling, I bought a PCI-e firewire card on an old windows desktop at home, and managed to digitize all the videos after a lot of fiddling. I captured it first in .avi, then converted it to h.265 which is a newer video codec.

There was an 1.5 hour video that my dad recorded which took place in Christmas 2009. The video was just of us eating and opening gifts. Maybe because of smart phones, the whole set up a tripod and record for hours during an event isn’t too popular, but maybe this is a tradition worth reviving.

Dealing with the grief has been tricky as we don’t have many playbooks in life to learn about this. However, there are two things that have stood out to me which were helpful.

When I saw a friend after the passing of my father he asked me, “do you want a normal day, or do you want to talk about it”. I never really thought about it, but as the person dealing with grief, you do want to control the narratives of how your day goes. Some days I want to talk about it, some days I don’t.
A friend sent me a text message and said, “as much as you are there supporting family, don’t forget to take time to grieve for yourself.”

Another unexpected blessing, and something to consider with elderly parents is their online accounts and access to their e-mail. I fortunately set myself as the 2 factor authentication back-up so I could log in to my dad’s e-mail to get access to important documents. Also having all his phone pin codes so I could long in was helpful, as some apps were sending SMS messages to log-in.

The pamphlet “Gone From My Sight – A Dying Experience”, a book given to us when my dad entered hospice care in retrospect was eerily accurate in the months, days, and hours until the end of life. https://www.amazon.com/Gone-My-Sight-Dying-Experience/dp/B00072HSCY

When I encounter people I don’t often see, I briefly mention the major news, talking about it for a minute or two, before shifting the topic. I feel it’s important for them to be aware of this change in my life, but at the same time, I’m conscious of not letting it dominate our entire conversation.

When he was diagnosed with prostate cancer the first time around and beat it, he was talking to me about this verse and how he enjoyed it.

Ecclesiastes 3

A season for everything

3 There’s a season for everything
    and a time for every matter under the heavens:
²    a time for giving birth and a time for dying,
    a time for planting and a time for uprooting what was planted,
³    a time for killing and a time for healing,
    a time for tearing down and a time for building up,
⁴    a time for crying and a time for laughing,
    a time for mourning and a time for dancing,
⁵    a time for throwing stones and a time for gathering stones,
    a time for embracing and a time for avoiding embraces,
⁶    a time for searching and a time for losing,
    a time for keeping and a time for throwing away,
⁷    a time for tearing and a time for repairing,
    a time for keeping silent and a time for speaking,
⁸    a time for loving and a time for hating,
    a time for war and a time for peace.

It’s kind of an interesting choice because this is not really one of those traditional Bible verses used for comfort. But this choice shows his character because at the end of his life he openly and bravely accepted his mortality. He told us, don’t worry about me, I’m ready to go.

As tough as this was to hear, this was his last gift to us accepting God’s will and to be at peace, thereby bestowing it to us when he passed away.

Rest in peace dad.

State of Data Engineering 2024 Q1

The current state of data engineering offers a plethora of options in the market, which can be challenging when selecting the right tool We are approaching a period where the traditional boundaries between between databases, datalakes, and data warehouses are overlapping. As always, it is important to think about what is the business case, then do a technology selection afterwards.

This diagram is simple, but merits some discussion.

Most companies in the small and medium data fields can get away with simpler architectures with a standard database powering their business applications. However it is when you get into big data and extremely large data do you want to start looking at more advanced platforms.

The Open Source Table Format Wars Revisited

A growing agreement is forming around the terminology used for Open Table Formats (OTF), also known as Open Source Table Formats (OSTF). These formats are particularly beneficial in scenarios involving big data or extremely large datasets, similar to those managed by companies like Uber and Netflix. Currently, there are three major contenders in the OTF space.

Platform	Link	Paid Provider
Apache Hudi	https://hudi.apache	https://onehouse.ai/
Apache Iceberg	https://iceberg.apache.org/	https://tabular.io/
Databricks	https://docs.databricks.com/en/delta/index.html	Via hyperscaler

Several announcements from AWS recently, lead me to believe of some more support of Apache Iceberg into the AWS ecosystem

AWS Glue Data Catalog now supports automatic compaction of Apache Iceberg tables

https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/

Every datalake eventually suffers from a small file problem. What this means is if you have too many files in a given S3 partition (aka file path), performance degrades substantially. To alleviate this, compaction jobs are run to merge files to bigger files to improve performance. In managed paid platforms, this is done automatically for you, but in the open source platforms, developers are on the hook in needing to do this.

I was surprised to read that now if you use Apache Iceberg tables, developers no long have to deal with compaction jobs. Now to the second announcement:

Amazon Redshift announces general availability of support for Apache Iceberg

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-redshift-support-apache-iceberg/

If you are using Amazon Redshift, you can do federated queries without needing to go through the hassle of manually mounting data catalogs.

In this video, you can watch Amazon talk about Iceberg explicitly in their AWS storage:session from re:Invent.

This generally leads me to believe that Apache Iceberg probably will be more integrated into the Amazon ecosystem in the near future.

Apache Hudi

Apache Hudi recently released version 0.14.0 which has some major changes such as Record Level Indexing

https://hudi.apache.org/releases/release-0.14.0/

https://aws.amazon.com/blogs/big-data/simplify-operational-data-processing-in-data-lakes-using-aws-glue-and-apache-hudi/

One Table

Another kind of weird development which was announced right before Re:invent was the announcement of OneTable,

https://onetable.dev/

Microsoft, the Hudi team, and the Databricks team got together to create a new standard that serves as an abstraction layer on top of an OTF. This is odd to me, because not too many organizations have these data stacks concurrently deployed.

However probably in the next couple years as abstraction layers get created on top of OTFs, this will be something to watch.

Amazon S3 Express One Zone Storage Class

Probably one of the most important but probably buried news from re:Invent was the announcement of Amazon S3 Express One Zone

https://aws.amazon.com/s3/storage-classes/express-one-zone/

With this, we can now have single digit millisecond access to data information to S3, which leads to a weird question of datalakes encroaching onto database territory if they now can meet higher SLAs. However there are some caveats with this as there is limited region availability, and it is in one zone so think about your disaster recovery requirements. This is one feature I would definitely watch.

Zero ETL Trends

Zero ETL is the ability for behind the scenes replication for Aurora, RDS, and Dynamo to replicate to Redshift. If you have a use case where Slowly Changing Dimensions (SCD) Type 1 is acceptable, these are all worth taking a look at. From my understanding, when replication occurs, there is no connection penalty to your Redshift cluster.

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-postgresql-zero-etl-integration-redshift-public-preview/

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-rds-mysql-zero-etl-integration-amazon-redshift-public-preview/

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-dynamodb-zero-etl-integration-redshift/

Amazon OpenSearch Service zero-ETL integration with Amazon S3 preview now available

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-opensearch-zero-etl-integration-s3-preview/

AWS announces Amazon DynamoDB zero-ETL integration with Amazon OpenSearch Service

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-dynamodb-zero-etl-integration-amazon-opensearch-service/

AWS CloudTrail Lake data now available for zero-ETL analysis in Amazon Athena

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-cloudtrail-lake-zero-etl-anlysis-athena/

Spark/Glue/EMR Announcements

Glue Serverless Spark UI

Now it is way easier to debug glue jobs as the Spark UI doesn’t have to manually be provisioned.

https://aws.amazon.com/blogs/big-data/introducing-aws-glue-studio-serverless-spark-ui-for-better-monitoring-and-troubleshooting/

Glue native connectors: Teradata, SAP HANA, Azure SQL, Azure Cosmos DB, Vertica, and MongoDB

https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-launches-native-connectivity-6-databases/

AWS Glue announces entity-level actions to manage sensitive data
https://aws.amazon.com/about-aws/whats-new/2023/11/aws-glue-entity-level-actions-sensitive-data/

Glue now supports Gitlab and Bitbucket

https://aws.amazon.com/about-aws/whats-new/2023/10/aws-glue-gitlab-bitbucket-git-integration-feature/

Trusted identity propagation

Propagate oauth 2.0 credentials to EMR

https://docs.aws.amazon.com/singlesignon/latest/userguide/trustedidentitypropagation-overview.html

Databases

Announcing Amazon Aurora Limitless Database

https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-aurora-limitless-database/

Conclusion

It is exciting to see the OTF ecosystem evolve. Apache Hudi is still a great and mature option, with Apache Iceberg now being more integrated with the AWS ecosystem.

Zero ETL has the potential to save your organization a ton of time if your data sources are supported by it.

Something to consider is that major shifts in data engineering occur every couple of months, so keep an eye on new developments, as they can have profound impacts on enterprise data strategies and operations.

State of Data Engineering 2023 Q3

As we roll towards the end of the year data engineering as expected does have some changes, but now everyone wants to see how Generative AI intersects with everything. The fits are not completely natural, as Generative AI like Chat GPT is more NLP type systems, but there are a few interesting cases to keep an eye on. Also Apache Iceberg is one to watch now there is more first class Amazon integration.

Retrieval Augmented Generation (RAG) Pattern

One of the major use cases for data engineers to understand for Generative AI is the retrieval augmented generation (rag) pattern.

There are quite a few articles on the web articulating this such as

What is important to realize is that Generative AI is only providing the light weight wrapper interface to your system. The RAG paradigm was created to help address context limitations by vectorizing your document repository and using some type of nearest neighbors algorithm to find the relevant data and passing it back to a foundation model. Perhaps LLMS with newer and larger context windows (like 100k context) may address these problems.

At the end of the data engineers will be tasked more with chunking, and vectorizing back end systems, and debates probably will emerge in your organization whether you want to roll out your own solution or just use a SAAS to do it quickly.

Generative AI for Data Engineering?

One of the core problems with generative AI is eventually it will start hallucinating. I played around with asking ChatGPT to convert CSV to JSON, and it worked for about the first 5 prompts, but by the 6^th prompt, it started to make up JSON fields which never existed.

Things I kind of envision in the future is the ability to use LLMs to stitch parts of data pipelines concerning data mapping and processing. But at the moment, it is not possible because of this.

There is some interesting research occurring where a team has put a finite state machine (FSM) with LLMs to create deterministic JSON output. I know that might not seem like a big deal, but if we can address deterministic outcomes of data generation, it might be interesting to look at

https://github.com/normal-computing/outlines

So far use cases we see day to day are

1. Engineers using LLMs to help create SQL or Spark code scaffolds

2. Creation of synthetic data – basically pass in a schema and ask an LLM to generate a data set for you to test

3. Conversion of one schema to another schema-ish. This kind of works, but buyer beware

Apache Iceberg

Last year our organization did a proof of concept with Apache Iceberg, but one of the core problems, is that Athena and Glue didn’t have any native support, so it was difficult to do anything.

However on July 19, 2023 AWS quietly released an integration with Apache Iceberg & Athena into production

Since then, AWS has finally started to treat Iceberg as a first class product with their documentation and resources

Something to keep track of is that the team which founded Apache Iceberg, founded a company called tabular.io which provides hosted compute for Apache Iceberg workloads. Their model is pretty interesting because what you do is give Tabular access to your S3 buckets and they will deal with ingestion, processing, and file compaction for you. They even can point to DMS CDC logs, and create SCD Type 1, and query SCD Type 2 via time travel via a couple clicks which is pretty fancy to me.

However if you choose to roll things out yourself, expect to handle engineering efforts similar to this

https://tabular.io/blog/cdc-merge-pattern/

The Open Source Table Format Wars Continue

One of the core criticisms of traditional datalakes the difficulty to perform updates or deletes against them. With that, we have 3 major players in the market for transactional datalakes.

Platform	Link	Paid Provider
Databricks	https://docs.databricks.com/en/delta/index.html	Via hyperscaler
Apache Hudi	https://hudi.apache	https://onehouse.ai/
Apache Iceberg	https://iceberg.apache.org/	https://tabular.io/

What’s the difference between these 3 you say? Well, 70% of the major features are similar, but there are some divergent features

https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison

Also don’t even consider AWS Governed Tables and focus on the top 3 if you have these use cases.

Redshift Serverless Updates

There has been another major silent update that now Redshift Serverless only requires 8 RPUs to provision a cluster. Before it was 32 RPUs which was ridiculously high number

8 RPUs x 12 hours x 0.36 USD x 30.5 days in a month = 1,054.08 USD

Redshift Serverless cost (monthly): 1,054.08 USD

Ra3.xlplus – 1 node

792.78 USD

So as you can see provisioned is still cheaper, but look into Serverless if

· You know your processing time of the cluster will be 50% idle

· You don’t want to deal with the management headaches

· You don’t need a public endpoint

DBT

Data Built Tool (dbt), has really been gaining a lot of popularity at the moment. It is kind of weird for this pendulum to be swinging back and forth as originally many years ago we had these super big SQL scripts running on data warehouses. That went out of fashion, but now here we are

A super interesting thing that got released is a dbt-glue adapter.

https://pypi.org/project/dbt-glue/

https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

That means you can now run dbt SQL processing on Athena now

For those new to dbt feel free to check this out

https://dbt.picturatechnica.com/

https://corpinfollc.atlassian.net/wiki/spaces/APS/pages/119138643968195/DBT+ETL+getdbt.com

Glue Docker Image

A kind of a weird thing, but I recently saw the ability to launch Glue as a local docker image. I haven’t personally tried this, but it is interesting

https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

https://github.com/juanAmayaRamirez/docker_compose_glue4.0

https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/

Zero ETL Intrigue

This is kind of an old feature, but Amazon rolled out in preview a Zero ETL method of MySQL 8.x to Redshift

https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-aurora-mysql-zero-etl-integration-redshift-public-preview/

This is pretty intriguing meaning SCD Type 1 views should be replicated without doing any work of putting data through a datalake. However it is still in preview, so I can’t recommend it until it goes into general release.

State of Data Engineering 2023 Q2

When looking at data engineering for your projects, it is important to think about market segmentation. In particular, you might be able to think about it in four segments

Small Data
Medium Data
Big Data
Lots and Lots of Data

Small Data – This refers to scenarios where companies have data problems (organization, modeling, normalization, etc), but don’t necessarily generate a ton of data. When you don’t have a lot of data, different tool sets are in use ranging from low code tools to simpler storage mechanisms like SQL databases.

Low Code Tools

The market is saturated with low code tools, with an estimated 80-100 products available. Whether low code tools work for you depends on your use case. If your teams lack a strong engineering capacity, it makes sense to use a tool to help accomplish ETL tasks.

However, problems arise when customers need to do something outside the scope of the tool.

Medium Data– This refers to customers who have more data, making it sensible to leverage more powerful tools like Spark. There are several ways to solve the problem with data lakes, data warehouses, ETL, or reverse ETL.

Big Data – This is similar to medium data, but introduces the concepts of incremental ETL (aka transactional data lakes or lake houses). Customers in this space tend to have data in the hundreds gigabytes to terabytes.

Transactional data lakes are essential because incremental ETL is challenging. For example, consider an Uber ride to the airport that costs $30. Later, you give a $5 tip, and now your trip costs $35. In a traditional database, you can run some ETL to update the script. However, Uber has tons of transactions worldwide, and they need a different way of dealing with the problem.

Introducing transactional data lakes requires more operational overhead, which should be taken into consideration.

Lots and Lots of Data – Customers in this space generate terabytes or petabytes of data a day. For example, Walmart creates 10 pb of data (!) a day.

https://medium.com/walmartglobaltech/lakehouse-at-fortune-1-scale-480bcb10391b

When customers are in this space, transactional data lakes with Apache Hudi, Apache Iceberg, and Databricks Deltalake are the main tools used.

Conclusion

The data space is large and crowded. With the small and lots of data sizes, the market segment is clear. However, the mid-market data space will probably take some time for winners to emerge.