Digital Ghosts, Wisdom, and Tennis Matchmaking

Digital Ghosts

My mom recently had a free consultation from her electric company to assess replacing her propane water heater with an electric water pump heater. She forwarded the assessment report to me, and I spent some time reviewing and researching the program.

Despite living quite far away, I have been surprised by how much remote help I am able to do. Since my dad’s passing, I sometimes play the role of executive assistant and researcher.

I think as our parents grow older, our relationship with them changes. When I was young, my parents were more of a guardian figure telling me what to do. But around my university years, things began to change where I wouldn’t call them a friend, but it wasn’t the same dynamic as an authority figure.

After using copious amounts of ChatGPT to research the program for pros and cons, I discovered my mom was eligible for an $8,000 credit through the High-Efficiency Electric Home Rebate Act (HEEHRA) program which was passed as part of the US Congress Inflation Reduction Act. It is a bit odd, as I don’t really see the correlation between inflation, but hey, free money.

The HEEHRA program would allow residents to replace their air conditioners and furnace with a heat pump. Last September, I just happened to visit my mom during a heat wave where it was a 104°F (40°C) during the day and temperatures only slightly dropping at night. I then learned my mom’s air conditioner was broken when we tried to turn it on.

With the heat wave, every HVAC technician was booked solid through the next week, so I decided to open the air conditioner.

Looking at this, I was like huh, maybe uhh, the brown rusted thing is broken. After doing many YouTube, Google, and AI searches, I determined this was a capacitor and it was probably broken. I fed AI the user’s manual as I needed to order the capacitor with the right ohms, as I know nothing about hardware at all. One day later, with an Amazon package delivery, and doing some prayers while replacing the capacitor, the replacement worked!

Given my previous experience with the program, I was pretty heavily pushing my mom to take advantage of the program, as I was afraid the air conditioner may fail for different reasons in the future.

I began engaging with talking with contractors local to my mom, and they asked “are the power lines to the house over ground or under ground?”. I said I had no idea, and that I could look at Google Maps Street View.

After pulling my mom’s address, I began looking around, and saw a shadow of someone with a dog, and decided to zoom in.

It was a perfect snapshot of my dad and old dog going for a walk. I even looked around the parking area and saw my car there so I must have been home at the time. I would estimate that this photo was taken at least 15 years ago which is quite surprising since the Google Street View car hasn’t come back to update the photo.

After much time, I have pretty much finished digitizing all of the old physical media he had. There was a litany of insane formats like VHS-C, Video8, and mini-dv players. On Facebook Marketplace, I was surprised how many people offered digitization services which was helpful in saving me hours of working through a manual process.

On the one-year anniversary, I stitched together a highlights video putting together another compendium of his life I wasn’t able to put together in time for the funeral. Going through the process triggered quite a few memories as I was watching things occurring way back from 35 years to the near present.

Going through the videos led to one piece of regret. Back then when people used to send voice mails, he sent me a couple on my phone and I just deleted them. I wished I had kept them, because that memory is one specifically from him to me. Often, he would remind me to call mom for her birthday, or ask what I was doing when I didn’t answer the phone.

When my mom was cleaning out some old bank documents, she discovered my dad’s old journal. A lot of the journal talked about the difficulties and challenges and finding purpose after he left Vietnam for America. Others were more succinct such as

4/29/1972

“Today is my 28th year old birthday. 21 years in school, 7 years in the service.”

I want to ask my mom more about it, but I choose not to for the time being, as I can tell talking about it brings back a lot of memories.

In a way, all these digital fragments—videos, journals, old photos, even Street View ghosts—feel like little time machines. Some are clear, others fuzzy. But all of them, no matter how serendipitously they appear, are treasured reminders of the past.

Digital Wisdom

Recently I took a trip to attend a technical business conference about Generative Artificial Intelligence (AI). The field of consulting has changed greatly since the pandemic. In a prior life, I would take a business trip to see customers maybe every 2-3 weeks to work in their office, but with online meetings now the norm, customers have a hard time justifying the expense.

Now we are about 2 years in from ChatGPT’s release, I have more concerns than before about Generative AI’s impact to the junior-level workforce, and eventually on broad swaths of the white-collar work population. The systems are getting better where they excel in deterministic based systems, meaning fields which are structured like coding. I advise my team that they need to work on their reading, writing, and speaking, because inevitably, AI will do some part of the engineering aspects of work in the future.

I randomly came across the book “More Than Words: How to Think About Writing in the Age of AI.” The author John Warner provides a slightly different stance of AI, as he approaches it from the perspective of an English teacher.

His core observation which struck me was the statement “ChatGPT cannot write. Generating syntax is not the same thing as writing. Writing is an embodied act of thinking and feeling. Writing is communication with intention.”

There are times at work I can tell somebody has sent me an e-mail written with AI. The writing comes out as flowery non-offensive, where there are snippets of an ask nestled in between. Sometimes I get lazy and use an AI to reply back with a message, which makes me wonder, are just two AI systems talking to each other with minimal humans in the loop? Why are we doing this? Should we just write more succinct e-mails without the formalities?

During one of the morning sessions, a presenter gave her own opinion on AI. It was all pretty standard fare stuff, but one slide caught my eye on her opinion of AI. It stated the progression of AI was

Data
Knowledge
Wisdom
Analysis
Synthesis

I sat there a little stunned and muttered to myself, “wisdom”? This bothered me a little bit, but I couldn’t articulate why exactly. After the session I spoke to the speaker outside about her slide deck and some of her management methodology as she brought up some interesting points in other parts of her presentation. During our conversation I was a little more confrontational than expected. The wisdom slide somehow set me on edge, and in retrospect what I should have done is asked her to explain what she meant by AI being wise and some examples.

Very often when we encounter people with a differing viewpoint, our typical tendencies are to stay in our box, and in our camps. But there is a benefit to having a cooler head, and being genuinely introspective in peeking into a viewpoint that is not ours.

What is wisdom? For me, growing up in the church, my definition is one that comes from God. With that context in mind, I think that’s where the whole AI and wisdom comment bothered me the most. Similar to Mr. Warner’s book, I believe AI will be disruptive and change the world in ways we don’t know, but for me, it is important not to anthropomorphize AI as having a soul and spirit.

Digital Tennis Matchmaking

With the summer in Vancouver I have picked up the frequency with which I played tennis. When I was living in Southern California where it was mostly sunny, I played tennis here and there (but did surf more). Vancouver is one of the most impractical places to play tennis as it rains a lot in the winter so you don’t get a lot of play outside.

A friend told me that there was this Vancouver Tennis group, where you can meet up with people and play pick up rallies or matches. How it works is you post

Your level, roughly based off the tennis rankings, I consider myself self-rating wise around a 4-0-4.5
The location you want to play
What time you want to play

Meeting up with strangers to tennis is a bit weird, as you have to negotiate expectations up front. The general rule of thumb is you want to play with someone about your same level or a bit better. You don’t want to play with someone worse than you as it eventually drags your game down, or you don’t want to play with someone so good, you can barely keep up or drag them down.

Just to keep track, I have a spreadsheet of who I played with, and what I think their level is as well as some notes.

The 3.5 Player

I played with a 3.5 player on an early Sunday morning and when you come to the court you always warm up mid baseline. At that time, you get a chance to talk to the new person, and ask things like how long they’ve played etc.

As I was playing with her, she was doing okay, but when I would push the ball to the corners she had a hard time keeping up. To make sure we both had a good time, I held back a little bit. She said she had a non-existent serve so asked if we could play a game where we would play up to 11 points. Somebody would start a rally, and would hit softly twice, then try to hit a winner on third shot.

It was kind of a weird game, but sure why not? After about an hour we wrapped up and walked out of the courts. She immediately went, “hey when are you free to play next?” I kind of hesitated and told her I would have to check my schedule, and she followed up with, “I prefer you to be transparent, if you don’t want to play with me let me know, and we can move on. Sometimes some people have said I’m not good enough, and they eventually don’t want to play with me.”

The moment was kind of intense, and it felt almost like a “tennis friend date”, where expectations were being sorted out. I guess sometimes it is natural, and sometimes there is an explicit conversation.

The 4.5 player

I posted on the chat one day that I was looking for someone to play in Granville Loop. These courts are quiet, but have very strange quirks. They are two freshly resurfaced courts, but one court has a huge crack on it. When the ball occasionally hits the crack, it is as if a Super Mario mushroom power-up gets activated because the ball would fly crazy and higher than usual.

The second court is slanted so depending on which side you play on, the court would play slower or faster.

In Vancouver, I would say the most popular courts are the Kitsilano Beach Tennis Courts. There are 8 courts, and is by the beach and it is super beautiful. Only problem is they haven’t resurfaced the courts in a while, but it is a very odd mix of players who are very good, and players who are complete beginners.

Since it is in a prime area, these courts are super crowded. There is always somebody waiting for a court, and lots of beginners don’t know the rules, so they might just randomly walk behind you if you are playing a match. Somebody also is always playing loud music which is distracting.

I met the 4.5 player at the court, and first thing I notice is this person is very good. He has a lot of topspin and hits everything at baseline. I can kinda keep up-ish, but he’s good where all his shots are hard to hit.

After chatting around a little bit, I learn that he formerly played for his university and is just trying to get back into it. He had a trip to Asia, so I was going to message him a bit later if he was interested.

There is this odd thing also like, after you play a tennis match, how long do you wait before you message each other? You don’t want to be overly aggressive, but on the other hand summer is short in Vancouver so you do want to get as much play as possible.

I did reach out recently, and saw on Facebook messenger the message was read. But to this day, no response. Did I just get ghosted?

The 4.0 Player

I met up with a 4.0 player at my favorite tennis court (I’m going to keep that one a bit of a secret here) a second time, and he was flexible because he was in transition looking for a new job. I try to play in the middle of the day during my lunch break, because courts get really crowded after work. At 11am, all 4 courts were full. I was kind of surprised because it was a random Tuesday in a neighborhood court. As we played a second time, we were at about the same level, where both of us were better in different aspects in the game, so it was quite fun playing with him.

After the match, we caught up a little bit, and I learned he was looking for a job in the project management of the construction industry. I referred him to a friend, and said if he wanted to learn more he can follow up if he liked.

In life, we tend to help out our communities. One prime example is your alma matter. Typically when people see someone from their alma matter in a resume or an event, people try to help them out more than the ordinary person. I think it is human nature when you have a commonality to help out that other person.

Back in 2003, when I graduated there was a mini recession, so none of my friends in our computer science cohort got jobs. But as a group, we looked out for each other, and when one person got a job we helped other people get jobs.

I went through a string of job interviews without any results and was a bit depressed. I ran into a friend in the Ring Road Park during my university senior year, and she told she didn’t pass a job interview, but recommended I talk to that recruiter anyway.

About two months later, I actually got the job. And it was only because I was connected through my school in my computer science group.

I struggle with this a little bit though. I understand the power of networking, but I think there is a danger of being insular to not allow others in our community when ideally we should be as open as possible.

Maybe we shouldn’t play?

I was browsing through the False Creek tennis meet up group, when I saw a post,

“Looking for a 3.5-4.0 player, False Creek, Sunday morning.”

What I usually do is just like to scan their Facebook profile briefly. It’s more so just to make sure they are legit, but I try not to spy too much into their lives. My goal isn’t really to talk politics, but to play tennis.

I looked at this one profile and was like, hm, the name seems familiar to me. Scrolling down, I was like, yea based out of Vancouver, they went to a Yuja Wang concert – oh wait, this is one of my partner’s doctors.

Immediately I was like uhh, yea I’m not moving forward with this. I can imagine in a weird hypothetical scenario we did meet up, and it might go something like this?

“Hi, my name is Dan. “
“Oh yea I know who you are, I see you sometimes at Jason’s appointments. How is it going?”

I imagine in this hypothetical scenario I would set ground rules like not talking to him about my partner’s medical history. And what would happen if I beat him or he beat me in tennis? Would this affect how he treats my partner?

In this imaginary world, I have the ability not to pursue this, but what would have happened if he messaged me to play, with me knowing who he was? Should I ignore him? Should I instead say who I know he is and decline to play? Or should I play with him knowing he reached out to me first? Maybe I’ll avoid the False Creek tennis meet-up group for a while.

Rules are rules?

After I played with a 4.0 match one day, we were wrapping up, and a family wanted to play in the court next to us. The problem is they brought a soccer ball.

There are rules posted on the court which state:

“Tennis courts are for tennis only”.

“No dogs on court”.

It was a young kid with his parents. A part of me says, to decline them and say, “hey rules are rules.” But this kid was so young. I could imagine myself saying, “sorry kid, go play in the park outside.” And I further imagine that he gets traumatized from this experience from being rejected and dislikes tennis people forever.

Fortunately for me, I had an out as I was leaving, and said, “I’m actually leaving” and didn’t answer the question. At Granville Loop, weird non tennis things are always happening. It is either a pickle ball player or dog walkers who want to use the second court.

The other thing is I worry if I also set bad precedent, now I invite this kid to come into the court anytime.

It oddly reminded of church where we sometimes talk about the law and love. The law consists of rules from God, meant to protect you, but people don’t get attracted to church because of the law, but instead of the love of people.

Fortunately in this case I didn’t have to answer that question today.

A Data Engineering Perspective of LLMs

Data engineering is a field I would categorize as a subspecialty of software engineering. It shares the same concerns as software engineering—scalability, maintainability, and other “-ilities”—but its primary focus is on data. It’s a unique discipline because data is inherently messy, and as a result, no standard enterprise framework has emerged to dominate the space—and perhaps it never will.

The complexity of a data project can often be measured by the variety and nature of its data sources. For example, a project involving high-volume Internet of Things (IoT) data is vastly different from one dealing with a structured database modeled with Slowly Changing Dimensions (SCD) Type 1.

Data projects generate value by transforming data. Sometimes, simply joining two datasets can uncover new insights. In other cases, cleansing data helps provide clearer metrics for the C-suite to make decisions. In nearly all cases, we are fully aware of the data sources being ingested and monitor them closely for changes, as modifications can impact upstream pipelines.

That’s where Large Language Models (LLMs) start to feel a bit strange. Unlike traditional data projects where the sources are known and controlled, with LLMs—especially frontier models like ChatGPT or Anthropic—we don’t have true visibility into the training data. If someone asked me to build a data project without transparency into the underlying data sources, I’d be very cautious.

The Washington Post attempted to extrapolate what’s in some frontier models by analyzing Google’s C4 dataset:

https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning

We can probably assume that most LLMs have read everything publicly visible on the internet—blogs, Reddit, YouTube transcripts, Wikipedia, and so on. The bigger question is whether they’ve read copyrighted books—and, if so, how can we even know?

A while back, I was quite obsessed with making tofu from scratch. The process involved making my own soy milk, then coagulating it with Nigari (a magnesium chloride solution made from seawater). I sourced Nigari locally from Vancouver Island, where the manufacturer gave some tips on how to make it. After many attempts and watching tofu videos on YouTube many times, I finally got a recipe down.

Recently, I had a craving to make Tau Foo Fah, a silky tofu dessert popular across Asia. Unlike Japanese tofu, this version uses a different coagulant—gypsum and cornstarch. The most traditional method I’ve seen involves dissolving the coagulant in water, pouring hot soy milk over it from a height, and letting it set. My results were always inconsistent, so I turned to ChatGPT

https://chatgpt.com/share/67ef2c2c-2728-8011-bd4b-b07973cb87e4

In the ChatGPT iOS app, I must have accidentally triggered the “deep research” feature—so instead of querying the regular model, I used a deep research credit. The model suggested the following method.

Heat soy milk to ~185°F (85°C).

Dissolve gypsum + cornstarch in a bit of water.

Pour hot soy milk into the coagulant mix in a mason jar (don’t stir too much).

Place jar on trivet in Instant Pot with water (pot-in-pot method).

Steam on Low for 10 mins → natural release 10–15 mins.

Let it rest undisturbed before opening.

For the sake of experimentation, I tried two approaches in the Instant Pot:

1. I mixed the coagulants in cold soy milk (1% corn starch, .74% gypsum, 100% soymilk), put a cap on a mason jar and steamed it on high for 10 minutes in an instant pot

2. Did what ChatGPT suggested with heating it to 185° F

The results were vastly different. The ChatGPT method didn’t set properly. Looking closer at the deep research citations, I realized most were blogs, and the model had likely built its answer on those foundations. It also seemed to confuse traditional tofu-making techniques with those specific to Tau Foo Fah.

This experience got me thinking more about the importance of data sources and data lineage when using LLMs. Here are a couple questions that take a data centric approach to thinking about them.

1. Do you know what data sources are being referenced in the LLM?

It sounds basic, but it’s critical. LLMs are black boxes in many respects. Just as we wouldn’t integrate unknown data into a pipeline without thorough vetting, we shouldn’t treat LLM outputs as reliable without understanding what they’re based on.

2. Does the prompt you’re asking have high-quality representation in the model’s training data?

It helps to step back and consider the quality and coverage of data within the problem space. For coding-related queries, I assume the quality is high—thanks to well-defined languages grammars, and extensive examples on Stack Overflow and Reddit.

Other well-represented domains likely include medicine and law, given the large corpus of reliable reference material available.

But in areas like food, I’ve learned not to rely on ChatGPT as an authoritative source. Many great recipes are locked behind non-digitized books or restricted datasets that LLMs can’t legally access.

3. Are you able to see data source citations in your answer?

When using a general model like GPT-4o, answers are often returned without citations. That leaves the burden of validation on you—the user.

For low-stakes questions like “What kind of plant is this?”, a wrong answer is annoying, but not disastrous. But if you ask, “How much revenue did we generate last month?” and use the response to make business decisions without verifying the source, the risk increases significantly.

Medical questions are even riskier. Making a health decision based on uncited LLM output is problematic.

Google’s NotebookLM stands out here. You can upload your own PDFs (e.g., data sources), and the model provides footnoted citations linking back to the original documents—much more traceable and reliable.

4. Are you an expert in the domain you’re querying?

A senior software engineer might use ChatGPT to refine a coding solution, while a junior engineer may dangerously implement answers without understanding them.

In general, we need to be cautious about accepting LLM responses at face value. It’s tempting to delegate too much thinking to the machine, but we must not bypass our own analytical skills as we inevitably integrate these tools into our workflows.

Europe

These are a mix of stories from Norway, Croatia, and Slovenia from some past trips.

Make Serbia Great Again

Our first stop is the city of Dubrovnik in Croatia, which has become well known being the primary filming location for Game of Thrones seasons 2 – 8. I watched the series, and am kind of a fan, but the city has really capitalized on this popularity. As you walk through the city you see these stores with fake iron thrones where you can sit and take pictures.

The city is quite unique as it is literally closed off by massive city walls (think Kings Landing) and to walk through the city requires climbing these huge flights of stairs. We were quite jet lagged the second day so we explored the city on our own. It was pretty empty, and we saw a couple elderly locals climb up the stairs. They were slow, but managed to do it which was impressive.

In the late morning, the city was a zoo. I’m talking like Disneyland-level crowds of people everywhere. It reminded me a lot of the Granville Island Public market in Vancouver during tourist season. On a normal day I could get groceries quickly, but on some random summer days the market would be packed shoulder to shoulder because of a cruise ship docked in the Waterfront port.

Luckily, we booked a tour guide Ivan to show us around the city. It was the shoulder season, but the crowds were surprising. He mentioned that in a peak summer day 8,000 people can enter the city!

He pointed out an insignia with a coat of arms that had “HR” and said that “Hrvatska” was the real name of the country. I did a double take for a moment and said, “Wait, Croatia isn’t the name of the country? He mentioned that the lack of vowels made it difficult for foreigners to pronounce so someone came up with “Croatia”.

I distinctly remember that as a kid, there was societal pressure for people with hard to pronounce names to choose an English name. Growing up, I remember adults telling my cousin Trang to pick an English name and drop her Vietnamese name. Today we would frown upon this, but back then it was pretty common to try to purge your identity to blend in.

Trang (pronounced like chrahng):

“Tr”: Pronounced like “ch” in “charm,” but with the tongue placed slightly farther back in the mouth. The sound is sharper than in English.
“a”: Pronounced like the “a” in “father” – a long and open vowel sound.
“ng”: Pronounced like the “ng” in “song” or “ring,” with the sound coming from the back of the throat.

Even I was not immune to this. My Vietnamese name is Minh, but most of you probably know me as Daniel. My parents were afraid people wouldn’t be able to pronounce Minh so I went by Daniel.

During my time at the University of California, Irvine, a TA was calling our names to collect back our graded quizzes. He exclaimed out loud, “Danielle, Danielle, please pick up your quiz.” My friend nudged me and said it was probably my quiz. I reluctantly walked up to the front and grabbed the paper, and sure enough, my name was on top – “Daniel.”

Oddly enough I’ve had this problem multiple times, at places like Starbucks and even in other classes so I just shortened my name to Dan, to make it fool proof.

During my university graduation, they called each name before students walked across the stage. As I was waiting, I heard “David Quach”. As I walked to grab my diploma, my life flashed before my eyes that I would forever regret the moment if I didn’t walk back and correct the announcer. I walked all the way back to the stage, told him the corrected name, then walked off the stage.

Tourism boards probably err on the side of caution of easy to pronounce names, but does that do the country a disservice? I mean, if Hrvatska is the real name, why don’t we just call it that?

The walk around the city felt old and historic. Ivan pointed out a building and says, “That’s an Airbnb now.” In the past, the old city used to have 5,000 people, but only 500 remain. Since tourists are willing to pay high prices for short-term rentals, property owners make more money renting to visitors rather than to locals. This drives up the cost of living making it hard for residents to live in their own city.

Ivan continued by explaining that the lack of affordable housing is forcing younger generations to leave Dubrovnik and Croatia to seek better opportunities. A really random fact is that many Croatians play water polo locally then leave for scholarships at top universities in the United States.

The word “authentic” is always a really loaded word, but I wonder without the locals, is Dubrovnik a hollow shell catering to tourists? Taking the Disneyland comparison further, that place isn’t bad, but no locals live there. What do we lose when cities become museums where we check off landmarks, but don’t interact with any locals?

We climbed to a higher point of the city, and Ivan pointed out a building with new red roof tiles alongside older ones He mentions that the new tile was part of the post-war reconstruction.

Before visiting, I hadn’t read much about the history of Croatia and its neighbors. In seventh grade, I only had a vague recollection that Yugoslavia split up and that Tito was the former dictator.

Ivan shared his experience as a kid during the 1991–1995 Yugoslav War, when he was growing up in Dubrovnik. His mom specifically warned him not to walk in certain areas due to the risk of snipers. About 75% of the city was destroyed during an eight-month scorched-earth campaign, where civilian buildings were also targeted.

I asked why he thought the war happened, and his theory was that Tito’s monetary policy fueled hyperinflation, and politicians exploited the resulting instability. He suggested that Serbia’s ambition to create a “Greater Serbia” was a factor in the conflict—hence the expression “Make Serbia Great Again.”

The book Café Europa Revisited: How to Survive Post-Communism had one line that really stuck with me. The author, Slavenka Drakulić, wrote:

“Nationalism is an ideology that needs an enemy; it constitutes itself in confrontation with the Other—whoever that might be at the moment.”

There has been much news lately about the tariff war between the U.S. and Canada. I’ve been struck by the anger on the Canadian side, with some even boycotting U.S.-made goods—despite the tariff being extended by 30 days.

It reminds me of the nationalism statement, and now I wonder with Canada’s emergent nationalist sentiment is only due to viewing the USA as an adversary now.

The Traffic Ticket in Bosnia

From Dubrovnik, we drove to Ston and Mostar. Driving from Croatia to Bosnia is a bit more complicated as you are leaving the EU, and when you rent a car, you need to tell the rental company this specifically as there is an upcharge for documentation you have to give to the border police.

Just like at the airport, you go through an EU customs exit, then the Bosnian country customs border. I always get a little anxious going through any country’s borders especially on land, but we had no issues.

In Mostar, we booked a tour from a contact in the Rick Steves guidebook. The guide met us in the center of the city. When visiting countries deeply affected by war, I struggle with the balance of how much I should ask tour guides about their experiences during that time. In one hand, I wonder if that is traumatic for them to constantly retell the stories so during the tour, so we let Alma lead the discussion.

She spent 45 minutes talking about the history of the region starting from the Roman Empire. I’m not a historical scholar, but long story short, Mostar emerged as a significant center during the Ottoman Empire (1468 – 1908ish) . The interesting thing is that the Ottomans were generally tolerant of other religions with Catholic and Jewish populations and the legacy carries on to this day for having an inclusive society.

In the 1990s war broke out in the entire region, and the key historical event was the bombing of the Stari Most bridge in 1993. When inquiring more into why she thought the war happened, her belief was that Serbian nationalism was the primary cause of the war as the aggressor.

She also was critical and disillusioned with the United Nations (UN) and UNESCO, as in her perspective they failed to stop the bloodshed during the war. I can’t help think about when the Ukraine and Gaza conflicts are over, will the same criticisms be leveled against these global institutions.

The weirdest thing she talked about was the concept of the Tripartite (three-headed) Presidency which has been a core issue post-war. The Dayton Accords were signed in Ohio in 1995, stopped the bloodshed and established a peace treaty, but put in place a really weird political situation in Bosnia.

As part of the agreement every four years three members of the presidency are elected: one Bosniak, one Croat, and one Serb. Who you vote for is based on the territory where you live and your ethnicity.

In the four-year year term, the presidency rotates every eight months between the Bosniak, Croat, and Serb. The guide was deeply critical of the complex political structure. The accords were trying to make everything equal in their society, but she pointed out the irony that this type of structure has caused paralysis in the government’s decision making.

Similar to Croatia, there has been a brain drain out of the country post-war, due to their segregated education system which has another convoluted set of rules. A part of me is quite sad that when the system is broken, the people suffer.

Being pretty ignorant, I asked, “Do people in Croatia and Bosnia speak the same language?” The reality is that Croatian, Bosnian, and Serbian all come from the same Slavic language. As a side note, they can somewhat understand Russian verbally.

Alma believed that the decision to call a language “Croatian, “Bosnian, or “Serbian”, is not based on objective linguistic criteria, but rather on the political agendas and a desire to establish dominance. I guess languages might be a form of nationalism in a sense.

Walking through the old town, it was quite pretty, and only small relics of the war remain. Around some buildings, you see some large mortar shell holes from the conflict, but Bosnia really has tried to move on and is supported by robust tourism from day trippers in Dubrovnik.

Being in a Rick Steves guide book is a big deal, and we asked how she got mentioned. The guide initially worked as a tour guide in a local agency, but Rick Steves’ representatives wanted her to be a guide to Rick Steves personally so he could update the Bosnia book.

When Rick Steves finished his tour, instead of mentioning the tour agency, he mentioned the guide directly, omitting the tour agency, which became a turning point in her career. She refers to that moment as divine intervention, as that mention changed her career path forever.

It is crazy to see these moments where people in the positions of power and influence have the opportunity to change people’s lives through a single decision. I’m glad that local guide mentions actually make a difference, and it does make me look at the guidebook in a slightly different light.

The next day, we drove out of Bosnia and towards Ston. On the way out, as I crossed an intersection, I noticed a police officer holding a stop sign perpendicular to me on the right. Imagine a ping-pong paddle where the center of the paddle has the word “STOP.” As I was driving, the officer was just within the corner of my eye, at the edge of my peripheral vision. My instinct told me to slow down and pull over, but I kept going.

Right when that happened, I saw in my rearview mirror the cop running to his car and then beginning to drive. At that point, I already knew the cop was going to chase me down. I began to slow down to avoid escalating the situation, and the cop pulled me over.

In English, the cop asked, “Did you see the stop sign?” I said I wasn’t sure, and living in California has really made me paranoid of cops. You hear many stories about incidents with police where things escalate and go wrong. Typically, in North America, cops flash their lights behind you to pull you over, not sit perpendicular, but different country, different rules.

I immediately handed him my US passport. Then he came back asking for my driver’s license. He was immediately confused because it was Canadian, and I explained that I am a dual resident. Next, the officer said, “It is a 60-euro fine for not stopping,” and I said that was fine. Then he asked me to step out of the car.

When I walked over to the police officer’s car, his partner was in the driver’s seat, and they started writing up the ticket. Oddly enough, they were very friendly and said the ticket was only 30 euros, but payable in cash only. A part of me wonders—if we hadn’t had the cash, what would have happened? Would we have been thrown in jail?

The officer who pulled me over noticed the U.S. passport and said he had a cousin who lived in California. Then he jokingly asked, “Who will you vote for in the upcoming election, Trump or Harris?” Thinking the question could be a trap, I declined to answer, saying, “In the USA, a lot of people don’t reveal who they vote for.” The officer said that in Bosnia, it is common for people to explicitly state their political preferences.

We paid the 30-euro fee and then continued on our way. Writing this months later, I still feel unsettled by the experience. I’m not sure if the relationship between European citizens and police is different, but when I was living in California, I’ve only had two experiences with police.

When you drive on the freeway in California, the technical speed limit is 65 mph, but nobody drives that slow. The average is about 73 mph, and at 80-85 mph, you risk getting a ticket. That means when driving on California highways, you technically could get a ticket at any time, but an unwritten rule exists that you usually get one at higher speeds.

During college, I was driving to church when I got a ticket, I think, for driving 81 mph. The officer got my ID and then asked if I was related to Steve Quach, the UFC fighter. I said no, and a part of me wanted to plead the “I was driving to church” card, but I didn’t say anything.

The second experience was when I was in grade 6. I was biking to school with a helmet, and a police officer turned on their sirens behind me. They said, “Congrats on wearing a helmet, here is a coupon for a free Slurpee at 7-11,” and went on their way. Even to this day, I still think pulling sirens on an elementary school kid is a bit excessive for giving a free drink.

We made it safely back to Ston using the toll road and learned the lesson to trust my instincts a little bit more on the road.

Pula

In Croatia, we booked a tour guide from Rovinj to show us around the city of Pula. Things were a little stressful because I was driving the rental car while the tour guide sat in the passenger seat, giving a historical overview of the Istria region. I quickly learned that I cannot drive and absorb something overly educational at the same time, so I more or less tuned her out and just listened to the directions.

Our first stop was Uljara Vodnjan (Vodnjan Olive Oil Mill), as she told us olives were in season and we should check out an olive oil shop. The facility was family-owned but quite modern. In one corner of the wall, there were huge crates with family names on them. This really was a community mill in the truest sense—people could bring their own olives to have olive oil made. The processing steps involved weighing the olives, washing them, crushing them into a paste using a millstone, and then using a centrifuge to separate the oil.

The employees at the shop offered us a tasting, and out came a small cup of fluorescent green olive oil. They mentioned that it was this color because it had just been harvested and was also unfiltered. Most of the olive oil we get in grocery stores is filtered, a process that removes solid particles, making the oil lighter, more transparent, and free of impurities.

One of the employees instructed us to first warm the cup of olive oil in our hands, then smell it. Next, we were to take a small amount of olive oil in our mouths, smile, and inhale. As we did this, he mentioned that high-quality olive oil should taste like fresh-cut grass and leave a spicy, peppery sensation in the throat.

I followed these steps to a tee, but I apparently put too much olive oil in my cup. I started coughing—a lot. I hacked away for a couple of minutes while everyone laughed. The unfiltered olive oil was unreal and unlike anything I’d ever tasted. It was super intense, slightly medicinal, and completely punched me in the face.

In North America, we tend to go through diet trends, and at one point, the Mediterranean diet was all the rage—advocating for olive oil and eating fish. But watching the process of olive oil harvesting in person gave me a different perspective. It wasn’t just about the health benefits; making and consuming olive oil here was about being in tune with the rhythms of nature and community. Obviously, not everyone has the luxury of owning an olive orchard, but there is a sense of celebration when a harvest is completed.

Earlier in our trip, we were in Ston and had an amazing olive oil at a restaurant. We asked if we could buy some, but they said, “No, sorry. We make our own olive oil and only have enough for the restaurant.”

It’s amazing how much diversity exists in food worldwide. For example, we can get button mushrooms year-round at the grocery store, but in Vancouver, when fall and winter hit, that’s when we get foraged mushrooms from the mountains—varieties like chanterelles, black trumpets, and hedgehogs, just to name a few.

About an hour later, we arrived in the city of Pula. Parking was a bit confusing for our tour guide, so after asking a parking attendant a few questions, we decided on a paid lot. She paid via a web link on her phone, and we headed out.

Our first stop was the Pula Arena, also known as the Pula Amphitheater. It was very reminiscent of the Colosseum in Rome—roughly a circle without a roof. Originally, it was used for gladiator fights, but those were banned with the rise of Christianity. The site had fallen into disrepair over the years but underwent a major renovation in 1985.

Nowadays, the Pula Arena is the cultural heart of the city, hosting concerts, film screenings, and even ice-skating shows. Our guide casually mentioned that Dua Lipa, Elton John, and Andrea Bocelli had all performed there. I did a double take—wait, what?

It’s fascinating how a small city like Pula repurposed such an ancient landmark for cultural events. It made me reflect on Vancouver, where I don’t think we have a single iconic landmark that represents the heartbeat of the city. San Francisco has the Golden Gate Bridge, New York has the Statue of Liberty, and other cities have their own marketing symbol, but Vancouver doesn’t seem to have one definitive emblem.

For lunch, we went to a restaurant in the middle of town. Tour guides usually excuse themselves to eat alone, but we enjoy dining with our guides, so she joined us. We ordered pasta with black truffles, a regional specialty, and they were quite generous with the truffle shavings.

Tour guides typically try to keep personal opinions out of their tours, but at some point, the topic of credit cards came up. She bluntly stated that credit cards were a tool for governments to control people. Things got a little awkward. Then she realized she couldn’t renew parking through her phone, so I ran back to the lot to add coins to the meter. Turns out, the parking lot had a two-hour limit, so while she couldn’t renew online, we could extend our time by manually inserting coins—giving us the extra hour we needed.

As we continued our tour, our guide explained that, unlike other parts of Croatia, Pula largely escaped direct conflict during the Yugoslav Wars. Originally a military outpost, the city had a shipyard that employed thousands, but in the early 1990s, the Yugoslav army left Pula, marking its transition into a civilian city. As the war escalated, many regions sought independence, leading to the current borders of Croatia today.

I’m grateful that Pula largely escaped the devastation of the Yugoslav Wars and that the city has not only survived but thrived in modern times. Walking through its streets, there’s a sense of resilience—history is ever-present, but it doesn’t feel weighed down by the past. The Pula Arena, once a battleground for gladiators, now hosts world-renowned artists; the shipyards that once served military needs have transitioned to civilian industry. There’s something hopeful about that—a city that has seen so much history yet continues to reinvent itself while holding onto its heritage. It’s a reminder that even in regions touched by conflict, life moves forward, and culture finds a way to flourish.

Lofoten Islands

We are having our first dinner in Lofoten Islands, and it is pretty quiet in the restaurant. Apparently May is the shoulder season for tourism, but we catch a break because the weather 65 degrees and sunny. A typical May here sees fewer crowds, with temperatures in the 40s, along with rain and even snow.

There’s only one other party in the restaurant—two older people and a guy in his 40s. I’m casually spying, trying to figure out their relationship, when we break the ice and ask where they’re visiting from. Turns out, they’re from New York, and the older couple are the guy’s parents. A tour guide pops in a bit later, giving them tips on how to prep for a midnight hike.

Lofoten is way up north, inside the Arctic Circle, so this time of year, the sun never really sets. Having daylight at 11:30 p.m. kind of messes with your brain, so sure, I guess a midnight hike makes sense?

The next day we drove out to Lofoten Seaweed. It originally wasn’t on the itinerary but Jason saw a sample package at the restaurant, so we decided to go. Itinerary wise, my travel style is try to plan as much as in advance as possible, and then shift plans if needed. Rick Steves, a famous Europe travel author in a recent interview, had some sage advice of the importance of saying “yes” to serendipitous moments, and putting yourself in situations where they can occur.

I always thought of seaweed as primarily Japanese, with its deep traditions of making dashi. But entering the store, I was really surprised on the diversity available. There were some rarer types, like dulse, which has a bacon-like flavor when roasted or fried, and truffle seaweed, which naturally tastes like truffles. Vancouver has seaweed too, but it’s mostly kombu and wakame.

The history of seaweed and Norway and Iceland is quite interesting because there are some indications that Vikings ate seaweed to help with nutrition on long voyages. The company’s founders, both are inspiring where they spent years harvesting seaweed in a male dominated fishing industry eventually winning converts over.

After picking up some seaweed to take home, we ask how many hours of sunlight they get in winter. She tells us they get just one hour—but adds that the Northern Lights, which so many people chase, are incredibly common here in winter.

Most of the time, Northern Lights only really look good because of cameras and long exposure, and seeing it in person is kind of like a Windows 95 wallpaper updating slowly. It looks like colored clouds in the sky, but apparently in this region, Northern Lights are bright enough to reflect green light on the snow.

People see those stunning images online and make it a lifelong dream to witness the Northern Lights. I wonder what it must be like to live in a place like Lofoten or the Yukon, where they’re an ordinary sight. I had assumed that 23-hour winter darkness would be depressing, but the locals don’t seem to give off that vibe at all.

It’s fascinating to see how traditions evolve—what was once a deeply rooted Japanese practice is now thriving in Norway. It makes me wonder: in a hundred years, could Norwegian seaweed surpass Japanese seaweed in quality and reputation?

Food history is unpredictable. After all, sushi was virtually unheard of outside Asia until a few decades ago, and now it’s a global staple. Maybe one day, kelp will be just as common on menus around the world. Who knows? In the future, we might be ordering kelp burgers with the same ease we now order sushi rolls.

Was Hitler a Disney fan?

After a couple days of exploring the west side of Lofoten Island, we trekked over to the small city of Svolvear to take a tour of the Trollfjorgarden water area. After the cruise we explored the main city square, and there was this small World War 2 museum recommended by Lonely Planet.

Entering the museum, there was an older gentleman sitting at the reception. I wouldn’t call this a museum in a traditional sense, but rather a couple rooms with mostly memorabilia from World War 2, with a focus from Norway during the time of 1940-1945

Looking around, I’m pretty surprised to see a lot of actual Nazi memorabilia, including Nazi uniforms, insignias, and machinery. Most museums are curated, but this one is just a collection of items that I’m not sure would be shown in a normal museum.

Walking into one of the dimly lit rooms, I was startled to see mannequins dressed in full Gestapo uniforms, standing in a chillingly realistic recreation of a Nazi-era office. The walls were draped with swastika banners, and an old wooden desk sat at the center, cluttered with wartime documents, a rotary phone, a radio, and a worn-out map marked with strategic points. A large Nazi eagle emblem loomed over the scene, its shadow casting an eerie presence on the wall.

The atmosphere was heavy, almost oppressive, as if stepping into a moment frozen in time. Unlike most museums, where history is curated with context and neutrality, this room felt raw—almost as if I had intruded into a sinister past that was never meant to be seen again. The sheer presence of these artifacts was unsettling, a stark reminder of the bureaucratic machinery that orchestrated war and oppression.

After exploring the exhibit, we asked the staff member at the desk about anything particularly interesting we should focus on. He directed us to a section we had missed, which displayed drawings of characters from Snow White and the Seven Dwarfs. According to the story he shared, some believe these sketches were created by Adolf Hitler. While the evidence is circumstantial, tests suggest that the drawings date back to the 1940s and were hidden behind a painting signed “A. Hitler.”

He added his own commentary, suggesting that if the drawings are indeed authentic, they serve as a reminder that even those whom history deems as purely evil may have complex and multifaceted personalities. It looks like the curator never paid to authenticate the drawings so we never actually will know if these are real or not

I kind of wonder why he bought these drawings if they are circumstantial. Was he trying to make a point about adding complexity to villains in our history?

When we look back at people who have done really terrible things, I don’t think we learn much about their lives holistically, instead we learn about the things they done, and putting people in that box of good and bad makes it easy to digest history.

If these drawings are authentic, this is an uncomfortable thing to reconcile with. Could someone who has done so many terrible things to humanity have a soft side for Disney?

Maybe that’s what makes history so unsettling—it refuses to be simple. It’s easier to see figures like Hitler as pure evil, to categorize the past into neat moral lessons. But artifacts like these, whether authentic or not, complicate that perception. They force us to acknowledge that even those who committed unimaginable atrocities had personal interests, emotions, and contradictions. Does that change anything? Not really. But it does remind us that history isn’t just about villains and heroes—it’s about real people. And that’s what makes it both terrifying and necessary to remember.

Why Isn’t European Food Piping Hot?

We arrived in the city of Tolmin, Slovenia, which is the base for the Julian Alps. We learned about this area through Lonely Planet, and since we got a very cheap car rental, we decided to drive from Plitvice National Park to Tolmin, then through the Julian Alps via the scenic Vršič Pass.

After meeting our Vrbo host, he gave us some tips on checking out the water near the park, which was about a five-minute drive away. When we arrived at the park, the water’s color was stunning. It was a bright turquoise, almost as if there was dye in the water. It brought back a very specific memory of Kicking Horse River at Yoho National Park, Canada, as the color was very similar.

Most people are very aware of Banff, but there are actually three parks nearby: Jasper, Yoho, and Kootenay. I actually think Jasper is much nicer than Banff, but you might need to be strategic in your visit, as large parts of the area were affected by fires this past summer.

That night, we ate at Okrepčevalnica Tolminska, a restaurant serving local dishes like Frika, Soča trout, and Jota.. We e-mailed the Tolmin tourist board and they recommended this one for local traditional style food.

The restaurant was next to Tolmin Gorge, but it was too late to enter the park since it was nighttime. We went in October, so it was shoulder season, with the prospects of unpredictable weather at this time.

I’m always a bit conflicted when people say, “Local food is the best food.” I think local food mainly reflects the historical ingredient selection of an area.

We ordered a chicken and cheese dish and the Soča trout. The food was okay—pretty simple in preparation.

The restroom situation was a bit odd, as you had to get a token from the restaurant or pay a euro to use it. This was probably put in place due to the high volume of visitors, given their location right next to Tolmin Gorge.

On the way back to the table from the restroom, we saw another Asian couple sitting outside. Usually, I wouldn’t do a double take, but during our trip, there were very few Asian people anywhere. While waiting for our food, we decided to strike up a conversation.

When traveling, I feel it’s pretty easy to start a conversation by asking, “Are you on holiday?” They said they were, and they were both from Singapore—one was on exchange in Spain, and the other was on exchange in Switzerland. Their trip was a bit different, as they were spending a few more days in the Julian Alps to do outdoor activities like white-water rafting.

The next day, we drove through the stunning Julian Alps, which felt quite similar to the drive between Jasper and Banff. Along the road, there were several pull-offs, and after a short walk, we would reach either a waterfall or a scenic viewpoint. The water resembled what we had seen in Tolmin—either a deep turquoise or emerald green, depending on the glacier particles, mineral content, or sunlight

A couple of days later, we were on the last leg of our trip in Ljubljana, the capital of Slovenia. I know it can be a bit stressful, but we also try to book most of our restaurants in advance, which does take a lot of work prior to the trip.

Fortunately, we happened to be in the region during Slovenia’s Restaurant Week. I don’t know if it’s due to Europe’s socialist egalitarian ideas, but Restaurant Week there was extremely cheap. We enjoyed several three-course tasting menus at Michelin restaurants for about $30 to 40 USD each.

In 2022, I watched The Menu. The premise is a horror story that takes place entirely in a fancy restaurant (they even had Dominique Crenn, a Michelin-starred chef from San Francisco, as a restaurant consultant). The movie is gory, but at its core, it critiques culinary elitism. While it exaggerates quite a few things, it also lands some interesting punches.

After watching, I kept thinking about it—particularly about whether fine dining is worth the money. I remembered a specific experience in San Francisco when I had a tasting menu with friends and was still hungry after it was over.

The Michelin Guide has historically had deep roots in and placed an emphasis on European-style cuisine, although things have begun to change in the past couple of years. It’s strange—why did France become the arbiter of food standards? Why not Asia, with its complex techniques for creating broths, or Mexico, with its rich culinary history tracing back to the Mayan civilization?

In Ljubljana, as we were walking back from a coffee shop, we coincidentally ran into the same couple from Tolmin again. There was definitely a sense of camaraderie the second time, and we debriefed each other on how the trip was going. I joked that I missed just having a bowl of rice, and the other couple agreed, saying, “European food is good, but I just want hot food. And in Singapore, we have plenty of that.”

n cooking, we assess food by the five tastes (salty, sour, sweet, bitter, and umami), but there are two other important factors: texture and temperature. Texture makes a meal more interesting, but temperature is where I think there is a big difference between European and Asian food.

Many Asian dishes prioritize temperature—like a hot bowl of Japanese ramen, Korean tofu stew, or Vietnamese pho. But, generally speaking, European food doesn’t have many dishes that are served super piping hot, to my knowledge.

The book Food in History, which I read back in university is a pretty fascinating and a bit dense book about the history of food starting in the Middle Ages to the present day.

Medieval dining was characterized by a haphazard assortment of dishes placed on the table at the same time. A “course” was not a sequence of dishes, but a variety of options available simultaneously.

In the 18^th and 19^th centuries, French menus began to be codified, and presented in groups of 3 courses in a style called service à la russe which was more individualized and didn’t really promote sharing.

This trip made me realize that while discovering new cuisines is exciting, there’s something irreplaceable about the foods that remind us of home. No matter where we travel, the flavors of our childhood always call us back.

State of Data Engineering 2025 Q1

AI Updates

There is a lot of chatter about 2025 being the year of agentic frameworks. To me, this means a system in which a subset can allow AI models to take independent actions based on their environment, typically interacting with external APIs or interfaces.

The terminology around this concept is still evolving, and definitions may shift in the coming months—similar to the shifting discussions around Open Table Formats. Below are some key takeaways from recent discussions on agentic systems:

Agents can take actions allowed by a given environment

Agents can be potentially dangerous if given the ability to perform write actions

Agent workflows can potentially be very expensive if not constrained (EG excessive token usage)

Evaluating success on agent workflows is tricky.

Chip Huyen recently wrote a new book on AI Engineering and created a standalone blog post about agents https://huyenchip.com//2025/01/07/agents.html

Anthropic has explored agentic systems in their article: https://www.anthropic.com/research/building-effective-agents

When writing traditional software, it’s helpful to think in terms of determinism. In probability theory:

P(1) represents an outcome that is guaranteed to occur (100%).

P(0) represents an outcome that will never occur (0%).

Similarly, code is deterministic—given the same input, it will always produce the same output. Business rules drive execution, either directly through the code or abstracted into configuration files.

However, agents break this determinism by introducing an element of decision-making and uncertainty. Instead of executing predefined logic, agents operate within a framework that allows for flexible responses based on real-time data. This can be powerful, as it enables systems to handle edge cases without explicitly coding for them. But it also raises evaluation challenges—how do we determine if an agent is performing well when its behavior isn’t fully predictable?

Taking a data engineering perspective on this, I think usefulness of agents will be directly proportional to the quality of the data sources available for an agent.

ChatGPT and similar LLMs have been useful primarily because they have likely ingested the entire Internet. Whether this has been done in full compliance with copyright laws is a separate issue.
More structured domains, such as coding, have thrived due to the constraints imposed by programming languages, as these languages are inherently built upon specific grammatical rules.

When customers add data sources to their agentic frameworks I can only imagine a lot of data cleansing and structuring will have to be done to make these systems successful.

Other interesting links to learn about AI

Chip Huyen’s List of AI Links

Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

re:Invent Recap

Every year, there are tons of announcements, and you can catch a recap of all the new services in this 126 slide PDF from AWS courtesy of community days.

https://assets.community.aws/a/reCap-Summary-2024.pdf

Data Highlights:

Scale to zero for Amazon Aurora Serverless

Amazon Aurora DSQL

Database CDC with Amazon Data Firehose

AI upgrades for Apache Spark in AWS Glue – Migrate Glue 4 to Glue 5

Glue 5.0 – be sure to use this new latest version

Scale Down to Zero for Amazon SageMaker

Amazon Nova Frontier Model

Sagemaker Unified Studio

Another major launch during re:Invent was Sagemaker Unified Studio. Capabilities include:

Amazon SageMaker Unified Studio (public preview): use all your data and tools for analytics and AI in a single development environment.

Amazon SageMaker Lakehouse: unify data access across Amazon S3 data lakes, Amazon Redshift, and federated data sources.

Data and AI Governance: discover, govern, and collaborate on data and AI securely with Amazon SageMaker Catalog, built on Amazon DataZone.

Model Development: build, train, and deploy ML and foundation models, with fully managed infrastructure, tools, and workflows with Amazon SageMaker AI.

Generative AI app development: Build and scale generative AI applications with Amazon Bedrock.

SQL Analytics: Gain insights with the most price-performant SQL engine with Amazon Redshift.

Data Processing: Analyze, prepare, and integrate data for analytics and AI using open-source frameworks on Amazon Athena, Amazon EMR, AWS Glue, and Amazon Managed Workflows for Apache Airflow (MWAA).

https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-sagemaker-the-center-for-all-your-data-analytics-and-ai

https://aws.amazon.com/blogs/big-data/simplify-data-access-for-your-enterprise-using-amazon-sagemaker-lakehouse

In the data field, there are two options, build vs buy. AWS traditionally follows the build approach, offering tools for developers to create their own platforms. However, operationalizing data platforms is complex, and competitors like Databricks and Snowflake offer faster onboarding workflows with vertically integrated components.

SageMaker Lakehouse shows potential as it attempts to bridge the gap between build and buy solutions. It offers robust management features within its UI, including CI/CD workflows similar to GitHub. However, given its early stage, I would approach this large platform with your primary use cases, and do a proof of concept first before approaching any production workloads.

However my main complaint is this service should never been called Sagemaker Unified Studio to begin with. Historically Sagemaker has been associated with the AI aspect of AWS, so it is easy to be confused that the new portfolio now includes data engineering. If I had to do it all over I would have called it

AWS Data Unified Studio which has:
- Data Engineering Workflows
- Sagemaker AI Workflows
- Catalog Workflows

Open Table Format Wars

Industry has been talking about the Open Table Format (OTF) wars for a while, and really the problem has been more of a political problem rather than a technological one.

As a reminder, there are 3 major players in the space, Apache Hudi, Apache Iceberg, and Databricks Deltalake. In the past months, AWS, Databricks, and Snowflake have begun to coalesce around allowing catalogs based off Apache Iceberg to query and write to each other.

At its core, Apache Iceberg is a set of specifications. In this case, both the AWS Glue Data Catalog and Databricks have implemented the Apache Iceberg REST contracts, enabling them to query each other. AWS facilitates this by exposing an external endpoint that other vendors can access.

This marks a significant shift in how we approach cross-platform and multi-cloud data querying. What’s puzzling is why there now seems to be broad industry alignment on this paradigm—but regardless, it’s a promising step forward for data engineering. Looking ahead, I wonder if we’ll eventually see true separation of compute and storage across clouds.

This use case is particularly relevant for large enterprises operating across multiple clouds and platforms. For smaller companies, however, it may not be as critical.

What about Apache Hudi for cross query federation? It remains a significant platform, but I’m pessimistic about Apache XTable catching on. However, if your use case involves federated querying across a broader ecosystem, consider this design pattern.

Keep in mind that cross-platform read and write capabilities are being rolled out gradually. Be sure to conduct extensive proof-of-concept (POC) testing to ensure everything functions as expected.

Guidelines through data engineering technology selection

History is often a useful guide, and some are saying the Iceberg format will replace parquet files. As an example, a nifty feature is hidden partitioning, where you can change your partition and not have to rewrite your physical files in storage. The tradeoff is that Iceberg is not the simplest of platforms, you still do need to manage metadata and snapshots.

Here are some tips to guide your projects through a technology selection.

Does you actually have big data?

Big data is a hard term to define but typically think about organizations which have terabytes or petabytes of data. If they do qualify for this, you most likely will choose one of the 3 OTFs.

What ecosystem are you working in?
A fair amount of consideration should be given to the ecosystem and technologies they are working in.

What programming languages does your staff know?

Do they have any experience with any previous OTFs

What are the dimensions you need to balance?

There is a weird blending of datalakes and data warehouses now where the lines aren’t clear because features exist in both. We aren’t going to get clear lines soon, so consider these items.

Cost At Scale
- After factoring in your projects expected growth patterns, which solution will scale the best at cost? For example, holding petabytes of data is probably better in a datalake than a data warehouse.

Managed or Unmanaged AWS Service?
- It’s important to recognize that managed services, such as AWS Glue or MWAA, are not a one-size-fits-all solution. While they simplify operations, they also abstract away fine-tuning capabilities. Managed services often come at a higher cost but reduce operational complexity. When evaluating them, consider both infrastructure costs and the time and effort required for ongoing management.

If your team has the engineering resources and expertise to manage an unmanaged platform, it may be a cost-effective choice at scale. If your project consists of a small team with limited bandwidth, opting for a managed AWS service can help streamline operations and reduce overhead.

Some members of the Hudi team argue that Open Table Formats aren’t truly “open.” While they raise a valid point, vendor lock-in hasn’t been a major concern for most projects. After all, choosing AWS or Databricks also involves a level of vendor lock-in. As a result, the arguments around openness may resonate more with a niche audience rather than the broader industry.

https://thenewstack.io/your-data-stack-is-outdated-heres-how-to-future-proof-it

Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality

This is a really intriguing pattern for data quality. It enables you to test your data in a staging area, then commit it. This pattern is not necessarily new, but Iceberg’s branching features offers an easier way of doing this

https://aws.amazon.com/blogs/big-data/build-write-audit-publish-pattern-with-apache-iceberg-branching-and-aws-glue-data-quality

https://lakefs.io/blog/data-engineering-patterns-write-audit-publish/

Implement historical record lookup and Slowly Changing Dimensions Type-2 using Apache Iceberg

https://aws.amazon.com/blogs/big-data/implement-historical-record-lookup-and-slowly-changing-dimensions-type-2-using-apache-iceberg

Building a transaction datalake lake using Amazon Athena, Apache Iceberg and dbt

A case study how the UK Ministry of Justice saved quick a bit of money switching over to Iceberg using dbt

https://ministryofjustice.github.io/data-and-analytics-engineering/blog/posts/building-a-transaction-data-lake-using-amazon-athena-apache-iceberg-and-dbt/

Expand data access through Apache Iceberg using Delta Lake UniForm on AWS

Kind of a strange article, but there are still some customers using UniForm. With it, you can read Delta Lake tables as Apache Iceberg tables.

https://aws.amazon.com/blogs/big-data/expand-data-access-through-apache-iceberg-using-delta-lake-uniform-on-aws

Amazon S3 Tables

One of the most significant announcements in the data space was Amazon S3 Tables. A common challenge with data lakes is that querying data in S3 can be slow. To mitigate this, techniques like compaction and partitioning are used to reduce the amount of data scanned.

Amazon S3 Tables address this issue by allowing files to be stored in S3 using the native Apache Iceberg format, enabling significantly faster query performance.

https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads

https://aws.amazon.com/blogs/storage/how-amazon-s3-tables-use-compaction-to-improve-query-performance-by-up-to-3-times

https://community.sap.com/t5/technology-blogs-by-members/new-amazon-s3-iceberg-tables-made-us-drop-our-parquet-library-in-abap/ba-p/13971752

However, there is a fair amount of criticism from competitors that there is a substantial mark-up in cost in this service, so be sure to read the S3 table pricing page

https://aws.amazon.com/s3/pricing

Redshift

Redshift History Mode – Redshift has released a new feature where if you have a zero ETL integration enabled, you can turn on Slowly Changing Dimensions (SCD) Type 2 out of the box.

Our historical pattern used to be

DMS –> S3 (SCD Type 2) –> Spark job to reconcile SCD Type 1 or 2

Now we can do

Zero ETL Redshift Replication –> Access SCD Type 2 tables

This is a relatively new feature so please do your comprehensive testing.

https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-history-mode.html

Amazon Redshift now supports incremental refresh on Materialized Views (MVs) for data lake tables
https://aws.amazon.com/about-aws/whats-new/2024/10/amazon-redshift-incremental-refresh-mvs-data-lake-tables/

Apache Hudi Updates

Apache Hudi recently has reached a major milestone of having a 1.0.0 release in December, with several new features

A notable new feature allows users to create indexes on columns other than the Hudi record key, enabling faster lookups on non-record key columns within the table.

https://hudi.apache.org/blog/2024/12/16/announcing-hudi-1-0-0/#secondary-indexing-for-faster-lookups

https://www.onehouse.ai/blog/accelerating-lakehouse-table-performance-the-complete-guide

Glue

Glue 5.0

A newer version is out with faster runtimes, so be sure to use this version in your projects

https://aws.amazon.com/about-aws/whats-new/2024/12/aws-glue-5-0

New Glue Connectors

There are now 16 new connectors Adobe Analytics, Asana, Datadog, Facebook Page Insights, Freshdesk, Freshsales, Google Search Console, Linkedin, Mixpanel, Paypal Checkout, Quickbooks, SendGrid, SmartSheets, Twilio, WooCommerce and Zoom Meetings for Glue

https://aws.amazon.com/about-aws/whats-new/2024/12/aws-glue-16-native-connectors-applications

Glue Compaction with Mor Tables

https://aws.amazon.com/blogs/big-data/accelerate-queries-on-apache-iceberg-tables-through-aws-glue-auto-compaction

Monitor and Manage Data Quality with Glue (Youtube Video)

Video

https://aws.amazon.com/about-aws/whats-new/2024/10/amazon-redshift-incremental-refresh-mvs-data-lake-tables

State of Data Engineering Q3 2024

Here is this quarter’s state of data engineering newsletter. There is only a little chat about AI this time, and a focus on Open Table Formats, the Apache Iceberg Rest Spec, Open Table Format updates, and new updates in the Amazon Data Engineering ecosystem.

Prompt Engineering – Meta Analysis Whitepaper

One of my favorite AI podcasts, Latent Space, recently featured Sander Schulhoff, one of the authors of a comprehensive research paper on prompt engineering. This meta-study reviews over 1,600 published papers, with co-authors from OpenAI, Microsoft, and Stanford.

[podcast]
https://www.latent.space/p/learn-prompting

[whitepaper]
https://arxiv.org/abs/2406.06608

The whitepaper is an interesting academic deep dive into prompting, and how to increase the quality of it through exemplars (examples provided into an LLM), but not providing too many (more than 20 hurts quality), and strange things like Minecraft agents are tools to understand how this ecosystem works.

Other practical tips are given where asking for data in JSON and XML generally is more accurate and results formatted based of the LLM’s training data is better. That does kind of lead to a problem if you don’t know what the training data is based off

It provides a wide range of tools outside of the two common ones we use the most – chain of thought – which is a multi turn conversation, and retrieval augmented generation (RAG).

We can expect in the next couple of years where LLMs themselves will integrate these workflows so we don’t care about it anymore, but if you want to squeak out some better performance this paper is worth reading.

Open Table Format Wars – Continued

As a quick refresher, the history of data engineering kind of goes like this in 30 seconds

1980s – big data warehouses exist. SQL is lingua franca
2000s – Apache Hadoop ecosystem comes out to address limitations of data warehouses to cope with size and processing
2010s – Datalakes emerge where data is still in cloud storage (E.G. Amazon S3)
2020ish – Datalakehouses or Transactional Datalakes come out to address limitations of Datalakes capability to be ACID compliant
2023 – Consensus emerges over the term Open Table Format (OTF) with three contenders
- Apache Hudi
- Databrick Deltalake
- Apache Iceberg
Mid 2024s
- June 3, 2024 – Snowflake announces Polaris catalog support for Apache Iceberg
- June 4, 2024 – Databricks buys Tabular (thereby bringing in the founders of Apache Iceberg)

Historically, we see a major shift in technology about every 20 years, with older systems being overhauled to meet new paradigms. Consider the companies that fully embraced Apache Hadoop in the 2000s—they’re now in the process of rebuilding their systems. Right now we are in the middle of the maturing of open table formats.

Data has always kind challenging to deal with because the nature of data is messy, and moving data from one system to another seems simple, but is quite a bit of work as we know most ETL rarely is straight forward when taking into accounts SLAs, schema changes, data volumes, etc.

OTFs really matter for us when we deal with big data, and especially for extremely large data (think Uber or Netflix size data). Databases usually can handle the blue and green without problem, but break at the yellow and red.

When working with your data platform, these are key questions you should be asking to help in refining your technology stack.

How much data is being processed (are we talking hundreds of gigabytes, terabytes, or petabytes?)
What is the SLA the data needs to be queried?
What is the existing data foot print in your organization (are you using a lot of MySQL, Microsoft, etc)
Does the organization have the capability to own the engineering effort of an OTF platform?
Do any of the customer’s data sources work for ZeroETL (like Salesforce, Aurora MySQL/Postgres, RDS?)
Is the customer already using Databricks, Hudi, Snowflake, Iceberg, Redshift, or Big Query?

The Future: Interoperability via the Apache Iceberg Catalog API

Apache Iceberg, which emerged from Netflix recently has recently been making a lot of news lately. From the acquisition of Tabular (basically the guys who founded Iceberg), to Snowflake open sourcing the Polaris catalog, to Databricks support in private preview, many signs are pointing to a more cross compatible future if certain conditions are met.

In this article

https://www.snowflake.com/en/blog/introducing-polaris-catalog/

There is a pretty important diagram where it shows cross compatibility of AWS, Azure, and Google Cloud. We aren’t here yet, but if all 3 vendors move towards implementing the Apache Iceberg HTTP Catalog API spec, that means cross federated querying will be possible.

I’m hopeful, because ETL’ing data from one place to another place has always been a huge hassle. This type of future really opens up interesting workloads where compute really can be separate even from your cloud.

Everything is a little strange to me, because moving towards the future really isn’t a technology problem, but more of a political one if each cloud choose to move that direction. We are getting signs, but I would say by this time next year, we will learn the intentions of all players. Meanwhile, stay tuned.

New emerging technology: DuckDB

DuckDB was created in 2018 and is a fast in-process analytical database. There is a hosted version called MotherDuck, which is based off a serverless offering. DuckDB takes a different approach where you can run analysis on a large data set either via a CLI or your favorite programming language. The mechanisms are slightly different where the compute runs closer to your application itself.

Article: Running Iceberg and Serverless DuckDB in AWS

https://www.definite.app/blog/cloud-iceberg-duckdb-aws

In this article, DuckDB can query Iceberg tables stored in S3. Also, as an alternative it describes deploying DuckDB in a serverless environment using ECS with custom containers via HTTP requests.

In the future, I expect AWS to take more notice and integrate DuckDB in the ecosystem in the next couple of years.

ChatGPT even has a DuckDB analyst ready

https://chatgpt.com/g/g-xRmMntE3W-duckdb-data-analyst

Use Cases:

Say you have a lot of log data in EC2. Typically, you would load it into S3 and query via Athena. Instead you could load the data in EC2, and then load a DuckDB instance there where you can query it without penalty for exploration
Preprocessing and pre-cleaning of user-generated data for machine learning training
Any type of system that previously used SQLite
Exploration of any data sets if it is on your laptop – this one is a no brainer.

— Iceberg Updates:

[Article]: The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables

TLDR: Iceberg now supports snapshot and orphan file removal
https://aws.amazon.com/blogs/big-data/the-aws-glue-data-catalog-now-supports-storage-optimization-of-apache-iceberg-tables/

Amazon previously tackled the issue of small file accumulation in Apache Iceberg tables by introducing automatic compaction. This feature consolidated small files into larger ones, improving query performance and reducing metadata overhead, ultimately optimizing storage and enhancing analytics workloads.

Building on this, Amazon has now released a new feature in AWS Glue Data Catalog that automatically deletes expired snapshots and removes orphan files. This addition helps control storage costs and maintain compliance with data retention policies by cleaning up unnecessary data, offering a more complete solution for managing Iceberg tables efficiently.

[Feature]: Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog

TLDR: If you want faster SLA in Iceberg tables, run the table statistics feature for a potential of 24 –> 80% in improvement in query time

https://aws.amazon.com/blogs/big-data/accelerate-query-performance-with-apache-iceberg-statistics-on-the-aws-glue-data-catalog

Column-level statistics are a method for enhancing the query performance of Iceberg tables in Amazon Redshift Spectrum and Athena. These statistics are based on the Puffin file format and allow query engines to optimize SQL operations more effectively. You can enable this feature via the AWS Console or by running an AWS Glue job. According to the performance data in the blog post, improvements range from 24% to 83%, simply by running a job to store metadata.

Summary:

Use this if you have large datasets and need consistent query performance. Small datasets may not benefit enough to justify the effort.
Be aware of the overhead involved in running and maintaining statistics jobs.
Since data will likely change over time, you should set up automated jobs to periodically regenerate the statistics to maintain performance gains. While manual effort is required now, this feature could be more integrated into the platform in the future.

[Article]: Petabyte-Scale Row-Level Operations in Data Lakehouses
Authors: Apache Foundation, Apple Employees, Founder of Apache Iceberg

TLDR: If you need to do petabyte scale row level changes, read this paper.
https://www.vldb.org/pvldb/vol17/p4159-okolnychyi.pdf

We rarely have run into the scale of needed to run petabyte row level changes, but it details a strategy with these techniques


Technique	Explanation	Hudi Equivalent	Databricks Equivalent
Eager Materialization	Rewrites entire data files when rows are modified; suitable for bulk updates.	Copy-on-Write (COW)	Data File Replacement
Lazy Materialization	Captures changes in delete files, applying them at read time; more efficient for sparse updates.	Merge-on-Read (MOR)	Delete Vectors
Position Deletes	Tracks rows for deletion based on their position within data files.		Delete Vectors
Equality Deletes	Deletes rows based on specific column values, e.g., row ID or timestamp.		Delete Vectors
Storage-Partitioned Joins	Eliminates shuffle costs by ensuring data is pre-partitioned based on join keys.		Low Shuffle MERGE
Runtime Filtering	Dynamically filters out unnecessary data during query execution to improve performance.		Runtime Optimized Filtering
Executor Cache	Caches delete files in Spark executors to avoid redundant reads and improve performance.
Adaptive Writes	Dynamically adjusts file sizes and data distribution at runtime to optimize storage and prevent skew.
Minor Compaction	Merges delete files without rewriting the base data to maintain read performance.	Compaction in MOR
Hybrid Materialization	Combines both eager and lazy materialization strategies to optimize different types of updates.

The paper also half reads as a marketing paper for Iceberg, but the interesting aspect is that half of the authors are from Apple. One of the authors of that paper also made this video on how Apache Iceberg is used at Apple.

Video:
https://www.youtube.com/watch?v=PKrkB8NGwdY

[Article]: Faster EMR 7.1 workloads for Iceberg

TLDR: EMR 7.1 runs faster on its customized Spark runtime onEC2

https://aws.amazon.com/blogs/big-data/amazon-emr-7-1-runtime-for-apache-spark-and-iceberg-can-run-spark-workloads-2-7-times-faster-than-apache-spark-3-5-1-and-iceberg-1-5-2

This article essentially serves as marketing for Amazon EMR, but it also demonstrates the product team’s commitment to enhancing performance with Apache Iceberg. It’s a slightly curious comparison, as most users on AWS would likely already be using EMR rather than managing open-source Spark on EC2. Nevertheless, the article emphasizes that EMR’s custom Spark runtime optimizations are significantly faster than running open-source Spark (OSS) on EC2.

Optimizations for DataSource V2 (dynamic filtering, partial hash aggregates).
Iceberg-specific enhancements (data prefetching, file size-based estimation).
Better query planning and optimized physical operators for faster execution.
Integration with Amazon S3 for reduced I/O and data scanning.
Java runtime improvements for better memory and garbage collection management.
Optimized joins and aggregations, reducing shuffle and join overhead.
Increased parallelism and efficient task scheduling for better cluster utilization.
Improved resource management and autoscaling for cost and performance optimization.

[Article]: Using Amazon Data Firehose to populate Iceberg Tables
TLDR: Use this technique if you might need Iceberg tables from the raw zone for streaming data and you need ACID guarantees

https://www.tind.au/blog/firehose-iceberg/

Recently, a sharp-eyed developer spotted an exciting new feature in a GitHub Changelog: Amazon Data Firehose now has the ability to write directly to Iceberg tables. This feature could be hugely beneficialfor anyone working with streaming data and needing ACID guarantees in their data lake architecture.

Warning: This feature isn’t production-ready yet, but it’s promising enough that we should dive into how it works and how it simplifies the data pipeline.

: An Interesting Future: Example of Iceberg being queried from Snowflake and Databricks

Randy Pitcher from Databricks shows an example how an Iceberg table created in Databricks is queried with Snowflake. As mentioned earlier, the chattering is not all vendors are fully implemented the Catalog API spec (yet), but once this gets mature in 2026-ish, expect the ability to query data across cloud to be possible.

https://www.linkedin.com/posts/randypitcherii_snowflake-is-killing-it-with-their-iceberg-ugcPost-7239751397779419136-z1ue

Redshift Updates

Major updates for Zero ETL

[Feature] Redshift Zero ETL Available for Salesforce – TLDR: If you need to move data from Salesforce to Redshift, try this first
- https://aws.amazon.com/blogs/big-data/harness-zero-copy-data-sharing-from-salesforce-data-cloud-to-amazon-redshift-for-unified-analytics-part-1/
[Feature] – Redshift Zero ETL support Amazon RDS – This is a big change as only Amazon Aurora was previously supported
- https://aws.amazon.com/blogs/aws/amazon-rds-for-mysql-zero-etl-integration-with-amazon-redshift-now-generally-available-enables-near-real-time-analytics/
[Feature] Amazon Redshift Serverless now supports higher base capacity of 1024 Redshift Processing Units
https://aws.amazon.com/about-aws/whats-new/2024/09/amazon-redshift-serverless-capacity-1024-processing-units/
[Feature ] -Optimize your workloads with Amazon Redshift Serverless AI-driven scaling and optimization
- If you want to experiment with cost/performance widgets in Redshift Serverless
- https://aws.amazon.com/blogs/big-data/optimize-your-workloads-with-amazon-redshift-serverless-ai-driven-scaling-and-optimization/

All Other AWS Updates:

[S3] Amazon S3 now supports conditional writes. We will see how this is integrated into OTF libraries
- https://aws.amazon.com/about-aws/whats-new/2024/08/amazon-s3-conditional-writes/
[Glue] Introducing AWS Glue Data Quality anomaly detection
- https://aws.amazon.com/blogs/big-data/introducing-aws-glue-data-quality-anomaly-detection/
  - Use Glue to check for data anomalies
[Glue] Introducing AWS Glue usage profiles for flexible cost control
- https://aws.amazon.com/blogs/big-data/introducing-aws-glue-usage-profiles-for-flexible-cost-control/
  - With this technique, you can put constraints on worker types and workers. I don’t think it 100% solves the cost overrun problem, but it is a start
[Glue] Introducing job queuing to scale your AWS Glue workloads
- https://aws.amazon.com/blogs/big-data/introducing-job-queuing-to-scale-your-aws-glue-workloads/
  - This enables Glue jobs to be run in a queue if your jobs are failing due to high concurrency issues
[LakeFormation] Announcing fine-grained access control via AWS Lake Formation with EMR Serverless
- https://aws.amazon.com/about-aws/whats-new/2024/07/fine-grained-access-control-aws-lake-formation-emr-serverless/
  - Lake Formation is now caught up to be compatible with EMR Serverless

Other

[Video]: Modern data architecture at Peloton with Apache Hudi
https://www.linkedin.com/events/7235419044852424705/comments/
[Article] about the Snowflake, Databricks rivalary
- https://www.bloomberg.com/news/articles/2024-08-14/inside-the-snowflake-databricks-rivalry-and-why-both-fear-microsoft
History of data engineering, with some brutal stabs on the failures of Apache Hadoop, and how the future is probably more SQL-ish
- https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf
OpenAI now allows structured JSON outputs. It is a small move, but we’ll see how far deterministic outputs can be pushed in these systems.
- https://openai.com/index/introducing-structured-outputs-in-the-api/
Netflix released yet another data orchestration framework
- https://github.com/Netflix/maestro