The titanic dataset has a twist

2025-08-18

The Titanic dataset can be seen as the "hello world" of data science. The goal of the dataset is to predict who survived the titanic shipwreck based on a few known passenger properties. The Kaggle competition for it has over 1.3M entrants and the dataset is in every introduction course out there as a result.

And yet, it has a twist that no book talks about.

Model

Typically, when tutorials make a model for the Titanic dataset, you'll see features like the number of siblings, the age, and the passenger class go into a model. It might try a few models and (if they bothered to do cross validation the right way and if they added a Dummy model) you'd get a table that looks like this:

Model Accuracy Precision Recall F1-Score
Histogram Gradient Boosting 0.815 ± 0.022 0.797 ± 0.039 0.734 ± 0.049 0.763 ± 0.029
Logistic Regression 0.803 ± 0.018 0.783 ± 0.043 0.717 ± 0.059 0.746 ± 0.026
K-Nearest Neighbors 0.801 ± 0.024 0.776 ± 0.055 0.728 ± 0.065 0.748 ± 0.031
Dummy (Most Frequent) 0.594 ± 0.001 0.000 ± 0.000 0.000 ± 0.000 0.000 ± 0.000

You might look at this table and conclude that the boosted model performs best, albeit without a lot of margin from the other models.

But let's now consider a fun twist.

That name

Typically the models would focus on features like age, passenger class, gender, and the amount that was paid for the ticket. But the really interesting thing is that you can get a model that's pretty much in the the same ballpark as far as accuracy goes, but just zooming in on one and only one column.

Can you guess what column it is?

There is one feature that people forget about ...

It's the name!

If you make a pipeline with a CountVectorizer, that one-hot encodes every single token in the name column then you get a model in the same ballpark. You could also repeat the same exercise but with an embedding model from sentence-transformers.

Model Accuracy Precision Recall F1-Score
Histogram Gradient Boosting 0.815 ± 0.022 0.797 ± 0.039 0.734 ± 0.049 0.763 ± 0.029
Logistic Regression 0.803 ± 0.018 0.783 ± 0.043 0.717 ± 0.059 0.746 ± 0.026
K-Nearest Neighbors 0.801 ± 0.024 0.776 ± 0.055 0.728 ± 0.065 0.748 ± 0.031
Names (CountVectorizer + LogReg) 0.796 ± 0.023 0.762 ± 0.055 0.734 ± 0.072 0.744 ± 0.033
Names (Embeddings + LogReg) 0.779 ± 0.024 0.773 ± 0.057 0.655 ± 0.077 0.704 ± 0.040
Dummy (Most Frequent) 0.594 ± 0.001 0.000 ± 0.000 0.000 ± 0.000 0.000 ± 0.000

The histogram boosting model for still performs better, but the fact that name model even gets close ... that is friggin' interesting! It only uses one single feature! Turns out the name has ample information, it has a title for a person as well as the gender and maybe the last name is a proxy for wealth.

I never really understood why books would skip this super interesting fact. What if the CountVectorizer approach was more accurate? You could easily simulate such a situation by removing a few columns and pretending that those were not available. Would that make the model better? Or would we prefer a model that goes beyond the name?

These kinds of questions seem way(!) more interesting for a classroom activity than merely just looking at accuracy numbers. People need to understand that you can get creative which means that you should sometimes try and do something that is counter-intuitive.

Final thoughts

Honestly, it's kind of bonkers that this dataset became so popular as a tutorial. There's no practical utility for the dataset what-so-ever and it will never be applied for anything. Given ample business cases for these models it makes a lot more sense to focus on those dataset.

That said, There is one final comment worth mentioning with regards to vibe coding and these kinds of conclusions. I imagine that there are many other data sets with a fun twist, and my hope is that vibe-coding can actually help us document these. LLMs are good enough to help you write scikit-learn pipelines for these sorts of small blogposts, so I hope more folks will join me to share more of these data science anekdotes. It feels like they're missing on the internet.

If that sounds interesting, you might enjoy this vibe-coding exercise I did for marimo. We're working on better LLM integration so we're investigating the space while also recording some vibe-coding exercises.