Overtype markdown

2025-08-21

Overtype is a neat markdown editor that's as easy as jquery when it comes to "just adding it to a site".

The "magic trick" is that this is really just clever CSS on top of a textarea! You just define a <div class="editor> somewhere and then a new Overtype(".editor") will take care of the rest from JS. It also means that you can do stuff like:

let elem = document.querySelector(".editor textarea")
elem.addEventListener("input", e => console.log(elem.value))

Everything that is typed into the editor will now also appear in the console. You can check it for yourself!

The titanic dataset has a twist

2025-08-18

The Titanic dataset can be seen as the "hello world" of data science. The goal of the dataset is to predict who survived the titanic shipwreck based on a few known passenger properties. The Kaggle competition for it has over 1.3M entrants and the dataset is in every introduction course out there as a result.

And yet, it has a twist that no book talks about.

Model

Typically, when tutorials make a model for the Titanic dataset, you'll see features like the number of siblings, the age, and the passenger class go into a model. It might try a few models and (if they bothered to do cross validation the right way and if they added a Dummy model) you'd get a table that looks like this:

Model Accuracy Precision Recall F1-Score
Histogram Gradient Boosting 0.815 ± 0.022 0.797 ± 0.039 0.734 ± 0.049 0.763 ± 0.029
Logistic Regression 0.803 ± 0.018 0.783 ± 0.043 0.717 ± 0.059 0.746 ± 0.026
K-Nearest Neighbors 0.801 ± 0.024 0.776 ± 0.055 0.728 ± 0.065 0.748 ± 0.031
Dummy (Most Frequent) 0.594 ± 0.001 0.000 ± 0.000 0.000 ± 0.000 0.000 ± 0.000

You might look at this table and conclude that the boosted model performs best, albeit without a lot of margin from the other models.

But let's now consider a fun twist.

That name

Typically the models would focus on features like age, passenger class, gender, and the amount that was paid for the ticket. But the really interesting thing is that you can get a model that's pretty much in the the same ballpark as far as accuracy goes, but just zooming in on one and only one column.

Can you guess what column it is?

There is one feature that people forget about ...

It's the name!

If you make a pipeline with a CountVectorizer, that one-hot encodes every single token in the name column then you get a model in the same ballpark. You could also repeat the same exercise but with an embedding model from sentence-transformers.

Model Accuracy Precision Recall F1-Score
Histogram Gradient Boosting 0.815 ± 0.022 0.797 ± 0.039 0.734 ± 0.049 0.763 ± 0.029
Logistic Regression 0.803 ± 0.018 0.783 ± 0.043 0.717 ± 0.059 0.746 ± 0.026
K-Nearest Neighbors 0.801 ± 0.024 0.776 ± 0.055 0.728 ± 0.065 0.748 ± 0.031
Names (CountVectorizer + LogReg) 0.796 ± 0.023 0.762 ± 0.055 0.734 ± 0.072 0.744 ± 0.033
Names (Embeddings + LogReg) 0.779 ± 0.024 0.773 ± 0.057 0.655 ± 0.077 0.704 ± 0.040
Dummy (Most Frequent) 0.594 ± 0.001 0.000 ± 0.000 0.000 ± 0.000 0.000 ± 0.000

The histogram boosting model for still performs better, but the fact that name model even gets close ... that is friggin' interesting! It only uses one single feature! Turns out the name has ample information, it has a title for a person as well as the gender and maybe the last name is a proxy for wealth.

I never really understood why books would skip this super interesting fact. What if the CountVectorizer approach was more accurate? You could easily simulate such a situation by removing a few columns and pretending that those were not available. Would that make the model better? Or would we prefer a model that goes beyond the name?

These kinds of questions seem way(!) more interesting for a classroom activity than merely just looking at accuracy numbers. People need to understand that you can get creative which means that you should sometimes try and do something that is counter-intuitive.

Final thoughts

Honestly, it's kind of bonkers that this dataset became so popular as a tutorial. There's no practical utility for the dataset what-so-ever and it will never be applied for anything. Given ample business cases for these models it makes a lot more sense to focus on those dataset.

That said, There is one final comment worth mentioning with regards to vibe coding and these kinds of conclusions. I imagine that there are many other data sets with a fun twist, and my hope is that vibe-coding can actually help us document these. LLMs are good enough to help you write scikit-learn pipelines for these sorts of small blogposts, so I hope more folks will join me to share more of these data science anekdotes. It feels like they're missing on the internet.

If that sounds interesting, you might enjoy this vibe-coding exercise I did for marimo. We're working on better LLM integration so we're investigating the space while also recording some vibe-coding exercises.

The Sock Drawer Paradox

2025-07-29

I prompted Claude for a math puzzle and was suprised to see one that I did not know about. Figured it might be worth sharing here.

The Sock Drawer Paradox

You have a drawer with some red socks and some blue socks. You know that if you pull out 2 socks randomly, the probability they're both red is $\frac{1}{2}$

Given this information, what's the probability that the first sock you pull is red?

The Surprising Answer

Most people think "well if P(both red) = 1/2, then P(first red) should be set" but this problem has multiple solutions!

Let $r$ = number of red socks and $b$ = number of blue socks.

The probability of drawing two red socks is: $$P(\text{both red}) = \frac{r}{r+b} \times \frac{r-1}{r+b-1} = \frac{1}{2}$$

After symplifying and using the quadratic formula you get: $$r = b \pm 0.5 \sqrt{8.0 b^{2} + 1.0} + 0.5$$

This can be varified with sympy.

from sympy import symbols, Eq, solve

# Let r be the number of red socks and b be the number of blue socks
r, b = symbols('r b')

# The probability of pulling 2 red socks
probability_both_red = (r / (r + b)) * ((r - 1) / (r + b - 1))

# Given that this probability is 1/2
equation = Eq(probability_both_red, 1/2)

# Solve for r in terms of b
solution = solve(equation, r)

Integers

Not all of these solutions are integer solutions though. But you can find them via:

from sympy import lambdify 
from decimal import Decimal 

r_given_b = lambdify("b", solution[1])

[{"b": _, "r": int(r_given_b(_))} 
 for _ in range(1000000) 
 if (r_given_b(_) * 1000 % 1000) == 0]

There are 8 solutions in the first one million values for $b$.

b r $p(r)$ $p(rr)$
1 3 0.75 0.5
6 15 0.714286 0.5
35 85 0.708333 0.5
204 493 0.707317 0.5
1189 2871 0.707143 0.5
6930 16731 0.707113 0.5
40391 97513 0.707108 0.5
235416 568345 0.707107 0.5

What's interesting here is that the probability of grabbing a red ball converges as $b$ grows but it is not constant. There is still a degree of freedom when you only know the probability of grabbing two red balls!