Blog of a data person

Revisiting my arxiv frontpage

2025-04-07

A bit over a year ago I built myself my very own arxiv-frontpage. The work was part of my PyData Amsterdam 2023 keynote, which you can watch here. The project represents a GitHub scraper that pulls in new arxiv articles daily and tries to detect topics that I might be interested in. These are then published on GitHub pages.

The topic that I was mostly interested in was "articles about new datasets". It's such an effort to collect a high quality dataset that a researcher typically has a very good reason to do so. Often it's about a niche application, but sometimes it's also just because the authors are very dedicated about something. In short: these can be fun papers to read!

This led me to build some classifiers to detect topics. A topic could be "a novel dataset" but I also had some about active learning and other things that interest me. It's all pretty lightweight to run because since a scikit-learn pipeline with a sentence-transformer did the trick. The classifier works on a sentence-level which also comes with the benefit of being able to visualise the belief of the system.

CleanShot 2025-04-07 at 20.05.50.png
It looks pretty and helps me debug

It did not last

I was really happy with the setup and it also led me to find some interesting papers. After a month or two though ... the whole setup crumbled down.

The reason? Most of the articles about new datasets started to be synthetic datasets generated with LLMs. On paper, these texts fit the topic that I defined, but aren't what I am interested in. Real life data collection usually leads to an interesting read, but a syntethic approach simply doesn't do it for me.

After two months I contemplated adding a "synthetic" classifier so that I may build a rule-based system to help me filter but by that time the articles about real life data collection had halved. Since then, it felt like it kept evaporating.

Luckily, some of the other topics fared much better. I also have a "developer research" topic that tries to detect articles about how developers could become more/less productive and there's a bunch of fun titles that pop up there. Just to pick two at random today:

I'm still hoping to see some fun dataset articles in the future, but given the current phase of AI I can also see it taking a while. In the meantime I can also recommend this approach to anyone. Building your own frontpage of the internet is a very rewarding exercise!