SciNCL

I stumbled apon a huggingface project that refers to this paper. The main result is a sentence-transformer embedding model that seems to have some interesting properties.

When embedding models are trained these days they usually take a BERT encoder under the hood and then they finetune some layers on top of it for a similarity task. My impression is that you'd either use specific datasets with specific annotation tasks for this or that you'd take sentences from the same paragraph that you try to contrast sentences with with sentences that are further away. This paper tries to do something different by zooming in on just academic papers.

The thing with academic papers is that they also carry a citation graph under the hood. That is a source of signal for similarity for sure and this seems to be the main signal that the authors trained their embedding model on. The paper reports a bunch of metrics that indicate the merit of the approach, but my favourite part is the fact that the embedding model is freely available on huggingface.

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("malteos/scincl")

# Concatenate the title and abstract with the [SEP] token
papers = [
    "BERT [SEP] We introduce a new language representation model called BERT",
    "Attention is all you need [SEP] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks",
]
# Inference
embeddings = model.encode(papers)

# Compute the (cosine) similarity between embeddings
similarity = model.similarity(embeddings[0], embeddings[1])
print(similarity.item())
# => 0.8440517783164978

A small bonus, notice how this sentence model doesn't require you to trust the source of the data. A lot of trained models expect you to add a 'trust_remote_code' flag to you code, which has always felt unsafe to me.

I may give this one a spin myself on my own experiments with the Arxiv abstracts that I have locally. Might be fun to see how well these embeddings perform when it comes to retreiving interesting papers with embedded search.