CRD3 stands for Critical Role Dungeons and Dragons Dataset. It turns out that there is a show called Critical Role. It's an unscripted, live-streamed show where a fixed group of people play Dungeons and Dragons. This dataset contains trascribed text from 159 episodes consisting of 398,682 turns of dialogue.
It's an interesting dataset. To quote the abstract:
The dataset is linguistically unique in that the narratives are generated entirely through player collaboration and spoken interaction. For each dialogue, there are a large number of turns, multiple abstractive summaries with varying levels of detail, and semantic ties to the previous dialogues.
After learning about this I checked the references and it turns out it's not even the 1st dialogue data from the realm of Dungeons and Dragons.
To quote the introduction of this paper:
Imagine a giant, a dwarf, and a fairy in a combat situation. We would expect them to act differently,and conversely, if we are told of even a few actions taken by a character in a story, we naturally start to draw inferences about that character’s personality. Communicating narrative is a fundamental task of natural language, and understanding narrative requires modelling the interaction between events and characters.
These datasets got me thinking about usecases. The huggingface website lists the first dataset as a dataset for the text summarisation use-case. But this isn't what came to mind when I first saw it.
After all, these are some of the few large datasets out there that are genuinely conversational. Wikipedia is a huge corpus, sure, but the text does not represent two people in a conversation. This data on the other hand, sure it might be about fantasy, but the style of the text will be very much informal and conversational. I wonder if it might make sense to train embeddings on this just to see how much of the conversational aspect might be captured.
Then again ... we might just get something that overfits on elves, halflings and orcs.