just saying it twice
This was an interesting, and delightfully short, read.
It seems that for a lot of tasks you can just repeat the input twice and get better performance out of an LLM. Granted, this is shown for pretty old models, but still!
This was an interesting, and delightfully short, read.
It seems that for a lot of tasks you can just repeat the input twice and get better performance out of an LLM. Granted, this is shown for pretty old models, but still!
There are a bunch of reasons why people are excited about evolutionary strategies but an often overlooked one is that backprop is pretty darn heavy.
It's not just the time it takes, but also the memory! Especially when you use Adam. Suddenly you're not just storing the weights and their gradients, but also their velocities for the momentum-stuff.
Quick and dirty benchmark can be found here.
"Sharding" is a popular database technique where you split a database across multiple "shards" to spread the load. The term sounds like it might have originated from the actual term "shard" which implies it is a piece of a larger whole, but it turns out the actual origin is more related to fantasy lore.
The full details are explained in this interview from ars technica where Richard Garriott explains how Ultima Online (or at least the version from the 90ies) was created with an ecology bug. In the same interview he also mentions a bit of lore related to the game. As the story goes, when you defeated a final boss who carried "the gem of immortality" it broke into many different shards. And this bit of lore was used to justify the creation of many different servers, each with their own copy of the game.
That's where the original term seems to have come from. Some other sources online seem to verify it.