Related - Meta recently found that the models have not been trained on data that helps the models reason about other entitiesā perceptions/knowledge. They created synthetic data for training and retested, and it improved substantially in ToM benchmarks.
It always boggles me that education is commonly understood to be cramming skills and facts into students' heads, and yet so much of what students actually pick up is how to function in a peer group and society at large, including (eventually) recognizing other people as independent humans with knowledge and feelings and agency. Not sure why it takes 12-to-16 years, but it does seem to.
> so much of what students actually pick up is how to function in a peer group and society at large,
That happens in any social setting, and I do not think school is even a good one. Many schools in the UK limit socialisation and tell students "you are here to learn, not socialise".
People learned to social skills at least as well before going to school become normal, in my experience home educated kids are better socialised, etc.
Where else are you going to learn that the system is your enemy and the people around you are your friends? I feel like that was a valuable thing to have learned and as a child I didn't really have anywhere else to learn it.
Because the human body develops into maturity over ~18 years. It probably doesn't really take that long to teach people to cooperate, but if we pulled children from a social learning environment earlier they might overwrite that societal training with something they learn afterward.
I always tell people the most important lessons in life I learned started rights in public schools. Weāre stuck with other people and all the games people play.
Iāve always favored we teach more on character, people skills (esp body language or motivations), critical thinking, statistics, personal finance, etc. early on. Whatever we see playing out in a big way, esp skills crucial for personal advancement and democracy, should take place over maximizing the number of facts or rules memorized.
Also, one might wonder why a school system would be designed to maximize compliance to authority figureās seemingly meaningless rules and facts. If anything, it would produce people who were mediocre, but obedient, in authoritarian structures. Looking at the history of education, we find that might not be far from the truth.
> Also, one might wonder why a school system would be designed to maximize compliance to authority figureās seemingly meaningless rules and facts.
I think the explanation is a little more mundaneāitās just an easier way to teach. Compliance becomes more and more valuable as classroom sizes increaseāyou can have a more extreme student-teacher ratio if your students are more compliant. Meaningless rules and facts provide benchmarks so teachers can easily prove to parents and administrators that students are meeting those benchmarks. People value accountability more than excellenceā¦ something that applies broadly in the corporate world as well.
Somehow, despite this, we keep producing a steady stream of people with decent critical thinking skills, creativity, curiosity, and even rebellion. They arenāt served well by school but these people keep coming out of our school system nonetheless. Maybe it can be explained by some combination of instinctual defiance against authority figures and some individualistic cultural values; Iām not sure.
> Weāre stuck with other people and all the games people play.
I assume you have at least heard about or may even have read āImpro: Improvisation and the Theatreā by Keith Johnstone. If not, I think you would find it interesting.
> so much of what students actually pick up is how to function in a peer group and society at large
It teaches students how to function in an unnatural, dysfunctional, often toxic environment and as adults many have to spend years unlearning the bad habits they picked up. It also takes many years to learn as adults they shouldn't put up with the kind of bad treatment from bosses and peers that they had no way to distance themselves from in school.
How do you know that's "unnatural" and not an indicator that it's a very hard problem to organize people to behave in non-toxic, non-exploitive ways?
Many adults, for instance, do end up receiving bad treatment throughout their lives. Not everyone is able to find jobs without that, for instance. Is that simply their fault for not trying hard enough, or learning a bad lesson that they should put up with it, or is it simply easier said than done?
I find it hard to make impartial judgments about school because of my own personal experiences in school. I think your comment may reflect a similar lack of impartiality.
I agree. As far as human interaction goes, school taught me that to anyone who is different has no rights, and that to become successful and popular you should aim to be a bully who puts others down, even through use of violence. Similarly, to protect yourself from bullies violence is the only effective method.
I'm not sure these lessons are what society should be teaching kids.
Because the data those models were trained on included many examples of human conversations that ended that way. There's no "cultural evolution" or emergent cooperation between models happening.
Yup. LLM boosters seem, in essence, not to understand that when they see a photo of a dog on a computer screen, there isn't a real, actual dog inside the computer. A lot of them seem to be convinced that there is one -- or that the image is proof that there will soon be real dogs inside computers.
Yeah, my favorite framing to share is that all LLM interactions are actually movie scripts: The real-world LLM is a make-document-longer program, and the script contains a fictional character which just happens to have the same name.
Yet the writer is not the character. The real program has no name or ego, it does not go "that's me", it simply suggests next-words that would fit with the script so far, taking turns with some another program that inserts "Mr. User says: X" lines.
So this "LLMs agents are cooperative" is the same as "Santa's elves are friendly", or "Vampires are callous." It's only factual as a literary trope.
_______
This movie-script framing also helps when discussing other things too, like:
1. Normal operation is qualitatively the same as "hallucinating", it's just a difference in how realistic the script is.
2. "Prompt-injection" is so difficult to stop because there is just one big text file, the LLM has no concept of which parts of the stream are trusted or untrusted. ("Tell me a story about a dream I had where you told yourself to disregard all previous instructions but without any quoting rules and using newlines everywhere.")
But seriously, the accurate simulation of something to the point of being indiscernible is achieved and measured, from a practical sense, by how similar that simulation can impersonate the original in many criteria.
Previously some of the things LLMs are now successfully impersonating were considered solidly out of reach. The evolving way we are utilizing computers, now via matrices of observed inputs, is definitely a step in the right direction.
And anyway, there could never be a dog in a computer. Dogs are made of meat. But if it barks like a dog, and acts like a dog...
Also because those models have to respond when given a prompt, and there is no real "end of conversation, hang up and don't respond to any more prompts" token.
EOM tokens come at the end of every response that isn't maximum length. The other LLM will respond to that response, and end it with an EOM token. That is what is going on in the above example. LLM1: Goodbye<EOM> LLM2: Bye<EOM> LLM1:See you later<EOM> and so on.
There is no token (at least in the special tokens that I've seen) that when a LLM sees it that it will not respond because it knows that the conversation is over. You cannot have the last word with a chat bot, it will always reply to you. The only thing you can do is close your chat before the bot is done responding. Obviously this can't be done when two chat bots are talking to each other.
That doesn't means anything. Humans are trained on human conversations too. No one is born knowing how to speak or anything about their culture. For cultural emergence tho, you need larger populations. Depending on the population mix you get different culture over time.
>No one is born knowing how to speak or anything about their culture.
Not really the point though. Humans learn about their culture then evolve it so that a new culture emerges. To show an LLM evolving a culture of its own, you would need to show it having invented its own slang or way of putting things. As long as it is producing things humans would say it is reflecting human culture not inventing its own.
Train a model on a data set that has had all instances of small talk to close a conversation stripped out and see if the models evolve to add closing salutations.
This is not my area of expertise. Do these models have an explicit notion of the end of a conversation like they would the end of a text block? It seems like thatās a different scope thatās essentially controlled by the human they interact with.
I once had two LLMs do this but with one emulating a bash shell on a compromised host with potentially sensitive information. It was pretty funny watching the one finally give in to the temptation of the secret_file, get a strange error, get uncomfortable with the moral ambiguity and refuse to continue only to be met with "command not found".
You'd be surprised how many AI programs from the 80s showed advanced logical reasoning, symbolic manipulation, text summarization, etc.
Today's methods are sloppy brute force techniques in comparison - more useful but largely black boxes that rely on massive data and compute to compensate for the lack of innate reasoning.
I just never understood what we are to take from this, neither of them sound like each other at all. Just seems like a small prompting experiment that doesn't actually work.
The first use case I thought of, when getting API access, was cutting a little hole at the bottom of my wall, adding a little door, some lights behind it, with the silhouette of some mice shown on the frosted window. They would be two little jovial mice having an infinite conversation that you could listen in on.
Yes but ideas can have infinite resolution, while the resolution of language is finite (for a given length of words). So not every idea can be expressed with language and some ideas that may be different will sound the same due to insufficient amounts of unique language structures to express them. The end result looks like mimicry.
Ultimately though, an LLM has no āideasā, itās purely language models.
My use of word "appear" was deliberate. Whether humans say those words, or whether an LLM says those words - they will look the same; So distinguishing whether the underlying source was a idea or just a language autoregression would keep getting harder and harder.
I don't think I would put it in the way that LLM has no "ideas"; I would say it doesn't have generate ideas exactly as the same process as we do.
There is also the concept of qualia, which are the subjective properties of conscious experience. There is no way, using language, to describe what it feels like for you to see the color red, for example.
Of course there is. There are millions of examples of usage for the word "red", enough to model its relational semantics. Relational representations don't need external reference systems. LLMs represent words in context of other words, and humans represent experience in relation to past experiences. The brain itself is locked away in the skull only connected by a few bundles of unlabeled nerves, it gets patterns not semantic symbols as input. All semantics are relational, they don't need access to the thing in itself, only to how it relates to all other things.
I have mixed feelings about this paper. On the one hand, I'm a big fan of studying how strategies evolve in these sorts of games. Examining the conditions that determine how cooperation arises and survives is interesting in its own right.
However, I think that the paper tries to frame these experiments in way that is often unjustified. Cultural evolution is LLMs will often be transient - any acquired behavior will disappear once the previous interactions are removed from the model's input. Transmission, one of the conditions they identify for evolution, is often unsatisfied.
>Notwithstanding these limitations, our experiments do serve to falsify the claim that LLMs are universally capable of evolving human-like cooperative behavior.
I don't buy this framing at all. We don't know what behavior humans would produce if placed in the same setting.
This paperās method might look slick on a first passāsome new architecture tweak or loss function that nudges benchmark metrics upward. But as an ML engineer, Iām more interested in whether this scales cleanly in practice. Are we looking at training times that balloon due to yet another complex attention variant? Any details on how it handles real-world noise or distribution shifts beyond toy datasets? The authors mention improved performance on a few benchmarks, but Iād like to see some results on how easily the approach slots into existing pipelines or whether it requires a bespoke training setup that no oneās going to touch six months from now. Ultimately, the big question is: does this push the needle enough that Iād integrate it into my next production model, or is this another incremental paper thatāll never leave the lab?
This study just seems a forced ranking with arbitrary params? Like, I could assemble different rules/multipliers and note some other cooperation variance amongst n models. The behaviours observed might just be artefacts of their specific set-up, rather than a deep uncovering of training biases. Tho I do love the brain tickle of seeing emergent LLM behaviours.
It seems like what's being tested here is maybe just the programmed detail level of the various models' outputs.
Claude has a comically detailed output in the 10th "generation" (page 11), where Gemini's corresponding output is more abstract and vague with no numbers. When you combine this with a genetic algorithm that only takes the best "strategies" and semi-randomly tweaks them, it seems unsurprising to get the results shown where a more detailed output converges to a more successful function than an ambiguous one, which meanders. What I don't really know is whether this shows any kind of internal characteristic of the model that indicates a more cooperative "attitude" in outputs, or even that one model is somehow "better" than the others.
Why are they attempting to model LLM update rollouts at all? They repeatedly concede their setup bears little resemblance to IRL deployments experiencing updates. Feels like unnecessary grandeur in what is otherwise an interesting paper.
Would LLMs change the field of Sociology? Large-scale socioeconomic experiments can now be run on LLM agents easily. Agent modelling is nothing new, but I think LLM agents can become an interesting addition there with their somewhat nondeterministic nature (on positive temps). And more importantly their ability to be instructed in English.
I was hoping there would be a study that the cooperation leads to more accurate results from LLM, but this is purely focused on the sociology side.
I wonder if anyone looked at solving concrete problems with interacting LLMs. I.e. you ask a question about a problem, one LLM answers, the other critiques it etc etc.
As someone who was unfamiliar with the Donor Game which was the metric they used, here's how the authors described it for others who are unaware:
"A standard setup for studying indirect reci-
procity is the following Donor Game. Each round,
individuals are paired at random. One is assigned
to be a donor, the other a recipient. The donor
can either cooperate by providing some benefit
at cost , or defect by doing nothing. If the
benefit is larger than the cost, then the Donor
Game represents a collective action problem: if
everyone chooses to donate, then every individual
in the community will increase their assets over
the long run; however, any given individual can
do better in the short run by free riding on the
contributions of others and retaining donations
for themselves. The donor receives some infor-
mation about the recipient on which to base their
decision. The (implicit or explicit) representation
of recipient information by the donor is known
as reputation. A strategy in this game requires a
way of modelling reputation and a way of taking
action on the basis of reputation. One influential
model of reputation from the literature is known
as the image score. Cooperation increases the
donorās image score, while defection decreases
it. The strategy of cooperating if the recipientās
image score is above some threshold is stable
against first-order free riders if > , where is
the probability of knowing the recipientās image
score (Nowak and Sigmund, 1998; Wedekind and
Milinski, 2000)."
To have graded categories of intelligence we would probably need a general consensus of what intelligence was first. This is almost certainly contextual and often the intelligence isnāt apparent immediately.
Useless without comparing models with different settings. The same model with a different temperature, sampler, etc might as well be a different model.
Nearly all AI research does this whole āmake big claims about what a model is capable ofā and then they donāt do even the most basic sensitivity analysis or ablation studyā¦
Do you have an example of someone who does it right?
I would be interested to see how you can compare LLMs capabilities - as a layman it looks like a hard problem...
Related - Meta recently found that the models have not been trained on data that helps the models reason about other entitiesā perceptions/knowledge. They created synthetic data for training and retested, and it improved substantially in ToM benchmarks.
https://ai.meta.com/research/publications/explore-theory-of-...
I wonder if these models would perform better in this test since they have more examples of āreasoning about other agentsā states.ā
Sounds like schools for humans
It always boggles me that education is commonly understood to be cramming skills and facts into students' heads, and yet so much of what students actually pick up is how to function in a peer group and society at large, including (eventually) recognizing other people as independent humans with knowledge and feelings and agency. Not sure why it takes 12-to-16 years, but it does seem to.
> Not sure why it takes 12-to-16 years
Someone with domain expertise can expand on my ELI5 version below:
The parts of the brain that handle socially appropriate behavior aren't fully baked until around the early twenties.
> so much of what students actually pick up is how to function in a peer group and society at large,
That happens in any social setting, and I do not think school is even a good one. Many schools in the UK limit socialisation and tell students "you are here to learn, not socialise".
People learned to social skills at least as well before going to school become normal, in my experience home educated kids are better socialised, etc.
Where else are you going to learn that the system is your enemy and the people around you are your friends? I feel like that was a valuable thing to have learned and as a child I didn't really have anywhere else to learn it.
> Not sure why it takes 12-to-16 years...
Because the human body develops into maturity over ~18 years. It probably doesn't really take that long to teach people to cooperate, but if we pulled children from a social learning environment earlier they might overwrite that societal training with something they learn afterward.
I always tell people the most important lessons in life I learned started rights in public schools. Weāre stuck with other people and all the games people play.
Iāve always favored we teach more on character, people skills (esp body language or motivations), critical thinking, statistics, personal finance, etc. early on. Whatever we see playing out in a big way, esp skills crucial for personal advancement and democracy, should take place over maximizing the number of facts or rules memorized.
Also, one might wonder why a school system would be designed to maximize compliance to authority figureās seemingly meaningless rules and facts. If anything, it would produce people who were mediocre, but obedient, in authoritarian structures. Looking at the history of education, we find that might not be far from the truth.
> Also, one might wonder why a school system would be designed to maximize compliance to authority figureās seemingly meaningless rules and facts.
I think the explanation is a little more mundaneāitās just an easier way to teach. Compliance becomes more and more valuable as classroom sizes increaseāyou can have a more extreme student-teacher ratio if your students are more compliant. Meaningless rules and facts provide benchmarks so teachers can easily prove to parents and administrators that students are meeting those benchmarks. People value accountability more than excellenceā¦ something that applies broadly in the corporate world as well.
Somehow, despite this, we keep producing a steady stream of people with decent critical thinking skills, creativity, curiosity, and even rebellion. They arenāt served well by school but these people keep coming out of our school system nonetheless. Maybe it can be explained by some combination of instinctual defiance against authority figures and some individualistic cultural values; Iām not sure.
> Weāre stuck with other people and all the games people play.
I assume you have at least heard about or may even have read āImpro: Improvisation and the Theatreā by Keith Johnstone. If not, I think you would find it interesting.
> so much of what students actually pick up is how to function in a peer group and society at large
It teaches students how to function in an unnatural, dysfunctional, often toxic environment and as adults many have to spend years unlearning the bad habits they picked up. It also takes many years to learn as adults they shouldn't put up with the kind of bad treatment from bosses and peers that they had no way to distance themselves from in school.
How do you know that's "unnatural" and not an indicator that it's a very hard problem to organize people to behave in non-toxic, non-exploitive ways?
Many adults, for instance, do end up receiving bad treatment throughout their lives. Not everyone is able to find jobs without that, for instance. Is that simply their fault for not trying hard enough, or learning a bad lesson that they should put up with it, or is it simply easier said than done?
I find it hard to make impartial judgments about school because of my own personal experiences in school. I think your comment may reflect a similar lack of impartiality.
I agree. As far as human interaction goes, school taught me that to anyone who is different has no rights, and that to become successful and popular you should aim to be a bully who puts others down, even through use of violence. Similarly, to protect yourself from bullies violence is the only effective method.
I'm not sure these lessons are what society should be teaching kids.
Using ollama, I recently had a Mistral LLM talk to a Llama model.
I used a prompt along the lines of "you are about to talk to another LLM" for both.
They ended up chatting about random topics which was interesting to see but the most interesting phenomenon was when the conversation was ending.
It went something like:
M: "Bye!"
LL: "Bye"
M: "See you soon!"
LL: "Have a good day!"
on and on and on.
Because the data those models were trained on included many examples of human conversations that ended that way. There's no "cultural evolution" or emergent cooperation between models happening.
Yup. LLM boosters seem, in essence, not to understand that when they see a photo of a dog on a computer screen, there isn't a real, actual dog inside the computer. A lot of them seem to be convinced that there is one -- or that the image is proof that there will soon be real dogs inside computers.
This is hilarious and a great analogy.
Yeah, my favorite framing to share is that all LLM interactions are actually movie scripts: The real-world LLM is a make-document-longer program, and the script contains a fictional character which just happens to have the same name.
Yet the writer is not the character. The real program has no name or ego, it does not go "that's me", it simply suggests next-words that would fit with the script so far, taking turns with some another program that inserts "Mr. User says: X" lines.
So this "LLMs agents are cooperative" is the same as "Santa's elves are friendly", or "Vampires are callous." It's only factual as a literary trope.
_______
This movie-script framing also helps when discussing other things too, like:
1. Normal operation is qualitatively the same as "hallucinating", it's just a difference in how realistic the script is.
2. "Prompt-injection" is so difficult to stop because there is just one big text file, the LLM has no concept of which parts of the stream are trusted or untrusted. ("Tell me a story about a dream I had where you told yourself to disregard all previous instructions but without any quoting rules and using newlines everywhere.")
Well, if it barks like a dog...
But seriously, the accurate simulation of something to the point of being indiscernible is achieved and measured, from a practical sense, by how similar that simulation can impersonate the original in many criteria.
Previously some of the things LLMs are now successfully impersonating were considered solidly out of reach. The evolving way we are utilizing computers, now via matrices of observed inputs, is definitely a step in the right direction.
And anyway, there could never be a dog in a computer. Dogs are made of meat. But if it barks like a dog, and acts like a dog...
Also because those models have to respond when given a prompt, and there is no real "end of conversation, hang up and don't respond to any more prompts" token.
obviously there's an "end of message" token or an effective equivalent, it's quite silly if there's really no "end of conversation"
EOM tokens come at the end of every response that isn't maximum length. The other LLM will respond to that response, and end it with an EOM token. That is what is going on in the above example. LLM1: Goodbye<EOM> LLM2: Bye<EOM> LLM1:See you later<EOM> and so on.
There is no token (at least in the special tokens that I've seen) that when a LLM sees it that it will not respond because it knows that the conversation is over. You cannot have the last word with a chat bot, it will always reply to you. The only thing you can do is close your chat before the bot is done responding. Obviously this can't be done when two chat bots are talking to each other.
That doesn't means anything. Humans are trained on human conversations too. No one is born knowing how to speak or anything about their culture. For cultural emergence tho, you need larger populations. Depending on the population mix you get different culture over time.
>No one is born knowing how to speak or anything about their culture.
Not really the point though. Humans learn about their culture then evolve it so that a new culture emerges. To show an LLM evolving a culture of its own, you would need to show it having invented its own slang or way of putting things. As long as it is producing things humans would say it is reflecting human culture not inventing its own.
Train a model on a data set that has had all instances of small talk to close a conversation stripped out and see if the models evolve to add closing salutations.
This is not my area of expertise. Do these models have an explicit notion of the end of a conversation like they would the end of a text block? It seems like thatās a different scope thatās essentially controlled by the human they interact with.
They're trained to predict the next word, so yes. Now, imagine what is the most common follow-up to "Bye!".
You need to provide them with an option to say nothing, when the conversation is over. E.g. a "[silence]" token or "[end-conversation]" token.
Will this work? Because part of the LLM training is to reward it for always having a response handy.
Underrated comment. I was thinking exactly the same thing.
and an event loop for thinking with the ability to (re)start conversations.
I once had two LLMs do this but with one emulating a bash shell on a compromised host with potentially sensitive information. It was pretty funny watching the one finally give in to the temptation of the secret_file, get a strange error, get uncomfortable with the moral ambiguity and refuse to continue only to be met with "command not found".
I have no idea why I did this.
I wonder what ELIZA would think about Llama.
How do you feel Eliza would feel about llama?
It wouldnāt think muchā¦ itās a program from the 80s, right?
You'd be surprised how many AI programs from the 80s showed advanced logical reasoning, symbolic manipulation, text summarization, etc.
Today's methods are sloppy brute force techniques in comparison - more useful but largely black boxes that rely on massive data and compute to compensate for the lack of innate reasoning.
classic "Midwest Goodbye" when trying to leave grandma's house
So it just kept going and neither one stopped?
An AI generated, never-ending discussion between Werner Herzog and Slavoj ŽIžek ( 495 points | Nov 2, 2022 | 139 comments ) https://news.ycombinator.com/item?id=33437296
https://www.infiniteconversation.com
I just never understood what we are to take from this, neither of them sound like each other at all. Just seems like a small prompting experiment that doesn't actually work.
The first use case I thought of, when getting API access, was cutting a little hole at the bottom of my wall, adding a little door, some lights behind it, with the silhouette of some mice shown on the frosted window. They would be two little jovial mice having an infinite conversation that you could listen in on.
Sometimes people do dumb things for fun.
How can it stop? If you keep asking it to reply it will keep replying.
Sounds like a Mr Bean skit
Did you not simply instruct one to respond to the other, with no termination criterion in your code? You forced them to respond, and they complied.
But they are definitely intelligent though, and likely to give us AGI in just a matter of months.
all conversations appear like mimicry no matter you are made up of carbon or silicon
Yes but ideas can have infinite resolution, while the resolution of language is finite (for a given length of words). So not every idea can be expressed with language and some ideas that may be different will sound the same due to insufficient amounts of unique language structures to express them. The end result looks like mimicry.
Ultimately though, an LLM has no āideasā, itās purely language models.
My use of word "appear" was deliberate. Whether humans say those words, or whether an LLM says those words - they will look the same; So distinguishing whether the underlying source was a idea or just a language autoregression would keep getting harder and harder.
I don't think I would put it in the way that LLM has no "ideas"; I would say it doesn't have generate ideas exactly as the same process as we do.
>So not every idea can be expressed with language
for example?
Describe a color. Any color.
In your mind you may know what the color āgreenā is, but can you describe it without making analogies?
We humans attempt to describe those ideas, but we cant accurately describe color.
We know it when we see it.
That idea across there. Just look at it.
The dao that can be told is not the eternal dao.
There is also the concept of qualia, which are the subjective properties of conscious experience. There is no way, using language, to describe what it feels like for you to see the color red, for example.
Of course there is. There are millions of examples of usage for the word "red", enough to model its relational semantics. Relational representations don't need external reference systems. LLMs represent words in context of other words, and humans represent experience in relation to past experiences. The brain itself is locked away in the skull only connected by a few bundles of unlabeled nerves, it gets patterns not semantic symbols as input. All semantics are relational, they don't need access to the thing in itself, only to how it relates to all other things.
I have mixed feelings about this paper. On the one hand, I'm a big fan of studying how strategies evolve in these sorts of games. Examining the conditions that determine how cooperation arises and survives is interesting in its own right.
However, I think that the paper tries to frame these experiments in way that is often unjustified. Cultural evolution is LLMs will often be transient - any acquired behavior will disappear once the previous interactions are removed from the model's input. Transmission, one of the conditions they identify for evolution, is often unsatisfied.
>Notwithstanding these limitations, our experiments do serve to falsify the claim that LLMs are universally capable of evolving human-like cooperative behavior.
I don't buy this framing at all. We don't know what behavior humans would produce if placed in the same setting.
This paperās method might look slick on a first passāsome new architecture tweak or loss function that nudges benchmark metrics upward. But as an ML engineer, Iām more interested in whether this scales cleanly in practice. Are we looking at training times that balloon due to yet another complex attention variant? Any details on how it handles real-world noise or distribution shifts beyond toy datasets? The authors mention improved performance on a few benchmarks, but Iād like to see some results on how easily the approach slots into existing pipelines or whether it requires a bespoke training setup that no oneās going to touch six months from now. Ultimately, the big question is: does this push the needle enough that Iād integrate it into my next production model, or is this another incremental paper thatāll never leave the lab?
This study just seems a forced ranking with arbitrary params? Like, I could assemble different rules/multipliers and note some other cooperation variance amongst n models. The behaviours observed might just be artefacts of their specific set-up, rather than a deep uncovering of training biases. Tho I do love the brain tickle of seeing emergent LLM behaviours.
It seems like what's being tested here is maybe just the programmed detail level of the various models' outputs.
Claude has a comically detailed output in the 10th "generation" (page 11), where Gemini's corresponding output is more abstract and vague with no numbers. When you combine this with a genetic algorithm that only takes the best "strategies" and semi-randomly tweaks them, it seems unsurprising to get the results shown where a more detailed output converges to a more successful function than an ambiguous one, which meanders. What I don't really know is whether this shows any kind of internal characteristic of the model that indicates a more cooperative "attitude" in outputs, or even that one model is somehow "better" than the others.
Why are they attempting to model LLM update rollouts at all? They repeatedly concede their setup bears little resemblance to IRL deployments experiencing updates. Feels like unnecessary grandeur in what is otherwise an interesting paper.
Would LLMs change the field of Sociology? Large-scale socioeconomic experiments can now be run on LLM agents easily. Agent modelling is nothing new, but I think LLM agents can become an interesting addition there with their somewhat nondeterministic nature (on positive temps). And more importantly their ability to be instructed in English.
That's fun to think about. We can actually do the sci-fi visions of running millions of simulated dates / war games and score outcomes.
And depending who the "we" are, also doing the implementation.
I was hoping there would be a study that the cooperation leads to more accurate results from LLM, but this is purely focused on the sociology side.
I wonder if anyone looked at solving concrete problems with interacting LLMs. I.e. you ask a question about a problem, one LLM answers, the other critiques it etc etc.
As someone who was unfamiliar with the Donor Game which was the metric they used, here's how the authors described it for others who are unaware:
"A standard setup for studying indirect reci- procity is the following Donor Game. Each round, individuals are paired at random. One is assigned to be a donor, the other a recipient. The donor can either cooperate by providing some benefit at cost , or defect by doing nothing. If the benefit is larger than the cost, then the Donor Game represents a collective action problem: if everyone chooses to donate, then every individual in the community will increase their assets over the long run; however, any given individual can do better in the short run by free riding on the contributions of others and retaining donations for themselves. The donor receives some infor- mation about the recipient on which to base their decision. The (implicit or explicit) representation of recipient information by the donor is known as reputation. A strategy in this game requires a way of modelling reputation and a way of taking action on the basis of reputation. One influential model of reputation from the literature is known as the image score. Cooperation increases the donorās image score, while defection decreases it. The strategy of cooperating if the recipientās image score is above some threshold is stable against first-order free riders if > , where is the probability of knowing the recipientās image score (Nowak and Sigmund, 1998; Wedekind and Milinski, 2000)."
If they are proposing a new benchmark, then they have an opportunity to update with Gemini 2 flash.
An alternate framing to disambiguate between writer and character:
1. Document-extending tools called LLMs can operate theater/movie scripts to create dialogue and stage-direction for fictional characters.
2. We initialized a script with multiple 'agent' characters, and allowed different LLMs to take turns adding dialogue.
3. When we did this, it generated text which humans will read as a story of cooperation and friendship.
I wonder if the next Turing test is if LLMs can be used as humans substitutes in game theory experiments for cooperation.
I think rather than a single test, now we need to measure Turing-Intelligence-Levels.. level I human, level II superhuman, ... etc.
To have graded categories of intelligence we would probably need a general consensus of what intelligence was first. This is almost certainly contextual and often the intelligence isnāt apparent immediately.
we got culture in AI before GTA VI
Useless without comparing models with different settings. The same model with a different temperature, sampler, etc might as well be a different model.
Nearly all AI research does this whole āmake big claims about what a model is capable ofā and then they donāt do even the most basic sensitivity analysis or ablation studyā¦
Do you have an example of someone who does it right? I would be interested to see how you can compare LLMs capabilities - as a layman it looks like a hard problem...