ACL 2023 Notes
There was a lot happening at ACL 2023, so to help me remember everything and as a resource for people who didn’t attend, I wrote up some notes.
Big takeaways:
- There is a split between people who think that LLMs learn world models and those that don’t. Geoffrey Hinton is in the world models camp as well as many of the people I met. It seems like more people are moving in this direction as the empirical evidence becomes more conclusive. Some of this empirical evidence was actually presented at the conference and talked about below.
- I believe that most people are excited about LLMs since it makes previously impossible things possible. LLMs are not the end of NLP though since they will be one layer of a larger stack. Dan Klein related this to the internet stack and how there are a few people working on improving transport layer protocols and many more working on building things on top of these layers.
- Reasoning is a big issue, but there are many different directions which are trying to create reasoning using the advances from LLMs. Chain-of-Thought prompting is currently the best and simplest method for enabling reasoning with an LLM, but others are working on explicitly extracting the world model and reasoning with it using existing techniques from GOFAI.
- AI2 in particular is trying to do this using Natural Logic and others are using formal reasoning systems like FOL or programming languages like Python.
- There was a lot of work on retrieval augmented language models and multilingual language models. This is again reflective of other limitations of current LLMs.
- Data cleaning and curation is an increasingly important problem as things like multilingual LLMs, bias, and sensitive information are seen as more important. There is also evidence that small, clean datasets can result in better performance than large, noisy datasets.
- I did not see anyone from OpenAI attending. It was odd since there were a couple talks and panels that seemed directly targeted towards them. In general, it seemed like people were advocating for open sourcing models, moving away from press releases and back to peer review, and regulating what data can be used for training.
Multilingual LLMs Tutorial
Now that there are reasonably good English LLMs, the natural question is if this extends beyond English. It turns out that often multilingual LLMs do not perform nearly as well in other languages as they do for English because most of the data they are trained on is English. This tutorial was focused on the current best methods for creating and evaluating multilingual LLMs.
- Data preprocessing: collection -> dedup -> filter
- For dedup, you can use minhash, exact match, …
- Filtering can be based on other pretrained models. For instance, you can filter out hate speech using a hate speech classifier.
- For multilingual sources, it is well known that low-represented languages create subgroups where the model has low performance.
- Evaluating an LM by averaging across languages is bad because the model caters to the most prevalent language to get high average performance.
- Probing is used to determine what the language model has learned:
- Structural probing: find the syntax tree such that word depth in the syntax tree matches the distance between words in the embedding space.
- Intrinsic probing: find the subset of neurons most informative of a linguistic property.
- Causal probing: use interventions to check if some detected behavior is actually used by the model.
- Removing a bias in one language may not transfer to removing the bias in another language.
Generating From an LM Tutorial
From the name of this workshop, I thought it was going to be useless because I thought it was obvious how to generate from an LM. I thought it was just argmax (greedy) or temperature sampling, but it turns out that the sampling approach used can be used for more than just improving generation quality. For instance, the sampling method can be used to remove harmful biases and improve truthfulness. This tutorial was an exposition of many of the different methods for generating from an LM and how to easily design new generation methods.
- A language model is just a probability distribution over token sequences. Given a sequence of tokens, the language model produces a distribution over tokens, so how does one generate text?
- There are methods like RLHF which tune the distribution to match a certain objective, but this tutorial is focused on methods which don’t require changing the model parameters.
- In high entropy scenarios, which means probability mass is spread among many tokens, the argmax is not very representative.
- Problems with generating from LLMs:
- LLMs don’t assign highest probability to truthful text.
- Model parameters are not interpretable, so they don’t help in determining what the LM has learned.
- All generation strategies can be decomposed into two parts:
- Scoring function: how do we score tokens?
- Algorithm: how do we choose the next token?
- It is known that the tail of the token probabilities is unreliable, so truncation based scoring removes the tail.
- Mirostat is a scoring function which dynamically truncates to the top k tokens to keep a stable perplexity of the generated text. Perplexity is a measure of how surprising/unusual the text seems.
- Checking the perplexity/surprisal of generated text is a way of diagnosing problems with generation.
- Prompting can be viewed as a type of scoring function. Providing more context lowers the entropy.
- The simplest algorithm is to take the argmax of the scored tokens, but there are more complex options like sampling from the score distribution. There are even optimization based approaches based on energy based models (EBMs): COLD Decoding.
Talks:
Ellie Pavlick (Brown professor, PhD advised by CCB!):
- Her lab is working on finding structure in neural nets (simple feed forward networks as well as transformers).
- The central argument of her talk is that neural networks, and LLMs in particular, may not inscrutable black boxes as most people assume. Some example papers from her lab in this direction are: Language Models Implement Simple Word2Vec-style Vector Arithmetic, Break It Down: Evidence for Structural Compositionality in Neural Networks, and Symbols and grounding in large language models.
- In Break It Down, they find subnetworks that encode functions that solve the training task compositionally. They can intervene on these subnetworks through ablations which show the effect of the subnetworks.
- These observations surprisingly do not seem to hold for vision models. She is not sure why this is the case.
- She sees language models as effectively “neurosymbolic” since they take symbols as input and output symbols. The symbols in vision are less well defined which could be what is causing the different training dynamics. They can also find symbols in the embedding space of a language model.
Noah Goodman (Stanford professor):
- Disagrees with the common system 1 and system 2 framing of most reasoning papers since he says there are more than just two systems of thought. He still thinks this binary approach is a useful way of thinking of cognitive processes.
- He has a paper on constraining LLM output using a formal FOL proof system:https://arxiv.org/abs/2306.04031.
- Their approach differs from allowing an LLM to use tools or building a symbolic system on top of the output of the LLM since they use the symbolic system to constrain the space of outputs and let the LLM still generate all the text.
- He compares LLMs to airplanes and birds. Airplanes have jet engines and stationary wings while birds have feathers, but both use the airfoil which we discovered is essential for flight. He thinks LLMs may help us discover what is essential to reasoning.
- He also mentioned that his lab is now working on mechanistic interpretability, like Ellie Pavlick.
Liam Dugan (Advised by CCB):
- Exploring the Curious Case of Code Prompts: https://arxiv.org/abs/2304.13250
- Translating prompts to code results in better performance for LLMs pretrained on code.
- The paper doesn’t give a reason why, it just shows this empirical result.
Maarten Sap (CMU professor):
- NLP systems need to account for social context. For instance, something that is said to your friend may be inappropriate to say to your boss.
- In NLPositionality: Characterizing Design Biases of Datasets and Models, they look at if the labels from a dataset/model are aligned with different subgroups.
- COBRA Frames: Contextual Reasoning about Effects and Harms of Offensive Statements is used to explicitly add social context to a model in the form of a prompt. They finetune a language model on these context-output pairs and see that the model is able to perform well on new social contexts.
- He says they still have many examples where their models fail. These look like the model given a social context and some statement and the model incorrectly predicts if the statement is appropriate or not.
- I’m thinking that it may be better to supervise the model to learn to use social context similar to Language Modeling with Latent Situations (described below) instead of providing the social context as input which may not result in generalization.
Peter Clark (CEO of AI2):
- What is a world model?
- A world model has entities, relationships, and an inference procedure to update the entities and relationships.
- Isn’t this just a relational database?
- He claims that modern neural systems form internal world models. We need interpretable versions of these world models for communication, verification, interaction, and augmentation.
- What the model generates does not equal what the model believes. This is shown by the model generating many contradictory statements. This was also the takeaway from the generation tutorial.
- REFLEX is proposed in Language Models with Rationality to try to fix the issue of the model generating false statements.
- They create a belief graph by prompting an LLM and then use constraint solving to fix contradictions in the beliefs.
- This results in a basic world model which can be made more complicated in follow up work.
- He finds that just asking the LLM if it believes what is generated is true can reduce the amount of false statements. If the LLM doesn’t model text that it believes, then how can you trust what it outputs when you ask it if it believes what it output?
- He wants to produce world models to explain the reasoning behind multiple questions. He calls this world model a “micro theory” since the world model can be used to answer a subset of questions instead of a single question.
Denny Zhou (lead of the Reasoning Team in Google DeepMind):
- His stated goal is to solve AGI by getting LLMs to reason… This was the only person to mention the word AGI at the conference.
- Reasoning is missing from current machine learning models. He shows how he developed Chain-of-Thought (Chain of Thought Prompting Elicits Reasoning in Large Language Models) and self-consistency (Self-Consistency Improves Chain of Thought Reasoning in Language Models) to greatly improve the reasoning ability of LLMs.
- LLMs can also self-debug the code they output: Teaching Large Language Models to Self-Debug.
- The takeaway from the talk was that LLMs initially could not reason, but he has worked extensively to take these models and prompt them or sample from them in better ways which allows the model to reason.
- Can this approach be classified as neurosymbolic since it takes the LLM which does basic “system 1” type of thinking and adds “system 2” type of capabilities?
Sarah Wiegreffe (postdoc at AI2, PhD advised by Mark Riedl):
- She talks about the two approaches to interpretability: prompting + querying and mechanistic interpretability.
- Prompting + Querying: Provide natural language inputs to the LLM and observe the natural language outputs.
- Decomposing queries can help people understand and predict model behavior.
- A problem with this approach comes from surface form competition (SFC) which is when probability mass is spread among many synonyms. This makes the most likely next token not correspond to the most probable concept.
- The big issue is defining the groups of synonyms.
- People try to use the prompt to constrain the output space, but this is shown to not work well. For instance, you can say “Answer with A, B, C, or D” which constrains the output to 4 choices, but this tends to destroy performance.
- Mechanistic Interpretability: Map model parameters to human interpretable functions.
- The traditional method is probing which means learning a simple function (usually linear) which maps the model latent space to the property you are looking for.
- The big issue is that correlation is not causation, so finding a projection of the latent space which explains some property may not mean that the model actually uses or has learned that property.
- Model editing (Locating and Editing Factual Associations in GPT) can serve as a testbed for confirming mechanistic hypotheses.
- In Editing Commonsense Knowledge in GPT they fix commonsense errors of a GPT model by editing.
- She sees model editing as a way to test hypotheses, not as a way to actually fix model mistakes or update a model since she isn’t sure if model editing introduces downstream problems.
- Cons of mechanistic interpretability:
- Current work targets only very low-level behavior.
- Negative results are not helpful.
- There is a lack of unified goals.
- The traditional method is probing which means learning a simple function (usually linear) which maps the model latent space to the property you are looking for.
Interesting Papers:
- Multilingual Language Models are not Multicultural: A Case Study in Emotion (Advised by Lyle and Eric)
- They test if multilingual LLMs have learned some aspects of the culture of the languages which they are trained on. For instance, asking GPT-4 in Japanese how it feels about confronting someone in their home results in a response that it would be an “exciting” and good experience to do so. This does not actually match the culture of Japan.
- They find that prompting in different languages does not result in an understanding of culture, but explicitly prompting for culture does. This means the LLM has learned to represent culture to some extent, but that current prompts do not extract this knowledge.
- Recieved the best paper award at WASSA!
- Multilingual Conceptual Coverage in Text-to-Image Models
- Looks at how concepts are encoded multilingually in text-to-image models. The prompt “dog” produces an image of a golden retriever, but what about “dog” in other languages?
- They show that there are some bad biases encoded in some text-to-image models and how they can fix some problems by using multilingual encoders.
- How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases
- Using a custom dataset where the training data has a spurious correlation, they show what size of model and dataset size are required to get the model to learn to parse the input as a tree. This is the desired behavior because it allows for generalization while the spurious correlation leads to wrong results on the test set.
- The takeaway is that if they can get the LLM to learn to parse first, then they can get it to learn more complex things. What is the ideal curriculum to teach an LLM complex concepts?
- Backpack Language Models
- Instead of trying to understand what an LLM has learned, they modify the architecture to try to aid in interpretability.
- For each word, they learn k word sense vectors. A sentence embedding is a weighted sum of each word’s sense vectors. They show that the sense vectors are interpretable since for a word like science they have a sense that is for the profession, a sense for fiction, and a sense for religion.
- Grokking of Hierarchical Structure in Vanilla Transformers
- This paper shows how training a transformer way past the point where it gets 100% training set accuracy allows it to use hierarchical structure in the input and generalize to structurally novel inputs.
- This is not very intuitive since usually people think that training this long results in overfitting, but this shows that training this long is actually necessary for generalization.
- The Whole Truth and Nothing But the Truth
- An interesting approach to combining a rule-based system with an LLM. They use the LLM to generate a computation graph and then handwritten rules constrain how the LLM can describe the computation it performed.
- This is important for a dialogue system that uses some API behind the scenes because it must describe precisely what actions it has performed and why. For instance, if you say “Schedule a meeting with Sarah at 5pm”, then the reply “Scheduled the meeting” is not good since you have no guarantee it scheduled it with the right person at the right time.
- To fix this, they turn the statement “Schedule a meeting with Sarah at 5pm” to a computation graph and then convert the computation graph back to a natural language description with the aid of handwritten rules.
- Language Modeling with Latent Situations
- LLMs can produce text about a subject not previously introduced, or randomly switch topics. These issues are problems with semantic coherence.
- To improve semantic coherence of LLMs, they make the LLM explicitly model the situation which they show improves the semantic coherence.
- Detecting Personal Information in Training Corpora
- Using simple RegEx, they find massive amounts of personal information including credit card numbers, social security numbers, and addresses in common LLM pretraining datasets.
- Their biggest problem is in writing the regex to find things like credit card numbers or bank numbers.
- Another big problem that this is related to is in test set data leaking into the training set. Exact match and even fuzzy match search is not enough. For instance, if an article is in the test set, how do you find if a summary of the article is in the training set?