The Laboratorium (3d ser.)

A blog by James Grimmelmann

Soyez réglé dans votre vie et ordinaire afin
d'être violent et original dans vos oeuvres.

GenLaw DC Workshop

I’m at the GenLaw workshop on Evaluating Generative AI Systems: the Good, the Bad, and the Hype today, and I will be liveblogging the presentations.

Introduction

Hoda Heidari: Welcome! Today’s event is sponsored by the K&L Gates Endowment at CMU, and presented by a team from GenLaw, CDT, and the Georgetown ITLP.

Katherine Lee: It feels like we’re at an inflection point. There are lots of models, and they’re being evaluated against each other. There’s also a major policy push. There’s the Biden executive order, privacy legislation, the generative-AI disclosure bill, etc.

All of these require the ability to balance capabilities and risks. The buzzword today is evaluations. Today’s event is about what evaluations are: ways of measuring generative-AI systems. Evaluations are proxies for things we care about, like bias and fairness. These proxies are limited, and we need many of them. And there are things like justice that we can’t even hope to measure. Today’s four specific topics will explore the tools we have and their limits.

A. Feder Cooper: Here is a concrete example of the challenges. One popular benchmark is MMLU. It’s advertised as testing whether models “possess extensive world knowledge and problem solving ability.” It includes multiple-choice questions from tests found online, standardized tests of mathematics, history, computer science, and more.

But evaluations are surprisingly brittle; CS programs don’t always rely on the GRE. In addition, it’s not clear what the benchmark measures. In the last week, MMLU has come under scrutiny. It turns out that if you reorder the questions as you give them to a language model, you get wide variations in overall scores.

This gets at another set of questions about generative-AI systems. MMLU benchmarks a model, but systems are much more than just models. Most people interact with a deployed system that wraps the model in a program with a user interface, filters, etc. There are numerous levels of indirection between the user’s engagement and the model itself.

And even the system itself is embedded in a much larger supply chain, from data through training to alignment to generation. There are numerous stages, each of which may be carried out by different actors doing different portions. We have just started to reason about all of these different actors and how they interact with each other. These policy questions are not just about technology, they’re about the actors and interactions.

Alexandra Givens: CDT’s work involves making sure that policy interventions are grounded in a solid understanding of how technologies work. Today’s focus is on how we can evaluate systems, in four concrete areas:

  1. Training-data attribution
  2. Privacy
  3. Trust and safety
  4. Data provenance and watermarks

We also have a number of representatives from government giving presentations on their ongoing work.

Our goals for today are on providing insights on cross-disciplinary, cross-community engagement. In addition, we want to pose concrete questions for research and policy, and help people find future collaborators.

Paul Ohm: Some of you may remember the first GenLaw workshop, we want to bring that same energy today. Here at Georgetown Law, we take seriously the idea that we’re down the street from the Capitol and want to be engaged. We have a motto, “Justice is the end, law is but the means.” I encourage you to bring that spirit to today’s workshop. This is about evaluations in service of justice and the other values we care about.

Zack Lipton

Zack Lipton: “Why Evaluations are Hard”

One goal of evaluations is simply quality: is this system fit for purpose? One question we can ask is, what is different about evaluations in the generative-AI era? And an important distinction is whether a system does everything for everyone or it has a more pinned-down use case with a more laser-targeted notion of quality.

Classic discriminative learning involves a prediction or recognition problem (or a problem that you can twist into one). For example, I want to give doctors guidance on whether to discharge a patient, so I predict mortality.

Generically, I have some input and I want to classify it. I collect a large dataset of input-output pairs and generates a model–the learned pattern. Then I can test how well the model works by testing on some data we didn’t train on. The model of machine learning that came to dominate is that I have a test set, and I measure how well the model works on the test set.

So when we evaluate a discriminative model, there are only a few kinds of errors. For a yes-no classifier, those are false positives and false negatives. For a regression problem, that means over- and under-estimates. We might look into how well the model performs on different strata, either to explore how it works, or to check for disparity on salient demographic groups in the population. And then we are concerned whether the model is valid at all out of the distribution it was trained on–e.g., at a different hospital, or in the wild.

[editor crash, lost some text]

Now we have general-purpose systems like ChatGPT which are provided without a specific task. They’re also being provided to others as a platform for building their own tools. Now language models are not just language models. Their job is not just to accurately predict the next word but to perform other tasks.

We have some ways of assessing quality, but there is no ground truth we can use to evaluate against. There is no single dataset that represents the domain we care about. Evaluation starts going to sweeping benchmarks; the function of what we did in NLP before is to supply a giant battery of tests we can administer to test “general” capabilities of ML models. And the discourse shifts towards trying to predict catastrophic outcomes.

At the same time, these generic capabilities provide a toolkit for building stronger domain-specific technologies. Now people in the marketplace are shipping products without any data. There is a floor level of performance they have with no data at all. Generative AI has opened up new domains, but with huge evaluation challenges.

Right now, for example, health-care systems are underwater. But the clerical burden of documenting all of these actions is immense: two hours of form-filling for every one hour of patient care. At Abridge, we’re building a generative-AI system for physicians to document clinical notes. So how do we evaluate it? There’s no gold standard, we can’t use simple tricks. The problem isn’t red-teaming, it’s more about consistently high-quality statistical documentation. The possible errors are completely open-ended, and we don’t have a complete account of our goals.

Finally, evaluation takes place at numerous times. Before deployment, we can look at automated metrics–but at the end of the day, no evaluation will capture everything we care about. A lot of the innovation happens when we have humans in the loop to give feedback on notes. We use human spot checks, we have relevant experts judging notes, across specialties and populations, and also tagging errors with particular categories. We do testing during rollout, using staged releases and A/B tests. There are also dynamic feedback channels from clinician feedback (star ratings, free-form text, edits to notes, etc.). There are lots of new challenges–the domain doesn’t stand still either. And finally, there are regulatory challenges.

Emily Lanza

Emily Lanza: “Update from the Copyright Office”

The Copyright Office is part of the legislative branch, providing advice to Congress and the courts. It also administers the copyright registration system.

As far back as 1965, the Copyright Office has weighed in on the copyrightability of computer-generated works. Recently, these issues have become far less theoretical. We have asked applicants to disclaim copyright in more-than-de-minimis AI-generated portions of their works. In August, we published a notice of inquiry and received more than 10,000 comments. And a human has read every single one of those comments.

Three main topics:

First, AI outputs that imitate human artists. These are issues like the Drake/Weeknd deepfake. Copyright law doesn’t cover these, but some state rights do. We have asked whether there should be a federal AI law.

Second, copyrightability of outputs. We have developed standards for examination. The generative-AI application was five years ago, entirely as a test case. We refused registration on the ground that human authorship is required; the D.C. District Court agreed and the case is on appeal. Other cases present less clear-cut facts. Our examiners have issued over 200 registrations with appropriate disclaimers, but we have also refused registration in three high-profile cases.

The central question in these more complex scenarios is when and how a human can exert control over the creativity developed by the AI system. We continue to draw these lines on a case-by-case basis, and at some point the courts will weigh in as well.

Third, the use of human works to train AIs. There are 20 lawsuits in U.S. courts. The fair use analysis is complex, including precedents such as Google Books and Warhol v. Goldsmith. We have asked follow-up questions about consent and compensation. Can it be done through licensing, or through collective licensing, or would a new form of compulsory licensing be desirable? Can copyright owners opt in or out? How would it work?

Relatedly, the study will consider how to allocate liability between developers, operators, and users. Our goal is balance. We want to promote the development of this exciting technology, while continuing to allow human creators to thrive.

We also need to be aware of developments elsewhere. Our study asks whether approaches in any other countries should be adopted or avoided in the United States.

We are not the only ones evaluating this. Congress has been busy, too, holding hearings as recently as last week. The Biden Administration issued an executive order in October. Other agencies are involved, including the FTC (prohibition on impersonation through AI-enabled deepfakes), and FEC (AI in political ads).

We plan to issue a report. The first section will focus on digital replicas and will be published this spring. The second section will be published this summer and will deal with the copyrightability of outputs. Later sections will deal with training and more. We aim to publish all of it by the end of the fiscal year, September 30. We will revise the Compendium, and also a study by economists about copyright and generative AI.

Sejal Amin

Sejal Amin (CTO at Shutterstock): “Ensuring TRUST; Programs for Royalties in the Age of AI”

Shutterstock was founded in 2003 and has since become an immense marketplace for images, video, music, 3D, design tools, etc. It has been investing in AI capabilities as well. Showing images generated by Shutterstock’s AI tools. Not confined to any genre or style. Shutterstock’s framework is TRUST. I’m going to focus today on the R, Royalties.

Today’s AI economy is not really contributing to the creators who enable it. Unregulated crawling helps a single beneficiary. In 2023, Shutterstock launched a contributor fund that provides ongoing royalties tied to licensing for newly generated assets.

The current model provides an equal share per image based on their contributions, which are then used in training Shutterstock’s models. There is also compensation by similarity, or by popularity. These models have problems. Popularity is not a proxy for quality; it leads to a rich-get-richer phenomenon. And similarity is also flawed without a comprehensive understanding the world.

For us, quality is a high priority. High-quality content is an essential input into the training process. How could we measure that? Of course, the word quality is nebulous. I’m going to focus on:

  • Aesthetic excellence
  • Safety
  • Diverse representation

A shared success model will need to understand customer demand.

Aesthetic excellence depends on technical proficiency (lighting, color balance) and visual appeal. Shutterstock screens materials for safety both manually and through human review. We have techniques to prevent generation of unsafe concepts. Diversity is important to all of us. We balance and improve representations of different genders, ethnicities, and orientation. Our fund attempts to support historically excluded creator groups. Our goal is shared success.

David Bau

David Bau: “Unlearning from Generative AI Models”

Unlearning asks: “Can I make my neural network forget something it learned?”

In training, a dataset with billions of inputs is run through training, and then can generate potentially infinite outputs. The network’s job is to generalize, not memorize. If you prompt Stable Diffusion for “astronaut riding a horse on the moon” there is no such image in the training set, it will generalize to create one.

SD is trained on about 100 TB of data, but the SD model is only about 4GB of network weights. We intentionally make these nets too small to memorize everything. That’s why they must generalize.

But still, sometimes a network does memorize. Carlini et al. showed that there is substantial memorization in some LLMs, and the New York Times found out that there is memorization in ChatGPT.

In a search engine, takedowns are easy to implement because you know “where” the information is. In a neural network, however, it’s very hard to localize where the information is.

There are two kinds of things you might want to unlearn. First, verbatim regurgitation, second, unwanted generalized knowledge (artist’s style, undesired concepts like nudity or hate, or dangerous knowledge like hacking techniques).

Three approaches to unlearning:

  1. Remove from the training data and retrain. But this is extremely expensive.
  2. Modify the model. Fine-tuning is hard because we don’t know where to erase. There are some “undo” ideas or targeting specific concepts.
  3. Filter outputs. Add a ContentID-like system to remove some outputs. This is a practical system for copyright compliance, but it’s hard to filter generalized knowledge and the filter is removable from open-source models.

Fundamentally, unlearning is tricky and will require combining approaches. The big challenge is how to improve the transparency of a system not directly designed by people.

Alicia Solow-Niederman

Alicia Solow-Niederman: “Privacy, Transformed? Lessons from GenAI”

GenAI exposes underlying weak spots. One kind of weak spot is weaknesses in a discipline’s understanding (e.g., U.S. privacy law’s individualistic focus). Another is weaknesses in cross-disciplinary conversations (technologists and lawyers talking about privacy).

Privacy in GenAI: cases when private data is used in the. If I prompt a system with my medical data, or a law-firm associate uses a chatbot to generate a contract with confidential client data. It can arise indirectly when a non-AI company licenses sensitive data for training. For example, 404 reported that Automattic was negotiating to license Tumblr data. Automattic offered an opt-out, a solution that embraces the individual-control model. This is truly a privacy question, not a copyright one. And we can’t think about it without thinking what privacy should be as a social value.

Privacy out of GenAI: When does private data leak out of a GenAI system? We’ve already talked about memorization followed by a prompt that exposes it. (E.g., the poem poem poem attack.) Another problem is out-of-context disclosures. E.g., ChatGPT 3.5 “leaked a random dude’s photo”–a working theory is that this photo was uploaded in 2016 and ChatGPT created a random URL as part of its response. Policy question: how much can technical intervention mitigate this kind of risk?

Privacy through GenAI: ways where the use of the technology itself violates privacy. E.g., GenAI tools used to infer personal information: chatbots can discern age and geography from datasets like Reddit. The very use of a GenAI tool might lead to violations of existing protections. The example of GenAI for a health-care system is a good example of this kind of setting.

Technical patches risk distracting us from more important policy questions.

Niloofar Mireshghallah

Niloofar Mireshghallah: “What is differential privacy? And what is it not?”

A big part of the success of generative AI is the role of training data. Most of the data is web-scraped, but this might not have been intended to be public.

But the privacy issues are not entirely new. The census collects data on name, age, sex, race, etc. This is used for purposes like redistricting. But this data could also be used to make inferences, e.g., where are there mixed-race couples? Obvious approach is to withhold some fields, such as name, but often the data can be reconstructed.

Differential privacy is a way of formalizing the idea that nothing can be learned about a participant in a database–is the database with the record distinguishable from the database without it? The key concept here is a privacy budget, which quantifies how much privacy can be lost through queries of a (partially obscured) database Common patterns are still visible, but uncommon patterns are not.

But privacy under DP comes at the cost of data utility. The more privacy you want, the more noise you need to add, and hence the less useful the data. And it has a disproprotionate imapct on the tails of the distribution, e.g., more inaccuracy in the census measurements of the Hispanic population.

Back to GenAI. Suppose I want to release a medical dataset with textual data about patients. Three patients have covid and a cough, one patient has a lumbar puncture. It’s very hard to apply differential privacy to text rather than tabular data. It’s not easy to apply clear boundaries between records to text. There are also ownership issues, e.g., “Bob, did you hear about Alice’s divorce?” applies to both Bob and Alice.

If we try applying DP with each patient’s data as a record, we get a many-to-many version. The three covid patients get converted into similar covid patients; we can still see the covid/cough relationship. But it does not detect and obfuscate “sensitive” information while keeping “necessary” information intact. We’ll still see “the CT machine at the hospital is broken.” This is repeated, but in context it could be identifying and shouldn’t be revealed. That is, repeated information might be sensitive! A fact that a lumbar puncture requires local anasthesia might appear only once, but that’s still a fact that ought to be learned, it’s not sensitive. DP is not good at capturing these nuances or these needle-in-haystack situations. There are similarly messy issues with images. Do we even care about celebrity photos? There are lots of contexual nuances.

Panel Discussion

[Panel omitted because I’m on it]

Andreas Terzis

Andreas Terzis: “Privacy Review”

Language models learn from their training data a probability distribution of a sequence given the previous tokens. Can their memorize rare or unique training-data sequences? Yes, yes yes. So thus we ask, do actual LLMs memorize their training data?

Approach: use the LLM to generate a lot of data, and they predict membership of an example in the training data. If it has a high likelihood of being generated, then it’s memorized, if not, then no. In 2021, they showed that memorization happens in actual models, and since then, scale exacerbates the issue. Larger models memorize more.

Alignment seems to hide memorization, but not to prevent it. An aligned model might not return training data, but it can be prompted (e.g., “poem poem poem”) in ways that elicit it. And memorization happens with multimodal models too.

Privacy testing approaches: first, “secret sharer” invovles controlling canaries inserted into the training data. This requires access to the model and can also pollute it. “Data extraction” only requires access to the interface but may underestimate the actual amount of memorization.

There are tools to remove what might be sensitive data from training datasets. But they may not find all sensitive data (“andreas at google dot com”), and on the flipside, LLMs might benefit from knowing what sensitive data looks like.

There are also safety-filter tools. They stop LLMs from generating outputs that violate its policies. This is helpful in preventing verbatim memorization but can potentially be circumvented.

Differential privacy: use training-time noise to provide reduced sensitivity to specific rarer examples. This introduces a privacy-utility tradeoff. (And as we saw in the morning, it can be hard to adopt DP to some types of data and models.)

Deduplication can reduce memorization, because the more often an example is trained on, the more likely it is to be memorized. The model itself is likely to be better (faster to train on and less resources on memorizing duplicates.)

Privacy-preserving LLMs train on data intended to be public, and then finetine locally on user-contributed data. This and techniques in the previous slides can be combined to provide layered defense.

Dave Willner

Dave Willner: “How to Build Trust & Safety for and with AI”

Top-down take from a risk management perspective. We are in a world where a closed system is a very different thing to build trust and safety for than an open system, and I will address them both.

Dealing with AI isn’t a new problem. Generative AI is a new way of producing content. But we have 15-20 years of experience in moderating content. There is good reason to think that generative-AI systems will make us better at moderating content; they may be able to substitute for human moderators. And the models offer us new sites of intervention, in the models themselves.

First, do product-specific risk assessment. (Standard T&S approach: actors, behaviors, and content.) Think about genre (text, image, multimodal, etc.) Ask how frequent some of this content is. And how is this system specifically useful to people who want to generate content you don’t want them to?

Next, it’s a defense in depth approach. You have your central model and a series of layers around it. So the only viable approach is to stack as many layers of mitigations as possible.

  • Control access to the system, you don’t need to let people trying to abuse the system have infinite chances.
  • Monitor your outputs to see what the model is producing.
  • You need to have people investigating anomalies and seeing what’s happening. That can drive recalibration and adjustment.

In addition, invest in learning to use AI to augment all of the things I just talked about. All of these techniques rely on human classification. This is error-prone work that humans are not good at and that takes a big toll on them. We should expect generative-AI systems to play a significant role here; early experiments are promising.

In an open-souce world, that removes centralized gatekeepers … which means removing centralized gatekeepers. I do worry we’re facing a tragedy of the commons. Pollution from externalities from models is a thing to keep in mind, especially with the more severe risks. We are already seeing significant CSAM.

There may not be good solutions here with no downsides. Openness versus safety may involve hard tradeoffs.

Nicholas Carlini

Nicholas Carlini: “What watermarking can and can not do”

A watermark is a mark placed on top of a piece of media to identify it as machine generated.

For example: an image with a bunch of text put on top of it, or a disclaimer at the start of a text passage. Yes, we can watermark, but these are not good watermarks; they obscure the content.

Better question: can we usefully watermark? The image has a subtle watermark that are present in the pixels. And the text model was watermarked, too, based on some of the bigram probabilities.

But even this isn’t enough. The question is what are your requirements? We want watermarks to be useful for some end task. For example, people want undetectable watermarks. But most undetectable watermarks are easy to remove–e.g., flip it left-to-right, or JPEG compress it. Other people want unremovable watermarks. By whom? An 8-year-old or a CS professional? Unremovable watermarks are also often detectable. Some people want unforgeable watermarks, so they can verify the authenticity of photos.

Some examples of watermarks designed to be unremovable.

Here’s a watermarked image of a tabby cat. An ML image-recognition model recognizes it as a tabby cat with 88% confidence. Adversarial perturbation can make the image look indistinguishable to us humans, but it is classified as “guacamole” with 99% confidence. Almost all ML classifiers are vulnerable to this.Synthetic fake images can be tweaked to look like real ones with trivial variations, such as texture in the pattern of hair.

Should we watermark? It comes down to whether we’re watermarking in a setting where we can achieve our goals. What are you using it for? How robustly? Who is the adversary? Is there even an adversary?

Here are five goals of watermarking:

  1. Don’t train on your own outputs to avoid model collapse. Hope that most people who copy the text leave the watermark in.
  2. Provide information to users so they know whether the pope actually wore a puffy coats. There are some malicious users, but mostly not many.
  3. Detect spam. Maybe the spammers will be able to remove the watermark, maybe not.
  4. Detect disinformation. Harder.
  5. Detect targeted abuse. Harder still. As soon as there is a human in the loop, it’s hard to make a watermark stick. Reword the text, take a picture of the image. Can we? Should we? Maybe.

Raquel Vazquez Llorente

Raquel Vazquez Llorente: “Provenance, Authenticity and Transparency in Synthetic Media”

Talking about indirect disclosure mechanisms, but I consider detection to be a close cousin. We just tested detection tools and broke them all.

Witness helps people use media and tech to protect their rights. We are moving fast to a world where human and AI don’t just coexist but intermingle. Think of inpainting and outpainting. Think of how phones include options to enhance image quality or allow in-camera editing.

It’s hard to address AI content in isolation from this. We’ve also seen that deception is as much about context as it is about content. Watermarks, fingerprints, and metadata all provide important information, but don’t provide the truth of data.

Finally, legal authentication. There is a lot of work in open-source investigations. The justice system plays an essential role in protecting rights and democracy. People in power have dismissed content as “fake” or “manipulated” when they want to avoid its impact.

Three challenges:

  1. Identity. Witness has insisted that identity doesn’t need to be a condition of authentication. A system should avoid collecting personal information by default, because it can open up the door to government overreach.
  2. Data storage, ownership, and access. Collecting PII that connects to activists means they could be targeted.
  3. Access to tools and mandatory usage. Who is included and excluded is a crucial issue. Provenance can verify content, but actual analysis is important.

Erie Meyer

Erie Meyer: “Algorithmic Disgorgement in the Age of Generative AI”

CFPB sues companies to protect them from unfair, deceptive, or abusive practices: medical debt, credit reports, repeat-offender firms, etc. We investigate, litigate, and supervise.

Every one of the top-ten commercial banks uses chatbots. CFPB found that people were being harmed by poorly deployed chatbots that sent users into doom loops. You get stuck with a robot that doesn’t make sense.

Existing federal financial laws say that impeding customers from solving problems can be a violation of law. If the technology fails to recognize that consumers are invoking their federal rights, or fails to protect their private information. Firms also have an obligation to respond to consumer disputes and competently interact with customers. It’s not radical to say that technology should make things better, not worse. CFPB knew it needed to do this report because it publishers its complaints online. In those complaints, they searched for the word “human” and it pulled up a huge number of complaints.

Last year, CFPB put out a statement that “AI is Not an Execuse for Breaking the Law.” Bright-line rules benefit small companies by giving them clear guidance without needing a giant team to study the law. They also make it clear when a company is compliant or not, and make it clear when an investigation is needed.

An example: black-box credit models. Existing credit laws require firms making credit decisions to tell consumers why they made a decision. FCRA has use limitations, accuracy and explainability requirements, and a private right of action. E.g., targeted advertising is not on the list of allowed uses. CFPB has a forthcoming FCRA rulemaking.

When things go wrong, I think about competition, repeat offenders, and relationships. A firm shouldn’t get an edge over its competitors from using ill-gotten data. Repeat offenders are an indication that enforcement hasn’t shifted the firm’s incentives. Relationships: does someone answer the phone, can you get straight answers, do you know that Erica isn’t a human?

The audiences for our work are individuals, corporations, and the industry as a whole. For people: What does a person do when their data is misused? What makes me whole? How do I get my data out? For corporations, some companies violate federal laws repeatedly. And for the industry, what do others in the industry learn from enforcement actions?

Finally, disgorgement: I’ll cite a case from the FTC in the case against Google. The reason not to let Google settle was that while the “data” was deleted, the data enhancements were used to target others.

What keeps me up at night is that it’s hard to get great legislation on the books.

Elham Tabassi

Elham Tabassi: “Update from US AI Safety Institute”

NIST is a non-regulatory agency under the Department of Commerce. We cultivate trust in technology and promote innovation. We promote measurement science and technologically valid standards. We work through a multi-stakeholder process. We try to identify what are the valuable effective measurement techniques.

Since 2023, we have:

  • We released an AI Risk Management Framework.
  • We built a Trustworthy AI Resource Center and launched a Generative AI Public Working Group
  • EO 14110 asks NIST to develop a long list of AI guidelines, and NIST has been busy working on those drafts, with a request for information and will be doing listening sessions.
  • NIST will start with a report on synthetic content authentication and then develop guidance
  • There are a lot of other tasks, most of which correspond with the end of July, but these will continue in motion after that.
  • We have a consortium with working groups to implement the different EO components
  • Released a roadmap along with the AI RMF.

Nitarshan Rajkumar

Nitarshan Rajkumar: “Update from UK AI Safety Institute”

Our focus is to equip governments with an empirical understanding of the safety of frontier AI systems. It’s built as a startup within government, with seed funding of £100 million, and extensive talent, partnerships, and access to models and compute.

UK government has tried to mobilize international coordination, starting with an AI safety summit at Bletchley Park. We’re doing consensus-building at a scientific level, trying to do for AI safety what IPCC has done for climate change.

We have four domains of testing work:

  • Misuse: do advanced AI systems meaningfully lower barriers for bad actors seeking to cause harm?
  • Societal imapcts: how are AI systems actually used in the real world, with effects on individuals and society?
  • Autonomous systems: this includes reproduction, persuasion, and create more capable AI models
  • Safeguards: evaluating effectiveness of advanced AI safety systems

We have four approaches to evaluations:

  • Automated benchmarking (e.g. Q&A sets). Currently broad but shallow baselines.
  • Red-teaming: deploying domain experts to manually interact with the model
  • Human-uplift studies: how does AI change the capabilities of novices?
  • Agents and tools: it’s possible that agents will become a more common way of interacting with AI systems

Panel Discussion

Katherine: What kinds of legislation and rules do we need?

Amba: The lobbying landscape complicates efforts. Industry has pushed for auditing mandates to undercut bright-line rules. E.g., facial-recognition auditing was used to undercut pushes to ban facial-recognition. Maybe we’re not talking enough about incentives.

Raquel: When we’re talking about generative-AI, we’re also talking about the broader information landscape. Content moderation is incredibly thorny. Dave knows so much, but the incentives are so bad. If companies are incentivized by optimizing advertising, data collection, and attention, then content moderation is connected to enforcing a misaligned system. We have a chance to shape these rules right now.

Dave: I think incentive problems affect product design rather than content moderation. The ugly reality of content moderation is that we’re not very good at it. There are huge technique gaps, humans don’t scale.

Katherine: What’s the difference between product design and content moderation?

Dave: ChatGPT is a single-player experience, so some forms of abuse are mechanically impossible. That kind of choice has much more of an impact on abuse as a whole.

Katherine: We’ve talked about standards. What about when standards fail? What are the remedies? Who’s responsible?

Amba: Regulatory proposals and regimes (e.g. DSA) that focus on auditing and evaluation have two weaknesses. First, they’re weakest on consequences: what if harm is discovered? Second, internal auditing is most effective (that’s where the expertise and resources are) but it’s not a substitute for external auditing. (“Companies shouldn’t be grading their own homework.”) Too many companies are on the AI-auditing gravy train, and they haven’t done enough to show that their auditing is at the level of effectiveness it needs to be. Scrutinize the business dynamics.

Nicholas: In computer security, there are two types of audits. Compliance audits check boxes to sell products, and actual audits where someone is telling you what you’re doing wrong. There are two different kinds of companies. I’m worried about the same thing happening here.

Elham: Another exacerbation is that we don’t know how to do this well. From our point of view, we’re trying to untangle these two, and come up with objective methods for passing and failing.

Question: Do folks have any reflection on approaches more aligned with transparency?

Nicholas: Happy to talk when I’m not on the panel.

Raquel: A few years ago, I was working on developing an authentication product. We got a lot of backlash from human-rights community. We hired different sets of penetration testers to audit the technology, and then we’d spend resource on patching. We equate open-source with security, but the amount of times we offered people code–but there’s not a huge amount of technical expertise.

Hoda: Right now, we don’t even have the right incentives to create standards except for companies’ bottom line. How do your agencies try to balance industry expertise with impacted communities?

Elham: Technologies change fast, so expertise is very important. We don’t know enough, and the operative word is “we” and collaboration is important.

Nitarshan: Key word is “iterative.” Do the work, make mistakes, learn from them, improve software, platform, and tooling.

Elham: We talk about policies we can put in place afterwards to check for safety and security. But these should also be part of the discussion of design. We want technologies that make it easy to do the right thing, hard to do the wrong thing, and easy to recover. Think of three-plug power outlets. We are not a standard-development organization; industry can lead standard development. The government’s job is to support these efforts by being third-party neutral objectives.

Question: What are the difference in how various institutions understand AI safety? E.g., protect company versus threats to democracy and human rights?

Nitarshan: People had an incorrect perception that we were focused on existential risk and we prominently platformed societal and other risks. We think of the risks as quite broad.

Katherine: Today, we’ve been zooming in and out. Safety is really interesting because we have tools that are the same for all of these topics–same techniques for privacy and copyright don’t necessarily work. Alignment, filters, etc. are a toolkit that is not necesarily specified. It’s about models that don’t do what we want them to do.

Let’s talk about trust and safety. Some people think there’s a tradeoff between safe and private systems

Dave: That is true especially early on in the development of a technology when we don’t understand it. But maybe not in the long run. For now, observation for learning purposes is important.

Andreas: Why would the system need to know more about individuals to protect them?

Dave: It depends on privacy. If privacy means “personal data” than no, but if privacy means “scrutiny of your usage” then yes.

Katherine: Maybe I’m generating a picture of a Mormon holding a cup of coffee. Depending on what we consider a violation, we’d need to know more about them, or to know what they care about. Know the age and context of a child.

Andreas: People have the control to disclose what they want to be know, that can also be used in responding.

Question: How do you think about whether models are fine to use only with certain controls, or should we avoid models that are brittle?

Dave: I’m very skeptical of brittle controls (terms of service, some refusals). Solving the brittleness of model-level mitigations is an important technical problem if you want to see open-source flourish. The right level to work at is the level you can make stick in the face of someone who is trying to be cooperative. Miscalibration is different than adversarial misuse. Right now, nothing is robust if someone can download the model and run it themselves.

Erie: What advice do you have for federal regulators who want to develop relationships with technical communities? How do you encourage whistleblowers?

Amba: Researchers are still telling us that problems with existing models are still unsolved. There are risks that are still unsolved; the genie is still out of the bottle. We’re not looking out to the horizon. Privacy, security, and bias harms are here right now.

Nicholas: I would be fine raising problems if I noticed them; I say things that get me in trouble in many circumstances. There are cases where it’s not worth getting in trouble–when I don’t have anything technically useful to add to the conversation.

Dave: People who work in these parts of companies are not doing it because they love glory and are feeling relaxed. They’re doing it because they genuinely care. That sentiment is fairly widespread.

Andreas: We’re here and we publish. There is a fairly vibrant community of open-source evaluations. In many ways they’re the most trustable. Maybe it’s starting to happen for security as well.

Katherine: Are proposed requirements for watermarking misguided?

Nicholas: As a technical problem, I want to know whether it works. In adversarial settings, not yet. In non-adversarial settings, it can work fine.

Katherine: People also mention homomorphic encryption–

Nicholas: That has nothing do with watermarking.

Katherine: –blockchain–

Nicholas: That’s dumb.

Raquel: There’s been too much emphasis on watermarking from a regulatory perspective. If we don’t embed media literacy, I’m worried about people looking at a content credential and misunderstanding what it covers.

Question: Is there value in safeguards that are easy to remove but hard to remove by accident?

Dave: It depends on the problem you’re trying to solve.

Nicholas: This the reason why depositions exist.

Raquel: This veers into UX, and the design of the interface the user engages with.

Question: What makes a good scientific underpinning for an evaluation? Compare the standards for cryptographic hashes versus the standards for penetration testing? Is it about math versus process?

Nitarshan: These two aren’t in tension. It’s just that right now ML evaluation is more alchemy than science. We can work on developing better methods.


And that’s it, wrapping up a nearly nine-hour day!

How Licenses Learn

I have posted a new draft essay, How Licenses Learn. It is a joint work with Madiha Zahrah Choksi, a Ph.D. student in Information Science at Cornell Tech, and the paper itself is an extended and enriched version of her seminar paper from my Law of Software course from last spring. We presented it at the Data in Business and Society symposium at Lewis and Clark Law School in September, and the essay is forthcoming in the Lewis and Clark Law Review later this year.

Here is the abstract:

Open-source licenses are infrastructure that collaborative communities inhabit. These licenses don’t just define the legal terms under which members (and outsiders) can use and build on the contributions of others. They also reflect a community’s consensus on the reciprocal obligations that define it as a community. A license is a statement of values, in legally executable form, adapted for daily use. As such, a license must be designed, much as the software and hardware that open-source developers create. Sometimes an existing license is fit to purpose and can be adopted without extensive discussion. However, often the technical and social needs of a community do not precisely map onto existing licenses, or the community itself is divided about the norms a license should enforce. In these cases of breakdown, the community itself must debate and design its license, using the same social processes it uses to debate and design the other infrastructure it relies on, and the final goods it creates.

In this Article, we analyze four case studies of controversy over license design in open-source software and hardware ecosystems. We draw on Stewart Brand’s How Buildings Learn, a study of how physical buildings change over time as they are adapted and repurposed to deal with new circumstances by successive generations of users. Similarly, we describe how open-source licenses are adapted and repurposed by different communities confronting challenges. Debates over license drafting and interpretation are a key mechanism of achieving the necessary consensus for successful collaboration. The resulting licenses are the visible traces of the constant political work that sustains open-source collaboration. Successful licenses, like successful buildings, require ongoing maintenance, and the record of license changes over the years is a history of the communities that have inhabited them.

The paper has been a great pleasure to work on for many reasons. First and foremost is the joy of collaborative work. Madiha has done extensive research on how open-source communities handle both cooperation and conflict, and the stories in How Licenses Learn are just a small fraction of the ones she has studied. Like the communities she studies, Madiha engages with legal issues from the perspective of a well-informed non-lawyer, which helps a lot in understanding what is really going on in arguments over licensing.

Second, this paper was a chance for me to revisit some of the ideas in architectural theory that I have been chewing on since I wrote Regulation by Software in law school nearly 20 years ago. Larry Lessig famously connected software to architecture, and we found the metaphor illuminating in thinking about the ``architecture’’ of software licensing. As always, I hope you enjoy reading this as much as I—as we—enjoyed writing it.

Scholars' Amicus Brief in the NetChoice Cases

Yesterday, along with twenty colleagues — in particular Gautam Hans, who served as counsel of record — I filed an amicus brief in the Supreme Court’s cases on Florida and Texas’s anti-content-moderation social-media laws, Moody v. NetChoice and NetChoice v. Paxton. The cases involve First Amendment challenges to laws that would prohibit platforms from wide swaths of content moderation. Florida’s prohibits removing or downranking any content posted by journalistic enterprises or by or about candidates for public office; Texas’s prohibits any viewpoint-based moderation of any content at all.

Our brief argues that these laws are unconstitutional restrictions on the rights of social-media users to find and receive the speech that they want to listen to. By prohibiting most content moderation, they force platforms to show users floods of content those users find repugnant, or are simply not interested in. This, we claim, is a form of compelled listening in violation of the First Amendment.

Here is the summary of our argument:

This case raises complex questions about social-media platforms’ First Amendment rights. But Florida Senate Bill 7072 (SB 7072) and Texas House Bill 20 (HB 20) also severely restrict platform users’ First Amendment rights to select the speech they listen to. Any question here is straightforward: such intrusions on listeners’ rights are flagrantly unconstitutional.

SB 7072 and HB 20 are the most radical experiments in compelled listening in United States history. These laws would force millions of Internet users to read billions of posts they have no interest in or affirmatively wish to avoid. This is compulsory, indiscriminate listening on a mass scale, and it is flagrantly unconstitutional.

Users rely on platforms’ content moderation to cope with the overwhelming volume of speech on the Internet. When platforms prevent unwanted posts from showing up in users’ feeds, they are not engaged in censorship. Quite the contrary. They are protecting users from a neverending torrent of harassment, spam, fraud, pornography, and other abuse — as well as material that is perfectly innocuous but simply not of interest to particular users. Indeed, if platforms did not engage in these forms of moderation against unwanted speech, the Internet would be completely unusable, because users would be unable to locate and listen to the speech they do want to receive.

Although these laws purport to impose neutrality among speakers, their true effect is to systematically favor speakers over listeners. SB 7072 and HB 20 pre- vent platforms from routing speech to users who want it and away from users who do not. They convert speakers’ undisputed First Amendment right to speak without government interference into something much stronger and far more dangerous: an absolute right for speakers to have their speech successfully thrust upon users, despite those users’ best efforts to avoid it.

In the entire history of the First Amendment, listeners have always had the freedom to seek out the speech of their choice. The content-moderation restrictions of SB 7072 and HB 20 take away that freedom. On that basis alone, they can and should be held unconstitutional.

This brief brings together nearly two decades of my thinking about Internet platforms, and while I’m sorry that it has been necessary to get involved in this litigation, I’m heartened at the breadth and depth of scholars who have joined together to make sure that users are heard. On a day when it felt like everyone was criticizing universities over their positions on free speech, it was good to be able to use my position at a university to take a public stand on behalf of free speech against one of its biggest threats: censorious state governments.

Just Shorthand for Young Women

It’s allowed a bunch of conservatives to rant in all kinds of insane ways about the degeneracy of “Gen Z,” which is just shorthand for “young women,” the same way the word “woke” is just shorthand for “minorities”.

Ryan Broderick

Thermal Observations on the Preparation of Hot Leaf Juice

I like tea, and I like the British way of making tea, as described by such experts as George Orwell, Douglas Adams, and my spouse. An important part of this process is steeping the tea in a pre-warmed mug or pot. When I can, I fill a mug with boiling water, wait for it to heat up, pour off the water, promptly fill it again with a fresh pour of boiling water, and only then add the teabag.

Using two pours of hot water rather than one improves the flavor of the resulting tea immensely. If two is better than one, than perhaps three might be better than two. Should I preheat the mug twice? This is a question that can be answered with science.

To a very rough approximation, a ceramic mug and the water in it weigh about the same (about 300 grams). The weight of the teabag (about 3 grams) is small enough to neglect. The specific heat of water (4,184 J/g˚C) is about four times the specific heat of the in the mug (~900 J/g˚C). Thus, they will reach equilibrium at about four fifths of way from the initial temperature of the mug to the initial temperature of the water. If the mug starts at room temperature (20˚C) and the water is boiling (100˚C), this means the tea will brew at about 84˚C.

Pre-heating the mug once with a second pour of water means that the mug starts at 84˚C before the brewing water and tea are added. The water still starts at 100˚C and the specific heats are unchanged, so now the equilibrium is about 97˚C. That’s a big difference!

Preheating the mug a second time with a third pour of water repeats the process again. Now the mug starts at 97˚C, and the third pour brings it up to about 99˚C. That’s a small difference!

I conclude that on theoretical grounds, preheating your mug by using a second pour of water raises the brewing temperature by about 13˚C, enough to result in a substantial improvement in flavor. Preheating the mug again by using a third pour of water raises the brewing temperature by only about another 2˚C, which is not enough to make a noticeable difference unless you have a more refined palate than mine. Preheat once if you can, but you can stop there.

This result has been confirmed experimentally.