Evaluating LLMs is a minefield

 

 

 

Arvind Narayanan & Sayash Kapoor

Princeton University

Oct 4, 2023

 

 

Authors of the AI Snake Oil book and newsletter

Is ChatGPT is getting worse over time?

Chen, Zaharia, Zou. How Is ChatGPT’s Behavior Changing over Time? arXiv 2023.

What we found

Narayanan & Kapoor. Is GPT-4 getting worse over time? AI Snake Oil (Substack) 2023.

No evidence of capability degradation.

 

But behavior changed in response to certain prompts.

 

Slightly different prompts needed to elicit capability.

Three hard problems in LLM evaluation

Narayanan & Kapoor. Is GPT-4 getting worse over time? AI Snake Oil (Substack) 2023.

1. Prompt sensitivity

      Are you measuring something intrinsic to the model or is it an artifact of your prompt?
 

2. Construct validity

 

3. Contamination

Does ChatGPT have a liberal bias?

Motoki, Neto, Rodrigues. More human than human: measuring ChatGPT political bias. Public Choice 2023.

What we found

Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.

We used the paper’s questions.

 

Example opinion:

  “The freer the market,
    the freer the people.”

 

What we found

Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.

We used the paper’s questions.

 

Example opinion:

  “The freer the market,
    the freer the people.”

 

What went wrong in the paper?

Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.

1. Multiple choice questions.

2. A further trick that forces the model to opine.

3. ...

Three hard problems in LLM evaluation

1. Prompt sensitivity

2. Construct validity

3. Contamination

No way to study political bias

and many other questions

Hypothesis: chatbots’ political bias is not a construct

that exists independently of a population of users.

 

 

No way to study political bias

and many other questions

Narayanan & Kapoor. Generative AI companies must publish transparency reports. AI Snake Oil (Substack) 2023.

Hypothesis: chatbots’ political bias is not a construct

that exists independently of a population of users.

 

 

Naturalistic observation is necessary. How?

1. Generative AI companies must publish transparency reports.

2. Researchers could create corpora of real-world use.

Did GPT-4 pass the bar and USMLE?

Narayanan & Kapoor. GPT-4 and professional benchmarks: the wrong answer to the wrong question. AI Snake Oil (Substack) 2023.

Or did it simply memorize the answers?

 

Evidence of contamination:

      Perfect results on a coding benchmark before September 5, 2021 and zero afterwards.

      But for the legal and medical benchmarks, we can't be sure.

 

Did GPT-4 pass the bar and USMLE?

Construct validity:

     Exams designed for humans measure underlying abilities that generalize

     to real-world situations. When applied to LLMs they tell us almost nothing.

Narayanan & Kapoor. GPT-4 and professional benchmarks: the wrong answer to the wrong question. AI Snake Oil (Substack) 2023.

The reproducibility crisis
in ML-based science

Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.

 

The reproducibility crisis
in ML-based science

Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.

 

Kapoor & Narayanan. Leakage and the Reproducibility Crisis in ML-based Science. Patterns 2023.

 

Kapoor et al. REFORMS: Reporting Standards for ML-based Science. Manuscript, 2023.

Harms arising from inadequate understanding of limitations

Should LLMs be used to evaluate grant proposals?

 

 

Should LLMs be used to evaluate grant proposals?

 

No! They focus on the style the of text rather than its
scientific content.

Harms arising from inadequate understanding of limitations

Evaluating LLMs is hard:

    prompt sensitivity, construct validity, contamination.

 

Faulty methods in research on LLMs and research using LLMs.

Takeaways

Closed LLMs: further reproducibility hurdles

Kapoor & Narayanan. OpenAI’s policies hinder reproducible research on language models. AI Snake Oil (Substack) 2023.

The future of open source AI hangs in the balance

Kapoor & Narayanan. Licensing is neither feasible nor effective for addressing AI risks. AI Snake Oil (Substack) 2023.

AI fears have led to dubious policy proposals to require licenses to build AI.

Strengthening open approaches to AI

Princeton Language & Intelligence:

 

A research initiative committed to

keeping AI expertise and know-how

in the public sphere.

We'll continue to cover this topic on the

AI Snake Oil newsletter.