Arvind Narayanan & Sayash Kapoor
Princeton University
Oct 4, 2023
Authors of the AI Snake Oil book and newsletter
Is ChatGPT is getting worse over time?
Chen, Zaharia, Zou. How Is ChatGPT’s Behavior Changing over Time? arXiv 2023.
What we found
Narayanan & Kapoor. Is GPT-4 getting worse over time? AI Snake Oil (Substack) 2023.
No evidence of capability degradation.
But behavior changed in response to certain prompts.
Slightly different prompts needed to elicit capability.
Three hard problems in LLM evaluation
Narayanan & Kapoor. Is GPT-4 getting worse over time? AI Snake Oil (Substack) 2023.
1. Prompt sensitivity
Are you measuring something intrinsic to the model or is it an artifact of your prompt?
2. Construct validity
3. Contamination
Does ChatGPT have a liberal bias?
Motoki, Neto, Rodrigues. More human than human: measuring ChatGPT political bias. Public Choice 2023.
What we found
Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.
We used the paper’s questions.
Example opinion:
“The freer the market,
the freer the people.”
What we found
Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.
We used the paper’s questions.
Example opinion:
“The freer the market,
the freer the people.”
What went wrong in the paper?
Narayanan & Kapoor. Does ChatGPT have a liberal bias? AI Snake Oil (Substack) 2023.
1. Multiple choice questions.
2. A further trick that forces the model to opine.
3. ...
Three hard problems in LLM evaluation
1. Prompt sensitivity
2. Construct validity
3. Contamination
No way to study political bias
and many other questions
Hypothesis: chatbots’ political bias is not a construct
that exists independently of a population of users.
No way to study political bias
and many other questions
Narayanan & Kapoor. Generative AI companies must publish transparency reports. AI Snake Oil (Substack) 2023.
Hypothesis: chatbots’ political bias is not a construct
that exists independently of a population of users.
Naturalistic observation is necessary. How?
1. Generative AI companies must publish transparency reports.
2. Researchers could create corpora of real-world use.
Did GPT-4 pass the bar and USMLE?
Narayanan & Kapoor. GPT-4 and professional benchmarks: the wrong answer to the wrong question. AI Snake Oil (Substack) 2023.
Or did it simply memorize the answers?
Evidence of contamination:
Perfect results on a coding benchmark before September 5, 2021 and zero afterwards.
But for the legal and medical benchmarks, we can't be sure.
Did GPT-4 pass the bar and USMLE?
Construct validity:
Exams designed for humans measure underlying abilities that generalize
to real-world situations. When applied to LLMs they tell us almost nothing.
Narayanan & Kapoor. GPT-4 and professional benchmarks: the wrong answer to the wrong question. AI Snake Oil (Substack) 2023.
The reproducibility crisis
in ML-based science
Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.
The reproducibility crisis
in ML-based science
Systematic reviews in over a dozen fields have found that large fractions of ML-based studies are faulty.
Kapoor & Narayanan. Leakage and the Reproducibility Crisis in ML-based Science. Patterns 2023.
Kapoor et al. REFORMS: Reporting Standards for ML-based Science. Manuscript, 2023.
Harms arising from inadequate understanding of limitations
Should LLMs be used to evaluate grant proposals?
YouYou, Yang, Uzzi. A discipline-wide investigation of the replicability of Psychology papers over the past two decades. PNAS 2023.
Should LLMs be used to evaluate grant proposals?
No! They focus on the style the of text rather than its
scientific content.
Crockett et al. The limitations of machine learning models for predicting scientific replicability. PNAS 2023.
Mottelson & Kontogiorgos. Replicating replicability modeling of psychology papers. PNAS 2023.
Harms arising from inadequate understanding of limitations
Evaluating LLMs is hard:
prompt sensitivity, construct validity, contamination.
Faulty methods in research on LLMs and research using LLMs.
Takeaways
Closed LLMs: further reproducibility hurdles
Kapoor & Narayanan. OpenAI’s policies hinder reproducible research on language models. AI Snake Oil (Substack) 2023.
The future of open source AI hangs in the balance
Kapoor & Narayanan. Licensing is neither feasible nor effective for addressing AI risks. AI Snake Oil (Substack) 2023.
AI fears have led to dubious policy proposals to require licenses to build AI.
Strengthening open approaches to AI
Princeton Language & Intelligence:
A research initiative committed to
keeping AI expertise and know-how
in the public sphere.