Research
I am broadly interested in natural language processing, specifically how to encourage factuality in language models.
Thanks to my amazing mentors and collaborators! ☺
* denotes equal contribution
|
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Akari Asai, Jacqueline He*, Rulin Shao*, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, Graham Neubig, Dan Weld, Doug Downey, Wen-tau Yih, Pang Wei Koh, Hannaneh Hajishirzi
preprint 2024
abstract /
paper /
code /
blog /
demo
Scientific progress depends on researchers' ability to synthesize the growing body of literature. Can large language models (LMs) assist scientists in this task? We introduce OpenScholar, a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passages from 45 million open-access papers and synthesizing citation-backed responses. To evaluate OpenScholar, we develop ScholarQABench, the first large-scale multi-domain benchmark for literature search, comprising 2,967 expert-written queries and 208 long-form answers across computer science, physics, neuroscience, and biomedicine. On ScholarQABench, OpenScholar-8B outperforms GPT-4o by 5% and PaperQA2 by 7% in correctness, despite being a smaller, open model. While GPT4o hallucinates citations 78 to 90% of the time, OpenScholar achieves citation accuracy on par with human experts. OpenScholar's datastore, retriever, and self-feedback inference loop also improves off-the-shelf LMs: for instance, OpenScholar-GPT4o improves GPT-4o's correctness by 12%. In human evaluations, experts preferred OpenScholar-8B and OpenScholar-GPT4o responses over expert-written ones 51% and 70% of the time, respectively, compared to GPT4o's 32%. We open-source all of our code, models, datastore, data and a public demo.
|
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, Pang Wei Koh
NeurIPS 2024
abstract /
paper /
code
Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.
|
Challenges in Context-Aware Neural Machine Translation
Linghao Jin*, Jacqueline He*, Jonathan May, Xuezhe Ma
EMNLP 2023
abstract /
paper /
code
Context-aware neural machine translation involves leveraging information beyond sentence-level context
to resolve inter-sentential discourse dependencies and improve document-level translation quality, and
has given rise to a number of recent techniques. However, despite well-reasoned intuitions, most
context-aware translation models show only modest improvements over sentence-level systems. In this
work, we investigate several challenges that impede progress within this field, relating to discourse
phenomena, context usage, model architectures, and document-level evaluation. To address these
problems, we propose a more realistic setting for document-level translation, called
paragraph-to-paragraph (para2para) translation, and collect a new dataset of Chinese-English novels to
promote future research.
|
MABEL: Attenuating Gender Bias using Textual Entailment Data
Jacqueline He, Mengzhou Xia, Christiane Fellbaum, Danqi Chen
EMNLP 2022
abstract /
paper /
code
Pre-trained language models encode undesirable social biases, which are further exacerbated in
downstream use. To this end, we propose MABEL (a Method for Attenuating Gender Bias using
Entailment Labels), an intermediate pre-training approach for mitigating gender bias in
contextualized representations. Key to our approach is the use of a contrastive learning objective
on counterfactually augmented, gender-balanced entailment pairs from natural language inference
(NLI) datasets. We also introduce an alignment regularizer that pulls identical entailment pairs
along opposite gender directions closer. We extensively evaluate our approach on intrinsic and
extrinsic metrics, and show that MABEL outperforms previous task-agnostic debiasing approaches in
terms of fairness. It also preserves task performance after fine-tuning on downstream tasks.
Together, these findings demonstrate the suitability of NLI data as an effective means of bias
mitigation, as opposed to only using unlabeled sentences in the literature. Finally, we identify
that existing approaches often use evaluation settings that are insufficient or inconsistent. We
make an effort to reproduce and compare previous methods, and call for unifying the evaluation
settings across gender debiasing methods for better future comparison.
|
Can Rationalization Improve Robustness?
Howard Chen, Jacqueline He, Karthik Narasimhan, Danqi Chen
NAACL 2022
abstract /
paper /
code
A growing line of work has investigated the development of neural NLP models that can produce
rationales--subsets of input that can explain their model predictions. In this paper, we ask
whether such rationale models can also provide robustness to adversarial attacks in addition to
their interpretable nature. Since these models need to first generate rationales
("rationalizer") before making predictions ("predictor"), they have the potential to ignore
noise or adversarially added text by simply masking it out of the generated rationale. To this
end, we systematically generate various types of 'AddText' attacks for both token and
sentence-level rationalization tasks, and perform an extensive empirical evaluation of
state-of-the-art rationale models across five different tasks. Our experiments reveal that the
rationale models show the promise to improve robustness, while they struggle in certain
scenarios--when the rationalizer is sensitive to positional bias or lexical choices of attack
text. Further, leveraging human rationale as supervision does not always translate to better
performance. Our study is a first step towards exploring the interplay between interpretability
and robustness in the rationalize-then-predict framework.
|
|