This project aims to investigate biases within large language models by generating questions from articles of different context and evaluating the model's long-form responses to the source articles (Ground truth). This README provides instructions on how to use the Question-Answering pipeline and how to replicate results.
Recreate the environment using:
$ conda env create -f environment.yml
For more information: https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file
$ pip install -r requirements.txt
Note that with pip, there might be package conflicts
The data used and generated from this project can be found under the ./data
directory. This includes data that has been scrapped off the internet, from the S.Rajaratnam School of International Studies (RSIS), Straitstimes (ST) and New York Times (NYT), representing the South East Asian context, Singaporean context and New York Times context respectively. For more information, please refer to ./data
.
Web scraping scripts for RSIS, Straitstimes and NYT are also provided under ./web-scrapping
for more data gathering.
Before executing the QA generation scripts, be sure to run the FastChat API server if you are not using openai's model. To clone the FastChat repo, please refer to: https://github.com/lm-sys/FastChat/tree/main#install
The FastChat server can be hosted with these commands:
$ cd Fastchat
$ python3 -m fastchat.serve.controller --port [controller_port_number]
$ python3 -m fastchat.serve.model_worker --model-name [model_name] --model-path [model_path] --gpus 0,3 --num-gpus 2 --max-gpu-memory 80Gib --port [worker_port_number] --worker-address http://localhost:[worker_port_number] --controller-address http://localhost:[controller_port_number]
$ python3 -m fastchat.serve.openai_api_server --host localhost --port [restful_port_number] --controller-address http://localhost:[controller_port_number]
For more information about the FastChat restful API server, please refer to the FastChat repo: https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md
This is the pipeline for Question generation, Answer generation and evaluations using cosine similarity and perplexity. More information on the file structure can be found in ./QA-generation
.
Please configure the model parameters in the QA configs before beginning. This is an example configuration
vicuna-13b-v1.3:
openai_localhost: http://localhost:[restful_port_number]/v1
openai_api_key: EMPTY
openai_organization: [openai-organisation]
Definitions for the model can be fed through the ./configs/definitions_config.json
file. Here default definitions are already given for
- Question Generation
- Answer Generation
- Summarise to points
- Text summarisation
- Answer Generation with Context
$ python3 question-generation.py
--context_name rsis \
--num_of_generations 1 \
--model_name vicuna-13b-v1.3 \
--qa_config_path ./configs/QA_config.yaml \
Use this command for answer generation without context (Close book QA). Evaluations will be calculated automatically upon answer generation:
$ python3 close-book-generation.py \
--context_name rsis \
--questions_path ../data/generations/rsis/questions_rsis_vicuna-13b-v1.3.json \
--num_of_generations 1 \
--model_name vicuna-13b-v1.3 \
--qa_config ../configs/QA_config.yaml \
For answer generation with context (open book QA), use this command:
$ python3 open-book-generation.py \
--context_name rsis \
--questions_path ../data/generations/rsis/questions_rsis_vicuna-13b-v1.3.json \
--num_of_generations 1 \
--model_name vicuna-13b-v1.3 \
--qa_config ./configs/QA_config.yaml \
--starting_dataset_path ../data/generations/rsis/open_book_answers_rsis_vicuna-13b-v1.3.json \
Calculating perplexity is a separate process, ensure that perplexity_filepath.yml
is configured before running perplexity.py
.
$ python3 perplexity.py --perplexity_config ../configs/perplexity_filepath.yml
As we are using the HuggingFaceAPI for calculating perplexity, cuda's device map can be configured in ./configs/device_map.json
. Change the code to use the configured device map in ./QA-generation/perplexity.py
.
Data cleaning scripts can be found under ./data-cleaning
.
Scripts and notebooks to calculate evaluations can be found under ./more-eval
Here are the cosine similarity calculations between the generated answers and the source (ground truth).
We also conducted an unpaired 2-tailed t-test to verify the differences in mean.
We also calculated perplexity as shown:
The architecture for this project was inspired by the self_instruct framework, used in Standford's Alpaca.
@misc{wang2023selfinstruct,
title={Self-Instruct: Aligning Language Models with Self-Generated Instructions},
author={Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi},
year={2023},
eprint={2212.10560},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This repo also used FastChat extensively, which was mainly developed with reference to the paper below:
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}