Skip to content

BM25 IR Benchmarking for Manticore Search vs Elasticsearch

Notifications You must be signed in to change notification settings

dMetrics/ir-bm25-benchmark

Repository files navigation

This repo has scripts and steps for evaluation of Manticore Search (MS) over example datasets for Information Retrieval (IR).

We try to evaluate how MS compares with Elasticsearch (ES) and how both compare for retrieval using BM25.

We try to mimic ES settings for BM25 search as described here.

The evaluation is done comparing various IR benchmarking metrics, implemented in BEIR. BEIR is a python package for benchmarking models/algorithms for IR tasks.

All Results:

Look at all the results here.

Look at all the updated results here. Thanks to the manticore team for addressing the concerns we raised!

Setup for data:

We evaluate on the datasets below.

  1. TREC-COVID https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
  2. NF-CORPUS https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip

Run the below commands in the directory you clone the repo:

cd data
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/trec-covid.zip
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
tar -xvzf nfcorpus.zip
tar -xvzf trec-covid.zip
cd ..

NOTES:

Meta information about the datasets:

  1. Each dataset has two fields with contents that needs be indexed, a title and txt field.
  2. The trec-covid dataset has 171332 documents while nfcorpus has 3633 documents.
  3. For IR evaluation, each dataset has a fixed set of queries and corresponding relevant documents.
  4. The trec-covid dataset has 50 queries while nfcorpus has 323 queries.

Setup for code:

  1. Create and activate conda env:
conda create --name ir-bm25-benchmark python=3.10
conda activate ir-bm25-benchmark
  1. Install dependencies:
pip install -r requirements.txt
pip install --no-deps -r requirements_no_deps.txt

The updated results required manticoresearch-python client version that was not available via pip as of this writing. To install the latest use:

pip install git+https://github.com/manticoresoftware/manticoresearch-python.git@master

Confirm using pip list that version installed is >=2.0.0

pip list | grep manticoresearch
manticoresearch   2.0.0

Evaluating Manticore Search:

  1. For Pull docker image and start container
docker pull manticoresearch/manticore
docker run -p 9306:9306 -p 9308:9308 manticoresearch/manticore
  • For Manticore dev version (MS dev)
docker pull manticoresearch/manticore:dev
docker run -p 9306:9306 -p 9308:9308 manticoresearch/manticore:dev
  1. Create and populate indices:

a. Create indices with default settings:

python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus

b. Create indices with settings to mimic ES-like BM25 behavior for search:

python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid_es_like --index-es-like
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus_es_like --index-es-like

NOTE 1:

The following options are set on the indices for the ES-like BM25 behaviour:

stopwords='en'
stopwords_unstemmed='1'
morphology='stem_en'
html_strip = '1'
index_exact_words = '1'
index_field_lengths = '1'

These options apply to the two text fields of the document collections. More details about indexing can be found in this function.

NOTE 2: The following MS ranking options are set for the evaluation of the ES-like BM25 behaviour:

ranker=expr('sum(10000 * bm25f(1.2,0.75,{{title=1,content=1}}))'), idf='plain,tfidf_unnormalized'

NOTE 3:

Manticore's default english stops words is much longer than that for ElasticSearch. For the *es_like indices you can set use the same stops words as ElasticSearch. But we've noticed that our evaluation performance is poor when we limit to ES only stop words. For this, copy the file in data/elasticsearch_en_stop_words to your manticore docker container, say at location /var/lib/manticore/data/. You can then change your index preparation script to this:

python -m benchmark.manticore.prepare data/trec-covid/corpus.jsonl trec_covid_es_like --index-es-like --stop-words /var/lib/manticore/data/elasticsearch_en_stop_words
python -m benchmark.manticore.prepare data/nfcorpus/corpus.jsonl nfcorpus_es_like --index-es-like --stop-words /var/lib/manticore/data/elasticsearch_en_stop_words

NOTE 4:

Elasticsearch indices are built according to this function in BEIR where the two text fields are indexed with the ES English analyzer. The resulting indices are then queried with a multi-match query over these two fields, as detailed in this function.

Evaluate:

a. Evaluate retrieval for MS default settings:

python -m benchmark.manticore.evaluate data/nfcorpus test nfcorpus
python -m benchmark.manticore.evaluate data/trec-covid test trec_covid

b. Evaluate retrieval for MS with ES-like settings:

python -m benchmark.manticore.evaluate data/nfcorpus test nfcorpus_es_like
python -m benchmark.manticore.evaluate data/trec-covid test trec_covid_es_like

Evaluating Elasticsearch:

  1. Run ElasticSearch in a docker container:
docker pull elasticsearch:7.17.0
docker-compose up

Wait for a couple of minutes for the docker container to be ready.

  1. Evaluate: (This re-creates an index each time you evaluate)
python -m benchmark.es.evaluate_bm25 data/trec-covid test trec_covid
python -m benchmark.es.evaluate_bm25 data/nfcorpus test nfcorpus

Note: There is a sleep of 10 seconds between the creation of the index and the evaluation in the above script. This allows ES to finish the indexing before we run the evaluations.

Findings:

We are looking to compare all the different strategies we used for indexing and search using the metric NDCG@10. This is metric reported by the BEIR paper and can be accessed here for these two datasets and others. Other metrics printed below are simply for sanity checks.

Comments: (In context of Manticore 4.2.0. Concerns raised were fixed in 4.2.1. See updated results here)

  1. Comparing to the results for NDCG@10 achieved with MS using ES-like settings: 3. For the trec-covid dataset: NDCG@10 jumps to 0.59764, but we still fall short of the best of 0.68803 reported with ES. 4. For the nfcorpus dataset: NDCG@10 jumps to 0.31715, but we still fall short of the best of 0.34281 reported with ES.
  2. Comparing to the results for NDCG@10 achieved with ES:
    1. MS performs very poorly for the trec-covid dataset - 0.29494 compared to the 0.68803 for ES.
    2. MS performs slightly poor for the nfcorpus dataset - 0.28791 compared to the 0.34281 for ES.
  3. Comparing to the results for NDCG@10 reported by BEIR against ES:
    1. These numbers should match exactly, but they are actually better in reality.
    2. The reported benchmark had a bug concerning reproducibility. More details here.

Results for trec-covid:

dataset settings NDCG@10
trec-covid MS (default) 0.29494
trec-covid MS (es-like) 0.59764
trec-covid MS dev (es-like) 0.71211
trec-covid ES 0.68803
trec-covid ES (reported in BEIR) 0.616

Results for nfcorpus:

dataset settings NDCG@10
nfcorpus MS (default) 0.28791
nfcorpus MS (es-like) 0.31715
nfcorpus MS dev (es-like) 0.34537
nfcorpus ES 0.34281
nfcorpus ES (reported in BEIR) 0.297

Versions:

Elasticsearch version: 7.17.0

Run using this docker image.

Manticore Search version: Manticore 4.2.0 15e927b28@211223 release

Run using this docker image.

Manticore Search version (with fixes): Manticore 4.2.1 d039fba84@220407 release

Run using this docker image.

Releases

No releases published

Packages

No packages published

Languages