GitHub - vector-engineering/covidcg: A COVID-19 CoV Genetics (CG) browser to inform therapeutics development

COVID-19 CG (CoV Genetics)

Article now up at eLife: https://doi.org/10.7554/eLife.63409

Table of Contents

COVID-19 CG (CoV Genetics)
Data enabling COVID CG
Installation
- Dependency changes
- Database refresh
Per-service installation
Analysis Pipeline
About the project
Citing COVID CG
- License
- Contributing

Data enabling COVID CG

We are extremely grateful to the GISAID Initiative and all its data contributors, i.e. the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.

Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258

Note: When using results from these analyses in your manuscript, ensure that you acknowledge the contributors of data, i.e. We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.

Installation

The COVID-19 CG website comprises of 3 services (PostgreSQL database, Flask server, React frontend). These can be run separately (see detailed instructions at per-service installation) but we recommend using Docker to manage these services.

The analysis pipeline for processing raw SARS-CoV-2 genomes is a separate install, and described below in Analysis Pipeline

Install Docker
Clone this repository: git clone https://github.com/vector-engineering/covidcg.git

$ cd covidcg
$ docker-compose build # Build containers
                       # (Re-builds only necessary if packages or
                       # dependencies have changed)
$ docker-compose up -d # Run all services
$ docker-compose down # Shut down all services when finished

The default deployment (docker-compose.yml) will run all 3 sites at the same time (sars2, rsv, and flu). For virus-specific sites, see docker-compose.sars2.yml, etc. Run a specific deployment with:

docker compose -f docker-compose.sars2.yml build
docker compose -f docker-compose.sars2.yml up -d
...

NOTE: When starting from a fresh database, the server will automatically seed the database with data from the example_data_genbank folder. Data provided with the repository is in raw/gzipped form and needs to be unarchived and processed before the data can be loaded into the database. Please see the Analysis Pipeline section for instructions on processing this data.

Dependency changes

If the dependencies for the JS change (i.e., a change in package.json), then you can rebuild the cg-frontend container with:

$ docker-compose down
$ docker-compose build --no-cache cg-frontend
$ docker-compose up

A rebuild will also need to be run if the toolchains change (webpack*.js or anything in tools/)

For files outside of src, i.e., in config/ or in static_data/, the container will need to be restarted but not rebuilt.

For dependency changes for the server (i.e., changes in requirements.txt)

$ docker-compose down
$ docker-compose build --no-cache cg-server
$ docker-compose up

Database refresh

To erase the local development database, delete the postgres docker volume with:

$ docker-compose down -v # -v will delete the volume
$ docker-compose up

Per-service installation

We recommend developing with Docker and docker-compose. More details on the installation for each service can be found in their respective Dockerfiles in the services/ folder, and in the docker-compose.yml file. Running each service separately is not recommended and not tested on our end. Since we are not actively testing per-service installations, please submit a GitHub issue if you run into any problems during installation or running.

First, clone this repository: git clone https://github.com/vector-engineering/covidcg.git

Javascript

Requirements:

curl
node.js > 8.0.0
npm

This app was built from the react-slingshot example app.

Install Node 8.0.0 or greater

Need to run multiple versions of Node? Use nvm.
Install Git.
Disable safe write in your editor to assure hot reloading works properly.
Complete the steps below for your operating system:

macOS
- Install watchman via brew install watchman to avoid this issue which occurs if your macOS has no appropriate file watching service installed.
Linux
- Run this to increase the limit on the number of files Linux will watch. Here's why.
  
  echo fs.inotify.max_user_watches=524288 | sudo tee -a /etc/sysctl.conf && sudo sysctl -p.
Install NPM packages

npm install
Run the app

CONFIGFILE=config/config_genbank.yaml npm start -s

This will run the automated build process, start up a webserver, and open the application in your default browser. When doing development with this kit, this command will continue watching all your files. Every time you hit save the code is rebuilt, linting runs, and tests run automatically. Note: The -s flag is optional. It enables silent mode which suppresses unnecessary messages during the build.

PostgreSQL

This development environment was tested with PostgreSQL 12

Please provide DB connection information to the Flask server with the following environment variables:

POSTGRES_USER
POSTGRES_PASSWORD
POSTGRES_DB
POSTGRES_HOST
POSTGRES_PORT
POSTGRES_MAX_CONN (the maximum number of connections for the Postgres connection pool)

Flask Server

Requirements:

Python3 (Python >= 3.8) with virtual environments. We recommend conda via. miniconda3, but python3 with virtualenv or any other virtual environment provider should also work fine

Install dependencies:

$ cd services/server
$ pip install -r requirements.txt

Run server:

$ cd services/server
$ CONFIGFILE=../../config/config_genbank.yaml ./serve.sh # Run Flask server in development mode, with GenBank settings
                                                   # Optionally, edit the serve.sh script to set the config file

Analysis Pipeline

Data analysis is run with Snakemake, Python scripts, and bioinformatics tools such as bowtie2. Please ensure that the conda environment is configured correctly (See Pipeline Installation).

Data analysis is broken up into two snakemake pipelines: 1) ingestion and 2) main. The ingestion pipeline downloads, chunks, and prepares metadata for the main analysis, and the main pipeline analyzes sequences, extracts mutations, and compiles data for display in the web application.

Configuration of the pipeline is defined in the config/config_[workflow].yaml files.

Pipeline Installation

Clone this repository: git clone https://github.com/vector-engineering/covidcg.git
Install miniconda3
Create conda environment:

$ conda config --add channels bioconda # Add package download locations
$ conda config --add channels conda-forge
$ conda env create -f environment.yml

For OSX M1 chips, use the alternative environment environment_osx-arm64.yaml. Some additional source compilation steps are required as not all ARM64 binaries are available on conda.

Ingestion

Currently available ingest workflows are:

SARS2:

workflow_sars2_gisaid_ingest
workflow_sars2_genbank_ingest
workflow_sars2_custom_ingest

RSV:

workflow_rsv_genbank_ingest
workflow_rsv_custom_ingest

Flu:

workflow_flu_gisaid_ingest
workflow_flu_genbank_ingest

NOTE: While GISAID ingestion pipelines are provided as open-source, it is intended only for internal use.

GenBank ingest pipelines are designed to automatically download and process data from their respective data source.

"Custom" ingest pipelines can be used for analyzing and visualizing in-house data. More details are available in README files within each ingestion pipeline's folder. Each ingestion workflow is parametrized by its own config file. i.e., config/config_sars2_genbank.yaml for the SARS-CoV-2 GenBank workflow.

For example, you can run the SARS-CoV-2 GenBank ingestion pipeline with:

$ cd workflow_sars2_genbank_ingest
$ snakemake --use-conda # Conda required specifically for SARS2 GenBank ingest in order to run Pangolin lineage assignments

Main Analysis

The main data analysis pipeline is located in workflow_main. It requires data, in a data folder, from the ingestion pipeline. The data folder is defined in the config/config_[workflow].yaml file. The path to the config file is required for the main workflow, as it needs to know what kind of data to expect (as described in the config files).

For example, if you ingested data from GenBank, run the main analysis pipeline with:

cd workflow_main
snakemake --configfile ../config/config_sars2_genbank_dev.yaml

This pipeline will align sequences to the reference sequence with minimap2, extract mutations on both the NT and AA level, and combine all metadata and mutation information data. The output data can be uploaded to a PostgreSQL database with workflow_main/scripts/push_to_database.py. Or, you can use the output files directly for your own analyses.

Example data

Example data from GenBank is provided for all viruses, and is located in gzipped tarballs inside the example_data_genbank folder. Data for some viruses is truncated by submission date in order to lighten data load and speed up development on smaller machines.

To extract the data:

$ cd example_data_genbank
$ tar -xzf sars2.tar.gz
$ tar -xzf rsv.tar.gz
$ tar -xzf flu.tar.gz

These tarballs contain only raw sequences and metadata, and mimic the output from their respective ingest pipelines. Once the files are extracted, run the main analysis workflow described above.

About the project

This project is developed by the Vector Engineering Lab:

Albert Tian Chen (Broad Institute)
Kevin Altschuler
Shing Hei Zhan, PhD (University of British Columbia)
Alina Yujia Chan, PhD (Broad Institute)
Ben Deverman, PhD (Broad Institute)

Contact the authors by email: covidcg@broadinstitute.org

Python/snakemake scripts were run and tested on MacOS 10.15.4 (8 threads, 16 GB RAM), Google Cloud Debian 10 (buster), (64 threads, 412 GB RAM), and Windows 10/Ubuntu 20.04 via. WSL2 (48 threads, 128 GB RAM)

Citing COVID CG

Users are encouraged to share, download, and further analyze data from this site. Plots can be downloaded as PNG or SVG files, and the data powering the plots and tables can be downloaded as well. Please attribute any data/images to covidcg.org, or cite our manuscript:

Chen AT, Altschuler K, Zhan SH, Chan YA, Deverman BE. COVID-19 CG enables SARS-CoV-2 mutation and lineage tracking by locations and dates of interest. eLife (2021), doi: https://doi.org/10.7554/eLife.63409

Note: When using results from these analyses in your manuscript, ensure that you acknowledge the contributors of data, i.e. We gratefully acknowledge all the Authors from the Originating laboratories responsible for obtaining the speciments and the Submitting laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based.

and cite the following reference(s):

Shu, Y., McCauley, J. (2017) GISAID: Global initiative on sharing all influenza data – from vision to reality. EuroSurveillance, 22(13) DOI:10.2807/1560-7917.ES.2017.22.13.30494 PMCID: PMC5388101

License

COVID-19 CG is distributed by an MIT license.

Contributing

Please feel free to contribute to this project by opening an issue or pull request in the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 682 Commits
.github/workflows		.github/workflows
build		build
config		config
example_data_genbank		example_data_genbank
pymol		pymol
services		services
src		src
static_data		static_data
tools		tools
workflow_flu_genbank_ingest		workflow_flu_genbank_ingest
workflow_flu_gisaid_ingest		workflow_flu_gisaid_ingest
workflow_main		workflow_main
workflow_rsv_custom_ingest		workflow_rsv_custom_ingest
workflow_rsv_genbank_ingest		workflow_rsv_genbank_ingest
workflow_sars2_custom_ingest		workflow_sars2_custom_ingest
workflow_sars2_genbank_ingest		workflow_sars2_genbank_ingest
workflow_sars2_gisaid_ingest		workflow_sars2_gisaid_ingest
.browserslistrc		.browserslistrc
.cloudsql_env		.cloudsql_env
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.gcloudignore		.gcloudignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmrc		.npmrc
.watchmanconfig		.watchmanconfig
API.md		API.md
LICENSE		LICENSE
README.md		README.md
babel.config.js		babel.config.js
docker-compose.cloudsql.prod.yml		docker-compose.cloudsql.prod.yml
docker-compose.cloudsql.yml		docker-compose.cloudsql.yml
docker-compose.flu.genbank.yml		docker-compose.flu.genbank.yml
docker-compose.flu.gisaid.yml		docker-compose.flu.gisaid.yml
docker-compose.rsv.yml		docker-compose.rsv.yml
docker-compose.sars2.yml		docker-compose.sars2.yml
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
environment_osx-arm64.yml		environment_osx-arm64.yml
jsconfig.json		jsconfig.json
package-lock.json		package-lock.json
package.json		package.json
package_example_data.sh		package_example_data.sh
tsconfig.json		tsconfig.json
webpack.config.dev.js		webpack.config.dev.js
webpack.config.prod.js		webpack.config.prod.js

License

vector-engineering/covidcg

Folders and files

Latest commit

History

Repository files navigation

COVID-19 CG (CoV Genetics)

Data enabling COVID CG

Installation

Dependency changes

Database refresh

Per-service installation

Javascript

PostgreSQL

Flask Server

Analysis Pipeline

Pipeline Installation

Ingestion

Main Analysis

Example data

About the project

Citing COVID CG

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages