Releases: ArchiveBox/ArchiveBox
v0.8.0-rc: New REST API ✨, Django 5.0, S3/B2/SMB/NFS remote storage support, VNC viewer, and more
WIP pre-release for the upcoming ArchiveBox v0.8.0
release.
Warning
This is an unfinished pre-release. We're promoting it a little earlier than usual because it contains ✨ lots of big new features ✨ and we want brave early adopters to help us test it! If that sounds like you, make sure to back up your archive first, then let us know if you find bugs by opening a new issue!
Try this release early using docker
or pip
:
# with docker (pre-built)
docker pull archivebox/archivebox:dev
# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
# with pip (built from source)
pip install 'git+https://github.com/ArchiveBox/ArchiveBox@dev'
Highlights
- New REST API built with
django-ninja
(thanks @Brandl!) - New ability to send outgoing webhooks triggered by archiving events
- new support for S3/B2/Google Drive/etc. remote storage using Docker +
rclone
- new ability to manage ArchiveBox config in Admin UI (read-only for now, ability to edit coming soon...)
- new noVNC remote viewing support for ArchiveBox browser (grab the updated
docker-compose.yml
first!) - upgraded to Django 5.0 internally (thanks @jimwins!)
- add new
*_EXTRA_ARGS
options (thanks @benmuth!) and new unifiedUSER_AGENT
option - add new
generic_jsonl
parser (thanks @jimwins!) - switch to
feedparser
for RSS parsing (thanks @jimwins!) - remember
Snapshot
detail page header expanded/collapsed state
Expand to see see more...
- add gitea and other domains to default GIT_DOMAINS list to run git archiving on
- check
/
,/data
, and/data/archive
in Docker and warn if running low on disk space - Add COOKIES_FILE support for singlefile extractor by @naoph in #1372
- Use
COOKIES_FILE
to fetch page titles by @benmuth in #1364 - Fallback to not
chown
'ing./data/archive
dir if it's a network mount that prevents ownership changes by @gnattu in #1312 - Show the upgrade notification only in specific views by @benmuth in #1314
- ability to populate is_staff and is_superuser flags at LDAP authentication by @vladimirdulov in #1335
- Make it a little easier to run specific tests by @jimwins in #1371
- disable chrome automatic self-updating when running headless
- Add ability to populate
is_staff
andis_superuser
flags during LDAP first auth - allow more restrictive NFS permission coercion on
./data/archive
- bump
yt-dlp
,singlefile
,wget
,curl
, andchrome
versions - fix
RESOLUTION
being ignored when using Chrome headless in Docker - fix sorting by Size / Files in the Admin Snapshots list page UI
- fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
- fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
- fix wget extractor not finding output when
:80
or:443
port is present in the original URL - fix
/var/spool/cron/crontabs
permissions when mounting it via Docker - fix
/browsers
chown on Dockerarmv7
entrypoint failing
COMING SOON: new sci-dl
scientific paper downloader being worked on by @benmuth
New Contributors
- @Brandl made their first contribution in #1397
- @tqobqbq made their first contribution in #1396
- @gnattu made their first contribution in #1312
- @speerer made their first contribution in #1323
- @neel-suthar made their first contribution in #1330
- @jimwins made their first contribution in #1365
- @naoph made their first contribution in #1372
- @rdela made their first contribution in #1374
- @n-hebert made their first contribution in #1382
Full Changelog: v0.7.2...v0.8.0-rc
v0.7.2: Make scheduled imports taggable, fix admin buttons, readability, Docker permissions
Get this release via pip
, docker
, brew
, or dpkg
(apt
& brew
releases are delayed).
# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.2'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.2
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
pip install --upgrade 'archivebox==0.7.2'`
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb
# then run pip install after
pip install --upgrade 'archivebox==0.7.2'`
Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox
w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt
ppa
& brew
updates coming eventually, packaging all the vendored binaries that archivebox depends on has gotten harder lately)
# Then run this to upgrade an existing collection data dir to 0.7.2
cd ~/path/to/data/dir
archivebox init
What's Changed
- add
--tag=tag1,tag2,tag3
support toarchivebox schedule
command - allow
PGID=0
root-group ownership of data dir (but PUID=0 is still not allowed) - improve error messages, hints, and logging about permissions issues in Docker
- notify users when new ArchiveBox version is available on Github (thanks @benmuth!)
- bump dependency versions (yt-dlp, chrome, readability, node, python)
- warn when Docker
/
or/data
volume mounts don't have any space available - limit to compatible python version to >= 3.8 and <= 3.11
Bug Fixes
- fix action buttons in Snapshot admin page not showing up correctly
- tag links immediately in first stage of
archivebox add
instead of at the end (so that imports that are paused or interrupted still get tagged correctly) - fix config variables in
CHROME_USER_AGENT
format string not getting interpolated properly - switch readability to prefer Chrome DOM dumps for article text instead of singlefile (because singlefile output is often huge and crashes readability/times out)
- make Docker image smaller by removing unneeded docs files
- better current version detection and remove annoying
+editable
string and also add BUILD_TIME - fix
/browsers/*
does not exist warning on startup
v0.7.1: Minor new features, bugfixes, and new dependency versions
Get this release via pip
, docker
, brew
, or dpkg
(apt
ppa
update delayed).
# Get it with Pip on any OS (`amd64`, `arm64`, `arm/v7`)
pip install --upgrade 'archivebox==0.7.1'`
# Get it with Docker on any OS (`amd64`, `arm64`, `arm/v7`)
docker pull archivebox/archivebox:0.7.1
# Get it with brew on macOS (`amd64`, `arm64`)
brew tap archivebox/archivebox
brew install archivebox
# Get it with apt on Ubuntu/Debian based systems (`any`)
wget 'https://github.com/ArchiveBox/debian-archivebox/raw/main/archivebox-0.7.1.deb'
apt install ./archivebox-0.7.1.deb
# OR
dpkg -i ./archivebox-0.7.1.deb
Note: this is not packaged using "proper" debian techniques like 0.6.2 was, instead it's just a wrapper for executing pip install archivebox
w/ a few extras. This is because ArchiveBox relies on some binary and dynamic dependencies (node, chrome, playwright, ffmpeg, yt-dlp, etc.) which aren't allowed in Debian packages.
(Launchpad apt
ppa
update coming eventually, packaging for apt
has gotten harder lately)
# Then run this to upgrade an existing collection data dir to 0.7.1
cd ~/path/to/data/dir
archivebox init
What's Changed
Lots of bugfixes, speedups, and small convenience features.
- fix bookmarklet script by @dryrain39 in #708
- point to master image, not latest by @FiddlyRumpus in #739
- Docs: Improve spelling on readme by @Namdrib in #766
- Exempt /add route from CSRF by @tjhorner in #777
- Bump ws from 5.2.2 to 5.2.3 by @dependabot in #784
- Discard Referer header from iframe and link to original URL by @Inndy in #799
- Update setup.sh in #804
- Fix Pinboard RSS parsing valid links as
None
by @overhacked in #822 - healthcheck endpoint by @ajgon in #873
- Update README.md by @adamwolf in #884
- Fixes Add button behavior on Safari by @adamwolf in #886
- Tweak JS so Safari can choose admin actions by @adamwolf in #885
- Avoid KeyError on Pocket API parser by @bltavares in #843
- (#847) Decode error output hints to string if needed by @TheCakeIsNaOH in #904
- Change logfile open to write mode only by @tuupola in #906
- Fix #725 - correctly parse tags on json import by @hannah98 in #908
- Bump ansi-regex from 5.0.0 to 5.0.1 by @dependabot in #910
- Bump jszip from 3.6.0 to 3.7.1 by @dependabot in #909
- Added TAG_SEPARATOR_PATTERN option for splitting tags by @hannah98 in #911
- Fix typo: volumes section in docker-compose.yml should use array notation by @akhilleusuggo in #918
- Fix broken URI fragment in README.md by @xfq in #942
- Fix typo in README.md by @hyfen in #932
- Fix bin_version: set LANG=C when calling executables to avoid parsing localized output by @pellaeon in #936
- Fix arch installation command by @CrazyPython in #923
- Update pywb entrypoint by @kusold in #961
- Fix missing input redirection in a hint text by @rossvor in #967
- improve title extractor by @prnake in #924
- Bump node-fetch from 2.6.1 to 2.6.7 by @dependabot in #969
- Add PikaPods as commercial hosting option by @m3nu in #974
- Attempted to warn on #984 and #1014 by @turian in #1020
- Method typo? by @EsEnZeT in #1048
- Added standalone dockerfile instructions by @turian in #1023
- Add missing migration 0021 by @turian in #1027
- get setup.sh to run on FreeBSD again (13.x) by @mwestza in #1068
- Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 by @turian in #1026
- Add SINGLEFILE_ARGS to control single-file arguments by @notevenaperson in #1021
- Support for Reverse Proxy authentication backends (like authelia) by @ajgon in #866
- Bump moment from 2.29.3 to 2.29.4 by @dependabot in #1081
- Install the CodeSee workflow. by @codesee-maps in #1103
- Revert "Install the CodeSee workflow." by @pirate in #1104
- add systemd config by @fa0311 in #1115
- add CHROME_TIMEOUT args by @fa0311 in #1120
- add explicitly specify --headless=new by @fa0311 in #1123
- Add missing closing quote to style attribute by @tejr in #1128
- Fix for Issue #1008 by @dcalano in #1131
New Contributors
Expand to see the list...
- @dryrain39 made their first contribution in #708
- @FiddlyRumpus made their first contribution in #739
- @Namdrib made their first contribution in #766
- @tjhorner made their first contribution in #777
- @Inndy made their first contribution in #799
- @ajgon made their first contribution in #873
- @TheCakeIsNaOH made their first contribution in #904
- @tuupola made their first contribution in #906
- @akhilleusuggo made their first contribution in #918
- @xfq made their first contribution in #942
- @hyfen made their first contribution in #932
- @pellaeon made their first contribution in #936
- @CrazyPython made their first contribution in #923
- @kusold made their first contribution in #961
- @rossvor made their first contribution in #967
- @prnake made their first contribution in #924
- @m3nu made their first contribution in #974
- @turian made their first contribution in #1020
- @EsEnZeT made their first contribution in #1048
- @mwestza made their first contribution in #1068
- @notevenaperson made their first contribution in #1021
- @codesee-maps made their first contribution in #1103
- @fa0311 made their first contribution in #1115
- @tejr made their first contribution in #1128
- @dcalano made their first contribution in #1131
Full Changelog: v0.6.2...v0.7.1
v0.6.2: >10x performance gain, new Admin UI & CLI features, and more
New features
- new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
- ability to save multiple snapshots of the same URL over time using new
Re-snapshot
button - add
init --quick
andserver --quick-init
options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful) - add new
archivebox setup
command andarchivebox init --setup
flag to aid in automatically installing dependencies and creating a superuser during initial setup - new
SNAPSHOTS_PER_PAGE=40
andMEDIA_MAX_SIZE=750m
config options - allow hotlinking directly to specific extractor output on the snapshot detail page using URL
#hash
e.g./archive/<timestamp>/index.html#git
- add ability to view snapshot matching a given URLs by visiting
/archive/https://example.com/some/url
-> redirects to ->/archive/<timestamp>/index.html
(also works without scheme/archive/example.com
) - #660 add ability to tag URLs while adding them via the web UI and via the CLI using
archivebox add --tag=tag1,tag2,tag3 ...
- #659 add back ability to override visual styling with custom HTML and CSS using new config option
CUSTOM_TEMPLATES_DIR
- ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown
Enhancements
- lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
- full text search now works on the public snapshot list
- dates and times are now localized to your browser's timezone instead of showing in UTC
- integrity and correctness improvements to readability, mercury, warc, and other extractors
- video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
- log all errors with full tracebacks to new
data/logs/errors.log
file (so users no longer have to run in --debug mode to see error details) - better
archivebox schedule
logging and changed logfile location to./logs/schedule.log
- better docker-compose setup experience with sonic config example in
docker-compose.yml
- add Django Debug Toolbar +
djdt_flamegraph
for developers to profile UI performance - add
--overwrite
flag support toarchivebox schedule
, archived urls get added similarly toadd --overwrite
- #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
- #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
- #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
- 3276434 add new
SEARCH_BACKEND_TIMEOUT
config option to tune amount of time search backend can take before it gives up - more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
- make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
- better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
- added
Cache-Control
headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams - new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io
Bugfixes
- #673 fix searching by URL substring in Snapshot admin list
- #658 fix Snapshot admin action buttons not working in Safari and some other browsers
- #678 fix
AssertionError
error when archivebox would to attempt archive withCHROME_BINARY=None
when Chrome was not found on host system - #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
- #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
- #433 fix deleted items sometimes reappearing on next import/update
- #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
- fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose
v0.5.6: Bugfixes and packaging improvements
- add ARMv7 and ARMv8 CPU support for
apt
/deb
distribution on Launchpad PPA - fix nodesource apt repo not supported on i386 b90afc8
- fix handling of skipped ArchiveResult entries with null output 0aea5ed
- catch exception on import of old index.json into ArchiveResult 171bbeb
- move debsign to release not build 66fb5b2
- skip tests during debian build a32eac3
- fix emptystrings in cmd_version causing exception a49884a
- automate deb dist better and bump version 0e6ac39
- fix assertion 6705354
- change wording of db not found error 683a087
v0.5.4: New Snapshot detail UI, lots of bugfixes, speed improvements, and limit media downloads to 750mb by default
Thank you contributors who helped with the 181 commits in this release!
@cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf
- fix migration failing due to null cmd_versions in older archives a3008c8
- Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8
- fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f
- use relative imports for
.util
to fix windows import clash 72e2c7b - fix
COOKIES_FILE
config param breaking in wget ef7711f - Refactor
should_save_extractor
methods to acceptoverwrite
parameter 5420903 - Fix issue #617 by using mark_safe in combination with format_html … 1989275
- make permission chowning on docker start less fancy, respect PUID/PGID #635
- add createsuperuser flag to server command 39ec77e
- fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2
- limit youtubedl download size to 750m and stop splitting out audio files 3227f54
- also search url, timestamp, tags on public index 8a4edb4
- fix trailing slash problems and wget not detecting download path 9764a8e
- add response status code to headers.json c089501
- fix singlefile path used for sonic 24e2493
- cleanup template layout in filesystem, new snapshot detail page UI
v0.5.3: New grid UI, full-text search, oneshot subcommand, Pocket API and Wallabag importers, bufixes, and packaging improvements
- ArchiveResult moved to SQLite3 DB for performance @cdvv7788
- lots of assorted bugfixes and improvements courtesy of @cdvv7788 and @jdcaballerov
- new full-text search support with ripgrep and sonic courtesy of @jdcaballerov
- new
archivebox oneshot
command for downloading a single site without starting a whole collection - new Pocket API importer courtesy of @mAAdhaTTah
- new Wallabag importer courtesy of @ehainry
- new extractor options on Add page courtesy of @BlipRanger
- new apt/deb/homebrew/pip packaging setup into separate repos under new Github Org https://github.com/ArchiveBox
- new official PPA and Docker Hub accounts https://hub.docker.com/r/archivebox/archivebox (with automatic armv7 builds courtesy of @chrismeller)
- new Snapshot grid view courtesy of @jdcaballerov
v0.4.24: Packaging improvements, UI improvements, and bugfixes
Last stable version for the v0.4 branch, contains numerous last fixes an improvements to v0.4 before the leap to v0.5.
v0.4.21: Better Node dependency version checking and sdist PATH fixes
v0.4.17: Bugfixes and CLI experience improvements
- Fix bugs with parsing long URLs as paths
- html-encoded URLs
- new generic HTML parser
- new
--init
and--overwrite
flags onadd
- improve stdout and hints
- fix Pull title button
- other small bugfixes