Feature/gcp tpu #609

jmikedupont2 · 2024-03-28T21:50:47Z

Here is my branch of hivemind that works on the gcp tpu

@mryab

- fix edge case where expert requests with 3.99-4MB payload would fail due to max message size (due to serialization overhead) - recover from errors in the Runtime, propagate them to the corresponding tasks - previously, a failing function would terminate the entire server - which was a major pain for me personally :) - failure to process a request will now trigger P2PHandlerError instead of P2PDaemonError (cuz it does not kill the daemon) - allow optional metadata in ExpertRequest / ExpertResponse for extendability [todo: validate it vs. @mryab ] Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> Co-authored-by: Pavel Samygin <samygin@phystech.edu> (cherry picked from commit ef0b842)

Type of metadata field in Expert Request/Response changed to more native type `bytes` and some compatibility fixes are done to the tests to fit different `torch` versions (cherry picked from commit fe7a4ef)

It is not immediately clear from the documentation that this example cannot run on multiple machines. This PR clarifies this. (cherry picked from commit ee75b91)

@borzunov

* make DHT ignore SIGINT * update p2pd version Co-authored-by: @borzunov (cherry picked from commit 61e5e8c)

…#494) * Update README with latest projects and publications * Reformat the BibTeX entries Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> (cherry picked from commit d42c703)

I think some people are interested in the "Example Use Cases" section because they'd like to know what was already built with hivemind, and other people would like to take a look on the code if they've already started to use hivemind and want some code examples. Currently, the sahajBERT link leads to the sahajBERT repo that doesn't describe much about the project itself. Conversely, it's hard to find the repo with the code following the CALM and "Training Transformers Together" links. This PR adds more useful links to each of the projects. (cherry picked from commit 7a7c93a)

(cherry picked from commit 2826147)

…Averager are shut down (learning-at-home#501) (cherry picked from commit 3267fc7)

Co-authored-by: Alex <alexandershulga.sh@gmail.com> (cherry picked from commit bb3aed6)

…earning-at-home#503) This PR fixes a potential deadlock in hivemind.utils.enter_asynchronously. This deadlock occurs when many coroutines enter nested locks and exhaust all workers in ThreadPoolExecutor. In this PR, we mitigate it by creating a dedicated executor for entering locks with no limit to the number of workers. Co-authored-by: Aleksandr Borzunov <borzunov.alexander@gmail.com> (cherry picked from commit b02bdad)

…home#506) The TaskPoolBase interface currently requires iterate_minibatches to be implemented. However, this method is not called by anything except TaskPool (internally). Runtime actually calls load_batch_to_runtime. This PR changes the interface to reflect that. While we're at it, i've also changed prefetch generator so that it actually does not prefetch batches when prefetch_batches = 0. Previously, 0 would silently mean "unlimited", Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> (cherry picked from commit 41587e4)

…iority (learning-at-home#505) Currently, the priority is set to the timestamp of the earliest undispatched task. Choosing earliest tasks will reduce the maximum waiting time when queue is nonempty Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> Co-authored-by: Pavel Samygin <44449246+greenfatguy@users.noreply.github.com> (cherry picked from commit 6395e89)

* Add support for quantization with bitsandbytes * Extend the compression benchmark * Add a test for blockwise compression * Add a note to README about bitsandbytes * Install bitsandbytes in tests as well * Verify outputs consistently in test_moe.py (to make the test less flaky) * Pass device="cpu" in test_background_server_identity_path This ensures that the server can actually launch in a GPU-enabled environment: otherwise initializing the CUDA context in a parent process prevents it * Filter bitsandbytes warnings (cherry picked from commit 131f82c)

forbid protobuf 4.x for now (cherry picked from commit e9f35b5)

While using scripts built with hivemind, users often run two peers with the same identity by accident (e.g., if they forget to change the CLI command or copied the same identity file to another host via `scp`). Now, this leads to undefined behavior of libp2p. This PR makes `hivemind.P2P` check if the identity is already taken, thus solving this issue in all applications at once. (cherry picked from commit 64a6c30)

(cherry picked from commit b3a51dd)

…xes (learning-at-home#513) - In `hivemind.Server`, use the graceful shutdown for `ConnectionHandler` - In `hivemind.P2P`, if we are the first peer, skip checking if the provided identity is free (cherry picked from commit 13cdd13)

* Update bitsandbytes, relax its version constraint (cherry picked from commit 44d9569)

Fixed the broken link in the tutorial. (cherry picked from commit 3e817a5)

…home#517) Currently, one may sometimes get the "unable to open shared memory" error (see the screenshot) while using `hivemind.MPFuture`. Interestingly, the smaller `HIVEMIND_SHM_BUFFER_SIZE` is, the more often the error occurs (e.g., in Petals, it occurs right after starting the server if `HIVEMIND_SHM_BUFFER_SIZE=2`). Turns out, it happens when the origin process garbage-collects all instances of MPFuture using the same shmem buffer, then the underlying buffer is freed, and target processes can't reconnect to it anymore when unpickling its instances of MPFuture. This PR fixes this important issue. (cherry picked from commit 94c985d)

(cherry picked from commit 8f258b4)

This is necessary for learning-at-home#521 to work. The minimal version where `torch.inference_mode()` works is 1.9.0. (cherry picked from commit 1242cfb)

Before this PR, the P2P daemon was often killed after `idle_timeout` even if the persistent connection is opened due to a concurrency bug in go-libp2p-daemon that was just fixed: learning-at-home/go-libp2p-daemon#21 (cherry picked from commit 8d51b97)

This PR implements bfloat16 support for `CompressionType.NONE` and `CompressionType.BLOCKWISE_8BIT`. This is important for the Petals client, see bigscience-workshop/petals#79 (cherry picked from commit 1e4af43)

…e#525) Before this PR, hivemind-dht-based initial peers collected lots of stale PeerIDs and other peers could not actually make DHT queries anymore. (cherry picked from commit be88b42)

This version contains relevant changes that improve work of libp2p relays, see learning-at-home/go-libp2p-daemon#22. Co-authored-by: Pavel Samygin <44449246+greenfatguy@users.noreply.github.com> (cherry picked from commit 4c167fa)

- Fix LRSchedulerBase - Handle None after .zero_grad() in torch 2.0.0 - Use set_to_none=True by default in torch>=2.0 - Add set_to_none param to TrainingStateAverager.step() Co-authored-by: Aleksandr Borzunov <hxrussia@gmail.com> (cherry picked from commit 98531ce)

…t-home#560) Follow-up to learning-at-home#553. (cherry picked from commit 3164928)

…e#561) Previously, `RemoteExpertWorker` ran one coroutine at a time, so hivemind.moe/Petals clients were very slow for concurrent calls. (cherry picked from commit 589cb2c)

(cherry picked from commit 6c3a46c)

…#565) This PR: 1. Fixes warnings in hivemind.p2p destructors. 2. Makes bfloat16 serialization in hivemind.compression forward- and backward-compatible. The code before this PR (a) didn't work in torch < 1.13.0 (hivemind requires torch >= 1.9.0) and (b) led to warnings on torch >= 2.0. The new code works without warnings in all versions of PyTorch. (cherry picked from commit 0d2614d)

…t-home#567) (cherry picked from commit 7a91d03)

…arning-at-home#568) (cherry picked from commit 11d70fe)

Pydantic 2.0 has been released yesterday and is not compatible with the current code. (cherry picked from commit b7cbd97)

(cherry picked from commit ec1d7fe)

(cherry picked from commit da130cd)

Fix bigscience-workshop/petals#237, bigscience-workshop/petals#368 (comment). (cherry picked from commit 6f5c471)

…-home#587) * allow overriding args/kwargs in Runtime * switch stats time to time.perf_counter --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com> (cherry picked from commit 33a9a41)

(cherry picked from commit d90a14d)

This doesn't change anything on Linux but helps macOS users. Specifically, it's helps to: - Avoid [this error](bigscience-workshop/petals#405 (comment)) for people who don't use `if __name__ == "__main__"` in simple scripts on macOS (that uses spawn for processes by default). - Make DHT consistent with other code that inherits from `mp.context.ForkProcess` directly. (cherry picked from commit 1eb5d18)

…g-at-home#588) This PR uses makes hivemind use a separate p2pd binary for each `(os, platform)`, so: - Now we download 2x smaller binary for a specific macOS arch, instead of downloading the large universal binary - Now we also provide `p2pd-linux-arm64` binary (maybe someone wants to run a DHT node on Raspberry Pi?) (cherry picked from commit 27318f9)

(cherry picked from commit 64f1f1e)

* serialize with requires_grad * ensure that all compression methods return tensor of the original dtype * test that all compression methods preserve dtype and requires_grad --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

…g-at-home#595) * Install setuptools+wheel in develop mode during CI * Fix deprecations and update dependencies for examples/albert

…me#593)

* Bump p2pd version * Bump multiaddr * Remove pymultihash

mryab and others added 30 commits June 20, 2022 16:40

Bump the version to 1.1.0

e96f414

metadata type changed to bytes (learning-at-home#491)

94dcbf0

Type of metadata field in Expert Request/Response changed to more native type `bytes` and some compatibility fixes are done to the tests to fit different `torch` versions (cherry picked from commit fe7a4ef)

Add clarification to averaging quickstart (learning-at-home#492)

c0e87c0

It is not immediately clear from the documentation that this example cannot run on multiple machines. This PR clarifies this. (cherry picked from commit ee75b91)

Make DHT ignore SIGINT (learning-at-home#493)

752f5ea

* make DHT ignore SIGINT * update p2pd version Co-authored-by: @borzunov (cherry picked from commit 61e5e8c)

Update README with latest projects and publications (learning-at-home…

af6320a

…#494) * Update README with latest projects and publications * Reformat the BibTeX entries Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> (cherry picked from commit d42c703)

Support bfloat16 for autograd (learning-at-home#499)

d738fa8

(cherry picked from commit 2826147)

Remove libp2p handlers when ConnectionHandler, DHT, and Decentralized…

c2a53d0

…Averager are shut down (learning-at-home#501) (cherry picked from commit 3267fc7)

Fix PyTorch warning supression (learning-at-home#502)

df1508a

Co-authored-by: Alex <alexandershulga.sh@gmail.com> (cherry picked from commit bb3aed6)

Bump version to 1.1.1

9499373

Forbid protobuf 4.x in requirements (learning-at-home#508)

0086245

forbid protobuf 4.x for now (cherry picked from commit e9f35b5)

Add Petals to "Example Use Cases" (learning-at-home#512)

090de06

(cherry picked from commit b3a51dd)

Update bitsandbytes, relax its version constraint (learning-at-home#510)

9e65c30

* Update bitsandbytes, relax its version constraint (cherry picked from commit 44d9569)

Bump version to 1.1.2

cdf5bee

Update moe.md (learning-at-home#516)

c835c4f

Fixed the broken link in the tutorial. (cherry picked from commit 3e817a5)

Fix MPFuture failing outside inference mode (learning-at-home#521)

d9a986e

(cherry picked from commit 8f258b4)

Bump torch to >=1.9.0 (learning-at-home#522)

ff0c2bd

This is necessary for learning-at-home#521 to work. The minimal version where `torch.inference_mode()` works is 1.9.0. (cherry picked from commit 1242cfb)

Support torch.bfloat16 in hivemind.compression (learning-at-home#524)

25e0f40

This PR implements bfloat16 support for `CompressionType.NONE` and `CompressionType.BLOCKWISE_8BIT`. This is important for the Petals client, see bigscience-workshop/petals#79 (cherry picked from commit 1e4af43)

Remove stale PeerIDs in hivemind-dht's routing table (learning-at-hom…

bc8629e

…e#525) Before this PR, hivemind-dht-based initial peers collected lots of stale PeerIDs and other peers could not actually make DHT queries anymore. (cherry picked from commit be88b42)

Bump version to 1.1.3

70e29d9

Update p2pd to v0.3.13 (learning-at-home#527)

cfc1299

This version contains relevant changes that improve work of libp2p relays, see learning-at-home/go-libp2p-daemon#22. Co-authored-by: Pavel Samygin <44449246+greenfatguy@users.noreply.github.com> (cherry picked from commit 4c167fa)

justheuristic and others added 30 commits March 31, 2023 16:55

Fix bfloat16 serialization for tensors with zero elements (learning-a…

8d8520d

…t-home#560) Follow-up to learning-at-home#553. (cherry picked from commit 3164928)

Allow RemoteExpertWorker run coroutines concurrently (learning-at-hom…

542f5c3

…e#561) Previously, `RemoteExpertWorker` ran one coroutine at a time, so hivemind.moe/Petals clients were very slow for concurrent calls. (cherry picked from commit 589cb2c)

Fix broken link, min torch version in readme (learning-at-home#562)

09855f8

(cherry picked from commit 6c3a46c)

Bump version to 1.1.7

18d0323

Bump version to 1.1.7.post1

41adac5

Bump version to 1.1.8

e98f445

Improve handling of KeyboardInterrupt in CLI applications (learning-a…

ab887d7

…t-home#567) (cherry picked from commit 7a91d03)

Measure coverage of subprocesses, exclude protobuf compiled files (le…

7444be6

…arning-at-home#568) (cherry picked from commit 11d70fe)

Require pydantic<2.0 unless it's supported (learning-at-home#573)

a1869e8

Pydantic 2.0 has been released yesterday and is not compatible with the current code. (cherry picked from commit b7cbd97)

Support Python 3.11 (learning-at-home#574)

3628dc8

(cherry picked from commit ec1d7fe)

Fix using .lstrip() in hivemind.compression (learning-at-home#578)

77b9b2c

(cherry picked from commit da130cd)

Fix TypeError in P2P._terminate() (learning-at-home#579)

6fd7fdb

Fix bigscience-workshop/petals#237, bigscience-workshop/petals#368 (comment). (cherry picked from commit 6f5c471)

Bump version to 1.1.9

9d5c6fd

Use proper p2pd binary on macOS (learning-at-home#586)

0bbe4a5

(cherry picked from commit d90a14d)

Consider multiple CPU arch aliases (learning-at-home#590)

5d5cf64

(cherry picked from commit 64f1f1e)

Bump version to 1.1.10

3163027

Bump version to 1.1.10.post1

a45e729

Hotfix: add requirements.txt in MANIFEST.in for sdist build

26d551c

Bump version to 1.1.10.post2

c295cfb

adding patched code

e6bbd89

Fix deprecations and update dependencies for examples/albert (learnin…

a79cb56

…g-at-home#595) * Install setuptools+wheel in develop mode during CI * Fix deprecations and update dependencies for examples/albert

Fix OptimizerWrapper creation, test gradient clipping (learning-at-ho…

354adec

…me#593)

Update petals homepage URL (learning-at-home#599)

53ae68a

Bump p2pd version and remove pymultihash (learning-at-home#598)

c5cf121

* Bump p2pd version * Bump multiaddr * Remove pymultihash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/gcp tpu #609

Feature/gcp tpu #609

jmikedupont2 commented Mar 28, 2024

Feature/gcp tpu #609

Are you sure you want to change the base?

Feature/gcp tpu #609

Conversation

jmikedupont2 commented Mar 28, 2024