Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize internal clip_values function #1104

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gogetron
Copy link
Contributor

@gogetron gogetron commented Apr 20, 2024

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of clip_values in internal/util.py file.

[ ✏️ Write your summary here. ]
The significant improvement comes from using np.sum and np.clip directly on the arrays.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np

from cleanlab.internal.util import clip_values

np.random.seed(0)

N = 100_000_000
x = np.random.random(N)

Current version

%%timeit
%memit clip_values(x)
# peak memory: 6365.15 MiB, increment: 5323.42 MiB
# peak memory: 6365.48 MiB, increment: 5350.71 MiB
# peak memory: 6365.58 MiB, increment: 5349.82 MiB
# peak memory: 6365.70 MiB, increment: 5348.95 MiB
# peak memory: 6365.76 MiB, increment: 5347.84 MiB
# peak memory: 6365.70 MiB, increment: 5346.80 MiB
# peak memory: 6365.73 MiB, increment: 5345.84 MiB
# peak memory: 6365.76 MiB, increment: 5344.89 MiB
# 24.8 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit clip_values(x)
# peak memory: 2500.48 MiB, increment: 1460.37 MiB
# peak memory: 2537.27 MiB, increment: 1525.85 MiB
# peak memory: 2505.12 MiB, increment: 1493.64 MiB
# peak memory: 2537.51 MiB, increment: 1526.02 MiB
# peak memory: 2537.58 MiB, increment: 1525.84 MiB
# peak memory: 2537.61 MiB, increment: 1525.87 MiB
# peak memory: 2537.62 MiB, increment: 1525.88 MiB
# peak memory: 2537.51 MiB, increment: 1525.77 MiB
# 571 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing test suite.

I faced an issue when I initialy refactored the code. When we call float(sum(x)) it produces a TypeError if the array is not 1D. To maintain that behavior, which is asserted in the tests, I added the conditional check in the clip_values function.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

There are a few more functions that could be refactored in this file but I will open those in a different PR because of the TypeError unit test.

The reason behind the conditional check to raise the TypeError is explained above, basically it is about ensuring that the test expecting a TypeError is still valid, however maybe we could remove that test and accept other array shapes in this function like (1, N) or (N, 1) arrays for example.

@gogetron gogetron changed the title Perf: clip_values Optimize internal clip_values function Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant