Optimize internal clip_values function #1104

gogetron · 2024-04-20T10:54:45Z

Summary

This PR partially addresses #862

🎯 Purpose: Improve performance of clip_values in internal/util.py file.

[ ✏️ Write your summary here. ]
The significant improvement comes from using np.sum and np.clip directly on the arrays.

For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.

Code Setup

import numpy as np

from cleanlab.internal.util import clip_values

np.random.seed(0)

N = 100_000_000
x = np.random.random(N)

Current version

%%timeit
%memit clip_values(x)
# peak memory: 6365.15 MiB, increment: 5323.42 MiB
# peak memory: 6365.48 MiB, increment: 5350.71 MiB
# peak memory: 6365.58 MiB, increment: 5349.82 MiB
# peak memory: 6365.70 MiB, increment: 5348.95 MiB
# peak memory: 6365.76 MiB, increment: 5347.84 MiB
# peak memory: 6365.70 MiB, increment: 5346.80 MiB
# peak memory: 6365.73 MiB, increment: 5345.84 MiB
# peak memory: 6365.76 MiB, increment: 5344.89 MiB
# 24.8 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR

%%timeit
%memit clip_values(x)
# peak memory: 2500.48 MiB, increment: 1460.37 MiB
# peak memory: 2537.27 MiB, increment: 1525.85 MiB
# peak memory: 2505.12 MiB, increment: 1493.64 MiB
# peak memory: 2537.51 MiB, increment: 1526.02 MiB
# peak memory: 2537.58 MiB, increment: 1525.84 MiB
# peak memory: 2537.61 MiB, increment: 1525.87 MiB
# peak memory: 2537.62 MiB, increment: 1525.88 MiB
# peak memory: 2537.51 MiB, increment: 1525.77 MiB
# 571 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing

🔍 Testing Done: Existing test suite.

I faced an issue when I initialy refactored the code. When we call float(sum(x)) it produces a TypeError if the array is not 1D. To maintain that behavior, which is asserted in the tests, I added the conditional check in the clip_values function.

References

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

There are a few more functions that could be refactored in this file but I will open those in a different PR because of the TypeError unit test.

The reason behind the conditional check to raise the TypeError is explained above, basically it is about ensuring that the test expecting a TypeError is still valid, however maybe we could remove that test and accept other array shapes in this function like (1, N) or (N, 1) arrays for example.

Perf: clip_values

8618f94

gogetron changed the title ~~Perf: clip_values~~ Optimize internal clip_values function Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize internal clip_values function #1104

Optimize internal clip_values function #1104

gogetron commented Apr 20, 2024 •

edited

Optimize internal clip_values function #1104

Are you sure you want to change the base?

Optimize internal clip_values function #1104

Conversation

gogetron commented Apr 20, 2024 • edited

Summary

Testing

References

Reviewer Notes

gogetron commented Apr 20, 2024 •

edited