hybrid between old and LUT implicit triangulation #971

JonasLukasczyk · 2023-09-12T14:18:10Z

Hi Julien,
this PR is the unholy union between the old and the LUT-based approach for implicit triangulations. For now this seems to result in a 4x speedup. We will work in the background on getting the 8x speedup of the pure LUT-based approach in TTK, but as you said grabbing this speedup essentially for free right now makes sense. I tested this, but since touching the triangulation is tricky work I recommend that we test this as much as possible.

Best
Jonas

JonasLukasczyk · 2023-09-12T14:57:37Z

Hi Julien, I ran clang-format so I don't know why this is complaining.

julien-tierny · 2023-09-12T15:03:15Z

you may want to use a version of clang-format in line with the CI's.
fyi, I'm using clang-format 16.0.6 here.

julien-tierny · 2023-09-12T15:06:20Z

I've just pushed the diff with clang-format 16. let's see :)

julien-tierny · 2023-09-12T15:10:30Z

it looks like the CI is using clang-format 14.0-55.
you may want to use that one.

julien-tierny · 2023-09-12T15:15:07Z

I've just tried with clang-format 14.
Let's see if it's any better.

JonasLukasczyk · 2023-09-12T15:16:26Z

If this runs through I will squash the commits.

julien-tierny

hi jonas,
thanks a lot!
that looks awesome.

I haven't run it yet (I'll double check the perfs over the weekend).
Could you address the two tiny remarks I mentioned here?

Also, could you please port this query over to the PeriodicImplicitTriangulation as well?
(in principles, there's fewer cases).

Thanks a lot!

core/base/implicitTriangulation/ImplicitTriangulation.h

julien-tierny

here's another batch of small-scale remarks.
thanks!

core/base/implicitTriangulation/ImplicitTriangulation.h

core/base/implicitTriangulation/ImplicitTriangulation.cpp

core/base/implicitTriangulation/ImplicitTriangulation.h

julien-tierny · 2023-09-16T05:19:25Z

perf notes (to be formated later)

ScalarFieldSmoother(ctBones)

vger (laptop)

ttk:dev

6 cores, 100 it: 36s
1 core, 25 it: 34s

LUT

6 cores, 100 it: 99s
1 core, 25 it: 97s

hal9000 (desktop)

ttk:dev

24 cores, 100 it: 13.782s
1 core, 25 it: 27.455s

LUT

24 cores, 100 it: 18.048s
1 core, 25 it: 69.135s

julien-tierny · 2023-09-16T05:45:20Z

Hi Jonas,

I've been trying to benchmark a little bit the LUT implementation by simply applying the scalar field smoother (which only uses getVertexNeighbor) on ctBones.vti. I have built the LUT implementation with the same options as the current dev branch of TTK.

For some reason, I can't reproduce speedups on my systems (see the above quick benchmark).

Could you try to run this little benchmark on your end and report your numbers?
Here the corresponding python script.

# state file generated using paraview version 5.11.1
import paraview
paraview.compatibility.major = 5
paraview.compatibility.minor = 11

#### import the simple module from the paraview
from paraview.simple import *
#### disable automatic camera reset on 'Show'
paraview.simple._DisableFirstRenderCameraReset()

# ----------------------------------------------------------------
# setup the data processing pipelines
# ----------------------------------------------------------------

# create a new 'XML Image Data Reader'
ctBonesvti = XMLImageDataReader(registrationName='ctBones.vti', FileName=['ctBones.vti'])
ctBonesvti.PointArrayStatus = ['Scalars_']
ctBonesvti.TimeArray = 'None'

# create a new 'TTK ScalarFieldSmoother'
tTKScalarFieldSmoother1 = TTKScalarFieldSmoother(registrationName='TTKScalarFieldSmoother1', Input=ctBonesvti)
tTKScalarFieldSmoother1.ScalarField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother1.IterationNumber = 25
tTKScalarFieldSmoother1.MaskField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother1.UseAllCores = 0

UpdatePipeline()

# create a new 'TTK ScalarFieldSmoother'
tTKScalarFieldSmoother2 = TTKScalarFieldSmoother(registrationName='TTKScalarFieldSmoother2', Input=ctBonesvti)
tTKScalarFieldSmoother2.ScalarField = ['POINTS', 'Scalars_']
tTKScalarFieldSmoother2.IterationNumber = 100
tTKScalarFieldSmoother2.MaskField = ['POINTS', 'Scalars_']

UpdatePipeline()

JonasLukasczyk · 2023-09-20T07:18:14Z

Hi Julien,
I can not reproduce your numbers. Can you run HelloWorld on a 1024^3 Wavelet? The dataset is much bigger an here I get a 3x times speedup:
`
DEV

[HelloWorld] Computing Averages .............................. [6.587s|32T|100%]

LUT

[HelloWorld] Computing Averages .............................. [2.228s|32T|100%]
`

I suspect that your compiler decided not to inline the methods and as far as I know there is no way to force this. What are your thoughts @petersteneteg?

Another thought: Is it possible that your coed uses the implicit triangulation with preconditions?

julien-tierny · 2023-09-20T08:43:10Z

hi jonas,
thanks for your feedback.
I'll run the test you suggested.
In the meantime, can you please run the python script I suggested and report the numbers?
Thanks.
julien

julien-tierny · 2023-09-20T09:11:27Z

perf notes (to be formated later)

HelloWorldCmd -i ctBones512.vti (512 is sufficient to trigger the full implicit triangulation)

hal9000 (desktop)

ttk:dev

1 core: 13.938s
24 cores: 0.934s

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

1 core: 7.926s
24 cores: 0.522s
24 cores: 0.436s (2209754)

lut

1 core: 10.193s
24 cores: 0.557s

vger (latop)

ttk:dev

1 core:
6 cores:

lut

1 core: 22.847s
6 cores: 6.405s

perf smoother

ttkScalarFieldSmootherCmd -i ctBones512.vti -I

hal9000 (desktop)

ttk:dev

1 core (5 it): 90.695s
24 cores (50 it): 56.186s

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

1 core (5 it): 39.800s
24 cores (50 it): 40.502s

lut

1 core (5 it): 57.788s
24 cores (50 it): 43.719s

perf persistence diagram

ttkPersistenceDiagramCmd -i ctBones512.vti

hal9000 (desktop)

ttk:dev

24 cores: 90.434s (44.357s), 91.534s (44.000s), 91.618s (43.766s), 94.023s (44.365s), 91.203s (44.647s) AVG: 91.7624 (44.227)

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

24 cores: 82.889s (36.938s), 83.455s (37.102s), 86.247s (38.791s), 85.258s (37.642s), 83.753s (37.097s) (53e2e58) AVG: 84.3204 (37.514)
24 cores: 86.703s (a53067d)
24 cores: 87.456s (0e4eee9)
24 cores: 84.082s (39.638s), 87.711s (39.942s), 86.249s (39.713s), 88.103s (41.261s), 85.699s (39.606s) (57d7181) AVG: 86.3688 (40.032)

lut

24 cores: 93.594s

julien-tierny · 2023-09-20T09:22:41Z

alright, so I believe I understand what's going on here.

in short:

preconditioned implicit is ~ x3 faster than LUT which is ~x2 faster than full implicit (especially in parallel)

so then, I believe this PR should be modified. the LUT implementation should replace the full implicit one, not the preconditioned one (I have the impression you replaced both)

julien-tierny · 2023-09-20T09:37:19Z

so in short, I believe you should leave the old implementation (i.e. template <typename Derived> int ImplicitTriangulationCRTP<Derived>::TTK_TRIANGULATION_INTERNAL(getVertexNeighbor)(const SimplexId &vertexId, const int &localNeighborId, SimplexId &neighborId) const {) as is in ImplicitTriangulation.cpp, but you should re-implement the function in the case of full implicit triangulation.

for that you can follow the example of the function getVertexPosition that has one (specialized) fully implicit implementation (ImplicitTriangulation.cpp) and a preconditioned (default) one (ImplicitTriangulation.h).

we can discuss this on Friday.
cheers.

JonasLukasczyk · 2023-09-20T11:53:35Z

Hi everyone,
I'm sorry but I can not spend any more time on this. @pierre-guillou can you add this fix?

Best
Jonas

julien-tierny · 2023-09-20T15:36:30Z

hi jonas,
I have updated perf numbers above, by taking just the in-lining speedup (https://github.com/julien-tierny/ttk/tree/implicitInlining). I'm currently populating these numbers for my laptop.
Some to talk about on Friday.

JonasLukasczyk · 2023-09-21T12:50:19Z

I don't understand why you don't see any speedup of using the hybrid over the inlined dev version but I think we already reached the point where investing time into the hybrid defeats the purpose. For now it seems much easier to just inline all methods of the fully implicit triangulation and call it a day. Afterwards we can see if a purely LUT-based approach brings additional speedup.

julien-tierny · 2023-09-23T07:18:40Z

alright everyone.
so I crunched some numbers, and it turns out that in-lining (final) all the functions of the implicit triangulation (https://github.com/julien-tierny/ttk/tree/implicitInlining) does improve the performances overall.
for the discrete gradient computation (sub-part of the persistence diagram computation, a fairly good benchmark that uses most of the queries of the triangulation), I observe an average acceleration of about 10% against the current dev branch of ttk (see the performance numbers above, the numbers in parentheses correspond to the discrete gradient). that's good.

what bugs me is that, if we don't in-line all functions, but only the getVertexNeighbor and getVertexNeighborNumber (only these two functions, julien-tierny@53e2e58), the gain is actually better (15%).
I don't have a clear explanation here. I thought that in-lining could only increase performances (or in the worst case not improve them), but I was not expecting performances to degrade as more functions were in-lined.

do you guys have any explanations of what is going on?
@petersteneteg @pierre-guillou

thanks a lot for your feedback.
best,

petersteneteg · 2023-09-27T08:12:58Z

@julien-tierny

So to a first approximation inlining more will make it faster, since it will avoid the function call overhead, and often enables a bunch of other optimization.

The problems arises when a lot of code get inline, then the code size will increase since the code get basically copy pastes many times. And when the code size increases that puts more pressure on the instruction cache in the cpu, basically the processor will be starved for the lack of instruction rather than the lack of data.

So ideally you need to balance the amount of inlining that you are doing, and often the compiler will have a lot of built in heuristics to get a good balance between inlining and code size. But It will generally not always find the optimal balance.

One can use tools like cachegrind. to look at the amount of instruction cache misses vs data cache misses to potentially optimize the performance.

julien-tierny · 2023-09-28T07:16:56Z

hi @petersteneteg
thanks a lot for your explanations. all of this is making sense.

I think I'll go ahead and PR the commit which optimizes the performance then (with only the getVertexNeighbors queries inline). Thanks a lot!

JonasLukasczyk · 2023-09-28T08:37:09Z

I think just inlining the neighbor methods is sufficient for now. Those are the most relevant queries anyway. Maybe in the future the LUT-based procedures are also much smaller and therefore fit into the instruction cache. We will have to see.

JonasLukasczyk force-pushed the implicitTriangulationLUT branch from e41dd2c to 932a24d Compare September 12, 2023 18:50

julien-tierny requested changes Sep 14, 2023

View reviewed changes

core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved

core/base/implicitTriangulation/ImplicitTriangulation.h Outdated Show resolved Hide resolved

julien-tierny requested changes Sep 14, 2023

View reviewed changes

hybrid between old and LUT implicit triangulation

bd4d638

JonasLukasczyk force-pushed the implicitTriangulationLUT branch from b66b9ce to bd4d638 Compare September 14, 2023 10:50

julien-tierny mentioned this pull request Sep 28, 2023

Inlining queries for the (fully) implicit triangulation #972

Merged

julien-tierny marked this pull request as draft October 4, 2023 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hybrid between old and LUT implicit triangulation #971

hybrid between old and LUT implicit triangulation #971

JonasLukasczyk commented Sep 12, 2023

JonasLukasczyk commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

JonasLukasczyk commented Sep 12, 2023

julien-tierny left a comment

julien-tierny left a comment

julien-tierny commented Sep 16, 2023 •

edited

julien-tierny commented Sep 16, 2023

JonasLukasczyk commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

julien-tierny commented Sep 20, 2023 •

edited

julien-tierny commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

JonasLukasczyk commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

JonasLukasczyk commented Sep 21, 2023

julien-tierny commented Sep 23, 2023 •

edited

petersteneteg commented Sep 27, 2023

julien-tierny commented Sep 28, 2023

JonasLukasczyk commented Sep 28, 2023

hybrid between old and LUT implicit triangulation #971

Are you sure you want to change the base?

hybrid between old and LUT implicit triangulation #971

Conversation

JonasLukasczyk commented Sep 12, 2023

JonasLukasczyk commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

julien-tierny commented Sep 12, 2023

JonasLukasczyk commented Sep 12, 2023

julien-tierny left a comment

Choose a reason for hiding this comment

julien-tierny left a comment

Choose a reason for hiding this comment

julien-tierny commented Sep 16, 2023 • edited

perf notes (to be formated later)

vger (laptop)

ttk:dev

LUT

hal9000 (desktop)

ttk:dev

LUT

julien-tierny commented Sep 16, 2023

JonasLukasczyk commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

julien-tierny commented Sep 20, 2023 • edited

perf notes (to be formated later)

hal9000 (desktop)

ttk:dev

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

lut

vger (latop)

ttk:dev

lut

perf smoother

hal9000 (desktop)

ttk:dev

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

lut

perf persistence diagram

hal9000 (desktop)

ttk:dev

ttk:dev with inline fully implicit getVertexNeighbors (https://github.com/julien-tierny/ttk/tree/implicitInlining)

lut

julien-tierny commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

JonasLukasczyk commented Sep 20, 2023

julien-tierny commented Sep 20, 2023

JonasLukasczyk commented Sep 21, 2023

julien-tierny commented Sep 23, 2023 • edited

petersteneteg commented Sep 27, 2023

julien-tierny commented Sep 28, 2023

JonasLukasczyk commented Sep 28, 2023

julien-tierny commented Sep 16, 2023 •

edited

julien-tierny commented Sep 20, 2023 •

edited

julien-tierny commented Sep 23, 2023 •

edited