Timeline for Performance of n-section root finding algorithm
Current License: CC BY-SA 4.0
27 events
when toggle format | what | by | license | comment | |
---|---|---|---|---|---|
Sep 23, 2019 at 15:07 | comment | added | Jacajack | @YvesDaoust It makes sense, and I was thinking about doing something about it, but I also felt somewhat uncomfortable with having the zero test all the way at the end. | |
Sep 23, 2019 at 15:02 | vote | accept | Jacajack | ||
Sep 23, 2019 at 15:00 | comment | added | user1196549 | @Jacajack: ouch, the assignment embedded in the test makes it so unreadable ! Personally, I avoid the tests for "early" zeroes because they just add overhead in 99.999% of the cases. | |
Sep 23, 2019 at 14:49 | comment | added | Jacajack | @LutzL I didn't realize how huge the difference caused by that can be. I optimized the code, so it reuses f(x) values from previous iterations and N=2 is now comparable to all the others. Thank you for pointing that out! | |
Sep 23, 2019 at 14:45 | comment | added | Jacajack | @YvesDaoust It checks if f(x) is 0. It's very unlikely, but can happen and would result in root being missed. | |
Sep 23, 2019 at 10:29 | comment | added | Lutz Lehmann |
In the inner loop you are recomputing fa == f(x[0]) and possibly fb == f(x[NSECT]) . This has a massive influcence on the runtime in the bisection and trisection cases, not so much in the higher divisions.
|
|
Sep 23, 2019 at 9:56 | comment | added | user1196549 |
Why this: if ( ( fx[i + 1] = f( x[i + 1] ) ) == 0 ) ?
|
|
Sep 23, 2019 at 9:46 | answer | added | user1196549 | timeline score: 3 | |
Sep 22, 2019 at 13:54 | comment | added | conditionalMethod | There is nothing else to explain. Run all your tests with a function that has its only root close enough to the right end of the interval. That is where those algorithms will do worst. For that case we have already given you the counting of the number of iterations. Everything else is just bad methodology, and poor understanding. | |
Sep 21, 2019 at 20:46 | comment | added | Jacajack |
I ran the test once again, this time with -ffast-math and -march=native : imgur.com/a/othkCXk. The times are much better now, especially thanks to -ffast-math . The shape of the curve, however, still remains the same. I'd even say that the gap between N=2 and N=3 is more visible now. (N-1)/log N and N/log N functions don't seem to explain that, because they don't change that violently between 2 and 3. Do you see any mathematical or technical reasons for this behavior?
|
|
Sep 21, 2019 at 13:31 | comment | added | conditionalMethod |
The result is rather trivial, actually, in terms of complexity analysis. It is just multiplying the length of each of the two nested loops. The rest is just understanding what writing a number in base N is. The while has length equal to the number of digits that the tolerance eps_x allows to represent in base N, and the for will take (proportional to ) N iterations in the worst case.
|
|
Sep 21, 2019 at 13:00 | comment | added | conditionalMethod | ... digits ln(n)/ln(N) times N. The n is fixed. Now minimize N/ln(N). | |
Sep 21, 2019 at 12:59 | comment | added | conditionalMethod |
@PeterCordes The time that this algorithm will take to locate a random number to a given precision is proportional to the number of iterations of the while times the average time that each for loop will take to locate the subinterval with the sign change. The precision determines a certain maximum number n of numbers in base N that are representable. Each loop of the for determines one of those digits. At worst each time there will be a number proportional to N of loops of that for to locate the digit in base N. Therefore, the algorithm takes a time proportional to the number of ...
|
|
Sep 21, 2019 at 12:55 | comment | added | Lutz Lehmann |
... The exception is the N=3 case, where you have to compare 2/log(3)=1.820478 to the bisections 1/log(2)=2/log(4)=1.442695 (in contrast to the N/log(N) formula where trisection 3/log(3)=6/log(9) is smaller than bisection 2/log(2)=6/log(8) .)
|
|
Sep 21, 2019 at 12:53 | comment | added | Lutz Lehmann | The overall results still seem strange. If you do compute the full function table (which you do not), then (N-1) evaluations result in a reduction in the interval by a factor of $N$, so that to get to a given final interval length requires C*(N-1)/log(N) function evaluations. However, if you evaluate the function table in the manner of a binary search, only evaluating the function values you actually need, you should regain asymptotically the characteristics of the bisection method. ... | |
Sep 21, 2019 at 8:29 | comment | added | Peter Cordes |
clang -O3 does enable auto-vectorization. But yeah, -ffast-math may help. Also worth using -march=native to let it use AVX, FMA, and everything else your CPU supports. FMA is great for Horner's rule.
|
|
Sep 21, 2019 at 8:03 | history | edited | Jacajack | CC BY-SA 4.0 |
make compilation command more visible
|
Sep 21, 2019 at 7:54 | comment | added | Jacajack |
@PeterCordes I specified the build command in the comment at the top of the source code. It's not really visible there - sorry for that. I used clang compiler and -O3 optimization. I didn't, however, try using -ffast-math and enabling vectorization. My CPU is i7-8700k.
|
|
Sep 21, 2019 at 7:36 | comment | added | Peter Cordes | If compiling for x86 or other ISAs with SIMD, using NSECT=5 would allow easy vectorization, evaluation the polynomial for 4 elements in parallel, but IDK how easily a compiler would spot that from this source code. Probably not because of the conditional branch inside the loop. | |
Sep 21, 2019 at 7:36 | comment | added | Peter Cordes |
And here we can increase the instruction-level parallelism by using larger N. Although we possible bottleneck on division throughput (not fully pipelined) for non-power-of-2 NSECT because the code uses ... / NSECT instead of ... * (1.0f / NSECT) . The OP doesn't mention using -ffast-math which would allow the compiler to use a reciprocal even if it's not exactly representable. In fact they don't mention any optimization options at all... Or what compiler + hardware they tested on.
|
|
Sep 21, 2019 at 7:29 | comment | added | Peter Cordes | @conditionalMethod: That "economy" metric (radix * length of digit-string needed) seems made-up in a way that just happens to match the time-complexity of this task. Refining the accuracy of a root involves calculations with a serial dependency, but the analogy to div / rem to produce a digits breaks down because integer division doesn't take longer with higher divisors, it takes near constant time with any non-power-of-2 divisor, or much less with any power of 2. (On real HW anyway). | |
Sep 20, 2019 at 19:30 | comment | added | Jacajack | @LutzL I'm aware that better methods exists. I was just curious of the influence of the N value on the performance and was just playing around until I couldn't really explain the N=2 case. Then I decided to ask a question here :) | |
Sep 20, 2019 at 19:22 | comment | added | Lutz Lehmann | A bisection algorithm is structurally very simple, just like bubblesort among the sort algorithms. For a trisection implementation you need at least two branchings to select the next interval. If you are that concerned about the performance of bisection, a more striking improvement is obtained by selecting the midpoint using the function values as the secant root etc., like in regula falsi and its variants, then Dekker's method and the widely used Brent's method. | |
Sep 20, 2019 at 19:15 | comment | added | Jacajack | @conditionalMethod that's actually pretty interesting fact. Thank you | |
Sep 20, 2019 at 19:11 | comment | added | conditionalMethod | These root solvers are essentially, after averaging over many random functions and scaling the initial interval to have length 1, listing the digits of a number (the root to which they are converging) in base N. The numeric system that is most efficient is the one of base 3. | |
Sep 20, 2019 at 19:06 | history | edited | Jacajack | CC BY-SA 4.0 |
fix double word
|
Sep 20, 2019 at 18:59 | history | asked | Jacajack | CC BY-SA 4.0 |