Timeline for Performance of n-section root finding algorithm

Current License: CC BY-SA 4.0

27 events

when toggle format	what		by	license	comment
Sep 23, 2019 at 15:07	comment	added	Jacajack		@YvesDaoust It makes sense, and I was thinking about doing something about it, but I also felt somewhat uncomfortable with having the zero test all the way at the end.
Sep 23, 2019 at 15:02	vote	accept	Jacajack
Sep 23, 2019 at 15:00	comment	added	user1196549		@Jacajack: ouch, the assignment embedded in the test makes it so unreadable ! Personally, I avoid the tests for "early" zeroes because they just add overhead in 99.999% of the cases.
Sep 23, 2019 at 14:49	comment	added	Jacajack		@LutzL I didn't realize how huge the difference caused by that can be. I optimized the code, so it reuses f(x) values from previous iterations and N=2 is now comparable to all the others. Thank you for pointing that out!
Sep 23, 2019 at 14:45	comment	added	Jacajack		@YvesDaoust It checks if f(x) is 0. It's very unlikely, but can happen and would result in root being missed.
Sep 23, 2019 at 10:29	comment	added	Lutz Lehmann		In the inner loop you are recomputing `fa == f(x[0])` and possibly `fb == f(x[NSECT])`. This has a massive influcence on the runtime in the bisection and trisection cases, not so much in the higher divisions.
Sep 23, 2019 at 9:56	comment	added	user1196549		Why this: `if ( ( fx[i + 1] = f( x[i + 1] ) ) == 0 )` ?
Sep 23, 2019 at 9:46	answer	added	user1196549		timeline score: 3
Sep 22, 2019 at 13:54	comment	added	conditionalMethod		There is nothing else to explain. Run all your tests with a function that has its only root close enough to the right end of the interval. That is where those algorithms will do worst. For that case we have already given you the counting of the number of iterations. Everything else is just bad methodology, and poor understanding.
Sep 21, 2019 at 20:46	comment	added	Jacajack		I ran the test once again, this time with `-ffast-math` and `-march=native`: imgur.com/a/othkCXk. The times are much better now, especially thanks to `-ffast-math`. The shape of the curve, however, still remains the same. I'd even say that the gap between N=2 and N=3 is more visible now. `(N-1)/log N` and `N/log N` functions don't seem to explain that, because they don't change that violently between 2 and 3. Do you see any mathematical or technical reasons for this behavior?
Sep 21, 2019 at 13:31	comment	added	conditionalMethod		The result is rather trivial, actually, in terms of complexity analysis. It is just multiplying the length of each of the two nested loops. The rest is just understanding what writing a number in base N is. The `while` has length equal to the number of digits that the tolerance `eps_x` allows to represent in base N, and the `for` will take (proportional to ) N iterations in the worst case.
Sep 21, 2019 at 13:00	comment	added	conditionalMethod		... digits ln(n)/ln(N) times N. The n is fixed. Now minimize N/ln(N).
Sep 21, 2019 at 12:59	comment	added	conditionalMethod		@PeterCordes The time that this algorithm will take to locate a random number to a given precision is proportional to the number of iterations of the `while` times the average time that each `for` loop will take to locate the subinterval with the sign change. The precision determines a certain maximum number `n` of numbers in base N that are representable. Each loop of the `for` determines one of those digits. At worst each time there will be a number proportional to N of loops of that `for` to locate the digit in base N. Therefore, the algorithm takes a time proportional to the number of ...
Sep 21, 2019 at 12:55	comment	added	Lutz Lehmann		... The exception is the `N=3` case, where you have to compare `2/log(3)=1.820478` to the bisections `1/log(2)=2/log(4)=1.442695` (in contrast to the `N/log(N)` formula where trisection `3/log(3)=6/log(9)` is smaller than bisection `2/log(2)=6/log(8)`.)
Sep 21, 2019 at 12:53	comment	added	Lutz Lehmann		The overall results still seem strange. If you do compute the full function table (which you do not), then (N-1) evaluations result in a reduction in the interval by a factor of $N$, so that to get to a given final interval length requires C*(N-1)/log(N) function evaluations. However, if you evaluate the function table in the manner of a binary search, only evaluating the function values you actually need, you should regain asymptotically the characteristics of the bisection method. ...
Sep 21, 2019 at 8:29	comment	added	Peter Cordes		`clang -O3` does enable auto-vectorization. But yeah, `-ffast-math` may help. Also worth using `-march=native` to let it use AVX, FMA, and everything else your CPU supports. FMA is great for Horner's rule.
Sep 21, 2019 at 8:03	history	edited	Jacajack	CC BY-SA 4.0	make compilation command more visible
Sep 21, 2019 at 7:54	comment	added	Jacajack		@PeterCordes I specified the build command in the comment at the top of the source code. It's not really visible there - sorry for that. I used `clang` compiler and `-O3` optimization. I didn't, however, try using `-ffast-math` and enabling vectorization. My CPU is i7-8700k.
Sep 21, 2019 at 7:36	comment	added	Peter Cordes		If compiling for x86 or other ISAs with SIMD, using NSECT=5 would allow easy vectorization, evaluation the polynomial for 4 elements in parallel, but IDK how easily a compiler would spot that from this source code. Probably not because of the conditional branch inside the loop.
Sep 21, 2019 at 7:36	comment	added	Peter Cordes		And here we can increase the instruction-level parallelism by using larger N. Although we possible bottleneck on division throughput (not fully pipelined) for non-power-of-2 `NSECT` because the code uses `... / NSECT` instead of `... * (1.0f / NSECT)`. The OP doesn't mention using `-ffast-math` which would allow the compiler to use a reciprocal even if it's not exactly representable. In fact they don't mention any optimization options at all... Or what compiler + hardware they tested on.
Sep 21, 2019 at 7:29	comment	added	Peter Cordes		@conditionalMethod: That "economy" metric (radix * length of digit-string needed) seems made-up in a way that just happens to match the time-complexity of this task. Refining the accuracy of a root involves calculations with a serial dependency, but the analogy to div / rem to produce a digits breaks down because integer division doesn't take longer with higher divisors, it takes near constant time with any non-power-of-2 divisor, or much less with any power of 2. (On real HW anyway).
Sep 20, 2019 at 19:30	comment	added	Jacajack		@LutzL I'm aware that better methods exists. I was just curious of the influence of the N value on the performance and was just playing around until I couldn't really explain the N=2 case. Then I decided to ask a question here :)
Sep 20, 2019 at 19:22	comment	added	Lutz Lehmann		A bisection algorithm is structurally very simple, just like bubblesort among the sort algorithms. For a trisection implementation you need at least two branchings to select the next interval. If you are that concerned about the performance of bisection, a more striking improvement is obtained by selecting the midpoint using the function values as the secant root etc., like in regula falsi and its variants, then Dekker's method and the widely used Brent's method.
Sep 20, 2019 at 19:15	comment	added	Jacajack		@conditionalMethod that's actually pretty interesting fact. Thank you
Sep 20, 2019 at 19:11	comment	added	conditionalMethod		These root solvers are essentially, after averaging over many random functions and scaling the initial interval to have length 1, listing the digits of a number (the root to which they are converging) in base N. The numeric system that is most efficient is the one of base 3.
Sep 20, 2019 at 19:06	history	edited	Jacajack	CC BY-SA 4.0	fix double word
Sep 20, 2019 at 18:59	history	asked	Jacajack	CC BY-SA 4.0

toggle format

Collectives™ on Stack Overflow

Timeline for Performance of n-section root finding algorithm

Current License: CC BY-SA 4.0