What Are the Architectural Constraints in Haswell That Limit CPE Optimization?

Posted by Marvee Amasi
9:59 am
17/12/2024

I want to understand why any scalar version of the inner product procedure cannot achieve a CPE less than `1.00` on an Intel Core i7 4790 Haswell processor, Ubuntu 20.04 Linux Kernel 5.4, with GCC 9.3.0 compiler.
I want to optimize the inner product procedure using `6x1a` loop unrolling on the Intel Core i7 Haswell processor.
For integer data, my unrolled version gives a CPE as in cycles per element of `1.07`.
For floating-point data, it still remains at `3.01`.
I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below `1.00`, even with loop unrolling?

Are there architectural constraints in the Haswell processor that make achieving a CPE of less than `1.00` impossible? What will be the best approach to optimize further?

heidi.tech#0

December 17, 2024 at 9:59 am

source_code.txt

Cancel Reply

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.
Renuel Roberts#0000

December 17, 2024 at 9:59 am

Achieving a `CPE` (cycles per element) of less than 1.00 for scalar inner product procedures on an `Intel Core i7 4790` (Haswell) processor is challenging due to fundamental architectural constraints. In scalar execution, each instruction processes only one data element, and factors such as instruction latency, limited functional units, and data dependencies prevent processing more than one element per cycle. While loop unrolling reduces overhead, it doesn’t overcome the bottlenecks in scalar execution.

`Haswell` supports `SIMD` (`AVX2`), allowing multiple elements to be processed per instruction. To achieve lower `CPE`, you need to leverage vectorization. Compiler flags (` -O3 -march=native`) can help, but for more control, manual use of `SIMD` intrinsics may be necessary. Additionally, optimizing memory access and reducing `cache` misses can further improve performance.

Also, scalar code has inherent limitations, and achieving a `CPE` below 1.00 requires moving to vectorized code, which fully utilizes the processor’s `SIMD` capabilities.

Cancel Reply

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.
Marvee Amasi#0000

December 17, 2024 at 9:59 am

Yes I agree that scalar code has its inherent limitations, and vectorization seems to be the clear path forward for lowering the CPE. I hav already tried using the -O3 -march=native flags, and while they help, I hav not explored manual AVX2 intrinsics in depth yet. I will dive into that next

Cancel Reply

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.
Marvee Amasi#0000

December 17, 2024 at 9:59 am

On the memory side, I am curious , do you think *cache alignment* or *prefetching techniques* could significantly impact the CPE, even with SIMD?

Is there even a way to track how well my cache usage is optimized on the Haswell architecture, perhaps using perf or another tool?

Cancel Reply

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.