Login

Become a leader in the IoT community!

Join our community of embedded and IoT practitioners to contribute experience, learn new skills and collaborate with other developers with complementary skillsets.

Step 1 of 5

CREATE YOUR PROFILE *Required

Step 2 of 5

WHAT BRINGS YOU TO DEVHEADS? *Choose 1 or more

Connect & collaborate 🤝with other tech professionals

I want to connect & collaborate with other techies

Learn & Grow 📚

Learn from our helpers & grow your tech knowledge

Contribute Experience & Expertise 🔧

Become a helpful hacker & assist others by sharing

Step 3 of 5

WHAT'S YOUR INTEREST OR EXPERTISE? *Choose 1 or more

Hardware Design 💡

PCB design, analog circuits, and more.

Embedded Software 💻

Embedded OS, firmware/middleware, debug & tools

Edge Networking ⚡

Real-time/low-power connectivity & IoT device management

Step 4 of 5

Personalize your profile

Step 5 of 5

Read & agree to our COMMUNITY RULES

We want this server to be a welcoming space! Treat everyone with respect. Absolutely no harassment, witch hunting, sexism, racism, or hate speech will be tolerated.
If you see something against the rules or something that makes you feel unsafe, let staff know by messaging @admin in the "support-tickets" tab in the Live DevChat menu.
No age-restricted, obscene or NSFW content. This includes text, images, or links featuring nudity, sex, hard violence, or other graphically disturbing content.
No spam. This includes DMing fellow members.
You must be over the age of 18 years old to participate in our community.
You agree to our Terms of Service (https://www.devheads.io/terms-of-service/) and Privacy Policy (https://www.devheads.io/privacy-policy)

By clicking "Finish", you have read and agreed to the our Terms of Service and Privacy Policy.

How can I optimize matrix multiplication performance and reduce L3 cache misses in my C++ library?

Posted by Marvee Amasi
4:20 am
21/11/2024

I started a C++ library for efficient matrix operations, with a primary focus on matrix multiplication. The target application is scientific computing, of course performance is critical. I implemented a start matrix class and a matrix multiplication function, used SSE instructions for optimization on Intel Core i7 12700K, 32GB DDR4 3200 RAM on visual studio code with clang format extension .
https://github.com/Marveeamasi/image-processing-matrix-multiplier
even after using SSE instructions, the current matrix multiplication implementation started to show significant performance bottlenecks, especially when dealing with large matrices. Profiling results indicate high L3 cache miss rates as the primary culprit

Matrix Matrix::operator*(const Matrix& other) const {
    if (cols_ != other.rows()) {
        exit(1);
    }

    Matrix result(rows_, other.cols_);

    for (int i = 0; i < rows_; ++i) {
        for (int j = 0; j < other.cols_; ++j) {
            double sum = 0.0;
            for (int k = 0; k < cols_; ++k) {
                sum += (*this)(i, k) * other(k, j);
            }
            result(i, j) = sum;
        }
    }

    return result;
}

tried to optimize memory access patterns and loop structure, but performance gains are still limited. Please need help on strategies to improve cache locality, reduce cache misses, and further enhance the overall efficiency of the matrix multiplication operation.
I’m eager to know about different approaches and best practices for high performance matrix computations.