你的位置:首页 > 软件开发 > 操作系统 > 一些网摘的hpc材料

一些网摘的hpc材料

发布时间:2015-12-08 09:00:28
source from: https://computing.llnl.gov Factors determines a large-scale programs performance 4 * Application rel ...

 source from: https://computing.llnl.gov

 

 Factors determines a large-scale program's performance

  4         * Application related factors:

  5                 * algorithms

  6                 * dataset size

  7                 * Memory Usage Pattern

  8                 * Use of IO

  9                 * Communication Patterns

 10                 * Task Granularity

 11                 * Load Balancing

 12                 * Amdahl's Law

 13 

 14         * Hardware factors

 15                 * Processors Architecture

 16                 * Memory Hierarchy

 17                 * I/O configuration

 18                 * Network

 19 

 20         * Software factors

 21                 * OS

 22                 * Compiler

 23                 * Preprocessor

 24                 * Communication protocols

 25                 * Libraries

 

Performance analysis: 

  Timers, Profiles, system stat, memory tools

 

Learn some about hardware archiecture:

Intel Xeon 5500/5600 

  4-core/ 6-core

  2.4/2.8 GHz

  Cache

    L1 Data 32Kb, private

    L1 Instruction 32Kb, private

        L2 256K, private

     L3 8Mb/12Mb, shared

     Cpu-Memory bandwidth: 32 Gb/s

 

Intel Xeon E5-2670 

    8-core, 2.6GHz

            Cache

      L1 Data 32K, private

      L1 Instruction 32K, private

      L2 256K, private

      L3 20Mb, shared

       CPU-Memory bandwidth  51.2G/s

 

AMD processors 

     2.2 GHz

  Cache

       L1  Data 64k (2-way)

       L1  Instruction 64k(2-way)

       L2  512K private

       L3  2M shared

 

  Direct - connect Architecture

    CPU-memory bandwidth 10.7G/s per socket F

    other connect socket bandwidth 8G/s(2-way)

 

  4x Infiniband Interconnect

    * SDR 1.25G/s

    * DDR 2.5G/s

          * QDR  5G/s

 

Learn something about NUMA  

  -physical: each node has sevearl(2-4) sockets, each socket has sevearl(4-8) CPU cores. On same socket, cores share L3 cache; socket-socket communcation through CPU-memory bus, almost 2x ~ 5x slower.   

      -design consideration: CPU affinity(numactl --cpunodebind), local memory policy. other compiler/running-time options(mpirun --bind-to-socket -bynode) 

 

Finally and most importantly, a good algorithm.   

 


原标题:一些网摘的hpc材料

关键词:

*特别声明:以上内容来自于网络收集,著作权属原作者所有,如有侵权,请联系我们: admin#shaoqun.com (#换成@)。

可能感兴趣文章

我的浏览记录