Heterogeneous computing: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 19:54, 16 September 2024

Heterogeneous computing refers to systems that use more than one kind of processor or core. These systems gain performance or energy efficiency not just by adding the same type of processors, but by adding dissimilar coprocessors, usually incorporating specialized processing capabilities to handle particular tasks.^[1]

Heterogeneity

Usually heterogeneity in the context of computing refers to different instruction-set architectures (ISA), where the main processor has one and other processors have another - usually a very different - architecture (maybe more than one), not just a different microarchitecture (floating point number processing is a special case of this - not usually referred to as heterogeneous).

In the past heterogeneous computing meant different ISAs had to be handled differently, while in a modern example, Heterogeneous System Architecture (HSA) systems^[2] eliminate the difference (for the user) while using multiple processor types (typically CPUs and GPUs), usually on the same integrated circuit, to provide the best of both worlds: general GPU processing (apart from the GPU's well-known 3D graphics rendering capabilities, it can also perform mathematically intensive computations on very large data-sets), while CPUs can run the operating system and perform traditional serial tasks.

The level of heterogeneity in modern computing systems is gradually increasing as further scaling of fabrication technologies allows for formerly discrete components to become integrated parts of a system-on-chip, or SoC.^{[citation needed]} For example, many new processors now include built-in logic for interfacing with other devices (SATA, PCI, Ethernet, USB, RFID, radios, UARTs, and memory controllers), as well as programmable functional units and hardware accelerators (GPUs, cryptography co-processors, programmable network processors, A/V encoders/decoders, etc.).

Recent findings show that a heterogeneous-ISA chip multiprocessor that exploits diversity offered by multiple ISAs can outperform the best same-ISA homogeneous architecture by as much as 21% with 23% energy savings and a reduction of 32% in Energy Delay Product (EDP).^[3] AMD's 2014 announcement on its pin-compatible ARM and x86 SoCs, codename Project Skybridge,^[4] suggested a heterogeneous-ISA (ARM+x86) chip multiprocessor in the making.^{[citation needed]}

Heterogeneous CPU topology

A system with heterogeneous CPU topology is a system where the same ISA is used, but the cores themselves are different in speed.^[5] The setup is more similar to a symmetric multiprocessor. (Although such systems are technically asymmetric multiprocessors, the cores do not differ in roles or device access.) There are typically two types of cores: a higher performance core usually known as a "big" or P-core and a more power efficient core usually known as a "small" or E-core. The terms P- and E-cores are usually used in relation to Intel's implementation of hetereogeneous computing, while the terms big and little cores are usually used in relation to the ARM architecture. Some processors have three categories of core, prime, performance and efficiency cores, with prime cores having higher performance than performance cores; a prime core is known as "big", a performance core is known as "medium", and an efficiency core is known as "small".^[6]

A common use of such topology is to provide better power efficiency, especially in mobile SoCs.

ARM big.LITTLE (succeeded by DynamIQ) is the prototypical case, where faster high-power cores are combined with slower low-power cores.^[7]
Apple has produced Apple silicon SoCs with similar organization.
Intel has also produced hybrid x86-64 chips codenamed Lakefield, although not without major limitations in instruction set support. The newer Alder Lake reduces the sacrifice by adding more instruction set support to the "small" core.

Challenges

Heterogeneous computing systems present new challenges not found in typical homogeneous systems.^[8] The presence of multiple processing elements raises all of the issues involved with homogeneous parallel processing systems, while the level of heterogeneity in the system can introduce non-uniformity in system development, programming practices, and overall system capability. Areas of heterogeneity can include:^[9]

ISA or instruction-set architecture: Compute elements may have different instruction set architectures, leading to binary incompatibility.
ABI or application binary interface: Compute elements may interpret memory in different ways.^[10] This may include both endianness, calling convention, and memory layout, and depends on both the architecture and compiler being used.
API or application programming interface: Library and OS services may not be uniformly available to all compute elements.^[11]
Low-Level Implementation of Language Features: Language features such as functions and threads are often implemented using function pointers, a mechanism which requires additional translation or abstraction when used in heterogeneous environments.
Memory Interface and Hierarchy: Compute elements may have different cache structures, cache coherency protocols, and memory access may be uniform or non-uniform memory access (NUMA). Differences can also be found in the ability to read arbitrary data lengths as some processors/units can only perform byte-, word-, or burst accesses.^[12]
Interconnect: Compute elements may have differing types of interconnect aside from basic memory/bus interfaces. This may include dedicated network interfaces, Direct memory access (DMA) devices, mailboxes, FIFOs, and scratchpad memories, etc. Furthermore, certain portions of a heterogeneous system may be cache-coherent, whereas others may require explicit software-involvement for maintaining consistency and coherency.
Performance: A heterogeneous system may have CPUs that are identical in terms of architecture, but have underlying micro-architectural differences that lead to various levels of performance and power consumption. Asymmetries in capabilities paired with opaque programming models and operating system abstractions can sometimes lead to performance predictability problems, especially with mixed workloads.
Development tools: Different types of processors would typically require different tools (editors, compilers, ...) for software developers, which introduces complexity when partitioning the application across those.^[13]
Data Partitioning: While partitioning data on homogeneous platforms is often trivial, it has been shown that for the general heterogeneous case, the problem is NP-Complete.^[14] For small numbers of partitions, optimal partitionings that perfectly balance load and minimize communication volume have been shown to exist. ^[15]

Example hardware

Heterogeneous computing hardware can be found in every domain of computing—from high-end servers and high-performance computing machines all the way down to low-power embedded devices including mobile phones and tablets.

High Performance Computing
- Cydra-5 (Numeric coprocessor)
- Cray XD1 (FPGA)
- SRC Computers SRC-6 and SRC-7 (FPGA)
Embedded Systems (DSP and Mobile Platforms)
- Texas Instruments OMAP (Media coprocessor)
- Analog Devices Blackfin (DSP and media coprocessors)
- Qualcomm Snapdragon (GPU, DSP, image, sometimes AI coprocessor; Modem, Sensors)
- Nvidia Tegra (GPU; Modem, Sensors)
- Samsung Exynos (GPU; Modem, Sensors)
- Apple "A" series (CPU, GPU; Modem)
- Movidius Myriad Vision processing units, which includes several symmetric processors, complemented by fixed function units, and a pair of SPARC based controllers.
- HiSilicon Kirin SoCs (GPU; Modem, Sensors)
- MediaTek SoCs (GPU; Modem, Sensors)
- Cadence Design Systems Tensilica DSPs
Reconfigurable Computing
- Xilinx Field-programmable gate array (FPGA; e.g., Virtex-II Pro, Virtex 4 FX, Virtex 5 FXT) and Zynq and Versal Platforms
- Intel "Stellarton" (Atom + Altera FPGA)
Networking
- Intel IXP Network Processors
- Netronome NFP Network Processors
General Purpose Computing, Gaming, and Entertainment Devices
- Intel Sandy Bridge, Ivy Bridge, and Haswell CPUs (Integrated GPU, OpenCL-capable since Ivy Bridge)
- AMD Excavator and Ryzen APUs (Integrated GPU, OpenCL-capable)
- IBM Cell, found in the PlayStation 3 (Vector coprocessor)^[16]
  - SpursEngine, a variant of the IBM Cell processor
- Emotion Engine, found in the PlayStation 2 (Vector and media coprocessors)
- ARM big.LITTLE/DynamIQ CPU architecture (heterogeneous topology)
  - Nearly all ARM vendors offer heterogeneous solutions; ARM, Qualcomm, Nvidia, Apple, Samsung, HiSilicon, MediaTek, etc.

References

^ Shan, Amar (2006). Heterogeneous Processing: a Strategy for Augmenting Moore's Law. Linux Journal.
^ "Hetergeneous System Architecture (HSA) Foundation". Archived from the original on 2014-04-23. Retrieved 2014-11-01.
^ Venkat, Ashish; Tullsen, Dean M. (2014). Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor. Proceedings of the 41st Annual International Symposium on Computer Architecture.
^ Anand Lal Shimpi (2014-05-05). "AMD Announces Project SkyBridge: Pin-Compatible ARM and x86 SoCs in 2015, Android Support". AnandTech. Retrieved 2017-06-11. Next year, AMD will release a low-power 20nm Cortex A57 based SoC with integrated Graphics Core Next GPU.
^ "Energy Aware Scheduling". The Linux Kernel documentation.
^ Amadeo, Ron (2023-10-24). "Qualcomm's Snapdragon 8 Gen 3 promises 30 percent faster CPU". Ars Technica.
^ Mittal, Sparsh (February 2015). "A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors". ACM Computing Surveys. 48 (3): 1–38. doi:10.1145/2856125.
^ Kunzman, D.M. (2011). Programming Heterogeneous Systems. International Symposium on Parallel and Distributed Processing Workshops. doi:10.1109/IPDPS.2011.377.
^ Flachs, Brian (2009). Bringing Heterogeneous Processors Into The Mainstream (PDF). Symposium on Application Accelerators in High-Performance Computing (SAAHPC).
^ K. Gai; L. Qiu; H. Zhao; M. Qiu (October–December 2020). "Cost-Aware Multimedia Data Allocation for Heterogeneous Memory Using Genetic Algorithm in Cloud Computing". IEEE Transactions on Cloud Computing. 8 (4): 1212–1222. doi:10.1109/TCC.2016.2594172.
^ Agron, Jason; Andrews, David (2009). Hardware Microkernels for Heterogeneous Manycore Systems. Parallel Processing Workshops, 2009. International Conference on Parallel Processing (ICPPW). doi:10.1109/ICPPW.2009.21.
^ Lang, Johannes (2020). Heterogenes Rechnen mit ARM und DSP Multiprozessor-Ein-Chip-Systemen (MSc.). Fachhochschule Vorarlberg. doi:10.25924/opus-4525.
^ Wong, William G. (30 September 2002). "Tools Matter In Mixed-Processor Software Development". www.electronicdesign.com. Retrieved 2023-08-09.
^ Beaumont, Olivier; Boudet, Vincent; Rastello, Fabrice; Robert, Yves (August 2002). "Partitioning a square into rectangles: NP-completeness and approximation algorithms" (PDF). Algorithmica. 34 (3): 217–239. CiteSeerX 10.1.1.3.4967. doi:10.1007/s00453-002-0962-9. S2CID 9729067.
^ Beaumont, Olivier; Becker, Brett; DeFlumere, Ashley; Eyraud-Dubois, Lionel; Lastovetsky, Alexey (July 2018). "Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms" (PDF). IEEE Transactions on Parallel and Distributed Computing.
^ Gschwind, Michael (2005). A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor (PDF). Hot Chips: A Symposium on High Performance Chips. Archived from the original (PDF) on 2020-06-18. Retrieved 2014-10-28.

[linux_journal-1] Shan, Amar (2006). Heterogeneous Processing: a Strategy for Augmenting Moore's Law. Linux Journal.

[hsa_foundation-2] "Hetergeneous System Architecture (HSA) Foundation". Archived from the original on 2014-04-23. Retrieved 2014-11-01.

[venkat-3] Venkat, Ashish; Tullsen, Dean M. (2014). Harnessing ISA Diversity: Design of a Heterogeneous-ISA Chip Multiprocessor. Proceedings of the 41st Annual International Symposium on Computer Architecture.

[skybridge-4] Anand Lal Shimpi (2014-05-05). "AMD Announces Project SkyBridge: Pin-Compatible ARM and x86 SoCs in 2015, Android Support". AnandTech. Retrieved 2017-06-11. Next year, AMD will release a low-power 20nm Cortex A57 based SoC with integrated Graphics Core Next GPU.

[5] "Energy Aware Scheduling". The Linux Kernel documentation.

[6] Amadeo, Ron (2023-10-24). "Qualcomm's Snapdragon 8 Gen 3 promises 30 percent faster CPU". Ars Technica.

[7] Mittal, Sparsh (February 2015). "A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors". ACM Computing Surveys. 48 (3): 1–38. doi:10.1145/2856125.

[prog_hetero-8] Kunzman, D.M. (2011). Programming Heterogeneous Systems. International Symposium on Parallel and Distributed Processing Workshops. doi:10.1109/IPDPS.2011.377.

[mainstream_hetero-9] Flachs, Brian (2009). Bringing Heterogeneous Processors Into The Mainstream (PDF). Symposium on Application Accelerators in High-Performance Computing (SAAHPC).

[10] K. Gai; L. Qiu; H. Zhao; M. Qiu (October–December 2020). "Cost-Aware Multimedia Data Allocation for Heterogeneous Memory Using Genetic Algorithm in Cloud Computing". IEEE Transactions on Cloud Computing. 8 (4): 1212–1222. doi:10.1109/TCC.2016.2594172.

[hetero_manycore-11] Agron, Jason; Andrews, David (2009). Hardware Microkernels for Heterogeneous Manycore Systems. Parallel Processing Workshops, 2009. International Conference on Parallel Processing (ICPPW). doi:10.1109/ICPPW.2009.21.

[12] Lang, Johannes (2020). Heterogenes Rechnen mit ARM und DSP Multiprozessor-Ein-Chip-Systemen (MSc.). Fachhochschule Vorarlberg. doi:10.25924/opus-4525.

[13] Wong, William G. (30 September 2002). "Tools Matter In Mixed-Processor Software Development". www.electronicdesign.com. Retrieved 2023-08-09.

[14] Beaumont, Olivier; Boudet, Vincent; Rastello, Fabrice; Robert, Yves (August 2002). "Partitioning a square into rectangles: NP-completeness and approximation algorithms" (PDF). Algorithmica. 34 (3): 217–239. CiteSeerX 10.1.1.3.4967. doi:10.1007/s00453-002-0962-9. S2CID 9729067.

[15] Beaumont, Olivier; Becker, Brett; DeFlumere, Ashley; Eyraud-Dubois, Lionel; Lastovetsky, Alexey (July 2018). "Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms" (PDF). IEEE Transactions on Parallel and Distributed Computing.

[hotchips_cell-16] Gschwind, Michael (2005). A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor (PDF). Hot Chips: A Symposium on High Performance Chips. Archived from the original (PDF) on 2020-06-18. Retrieved 2014-10-28.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

@@ Line 1: / Line 1: @@
 {{Short description|Computer architecture that utilizes multiple, different processing methods}}
 {{refimprove|date=October 2014}}
-'''Heterogeneous computing''' refers to systems that use more than one kind of processor or [[Multi-core processor|cores]]. These systems gain performance or [[Electrical efficiency|energy efficiency]] not just by adding the same type of processors, but by adding dissimilar [[coprocessors]], usually incorporating specialized processing capabilities to handle particular tasks.<ref name=linux_journal>{{cite conference|last1=Shan|first1=Amar|title=Heterogeneous Processing: a Strategy for Augmenting Moore's Law | year=2006|conference= Linux Journal|url=http://www.linuxjournal.com/article/8368}}</ref>
+'''Heterogeneous computing''' refers to systems that use more than one kind of processor or [[Multi-core processor|core]]. These systems gain performance or [[Electrical efficiency|energy efficiency]] not just by adding the same type of processors, but by adding dissimilar [[coprocessors]], usually incorporating specialized processing capabilities to handle particular tasks.<ref name=linux_journal>{{cite conference|last1=Shan|first1=Amar|title=Heterogeneous Processing: a Strategy for Augmenting Moore's Law | year=2006|conference= Linux Journal|url=http://www.linuxjournal.com/article/8368}}</ref>
 ==Heterogeneity==
-Usually heterogeneity in the context of computing referred{{when?|date=June 2017}} to different [[instruction set architecture | instruction-set architecture]]s (ISA), where the main processor has one and other processors have another - usually a very different - architecture (maybe more than one), not just a different [[microarchitecture]] ([[floating point]] number processing is a special case of this - not usually referred to as heterogeneous).
+Usually heterogeneity in the context of computing refers to different [[instruction set architecture | instruction-set architecture]]s (ISA), where the main processor has one and other processors have another - usually a very different - architecture (maybe more than one), not just a different [[microarchitecture]] ([[floating point]] number processing is a special case of this - not usually referred to as heterogeneous).
 In the past heterogeneous computing meant different ISAs had to be handled differently, while in a modern example, [[Heterogeneous System Architecture]] (HSA) systems<ref name=hsa_foundation>{{cite news|title= Hetergeneous System Architecture (HSA) Foundation|url= http://www.hsafoundation.com/|access-date= 2014-11-01|archive-url= https://web.archive.org/web/20140423141300/http://www.hsafoundation.com/|archive-date= 2014-04-23|url-status= dead}}</ref> eliminate the difference (for the user) while using multiple processor types (typically [[CPU]]s and [[GPU]]s), usually on the same [[integrated circuit]], to provide the best of both worlds: general GPU processing (apart from the GPU's well-known 3D graphics rendering capabilities, it can also perform mathematically intensive computations on very large data-sets), while CPUs can run the operating system and perform traditional serial tasks.
@@ Line 24: / Line 24: @@
 === Heterogeneous CPU topology ===
-A system with '''heterogenous CPU topology''' is a system where the same ISA is used, but the cores themselves are different in speed.<ref>{{cite web |title=Energy Aware Scheduling |url=https://www.kernel.org/doc/html/latest/scheduler/sched-energy.html |website=The Linux Kernel documentation}}</ref> The setup is more similar to a [[symmetric multiprocessor]]. (Although such systems are technically [[asymmetric multiprocessing|asymmetric multiprocessors]], the cores do not differ in roles or device access.)
+A system with '''heterogeneous CPU topology''' is a system where the same ISA is used, but the cores themselves are different in speed.<ref>{{cite web |title=Energy Aware Scheduling |url=https://www.kernel.org/doc/html/latest/scheduler/sched-energy.html |website=The Linux Kernel documentation}}</ref> The setup is more similar to a [[symmetric multiprocessor]]. (Although such systems are technically [[asymmetric multiprocessing|asymmetric multiprocessors]], the cores do not differ in roles or device access.) There are typically two types of cores: a higher performance core usually known as a "big" or P-core and a more power efficient core usually known as a "small" or E-core. The terms P- and E-cores are usually used in relation to Intel's implementation of hetereogeneous computing, while the terms big and little cores are usually used in relation to the ARM architecture. Some processors have three categories of core, prime, performance and efficiency cores, with prime cores having higher performance than performance cores; a prime core is known as "big", a performance core is known as "medium", and an efficiency core is known as "small".<ref>{{cite web
+|url=https://arstechnica.com/gadgets/2023/10/qualcomms-snapdragon-8-gen-3-promises-30-percent-faster-cpu/ |title=Qualcomm's Snapdragon 8 Gen 3 promises 30 percent faster CPU |first=Ron |last=Amadeo |date=2023-10-24 |website=[[Ars Technica]]}}</ref>
+A common use of such topology is to provide better power efficiency, especially in mobile SoCs.
-A common use of such topology is to provide better power efficiency in mobile SoCs. [[ARM big.LITTLE]] is the prototypical case, where faster high-power cores are combined with slower low-power cores.<ref>[https://www.researchgate.net/publication/283733254_A_Survey_Of_Techniques_for_Architecting_and_Managing_Asymmetric_Multicore_Processors A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors], ACM Computing Surveys, 2015.</ref> Apple has produced [[Apple silicon]] ARM cores with similar organization. Intel has also produced hybrid x86 cores codenamed Lakefield, although not without major limitations in instruction set support.
+* [[ARM big.LITTLE]] (succeeded by DynamIQ) is the prototypical case, where faster high-power cores are combined with slower low-power cores.<ref>{{cite journal |url=https://www.researchgate.net/publication/283733254 |title=A Survey Of Techniques for Architecting and Managing Asymmetric Multicore Processors |first=Sparsh |last=Mittal |journal=ACM Computing Surveys |volume=48 |issue=3 |date=February 2015 |pages=1–38 |doi=10.1145/2856125}}</ref>
+* Apple has produced [[Apple silicon]] SoCs with similar organization.
-[[Alder Lake (microprocessor)|Alder Lake]] uses small and big cores.
+* Intel has also produced hybrid x86-64 chips codenamed [[Intel Lakefield|Lakefield]], although not without major limitations in instruction set support. The newer [[Alder Lake (microprocessor)|Alder Lake]] reduces the sacrifice by adding more instruction set support to the "small" core.
 == Challenges ==
 Heterogeneous computing systems present new challenges not found in typical homogeneous systems.<ref name=prog_hetero>{{cite conference|last1=Kunzman|first1=D.M.|title=Programming Heterogeneous Systems | year=2011|conference=International Symposium on Parallel and Distributed Processing Workshops |doi=10.1109/IPDPS.2011.377}}</ref> The presence of multiple processing elements raises all of the issues involved with homogeneous parallel processing systems, while the level of heterogeneity in the system can introduce non-uniformity in system development, programming practices, and overall system capability. Areas of heterogeneity can include:<ref name=mainstream_hetero>{{cite conference|last1=Flachs|first1=Brian|title=Bringing Heterogeneous Processors Into The Mainstream | year=2009|conference= Symposium on Application Accelerators in High-Performance Computing (SAAHPC)|url=http://saahpc.ncsa.illinois.edu/09/sessions/day1/session2/Flachs_presentation.pdf}}</ref>
-* ISA or [[Instruction set|instruction-set architecture]]
+; ISA or [[Instruction set|instruction-set architecture]]
-** Compute elements may have different instruction set architectures, leading to binary incompatibility.
+: Compute elements may have different instruction set architectures, leading to binary incompatibility.
-* ABI or [[application binary interface]]
+; ABI or [[application binary interface]]
-** Compute elements may interpret memory in different ways.<ref>{{cite journal| title= Cost-aware multimedia data allocation for heterogeneous memory using genetic algorithm in cloud computing |year=2016|url= http://webpage.pace.edu/kg71231w/docs/tcc1.pdf |publisher= IEEE }}</ref> This may include both [[endianness]], [[calling convention]], and memory layout, and depends on both the architecture and [[compiler]] being used.
+: Compute elements may interpret memory in different ways.<ref>{{cite journal |author1=K. Gai |author2=L. Qiu |author3=H. Zhao |author4=M. Qiu |title=Cost-Aware Multimedia Data Allocation for Heterogeneous Memory Using Genetic Algorithm in Cloud Computing |journal=IEEE Transactions on Cloud Computing |volume=8 |issue=4 |pages=1212–1222 |date=October–December 2020 |doi=10.1109/TCC.2016.2594172}}</ref> This may include both [[endianness]], [[calling convention]], and memory layout, and depends on both the architecture and [[compiler]] being used.
-* [[API]] or [[application programming interface]]
+; [[API]] or [[application programming interface]]
-** Library and OS services may not be uniformly available to all compute elements.<ref name=hetero_manycore>{{cite conference|last1=Agron|first1=Jason|last2=Andrews|first2=David|title=Hardware Microkernels for Heterogeneous Manycore Systems | year=2009|conference= Parallel Processing Workshops, 2009. International Conference on Parallel Processing (ICPPW)|doi=10.1109/ICPPW.2009.21}}</ref>
+: Library and OS services may not be uniformly available to all compute elements.<ref name=hetero_manycore>{{cite conference|last1=Agron|first1=Jason|last2=Andrews|first2=David|title=Hardware Microkernels for Heterogeneous Manycore Systems | year=2009|conference= Parallel Processing Workshops, 2009. International Conference on Parallel Processing (ICPPW)|doi=10.1109/ICPPW.2009.21}}</ref>
-* Low-Level Implementation of Language Features
+; Low-Level Implementation of Language Features
-** Language features such as functions and threads are often implemented using [[function pointer]]s, a mechanism which requires additional translation or abstraction when used in heterogeneous environments.
+: Language features such as functions and threads are often implemented using [[function pointer]]s, a mechanism which requires additional translation or abstraction when used in heterogeneous environments.
-* Memory Interface and [[Memory hierarchy|Hierarchy]]
+; Memory Interface and [[Memory hierarchy|Hierarchy]]
-** Compute elements may have different [[Cache (computing)|cache]] structures, [[cache coherency]] protocols, and memory access may be uniform or non-uniform memory access ([[Non-uniform memory access|NUMA]]). Differences can also be found in the ability to read arbitrary data lengths as some processors/units can only perform byte-, word-, or burst accesses.
+: Compute elements may have different [[Cache (computing)|cache]] structures, [[cache coherency]] protocols, and memory access may be uniform or non-uniform memory access ([[Non-uniform memory access|NUMA]]). Differences can also be found in the ability to read arbitrary data lengths as some processors/units can only perform byte-, word-, or burst accesses.<ref>{{cite thesis |last= Lang |first=Johannes |date=2020 |title=Heterogenes Rechnen mit ARM und DSP Multiprozessor-Ein-Chip-Systemen|type=MSc. |publisher=Fachhochschule Vorarlberg |doi=10.25924/opus-4525 |url=https://doi.org/10.25924/opus-4525}}</ref>
-* Interconnect
+; Interconnect
-** Compute elements may have differing types of interconnect aside from basic memory/bus interfaces. This may include dedicated network interfaces, Direct memory access ([[Direct memory access|DMA]]) devices, mailboxes, [[FIFO (computing and electronics)|FIFO]]s, and [[Scratchpad memory|scratchpad memories]], etc.  Furthermore, certain portions of a heterogeneous system may be cache-coherent, whereas others may require explicit software-involvement for maintaining consistency and coherency.
+: Compute elements may have differing types of interconnect aside from basic memory/bus interfaces. This may include dedicated network interfaces, Direct memory access ([[Direct memory access|DMA]]) devices, mailboxes, [[FIFO (computing and electronics)|FIFO]]s, and [[Scratchpad memory|scratchpad memories]], etc.  Furthermore, certain portions of a heterogeneous system may be cache-coherent, whereas others may require explicit software-involvement for maintaining consistency and coherency.
-* Performance
+; Performance
-** A heterogeneous system may have CPUs that are identical in terms of architecture, but have underlying micro-architectural differences that lead to various levels of performance and power consumption.  Asymmetries in capabilities paired with opaque programming models and operating system abstractions can sometimes lead to performance predictability problems, especially with mixed workloads.
+: A heterogeneous system may have CPUs that are identical in terms of architecture, but have underlying micro-architectural differences that lead to various levels of performance and power consumption.  Asymmetries in capabilities paired with opaque programming models and operating system abstractions can sometimes lead to performance predictability problems, especially with mixed workloads.
+;Development tools
-*Data Partitioning
+: Different types of processors would typically require different tools (editors, compilers, ...) for software developers, which introduces complexity when partitioning the application across those.<ref>{{Cite web |last=Wong |first=William G. |date=30 September 2002 |title=Tools Matter In Mixed-Processor Software Development |url=https://www.electronicdesign.com/technologies/embedded/digital-ics/processors/dsp/article/21756193/tools-matter-in-mixedprocessor-software-development |access-date=2023-08-09 |website=www.electronicdesign.com}}</ref>
-**While partitioning data on homogeneous platforms is often trivial, it has been shown that for the general heterogeneous case, the problem is NP-Complete.<ref>{{Cite journal|last=Beaumont|first=Olivier|last2=Boudet|first2=Vincent|last3=Rastello|first3=Fabrice|last4=Robert|first4=Yves|date=August 2002|title=Partitioning a square into rectangles: NP-completeness and approximation algorithms|url=http://lara.inist.fr/bitstream/handle/2332/487/RR2000-10.pdf?sequence=1|journal=Algorithmica|volume=34|issue=3|pages=217–239|doi=10.1007/s00453-002-0962-9|citeseerx=10.1.1.3.4967}}</ref> For small numbers of partitions, optimal partitionings that perfectly balance load and minimize communication volume have been shown to exist. <ref>{{Cite journal|last=Beaumont|first=Olivier|last2=Becker|first2=Brett|last3=DeFlumere|first3=Ashley|last4=Eyraud-Dubois|first4=Lionel|last5=Lastovetsky|first5=Alexey|date=July 2018|title=Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms.|url=https://www.brettbecker.com/wp-content/uploads/2018/07/beaumont2018recent_pre.pdf|journal=IEEE Transactions on Parallel and Distributed Computing}}</ref>
+;Data Partitioning
+: While partitioning data on homogeneous platforms is often trivial, it has been shown that for the general heterogeneous case, the problem is NP-Complete.<ref>{{Cite journal|last1=Beaumont|first1=Olivier|last2=Boudet|first2=Vincent|last3=Rastello|first3=Fabrice|last4=Robert|first4=Yves|date=August 2002|title=Partitioning a square into rectangles: NP-completeness and approximation algorithms|url=http://lara.inist.fr/bitstream/handle/2332/487/RR2000-10.pdf?sequence=1|journal=Algorithmica|volume=34|issue=3|pages=217–239|doi=10.1007/s00453-002-0962-9|citeseerx=10.1.1.3.4967|s2cid=9729067 }}</ref> For small numbers of partitions, optimal partitionings that perfectly balance load and minimize communication volume have been shown to exist. <ref>{{Cite journal|last1=Beaumont|first1=Olivier|last2=Becker|first2=Brett|last3=DeFlumere|first3=Ashley|last4=Eyraud-Dubois|first4=Lionel|last5=Lastovetsky|first5=Alexey|date=July 2018|title=Recent Advances in Matrix Partitioning for Parallel Computing on Heterogeneous Platforms.|url=https://www.brettbecker.com/wp-content/uploads/2018/07/beaumont2018recent_pre.pdf|journal=IEEE Transactions on Parallel and Distributed Computing}}</ref>
 == Example hardware ==
+{{cleanup section|reason=Some groupings don't make sense when "what's added compared to a bare CPU" is considered. Maybe it's time to rethink the taxonomy.|date=September 2021}}
 Heterogeneous computing hardware can be found in every domain of computing—from high-end servers and high-performance computing machines all the way down to low-power embedded devices including mobile phones and tablets.
 * High Performance Computing
+** [[Cydra-5]] (Numeric coprocessor)
-** [[Cray XD1]]
-** [[SRC Computers]] SRC-6 and SRC-7
+** [[Cray XD1]] (FPGA)
+** [[SRC Computers]] SRC-6 and SRC-7 (FPGA)
 * Embedded Systems (DSP and Mobile Platforms)
-**[[Texas Instruments]] [[OMAP]]
+**[[Texas Instruments]] [[OMAP]] (Media coprocessor)
-** [[Analog Devices]] [[Blackfin]]
+** [[Analog Devices]] [[Blackfin]] (DSP and media coprocessors)
-**[[Qualcomm]] [[Qualcomm Snapdragon|Snapdragon]]
+**[[Qualcomm]] [[Qualcomm Snapdragon|Snapdragon]] (GPU, DSP, image, sometimes AI coprocessor; Modem, Sensors)<!-- currently using semicolon to show not very compute-y things -->
-**[[Nvidia]] [[Tegra]]
+**[[Nvidia]] [[Tegra]] (GPU; Modem, Sensors)
-**[[Samsung]] [[Exynos]]
+**[[Samsung]] [[Exynos]] (GPU; Modem, Sensors)
-**[[Apple Inc.|Apple]] [[Apple silicon#A series|"A" series]]
+**[[Apple Inc.|Apple]] [[Apple silicon#A series|"A" series]] (CPU, GPU; Modem)
 **[[Movidius Myriad 2|Movidius Myriad]] [[Vision processing unit|Vision processing units]], which includes several symmetric processors, complemented by [[fixed function units]], and a pair of [[SPARC]] based controllers.
-**[[HiSilicon]] Kirin SoCs
+**[[HiSilicon]] Kirin SoCs (GPU; Modem, Sensors)
-**[[MediaTek]] SoCs
+**[[MediaTek]] SoCs (GPU; Modem, Sensors)
 **[[Cadence Design Systems]] Tensilica DSPs
 * Reconfigurable Computing
@@ Line 71: / Line 77: @@
 ** [[Intel]] "Stellarton" (Atom + [[Altera]] [[FPGA]])
 * Networking
-** Intel IXP Network Processors
+** Intel [[XScale#IXP_network_processor|IXP]] Network Processors
 ** [[Netronome]] NFP Network Processors
 * General Purpose Computing, Gaming, and Entertainment Devices
-**[[Intel]] Sandy Bridge, Ivy Bridge, and Haswell CPUs
+**[[Intel]] Sandy Bridge, Ivy Bridge, and Haswell CPUs (Integrated GPU, OpenCL-capable since Ivy Bridge)
-** [[AMD]] [[Excavator (microarchitecture)|Excavator]] and [[Ryzen]] APUs
+** [[AMD]] [[Excavator (microarchitecture)|Excavator]] and [[Ryzen]] APUs (Integrated GPU, OpenCL-capable)
-** [[IBM]] [[Cell (microprocessor)|Cell]], found in the [[PlayStation]] 3<ref name=hotchips_cell>{{cite conference|last1=Gschwind|first1=Michael|title=A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor | year=2005|conference= Hot Chips: A Symposium on High Performance Chips|url=http://www.hotchips.org/wp-content/uploads/hc_archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf}}</ref>
+** [[IBM]] [[Cell (microprocessor)|Cell]], found in the [[PlayStation]] 3 (Vector coprocessor)<ref name=hotchips_cell>{{cite conference|last1=Gschwind|first1=Michael|title=A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor|year=2005|conference=Hot Chips: A Symposium on High Performance Chips|url=http://www.hotchips.org/wp-content/uploads/hc_archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf|access-date=2014-10-28|archive-date=2020-06-18|archive-url=https://web.archive.org/web/20200618004509/http://www.hotchips.org/wp-content/uploads/hc_archives/hc17/2_Mon/HC17.S1/HC17.S1T1.pdf|url-status=dead}}</ref>
 *** [[SpursEngine]], a variant of the IBM Cell processor
-** [[Emotion Engine]], found in the [[PlayStation 2]]
+** [[Emotion Engine]], found in the [[PlayStation 2]] (Vector and media coprocessors)
-**[[ARM architecture|ARM]] [[ARM big.LITTLE|big.LITTLE/DynamIQ]]  CPU architecture
+**[[ARM architecture|ARM]] [[ARM big.LITTLE|big.LITTLE/DynamIQ]]  CPU architecture (heterogeneous topology)
 *** Nearly all ARM vendors offer heterogeneous solutions; ARM, Qualcomm, Nvidia, Apple, Samsung, HiSilicon, MediaTek, etc.
@@ Line 86: / Line 92: @@
 * [[MPSoC]]
 * [[ARM big.LITTLE|big.LITTLE/DynamIQ]]
+* [[Simultaneous and heterogeneous multithreading]]
 ==References==