If so, how does this work with AMD’s upcoming high-end Summit Ridge CPU’s? It is widely accepted that HSA does not deliver on its promise (at least under current architecture) if there is not a tight coupling of the CPU and GPU with a shared memory allocation, and also affected by latency problems with PCIe. Hence, i can buy a £300 laptop which supports HSA, but I cannot build a PC that leverages the power of my £600 graphics card and £300 CPU. Either they sort out this limitation of shared memory pool and latency over PCIe, they stick shaders on Summit Ridge, or, HSA has no part to play in their high-end offerings.
At least that is what I used to think, but perhaps I have been looking through the wrong end of the telescope.
If HSA is the future then we [must] presume it will one day function on their high-end/high-margin products, and not just the $300 Best-Buy boxes from which they scrape a few miserable dollars in profit.
If HSA [is] still the future of AMD then its high-end/high-margin products have two possible solutions to the current problem that i can see:
1. Summit Ridge comes with some shaders. Not many, not a significant proportion of total die space, but a useful number to enable HSA functionality. 256 high density shaders on 14nm would be very achievable with 8c/16t at 14nm inc L3 cache.
2. They find a way to extend shared memory allocation across PCIe, and look at technical solutions to reduce latency (3.0 was supposed to be better than 2.0, will 4.0 be better still?), and software solutions to mitigate the impact of that latency.
The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model.
3. Polaris/Vega comes with four high performance ARM CPU cores embedded in the new “Command Processor” section of the GPU. The idea being that HSA tasks – which are architecture independent anyway – are passed across the PCIe bus from the CPU to the GPU, whereupon the GPU (with its onboard CPU’s), takes over the task in its entirety with the benefit of 8GB of ‘unified’ memory and 1000GB/s of bandwidth to supply the many thousands of GCN1.4 shaders.