TopNax |
ASRock's E350M1: AMD's Brazos Platform Hits The Desktop First |
We had the opportunity to preview the Zacate APU late last year at AMD’s headquarters in Austin, Texas. Now we have the first retail motherboard based on the Brazos platform in ASRock’s E350M1. Today we’re asking: what can the Fusion initiative really do? So, I’m sitting here, benchmarking away on ASRock’s E350M1 motherboard, which is armed with AMD's E-350 APU thinking to myself, “Man, this is a way better experience than any Atom-based configuration I’ve ever tested in Windows 7, but it's not what I was hoping for from the ‘Fusion, Fusion, Fusion’ that AMD keeps chanting.” What are you, anyway? Are you…an Atom-killer? Are you…able to compete with a desktop architecture? Are you…designed to go heads-up against mobile parts? Maybe the Brazos platform is just a little bit of each of those. As such, I'm going to throw a pretty wide net for benchmarking against this compact platform with hard-to-define aspirations. Unquestionably, the most telling specification that helps us classify the AMD E-350 accelerated processing unit (I’ll call it an APU from here on out) on ASRock’s board is its 18 W TDP. That’s decidedly mobile territory—and indeed, AMD told our own Andrew Ku at its preview event last year to expect the E-350 and E-240 APUs to end up in nettops (low-powered desktops) and ultraportable notebooks priced below $500. Finessed into a Mini-ITX form factor, ASRock’s E350M1 most definitely qualifies for nettop duty. But so do a lot of other different platforms. You can find everything from anemic dual-core 8 W Atom processors to enthusiast-class Core i5s crammed into nettops. The Brazos platform consequently fits somewhere in between. And so you have to ask yourself a few questions. First, how much performance do you need from a nettop—a system that, by design, is only really intended to do a little Web browsing, a little word processing, and if you’re lucky, a bit of mainstream gaming? How much are you willing to spend? And how important are power consumption/cooling to you? Fusion’s First Outing Sitting front and center—quite literally—on the E350M1 motherboard is AMD’s E-350 APU. The 75 mm² die hosts two processor cores based on the new Bobcat architecture, a single-channel DDR3 memory controller, two SIMD engines, UVD3 fixed-function decode logic, a pair of independent display outputs, four second-gen PCI Express lanes, and the Unified Media Interface that connects the E-350 to its complementary chipset. The APU is manufactured by TSMC using 40 nm lithography, and it employs the same 18 W TDP we saw in AMD’s suite at IDF 2010 and then previewed at the company’s headquarters last year. Each of its two Bobcat cores comes armed with 32 KB of data cache and 32 KB of instruction cache, plus a 512 KB L2 cache. They run at 1.6 GHz and offer 64-bit support, plus SSE, SSE2, and SSE3. We already know that the Bobcat architecture offers out-of-order execution, which should make it faster than Intel’s Atom. What we don’t know is how much more power-hungry Bobcat will be as a result. And what about the APU’s graphics capabilities? Each of the two on-die SIMD engines includes 40 stream processors, totaling 80 stream processors per APU. I’m 99% sure this design centers on AMD’s VLIW5 architecture, meaning each SIMD features eight thread processors with five ALUs each. For a contemporary in AMD’s discrete lineup, you’d have to look to the Radeon HD 5450, which features 80 SPs as well, but also runs at 650 MHz. In comparison, Zacate’s graphics core, branded as Radeon HD 6310, operates at 500 MHz. Yes, this means it supports DirectX 11, though you’ll find the viability of DirectX 11-class gaming is going to be limited on such a power-constrained piece of hardware. The discrete 5450 boasts eight texture units and four ROPs—I’m still waiting to hear back from AMD on Zacate’s specific configuration, but I wouldn’t be surprised to see something similar. Update(1/14/2011): AMD just confirmed for us that Zacate employs a Cedar core, equivalent to what you'd find in the aforementioned Radeon HD 5450, based on the company's VLIW5 architecture. Indeed, that gives you access to eight texture units and four ROPs. More useful is the integration of UVD 3, AMD’s third-generation fixed-function video decode block capable of accelerating H.264, VC-1, MPEG-2, and now DivX and Xvid through MPEG-4 Part 2. The one capability that doesn’t carry over from the UVD3 implementation found on other Radeon HD 6000-series GPUs is MVC acceleration, needed for playing back Blu-ray 3D content. That’s going to limit the utility of an E-350-based nettop in certain home theater environments. The Zacate APU isn’t designed to be the fastest low-power device available; it only need to be fast enough to serve AMD’s intended market, delivering as much battery life as possible in the process. Of course, power consumption isn’t as big of a deal in the nettop space. There, performance compromises are more impactful. E-350’s single 64-bit memory channel is a good example. Intel’s Sandy Bridge architecture gets plenty of bandwidth via two memory channels able to support DDR3-1333—needed to feed up to four execution cores and a graphics engine. AMD’s APU similarly needs to feed data to processor cores and graphics, but is limited to one channel at up to DDR3-1066. Update (1/16/2011): After our original story went up, many of you asked for results with AMD's E-350 overclocked. Bad news. According to ASRock, the APU gets its clock from the A50M, just like Intel's Sandy Bridge-based processors and P67/H67. As a result, overclocking is completely locked out. Even with an external clock generator, the chip's scalability is limited to around 3%, ASRock's engineers say. Don't get your hopes up on this one. AMD’s A50M Chipset As with Intel’s Sandy Bridge design, the Zacate APU incorporates a significant amount of functionality that would have, in the past, been found in the platform’s chipset. That goes a long way in explaining why the A50M (also known as Hudson) Fusion Controller Hub attached to AMD’s E350 APU is so danged small. There’s a four-lane, first-gen PCI Express link connecting the APU and FCH. The controller hub itself offers up to six SATA 6Gb/s ports, four second-gen PCI Express lanes, HD Audio, and up to 14 USB 2.0 ports. As you’ll see on the next page, ASRock takes advantage of much of this two-chip platform’s functionality. |
ASRock E350M1: Enter Brazos |
ASRock’s E350M1—its first Brazos platform—looks a lot like the first Intel Atom-based motherboard I ever picked up, but more modern. It has one passive heatsink and one lower-profile heatsink cooled by a fan. In the early Atom days, that active sink would have covered the chipset. Here, it sits atop the 18 W Zacate APU, while the FCH gets away with passive cooling. Right above the processor and chipset are two DDR3 memory slots. The E350M1 will take up to 16 GB of memory, but remember that both slots feed a single 64-bit channel. At 1066 MT/s, you’re looking at up to 8.53 GB/s maximum, regardless of whether you drop in one module or two. The Mini-ITX board boasts a single PCI Express x16 slot for upgrades. Don’t expect miracles from it—there are only four second-gen lanes coming from Zacate, feeding that slot. Then again, you probably wouldn’t want to buy anything beyond a mid-range discrete card, given the probability of a processor bottleneck. Four internal SATA 6Gb/s ports and one back-panel eSATA connector take advantage of most of Hudson’s storage connectivity. The I/O panel also plays host to six USB 2.0 ports (another four are accessible through onboard headers). Gigabit Ethernet is enabled through a Realtek RTL8111E controller tied in to one of the FCH’s four PCIe links. And the integrated graphics engine drives two display outputs simultaneously, letting you pick between VGA, single-link DVI, and HDMI. Anyone not getting audio output over HDMI can tap into analog 7.1-channel output or digital output via TOSLINK. Like its latest P67-based platforms, ASRock arms the E350M1 with a UEFI. The setup looks identical to what Thomas covered in his recent P67 Motherboard Roundup, with settings naturally altered to match the Brazos platform’s unique capabilities. |
The First Inklings Of Fusion: On-Die Video Decoding Via UVD3 |
Equipped with a modest 80 stream processors, graphics performance will never be Zacate’s flagship feature. However, the addition of AMD’s UVD 3 fixed-function decoding block makes it possible to watch high-bit rate content without taxing the lightweight Bobcat cores. As mentioned previously, the UVD 3 we have here isn’t exactly like what you’d find on a Radeon HD 6000-series discrete board, though. On the downside, Multiview Video Coding is not accelerated, meaning you don’t get Blu-ray 3D support. That won’t be a deal-breaker for most, since Blu-ray 3D has been very slow to catch on in light of the expensive, battery-powered, heavy glasses required for each viewer. But it might become more of a detractor if the technology gains momentum in 2011. However, the Zacate APU also introduces MPEG-2 hardware decode at the VLD (variable-length decoding) level. MPEG-2 is a fairly easy codec to decode, so nobody really gave it much thought that AMD wasn’t accelerating the entire pipeline (iDCT- and motion comp-only). And from a performance standpoint, the Brazos platform wouldn’t have had any problems. But by building in a more complete MPEG-2 decode solution, more of the workload is handled by efficient fixed-function logic, keeping the Bobcat cores as idle as possible, and extending battery life. Naturally, this is more of a consideration for mobile machines that will center on Brazos. Of course, we wanted to double-check that AMD’s fixed-function logic was doing its share of the heavy-lifting, so we fired up Quantum of Solace (AVC) and The Book of Eli (VC-1) using a recently-optimized copy of CyberLink’s PowerDVD 10: |
Athlon II X2, H.264 |
Celeron SU2300, H.264 |
E-350, H.264 |
Though the Athlon II and Celeron see the lowest utilization numbers, remember that the Athlon II-based machine centers on AMD’s 880G chipset, which includes UVD 2 and is driven by a 2.8 GHz desktop processor, while the Celeron benefits from Nvidia’s Ion, armed with third-generation PureVideo. The simple fact that AMD’s E-350 hovers around 20% utilization in H.264-based video and 30% utilization decoding VC-1 is pretty impressive. Also, these numbers include audio decoding, which happens on the CPU. If you were to bitstream Dolby TrueHD or DTS-HD Master Audio over HDMI, the two Bobcat cores would have even less work to do. I spend some time talking about my findings with Louis Chen, CyberLink’s director of business development, who shared the performance numbers his technical team had recorded. Although we don’t have access to the single-core version of Zacate or either Ontario SKU, CyberLink says even the C-50 (Ontario, 1 GHz, single-core) should be capable of Blu-ray playback using any of the three accelerated codecs. You’ll see utilization in excess of 70% in H.264 titles, but the requisite performance is there. |
More Inklings: Video Transcoding |
The way it works is simple. Previously, transcoding apps used CPU instructions to copy decoded video data from a graphics card on one end of the PCI Express bus to the processor, where post-processing and encoding took place. This interaction between dissimilar memory spaces burned CPU cycles. On a modern desktop processor, that probably wasn’t a debilitating bottleneck. But in a more mobile implementation, burnt cycles not only hold back performance more noticeably, but also have an adverse impact on power consumption. Fast Copy facilitates DMA to copy the same data without using CPU cycles, freeing the two Bobcat cores to work on the encode. Wait—encode is happening on the processor? We have 80 stream processors in a pair of SIMD engines—why not offload to those in much the same way that Intel involved its EUs in encode acceleration on Sandy Bridge? AMD does have encode acceleration available on its discrete graphics products. But the two SIMD engines on Zacate simply aren’t powerful enough to demonstrate an appreciable benefit. This functionality will be available through the Sabine platform’s Llano APU, so we’ll have to wait until later this year to see how well it works. In the meantime, one of CyberLink’s competitors, ArcSoft, is working on its own OpenCL-based encoder that may or may not change the Brazos performance story in the near-term. CyberLink is going the OpenCL route later this year as well. But again, both companies are more likely focused on Llano, which has the GPU muscle to make encoding worth offloading to graphics. |
Transcode Performance: The APU, CUDA, Stream, And Software |
Limited to decode acceleration, our expectations of what AMD’s Zacate APU will be able to do in transcode-oriented workloads has to be tempered. We’re testing each solution with a 10.5 Mb/s trailer of the movie Death Race from Apple’s Web site, encoded using H.264 video and AAC multi-channel audio. Because the E-350 doesn’t include hardware-accelerated encode, we’re only able to test it in software-only mode and with decode acceleration enabled. Using an optimized copy of CyberLink MediaEspresso 6.5, and converting to the canned iPad profile (Smart Fit, H.264, AAC) we cut what would have been a nearly 11-minute transcode down to just over nine minutes—an almost-20% speed-up. AMD’s 880G chipset, armed with Radeon HD 4250 graphics (40 stream processors) is similarly too anemic to handle encode acceleration. That task rests on the low-power Athlon II X2 240e running at 2.8 GHz. As a result, we’re able to compare software transcoding to AMD’s chipset with decode acceleration enabled. Because the desktop-class CPU is so much faster than Zacate, even the software-based test blazes by in less than four minutes. Decode acceleration helps get the job done 17% faster, though. In software-only mode, Intel’s 10 W Celeron SU2300 takes more than eight minutes to complete its transcode task. Flipping the switch on Nvidia’s Ion chipset, enabling PureVideo support, helps cut that number to just over seven minutes. Adding hardware-accelerated encoding makes an even more dramatic impact, cutting the transcode to less than four minutes. All told, you get a 212% performance boost by turning on hardware-accelerated encode and decode. Now, the most interesting result surfaces when you switch off decode acceleration, leaving accelerated encoding enabled. The job drops to 3:45, 14 seconds less than decode/encode enabled. Why is this? As it turns out, if you ask Ion to handle both encode and decode, the decode side of that equation slows down, causing the fully-accelerated configuration to take longer than if you let the CPU handle decoding on its own. If you’re using a mobile system, you’ll still want to leave both hardware options turned on, though, because you’ll benefit from reduced host processor loading with full acceleration enabled, and consequently, longer battery life. Despite its Ion chipset, the Atom-based platform does not show CUDA acceleration as an option in MediaEspresso. As such, we were only able to select PureVideo decode acceleration. But as the results demonstrate, that doesn't help our transcode job at all. In fact, the data transfer from graphics to processor is enough to slow down the workload versus a software-only implementation. |
Is Performance The Only Variable In Play? |
An Intel employee once told me, “video transcoded using CUDA looks like shit.” Honestly, I shrugged him off at the time. Companies say stuff like that on an almost daily basis; you tend to see things with a blue/red/green tint after a while, depending on the organization you’re representing. And as an advocate for the Tom's Hardware audience, I've trained myself to take everything I hear with a dash of skepticism. But I had a number of readers ask for quality comparisons in the comments section of my Sandy Bridge review, so I dusted off a couple of high-def clips and started saving the outputs from CyberLink’s Fusion-optimized version of MediaEspresso to see if there was any credence to those claims. The comparison here is really very, very basic. I have four test beds: the E-350-powered E350M1, the Athlon II X2 240e-driven 880GITX-A-E, the IONITX-P-E with Intel’s Celeron SU2300, and the Atom-powered IONITX-L-E. The two boards with Nvidia’s Ion chipset are going to deliver the same output as soon as you start involving hardware-accelerated encode and decode. The other two aren’t powerful enough to even enable hardware-based encode support. So, although we’re technically reviewing ASRock’s E350M1 motherboard here, our quality comparisons becomes CUDA versus software versus two AMD platforms that apply hardware-accelerated decode support.
If you download the full-sized (720p) versions of all three of these software-based images and tab through them, you’ll see the sort of quality variation that’d require you to diff each shot—and that’s in a still frame. For all intents and purpose, they’re the same. The same goes for all three boards with hardware-accelerated decoding applied. We can clearly see that what comes out of the decoder and is then operated on by the CPU during the encode stage is largely identical. Alternatively, you can tab between the software-only shot and the corresponding hardware decode version shown above to see that the decoded content is the same, whether the process happens on the CPU or GPU. And then we let Ion’s CUDA cores handle the encode stage and things quickly get ugly. The examples I’m using here aren’t even the best shots from our 2:30-long trailer. But if you download the screen shot and compare it to any of the six above, the difference is oh-so-evident. An even better way to tell would be to download the actual video clips themselves. I threw three examples up on MediaFire: CUDA-based encoding, the Ion platform with hardware-accelerated encode and decode turned off, and AMD’s E-350 with hardware-accelerated decoding enabled. Watch them back to back and see for yourself. You'll notice that the issues are most noticeable in scenes with lots of motion. The best way to describe the issue would be latent blocking or pixelation that distorts the output quality. Verdict As far as I’m concerned, sacrificing quality for speed is never alright. There’s a completely different story for another day here, since we can now turn around and run performance/quality tests on Sandy Bridge, AMD’s discrete graphics, and Nvidia’s add-in cards—all of which offer accelerated encoding. And since there are multiple software apps optimized for all three paths, we can really dig deeper in the days to come. In the context of media-oriented nettops, though, I’d rather not use Ion’s CUDA-based encode acceleration and get a better picture. That takes away much of the platform’s advantage over AMD’s E-350. Hopefully, the OpenCL-based encoders that emerge later this year, utilizing Llano, are written with our quality-oriented concerns in mind. |