PCI Express Gen2/Gen3 update: An interview with Jasmin Ajanovic and Kevin W. Bross, Intel

PCI Express Gen2/Gen3 update: An interview with Jasmin Ajanovic and Kevin W. Bross, Intel

I/O virtualization and higher bandwidth are among the features the PCI-SIG is developing for PCI Express.

Two individuals closely tied to PCI Express Gen2/Gen3 and embedded/communications development, Jasmin Ajanovic and Kevin W. Bross, spoke with me recently. Jasmin drives architecture and protocol specification within the PCI-SIG as a chair of PCIe Protocol Workgroup and PCIe Bridge Workgroup. He is a SeniorPrincipal Engineer for the Digital Enterprise Group at Intel Corporation. Kevin is a Modular systems Architect with Intel’s Embedded and Communications Group. Since 2002 Kevin has been working with AdvancedTCA spec development and with other PICMG standards efforts.

Joe - What’s the capsule history of ?

Jasmin - PCI Express (PCIe) is a member of the PCI architectural family. It’s a technology that has evolved across two generations, and a third generation of PCIe is currently in development.

In 2003, the PCIe Gen1 serial interconnect was introduced at 2.5 Gigatransfers per second (GTps). Gen2 doubles the speed of Gen1. Everything is backwards compatible, but Gen2 brings important enhancements to the table. Intel is shipping Gen2-compatible products now.

Current work within PCI-SIG is focused on third-generation PCIe development. In parallel, the PCI-SIG is developing architectural extensions to support emerging I/O and device sharing applications. The PCI-SIG is developing these specs to extend the capabilities of PCIe.

Graphics accelerators and a number of other interesting applications are driving Gen3. We have a number of extensions on the table geared toward supporting those applications.

The work in progress on PCIe (Gen3) includes doubling the bandwidth (although not necessarily doubling the operational frequency due to a new encoding method).

Joe - How do you double the bandwidth without doubling the clock rate?

Jasmin - We use a different encoding scheme - not 8b/10b - that will maintain the embedded clock and will provide encoding and scrambling at close to 0 percent overhead compared to 8b/10b’s 20 percent. Now we can be more efficient from a bandwidth point of view. The tradeoff here was that, while intuitively one would think that if the current generation is 5 GTps, then we need to go to 10 GTps. However, we found that 10 GTps really pushed the limits of the existing systems and could break the compatibility with those systems, that is, connectors, cards, and backplanes, and would significantly reduce the length of the channel, thus preventing us from meeting the needs of existing applications. So the decision was made to go to 8 GTps with a different encoding scheme. Table 1 (courtesy Intel) depicts theoretical raw max bandwidths for x16 wide PCIe Gen1, Gen2, and target Gen3. Note that effectively achieved bandwidths on actual products depend on a number of factors (implementation of data buffering and flow control, packet data payload size, and application workload, to name a few), and therefore the numbers in Table 1 should be used only as a reference points.

Joe - Walk us through some of the Gen2 extensions you believe will be of particular interest in the embedded space.

Jasmin - One is that Gen2’s 5 GTps speed increase is paired with capabilities to manage and configure the speed, link speed management, and bandwidth notification. Embedded developers now also have Function Level Reset, which provides an architected mechanism to better control a resetting of the functionality of the PCIe attached device. Function Level Reset should resolve deficiencies of the inherited legacy reset mechanism.

Access Control Services allow management of peer-to-peer communications in systems with multiple agents connected to either root ports or to a switch.

Completion timeout control is interesting for the embedded space because it is much better architected in Gen2. It specifies a set of timeout choices. So, depending on your topology, the completion timeout may be connected over a long path with multiple links involved in a complex hierarchy. In that case, the completion of the transaction may take a significant amount of time to traverse the topology, making it helpful to have variable timeout capability.

Joe - What is the larger context in which PCI-SIG began work on Gen3?

Jasmin - When the PCI-SIG started looking at the new types of applications emerging on PCIe, including accelerators and the deployment of PCIe in telecom and embedded systems, we identified that across the stack there would be a number of different desired enhancements and extensions to incorporate - from signaling speed up to and including power management (important for any type of system these days), improving protocol efficiencies, and including an enhanced mechanism for synchronization and control/status exchange.

One example is latency reduction, a mechanism for decreasing average access times of PCIe peripherals to host memory by optimizing data allocation and retention within system cache hierarchy. In addition to improving the latency, this could very significantly reduce the power consumption, at least based on the workloads that we looked at for server and client applications.

Another improvement targeting distributed processing environments such as those of the embedded telecom space is atomic read-modify-write transactions, which will come in very handy for synchronizing operations. This is a new capability that will be introduced in Gen3, but it is not tied only to Gen3 speeds. It can also be supported when the latest generation of devices operates at Gen1 or Gen2 speeds.

Kevin - When you scale back to a slower link speed, it goes back to 8b/10b encoding, so you maintain backwards compatibility, right?

Jasmin - Yes. From day one we knew that PCIe would go through successive speed increases, and we conceived a link speed management mechanism. This mechanism really came into its own in the PCIe 2.0 spec, where we are going from Gen1 to Gen2. We are also providing the plumbing to go beyond that in terms of managing the speed. One example is management of the link width for interfaces that are more than x1 wide.

So the assumption has been that there is a requirement for backwards compatibility. PCIe components that are capable of using Gen2 speed must be able to work at Gen1 speed as well when initially configured. In addition, components configured to operate at Gen2 speed must be able to downshift to Gen1 in the case of increased transmission errors that could be attributed to marginal designs or environmental impact. Downshifting is not necessarily just for reliability reasons - it might also take place between two devices that are capable of speaking at Gen2 speed, but decide to speak at Gen1 speed due to power reduction requirements.

With the existing PCIe software configuration mechanism as implemented in mainstream software operating systems, there are limits to how much of peripheral’s local memory can be mapped to a system memory. That limitation is being addressed with the BAR renegotiation mechanism. This mechanism, like most of other new capabilities, is agnostic to the operational speed of the interface - it can be used at Gen 1, 2, or 3 speeds.

Another enhancement in this area that relates to embedded computing needs is the capability for the switch (as well as other components) to report internal errors as an extension to the existing Advanced Error Reporting mechanism.

In addition to the power management that we have today at a device-level state, we are trying to add capability for high-powered devices that require substantial allocation of the power distribution and the thermal budget to be managed dynamically by the platform.

PCI Express cards can consume from 25 W up to 225 W, and there is a proposal for 300 W. In a system with several high-power adapters, typically not all adapters consume the maximum power all the time. However, system vendors do not have a choice other than to provision the power/thermal solution required for the worst case. A new Gen 3 extension for dynamic power resource management will allow an optimized, cost-reduced power/thermal solution where platform software will manage allocation of power based on the needs of adapters. The system manages the power allocation to the individual cards so that they don’t draw more power than their dynamically allocated budget, even though the maximum power draw from all these cards could be much higher than the allowed power draw.

Kevin - Jasmin, you have identified a number of features that are in addition to the faster speed. Can you talk briefly about how those features are available even at lower speeds?

Jasmin - If you have a device that is compliant to the PCIe 2.0 spec, that device can operate at the 5 GTps and 2.5 GTps speeds. The assumption is that if the device is compliant to the 2.0 spec, it was designed at the time when the additional capabilities were available, so even if that device is operating at 2.5 GTps (Gen1 speeds), it will support these capabilities.

Extrapolating this to the next generation, Gen 3, all of these capabilities will be available at 2.5, 5, and 8 GTps.

Joe - Gen2 is released. Are there going to be any more enhancements to Gen2?

Jasmin - The PCI-SIG is currently defining additional capabilities for I/O virtualization and device sharing for which specifications will be complete before PCIe 3.0 is delivered. These capabilities are geared primarily toward the server blade market but can find significant use in other markets such as embedded.

Joe - One of the limitations of PCIe has been the inability to have multiple roots talking to a peripheral. Does PCIe Gen2 overcome the problem of not being able to have multiple root complexes talking to a peripheral?

Jasmin - Yes. There is something called the Multi-Root Sharing specification that is part of PCI-SIG I/O virtualization development.

It is possible to connect multiple root complexes to a multi-root-aware switch that connects to the peripherals beneath, which, if support sharing capability exists, can be assigned in a flexible manner to particular root complexes.

Joe - That is important, because it enables redundant, highly available systems, which have been difficult to do with PCIe. In a pure Gen2 system I presume that because the speed is higher that the trace lengths must be shorter, and that skew requirements and crosstalk requirements are tighter? Is this true? Can you use the same trace length and same size motherboards and I/O slots in Gen2 as in Gen1 or do things get tightened up?

Jasmin - It requires a tighter reference clock, and some of the budgets that allowed sloppiness in Gen1 have disappeared. As far as channel length, my understanding is that Gen2 does not require compromise on the channel length of the motherboard, but it does require more disciplined design of the motherboard. Trade-offs can be made. I can use cheaper materials or fewer layers, but that will shorten my interconnect channel.

Kevin - Some backplanes use higher grade substrates (most backplanes, for example) and are not just a typical motherboard.

Jasmin - The PCI-SIG PCIe technology was developed primarily for mainstream applications, where you need to support a “50-cent” motherboard connector on a traditional PC, but for those systems that use better quality materials for connectors and backplanes, it affords the opportunity for longer channel lengths at the same speeds. However, the PCI-SIG did not extend the development effort to these other specialized form factors, and that would be the value-add the telecom developer could provide.

Kevin - In the AMC.1 specification, currently undergoing revision, we had members who did some signal integrity analysis, looking at both Gen1 and Gen2, to understand the impact of both signaling rates.

Joe - There are not a lot of AdvancedTCA systems using 3.4 (PCIe) native backplane communication out there. Do you think Gen2 would work on an existing AdvancedTCA backplane if you wanted? Would the connector handle it, and would the backplane handle it?

Kevin - In general, I would say a qualified “yes.” It depends on the particular backplane geometry. Designers have been doing things like putting the hub slots in the middle so the routing distance is as short as possible.

During the testing we have done, at PICMG Interoperability Workshops and internally, we have seen a number of backplanes that are good well beyond 5 Gigabits. The Gen2 speeds are certainly achievable.

Joe - Most of what AdvancedTCA provides is Ethernet across the backplane and PCIe as a local bus.

Kevin - The one difference I would say, there are some clients out there who are using PCIe across the Update Channel as a board-to-board kind of I/O expansion. This is similar to PICMG 3.4.

Joe - How does the demise of Advanced Switching affect potential applications for PCIe?

Kevin - The mainstream and server applications for PCIe have done some things that were similar to what ASI was originally targeting. For example, the multi-root capability and I/O virtualization begin to provide some of the same basic capabilities with the mainstream PCIe feature set without requiring a totally different protocol.

Joe - Some of the Gen2 enhancements do some of the things many were hoping Advanced Switching would do?

Jasmin - I would agree that the I/O virtualization and I/O sharing specs will replace to a large degree what could potentially have been done with Advanced Switching, and in a backwards-compatible manner because Advanced Switching branched off from the mainstream PCIe development.

In addition, Gen3 extensions include new protocol mechanisms, such as Multicast, that are specifically geared towards addressing needs of embedded/telecom applications. Multicast capability provides support in PCIe fabric (Root Complex and Switches) where a single transaction/payload can be delivered to a multiple recipients in an optimized manner from a performance/power standpoint.

Joe - That is significant as this recovers some of the functionality that was planned for AS.

Kevin - There are also a lot of things in terms of latency control that have long been important in the communications space which are now becoming important in more of the mainstream PCI space as well.

Jasmin - With regard to economy of scale we should note that In-Stat predicted 440 million PCIe based devices by 2010. When you consider this in light of PCI-SIG’s 800 members, many of them very active; it’s a large ecosystem.

Not all of the products will be applicable, but it gives you the ability to cherry-pick the products or cherry-pick partners to develop more specialized products for embedded markets.

Joe - I noticed there is a revision 2 mechanical spec for Gen2, what is the significant change in the mechanics?

Jasmin - There is no change in the mechanical aspect; the connector is the same. There are stricter guidelines for routing the traces on the add-in card to the component itself. The Gen1 and Gen2 connector are the same mechanically, that is, they use same pinout. Also, with regard to electrical properties, it is the same connector.

Kevin - Quite a few elements were part of Gen1 definition but included in the Card Electromechanical (CEM) specification. A number of those items got moved into the Gen2 baseline spec for Gen2 signaling rates, and the like, so they are no longer found in the CEM spec.

Jasmin - I would like to point out that it is important to understand the product level flexibility that PCIe increased signaling rates provide. Let’s say that to meet your goals with a Gen1 product you needed to have a x2 or x4 PCIe interface, well, when you go to Gen2 or Gen3, you can reduce the footprint of the interface while preserving the same bandwidth. The assumption here is that you do not necessarily need more performance, but you will reduce the cost of the device and reduce the power that device consumes. Power for the links is directly proportional to the number of lanes driven from the device, so the whole infrastructure becomes more optimized, costing less and consuming less power.

One can opt for better cost, lower power, or increased performance, and that will depend on the application.

Kevin - As noted earlier, if you are doing high-speed interconnects, a number of techniques can be used to improve the signal integrity; some designers are looking at doing a 12 degree rotation of the traces on the actual substrate so that signals are not going directly parallel to the weave of the fiberglass but are actually cutting across them to get better signal integrity. There are many other techniques like that, if you are looking at higher speed requirements.

Joe - What is the main thing to take away from this discussion we’ve had about PCIe?

Jasmin - PCIe is becoming one of the broadest deployed industry interconnect standards. With a very strong membership, PCI-SIG is working on delivering architectural extensions and enhancements that will extend the life of this technology and make it more applicable for other areas such as blade servers and embedded/telecom.

Jasmin Ajanovic is a Senior Principal engineer for the Digital Enterprise Group at Intel Corporation. Jasmin has 26 years of experience in the architecture development and system design of the communication and computer industry. After joining Intel in 1991, he spent a number of years working on PC architecture, enabling technologies and product development. He was chief architect of a number of successful products including several PCI chipset families. During the last eight years, Jasmin was responsible for the development of Intel's IO architecture and interconnect technologies. This includes development of PCI Express (PCIe), an industry standard which drives architecture and protocol specification, within the PCI-SIG as a chair of PCIe Protocol Workgroup and PCIe Bridge Workgroup. Jasmin holds approximately 50 industry patents.

Kevin W. Bross is a Modular Systems Architect with Intel’s Embedded and Communications Group. Since 2002 he has been working with AdvancedTCA spec development, various revisions since then, and with other PICMG standards efforts. Kevin has also been involved with groups such as Telcordia, ATIS, the SCOPE Alliance, and groups within Intel in defining revisions to NEBS specs, the definition of Intel AdvancedTCA products, and the design of Intel’s new data centers.

Intel Corporation

Topics covered in this article