Changes in data flow 'pipeline' needed for SoCs, new data types
Changes in data flow 'pipeline' needed for SoCs, new data types
By Scott Bowden, Director of Applications, Xyron Semiconductor Corp., Vancouver, Wash., EE Times
May 9, 2003 (2:22 p.m. EST)
URL: http://www.eetimes.com/story/OEG20030509S0038
The nature of the problems that embedded developers of applications in consumer electronics often involve the movement of large amounts of data around a system on chip design that would easier to solve if a data flow model is used rather than the traditional sequential control methodology. The design blocks used to create SoC's form a fairly simple list. At the heart of the chip are one or more processors. Next are the support systems processors require such as MMU's, caches, tightly coupled memory, memory buses, etc. Each chip will also need to communicate with the external world through various I/O blocks such as serial ports (USB, I2C, Firewire, Ethernet), mixed signal (A/D, D/A, Video, RF,), general purpose I/O, and parallel buses (ATA, printer ports). In many cases internal logic blocks are required for hardware assist such as MPEG, DES, baseband processing, or video processing. Finally, one or more internal buses allow t he blocks to ship data around the chip. This list is relatively short, but assembling these blocks into an SOC that meets specifications is very challenging. For example, many embedded products powered by SOC's must react to external events. A DVD player or a digital set top box must convert an MPEG video stream in to an uncompressed output. The products must respond to incoming data and start performing conversions without dropping data. The information must then be output at a specific rate to support driving the video display. MPEG decompression is a multi-step process. In many IC implementations a combination of software algorithms and hardware assist blocks are used to decode the incoming data stream. Usually a processor is tightly coupled to the logic and this collection forms a subsystem on the SOC operating as a dataflow mechanism. Starting the design based upon dataflow concepts yields a system that is quicker to assemble and verify. Specifically, using an internal bus to transfer data as packets and operating asynchronously without global control can simplify adding or removing the design objects that communicate over the bus. Multiple clock domains are possible, and an interface that is latency tolerant simplifies layout and timing issues. Hardware assist blocks built from a library of pipeline processing elements simplifies state machine design, and allows for Lego-style assembly. Improving the efficiency of processor context switching by placing task scheduling and switching into logic rather than software can result in significant throughput improvement at a given clock rate. With no overhead for context switching, difficult code can be decomposed into multiple modules as small as a few lines. The resulting code is faster to develop and easier to maintain. Tasks become activated by data and the system follows a dataflow model which can simplify simulation and development. Design modifications can be implemented more quickly, allowing a family of So C's to rapidly evolve from a single IC. The classic dataflow model is that data objects "flow" to the processing elements they need. For example, if the task is to add two numbers together and output a sum, the operation using dataflow techniques will be accomplished as follows:
The key concept to note is the addition operation is timed by the availability of the data and the adder functional block.
The difficulty in implementing a dataflow machine is coming up with a scheme to move and operate on data efficiently. Programming a microprocessor to manage these operations only adds an additional layer of complexity onto the software design, and has been the stumbling block to implementing dataflow syst ems.
Decomposing a task into subtasks and linking the pieces together into a pipeline can often make a problem easier to solve. The idea of pipeline processing is not new, and has found great utility in IC's. For example, all modern processors execute instructions sequentially through multiple pipeline stages before finally retiring an instruction.
Setting the stage
Several issues must be addressed when dataflow processing elements are implemented as a series of pipeline stages. At its simplest, a pipeline can be considered a sequence of register stages. Each stage may have logic that operates on the data, and the width of the datapath may vary from stage to stage. Each register stage can delete or add data, and can stop accepting upstream data. Each stage is bounded by an input and output bus.
Communication between stages can take many forms. Consider for example, a two wire interface in which the input bus of each stage has a control wire runn ing upstream called Accept and the output bus has a control wire running downstream called Valid.
Each stage indicates readiness to receive data by asserting Accept. The upstream stage indicated the presence of data to be transferred by asserting Valid. Only when both signals are high will a transfer occur.
With this implementation, bubbles of non-data can be removed from the pipeline by holding the Valid signal at the output bus low, preventing the bubble from propagating downstream and asserting an Accept signal at the input bus. When a data transfer from upstream occurs, the bubble will be overwritten.
An immediate concern with this design is how to handle a down stream stall condition. One of the basic problems facing SOC designers is the routing of global signals. Propagation delays require that signals be buffered and retransmitted. A stall signal will take multiple clock cycles to propagate upstream in a long pipe. The simple approach is to add side registers to hold incoming data. In our case, the Accept signal is buffered at each stage and propagates upstream. If a stall occurs at the end of a pipeline of m stages, it will take m clocks to propagate the signal to the input of the pipeline. An additional m units of data could have entered requiring the pipe to have storage for 2m data pieces. Each stage is designed with a set of side registers to hold extra data.
At first glance the overhead cost seems to be prohibitive. In today's semiconductor processes though, the addition of these latches is not a problem. The number of transistors grows, but the regular nature of the layout yields higher silicon utilization, and does not add significantly to die size.
One of the greatest challenges in SoC design is correctly implementing and verifying the state machines that control the various design blocks. The very nature of dataflow processing implemented with the 2-wire interface requires that a top-level state machine be decomposed into many smaller local machines, yielding a number of benefits.
First, each state machine has a reduced number of inputs which simplifies design and verification. Design and verification team throughput will improve. Second, the local machines are bounded by the input and output busses and only solve a piece of the problem, but in a more general way because they must be designed to be independent of external data dynamics.
In addition, such a re-usable Lego Block approach to design fits well as all the blocks conform to the same interface and each part of the problem has been distilled down until simple and well bounded. And, there is never any need to understand, or to build in the specific dynamics of the external dataflow. This means the design becomes context free, and can be applied to other blocks, or in other designs.
Building a ring bus
In its simplest form, the pipeline described above can form the backbone of a simple and flexible data bus. The bus does n't require global control by design and is latency tolerant and is thus easily scaled. Timing problems caused by routing delays can be solved by merely adding register stages to buffer the signals.
Routing data on and off the bus to I/O or to dataflow processing pipes can be accomplished with a design element called a Basic Ring Node. This Ring Node can be further decomposed into a library of standard elements that are Lego block assembled to meet the needs of each specific device.
Additional blocks that control the routing of data give the bus a great deal of flexibility. Finally, tying the ends of the bus together to form a ring topology gives a structure that allows point-to-point data transfers from one block to any other block on the ring.
With such a design element, six devices (A through F) can be connected to the ring via the Basic Ring Node interface. A special block called a Routing Fork can route data down one of its two output busses. A short cut path has been added which bypasses nodes C and D for improved performance. In addition, "alien" or direct buses can be added between devices that demand low latency. This is shown as a direct path from device E to device B. Finally, a maintenance block is added to perform housekeeping operations for the bus.
Datapath width is set by the implementation requirements, and will typically be in the range of 32 to 128 bits. Pipeline elements can convert bus segments to different widths as required by the design. For example a series of low speed I/O ports could be chained together with a narrow bus to conserve area, and then joined to the main bus after expansion. An example could be the blocks labeled C and D above.
The maintenance block ensures that packets are not allowed to traverse the ring forever. This is done by defining an extension to the command field to include a "dirty" bit. This bit is set in the header by the maintenance block when the packet passes through. If the dirty bit comes in already set, then the packet is deleted.
The last element required to efficiently implement dataflow architecture is a method of routing data without global control. This is achieved by treating the data as packets. A packet can consist of one or more words and will contain commands and data. Packets can be bound together to ensure they always arrive in the correct order, are atomic in nature and cannot be split.
Commands and data are not defined as a function of word size and can be packed into each packet as needed. The first word of each packet is denoted as a header and is examined by each logic block to determine whether the packet is of interest to that particular logic block. The header contains necessary routing information to assure the data moves correctly around the bus.
Developing the software that manipulates data as collections of small tasks following the dataflow model gives the programm er the same benefits as was realized by the hardware designer. The code will be quicker to develop and easier to maintain. If the tasks can remain context free, then they become implementation independent, and can be reused in other designs. The productivity of the software team will improve.
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Growing demand for high-speed data in consumer devices gives rise to new generation of low-end FPGAs
- The role of IP in the new generation of data center SoCs
- From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs
- Optimize data flow video apps by tightly coupling ARM-based CPUs to FPGA fabrics
Latest White Papers
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard
- Open-Source Design of Heterogeneous SoCs for AI Acceleration: the PULP Platform Experience