Distribution: An approach for Virtual Platform scalability
Stephane Farrouch (STMicroelectronics), Hmayak Arzumanyan (ProximusDA Cjsc)
Abstract:
System-On-Chips are becoming more and more complex, like Set-Top-Boxes: embedding hundreds of IPs and tens of cores, running 10 Giga-instructions for OS Boot and 200 Giga-instructions of drivers initialization, and playing several 4Kp60 (UltraHD 60Hz) video flux in parallel with high bandwidth networking and graphics.
Reducing time-to-market thus requires anticipation of tests and software development far before board Silicon availability.
This anticipation heavily relies on Virtual Platforms and Coemulation/Coprototyping solutions, with particularly an increasing usage of Virtual Platforms by hundreds of users with different focus: SoC verification tests, System validation tests, OS kernel, drivers, applications.
In this paper we focus on TLM-LT (Loosely Timed, Programmers’ View) platforms, which are usually the first needed/requested ones. And how to address one of the main challenges of these Virtual Platforms: enhancing their execution speed.
I. INTRODUCTION
Main challenges for Virtual Platforms integrators are:
- IP Models procurement: get in time the necessary models from internal/external providers. This is addressed by project/program management taking in account the importance of VPs.
- Production efficiency: Produce in time VPs fitting with user needs. This could be addressed by using an IP-Xact-based flow, plus close continuous loop with users’ representatives.
- Execution speed: Run complex use cases fast enough for software debug loop, and for system non-regression test. In [2], among other lessons learned, authors were showing that VP execution speed up needs to be improved.
The focus of this paper is to address this execution speed, by showing the impacting factors, reviewing what the classical approaches are, and exposing an experimented solution.
II. THE VP SPEED BOTTLENECK
A. 1st factor: increasing VP complexity & size
A huge number of IP models are coexisting in a VP; a typical Set-Top-Box VP is made of
- >300 TLM modules
- ~900 source files
- >130k-lines Platform (effective code)
B. 2nd factor: SystemC kernel way of modeling Hw/Sfw parallelization
SystemC is an event-based scheduler for modeling parallel Hardware resources using multiple sc_threads running inside one single Linux process on a single CPU core. This means:
- each sc_thread is elected to get the hand, others are meanwhile pending, waiting for their turn
- parallelization is modeled by a sequential execution of code (also known as “concurrency” or cooperative multitasking on a single CPU)
C. 3rd factor: difficulty to get optimized models in time
Given the complexity of SoCs, IP models are done by teams with different skills, with priority to provide a working IP, meaning heterogeneity in:
- level of optimization of source code
- type of TLM model (none/algorithmic/structural)
- priority given to the IP model development (not always perceived as important as it is at program level)
D. 4th factor: huge amount of embedded software, and not always optimized
Main usage of Virtual Platform is for embedded software development anticipation vs boards availability:
- with a huge number of firmwares, OS, drivers and applications running on the different simulated cores;
- often not yet optimized in terms of performances.
Moreover as one big advantage of Virtual Platform is the offered debug possibilities, these softwares are kept:
- compiled non-optimized;
- with debug options quite a while and often up to boards availability;
- and sometime with huge amount of traces.
All this results in having the software execution taking a significant part of the VP execution time, part on which no real optimization could be done during the first steps of the development.
E. Conclusion:
The simulation times typically required by modern SystemC/TLM-based VP's to run the entire non-regression test-suite can take many tens of hours.
Fig. #1: Some figures achieved on set-top-box VP
III. OPTIMIZATION TRACKS: SOME CLASSICAL APPROACHES LIMITATIONS AND EXPECTED GAIN
A. Overview of the approaches
Beside classical general program optimization approaches described in [3], some VP-specific approaches are also usually taken.
Some are TLM-LT modeling guidelines (cf [5] section 4 and [6]) at IP/subsystem level:
- Code optimization: rework the IP model source code for eliminating hotspots;
- Algorithmic model vs structural model: limit context switches by using pure algorithmic models (wrapping of a C model) not reflecting IP microarchitecture;
- Reduced traces to the minimum: have traces activated on demand at run time;
- Data Granularity – Reduce small frequent transactions: identify, understand and limit small (with small data) frequent transactions;
- Backdoor accesses/Direct Memory Interface: direct access to another model’s memory, bypassing regular transactions;
- On the fly switching to more accurate mode: at run-time, swap from fast low-accuracy model to slower more accurate model (eg from algorithmic model to a complete subsystem model running firmware).
Others are approaches at system level:
- Compilation optimization: use speed optimization of most recent compilers;
- Use-case adapted VSoC: reduce number of models to strictly what is necessary to exercise a test case;
- Optimized embedded SW : push SW guys to optimized their code;
- Check-pointing: save & restore execution context.
Fig. #2: Limitations and usual gain of classical approaches
B. Conclusion:
For expecting significant gain classical approaches should be applied as much as possible; and by experience: you always have to convince your providers and fight against their higher priorities. Moreover, the gain in performance that you need (several orders of magnitude) is much higher than what you could reach with those approaches.
IV. THE CHOSEN ALTERNATIVE APPROACH: HAVING PARALLELIZED PLATFORMS
A. Main principle:
The approach is based on a simple idea: a way to approach real time in modeling parallelized SoC ... is to parallelized VP execution!
Wein in [1] has explained how TLM helps in accelerating execution of a design description on parallel computer systems, thanks to its scheduling scheme tightly coupled to its explicit communication scheme. This principle, applied to VP, helps in speeding-up VP execution.
Instead of having a single SystemC kernel simulating the entire platform on a single CPU, the platform is partitioned, and several SystemC kernels are running in parallel, each with a part of the platform running on a different core on the host station. This allows data processing parallelism as well as efficient modeling of pipelining. And as it is still based on SystemC kernel, exact same models are used.
The overall automated flow (import and re-assembly) is based on IP-Xact description, allowing fast looping during the partitioning trials.
Main steps of parallelization loop are as follows:
- import the VP into the tool based on IP-Xact description (assembly, interfaces)
- graphical partitioning of the VP, spread over the host station CPU cores (each core is running a SystemC kernel) – cf Fig. #3
- automated re-assembly of the platform, with automatic insertion of bridges between the partitions for intercommunications/synchronizations.
Fig. #3: Graphical partitioning of the VP on 3 cores - screenshot
B. Parallelization alternatives
In [4] authors are describing two techniques for accelerating VP while benefiting from symmetric multiprocessor (SMP) workstations:
- The first is a general modeling strategy for shared-memory MPSoCs, TLM-DT (transaction level modeling with distributed time).
- The second is a truly parallel simulation engine, SystemC-SMP, itself based on TLM-DT
We did not go into this approach (even if it is quite similar to the taken approach) since our choice is to have untimed VP purely event-driven, so it was not making any sense for our needs to use a solution aiming to ensure time consistency across partitions.
C. Rationale of this choice:
We considered using this approach since:
- We did already used classical approaches on several models but gathering all is considered as difficult;
- We already have an IP-Xact based flow, easing import;
- The approach is scalable, so will help in addressing upcoming VP complexity increase challenges.
D. Obtained performances: example on a set-top-box VP
Data below depicts several experiments with splitting the VP into 2, 3, 5, and 6 partitions. The acceleration factor reflects the achieved speedup, which sometimes is over-linear, like for example in experiment #4 with 2 partitions. The over-linear acceleration is due to higher number of CPU cache hits with good partitioning and more CPU-s used for simulation.
Fig. #4: Acceleration factor vs partitions
E. Perspectives:
Using such an approach, different perspectives are offered, among others:
- Cover more complex system tests cases requiring more partitioning
- Benefit from modularity trend in SoC Design (easing partitioning)
- Using existing computer farms (instead of using dedicated stations)
- Address also mixed platforms (coemulation, coprototyping, …)
- Approach real-time with bigger machines (for system-level big non regression suites, or uses cases like audio codec quality ear-tested)
- Provide external customers with Virtual Platforms, before boards and Silicon, for their own application development anticipation
- Use parallelization for architecture exploration/performance analysis (cf [1])
- Strengthen validation of race-condition avoidance thanks to actual concurrency
V. CONCLUSION
Classical optimizations are far from being enough for increasing execution speed of Virtual Platforms, and require huge efforts on a subject which is usually out of the priority scope of IP models providers. The experimented approach offers high performances gain (several factors), flexibility and scalability with lower efforts.
ACKNOWLEDGMENT
We want to thanks Philippe Metsu and Laurent Ducousso for their long-term conviction on the benefit of this approach, and for their efforts for having it tested in a real case. We thank Proximus team for their support and proactivity while using their tool.
REFERENCES
[1] Enno Wein, “HW/SW Co-Design of Parallel Systems”, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2010
[2] Sungpack Hong et.al., “Creation and Utilization of a Virtual Platform for Embedded Software Optimization: An Industrial Case Study”, CODES+ISSS International Conference Hardware/Software Codesign and System Synthesis, 2006
[3] Wikipedia, “Program optimization”, 2014
[4] Aline Vieira De Mello, Isaac Maia, François Pécheux, Alain Greiner, «Parallel simulation of SystemC TLM 2.0 compliant MPSoCs », Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010
[5] Marcelo Montoreano, “Transaction Level Modeling using OSCI TLM 2.0”, 2007
[6] STARC - Semiconductor Technology Academic Research Center, “Transaction Level Modeling Guide”, 2008
Glossary:
IP – Intellectual Property – HW or SW implementation of a set of features contributing to the overall system
TLM - Transaction Level Modeling – now part of IEEE 1666™ "SYSTEMC LANGUAGE"
VP – Virtual Platform – Considered made in SystemC/ TLM hereafter.
SoC – SystemOnChip – complete piece of Silicon containing all the hardware resources (IPs, cores) necessary for targeted use cases
Related Semiconductor IP
- Root of Trust (RoT)
- Fixed Point Doppler Channel IP core
- Multi-protocol wireless plaform integrating Bluetooth Dual Mode, IEEE 802.15.4 (for Thread, Zigbee and Matter)
- Polyphase Video Scaler
- Compact, low-power, 8bit ADC on GF 22nm FDX
Related White Papers
- Implementing Power Management IP for Dynamic and Static Power Reduction in Configurable Microprocessors using the Galaxy Design Platform at 130nm
- Creating Virtual Platform using The OCP-IP Modeling kit
- Rethink your project planning with a virtual platform
- System Performance Analysis and Software Optimization Using a TLM Virtual Platform
Latest White Papers
- Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories
- Reimagining AI Infrastructure: The Power of Converged Back-end Networks
- 40G UCIe IP Advantages for AI Applications
- Recent progress in spin-orbit torque magnetic random-access memory
- What is JESD204C? A quick glance at the standard