A New Era for Edge AI: Codasip’s Custom Vector Processor Drives the SYCLOPS Mission
The demand for Artificial Intelligence applications is not just growing – it’s skyrocketing, reshaping industries from autonomous transportation and advanced medical diagnostics to intelligent manufacturing and personalized education. As these AI models become increasingly sophisticated, the hunger for more efficient, powerful, and accessible hardware solutions intensifies. This critical juncture, where the ubiquity of AI meets the limitations of traditional hardware paradigms, is precisely where the SYCLOPS project emerges as a vital initiative.
SYCLOPS (SYstems, Cybernetics, and LOw-Power computing for embedded Systems) is at the forefront of a movement to democratize AI acceleration. By championing open-source hardware and software solutions, the project, supported by the European Union’s Horizon Europe research and innovation program, is committed to advancing Europe’s capabilities in next-generation AI hardware. Central to this endeavor is the opensource RISC-V instruction set and its Vector extension, a transformative technology poised to unlock significant performance gains for a new wave of AI algorithms. This is particularly crucial for the burgeoning field of edge AI, where intelligence is deployed directly on resource-constrained devices – think real-time object detection in smart city surveillance, anomaly detection in industrial IoT sensors, or low-power keyword spotting in voice-activated assistants. The SYCLOPS project aims to make such sophisticated AI capabilities more attainable.

Understanding Core Concepts: SIMD and the RISC-V Vector Extension
Before diving into the SYCLOPS project’s specific contributions, it’s essential to grasp two foundational concepts: SIMD and the unique advantages of the RISC-V Vector extension.
A Glimpse into SIMD (Single Instruction, Multiple Data) is a cornerstone of modern processor design, a technique enabling a single instruction to perform the same operation on multiple data points simultaneously. This parallel processing capability dramatically increases computational efficiency compared to traditional scalar processing, which handles only one data element per instruction.
Consider the apt analogy of a chef: with a standard knife, they chop one vegetable at a time. Equip them with a specialized multi-blade tool, however, and they can process several pieces with a single motion, significantly boosting productivity. Similarly, SIMD extensions in processors often come with enhancements like dedicated register files or improved memory pathways to feed these parallel operations. The number of data elements processed in one go typically depends on the specific SIMD extension being used.
The RISC-V Vector Extension: A Paradigm Shift in Flexibility
The RISC-V Vector extension builds upon the SIMD principle but introduces a revolutionary level of flexibility. Unlike traditional SIMD architectures that operate on fixed-width data paths (e.g., always 128 bits or 256 bits), the RISC-V Vector extension is length-agnostic. This means it can dynamically adapt to various data sizes and underlying hardware capabilities without requiring code modifications.
This “length-agnostic” nature is more than a technical detail; it’s a strategic game-changer. It means a single, compiled piece of software can run efficiently across a diverse spectrum of RISC-V hardware – from compact, low-power microcontrollers to high-performance application processors. Hardware designers gain the freedom to scale their vector processing units (VPUs) to meet different performance and power targets, while software developers benefit from unparalleled code portability and reduced development overhead. This decoupling fosters a virtuous cycle of innovation, accelerating the growth of a rich and varied ecosystem of compatible software and scalable hardware.
The benefits for AI acceleration are profound:
- Dramatically Improved Performance: By processing multiple data elements concurrently, the Vector extension can deliver substantial speedups for AI computations. For typical AI inferencing kernels, such as those in image recognition or natural language processing, leveraging the RISC-V Vector extension could yield potential speedups in the range of 4x to 8x compared to equivalent scalar C code, depending on data types and hardware.
- Reduced Power Consumption: Performing more work with fewer, more powerful instructions translates directly to lower energy use – a critical factor for battery-powered embedded systems and mobile devices.
- Enhanced Scalability: The length-agnostic design allows solutions to scale seamlessly from small, efficient cores to large, powerful ones, enabling developers to precisely tailor hardware to their application’s demands.
- Fostering Innovation through its Open-Source Nature: Built upon the open-source RISC-V ISA, the Vector extension encourages widespread collaboration, transparency, and innovation, breaking down proprietary barriers in the AI hardware landscape.
From Theory to Reality: The Scalar Product Example
The RISC-V Vector extension’s potential for accelerating AI algorithms is best understood through a practical example. Let’s consider a common operation: the scalar product (or dot product) of two vectors. If we have two large vectors, A and B, of the same length, their scalar product c is calculated as: c = Σ (A[i] * B[i]) for all elements i.
A straightforward C implementation would be:
float dot_vec(float* vect1, float* vect2, int vec_len) {
float result = 0;
for(int idx=0; idx<vec_len; idx++) {
result += vect1[idx] * vect2[idx];
}
return result;
}
When compiled for a standard RISC-V processor without vector support, the assembly code for the main loop might look like this (simplified):
dot_vec:
ble a2,zero,.L4 // If vec_len <= 0, jump to end
mv a5,a0 // Move vect1 pointer
sh2add a2,a2,a0 // Calculate end pointer for vect1
fmv.s.x fa0,zero // Initialize result = 0
.L3: // Start of the loop
flw fa5,0(a5) // Load element from vect1
flw fa4,0(a1) // Load element from vect2
fmul.s fa5,fa5,fa4 // Multiply elements
fadd.s fa0,fa0,fa5 // Add to result
addi a5,a5,4 // Increment vect1 pointer
addi a1,a1,4 // Increment vect2 pointer
bne a5,a2,.L3 // Loop if not at end
ret
.L4:
fmv.s.x fa0,zero // Result is 0 if original length was <=0
ret
In this scalar version, the main loop (label .L3) processes 7 instructions for each individual vector element.
Now, let’s see the RISC-V Vector extension in action with a vectorized assembly version:
dot_vec_asmVS:
fmv.s.x fa0,zero // Clear the output register (scalar float for final sum)
mv a5,a0 // Keep original a0 (vect1 pointer) if needed, or use a0 directly
mv a6,a1 // Keep original a1 (vect2 pointer) if needed, or use a1 directly
vsetvli t0, a2, e32, m1, ta, ma // Set Vector Length (VL) based on remaining elements (a2)
// e32: 32-bit elements, m1: vector register grouping
// ta (agnostic tail), ma (agnostic mask)
vfmv.s.f v24, fa0 // Initialize vector register v24 with 0.0f for summation
.V1S: // Start of the vector loop
vsetvli t0, a2, e32, m1, ta, ma // Reconfigure VL for this iteration; t0 gets actual VL
vle32.v v0, (a0) // Load VL elements from vect1 into vector register v0
vle32.v v8, (a1) // Load VL elements from vect2 into vector register v8
vfmul.vv v0, v0, v8 // Vector multiply v0 * v8, result in v0
vfredosum.vs v24, v0, v24 // Reduce sum of v0 and add to v24 (ordered sum)
sll t1, t0, 2 // Calculate bytes processed (VL * 4 bytes/element)
add a0, a0, t1 // Advance vect1 pointer
add a1, a1, t1 // Advance vect2 pointer
sub a2, a2, t0 // Decrement remaining element count
bne a2, zero,.V1S // Loop if elements remain
vfmv.f.s fa0, v24 // Move final sum from vector register v24 to scalar fa0
ret
This vector implementation executes approximately 10 instructions per loop iteration. At first glance, 10 instructions might seem more than the scalar loop’s 7. However, the crucial difference is that each iteration of the vector loop processes multiple vector elements simultaneously – specifically, VL elements, where VL is the vector length determined by the vsetvli instruction based on hardware capability and remaining data.
So, if VL is, for example, 8 (meaning the hardware can process 8 floating-point numbers in parallel), the effective number of instructions per element becomes approximately 10 / 8 = 1.25. This is a substantial reduction from the 7 instructions per element in the scalar case, illustrating the significant performance gain.
The vsetvli instruction is key: it dynamically determines how many elements (t0) will be processed in the current iteration based on the requested vector length (a2) and the hardware’s capacity. This dynamic approach yields two powerful advantages:
- No Manual Vector Tail Handling: The hardware automatically manages iterations where the number of remaining elements is less than the maximum VL. Programmers don’t need to write separate code for these “tail” elements, simplifying development.
- Dynamic Element Processing & Code Portability: The same compiled code runs efficiently on different RISC-V implementations with varying vector processing capabilities. Hardware engineers can scale performance by widening vector units, and software engineers don’t need to re-optimize for each specific hardware configuration.
Navigating Complexity: The Crucial Role of Simulators in Performance Analysis
Analyzing the performance of modern programs, especially those leveraging advanced features like vector extensions, can be cumbersome. Loop lengths and execution paths often depend on the input data, making static analysis challenging. To truly understand and optimize performance, developers rely on a hierarchy of simulation tools, each offering a different level of insight:
- Functional Correctness: At the initial stage, tools like QEMU are invaluable. They allow for rapid emulation to ensure the code works correctly from a functional standpoint. However, QEMU often optimizes instruction sequences for speed and doesn’t typically produce detailed instruction traces, making it unsuitable for precise performance estimation.
- Algorithmic Performance Insights: Once functionality is confirmed, Instruction-Accurate (IA) simulators, such as SPIKE for RISC-V, become essential. These simulators provide exact counts of executed instructions for given workloads, helping identify algorithmic inefficiencies or “hotspots” that consume the most execution resources. For instance, using an IA simulator, we can confirm the number of instructions executed by our scalar and vector dot product examples.
- Microarchitectural Tuning: For the deepest level of optimization, microarchitecture-aware simulators are required. These are often based on detailed SystemVerilog models of the processor and provide cycle-accurate (CA) simulation. They reveal why certain instruction sequences perform better by modeling pipeline effects, resource contention, memory access patterns, and the cycle-level behavior of complex instructions. For example, the RISC-V Vector extension offers different reduction instructions. vfredosum.vs performs an ordered sum, crucial for minimizing numerical errors in some algorithms by guaranteeing the summation order. An alternative, vfredusum.vs, performs an unordered sum, which might be significantly faster on some microarchitectures if the strict order is not a concern. An IA simulator like SPIKE won’t reveal the performance difference between these two; only a CA simulator can provide that level of insight.
This hierarchical approach – from functional validation to algorithmic analysis and finally to microarchitectural tuning – allows developers to systematically refine their code, ensuring both correctness and optimal performance on the target hardware.

Empowering Innovation: Tools, Customization, and Codasip’s Role
Developing software for any processor, especially a new or customized one, requires a comprehensive suite of tools: compilers, assemblers, debuggers, profilers, and various simulators. While standard RISC-V implementations benefit from a growing ecosystem of such tools, one of RISC-V’s most powerful features is its customizability – the ability to extend the Instruction Set Architecture with new, domain-specific instructions.
However, this customization traditionally presents a significant hurdle: modifying the entire software toolchain to support these new instructions demands specialized engineering expertise and considerable effort. This is where Codasip Studio emerges as a transformative solution. It streamlines this complex process by automatically generating a complete set of development tools – including a C/C++ compiler, simulation models, debugger, and profiler – directly from a high-level processor description written in the CodAL language.
This automation is a catalyst for RISC-V customization. By drastically reducing the time and expertise needed to create and support custom ISAs, Codasip Studio democratizes the process of hardware design itself. It empowers a broader range of innovators – from Small and Medium-sized Enterprises (SMEs) and agile startups to academic research groups – to develop bespoke RISC-V processors tailored for demanding applications like AI, without the prohibitive overhead traditionally associated with such endeavors.
Within the SYCLOPS project, Codasip is leveraging this capability to develop a proof-of-concept implementation of the customizable RISC-V Vector processor using CodAL. This work not only contributes directly to the SYCLOPS objectives but also shows how advanced tools can accelerate the adoption and evolution of critical open-source hardware technologies.
A recent and crucial milestone in the SYCLOPS project is the release of the Edge Microdatacenter (EMDC) v2.0 with RVV accelerator. This deliverable documents the successful deployment of a comprehensive hardware testbed, EMDC v2.0, designed to evaluate the project’s use cases. The platform is a collaborative effort featuring three key components: first, a proprietary RISC-V platform from Codasip with vector extension, which has already demonstrated significant speedups, meeting the project’s Key Performance Indicators for vector operations; second, the SYCLARA open-source RISC-V platform developed by EURECOM, which validates the use of the SYCL compiler toolchain on RISC-V Vector hardware; and finally, a CXL-enabled EMDC testbed developed by HIRO MicroDataCenters to ensure a future-proof, composable infrastructure backbone. This collective effort validates the entire hardware-software stack, moving SYCLOPS closer to delivering open AI acceleration.
Final Thoughts: Paving the Way for a New Era of Accessible AI
The SYCLOPS project, with its strategic focus on the RISC-V Vector extension, is not merely an academic exercise; it is actively paving the way for a new era of AI acceleration. By championing open-source, scalable, and customizable hardware and software solutions, SYCLOPS is at the heart of the movement to truly democratize access to cutting-edge AI technology.
The combination of the inherently flexible RISC-V architecture, the power of its Vector extension, and the streamlined design methodologies enabled by tools like Codasip Studio, collectively works to lower the traditionally high barriers to entry in AI hardware development. This empowers a more diverse ecosystem of innovators – SMEs, startups, and academic institutions – to not only consume advanced AI acceleration but to actively contribute to its evolution. The result will be a richer landscape of AI applications and a broader talent pool driving future breakthroughs.
Codasip is proud to be a key contributor to the SYCLOPS project, lending its expertise in processor design automation to advance AI technology and support this important European initiative. The journey towards ubiquitous, efficient, and open AI hardware is complex, but with collaborative efforts like SYCLOPS, the future of AI looks brighter and more accessible than ever.
This project has received funding from the European Union’s Horizon Europe (HE) research and innovation programme under grant agreement No 101092877.
Related Semiconductor IP
- 32-bit Embedded RISC-V Functional Safety Processor
- Gen#2 of 64-bit RISC-V core with out-of-order pipeline based complex
- Compact Embedded RISC-V Processor
- Multi-core capable 64-bit RISC-V CPU with vector extensions
- 64 bit RISC-V Multicore Processor with 2048-bit VLEN and AMM
Related Blogs
- Upgrade the Raspberry Pi for AI with a Neuromorphic Processor
- Real-Time Intelligence for Physical AI at the Edge
- ReRAM-Powered Edge AI: A Game-Changer for Energy Efficiency, Cost, and Security
- Enhancing Edge AI with the Newest Class of Processor: Tensilica NeuroEdge 130 AICP
Latest Blogs
- CAVP-Validated Post-Quantum Cryptography
- The role of AI processor architecture in power consumption efficiency
- Evaluating the Side Channel Security of Post-Quantum Hardware IP
- A Golden Source As The Single Source Of Truth In HSI
- ONFI 5.2: What’s new in Open NAND Flash Interface's latest 5.2 standard