RCT FPiGA Audio Hat and Zynthian Interest

Hello everyone!

My team has recently been working on a very neat Audio DSP Hat for Raspberry Pi platforms that mates a Raspberry Pi with an FPGA and Audio Codec to allow for accelerated audio processing + performance.

As a brief overview -

The RCT FPiGA Audio hat seats a FPGA between the Raspberry Pi’s I2S lines and an AD SSM2603 audio codec to allow for various audio configurations.

To maximize capabilities and performance, the Pi is set as an I2S slave device at 384 KHz audio rate in and out to the FPGA. The SSM2603 typically is set to 48kSample and derives all clocks for this rate off of an on board crystal. The FPGA also receives this crystal’s signal and generates 384k I2S samples for the Pi.

This is so that between the FPGA and the Pi, there can be the opportunity to have 16 mono channels at 48KSample to allow for leveraging the dual processing capability before sending the mixed 48kHz stereo audio signal from the FPGA to the codec for output. On top of this, there is a stereo input as well as a mono mic input made accessible, so these can all be patched into the 16 channel audio loop as a designer would desire. Furthermore, there is hardware midi input and out/thru via 3.5mm jacks and USB.

The FPGA is pretty valuable as it allows for single sample latency DSP in most applications. It’s also controllable via I2C from the Pi and can be reprogrammed via the pi GPIO, so different FPGA designs can be loaded depending on a designer’s needs.

As an Audio enthusiast and hardware Eng professional, I love the Zynthian project.
I’m wondering if there would be any interest in integrating this with the Zynthian ecosystem as an added platform option for Zynthian? Not that a Pi is anything to sneeze at processing wise, but I can see this augmenting capabilities that’d place it well above $1k+ synthesizers and audio devices.

On the note of integration - I currently have an ALSA compatible kernel module made, device tree overlay, as well as a userspace application for managing data pipelining and communications between the Pi and the FPGA. I typically reserve a CPU core just for this purpose using ISCOLCPUS. Communication between this and other applications is facilitated between a very rudimentary and simple local server similar to how JACK works, though with some simplification. If it was considered interesting to folks here, then I would think that the easiest way to integrate would be to integrate this as an alternative to JACK/pulseaudio/etc.

I just got the latest spin of these boards in and am working on bringup as well as getting some good videos made to showcase the system. We’re looking to sell them for ~$200 a hat, so it’s not particularly cheap but I think that the combo of comparing against other audio hats that don’t really offer as much flexibility as well as the ability to treat hardware in a similar way to software VSTs with the addition of the reconfigurable FPGA, this is actually a fairly inexpensive addition. Making something like a Zynthian targeted FPiGA Kit would be something I’d be open to work with the official team here on as well if there was ever interest. Would be an interest pairing to see how much improvement could be gained in latency for this system through parallel compute platforms joining hands. Latency aside, I think there are things that could be done with a system like this that just couldn’t be done with a traditional Pi.

More details will be available here. Sorry in advance because it’s not quite mobile friendly yet -
https://radical-computer-technologies.github.io/Radical-Computer-Technologies/f-pi-ga-audio-hat-v1.html

Just polling for thoughts and feedback! I’ll post videos up as soon as we get some of the hard bringup work done.

Hope all is well with everyone!


5 Likes

Hi @RadCompTech, what an amazing project you are completing here! I cannot speak for the Zynthian devs, but my imagination runs wildly before the prospect of what could be done, with the audio computing capabilities and elevated clock speed of an onboard FPGA audio chip. Firstly, I can foresee huge improvements in latency times with high sample rates, and secondly the opportunity to build a synthesis platform with dense polyphonic performance (maybe, in the 128-256 voices range), beyond what can be normally achieved with a Raspi 5 processor. I for one look forward seeing your project possibly integrated in our hw/sw system.

Kudos, and all best luck :slight_smile:

Hi @RadCompTech
Very nice project! But… The most prominent thing I do not understand is, why reserve an entire CPU core (stealing its computational power from the Zynthian) just for communicating to the FPGA? Wouldn’t that be configured once in a while for booting and parameter changing and then left working on it’s own until the next parameter change has to happen? Just for Midi to an instrument or processor in the FPGA this would be over the top as well, as there is not much to do for sending a few thousand bytes a second over I²C.
With such an expensive setup, wouldn’t there be a better choice for the codec? The SSM2603 is from 2008 and there are many better (and some still even more economic) codecs on the market.
The FPGA could expose more I²S interfaces (for AES/EBU, WordClock, …) and combine them to a MUX for the Pi side. The RP1 (I/O circuit of the RPi5) I²S can handle 8 output and 8 input channels in a mux with separate selector pins, in hardware.
Why force the Pi to send out 16 channels of data, only to be mixed down into stereo?
The Zynthian internal mixer can handle this with far less processing overhead than providing 16 individual streams to the I²S
And what has the sample rate of 384k to do deal with it, if the output is 48k anyway?
It would force the audio sources in the host to provide eight times more data, do eight times more calculations (or simply do upsampling, still stealing CPU time) just to have the output be crushed down to 48k later. To me, it makes very little sense, except for number bragging in advertising. To me this looks a lot like Audiophile Farkle. Sorry.

(btw, I’m in the electronics for 36 years now)

Hi @fussl

I’ll address this more in depth when I have more time if need be, but I’ll cover some main points here.

There’s no need to reserve an entire core if you don’t desire to, it’s a development board so do what you like. The point in my case of doing 384k sample rate to the FPGA is a little misleading. It’s time division multiplexed, so it’s really 16 channels at 48K then stereo out at 48k. The point being that individual effects on each channel can be applied by the FPGA before final mixing. Otherwise, you wont necessarily see the overhead gains from the FPGA.

Alternatively you could use this scheme for a bunch of different cases. Consider 16 audio channels on input and the output. Channels 1 & 2 on output are direct to stereo output. Channels 3 - 16 can be used as individual processing stages or as modular stages for a digital modular approach. Channels 1-16 on the input could be assigned generated sound, live input, an output stage on part of the FPGA, or something else. It’s all up to you as a developer or user to decide. I don’t see how this could not be understood as advantageous, in some fashion.

Sure the 5 can handle multiple I2S outputs, but one of the points of this is to allow for more processing capability than a RPi 5 could do alone without using as expensive a unit. A Pi 3A+ should be more than sufficient. Furthermore, I tried my best with this design not to use too much of the Pi’s IO. This was intentional to allow a user to stack another hat on top that could do other things, like adding displays, adding encoders, adding ADCs for analog controls, etc.

On top of this, the separate core allows for lowest possible latency in the compute portion of I/O. It’s not kilobytes of data per second I’m sending/receiving. Since it’s I2S input and output - 3MB (16 channels * 4 bytes per channel * 48000) * 2 = ~6MBytes/Second. I think you maybe missed that part of the original post, though to be fair it maybe wasn’t clear.
Low latency timing for that is imperative. I get that it’s more overhead but if you have an FPGA sitting on the side doing faster and more work than you can do with a Pi on a cycle per cycle basis then the extra compute the Pi would need traditionally should be compensated there with almost certain augmentation.

Furthermore, combining resources of the Pi with the FPGA helps overcome some caveats of FPGA DSP design (ie low internal memory in FPGA, so utilize Pi DDR to overcome it). This would allow for easy wavetable generation , delay based effects, and reverbs to be handled via the Pi while high order filtering, modulation, FFTs, compression, and some particularly DSP heavy work could be handled on the FPGA. Maybe even do parts of the operations on the FPGA and then use the Pi to do parts it’d excel at. Whatever is clever for a design.

My thoughts were how do I move as much relevant data between the Pi and the FPGA without taking up too much IO. Otherwise I would have jumped to using the SMI interface for audio + control entirely ( though I’d have to write some sort of alternative design schema for Pi5 like you’re suggesting).
Make sense? We also are developing an SDR board which is targeting the SMI interface on Pi Zero 2, thus why I bring it up.

If an application doesn’t get served by the 16 channel at 48k interfacing, the driver and FPGA core can also support stereo 48K IO or higher sample rates with simple stereo. Just won’t vantage the FPGA acceleration as much or the same way. Haven’t been doing electronics for 36 years but I’ve been doing ASIC/FPGA DSP acceleration design for video and RF as well as hardware design, OS design, and Firmware/Software for around 10 years now so while something may not seem advantageous with what I’m suggesting at first glance I can guarantee that it’s not just “Audiophile Farkle”. :wink:

None of this is to say that I’m not considering vantaging a Pi CM5/5 multi I2S capabilities (also the Radxa Rock Pi’s) on a more intense build. I’ve been toying with the idea of a mini itx audio dev station board using a compute module, but if there really isn’t a need for the multiple I2S outputs, then I don’t see a reason to go there. As I suggested before, a clever SMI implementation on a Pi 3/4/Zero 2W + more powerful FPGA would actually either compare or trump the throughput gains of multiple I2S outputs in that case ( considering 48K sample rate per channel, which is pretty typical). I’ve reliably been able to get ~55 MB/s with a Pi Zero 2 W using SMI + DMA, for instance. I don’t think it’s reasonable to think that a Pi 5 on its own would be able to perform quite as well/better for audio processing than a Pi Zero 2 W/3 A+/CM3/CM4 using SMI with an FPGA. Probably wouldn’t be able to perform as well even with my suggested I2S multiplexed implementation in most cases, tbh. The idea is to share resources and provide quicker accessibility to DDR from the FPGA. Just my honest thoughts on the matter. I like using the latest and greatest where it makes sense, but in this application it just might not be all that beneficial. We built this as a simple first design to start building up a framework, more advanced/complicated and capable designs will likely follow.

The codec choice lays in the fact that we have 100’s of units of back stock on that codec. Also have a lot of design resources for that codec from older projects. Once it’s used up and/or we create another version, we would look at another one, though the SSM2603 is not a bad codec, just somewhat older as you said. Though we will have more revisions - if you could use a better modern codec, what would you pick? We have been considering AD1938/9, which though similarly as old offers more IO and better quality, as an example. Also have looked at ADAU1777. Added DSP there is a bonus though sort of unnecessary considering what’s feeding it. MAX9890 also looks promising for a pretty direct swap out with similar features.

Hope this could answer some of your questions and if something is still unclear, then I’m happy to discuss more.

1 Like

This is interesting but I am still not 100% clear on the implementation. Are you using a RPi side driver (kernel module / user space driver?) to multiplex 48000 fps audio into a single 384000 fps I2S stream? (This of course would provide an invalid I2S stream.) If so, then presenting as a 16-channel ALSA device would allow native integration with most Linux software, including JACK which zynthian depends. (Moving from jack is a massive undertaking that we are unlikely to consider.) If this assumption / interpretation is valid, then it could possibly be upscaled to 16 x 4 = 64 channels using all RPi5 (RP1) I2S streams. We already have used all 4 streams to provide full duplex, 8 channel input and output.

The DSP loaded into the FPGA would need to be well understood to allow zynthian to integrate. I am not keen to lock into a particular external DSP processor and zynthian’s strength comes from its modularity, e.g. the ability to create chains of processors. I can see similarity with the work I did recently with Tascam US-16x08 which has onboard DSP providing EQ & compressor per-channel. Zynthian supports controlling this DSP but it does not become part of a chain. It is in the input hardware device control, i.e. before any chain.

I would suggest the ability to present all 16 outputs (and inputs???) as individual analogue ports would be beneficial, rather then the need to mixdown to stereo. Zynthian benefits from multiple inputs and outputs and presents them to users, allowing routing to/from each port. Users will want to benefit from multiple inputs (e.g. from many microphones, instruments, etc.) and outputs (e.g. to monitor mixes, external effects, etc.) .

How do you envisage programming the DSP/FGPA? Would you create a toolkit for developers to create plugins / DSP code? Would you provide a library or pool of effects?

I only noticed you mention outputs. Are you considering inputs too? DSP on inputs like compression, EQ, reverb, etc. can be a real boon. (The afore mentioned Tascam provides me with a really good audio mixer with DSP offloaded from the RPi which can then do its magic with synths, mixing, extra effects, etc.)

1 Like

@riban

You’ve got it pretty much right that yes it’s an invalid I2S stream though it allows this to functionally work. It’s a “Soft” TDM rather than actual TDM unfortunately as (from my understanding) pis don’t support hardware TDM.

Current idea is to indeed present as 8 stereo (16 mono) input channels and 8 stereo (16 mono) output channels. There indeed is input DSP capability.

I think there is still a bit of confusion here regarding what those channels are though - they are not analog ports. There is a single stereo DAC and single stereo ADC in the codec, so the multiplexed channels I’m referring are meant to act as channels to feed and receive data from the FPGA for it to process. As I mentioned before, it’s akin to a concept of a virtual modular patch bay where these channels can be routed to separate portions of the DSP design or even as parallel channels each receiving their own processing before being mixed to stereo. I’ll soon offer a diagram and documentation to clarify that.

The current plan with the FPGA design is to offer a basic fleshed out DSP core design there that users can program and interact with, but since it’s an FPGA, there isn’t anything stopping a user from creating their own chip design. It’s a dev board, after all. If they do that then utilizing at least the interconnect structure /I2S signaling from the core design between the Pi and FPGA as well as the Pi to the codec would be the way to remain compliant.

FPGAs don’t have a traditional “library” structure since it’s not software. The concept is to provide various IP in the form of the aforementioned example design and users can choose to modify to their needs using the vendor tools. I am providing a visual logic analyzer for debug, though, so there are development tools included in the example design.

Further, there shouldn’t be a need to “lock into” a particular DSP here. Maybe exposing the control via a LV2 plugin might be helpful eventually if anything.

Thanks for the further clarification. I did understand that you plan to use the link as a TDM for multichannel transport and mixdown to stereo but I suggested also considering multiple inputs and outputs.

2 Likes

Noted. If/when we design for a larger unit with more outputs we’ll consider this. Hardware improvements would probably move from this unit → a 8-16 output & 8-16 input unit → 64 in and out.

64 analog inputs and 64 analog outputs actually sounds quite mouthwatering. While the price of something like that would likely jump a bit, with a Pi CM5 and a console box, that’d be quite a capable and powerful unit for modular synthesis, mixing/mastering, and production. I only question if a Pi 5 with Linux overhead could handle that much data at low latency - may make reserving a core or two for more real time processing even more necessary. 8-16 inputs and outputs should be easily doable and would likely still be more than enough for most applications.

2 Likes

Just returning to let you know that I’ve been able to configure ALSA using aloop to send 8 stereo channels (at 48k) to my userspace app which then packs the 16 data words into a 384k packet to pass off to the FPGA. Seems that this would also be compatible with JACK and the latency is still very low. It requires this app to set an I2C register in the FPGA and send synchronization packets, but once the FPGA has sync’d (within the order of a few packets), it stays verifiably sync’d indefinitely from what I can see. I ran an overnight test and all was running smoothly in the morning when I came back to check. Of course this requires free running I2S. I haven’t implemented the input yet, but it should be the same process in reverse. All’s looking quite promising. Shouldn’t be all to hard to get this to work with Zynthian from here. Will just need to ensure nothing else sends data on the I2S lines which I’m pretty sure won’t happen unless somebody configures Alsa mixer in a self defeating way.

As far as a first release on the FPGA design:

Outputs - mono channels 1 and 2 will be used for direct output to the headphones/line output, then channels 3-16 will be configurable send lines for user configurable FPGA driven processing.

Inputs - similarly mono channels 1 and 2 will be used for direct input of the line in/microphone input, then channels 3-16 will be return channels for user configurable FPGA driven processing.

The DSP is pretty much 1 sample latency from send to receive so this seems like this should close timing gaps pretty efficiently from a compute perspective. If sounds are being generated internally in the FPGA I could see interesting use cases for using some of those channels in a similar way to CVs in a modular synth setup, too.

Cheers!

hmm… yes, with just plain arithmetics (add, subtract, multiply, divide…). And how is it with filters?
:film_projector: :yum::popcorn:

The Ring :scream:

Is that supposed to be antagonistic? I don’t know why you would think what I’m suggesting is funny or make an assumption about what I mean by single sample latency. Very presumptuous.

To answer your question -
With the Biquads I have their response is only limited by the number of delays needed for the single/cascaded filter. You should understand that with decades in the industry, though you haven’t shared your particular background so I don’t know if you’ve worked as an FPGA or VLSI engineer in any capacity that’d be able to bring a decent assessment. Somebody with a decent FPGA background wouldn’t be throwing around division as a simple operation, just saying :wink: .

That being said they are always running in the FPGA and can be switched on as you want via I2C. I’m building towards a rudimentary instruction set to allow for customizable routing.

The calculations themselves are still done within a single sample cycle, so there is an output granted the filter won’t respond fully until 3 or so samples have ran through it as is the limit with any biquad filter. It’s the best case for latency that you could hope for while still being able to reschedule coefficients and registers to act upon other incoming samples in the same sample period. I think that’s just a limit of filters in general though but there are use cases where a biquad will be able to respond within a single sample period. Though depending on what you’re using the filter for, delay may not be the issue to worry about. So yes, with a serial biquad filter (or even 100 biquads - dare I say 100s?) implemented the latency till an output sample is less than a sample period. Welcome to pipelined and parallelized hardware.

For example I have a wave table + 2 cascaded biquads which I use for antialiasing the wave table internal to the FPGA. Within a single sample period I’m able to pull 16 samples (for anti aliasing - note that I could pull more, say 64-256), run them through the biquads, and decimate down to 48k and still have a few thousand clock cycles left before the next sample needs to generated. That same cycle could be repeated to generate a few hundred notes of polyphony only using a single multiplier within that period, the BRAM resources for the wave table, and the registers needed for the delays. With that table, the I2C register bank/handshaking logic , the clock domain crossing logic, and the I2S handling I’m only sitting at around 5% FPGA logic utilization and 1/28 of the DSP utilization so far. Plenty of room to do more and it could be improved further with some tricky pipelining.

No.
A single sample latency is achievable only in special cases, like adding, etc…
As filter processing needs a certain amount of information along and out of the time domain, a very short delay smaller than a cycle of the filter (which can be a lot of samples) can only happen with filter algoithms/structures that by nature have a serious response problem and generate lots of ringing.
I wonder how you want to mitigate this.
To blow up the sample rate with some sort of interpolation will anyway render no real information, but simplay add artefacts, specified by the technique used. A filter on this product will have to have its cycle adjusted to the upsampled rate in order to act in the intended way.
It is a general misconception that a filter renders “faster” results within a signle sample cycle on a signal if the signal is upsampled high enough that the filter cycles fit within an original sample period.

You should know that already.

And then it comes to data memory for the processors in the FPGA. Do you plan to use the RP1 DMA/Port functionality to transfer data to and from RPi5 RAM? Via I²S? No PCIe? :thinking: Via I²C? :rofl: sorry…

Yeah I addressed that case - If you’re feeding 48K audio in and you want to apply a filter, you do not interpolate but expect that the biquad will not respond effectively until a certain amount of samples have ran through. I again also addressed the case where I’ve accomplished single sample latency with filtering - IE antialiasing Biquad filters on wavetables internal to the FPGA. I nowhere suggested " To blow up the sample rate with some sort of interpolation". They’re activated when higher frequencies are played than what the table can reliably handle without aliasing. In that case we grab many samples of data at equally spaced intervals- IE oversampling, then low pass filter to anti alias, decimate, and output. It’s a common practice in wavetable design (honestly common in DDS in general - I’ve designed a fair amount of signal generators in my tenure). Furthermore, there can be scheduled gains which can take effect depending on which octave you’re working with, so different initial sample rates can be accommodated per range. The single biquad hardware core can act as many biquads. You just need to keep track of the delay registers and gains - schedule as appropriate.
“A filter on this product will have to have its cycle adjusted to the upsampled rate in order to act in the intended way.” - Which answers this.

It’s a clean alternative to MIP mapping in the case of preserving table memory, but I think you ignored that completely and misunderstood. 2nd order biquad needs a minimum of 4 samples to output a value. Cascading adds 2 more samples per section. The response will of course be variable as is just the case with a biquad, but in the case I’ve described it’s well behaved. FIRs are also possible, but they’ll come at a pretty large delay and memory cost. They also don’t respond like analog filters as biquads do, so that also needs to be understood as far as application.

I’ve already addressed how I’ve handled data transfer to and from the Pi above with a soft TDM I2S, which ironically was something you misinterpreted on your first reply. I2C is solely for control - not high speed data transfer. 400k is faster than typical midi baud, ay? Again I don’t think you’re following along with enough diligence here. There’s also internal memory in the FPGA which I’m using - a few Mbits of BRAM not including DRAM and register slices in the LUTS on this particular module, so it’s actually a hybrid. You’re speaking very generally and still being presumptuous it seems.

PCIe is something I’ve considered for another module, but I’m not aiming for this to solely be compatible with an RPi 5. Again, you’re not following very well. You seem quite convinced that the Pi 5 is the only target for some reason. :wink:

Zynthian runs on the Pi5.
I’ve read about big plans to have this board on other SBC as well.
There are either too many defects in the design, due to obvious misunderstanding of the matter, or, less likely, there is misunderstanding about the design due to nebulous discourse.
Who shall be following what with with diligence? :angry:
Know that I’m not going to follow this bubble of yours.
Good luck.

1 Like

Well, I have 10 units on hand and nothing has jumped out as being defective yet, so that’s not the case. Maybe there are aspects of the FPGA DSP design that somebody wouldn’t like, but that’s the beauty of reprogrammable hardware, you can change the design if you’d like.

It’s fine if you don’t want to follow - you seem to only have been here to naysay from the beginning anyhow.
Cheers.

Please tone down the rhetoric. I am enjoying this thread for it’s technical content. The challenges being asserted and responded to help to clarify and place context. I worry that the language being used may provoke undue and undesirable friction. Remember that we converse here in English which may not be everyone’s native language.

I have little experience with FPGA development so am very interested in the technical discussions on that. I do understand linear DSP that is typically used in zynthian (and other) software so am happy to have my knowledge validated. Just because two people have different experience does not mean they can’t learn from each other - the opposite should be true.

Is this necessary? Can the cascading be overlapped to remove extra cycles?

Zynthian uses jack on alsa so, like almost all computer audio systems, works on periods of samples, with the resulting minimum latency. It uses CoTS audio hardware which also uses buffers to reduce xruns which is necessary on a generic system. If you build a dedicated hardware platform you can minimise this and approach sample level latency but we don’t do that. We do however expect low latency in the hardware so, a FPGA based DSP I/O should indeed be aiming for very low latency.

Filters is a good subject for this discussion. As noted, biquad filters are relatively simple to approximate in software (or firmware, etc.) but the maths breaks down fairly easily with resulting (often) undesirable effects. Unstable oscillation is not unique to DSP (Moog used it to good effect in their ladder filters) but can give quite unexpected results and aliasing is a big challenge, including at frequencies outside the audio band that then fold back or impact the audible range.

But this topic isn’t specifically about how the DSP is implemented. That is the subject of the wizards who provide the algorithms, etc. I believe this topic is mostly about the hardware platform that can facilitate such DSP magic. So, if you can get audio in to and out of some DSP hardware quickly and consistently then you have a feature that may be beneficial beyond the linear DSP of software plugins (that sit within the audio graph). I am still sceptical as to how effectively this could be incorporated into zynthian, mostly due to the limited available effort and our priorities.

Be kind to each other. We all usually have something useful to be said and heard.

1 Like

@riban

My apologies for the tone and rhetoric. I agree that the focus here should be on addressing questions and providing information to help garner a better understanding on the topic(s).

Regarding the cascaded biquad question, there’s a lot to be said about serial vs parallel biquads, but yes, you can do either or. I think it depends on use case and/or desired result. Typically cascaded biquads will provide a cleaner output when trying to approximate a higher order filter for something like anti aliasing, but parallel filters also have their strengths, one being that as you said, you could accomplish a high order filter with less latency. In practice I think that sometimes it can actually be beneficial to use both topologies in tandem.

For example, if I’m implementing a multi band EQ, that can be done using parallel biquads. Maybe some bands would really benefit from a higher order so this can be done as cascaded biquads (or even parallel+cascaded biquads).
You’re right though that the math does tend to break down if the filter coefficients aren’t chosen carefully and an implementation is inappropriate. That’s why FIRs are generally preferred for applications requiring stability. They can sound harsh and come with a definite penalty in regards to both memory necessity as well as unavoidable latency addition.

It does make me happy to hear that you agree that there is value in a hardware approach that I’m taking. I agree that there are aspects that are still unclear, but that’s what I’m working hard on resolving bit by bit. It’s a first foray into making a product like this and I don’t know of an audio focused platform with an FPGA on it off the top of my head.

I know that there are rather expensive FPGA based synthesizer from Novation and Waldorf. Likewise I know that Korg has used the Pi compute modules on a few synths. Seems to me that there should be a nice place for a hybridized approach.