AM Modulator on Legacy FPGA Silicon

Architecture and Feasibility Study of a High-Density 40-Channel AM Modulator on Legacy FPGA Silicon with Real-Time MCU Network Integration

1. System Architecture and Register-Level Parallel Bus

The overall architecture is partitioned into two fundamental domains: the asynchronous network and control domain (ESP32) and the synchronous, deterministic DSP modulation domain (EP2C5).

The Bus Interface: Byte-Wide Register Driving

The transfer of modulation data occurs over an 8-bit wide data bus, flanked by a dedicated enable strobe (PIN_DATA_EN). Rather than utilizing sequential bit-shifting (bit-banging), the MCU leverages the low-level, hardware-near register structure of the ESP32 SoC. By executing atomic writes to the GPIO.out_w1ts (set) and GPIO.out_w1tc (clear) registers, the GPIO states for the entire data byte are manipulated in parallel within a single CPU clock cycle.

The protocol enforces a strict temporal sequence to allow transient states on the physical bus lines to settle before being sampled by the FPGA:

Data Setup: The MCU calculates the logic bitmask for the byte to be transmitted and writes it simultaneously to GPIOs 0 through 7.
Strobe Activation: Following a defined nanosecond-scale hardware setup time, the MCU asserts PIN_DATA_EN high.
Strobe Hold & Data Hold Time: The signal remains stable until the FPGA safely captures the data, followed by the atomic clearing of the enable pin.

2. The lwIP RAW Real-Time Network Stack and Port Multiplexing

The central challenge for the MCU firmware lies in the simultaneous management of up to 40 independent UDP ports (one port per transmission channel) at a constant audio sampling rate of 25 kHz per channel.

The Bottleneck of the BSD Socket Model

Deploying the standard POSIX/BSD Socket API (socket(), bind(), recvfrom()) is entirely unfeasible for this application due to the following reasons:

Context Switch Overhead: Sockets require blocking calls or kernel-level multiplexing via select(). With 40 active channels, this induces continuous context switching between the application's user space and the TCP/IP stack's kernel space, devastating the FreeRTOS scheduler.
Resource Exhaustion: Each BSD socket instantiates dedicated control structures and thread contexts. Scaling this to 40 ports would rapidly exceed the ESP32's SRAM footprint.
Data Copy Cycles: Payloads must be copied from the network driver to the lwIP buffer, then to the socket buffer, and finally to the application buffer. This cascade destroys the deterministic timing required for the 25 kHz audio task.

The Solution: lwIP RAW API

The lwIP RAW API circumvents the OS abstraction layers entirely. It operates on a purely event-driven, callback-based paradigm executing directly within the context of the core IP network thread.

For each audio channel, a Protocol Control Block (struct udp_pcb) is registered and bound to a global callback function via udp_recv(). When a UDP packet arrives at the Wi-Fi MAC, the lwIP core parses the header and immediately executes the registered callback.

This approach yields maximum efficiency:

Zero-Copy Approximation: The callback receives a direct pointer to the linked buffer list (struct pbuf). Audio data is iterated and pushed directly into the channel-specific ring buffer without intermediate OS allocations.
Latency Minimization: Devoid of blocking queues, the system achieves the theoretical minimum jitter between packet arrival and buffer update.
Centralized Port Management: All ports are handled seamlessly through a single linked list of lightweight control blocks.

https://github.com/radiolab81/FPGA_AMWaveSynth/tree/main/mcu_examples/esp32/udp_rx_mcu_tx

https://github.com/radiolab81/FPGA_AMWaveSynth/tree/main/TDM/40ch/udp_rx_mcu_tx

3. FPGA Pipeline: 4-Core TDM and Hybrid Memory Hierarchy

On the FPGA side, the design must strictly adhere to the aggressively constrained resources of the Altera Cyclone II EP2C5 (comprising merely 4,608 Logic Elements and 26 M4K blocks). The solution is a semi-parallel Time-Division Multiplexing (TDM) architecture, distributed across four parallel Arithmetic Logic Unit (ALU) cores.

The TDM Phase Schedule

The system operates on a 100 MHz system clock, multiplied from a 50 MHz external oscillator via an internal PLL. A central 4-bit modulo-10 counter generates the slot signal (ranging from 0 to 9). Each of the four computing cores (tdm_nco_core) processes exactly one AM transmission channel per slot. Consequently, each core serves 10 channels; four cores operating in parallel yield the required 40 channels. Each channel occupies the mathematical units for exactly one clock cycle (10 ns) every 100 ns, equating to an effective processing rate of 10 MHz per channel.

Hybrid Memory Architecture and RAM Forwarding (core_mem.v)

Because the memory depth per TDM core is only 10 rows while the word widths vary significantly, synthesis requires a strict bifurcation to prevent exceeding the M4K block limit:

LUT-Based RAM: Narrow parameters (8-bit audio, 16-bit gain) are explicitly forced into distributed Look-Up Tables and flip-flops via the (* ramstyle = "logic" *) attribute. This preserves the scarce M4K blocks.
Block RAM: Wide state data, such as the 32-bit Frequency Tuning Words (FTW) and the 32-bit Phase Accumulators, are synthesized within the physical M4K blocks ((* ramstyle = "M4K" *)).

Mitigating Read-Write Hazards (Data Forwarding)

Because the MCU asynchronously writes data via the parallel bus into the same memory space that the TDM ALU cyclically reads, the system is highly susceptible to data corruption and pipeline stalls. The core_mem.v module resolves this utilizing Data Forwarding:

If the asynchronous MCU write address (mcu_waddr) matches the currently active TDM read slot (slot) during a clock cycle, the internal RAM output is bypassed. A multiplexer transparently routes the incoming MCU data directly to the pipeline output registers (core_audio, core_frq, or core_gain).

4. Mathematical Modulation and AGC Hardening in the Arithmetic Core

The tdm_nco_core.v module implements a fully pipelined, stateless arithmetic unit responsible for phase accumulation, envelope management, and the final AM synthesis.

Pipeline Stages 1 & 2: Phase Accumulation and Peak Tracking

In the first stage, the Numerically Controlled Oscillator (NCO) updates its phase:

The upper 12 bits of the phase accumulator extract the address for the Sine ROM.

Stage 2 executes the mathematical peak tracking for the Automatic Gain Control (AGC). The absolute amplitude of the current audio sample is compared against a historical peak register. If the new value is greater, an instantaneous "attack" captures the new peak. If it falls, an exponential decay curve takes over, governed by a dedicated 16-bit decay counter (tuned to ~5 ms at the 10 MHz slot frequency).

The Inverse Noise Gate

A critical safety mechanism is embedded within the AGC inversion logic. Should the tracked audio peak fall below a predefined silence threshold, the system intelligently inverts the multiplier characteristic, forcing the channel's audio gain strictly to zero. This effectively prevents the amplification of analog pre-amp noise floors or network dithering. The result is an absolutely pure, unmodulated carrier during periods of silence.

Pipeline Stages 3 & 4: Shared Sine ROM and AM Synthesis

To conserve memory, the shared_sine_rom.v module houses only a quarter-wave sine lookup table (1024 entries in an M4K block). Full-wave reconstruction is achieved via hardware-based address mirroring and sign inversion:

Quadrants 1 & 2 (Bit 10 of the phase word): Controls the address mirroring (addr[10] ? ~addr[9:0] : addr[9:0]).
Quadrants 3 & 4 (Bit 11 of the phase word): Triggers the two's complement inversion of the ROM output.

In the final stage, the AM synthesis equation is executed:

Within the FPGA, this is mapped using high-speed fixed-point arithmetic:

The audio signal is scaled by the calculated AGC factor.
The scaled value is added to a hardcoded DC offset (CARRIER_BASE = 18'sd16384) to generate the modulation envelope.
This envelope is multiplied by the external channel attenuation factor (gain).
Finally, the envelope is multiplied by the sign-corrected output of the Sine ROM, producing the 16-bit RF sub-signal (rf_out) for that specific channel.

5. Synchronous Accumulator and Network Node Operations

At the terminus of the pipeline, the four independent RF data streams must be merged. A 4-way adder tree continuously sums the signed 16-bit signals (c0_rf through c3_rf) into a transient sum within the same clock cycle.

To mathematically eliminate any possibility of digital clipping—even if all 40 channels were to superimpose perfectly in phase—the final accumulator is expanded to a 22-bit width. It is only during the final write to the DAC output register (dac_out) at slot 9 that the signal is truncated and shifted back down to the physical bit width of the Digital-to-Analog Converter (e.g., 8 or 12 bits) using MSB logic.

Practical Operation via Standard Linux Toolchains

Because the MCU ingestion architecture is designed around raw data streams, the transmitter array can be controlled directly from the command line without requiring bespoke client software.

Frequency and Channel Configuration via netcat

Control port 5000 ingests ASCII-based mapping tables. The required format is [UDP-Port]:[Frequency in Hz]. For example, to configure network port 1234 to broadcast at 549 kHz (former DLF transmitter Nordkirchen) and port 1235 to 1422 kHz (former transmitter Heusweiler), the following command is issued:

echo -n "1234:549000 1235:1422000" | nc -u -w0 <ESP32_IP_ADDRESS> 5000

Audio Streaming via ffmpeg

The MCU interface expects raw PCM data formatted as unsigned 8-bit values (u8) at a rigid 25 kHz sample rate. Consequently, source material (MP3, FLAC, or HTTP live streams) must be transcoded on the fly. ffmpeg seamlessly handles the resampling and UDP pipelining:

ffmpeg -re -i "radio_show.mp3" \ -ac 1 -ar 25000 \ -f u8 -acodec pcm_u8 \ udp://<ESP32_IP_ADDRESS>:1234

Parameter Breakdown:

-re: Forces reading the input at native real-time speed (preventing ring-buffer floods on the MCU).
-ac 1: Downmixes the source to a single mono channel.
-ar 25000: Resamples the audio to match the rigid 25 kHz hardware cycle of the FPGA.
-f u8 -acodec pcm_u8: Transcodes the amplitudes into the required uncompressed 8-bit format.

Conclusion

This feasibility study demonstrates that through the rigorous exploitation of TDM ALU structures and the deliberate omission of memory-intensive parallel hardware instances, highly complex DSP arrays can be successfully deployed on legacy, cost-effective silicon. The hardware-near register programming of the ESP32, paired with the event-driven lwIP RAW API, serves as the fundamental bridge—reliably coupling the asynchronous, high-latency domain of Wi-Fi networks to the ultra-deterministic, synchronous hardware pipeline of the FPGA.

https://github.com/radiolab81/FPGA_AMWaveSynth/tree/main/TDM

Attached video shows EP2C5 multichannel AM modulation on medium wave with different gain, simulation of slight co-channel interference and fading on some stations using AMWaveSynthPropagationSimulator.