I'm new to FPGA and HDL but I'm trying to learn and cant figure this out. How can I calculate or estimate propagation delay though several levels of combination logic. Can I only determine this empirically or can I figure it out at design time. In this situation I'm using and FPGA to implement a parity setting and checking circuit. The circuit would look like a tree network of xor gates like the example pictures, except I intent xor 16 registers so there will be more levels or xor operations. I would like to be able to calculate the propagation delay though each "level" xor logic so I can determine how many fractions of clock cycles or how many nanoseconds the entire parity checking and setting operations will take. Hope I'm making sense.
Thanks a lot for the help.
You need "The Knowledge" as I explain here in "The Art of High Performance FPGA Design". http://www.fpgacpu.org/log/aug02.html#art "You have to ... crank up your tools and design some test circuits, and then open up the timing analyzer and the FPGA editor and pour over what came out, what the latencies (logic and routing) tend to be, etc."
After you do that for a while, you will look at this kind of question, and just know (or have a pretty good idea).
In this case, for example, I know in an FPGA, a 16-input XOR will be built out of a tree of 4- or 6-input lookup tables (4-LUTs or 6-LUTs) two deep, and it cannot be implemented in circuit only one LUT deep. Therefore the minimum delay for such a circuit in a pipelined implementation is going to be (in Xilinx timing nomenclature):
tCKO -- clock to output delay of any of the 16-flip-flops
tILO -- delay through the first level LUTs
tAS -- delay through 2nd level of LUTS + flip-flop setup time assuming implemented in the same slice
and for Virtex-6 speed -1 I would expect this to be ~1.5 ns.
As others have said, the component switching delay data is in the data sheets for your device in question, but the net routing delays are not. Indeed, in time, you may even start to remember the key delays and develop a sense for how many FPGA primitives like LUTs you can use and still make a particular clock period / clock frequency target.
Anyway I just tried this with some throwaway Verilog I coded up:
module t(clk, i, o);
input clk;
input [15:0] i;
output reg o;
reg [15:0] d;
always @(posedge clk) begin
d <= i;
o <= ^d;
end
endmodule
and a simple UCF file:
net clk period = 1.5 ns;
and the total delay in my device was about 1.4 ns. Try it for yourself and see!
Here is one path from the static timing analyzer output:
Paths for end point o (SLICE_X3Y68.A5), 6 paths
--------------------------------------------------------------------------------
Slack (setup path): 0.198ns (requirement - (data path - clock path skew + uncertainty))
Source: d_13 (FF)
Destination: o (FF)
Requirement: 1.500ns
Data Path Delay: 1.248ns (Levels of Logic = 2)
Clock Path Skew: -0.019ns (0.089 - 0.108)
Source Clock: clk_BUFGP rising at 0.000ns
Destination Clock: clk_BUFGP rising at 1.500ns
Clock Uncertainty: 0.035ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.070ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Maximum Data Path at Slow Process Corner: d_13 to o
Location Delay type Delay(ns) Physical Resource
Logical Resource(s)
------------------------------------------------- -------------------
SLICE_X3Y67.BQ Tcko 0.337 d<15>
d_13
SLICE_X2Y68.A2 net (fanout=1) 0.590 d<13>
SLICE_X2Y68.A Tilo 0.068 d<11>
d[15]_reduce_xor_21_xo<0>1
SLICE_X3Y68.A5 net (fanout=1) 0.180 d[15]_reduce_xor_21_xo<0>
SLICE_X3Y68.CLK Tas 0.073 d<10>
d[15]_reduce_xor_21_xo<0>3
o
------------------------------------------------- ---------------------------
Total 1.248ns (0.478ns logic, 0.770ns route)
(38.3% logic, 61.7% route)
As you can see, the logic delays from the datasheets are only about 480 ps whereas the net routing delays are 770 ns and clock skew etc. is a bit more, total under 1.3 ns. This is actually faster than a component switching limit / Fmax on the global clock tree of 700 MHz / 1.43 ns...
So in summary, as you try some test circuits, and trying tuning them, you will get experience that helps you estimate how fast your circuit will run when implemented in FPGA primitives like LUTs.
And if it really matters, there is no substite for implementing the design through synthesis, place-and-route, and static timing analysis. Don't forget to add timing constraints to give the tools something to target, and then experiment lowering the min clock period iteratively until you converge on a min period.
Happy hacking!