How can I calculate propagation delay through series of combinational circuits using Verilog and FPGA?

Frank Dejay picture Frank Dejay · Jan 16, 2012 · Viewed 13.9k times · Source

I'm new to FPGA and HDL but I'm trying to learn and cant figure this out. How can I calculate or estimate propagation delay though several levels of combination logic. Can I only determine this empirically or can I figure it out at design time. In this situation I'm using and FPGA to implement a parity setting and checking circuit. The circuit would look like a tree network of xor gates like the example pictures, except I intent xor 16 registers so there will be more levels or xor operations. I would like to be able to calculate the propagation delay though each "level" xor logic so I can determine how many fractions of clock cycles or how many nanoseconds the entire parity checking and setting operations will take. Hope I'm making sense.

parity setting network

parity checking network

Thanks a lot for the help.

Answer

Jan Gray picture Jan Gray · Jan 19, 2012

You need "The Knowledge" as I explain here in "The Art of High Performance FPGA Design". http://www.fpgacpu.org/log/aug02.html#art "You have to ... crank up your tools and design some test circuits, and then open up the timing analyzer and the FPGA editor and pour over what came out, what the latencies (logic and routing) tend to be, etc."

After you do that for a while, you will look at this kind of question, and just know (or have a pretty good idea).

In this case, for example, I know in an FPGA, a 16-input XOR will be built out of a tree of 4- or 6-input lookup tables (4-LUTs or 6-LUTs) two deep, and it cannot be implemented in circuit only one LUT deep. Therefore the minimum delay for such a circuit in a pipelined implementation is going to be (in Xilinx timing nomenclature):

  • tCKO -- clock to output delay of any of the 16-flip-flops

  • tILO -- delay through the first level LUTs

  • tAS -- delay through 2nd level of LUTS + flip-flop setup time assuming implemented in the same slice

  • plus net routing delays

and for Virtex-6 speed -1 I would expect this to be ~1.5 ns.

As others have said, the component switching delay data is in the data sheets for your device in question, but the net routing delays are not. Indeed, in time, you may even start to remember the key delays and develop a sense for how many FPGA primitives like LUTs you can use and still make a particular clock period / clock frequency target.

Anyway I just tried this with some throwaway Verilog I coded up:

module t(clk, i, o);
  input clk;
  input [15:0] i;
  output reg o;

  reg [15:0] d;
  always @(posedge clk) begin
    d <= i;
    o <= ^d;
  end
endmodule

and a simple UCF file:

net clk period = 1.5 ns;

and the total delay in my device was about 1.4 ns. Try it for yourself and see!

Here is one path from the static timing analyzer output:

Paths for end point o (SLICE_X3Y68.A5), 6 paths
--------------------------------------------------------------------------------
Slack (setup path):     0.198ns (requirement - (data path - clock path skew + uncertainty))
  Source:               d_13 (FF)
  Destination:          o (FF)
  Requirement:          1.500ns
  Data Path Delay:      1.248ns (Levels of Logic = 2)
  Clock Path Skew:      -0.019ns (0.089 - 0.108)
  Source Clock:         clk_BUFGP rising at 0.000ns
  Destination Clock:    clk_BUFGP rising at 1.500ns
  Clock Uncertainty:    0.035ns

  Clock Uncertainty:          0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter (TSJ):  0.070ns
    Total Input Jitter (TIJ):   0.000ns
    Discrete Jitter (DJ):       0.000ns
    Phase Error (PE):           0.000ns

  Maximum Data Path at Slow Process Corner: d_13 to o
    Location             Delay type         Delay(ns)  Physical Resource
                                                       Logical Resource(s)
    -------------------------------------------------  -------------------
    SLICE_X3Y67.BQ       Tcko                  0.337   d<15>
                                                       d_13
    SLICE_X2Y68.A2       net (fanout=1)        0.590   d<13>
    SLICE_X2Y68.A        Tilo                  0.068   d<11>
                                                       d[15]_reduce_xor_21_xo<0>1
    SLICE_X3Y68.A5       net (fanout=1)        0.180   d[15]_reduce_xor_21_xo<0>
    SLICE_X3Y68.CLK      Tas                   0.073   d<10>
                                                       d[15]_reduce_xor_21_xo<0>3
                                                       o
    -------------------------------------------------  ---------------------------
    Total                                      1.248ns (0.478ns logic, 0.770ns route)
                                                       (38.3% logic, 61.7% route)

As you can see, the logic delays from the datasheets are only about 480 ps whereas the net routing delays are 770 ns and clock skew etc. is a bit more, total under 1.3 ns. This is actually faster than a component switching limit / Fmax on the global clock tree of 700 MHz / 1.43 ns...

So in summary, as you try some test circuits, and trying tuning them, you will get experience that helps you estimate how fast your circuit will run when implemented in FPGA primitives like LUTs.

And if it really matters, there is no substite for implementing the design through synthesis, place-and-route, and static timing analysis. Don't forget to add timing constraints to give the tools something to target, and then experiment lowering the min clock period iteratively until you converge on a min period.

Happy hacking!