Escuela Politécnica Superior _ Universidad Autónom _ Grado en Ingeniería Informática _ asignatura_ DIE _ DIE_zip

Ingeniería de Software I

•

Outros

Materiales Generales

12/5/2022

Esta es una vista previa del archivo. Inicie sesión para ver el archivo original

__MACOSX/._DIE
__MACOSX/DIE/._Teoría
__MACOSX/DIE/._Problemas
DIE/.DS_Store
__MACOSX/DIE/._.DS_Store
DIE/Teoría/4. Pipeline.pdf
978-1-7281-1363-0/19/$31.00 ©2019 European Union
Pipelining on FPGAs: A Tutorial
Eduardo Boemo
Universidad Autónoma de Madrid
Madrid, Spain
eduardo.boemo@uam.es

Abstract— This tutorial reviews historical milestones and
main concepts regarding the pipelining of electronic circuits.
Although the technique emerged in the 1960s, it remains a
direct way to simultaneously increase throughput and reduce
power in FPGA-based systems. However, the efficacy of
pipelining is limited by the dominance of register and routing
delays. This work focuses on bit-level pipelining. It analyses by
examples keys aspects such as construction hints, pipeline
metrics, effects of registering, preferential pipeline directions,
and synchronization failures. The text condenses the first
section of the invited tutorial lecture at the 2019 Southern
Conference on Programmable Logic (SPL). Whenever is
possible, numeric examples are particularized to FPGA
technology, but in some cases, cell-based ASICs data are
deemed more convenient. The ideas would be useful for
students of an advanced course on digital electronics, or PhD
candidates interested in the details of the design of integrated
circuits.
Keywords—Pipeline, FPGAs, High-speed digital design,
Henry Ford, Clock skew, Wave pipeline, Power consumption.
I. INTRODUCTION
Main concepts of pipelining originated on the assembly
line of the T-Model in 1908. In that year, production in the
Ford plant was reorganized by dividing the construction of
the cars between groups of workers who specialized in only
one part of the process. Each team was placed along lines,
repeating the assigned task – always the same – on
successive pieces situated on a conveyor belt (Fig.1). In this
scheme, the maximum manufacturing speed is limited by the
slowest task. As a consequence, these tasks must be planned
to require the same time. The flow of parts is continuous,
except for the inevitable time needed to fill or empty the
conveyor belts. Henry Ford eliminated this inconvenience by
introducing an 8-hour working day, with 3 shifts a day.
Therefore, the steady-state was virtually infinite and all the
workers operated in parallel making simultaneously the same
part of the assembly, but on different cars. This type of
process is now called temporal parallelism. Using all these
concepts, Ford sped up chassis assembly from 12.5 to 1.5
hours making nearly 15 million cars between 1908 and 1927
[1]. The success of the process converted the T-model into a
milestone of mass production history, as well as an agent of
urban and social revolution.
Henry Ford´s ideas to improve production are also valid
for digital circuits. Large combinational blocks are unable to
process data at high-speed rates. Normally, new data cannot
be imputed until the previous result is outputted. If this rule
is violated, the fastest bits of the next data reach the slowest
bits of the previous one. Each line of gates only works during
a short fraction of the processing period [2]. Like in an
unorganized car production, most of the time the gates are
waiting for the arrival of new data.
Nevertheless, the main difference between car production
and digital processing is the matter of the parts. Mechanical
objects must be artificially moved while bits must be
artificially stopped. On the contrary, they travel along the
circuit until the electric potential equilibrium is reached.
Thus, digital pipelines require the insertion of edge-triggered
D-type flip-flops (FFs) in the datapath in order to align and
stop the intermediate results. These FFs make it possible to
synchronize the operations, but they do not contribute to any
transformation of the data, just as conveyor belts do not
assemble the cars. Like in the car production, in digital
pipelines the slower stage fixes the system period. To
maximize pipeline speed, the circuit partition must be
balanced. A direct option is to make identical the processing
elements. In the 1980s, this type of circuit was renamed a
“systolic arrays” [3]. However, once again the beneficial
effect of uniformity had already been discovered by Ford.
The T-model was originally built in black, red, green or grey.
But in the following years, the colour choice was reduced to
only one. "Any colour so long as it's black", said Ford [4].
Fortunately, in electronics there is a myriad of digital circuits
based on identical blocks: they were designed to be
extensible and cascadable. Even today, these features
facilitate circuit description, construction, width extension,
and debugging.

Fig.1: Workers on a T-model moving assembly line with magnetos and
flywheels in 1913. Reproduced from Wikicommons [5].

Probably the first published study about pipelining in
digital circuits is the work of Leonard Cotten [6]. But even in
the 1960s, the origin of pipelining was uncertain. Cotten
writes: “…The term pipeline has been used for over 5 years
by designers to describe maximal rate processing ….
However, the author has so far been unsuccessful in
determining the origin of the term as used in this context…”.
A second paper of Cotten [7] studied the effect of clock skew
(Section VII of this tutorial), and the wave pipeline
alternative. The IBM 360 FPU was an early materialization
of the technique [8]. At that time, the term was quite novel as
they used “pipelined”, in quotation marks [8],[9]. Pipeline
adders and multipliers were explored in [10]. After that,
notable pipeline implementations were [11]-[14].
This work has been supported by Comunidad Autónoma de Madrid
(Spain) DIFRAGEOS Project S2013/ICE-3004.
II. A SIMPLE EXAMPLE: PIPELINING A RIPPLE-CARRY ADDER
The extremely well-known ripple-carry adder (RCA) is
adequate for illustrating the effects of pipelining. This circuit
suffers from a large serial delay caused by the propagation of
the carries. Even so, nowadays it is one of the fastest options
for addition in FPGA technology. The cause is the low
fanout of its nets.
Fig.2 shows an 8-bit RCA composed of identical full-
adders (FAs). Unusually in digital design, it is drawn with
the input data traveling from bottom to top, and from right to
the left. It reflects the way in which the calculation is done
by hand.

Fig.2: 8-bit ripple-carry adder (RCA).
A fine-grain pipeline version of an RCA is obtained by
adding 117 FFs (Fig.3). As is traditional in pipeline
schematics, an FF (and its associated clock and reset lines) is
represented as a small square or dot [15], [16]. At the input, a
triangular arrangement of FFs are utilized to delays each
datum to synchronize it with the corresponding carry bits. At
the output, another triangular arrangement of FFs delays each
result to synchronize it with slower S8 and S7 bits. These
triangles are named skewing and deskewing registers in [12].

Fig.3: 8-bit RCA pipeline version.
Fig.3 illustrates an obvious property of pipelines: the
number of FFs along any I/O path is a constant number (9 in
this example, including the I/O registering). A path with
more (or less) FFs cannot exist as this would lead to the
mixing of bits of different data.
The flow of bits after each clock edge is shown in Fig.4.
Only the first clock cycles are shown. The second index
indicates the position of successive data (A02 is the bit A0
of the second data, and so on). Initially, the pipeline outputs
spurious additions. For example, the first bit S01 (of the first
addition) is obtained after the 2nd clock cycle, but the
corresponding last bit S81 of the same result is outputted
after the 9th cycle. So, it is necessary to delay the least
significant bits to maintain the parallel format. But unlike
the combinational version, after filling the pipeline all the
gates operate completely in parallel.
III. EVIDENCING DEPENDENCIES
In order to pipeline a circuit it is better to redraw it to
evidence the dependencies between processors (B depends
on A, if B needs the results of A to begin its calculations).

Fig.4: Details of the computation at each RCA pipeline stage.
The procedure for evidencing dependences is simple
(Fig.5): the FA that first starts to operate is positioned at the
bottom of the circuit, then the second, and so on.

Fig.5: RCA topology evidencing dependencies.
Naturally, the resulting structure is identical to the
represented by the Fig.2, but now pipelining is
straightforward. The intersection between the horizontal lines
and the wires indicates the places where the synchronization
FFs must be inserted (supposing that all the FA blocks and
wiring have the same delay).
Fig.5 confirms the first drawback of pipelining: the area
overhead. For example, in standard cell technology [17] both
full-adder and D-type FF (with reset) have a minimum of 28
transistors each. So, during the pipeline of the 8-bit RCA of
Fig.4, the total number of transistors passes from 224 to
3500. Moreover, as the number of FF grows as NFF(n) = 1.5
n2 + 2.5 n + 1, (for a input data width n), a hypothetical
1024-bit fine-grain pipeline RCA would surpass 1.5 times
the maximum 1,095,200 available FFs in a state-of-the-art
Virtex chip [18]. These numbers explain why FPGAs
architects reuse the FFs of the configuration path to get
additional chains of registers, called SRL16, 32, etc. [19]. In
any case, such a huge pipeline would require tens of amperes
to raise the clock edge along the circuit in few nanoseconds.
Finally, Fig.6 shows another example of pipelining.
Drawing the “cubic” processors (above) according to its data
dependencies (below) makes the solution direct: a 36-register
and five-stage pipeline. Processing elements number 2, 3 and
4 do not exchange data between each other; so, they can
operate in parallel. The same situation occurs with
processing elements 6 and 7.

Fig.6: Redrawing a circuit (above) to evidence its data dependencies
(below).
The visualization of the dependencies also facilitates the
trimming of the pipeline. In those cases where the delays of
the processing elements are different, some lines can be
eliminated, without affecting the speed of the circuit. For
example, if the delay of PE1 is 200 ns while the other PEs
has a delay of 100 ns, the c and e lines can be eliminated.

IV. PIPELINING IN NUMBERS
The result of pipelining can be described by several
numbers: throughput, latency, speed-up, area penalty,
pipeline granularity, and logic depth among others.
Pipelining does not reduce the time required to obtain an
individual result; but increases the number of obtained
results per second. R. F. Lyon describes a pipeline as a
circuit “which has an operation period less than its
operation delay” [20]. This phrase condenses the two main
numbers of a pipeline: throughput and latency.
The Oxford dictionary states throughput as “the amount
of material or items passing through a system or process”.
In electronic pipelines, it can be adapted as the number of
results per second [21], or simply the processing rate. Other
key terms are bandwidth or production, as well as its inverse
magnitude, the pipeline period. The annual production of the
T-model reached in 1914 the number of 260,720 units [22].
That is, a throughput of nearly a car every 2 minutes.
The latency is the time necessary to process a single
piece of data. It also called the delay or response time.
Latency is masterfully defined by Peter Cappello as “the
amount of time between the first-bit-of-the-first-data
entrance and the last-bit-of-the-last-data output, for a single
(just one) computation” [23]. However, sometimes the
latency of a block is specified as the time between first input
bit and first output bit of a single piece of data.
The Boing Company is a useful illustration of the above
concepts, even considering that apply both spatial parallelism
(assembly lines in parallel) and temporal parallelism
(pipelines). Nowadays, Boeing produces a 777-model plane
every 3 days (throughput = 0.33 planes/day). Naturally, a
single plane cannot be assembled in 3 days. It is composed of
more than 3 million parts, and has approximately 60,000
rivets [24]. To determine the latency of the process it would
only be necessary to time any of those rivets from when it
enters the factory to when it leaves as part of the plane at the
other end of the factory.
The effectiveness of pipelining in terms of time is
measured using the speedup figure. This is the throughput of
the pipelined version divided by the throughput of the
original circuit. In return, this extra speed increases the
circuit cost. The area penalty is the pipeline area divided by
the original circuit area.
A more useful concept is the pipeline granularity β [25].
This is the maximum number of processors operating in
series between successive lines of FFs. In a regular pipeline,
granularity is the key to tuning the result, trading speed for
extra FFs. For example, if lines 2, 4, 6, and 8 are removed in
the circuit of Fig.4, the granularity is β=2, the FF count
passes from 117 to 65 but the minimum clock period is
greater than the delay of 2 FAs. Now it is necessary to wait
for the results of two FAs. Another area-time pair is obtained
using β=4, by removing lines 2, 3, 4, 6, 7, and 8 Fig.4. Now,
the number of FFs is 38 and the minimum clock period must
be greater than the delay of 4 FAs. The idea is illustrated in
Fig.7.
Granularity is a topological parameter. At silicon level,
its equivalence (or consequence) is the logic depth. In FPGA
technology, logic depth can be considered as the maximum
number of LUTs in series between successive lines of FFs. A
fine-grain pipelined RCA has a granularity β=1 but the logic
depth can be 1, 2 or more LUTs, depending on diverse
parameters such as the LUT size, dedicated XOR gates and
carry-chain lines, routing congestion criteria, or the ability of
the synthesis tool. In his remarkable book, H. B. Bakoglu
includes the gates of the input FF in the calculus of the logic
depth, for masked integrated circuits [26].

Fig.7: RCA pipelined with β=2 (above) and β=4 (below).

V. THE EFFECT OF REGISTERING AND EXTRA WIRING
The followers of Henry Ford in the field of electronics
shared his frustration about pipelining. The N-fold gain in
speed is a myth; it is only possible if the wiring and FF
delays are insignificant in comparison with logic delays. In
other words, fine-grain pipelines have a relatively slow
processing speed limit.
The organization of a pipelined production requires
several blocks receiving each datum at the right time. Even
though the nature of bits and car parts is different, the
consequences of introducing mechanisms to synchronize
them lead to the same result: loss of time. Ford divided the
motor assembly into 48 operations (N=48). This arrangement
should lead in the abstract to a speedup to nearly 48.
However, he merely obtained a speedup of 3. In the same
way, the construction of the magneto was split into 29 parts,
allowing the time to be reduced only from 20’ to 13’10”
(speedup=1.5). Finally, the overall car assembly evolved
from 12’30 hours to only 1’33 (speedup=8.3) [27].
In current integrated circuits, the effect of transporting
(wiring) and synchronizing (FFs) the parts (bits) is expensive
in term of time. FFs and wiring delays are larger than
combinatorial logic delays. This fact destroys the magic
effect of pipelining: dividing the task in N concurrent blocks
never produces the theoretical speedup of N. For example, in
Fig.7 a circuit is shown with a total combinational delay of
value ∆COMB. The propagation delay and setup of the FFs are
labelled ∆CK-OUT and ∆SETUP respectively. The pipelining of
the block in N stages, effectively diminish the combinational
delay of each stage by N, but the delays associated to the FF
remain constant. Additionally, a wiring delay ∆W must be
also computed. Thus, the pipeline period is:
T ≥ ∆CK-OUT + ∆COMB / N + ∆SETUP + ∆W + SKEW (1)
The effect of the clock skew in Eq.1 is analysed in
Section VIII. In any case, for a large number N of stages, the
clock period will remain dominated by FF and wiring delays.

Fig.8: The expansion of the stage delay.
Fig.9 shows the relationship between FF delays (∆CK-OUT
+ ∆SETUP) and minimum ∆COMB for ten different technologies
[28]-[38] ranging from discrete logic to FPGAs. The TILO
parameter of Xilinx datasheets was taken as the minimum
combinational delay for LUT-based circuits, while the two-
input NAND was selected for discrete logic. In any case, the
period of a maximum fine grain-pipeline would be
practically fixed by FFs and wires. Even in high-
performance computers, the logic depth cannot be just one
gate or LUT. For example, the CRAY models 1, 2 and 3 had
a logic depth of 8, 4 and 6 levels respectively [26].

Fig.9: Ratio between the total FF delay (propagation + setup) and
minimum combinational delay for different technologies.
As result, pipelining leads to a paradox. It is applied to
avoid inactive gates while waiting for new data (as it
happens in a combinational circuit). But for fine-grain
pipelining, only 20-30 % of the period is involved in the data
processing. The rest of the time is necessary to synchronize
and transport the bits. After all, the gates still remain inactive
most of the time. But at least, they work in parallel.
Meanwhile the delays of a FF are well-characterized
numbers; the nature of wiring delay is more complicated.
The wiring distribution depends on the number of stages N.
Pipelining expands the number of wires, changes the fanout
of the nets, and modifies the wiring distribution delay itself.
Some aspects that pipeline designers must take into account
about ∆W(N) are:
 Some pipeline directions change the nature of input
data wiring, passing from heavily loaded global or
broadcasted lines to lines with a minimum fanout.
This point is explained in the next section.
 Pipelining usually improves the circuit routability
and reduces wiring congestion in FPGAs. In a
pipeline circuit, FFs that drive another FF are the
most common structure of the circuit. So that, from
all the pins associated to a FPGA logic elements (a
k-LUT plus the associated FF), only two are utilized:
the input and output pins of the FF.
 In any well-routed circuit, the wiring histogram
follows a Pareto-Levi distribution [39]. That is, there
are lots of wires with low delays and few with the
highest delay values. In a combinational circuit, the
worst wire not always is part of the worst (critical)
path. But in a balanced pipeline, all paths are equally
critical for the minimum clock period. Normally, the
worse wire nests within the worse stage.

VI. GLOBAL LINES AND DIRECTIONS OF PIPELINING
The word pipeline evokes water and tubes. If friction is
neglected, the speed of the water is independent of the pipe
length [15]. However, on digital pipelines the circuit size
imposes a speed limit. Clock skew and heavily loaded global
lines increase the pipeline size. In this section, the second
effect is illustrated.
In [40], Jump and Ahuja analysed the different directions
of pipelining for array multipliers. The idea is illustrated in
Figs.10 and 11. The array of Fig.10 has a typical structure of
communication. The horizontal data enter in just one
processor. There are local communications, exhibiting a low-
fanout that is almost independent of the array size. In
contrast, the vertical data are broadcasted: each one requires
a global line to reach a complete column of processors. The
fanout of the vertical global wires is a function of the array
size. This type of mixed communication is common in binary
multipliers where a column (or a row) of AND gates
concurrently calculates, for example, the partial products
A0B0, A0B1, A0B2 …, A0Bn. The signal A0 requires a global
line to reach all the ANDs, meanwhile any Bi is local [41].
Fig.10 shows other valid directions of pipelining (there are
other possible “angles” but they do not maximize speed).
From a topological point of view, neglecting wiring delays,
both pipeline directions have a granularity of one processor.
So, in the abstract they should reach the same speed. The
“vertical” pipeline option sounds better because it exhibits a
less latency and smaller number of FFs. But the situation is
different in actual integrated circuits were wiring delay is
dominant: for large array sizes, these global lines will
exhibit high delays.

Fig.10: Generic pipeline with global lines inside each stage.

Fig.11: Pipeline direction disrupting global lines.

The vertical line pipeline confines each of these global
lines inside a pipeline stage. So, the total delays of these
heavily loaded interconnections are a part of the clock
period. In contrast, the “45 degrees” pipeline breaks the
global lines, transforming them into almost local wires. The
effect is detailed in Fig.12 for Y0 signal as example.

Fig.12: Elimination of global lines by pipelining (detail).

Work [42] carries out a case-study of the transformation
of global lines by the direction of pipelining. Target
technology was 1µ CMOS Standard Cells from the former
ES2 foundry [43], and Xilinx FPGAs. Two 16-bit arrays
multipliers were compared: the Hatamian-Cash [12] and the
McCanny-McWhirter [44]. Both pipeline circuits share the
same topology but the first maintain global interconnection
inside each stage, meanwhile the second transforms these
interconnections into a set of local wires with fanout equal to
2.
Fig.13 shows the histogram of wiring capacitance for
each version: Hatamian-Cash (above) and McCanny-
McWhirter (below). In both graphs there are two similar
groups of wires. On the left, there are a high number of local
interconnections loaded with small capacitances. On the
right side, a set of heavily loaded lines corresponding to the
clock and global reset signals. The delay of theses clock
branches does not directly affect the period. Its maximum
difference (skew) only increases the pipeline period (Section
VII).

Fig.13: Histograms of wiring capacitance.

The Hatamian-Cash pipeline exhibits less datapath nodes
(3115 versus 3986). But there is in the middle a group of
global lines of data that affect the clock period. These lines
do not exist in the fully local-line pipeline of McCanny-
McWhirter. As result, its throughput for typical delays is
higher (154 versus 117 MOPS).
VII. SYNCHRONIZATION FAILURES IN PIPELINES
In the previous sections, it was shown that fine grain
pipelining easily increases the total number of FFs from a
few units to tens of thousands. A large number of FFs makes
a clock distribution tree indispensable to drive them
synchronously. And clock trees generate latency and skew.
The first effect is mitigated in FPGA technology by adding
digital PLLs to align external and internal clock edges. More
problematic is the clock skew: almost all the pipeline´s paths
are vulnerable to it.
The clock skew is the maximum difference in time
between the same clock edge at two points of the die. The
factors that contribute to clock skew even in balanced trees
are differences in parameters like wiring length, distributed
RC, local temperature
and voltage, FF trigger thresholds, and
finally buffer and FFs propagation delays. The mixture of all
these components makes skew inevitable.
Fig.13 clarifies the effect of clock skew. In a single-phase
clock scheme, each clock edge initiates a race between the
data in D1 and D2. To work properly, the data in D2 must be
captured by the FF2 before being replaced by the data that
travels from D1. Moreover, after the arrival of the edge to
FF2, the previous data in D2 must still remain stable during
the hold time of the FF.

Fig.14: FFs vulnerable to clock skew in single-phase clocking.

As a consequence, for the worst case (wiring delay zero),
the maximum admissible clock skew is:
SKEW = ΔCLK2- ΔCLK1 < ΔCK-OUT - HOLD (2)
As a rule-of-thumb, chip designers know that clock skew
must always be lower than the FF propagation delay. This
problem was called double-clocking by Fishburn [45]
because one bit passes through two FFs in one clock edge. It
is also known as a short-path fault. That is, not only the
longest path generates problems in digital systems. An
important fact is that the clock period is not present in (2). If
a circuit suffers clock skew, the circuit will never work at
any frequency. Another aspect to consider is the sign of the
skew. If it has an opposite sign, the clock edge arrives first at
FF2, and the risk of double-clocking is reduced. Eq. 2 is
completely applicable to pipeline circuits where FF chains
are very common. For example, in the 8-bit fine grain
pipeline RCA of Fig.4, the 77 % of the FFs only drive
another FF.
The second source of synchronization failure is more
evident. It is called a long-path fault and states that the
minimum clock period must be larger than the delay of worst
pipeline stage. In [46] was proposed to calculate the clock
period using a circular pipeline, in order to include the I/O
pin delays in the computation of the period. The condition to
avoid long path fault is indicated in (1). The worst effect of
the skew occurs if the slower clock line triggers the input FF
and the faster clock line triggers the output FF. In such a
particular combination, the value of the skew is added to the
clock period.

VIII. CONCLUSIONS
This tutorial reviewed the main aspects of the pipelining
of digital circuits. Students interested in integrated circuits
can discover several points of research interest. However,
some important related issues like wave pipelining, the
relationship between pipeline and power consumption, and
the self-timed synchronization exceed the length available
for this work.
If pipelining is a masterpiece of the classical period of
digital electronics, wave pipelining (WP) is an example of
the baroque period. Both techniques were contemporary. WP
was summarized in [8] in 1967: “…If a section of
combinatorial logic, such as the logic to execute an add,
could be designed with equal delay in all parallel paths
through the logic, the rate at which new inputs could enter
this section of logic would be independent of the total delay
through the logic…”. Leonard Cotten describes the same
concept with different words in 1969: “…It is possible for
max-rate pipeline machines to operate at high rates
determined by path differences, rather than the conventional
maximum delay...” [7]. The WP technique speeds up a
circuit without using intermediate FFs. In a WP all the paths
are equalized; therefore, several “waves” of data can
propagate along the circuit without interference between the
fastest bit of a new data and the slowest bit of the previous
one. The equalization must be immune to temperature and
voltage variations. WP allows the designer to obtain a unique
combination of fine-grain pipeline speed and the latency of
the original combinational circuit. The technique was studied
in detail in [47], [48]. An example of wave pipeline in an
LUT-based FPGA is described in [49], [50].
Another important aspect of pipelining is its hidden
relationship with power consumption. The fact that
pipelining can reduce power defies common sense. This is
especially true in FPGA technology, but negligible in
standard cell devices [51]. In a pipeline, the intermediate
lines of FFs prevent the propagation of glitches that,
otherwise, would produce a snowball effect in the activity of
large combinational circuits. If the synchronization power
overhead caused by the extra FFs is less than the datapath
glitch power reduction, pipelining saves power. The pipeline-
power rule was first reported in [52], [53]. An early
experimental verification for FPGA was performed in [54].
Since then, the effect has been verified in more than 34
experiments of 12 research groups in 8 countries using chips
that cover 17 years of FPGA technology [55].
Other variation of the classical pipelining is the self-
timed technique. In this case, the clock tree is replaced by a
local handshake between processing elements. There is no
global clock line; only low-fanout wires of request-ack
signals. The origin of the technique is [56] but a solid line of
research was led by Steve Furber [57], [58]. An interesting
feature of self-timed circuits is their smooth requirement of
power supply current.

ACKNOWLEDGMENT
The author would like to thank all these people who
engineered electronic pipelines over the last 50 years.

REFERENCES
[1] “Company Timeline”, https://corporate.ford.com/history.html. Ford
Motor Company. Retirved: 5/12/2018.
[2] J. Deverell, "Pipeline Iterative Arithmetic Arrays". IEEE Trans. on
Computers, pp.317-322. March 1975.
[3] H.T. Kung, “Why Systolic Architectures”, Computer, pp.37-46, Jan.
1982.
[4] N. Sherrin (Editor), “The Oxford Dictionary of Humorous
Quotations”, Oxford University Press, 1995.
[5] https://commons.wikimedia.org, File: Ford_assembly_line_-
_1913.jpg. Retrieved: 3/1/2019.
[6] L. Cotten, "Circuit Implementation of High-Speed Pipeline Systems",
Proc. Fall Joint Computer Conference, pp. 489-504, 1965.
[7] L. Cotten, "Maximum-rate pipeline systems", Proc. Sprint Joint
Computer Conference, pp. 581-586, 1969.
[8] S. Anderson, J. Earle, R. Goldschmidt, and D. Powers, "The IBM
system/360 model 91 floating point execution unit", IBM Journal Res.
Development, Vol.11, pp. 34-53, Jan 1967.
[9] M. Flynn, “Very High-speed Computing Systems”, Proceedings of
the IEEE, Vol. 54, No. 12, December, 1966.
[10] T. Hallin and M. Flynn, "Pipeline of Arithmetic Functions". IEEE
Trans. on Computer, pp.880-886. August 1972.
[11] D. Henlin, M. Fertsch, M. Mazin y E. Lewis. "A 16 bit x 16 bit
Pipelined Multiplier Macrocell". IEEE Journal of Solid-State Circuits,
Vol.SC-20, Nº2, pp.542-547. Abr. 1985.
[12] M. Hatamian and G.L.Cash. "A 70-MHz 8-bit x 8 bit Parallel
Pipelined Multiplier in 2.5-um CMOS". IEEE Journal of Solid-State
Circuits. August 1986.
[13] T. Noll, D. Schmitt-Landsiedel, H. Klar and G. Enders, "A Pipeline
330-MHz Multiplier", IEEE Journal of Solid-State Circuits, Vol. SC-
21, pp. 411-416, Jun. 1986.
[14] M. Santoro and M. Horowitz, "A Pipelined 64x64-bit Iterative
Multiplier", IEEE J.of Solid-State Circuits, VOL. 24, n2, pp.487-
493, Apr. 1989.
[15] H. V. Jagadish, R.G. Mathews, T. Kailath and J.A. Newkirk. "A
Study of Pipelining in Computing Arrays". IEEE Transactions on
Computers, vol. C35, No5 . May 1986.
[16] F. Lu, H. Samueli, J. Yuan and S. Svensson, "A 700-MHz 24-bit
pipelined accumulator in 1.2 µm CMOS for Application on
Numerically Controlled Oscilators", IEEE Journal of SolidState
Circuits, Vol.28, N.8, pp.878-885, August 1993.
[17] Atmel Corp, "SClib ATMEL ATC18", Datasheet Version: 1.5.5-
1.0.0, Jan 2002.
[18] Xilinx Inc., “All Programmable 7 Series Product Selection Guide”,
https://www.xilinx.com/support/documentation/selection-guides/7-
series-product-selection-guide.pdf.
Retrieved:12-01-2018.
[19] Xilinx Inc., “Using Look-Up Tables as Shift Registers (SRL16) in
Spartan-3 Generation FPGAs”, XAPP465 (v1.1), May 2005.
[20] R. Lyon, "Two's Complement Pipeline Multipliers", IEEE
Transactions on Communications, pp. 418 - 425, Vol. 24 , Issue: 4 ,
Apr 1976.
[21] C. V. Ramamoorthy, “Pipeline Architecture", Computing Surveys,
Vol.9, No.1, March 1977.
[22] D. Gross, “Greatest bussiness histories of all times”, John Wiley &
Sons, Inc. 1996.
[23] P. Cappello y K. Steiglitz. "A VLSI Layout for Pipelined Dadda
Multiplier". ACM Trans. on Computer Systems, Vol.1, Nº2, May
1983.
[24] Boeing,http://www.boeing.com/resources/boeingdotcom/history/pdf/
Boeing_Chronology.pdf
[25] C. Hauck, C. Bamji and J. Allen, "The Systematic Exploration of
Pipelined Array Multiplier Performance", Proc. ICASSP 85,
pp.1461-1464. New York: IEEE Press, 1985.
[26] H. Bakoglu, "Circuits, Interconnections, and Packing for VLSI",
Reading, Massachusset: Addison-Wesley Publishing Co. 1992.
[27] Burrel G. (Editor), "Crónica de la Técnica", Barcelona: Plaza & Janes
Publishers, 1989, pp. 524-525.
[28] Texas Instruments, “SN74S74 Dual D-type positive-edge-triggered
flip-flops with preset and clear”, SDLS119 – December 1983 –
revised March 1988.
[29] Texas Instruments, “SN74S00, Quadruple 2-input NAND” ,
December 1983.
[30] Xilinx Inc., “Virtex-5 FPGA Data Sheet: DC and Switching
Characteristics”, DS202 (v5.5) June 17, 2016.
[31] Xilinx Inc., “XC8100 FPGA family”, Version 1.0, June 1, 1996.
[32] Xilinx Inc., “Spartan and Spartan-XL FPGA Families Data Sheet”,
DS060 (v2.0) March 1, 2013.
[33] Phillips, “Fast TTL Logic Series”, Holland, 1999.
[34] Xilinx Inc. "XC6200 Field Programmable Gate Arrays", (Version
1.8) January 9, 1997.
[35] Xilinx Inc., “XC4000 Series Field Programmable Gate Arrays”,
Version 1.02, June 1, 1996.
[36] Motorola, “CMOS Logic Data”, 1990.
[37] Xilinx Inc., “Virtex-6 FPGA Data Sheet: DC and Switching
Characteristics, DS152 (v2.4)”, May 11, 2010.
[38] Xilinx Inc, “XC3000 Series Field Programmable Gate Arrays
(XC3000A/L, XC3100A/L)”, Nov. 9, 1998.
[39] W. Donath, "Wire Length Distribution for Placemment of Computer
Logic", IBM J. of Res. Development, vol.25, nº3, May 1981.
[40] R. Jump and S. Ahura. "Effective Pipeline of Digital Systems", IEEE
Trans. on Computers, Vol. C-27, Nº9, pp.855-865, Sept. 1978.
[41] P. Song and G. de Micheli, “Circuit and Architecture Trade-offs for
High-Speed Multiplication”, IEEE Journal of Solid-State Circuits,
Vol.26, No.9, Sept 1991.
[42] E. Boemo, S. Lopez-Buedo, N. Acosta, and E. Todorovich, ”Local
versus Global Interconnections in Pipelined Arrays: An Example of
the Interaction between Architecture and Technology", Proc. XIV
DCIS Conference, pp.181-186, November 1999.
[43] European Silicon Structures, "ES2 ECPD10 Library Databook", Doc.
E01A09, 1993.
[44] J. McCanny and J. McWhirter, "Completely iterative, pipelined
multiplier array suitable for VLSI", IEE Proc. pp.40-46. Vol.129, Part
G, Nº2. April 1982.
[45] J. Fishburn, "Clock Skew Optimization", IEEE Trans. on
Computers, Vol.39, Nº7, pp.945-951, July 1990.
[46] K. Sakallah, T. Mudge, T. Burks and E. Davidson, "Optimal Clocking
of Circular Pipelines", Proceeding ICCD'91, pp.642-646. IEEE Press
1992.
[47] D. Wong, "Techniques for Designing High-Performance Digital
Circuits Using Wave Pipelining", Tech.Rep. nº CLS-TR-92-508,
Stanford Uiversity: Feb. 1992.
[48] C. Gray, W. Liu and R. Cavin, "Wave Pipelining: Theory and
Implementation", Norwell, MA: Kluwer Academic Publishers. 1992.
[49] E. Boemo, S. López-Buedo and J. Meneses, "The Wave Pipeline
Effect on LUT-Based FPGA Architectures" Proc. ACM FPGA 1996,
Monterrey, Feb. 1996.
[50] E. Boemo, S. Lopez-Buedo, and J. Meneses, "Some Experiments
about Wave Pipelining on FPGAs", IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, Vol.6, No.2, June 1998.
[51] E. Boemo, S. Lopez-Buedo, C. Santos, J. Jauregui and J. Meneses:
“Logic Depth and Power Consumption: A Comparative Study
between Standard Cells and FPGAs”, Proc. Design of Circuit and
Integrated Systems Conference (1998).
[52] J. Leijten, “Analysis of Transition Activity and Power Dissipation in
Synchronous Logic Circuits”. Nat. Lab. Technical Note, no. 339/93,
Philips Electronics N.V. (1993).
[53] J. Leijten, J. van Meerbergen’ and J. Jess, “Analysis and reduction of
glitches in synchronous networks”, Proc. European Design and Test
Conference (1995).
[54] E. Boemo, G. Gonzalez de Rivera, S. Lopez-Buedo and J. Meneses:
“Some Notes on Power Management on FPGAs”. In: Field-
Programmable Logic and Applications FPL’05, LNCS, vol. 975,
pp.149-157, Springer-Verlag 1995.
[55] E. Boemo, J.P. Oliver, and G. Caffarena, "Tracking the Pipelining-
Power Rule along the FPGA Technical Literature", Proc. ACM 2013
FPGA World, Stockholm, Sweden. ACM, Sept. 2013.
[56] I. Sutherland, "Micropipelines", Communication of the ACM, vol.22,
nº6, pp.720-734. Jun. 1989.
[57] S Furber, “Computing without clocks: Micropipelining the ARM
processor”, in Asynchronous Digital Circuit Design, pp.211-262,
Springer, London1995.
[58] J. Woods, P. Day, S. Furber, J.D. Garside, N. Paver, and S. Temple,
“AMULET1: An Asynchronous ARM Microprocessor”, IEEE
Transactions on Computers, Vol. 46, No. 4, April 1997.

__MACOSX/DIE/Teoría/._4. Pipeline.pdf
DIE/Teoría/.DS_Store
__MACOSX/DIE/Teoría/._.DS_Store
DIE/Teoría/1. Intro FPGA y Zynq.pdf
Introducción a la
Tecnología Xilinx a
través de la Práctica 1
Parte 1: Introducción
Eduardo Boemo
Universidad Autónoma de Madrid
eduardo.boemo@uam.es
1
¿Cuál de todos es el chip del Lab 2021?
• XC7Z010CLG400-1C
• XC = Xilinx Commercial
• 7Z = Familia Zynq
• 010 =“Tamaño” (cantidad de diversos bloques
internos). También se usa:
• “capacidad” (no la de los pF)
• “puertas equivalente”
• CLG400 = encapsulado y número de “patas”
• -1 = “Velocidad”
• C = “Comercial”
3
Nomenclatura Xilinx
4 5
¿Qué es Tj? ¿Funcionará el día que en Madrid
lleguemos a 85° en Julio? ¿Carnot?
¿400 patas y sólo 130 I/O? ¿Y el resto?
¿Reloj? ¿Sistema?
¿2 ARMs? ¿No era una FPGA?
¿Poca memoria?
XC7Z010-1CLG400C (Lab UAM)
6
Encapsulado CLG400
8
Migración para aumentar lógica manteniendo I/O (nº patas
constante)
• Ejemplo:
• Un diseño procesa
16 bits de entrada y
saca 16 de salida.
• PCB fabricado y de
pronto se requiere
más
procesamiento.
• ¿Qué hacer?
9
Migración para aumentar lógica manteniendo I/O (nº patas
constante). Concepto de circuito pad / core limited
• Los encapsulados son
compatibles entre sub-
rangos de FPGAs para
facilitar el diseño de
sistemas “core-limited”.
 Figura: ejemplo Spartan 3
200 K
System
Gates
1 Million
System
Gates
FT256 package
5X
Density
Range
VCC & GND User I/O
10
Introducción a la
Tecnología Xilinx a
través de la Práctica 1
Parte 2: La Práctica 1
Eduardo Boemo
Universidad Autónoma de Madrid
eduardo.boemo@uam.es
11
Primer diseño FPGA
• ¿Vcc?
• ¿SW?
• ¿G15, P15, M14...?
• ¿LD?
12
Tarjeta de desarrollo EPS 2021
Tarjeta de desarrollo:
• Bajo coste.
• Prototipo para facilitar pruebas
• Diseños de referencia
• Incluye planos.
• Digilent: originariamente
orientada al mercado
universitario
13
XDC = Xilinx Design Constraints (file)
15
XDC = Xilinx Design Constraints (file)
16
Introducción a la
Tecnología Xilinx a
través de la Práctica 1
Parte 3: EDA Tools
Eduardo Boemo
Universidad Autónoma de Madrid
eduardo.boemo@uam.es
17
Masked-ASICs & FPGA Design Flows
Global
Placement
Detail Placement
Clock Tree Synthesis
and Routing
Global Routing
Detail Routing
Power/Ground
Stripes, Rings Routing
Extraction and
Delay Calc.
Timing
Verification
IO Pad Placement
Fuente:
CSE241
VLSI Digital Circuits Winter 2003
Lecture 08:
Placement
Kahng & Cichy, UCSD
FPGA ≠ Masked-ASIC
FPGAs = platform
chips.
Es decir, chips
donde están
prediseñados
muchos bloques
tales como:
• VCC y GND (core
y pines)
• Árbol de reloj.
• Memorias.
• Drivers.
22
Xilinx Vivado
https://reference.digilentinc.com/vivado/getting_started/start
The Navigator is broken into seven sections:
• Project Manager
• Allows for quick access to project settings, adding sources, language templates, and the IP
catalog
• IP Integrator
• Tools for creating Block Designs
• Simulation
• Allows a developer to verify the output of their design prior to programming their device
• RTL Analysis
• Lets the developer see how the tools are interpreting their code
• Synthesis
• Gives access to Synthesis settings and post-synthesis reports
• Implementation
• Gives access to Implementation settings and post-implementation reports
• Program and Debug
• Access to settings for bitstream generation and the Hardware Manager
23
Síntesis: un texto VHDL (Verilog)  Primitivas HW (transistores,
puertas, LUTs, etc.
Source: ECE 645 – Computer Arithmetic
George Mason U
 Xilinx incluye síntesis y
simulación
24
Pasos de la Síntesis
Fuente: https://forums.xilinx.com/t5/Synthesis/what-exactly-is-elaborating-a-design/td-p/682043
• Elaboration: “Reading in your RTL file (which is text) and recognizing bits of
code that represent real hardware structures. Once recognized, these are
converted (in Vivado synthesis case) into "generic technology cells" -
abstract things like registers, adders, comparators, multiplexers, arbitrarily
wide gates, etc...
• Apply constraints to the design: “This step is necessary since the next steps
(high and low level optimizations) are timing driven, and hence need
constraints. But, constraints cannot be applied to your RTL (which is text) -
they need to be applied to a netlist. So elaboration creates the netlist of
generic technology cells.”
• Do high level optimizations of the design.
25
Pasos de la síntesis
1. Load technology library into database
2. Analyze design
Load HDL models into database, check for
synthesizable models
3. Elaborate design
Technology-independent circuit (random &
structured logic)
4. Specify design constraints (timing, area)
5. Compile/optimize design
Optimize for the loaded technology library
Repeat as necessary to meet constraints
6. Generate technology-specific netlist(s)
7. Generate simulation timing data (SDF file)
8. Generate reports (cells, area, timing)
Source: http://www.eng.auburn.edu/~nelson/ 26
Opciones de síntesis
27
Ejemplo de flujo de diseño sobre una FPGA educativa (pizarra)
library ieee;
use ieee.std_logic_1164.all;
entity lab1 is
port (
swt : in std_logic_vector (3 downto 0);
led : out std_logic_vector (3 downto 0)
);
end lab1;
architecture rtl of lab1 is
signal ledSig : std_logic_vector (3 downto 0);
begin
ledSig(0) <= not swt(0);
ledSig(1) <= swt(1) and not swt(2);
ledSig(2) <= (swt(1) and not swt(2)) or (swt(2) and swt(3));
ledSig(3) <= swt(2) and swt(3);
led <= ledSig;
end architecture;
28
Introducción a la
Tecnología Xilinx a
través de la Práctica 1
Parte 4: CLB e IOB Zynq
Eduardo Boemo
Universidad Autónoma de Madrid
eduardo.boemo@uam.es
29
PL (program. logic = FPGA) Zynq: Componentes básicos
30
Zynq Configurable Logic Block (CLB)
• Estructura
repetitiva y
jerárquica.
• Similar a Virtex
7 (y a V6)
• LUTs de 6
entradas (MUX
de 64 entradas)
LUT  LE=LOGIC ELEMENT (LUT + FF)  SLICE (Conjunto de LEs)  CLB (Conjunto de Slices)
Source:
31
__MACOSX/DIE/Teoría/._1. Intro FPGA y Zynq.pdf
DIE/Teoría/3. Timing 3 - Arbol de Reloj.pdf

Dispositivos Integrados Especializados
Escuela Politécnica Superior, Universidad Autónoma de Madrid

Tema: Sincronización en Circuitos Integrados
Subtema: Árbol de reloj
Objetivos: Comprender la complejidad de lógica y recursos asociados al reloj de FPGAs.
Bibliografía: Apuntes de clases. Hoja de datos Standard Cells Atmel.
© eduardo.boemo@uam.es

Problemas Propuestos

1. En la tabla adjunta se indica que una FPGA Xilinx XC6VLX760 tiene 948480 flip-flops. Suponga que el fanin de la
entrada de reloj de cada FF es 0,005 pF y que un 1 lógico son 1,8 volt. Calcule que corriente (es decir,
considerando que se cargan las capacidades con una corriente constante) es necesaria para llevar el reloj de
todos los FF de 0 a 1 lógico (un flanco de subida) en 1 ns.

2. Qué es un driver? Cuál es su principio de diseño? Explique la estructura de driver de la figura, detallado a
nivel de transistor.

3. Cuantos FF D tipo dfnrq1 puede manejar un buffer inversor de Atmel tipo clk2d2. Considere el caso (irreal) en
que la capacidad de pista es despreciable. Utilice las hojas de datos del final de esta guía.

4. Diseñe un árbol de reloj formado buffers clk2d2 para alimentar a 44096 FF tipo DFNRB1 (el número 44096 es
exactamente la cantidad de FF de una FPGA Virtex XC2VP100). Tenga en cuenta que el número de buffers
insertados en serie en el árbol de reloj debe ser par (¿por qué?). Capacidad pistas despreciable.

5. En el árbol anterior calcule la latencia del reloj. Es decir, el tiempo que transcurre desde que entra el reloj por el
pin metálico hasta que llega a cada FF.

6. ¿Qué es el skew de reloj? Imagine que debido a un rediseño, una de las ramas del árbol anterior sólo maneja 15
FF. ¿Qué skew introduciría este desbalance?

7. ¿Qué potencia disipan los FF del árbol de reloj del problema 5 si funcionan a 300 MHz?

8. ¿Cuál sería el máximo desbalance admisible (expresado en capacidad y número de FF) entre dos ramas de un
árbol de reloj construido con FF D tipo dfrnq1 y buffers CLK2D2, si en el proyecto se especifica que el skew no
puede sobrepasar 200 ps? Considere como primera aproximación que la capacidad de las pistas es nula.

9. Un pulso que está a “1” durante 100 ns pasa a través de 25 buffers CLK2D2 conectados en cascada. La señal
entra por el terminal clk y sale por el terminal c (la salida no inversora del buffer). Se pide calcular el ancho del
pulso (el tiempo en que está a “1”) en la salida OUT. El último buffer también soporta una carga total igual al
fanin de un buffer.

10. Explique qué significado tiene cada frase del documento de la Agencia Espacial Europea titulado "ASIC Design
and Manufacturing Requirements" (Issue 2, Oct'94), que especifica lo siguiente:

11. ¿Por qué la nota “Clock Buffer Cell User Guide”, de Atmel indica que la distribución de reloj de la figura de la
derecha es mejor?

12. En el trozo de hoja de datos adjunto se indica la distorsión de duty-cycle y el skew de un árbol de reloj de Xilinx.
Explique por qué ocurre cada parámetro.

Synchronous Design: Wherever possible the design shall be synchronous according to the following
definition:

 Every latch and flip-flop within a clock region shall be connected to the same clock source
with only buffers inserted for the purpose of increasing the driving strength or using another
clock edge (no logic functions or memory elements are allowed).

 The clock tree for each clock region should be optimally balanced.

 The device function shall not be dependent on internal signal delays.
__MACOSX/DIE/Teoría/._3. Timing 3 - Arbol de Reloj.pdf
DIE/Teoría/2. Timing 1 - Fundamentos.pdf
Eduardo I. Boemo
Escuela Politécnica Superior
Universidad Autónoma de Madrid
Ctra. de Colmenar Km.15. 28049 Madrid, España.
http://www.ii.uam.es/~ivan
e-mail: eduardo.boemo@uam.es
Timing en
FPGAs
y Standard Cells
Parte 1: Fundamentos
1
eduardo.boemo@uam.es
Introducción
Problema: Las herramientas EDA de FPGA escondes varios
conceptos electrónicos al usuario que diseña a alto nivel.
Consecuencias:
Aunque se escondan los detalles electrónicos, no significa que
se escondan sus consecuencias sobre el circuito final.
Las herramientas tienen opciones avanzadas de diseño que
requieren conocer algunos temas en profundidad.
Estrategia DIE: Utilizar circuitos integrados tipo Standard Cells (no
hay datos similares disponibles en Xilinx) para ayudar a entender
resultados de una implementación FPGA y mejorar la formación de
diseñadores de la UAM.
2
eduardo.boemo@uam.es
Dos conductores separados por un aislante
pueden almacenar carga.
Q  Campo eléctrico  Diferencia de potencial
V.
C se define como la relación entre la carga que
se almacena y el potencial V.
C = Q/V [Culombio/Voltio = Faradios].
Capacidad
4
C = Є A/d (capacitor o condensador
plano)
A= área del cap.
d= distancia entre placas
eduardo.boemo@uam.es
C = Q/V
Si están en paralelo  V es la misma para todos los
CTOTAL = C1 + C2 + … + Cn
Capacidades en paralelo se suman
5
eduardo.boemo@uam.es
Carga y descarga de un capacitor
Carga a tensión constante:
Vcap = Vcc (1 - e-t/RC )
Carga a corriente constante
Vcap = (i/C) t
Energía almacenada en un C:
Ecap= ½ C V2
6
Fuente: EHU
eduardo.boemo@uam.es
Capacidad
Las pistas de un circuito integrado son capacitores casi “de libro”.
7
eduardo.boemo@uam.es
RC como parámetros distribuidos
8
Capacitance to
ground (aF/μm)
Coupling
capacitance (aF/μm)
Resistance/ length
(Ω/μm)
Metal 3 18 9 0.2
Metal 2 47 24 0.3
Metal 1 76 36 0.3
Femto f 10-15
Atto a 10-18
Zepto z 10-21
Yocto y 10-24
µm = millonésima parte de un metro
= milésima del mm.
Diámetro cabello humano:
15 (muy fino) a 170 µm.
eduardo.boemo@uam.es
Capacidades de una FPGA: XC3020
Die area: 39,600 mil2
Matrix height (Y): 480 μm
Matrix width (X): 370 μ m
Matrix transistor resistance: 0.5–1k W
Matrix transistor parasitic capacitance: 0.01–0.02 pF
PIP transistor resistance: 0.5–1k ohm
PIP transistor parasitic capacitance: 0.01–0.02 pF
Single-length line (X, Y): 370 um, 480 μ m
Single-length line capacitance: 0.075 a 0.1 pF
Horizontal Longline (8X): 8 cols.=2960 μ m
Horizontal Longline metal capacitance: 0.6 pF
Fuente: Smith
Uno de los pocos datos internos publicados de una FPGA
9
eduardo.boemo@uam.es
Eduardo I. Boemo
Escuela Politécnica Superior
Universidad Autónoma de Madrid
Ctra. de Colmenar Km.15. 28049 Madrid,
España.
http://www.ii.uam.es/~ivan
e-mail: eduardo.boemo@uam.es
Timing en
FPGAs
y Standard Cells
Parte 2: Números
concretos
10
eduardo.boemo@uam.es
Retardos intrínsecos y extrínsecos
1995: hace más de 25 años!
El retardo de una puerta se
separa en:
Intrínsico: retardo interno
de conmutación de la
puerta. Es un valor fijo.
Extrínseco: retardo debido
a la carga de la capacidad
que debe manejar la salida
de la puerta. No es fijo,
depende de la carga
11
eduardo.boemo@uam.es
Capacidad y
Retardo
12
Neil Weste
Kamran Eshraghian
eduardo.boemo@uam.es
Fanin, Fanout y Retardos
Fanout (in pF):
 Is the total capacitance that a signal will have to drive; this includes
gate capacitance as well as interconnect capacitance.
 The fanout figure is based on output fall and rise time staying within
reasonable limits.
 En Xilinx se entiende por fanout simplemente el número de entradas
conectadas a una salida
Tiempo de propagación:
 tpdhl propagation delay, high-to-low.
 tpdlh propagation delay, low-to-high.
 dtpdhl differential (load-dependent) propagation delay.
 dtpdlh differential (load-dependent) propagation delay.
Fuente: Atmel Corp. 2001
13
eduardo.boemo@uam.es
Retardo Intrínseco y Extrínseco en Xilinx
14
Zynq-7000 SoC (Z-7030, Z-7035, Z-7045, and Z-7100): DC and AC Switching
Characteristics
eduardo.boemo@uam.es
How much fast is a FPGA?
eduardo.boemo@uam.es
Retardo de interconexión en FPGA?
Fuente: Xilinx Inc. eduardo.boemo@uam.es
Retardo de las patas en FPGA?
eduardo.boemo@uam.es
Capacidades I/O en una FPGA?
Ejemplo Virtex 6
19
eduardo.boemo@uam.es
Capacidades I/O en una FPGA?
Ejemplo Virtex 6
Qué sucederá con los
tiempos de un IOB si Ud.
carga una pata con más de
1 pF ? (ej. Un pin de una
FPGA con 8 pF)
20
eduardo.boemo@uam.es
Miscellaneous: Cómo presenta Xilinx
sus datos?
22
eduardo.boemo@uam.es
Eduardo I. Boemo
Escuela Politécnica Superior
Universidad Autónoma de Madrid
Ctra. de Colmenar Km.15. 28049 Madrid,
España.
http://www.ii.uam.es/~ivan
e-mail: eduardo.boemo@uam.es
Timing en
FPGAs
y Standard Cells
Parte 3: ¿Cómo
bajar T?
23
eduardo.boemo@uam.es
Para “visualizar” la capacidad total que maneja una determinada
puerta, hay que sumar:
Fanout que soporta cada salida.
Capacidad distribuida de cada cable.
Minimizar retardo requiere:
Reducir fan-out (en el significado, cantidad de salidas que maneja una
entrada) C↓
Reducir wire length C↓
Aumentar driving i ↑
Bajar Temp ó subir Vcc
Bajar retardo de pista (Capacidad) en
Diseño FPGA a Alto Nivel
24
eduardo.boemo@uam.es
Como minimizar capacidad diseñando a alto
nivel?
Retardo o Capacidad alta: implica pistas largas o mucho fanout.
Minimizar capacidad  Minimizar longitud de pista
Atributos: time_spec.
Minimizar tamaño del circuito
 reducir nº de bits
Placement manual
Minimizar capacidad  limitar fanout
Replicar HW
Atributo max_fanout (Xilinx)
“Tocar” las opciones por defecto de la síntesis”
25
eduardo.boemo@uam.es
O
pc
io
ne
s
sí
nt
es
is
(V
iv
ad
o
20
20
.1
)
27
eduardo.boemo@uam.es
Limitación de Fanout en síntesis (ISE 14.7)
28
eduardo.boemo@uam.es
¿Cómo limitar el fanout de un circuito?
A0
B0B1B2B3 B0B1B2B3
A0
B0B1B2B3 B0B1B2B3
B3A0 B2 A0 B1A0 B0A0
A3 A2 A2 A0
B3 B2 B1 B0
P3 P2 B1 P0P7 P6 P5 P4
B3A1 B2 A1 B1A1 B0A1
B3A2 B2 A2 B1A2 B0A2
B3A3 B2 A3 B1A3 B0A3
Las señales globales (broadcasted) son muy comunes en los
circuitos digitales. ¿Cómo evitarlas o limitarlas?
29
eduardo.boemo@uam.es
¿Cómo limitar el fanout de un circuito?
Recomendación Xilinx
30
eduardo.boemo@uam.es
Limitación de Fanout: VHDL y XCF.
Ver además:
EQUIVALENT_REGISTER_REMOVAL is a synthesis constraint. It enables or disables
flip-flop optimization related only to the flip-flops described on RTL level. (Instantiated flip-
flops are not removed)
31
eduardo.boemo@uam.es
Área (Tamaño) y Retardos: Un caso de estudio
en Xilinx
Familia de multiplicadores
de 2x2 a 128x128
V5 – ISE 12.1
PPR automático.
Basado en LUTs.
Retardo = f(area)
Pistas globales
Distancia
32
Fuente: O. Lifschitz, UNS (Arg.)
eduardo.boemo@uam.es
Área (Tamaño) y Retardos: Un caso de
estudio en Xilinx
33
0 1 2 3 4 5
0
500
1000
1500
2000
Net delay − #64b
Delay [ns]
#
E
ve
nt
s
0 1 2 3 4 5 6 7 8
0
2000
4000
6000
8000
10000
Net delay − #128b
Delay [ns]
#
E
ve
nt
s
Fu
en
te
: O
. L
ifs
ch
itz
, U
N
S
(A
rg
.)
0 0.5 1 1.5 2 2.5 3
0
50
100
150
Net delay − #16b
Delay [ns]
#
E
ve
nt
s
0 0.5 1 1.5 2 2.5 3 3.5 4
0
200
400
600
Net delay − #32b
Delay [ns]
#
E
ve
nt
s
eduardo.boemo@uam.es
Concepto de “Deration”: Ejemplo
Tj y Vcc modifican los retardos nominales del chip.
Vcc además influye cuadráticamente en la potencia.
Tj depende de los datos de entrada (actividad).
Tj depende de la estrategia de power management.
Merriam-Webster's Collegiate Dictionary: De.rate vt (1947): to lower the rated capability of
(as electrical or mechanical apparatus) because of deterioration or inadequacy.
34
eduardo.boemo@uam.es
ps y Temperatura
Source: Kryotech, Inc.,
1999
Freon cooling at Cray 2
EPFL, © EIB
Reduce temperature to increase speed is well-known. Unfortunatelly, the reverse
is also true: Speed decrease 0,3 % per ºC of increment on T
__MACOSX/DIE/Teoría/._2. Timing
1 - Fundamentos.pdf
DIE/Teoría/5. Fallos Sincronizacion.pdf

Dispositivos Integrados Especializados (Telecom) / Desarrollo de Procesadores y Sistemas Específicos (Informática)
Escuela Politécnica Superior, Universidad Autónoma de Madrid

Tema: Sincronización de Circuitos Integrados
Subtema: Doble Captura y Captura Nula
Objetivos: Comprender fallos de sincronización de circuitos integrados
© eduardo.boemo@uam.es

Problemas Propuestos

1. Explique los parámetros temporales de un FF.
2. Explique los dos fallos de sincronización de un circuito con E/S registrada; Fallo del Camino Largo y Fallo del
Camino Corto (Captura Nula y Doble Captura).

3. Explique en el fichero adjunto de análisis de tiempo de una FPGA por qué se suma el skew y el setup al retardo
del path para obtener el período mínimo?

4. El sistema de la figura, los FF pueden tener entre 1 y 3 ns de retardo de propagación mientras que las pistas de
reloj (desde ck a la entrada de reloj del FF) varían entre 0 y 2 ns. Se pide Indicar cuanto puede ser como máximo
el retardo combinacional del circuito si se requiere que el período mínimo del sistema, para la peor combinación
de retardos, sea de 200 ns. Considere un valor s=3 de setup del FF.

=========================================================================
Timing constraint: Default period analysis for net $Net00003_
4434 items analyzed, 0 timing errors detected.
Minimum period is 38.749ns.
-------------------------------------------------------------------------
Delay: 38.749ns U2/so_6_1 to U2/co_7_7_r
38.743ns Total path delay (36.943ns delay plus 1.800ns setup)
0.006ns clock skew

Path U2/so_6_1 to U2/co_7_7_r contains 8 levels of logic:

5. De acuerdo con las características del FF tipo, cual es el valor máximo y mínimo de skew que puede soportar el
circuito de la figura. Considere setup = 2 ns, hold = 1 ns y retardo de propagación varía en el intervalo [ 3 ns, 7
ns] MIN-MAX.

6. Cuál será el período de operación máximo (mejor caso) y mínimo (peor caso) para el circuito del problema
anterior, considerando un skew entre 0.1 ns y 0.5 ns entre A y B (el reloj llega al punto A con un retardo entre 0.1
ns y 0.5 ns respecto de B).

7. En la siguiente figura, calcule:

A) La frecuencia máxima de operación
B) el retardo mínimo que debe tener el inversor para que el circuito funcione con total seguridad. En la tabla adjunta
se resumen los retardos del resto del circuito. Q parte de Q=0.

La figura adjunta ha sido extraída de una nota de aplicación del fabricante de FPGAs Actel (Micronsemi). En la misma
se sugiere que para eliminar problemas de skew se inserte un buffer en la línea de reloj (indicado como BUFD) tal
como se muestra en el esquema.

Calcule cuanto debería valer como mínimo el retardo de dicho buffer si se desea evitar el fenómeno de doble captura.
Utilice los siguientes parámetros simplificados del circuito. Considere FF idénticos y peor caso desde el punto de vista
del diseñador.

8. Suponga que en el problema anterior el retardo de BUFD vale exactamente 4 (máx=mín=4). Calcule el periodo
mínimo de reloj para el cual funciona el circuito, considerando el peor caso de combinación de retardos.

Elemento Retardo mínimo [ns] Retardo máximo [ns]
FFs 2 6
Pista (y Nodo) A 1 3
Pista B 1 1,5
Pista C 0 0,5
BUFF 0,2 0,4
IPAD 0 0
OPAD 0 0
Setup 4
Hold 5
A
BBUF
BUF
IPAD
IPAD
OPAD
9. El circuito de la figura tiene la siguiente variación en sus
parámetros temporales después de su fabricación. No
existe correlación alguna entre ellos; es decir, un parámetro
puede ser máximo y otro mínimo. Indicar:

a) Puede haber doble captura?
b) Que valor debe tener un buffer en B para que nunca
ocurra doble captura.
c) Indicar la frecuencia máxima de operación para la
combinación menos favorable de retardos (peor caso;
es decir pesimista) con el buffer de b)
d) la frecuencia máxima de operación para la combinación
más favorable de retardos (es decir, caso optimista - e
improbable - donde los parámetros tienen una combinación de valores que justo maximiza la frecuencia
máxima). Siga considerando el buffer en B.

Tipo Valor Mín [ns] Valor Máx [ns]
A Retardo de Pista 0 0
B Retardo de Pista 1 2
C Retardo de Pista 0 0
D Retardo de Pista de Reloj 1,1 1,2
E Retardo de Pista de Reloj 1,3 1,4
F Retardo de Pista de Reloj 1,5 1,6
G Retardo de Pista de Reloj 1,7 1,8
H Retardo de Propagación 0,5 0,6
V Retardo de Propagación 1,9 2,0
W Retardo de Propagación 1,9 2,0
S1=S2 Setup 3 -
H1=H2 Hold 0,3 -

10. A qué frecuencia máxima funcionará el circuito de la figura para el caso más desfavorable (conservador) de
combinación de retardos? Y para el caso más favorable (optimista) de combinación de retardos?

11. En el sistema de la figura, los números entre corchetes indican los retardos mínimos y máximos de cada
elemento. Por ejemplo, los FF pueden tener entre 1 y 3 (ns por ejemplo) de retardo de propagación. Se pide:

a) Indicar la frecuencia máxima del sistema, para la peor combinación de retardos, para un valor s=3 de setup del
FF.

b) Indicar, de acuerdo con los valores de la figura, para que sub-rango de retardo del bloque llamado “circuito
combinacional”, el sistema no funcionará o tendrá problemas de Hold, si éste vale H = 0,5. Considere en este
inciso el peor caso para el problema de Hold; es decir: Retardo FFs = 1; Retardo reloj al primer FF = 1; Retardo
reloj al 2do FF = 5.

12. Qué significan en la tabla de abajo los retardos TCKO, TDYCK, TDXCK y TILO.

__MACOSX/DIE/Teoría/._5. Fallos Sincronizacion.pdf
DIE/Problemas/.DS_Store
__MACOSX/DIE/Problemas/._.DS_Store
DIE/Problemas/2020 TEMA 2 PARCIAL 1 - hojas min.pdf
DIE Parcial 1 – Diciembre 2020 (4 puntos)

Apellido y nombres (en ese orden):

Últimos 3 dígitos DNI + letra:

e-mail: DIE/EPS/UAM (© eduardo.boemo@uam.es)

P1 ( 1,4 p) P2 ( 1 punto) P3 (0,8 puntos) P4 (0,8 puntos)
0,4

0,4 0,4 0,2

NOTA: Para que el ejercicio puntúe, adjunte todos los cálculos auxiliares que le han llevado a la solución. No entregue hojas
extras.

Problema 1: El circuito de la figura muestra un trozo crítico de un diseño mapeado en una FPGA. Los retardos de las LUTs valen
50 ps mientras que los FFs tienen: 260 ps de retardo de propagación, 10 ps de setup y 12 ps de hold. El circuito es manejado por
un árbol de reloj que tiene un skew máximo de 315 ps.

Se pide calcular:

a) Qué mínimo retardo Δ2 adicional debe agregarse a la pista de retardo original W1+W2 que va de la Q del FF3 a la D del
FF5, de manera de asegurar que el circuito soporte el skew especificado considerando el peor caso (pesimista).
Considere que cada pista tiene el retardo (en ps) que se indica en la Tabla 1.

b) Presupueste el máximo valor que puede tener la pista Δ1 para que el período mínimo de operación alcance 1220 ps
para el peor caso de skew (pesimista). Considere que cada pista tiene el retardo (en ps) que se indica en la Tabla 1.
Considere Δ2= 55 ps.

c) Cuánto vale el periodo mínimo considerando el peor caso de skew (pesimista), si los retardos de las pistas son ahora los
que se indican en la Tabla 2

W1 W2 W3 W4 W5 Δ1 Δ2
200 25 100 200 13 300 55
Tabla 2

d) Repita el punto b utilizando nuevamente la Tabla 1, pero siendo ahora optimista. Es decir, considerando que el skew le
favorece.

W1 W2 W3 W4 W5
10 15 40 5 8
Tabla 1
mailto:eduardo.boemo@uam.es

Problema 2: El trozo de árbol de reloj de la figura maneja 11 FFs. Todos los FFs están inicialmente “seteados”; es decir con sus
salidas Q a
“1” lógico. En t=0 ns la entrada D1 pasa de 1 a 0 y en t=400 ps la señal CLK pasa de 0 a 1. Tomando como origen t=0,
calcule en que instante pasa la salida Z de 1 a 0. Observe que Q1 tiene una carga C = 0,005 pF y considere despreciable la
capacidad de pista. Use la tabla de retardos adjunta.

Problema 3: Usando una combinación de inversores y llaves CMOS construya un MUX 2-1. Dicho multiplexor tiene una entrada
de control Z, 2 entradas de datos B e A, y una salida S. Cuando Z=0, ocurre A → S y cuando Z=1, ocurre B → S. Realice un dibujo
preciso y cuidadoso del circuito a nivel transistor, indicando donde van A, B, S y Z.

INV Llave CMOS

Problema 4: Diseñe y dibuje cuidadosamente un árbol de reloj que maneje exactamente 107 FD utilizando como el componente
INV como buffer de reloj. Su objetivo principal es minimizar el skew y el secundario reducir área. Es decir, una vez que consiga
minimizar el skew (que no será exactamente cero pues 107 es un número primo), debe minimizar la cantidad de INV utilizados.
Use las células del problema 2. Considere nula las capacidades de pista. Los FDs deben disparar cuando a la entrada de reloj
haya un flanco de subida.

__MACOSX/DIE/Problemas/._2020 TEMA 2 PARCIAL 1 - hojas min.pdf
DIE/Problemas/2021_DIE_DEySE_Celulas_CMOS.pdf
DIE/DEySE - Escuela Politécnica Superior - Universidad Autónoma de Madrid

Puertas Básicas CMOS
eduardo.boemo@uam.es

Los ejercicios indicados pueden verificarse con la app CMOS del DSLab UAM.

Disponible en: https://play.google.com/store/apps/details?id=com.MOSCircuits

1. Utilizando la tabla de la figura (donde Pi significa transistor P izquierdo, etc.) indique el estado de cada transistor
(conducción-corte, etc.) y deduzca el nivel lógico de la salida O. ¿A qué puerta corresponde?

2. Repita el ejercicio anterior para los circuitos de la figura.

3. Indique a que puerta lógica corresponde cada una de las siguientes células estándares de ATMEL. ¿Qué función
lógica realizan el par de transistores P22 y N20.

P P
NN
GND
Vdd
I1
O
I2
I2 I1
0 0
0 1
1 0
1 1
Pi Pd Ni Nd O
Ga4.cdr
P P
NN
GND
Vdd
I
O
P P
NN
GND
Vdd
I1
O
I2

4. En los circuitos de la figura se muestran los esquemas internos de dos puertas lógicas en tecnología CMOS. Se
pide:

a. Deduzca la función lógica correspondiente a cada una.
b. Redibuje cada célula para que realice la misma función, pero con 3 entradas.

5. Deduzca la función lógica que realiza la siguiente célula. Rediseñarla a nivel de transistores a partir de puertas
AND, OR e INV

6. La siguiente figura muestra 2 buffers. Indique la función lógica que realizan y deduzca la razón por la cual se
duplican los transistores de salida del segundo buffer.

7. En la siguiente figura se muestra un buffer con control de tercer estado (tri-state). Explique su funcionamiento.

8. Hallar la función lógica de la siguiente standard-cell. Se sugiere separar el circuito en bloques.

9. Complete la tabla de verdad correspondiente a la siguiente célula estándar:

C B A Z
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

10. Hallar la función lógica de la siguiente célula estándar. Rediseñarla a nivel de transistores a partir de puertas
AND, OR e INV

11. Diseñe una puerta NAND y NOR de 3 entradas.

12. Completar la tabla de verdad del siguiente circuito. Indicar a que conocido bloque combinacional corresponde.

C B A Z
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
13. Conectando los cables necesarios mapear en cada array de transistores las siguientes funciones lógicas donde /
significa negación.

a. F = /(ABC+D)
b. F = /[(A+B)(C+D)
c. F = /(AB+CD)
d. F = /[(A+B+C).D]

14. Analice la siguiente célula Standard Cell. Indique su función lógica, para que puede servir y su driving.

15. Completar la tabla de verdad del siguiente circuito hallando las salidas K y Z en función de C, B, y A. Verificar que
se trata de un full-adder donde K es el carry out y Z es el bit de suma.

__MACOSX/DIE/Problemas/._2021_DIE_DEySE_Celulas_CMOS.pdf
DIE/Problemas/Pipeline (Problemas).pdf
Dispositivos Integrados Especializados (Telecom) - Desarrollo de Procesadores y Sistemas Específicos (Informática)
Escuela Politécnica Superior, Universidad Autónoma de Madrid.

Tema: Pipelining
Objetivos: Comprensión de principios básicos de diseño digital de alta velocidad
Bibliografía: E. Boemo, "Pipelining on FPGAs: A Tutorial," 2019 X Southern Conference on Programmable Logic
(SPL), 2019, pp. 53-60, https://ieeexplore.ieee.org/document/8714285

1. Explique la diferencia entre paralelismo espacial y pipeline (paralelismo temporal).

2. Durante la concepción de un ASIC ó FPGA se puede optar por aumentar la velocidad en diferentes niveles de la
jerarquía de diseño: topológico, arquitectural, tecnológico y layout o diseño físico. Explique cada uno de ellos.

3. Defina los siguientes términos relacionados con pipelining: Throughput, Aceleración (speedup), Latencia,
Profundidad de lógica, Granularidad (β), y Penalización en área.
4.
5. Un FA (full-adder) se mapea con: a) con LUTs Xynq, b) con puertas lógicas, y c) con transistores MOS. Hallar la
profundidad de lógica para caso.

6. En la figura se muestra un RCA (ripple-carry adder o sumador de acarreo serie) de 8 bits. Considere que el
retardo de cada FA (full-adder) es 65 ps y que los retardos asociados a los FF son 260 ps – 40 ps – 140 ps
(propagación-setup-hold). Los valores anteriores corresponden aproximadamente a la Zynq usada en 2020 en
DIE. Considere que el módulo del skew es 100 ps. Por simplicidad omita en sus cálculos el retardo de las pistas.
Se pide calcular throughput, latencia, speedup y nº de FF para las siguientes versiones pipeline:

a) 1 etapa (combinacional con E/S registrada en líneas 1 y
9).
b) 2 etapas (registros en líneas 1-5-9).
c) 4 etapas (registros en líneas 1-3-5-7-9).
d) 8 etapas (registros en todas las líneas).

7. Deduzca la expresión general que predice la cantidad de FF de un circuito RCA que opera con números de n bits
“pipelinizado” para máxima velocidad, en función de n.

8. Segmentar el circuito “3D” de la derecha en dos versiones diferentes: β=1 y β=2. Redibujarlo para evidenciar la
dependencia entre PEs.

9. Verificar el pipeline de la figura.

10. El retardo de los PEs (procesadores elementales) de cada uno de los siguientes arrays vale 50 ns. Considerando
despreciable el retardo de pistas y registros, calcule la latencia y el número final de registros de cada circuito,
cuando se segmenta de la manera más eficiente (es decir, con el menor costo en área) para obtener: a) Un ancho
de banda de 20 MHz y b) Un ancho de banda de 10 MHz.

11. En el circuito de la figura, el retardo de cada PE vale 500 ps mientras que el número junto a cada pista indica su
retardo en ps. Considere que el retardo de propagación de los registros vale 260 ps ns, el setup 40 ns y el hold
140 ps. Calcule la latencia y la máxima frecuencia de operación de cada una de las posibles segmentaciones
balanceadas del circuito (Tres casos que corresponden a líneas de FFs en: a-b-c-d-e, a-c-e y a-e). ¿Cuál es la
máxima aceleración respecto al circuito combinacional con E/S registrada? ¿Se cumple el mito de que una
segmentación en N etapas aumenta la velocidad N veces?

4 6
1130
1 17 14
2 10
0 0
7
5
7 5 3 2
a
b
c
d
e

12. Repita el problema anterior considerando que todas las pistas están “ecualizadas”; es decir, tienen un retardo
idéntico. Suponga que tal retardo vale 4 ps, un número cercano al promedio de los valores de las pistas del
circuito.

13. En el circuito de la figura, los bloques 1, 2, y 4 tienen un retardo de 300 ms; el
bloque 5 un retardo de 100 ms; y los bloques 3, 6 y 7 un retardo de 400 ms.
Suponga que se pueden despreciar los retardos asociados a los FF y las pistas.
Se pide:

a) Transformar el circuito en un pipeline de máxima velocidad. Realice el dibujo de
este e indique Número Total de FF y Periodo Mínimo al cual puede operar. El
pipeline debe tener al menos la entrada/salida registrada.

b) Transformar el circuito en un pipeline que pueda operar a 800 ms de período y
tenga la mínima cantidad de FF. Haga un dibujo de este e indique el número de
FF resultantes. El pipeline debe tener al menos la entrada/salida registrada.

14. Un circuito digital está formado por 10 bloques cuyo retardo, en ns, se indica en la tabla adjunta. Diseñar un
pipeline insertando FF en la entrada/salida (líneas 1 y 5) y en cada una de las salidas intermedias (líneas 2, 3 y 4).
Los parámetros temporales de los FF y del árbol de reloj son despreciables en relación con los retardos de los
bloques. Realizar 3 versiones del pipeline de modo que pueda operar con: a) Tmín = 2100 ns, b) Tmín = 3700 ns,
c) Tmín = 4300 ns

15. Cada bloque del circuito de la figura tiene un retardo de 2,5 ns. Segmentarlo utilizando la mínima cantidad de FF
de modo que el circuito final puede producir un dato de salida cada: a) 3 ns, b) 6 ns, c) 9 ns y d) 12 ns. Considere
retardo FF es 0,260 ns y el setup 0,01 n y despreciable tanto el skew como el retardo de las pistas.

__MACOSX/DIE/Problemas/._Pipeline (Problemas).pdf
DIE/Problemas/2020 TEMA 1 PARCIAL 1 - hojas min.pdf
DIE Parcial 1 – Diciembre 2020 (4 puntos)

Apellido y nombres (en ese orden):

Últimos 3 dígitos DNI + letra:

e-mail: DIE/EPS/UAM (© eduardo.boemo@uam.es)

P1 ( 1,4 p) P2 ( 1 punto) P3 (0,8 puntos) P4 (0,8 puntos)
0,4

0,4 0,4 0,2

NOTA: Para que el ejercicio puntúe, adjunte todos los cálculos auxiliares que le han llevado a la solución. No entregue hojas
extras.

Se pide calcular:

a) Qué mínimo retardo Δ2 adicional debe agregarse a la pista de retardo original W4+W5 que va de la Q del FF3 a la D del
FF5, de manera de asegurar que el circuito soporte el skew especificado considerando el peor caso (pesimista).
Considere que cada pista tiene el retardo (en ps) que se indica en la Tabla 1.

b) Cuánto debe valer como máximo el retardo Δ1 para que el período mínimo de operación alcance 820 ps para el peor
caso de skew (pesimista). Considere que cada pista tiene el retardo (en ps) que se indica en la Tabla 1.

c) Cuánto vale el periodo mínimo considerando el peor caso de skew (pesimista), si los retardos de las pistas son ahora los
que se indican en la Tabla 2

W1 W2 W3 W4 W5 Δ1 Δ2
20 25 90 5 8 0 80
Tabla 2

d) Repita el punto b utilizando nuevamente la Tabla 1, pero siendo ahora optimista. Es decir, considerando que el skew le
favorece.

W1 W2 W3 W4 W5
30 20 40 5 8
Tabla 1
mailto:eduardo.boemo@uam.es

Problema 2: El trozo de árbol de reloj de la figura maneja 9 FFs. Todos están inicializados con sus salidas Q a cero. En t=0 ns la
entrada D1 pasa de 0 a 1 y en t=500 ps la señal CLK pasa de 0 a 1.

Tomando como origen t=0, calcule en que instante pasa la salida Z de 0 a 1. Observe que Q1 tiene una carga C = 0,003 pF y
considere despreciable la capacidad de pista. Use la tabla de retardos adjunta.

Problema 3: Usando una combinación de inversores y llaves CMOS construya un MUX 2-1. Dicho multiplexor tiene 2 entradas de
datos X e Y, una entrada de control S y una salida C. Cuando S=0, ocurre X → C y cuando S=1, ocurre Y → C. Realice un dibujo
preciso y cuidadoso del circuito a nivel transistor, indicando donde van X, Y, C y S.

INV Llave CMOS

Problema 4: Diseñe y dibuje cuidadosamente un árbol de reloj que maneje exactamente 137 FD utilizando como el componente
INV como buffer de reloj. El objetivo principal debe ser minimizar el skew y el secundario reducir área. Es decir, una vez que Ud.
consiga minimizar el skew (que no será exactamente cero pues 137 es un número primo), debe minimizar la cantidad de INV
utilizados. Use las células del problema anterior. Considere nula las capacidades de pista. Los FDs deben disparar cuando en la
pata de entrada de reloj hay un flanco de subida.
__MACOSX/DIE/Problemas/._2020 TEMA 1 PARCIAL 1 - hojas min.pdf
__MACOSX/DIE/Problemas/._Problemas_DIE_retardos
DIE/Problemas/Problemas_DIE_retardos/2021 - 8 DIE Timing 2 - Retardos SC.pdf

Dispositivos Integrados Especializados (Telecom)
Desarrollo de Procesadores y Sistemas Específicos (Informática y Inf. Matemática)

Escuela Politécnica Superior, Universidad Autónoma de Madrid

Tema: Retardos en Circuitos Integrados: Análisis de casos utilizando Standard Cells
Objetivos: Comprensión de un manual técnico de FPGAs
Bibliografía: Apuntes de clases. Hoja de datos Standard Cells Atmel
© eduardo.boemo@uam.es

Problemas recomendados

Observación: En todos los problemas en que no se indique el driving de la célula, elija la de menor valor. También
deben usarse sólo 3 decimales para los cálculos y resultados.

1. Porqué el retardo de una célula se especifica con dos partes, una expresada en ns y otra en ns/pF

2. Calcule el retardo máximo de la XOR de menor fanout. Considere todos los casos y que es cargada con 0,003
pF.

3. Una AND debe manejar una carga de 0,300 pF. El/la diseñador/a se plantea utilizar una and02d0 o una
and02d4? Se pide hallar cuanto más rápida es una puerta que otra, para el peor caso. Exprese el resultado como
un factor (por ejemplo, es 1,3 veces más rápida).

4. Un diseñador/a inexperto/a quiere retrasar una señal lo más cercano a 2 ns (por encima de este valor) pasándola
por inversores. ¿Cuántos inversores necesita si el último soporta una carga igual al fanin de un inv?

5. ¿Qué es un glitch? Calcular el ancho del glitch a la salida del circuito de la derecha, si la cantidad de inversores
es igual a 180. Utilice los datos correspondientes a las células de menor fanout.

6. Calcular el retardo máximo entre X y Z. Considere que las entradas A1 y A2 que están “al aire” en el dibujo se
conectan a “ 1” y que la salida Z soporta una carga de 0,008 pF

7. Calcular el retardo máximo del circuito de la figura considerando que la salida de la última puerta maneja una
carga de 0,005 pF. Considere despreciable las capacidades de pista. Considere que la entrada a1 de la AND
superior siempre está a 1.

a1
a1
a2
a2
c arga de 0,005 pf
0,003 pF

mailto:eduardo.boemo@uam.es

8. En t=0 una señal pasa de 0 a 1 en una AND conectada a un INV. Dibuje la forma de onda a la salida de la AND,
indicando el valor numérico con 3 decimales (y redondeo) de los instantes en que cambia dicha salida. Nota:
Utilices las células de menor fanout.

9. Una técnica para aumentar la velocidad de un circuito consiste en duplicar una función lógica (en este caso una
AND) y disminuir su carga a la mitad. Por ejemplo, el circuito de la gráfica izquierda tiene 128 ANDs conectadas a
una AND, mientras que

Entonces, ¿te gustó este material?

Ayude a animar a otros estudiantes a mejorar el contenido

¿Te gustó este material? ¡Compartir! 🧡