#### ABSTRACT OF THESIS

## HDL IMPLEMENTATION AND ANALYSIS OF A RESIDUAL REGISTER FOR A FLOATING-POINT ARITHMETIC UNIT

Processors used in lower-end scientific applications like graphic cards and video game consoles have IEEE single precision floating-point hardware [23]. Double precision offers higher precision at higher implementation cost and lower performance. The need for high precision computations in these applications is not enough to justify the use double precision hardware and the extra hardware complexity needed [23]. Native-pair arithmetic offers an interesting and feasible solution to this problem. This technique invented by T. J. Dekker uses single-length floating-point numbers to represent higher precision floating-point numbers [3]. Native-pair arithmetic has been proposed by Dr. William R. Dieter and Dr. Henry G. Dietz to achieve better accuracy using standard IEEE single precision floating point hardware [1]. Native-pair arithmetic results in better accuracy however it decreases the performance by 11x and 17x for addition and multiplication respectively [2]. The proposed implementation uses a *residual register to* store the error residual term [2]. This addition is not only cost efficient but also results in acceptable accuracy with 10 times the performance of 64-bit hardware. This thesis demonstrates the implementation of a 32-bit floating-point unit with residual register and estimates the hardware cost and performance.

Keywords: Native-pair floating-point unit residual VHDL

## HDL IMPLEMENTATION AND ANALYSIS OF A RESIDUAL REGISTER FOR A FLOATING-POINT ARITHMETIC UNIT

ΒY

Akil Kaveti

Dr. William R. Dieter

**Director of Thesis** 

Dr. YuMing Zhang

Director of Graduate Studies

March 25, 2008

#### Rules for the use of theses

Unpublished theses submitted for the Master's degree and deposited in the University of Kentucky Library are as a rule open for inspection, but are to be used only with due regard to rights of the authors. Bibliographical references may be noted, but quotations or summaries of parts may be published only with the permission of the author, and with the usual scholarly acknowledgements.

Extensive copying or publication of the thesis in whole or in part also requires the consent of the Dean of the Graduate School of the University of Kentucky.

A Library that borrows this thesis for use by its patrons is expected to secure the signature of each user.

Name

Date

THESIS

Akil Kaveti

The Graduate School University Of Kentucky 2008

### HDL IMPLEMENTATION AND ANALYSIS OF A RESIDUAL REGISTER FOR A FLOATING-POINT ARITHMETIC UNIT

#### MASTERS THESIS

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering at the University of Kentucky

By

Akil Kaveti

Lexington, Kentucky

Director of Thesis: Dr. William Dieter

Electrical Engineering, University of Kentucky,

Lexington, Kentucky

2008

Copyright <sup>©</sup> Akil Kaveti 2008

#### ACKNOWLEDGEMENTS

Foremost I would like to thank my advisor, Dr. William R. Dieter for providing me the opportunity to do this thesis. I am grateful to him for his constant guidance and support. I would also like to thank Dr. Hank Dietz for his suggestions which have helped in improving this thesis.

I would like to thank my Candy for being such a wonderful person and for motivating me time and again.

Most importantly I would like to thank my parents for their love and support I dedicate my thesis to them.

## **Table of Contents**

| ACKNOWLEDGEMENTS                                       | iii  |
|--------------------------------------------------------|------|
| Table of Contents                                      | iv   |
| List of Figures                                        | vi   |
| List of Tables                                         | vii  |
| List of Files                                          | viii |
| Chapter 1. Introduction                                | 1    |
| 1.1. Computer Representation of Real Numbers           | 1    |
| 1.2. Hardware Assistance of Native-Pair                |      |
| 1.3. Thesis Organization                               | 4    |
| Chapter 2. Background                                  | 6    |
| 2.1. IEEE 754 Floating-point standard                  | 6    |
| 2.2. IEEE 754 Floating-point Arithmetic                |      |
| 2.3. History of Double-Double Arithmetic               |      |
| 2.4. The Residual Register                             |      |
| 2.5. Native-pair Addition and Subtraction              |      |
| 2.6. Native-pair Multiplication                        |      |
| Chapter 3. Native-pair Floating-point Unit             |      |
| 3.1. Native Floating-Point Addition/Subtraction        |      |
| 3.2. Native Floating-Point Multiplication              |      |
| 3.3. Native-pair Floating-point Addition/subtraction   |      |
| 3.4. Native-pair Floating-point Multiplication         |      |
| 3.5. Debugging FPU Unit                                |      |
| 3.6. Examples                                          | 50   |
| Chapter 4. Testing and Results                         |      |
| Chapter 5. Estimation of Hardware Cost and Performance | 71   |
| 5.1. Adder Implementation                              | 71   |
| 5.2. Multiplier Implementation                         | 73   |
| Conclusion                                             | 76   |

| Appendix A             |    |
|------------------------|----|
| Post-route simulations |    |
| Appendix B             |    |
| High-level Schematics  | 83 |
| VHDL Source Code       |    |
| References             |    |
| Vita                   |    |

## List of Figures

| Figure 1. Number line showing the ranges of single-precision denormalized and              |
|--------------------------------------------------------------------------------------------|
| normalized floating-point numbers in binary system.                                        |
| Figure 2. Ranges of overflow and underflow for single-precision floating-point numbers     |
|                                                                                            |
| Figure 3. Basic floating-point addition algorithm                                          |
| Figure 4. Basic Floating-point multiplication algorithm                                    |
| Figure 5. Residual register                                                                |
| Figure 6. Native-pair addition data flow, conventional and residual algorithms             |
| Figure 7. Native-pair multiplication data flow, conventional and residual algorithms 29    |
| Figure 8. Prenormalization unit for Floating-point addition                                |
| Figure 9. Addition unit for Floating-point addition                                        |
| Figure 10. Postnormalization unit for Floating-point addition                              |
| Figure 11. Prenormalization unit in floating-point multiplication                          |
| Figure 12. Multiplication unit for Floating-point multiplication                           |
| Figure 13. Postnormalization unit for Floating-point multiplication                        |
| Figure 14. Prenormalization unit for Native-pair Addition using Residual register 38       |
| Figure 15. Postnormalization unit for Native-pair addition with Residual register 41       |
| Figure 16. Postnormalization unit for Native-pair multiplication with Residual register 43 |
| Figure 17. Floating point arithmetic unit pipeline 46                                      |
| Figure 18: High-level schematic of FPU Adder                                               |
| Figure 19. High-level schematic Prenormalization unit used in Floating-point addition.84   |
| Figure 20. High-level schematic of Addition unit used in Floating-point addition           |
| Figure 21. High-level schematic of Postnormalization Unit used in Floating-point           |
| addition                                                                                   |
| Figure 22. High-level schematic of Residual register used in prenormalization and          |
| postnormalization                                                                          |
| Figure 23. High-level schematic of FPU multiplier                                          |
| Figure 24. High-level schematic of Prenormalization unit for Multiplier                    |
| Figure 25. High-level schematic of Multiplier unit                                         |
| Figure 26. High-level schematic of Postnormalization for Multiplier                        |

## List of Tables

| Table 1. Layouts for single and double precision numbers in IEEE 754 format               |
|-------------------------------------------------------------------------------------------|
| Table 2. Representation of single-precision binary floating-point numbers                 |
| Table 3. Different cases for sign and complement flag of residual register                |
| Table 4. Cases for complement flags and signs of residual register                        |
| Table 5. Addition or subtraction cases based on opcode and signs of the operands 33       |
| Table 6. Comparison of Implementation cost and delay for Adders                           |
| Table 7. Comparison of device utilization reports of Prenormalization unit for 32-bit FPU |
| adder with and without residual register hardware72                                       |
| Table 8. Comparison of device utilization reports of Postnormalization unit for 32-bit    |
| FPU adder with and without residual register hardware                                     |
| Table 9. Comparison of Implementation cost and delay of Multipliers                       |
| Table 10. Comparison of device utilization reports of Postnormalization unit for 32-bit   |
| FPU multiplier with and without residual register hardware                                |

## List of Files

| Name of figure                                                                             | Туре | Size<br>(KB) | Page |
|--------------------------------------------------------------------------------------------|------|--------------|------|
| Figure 6. Native-pair addition data flow, conventional and residual algorithms.            | .vsd | 50           | 25   |
| Figure 7. Native-pair multiplication data flow, conventional and residual algorithms.      | .vsd | 60           | 29   |
| Figure 8. Prenormalization unit for Floating-point addition                                | .vsd | 114          | 31   |
| Figure 9. Addition unit for Floating-point addition                                        | .vsd | 96           | 32   |
| Figure 10. Postnormalization unit for Floating-point addition.                             | .vsd | 107          | 34   |
| Figure 11. Prenormalization unit in floating-point multiplication                          | .vsd | 85           | 35   |
| Figure 12. Multiplication unit for Floating-point multiplication                           | .vsd | 50           | 36   |
| Figure 13. Postnormalization unit for Floating-point multiplication                        | .vsd | 113          | 37   |
| Figure 14. Prenormalization unit for Native-pair Addition using<br>Residual register       | .vsd | 122          | 38   |
| Figure 15. Postnormalization unit for Native-pair addition with Residual register          | .vsd | 152          | 41   |
| Figure 16. Postnormalization unit for Native-pair multiplication<br>with Residual register | .vsd | 113          | 43   |
| Figure 17. Floating point arithmetic unit pipeline                                         | .vsd | 64           | 46   |
| Addition testbenchwave                                                                     | .pdf | 130          | 90   |
| Additionwave1                                                                              | .pdf | 71           | 93   |
| Additionwave2                                                                              | .pdf | 69           | 97   |
| Fpumultfinal1                                                                              | .pdf | 32           | 103  |
| Fpumultfinal2                                                                              | .pdf | 44           | 104  |

#### **Chapter 1. Introduction**

This chapter briefly introduces all the topics that will be encountered and described in the later parts of the thesis. It starts by giving the reason for using floating-point numbers in computation. It discusses the floating-point arithmetic, cost involved in implementing higher precision than the existing floating-point hardware, native-pair arithmetic and its usage for better precision and accuracy, performance-cost factor in native-pair arithmetic and extra hardware support to improve the performance-cost factor. This chapter ends with the author's motivation to work on native-pair Floating-point Arithmetic unit and the organization of the thesis.

#### **1.1. Computer Representation of Real Numbers**

Real numbers may be described as numbers that can represent a number with infinite precision and are used to measure continuous quantities. Almost all computations in Physics, Chemistry, Mathematics or scientific computations, all involve operations using real numbers. Computers can only approximate real numbers, most commonly represented as fixed-point and floating-point numbers. In a Fixed-point representation, a real number is represented by a fixed number of digits before and after the radix point. Since the radix point is fixed, the range of fixed-point also is limited. Due to this fixed window of representation, it can represent very small numbers or very large numbers accurately within the available range. A better way of representing real numbers is floating-point representation. Floating-point numbers represent real numbers in scientific notation. They employ a sort of a sliding window of precision or number of digits suitable to the scale of a particular number and hence can represent of a much wider range of values accurately. Floating-point representation has a complex encoding scheme with three basic components: mantissa, exponent and sign. Usage of binary numeration and powers of 2 resulted in floating point numbers being represented as single precision (32-bit) and double precision (64-bit) floating point numbers. Both single and double precision numbers are defined by the IEEE 754 standard. According to the standard, a single precision number has one sign bit, 8 exponent bits and 23 mantissa bits where as a double precision number comprises of one sign bit, 11 exponent bits and 52 mantissa bits.

Most processors designed for consumer applications, such as Graphical Processing Units (GPUs) and CELL processors promise and deliver outstanding floating point performance for scientific applications while using the single precision floating point arithmetic hardware [23][6]. Video games rarely require higher accuracy in floating-point operations, the high cost of extra hardware needed in their implementation is not justified. The hardware cost of a higher precision arithmetic is lot greater than singleprecision arithmetic. For example, one double precision or 64-bit floating point pipeline has approximately same cost as two to four 32-bit floating-point pipelines [1]. Most applications use 64-bit floating point to avoid losing precision in a long sequence of operations used in the computation, even though the final result may not be accurate to more than 32-bit precision. The extra precision is used so the application developer does not have to worry about having enough precision. Native-pair arithmetic presents an opportunity to increase the accuracy of a single-precision or 32-bit floating-point arithmetic without incurring the high expense of a double-precision or 64-bit floatingpoint arithmetic implementation. Native-pair arithmetic uses two native floating-point numbers to represent the base result and the resulting error residual term that would have been discarded in a native floating point unit [23]. One native floating-point number is represented using two native floating-point numbers. This approach has been adapted from an earlier technique known as double-double arithmetic. Double-double arithmetic is the special case of native-pair arithmetic using two 64-bit double precision floating point numbers to represent one variable; the first floating-point number representing the leading digits and the second the trailing digits [17]. Similarly in Native-pair arithmetic, two 32-bit floating-point numbers are used to represent high and low terms where low component encodes the residual error from high component representation. Though this implementation results in higher accuracy without external hardware, it also degrades in performance [2].

#### 1.2. Hardware Assistance of Native-Pair

In order to obtain acceptable accuracy with less performance loss, addition of simple micro-architectural hardware is needed. Dieter and Dietz proposed a residual register to hold discarded information after each floating-point computation [2]. This feature not only reduces the performance cost of native-pair arithmetic but also provides lower latency and better instruction-level parallelism. A residual register has one sign bit, 8 exponent bits and 25 mantissa bits [23]. The usage of the residual register depends on what operation is being performed and at what stage or stages are the bits being discarded.

The most widely used floating-point standard is the IEEE 754 standard. The IEEE 754 standard prescribes a particular format for representing floating-point numbers in binary system, special floating-point numbers, rounding modes, exceptions and how to handle them. Floating-point operations such as addition, multiplication, division and square root have three stages viz., prenormalization, arithmetic unit and postnormalization. In the case of addition and subtraction, prenormalization increases or decreases the exponent part to align the mantissa parts, calculates the sign bit of the final result. The arithmetic unit does the basic arithmetic involving the mantissa bits. The result may not be in the appropriate format, so it is sent into the postnormalization unit. It is in the postnormalization that the result from previous stage is aligned to the IEEE 754 format, rounded depending on the rounding mode and the number with its sign, exponent and mantissa bits is given as the final result.

This thesis aims to prove that residual register hardware with minimal increase in hardware cost results in accuracy close to double-precision and hence is the more economically feasible solution for higher precision arithmetic than the double-precision hardware. Native-pair arithmetic presents an opportunity for more accurate and precise floating-point processing, but it also results in a decrease in performance and increase in implementation cost when compared with the single precision or 32-bit floating-point hardware [23]. The usage of residual register as the extra hardware for storing the error

residual term in native-pair arithmetic gives an implementation which has a slight increase in hardware cost coupled with performance close to that of single precision hardware [23]. Floating point arithmetic unit with residual register is implemented and its hardware utilization, maximum operable frequency is compared with the 32-bit and 64-bit floating-point arithmetic unit implementations. The main idea is to find the extra hardware cost and the performance drop resulting due to the residual register usage, moving the discarded bits into it, updating the residual register if bits are discarded more than once and also setting the sign and exponent of the residual register. The implemented floating-point unit uses the residual register with addition, subtraction and multiplication. The extra hardware needed accounted to an increase of 18% in adder and 12% in multiplier. A minimum period increase of 19% for adder and 12% for multiplier also resulted due to addition of extra hardware in the critical path. The divide and the square root portions of the floating-point unit are left unchanged.

A floating-point unit coded in VHDL was adopted for the purpose of developing a Native-pair Floating point unit from it [19]. The initial part of this thesis was to debug the code and make it fully pipelined to generate outputs on continuous clock cycles. Signals were added to carry the input operands and the intermediate outputs through the pipeline to wherever needed. Those signals which were present in the final stages and required input operands to be set have been moved to starting stage in order to eliminate the need to carry input operands. The Native-pair floating point unit is implemented by adding the residual register hardware to the debugged native floating point unit. The debugged code is a single precision or 32-bit floating point unit and was scaled to serve as a 64-bit floating point unit. The synthesis reports for the three implementations viz., 32-bit version, native-pair version or 32-bit with residual register and 64-bit version were obtained using Xilinx 9.1 ISE tool and a comparison of their resource utilizations and minimum periods is obtained.

#### 1.3. Thesis Organization

In Chapter 2 forms the background of this thesis. It discusses in detail the IEEE 754 floating-point arithmetic standard, IEEE 754 floating-point addition/subtraction, multiplication, Native-pair arithmetic, Native-pair arithmetic algorithms. Chapter 3

describes the working of 32-bit floating-point unit and native-pair floating-point unit with residual register. The different components of a floating-point unit are discussed in this chapter. Also covered in this chapter is where the residual register is added, how it is set or updated, when it is complemented and how its sign is set, usage of the MOVRR instruction. Chapter 4 describes how the Native-pair floating-point unit is tested. This chapter covers the test- benches used to test the implementation. Chapter 5 consists of the post map and route simulation reports, synthesis reports of native-pair floating point unit, 32-bit floating point unit and 64-bit floating point unit.

Chapter 6 compares the synthesis reports, provides a more detailed analysis of the implementation. Chapter 7 concludes the thesis and discusses the avenue for future research.

#### Chapter 2. Background

Hardware supporting different floating-point precisions and various formats have been adopted over the years. Amongst the earliest programmable and fully automatic computing machines, the Z3 built of relays and performed calculations using 22-bit word lengths in binary floating-point arithmetic [21]. The first commercial computer supporting floating-point, the Z4, had floating point hardware that supported 32-bit word length comprising of 7 bit exponent, 1 sign bit and 24 mantissa bits [22]. The second one was the IBM 704 in 1954 whose floating point hardware supported a format consisting of 1 sign bit, 8-bit exponent and 29-bit magnitude. IBM considered the 704 format as single precision and later in the IBM 7094 double precision was introduced which had a sign bit, 17-bit exponent and 54-bit magnitude [20]. The DEC – Digital Equipment Corporation's PDP 11/45 had an optional floating point processor. This processor is considered a predecessor to the IEEE 754 standard as it had a similar single precision format. The NorthStar FPB-A was a S100 bus floating point microprogram controlled processor, built on medium and small scale TTL parts and PROM memories to perform high speed decimal floating point arithmetic operations. It supported 2, 4, 6, 8, 10, 12, 14 digit precision and 7-bit base-10 exponent [25] [23]. The MASPAR MP1 supercomputer performed floating point operations using 4-bit slice operations on the mantissa with special normalization hardware and supported 32-bit and 64-bit IEEE 754 formats.

The CELL processor, most DSPs and GPUs support the IEEE 32-bit format. The Intel X87 floating point mechanism allows 32-bit, 64-bit and 80-bit operands but processes these operands using an 80-bit pipeline [23] [6] [7]. The standardization of IEEE 754 floating point standard in 1985 has greatly improved the portability of floating-point programs. This standard has been widely accepted and is used by most processors built since 1985.

#### 2.1. IEEE 754 Floating-point standard

The IEEE 754 floating-point standard is the most widely used standard for floating-point computations and is followed in most of the CPU and FPU (Floating point unit) implementations. The standard defines a format for floating-point numbers, special

numbers such as the infinite's and NAN's, a set of floating-point operations, the rounding modes and five exceptions. IEEE 754 specifies four formats of representation: single-precision (32-bit), double-precision (64-bit), single extended ( $\geq$  43 bits) and double extended precisions ( $\geq$  79 bits).

Under this standard, the floating point numbers have three components: a sign, an exponent and a mantissa. The mantissa has an implicit hidden leading hidden bit and the rest are fraction bits. The most used formats described by this standard are the single-precision and the double-precision floating-point number formats which are shown in Table 1. In each cell the first number indicates the number of bits used to represent each component, and the numbers in square brackets specify bit positions reserved for each component in the single-precision and double-precision numbers.

| Format           | Sign   | Exponent     | Fraction / Mantissa | Bias |
|------------------|--------|--------------|---------------------|------|
| Single-precision | 1 [31] | 8 [30 – 23]  | 23 [22 – 0]         | 127  |
| Double-precision | 1[63]  | 11 [62 - 52] | 52 [51 - 0]         | 1023 |

 Table 1. Layouts for single and double precision numbers in IEEE 754 format.

The Sign bit: A sign bit value of 0 is used to represent positive numbers and 1 is used to represent negative numbers

The Exponent: The exponent field has 8 bits in single-precision and 11 bits in doubleprecision. The value is stored in unsigned format and a *bias* is added to the actual exponent to get the stored exponent. For single-precision, the bias value is 127 and for double-precision it is 1023. Actual exponent = stored exponent – 127 for single-precision and it is equal to stored exponent – 1023 for double-precision. Denormalized numbers and zero have all zeroes in the exponent field. The infinite and Not a number values have all one's in the exponent field. The range of the exponent for single precision is from -126 to +127 and for double-precision it is -1022 to +1023.

The Mantissa: Apart from the sign and the exponent a floating-point number also has a magnitude part which is represented by the mantissa field. For single-precision the

number of mantissa bits is 23 and for double-precision it is 52. Each mantissa has a hidden bit which is not shown when the floating-point is represented in the IEEE format. This is because all the floating-point numbers are adjusted to have this hidden bit equal to 1 and so the fact that hidden bit is 1 is understood and so is not specified explicitly. Denormalized numbers have the hidden bit set to zero.

In general, floating-point numbers are stored in normalized form. This puts the radix point after the first non-zero digit. In normalized form, six is represented as  $+ 6.0 \times 10^{0}$ . In binary floating-point number representation, the radix point is placed after a leading 1. In this form six is represented as  $+ 1.10 \times 2^{2}$ . In general, a normalized floating-point number is represented as  $\pm 1. f \times 2^{e}$ . There is an implicit leading hidden 1 before the radix point and 23 visible bits after the radix point. The value of the IEEE 754 32-bit floating point number can be computed from the sign bit (s), 8-bit biased exponent field (e), and 23-bit fraction field (f) and arranging them as follows: Value =  $(-1)^{s} 2^{e-127} \times 1.f$ 

When a nonzero number is being normalized, the mantissa is shifted left or right. Each time a left shift is performed, the exponent is decremented. In case the minimum exponent is reached but further reduction is still required, then the exponent value is taken 0 after biasing, such a number is a denormalized number. Hence a number having zeroes in its exponent field and at least a single 1 in its mantissa part is said to be a denormalized number. The IEEE 754 standard represents the denormalized number as follows: Value =  $(-1)^{s} 2^{-126} \times 0.6$ 



Figure 1. Number line showing the ranges of single-precision denormalized and normalized floating-point numbers in binary system.

| Sign | Exponent | Mantissa                                | Value                         |
|------|----------|-----------------------------------------|-------------------------------|
| 0    | 00000000 | 000000000000000000000000000000000000000 | + 0                           |
| 1    | 00000000 | 000000000000000000000000000000000000000 | -0                            |
| 0    | 11111111 | 000000000000000000000000000000000000000 | $\infty +$                    |
| 1    | 11111111 | 000000000000000000000000000000000000000 | -∞-                           |
| 0    | 00000000 | 000000000000000000000000000000000000000 | Positive Denormalized         |
|      |          |                                         | floating-point numbers        |
|      |          | to                                      |                               |
|      |          |                                         |                               |
|      |          | 111111111111111111111111111111111111111 |                               |
| 1    | 00000000 | 000000000000000000000000000000000000000 | Negative Denormalized         |
|      |          |                                         | floating-point numbers        |
|      |          | to                                      |                               |
|      |          |                                         |                               |
|      |          | 111111111111111111111111111111111111111 |                               |
| 0    | 00000001 | ******                                  | Positive Normalized floating- |
|      |          |                                         | point numbers                 |
|      | to       |                                         |                               |
|      | 11111110 |                                         |                               |
| 1    | 00000001 |                                         | Negative Normalized floating- |
| 1    | 0000001  |                                         | point numbers                 |
|      | to       |                                         | point numbers                 |
|      | 10       |                                         |                               |
|      | 11111110 |                                         |                               |
| 0/1  | 11111111 | 100000000000000000000000000000000000000 | QNaN - Quiet Not a Number     |
|      |          |                                         |                               |
|      |          | to                                      |                               |
|      |          |                                         |                               |
|      |          | 111111111111111111111111111111111111111 |                               |
| 0/1  | 11111111 | 000000000000000000000000000000000000000 | SNaN – Signaling Not a        |
|      |          |                                         | Number                        |
|      |          | То                                      |                               |
|      |          |                                         |                               |
|      |          | 011111111111111111111111111111111111111 |                               |

 Table 2. Representation of single-precision binary floating-point numbers

## Exceptions

IEEE 754 floating-point standard defines five exceptions that are generally signaled using a separate flag. They are as follows:

 Invalid Operation: Some operations like divide by zero, square root of a negative number or addition and subtraction from infinite values are invalid. The result of such invalid operation is NaN – Not a Number. NaNs are of two types: QNaNs, or Quiet NaNs, and SNaNs or signaling NaNs. Their formats are shown in table 2.

The result of an invalid operation will result be a QNaN with a QNaN or SNaN exception. The SNaN can never be the result of any operation, only its exception can be signaled and this happens whenever one of the operands to a floating-point operation is SNaN. The SNaN exception can be used to signal operations with uninitialized operands, if we set the uninitialized operands to SNaN. The usage of SNaN is not subject to the IEEE 754 standard.

- 2. Inexact: This exception is signaled when the result of an arithmetic operation cannot be represented due to restricted exponent range or mantissa precision
- 3. Underflow: Two events cause that underflow exception to be signaled are tininess and loss of accuracy. Tininess is detected after or before rounding when a result lies between  $\pm 2^{-126}$ . Loss of accuracy is detected when the result is simply inexact or only when a denormalization loss occurs.
- 4. Overflow: The overflow exception is signaled whenever the result exceeds the maximum value that can be represented due to the restricted exponent range. It is not signaled when one of the operands is infinity, because infinity arithmetic is always exact.



Figure 2. Ranges of overflow and underflow for single-precision floating-point numbers

#### Rounding modes

Precision is not infinite and sometimes rounding a result is necessary. To increase the precision of the result and to enable round-to-nearest-even rounding mode, three bits are added internally and temporally to the actual fraction: *guard*, *round*, and *sticky* bit. While guard and round bits are normal storage holders, the sticky bit is turned '1' whenever a '1' is shifted out of range.

As an example we take a 5-bit binary number: 1.1001. If we left-shift the number four positions, the number will be 0.0001, no rounding is possible and the result will not be accurate. Now, let's say we add the three extra bits. After left-shifting the number four positions, the number will be 0.0001 101 (remember, the last bit is '1' because a '1' was shifted out). If we round it back to 5-bits it will yield: 0.0010, giving a more accurate result.

The four specified rounding modes are:

- 1. Round to nearest even: This is the default rounding mode. The value is rounded to the nearest representable number. If the value is exactly halfway between two infinitely precise results or between two representable numbers, then it is rounded to the nearest infinitely precise even number. For example, in one digit base-10 floating-point arithmetic, 3.4 will be rounded to 3, 5.6 to 6, 3.5 to 4 and 2.5 to 2.
- 2. Round to zero: In this mode, the excess bits will simply get truncated. For example, in two digit base-10 floating-point arithmetic, 3.47 will be truncated to 3.4, and -3.47 will be rounded to -3.4.
- 3. Round up: In round up mode, a number will be rounded towards  $+\infty$ . For example, 3.2 will be rounded to 4, while -3.2 to -3.
- 4. Round down: The opposite of round-up, a number will be rounded towards  $-\infty$ . For example, 3.2 will be rounded to 3 while -3.2 to -4.

### 2.2. IEEE 754 Floating-point Arithmetic

The IEEE 754 standard apart from specifying the representation format, the rounding modes and the exceptions also defines the basic operations that can be performed on floating-point numbers.

The Floating-point addition requires the following steps:

- 1. Aligning the mantissa's to make the exponent of the two operands equal and calculating the sign based on two operands. This exponent becomes the output exponent unless it is changed in the Step 3.
- 2. The mantissa bits are added or subtracted depending on the signs of the operands.
- 3. The result from the addition has to be rounded and normalized in order to represent it correctly within the IEEE 754 floating-point format. These three steps are implemented in the floating-point unit three pipeline stages labeled prenormalization, addition unit and postnormalization. The three stages are explained in detail in Chapter 3. Subtraction is the same as addition except that the sign of the subtrahend is inverted before adding the two operands.

Floating-point multiplication also involves three steps:

- Prenormalization: Multiplication does not require alignment of mantissa in order to make the exponents of the operands equal. In multiplication, the exponents are added and the mantissa bits are transferred to the multiplication stage. The sign of the product is also calculated in this stage.
- 2. Multiplication: In this stage, the mantissa bits are multiplied using a multiplication algorithm. The product has twice as many mantissa bits as the multiplicands.
- 3. Postnormalization: The result from the multiplication is rounded and normalized to represent in the given precision format while updating the output exponent when required.



Figure 3 shows the basic algorithm for addition or subtraction of two floating-point numbers.

Figure 3. Basic floating-point addition algorithm

Consider a simple example for addition of two floating-point numbers:

Let's say we want to add two binary FP numbers with 5-bit mantissas:

A = 0|00000100|1001sign<sub>a</sub>=0; e<sub>a</sub>=00000100; frac<sub>a</sub>=1001 B = 0|00000010|0010

 $sign_b = 0; e_b = 00000010; frac_b = 0100$ 

1. Get the number with the larger exponent and subtract it from the smaller exponent.

 $e_L = 4, e_S = 2$ , so diff = 4 - 2 = 2.

2. Shift the fraction with the smaller exponent *diff* positions to the right. We can now leave out the exponent since they are both equal.

This gives us the following:  $1.1001\ 000 + 0.0100\ 100$ 

- 3. Add both fractions
  - 1.1001 000 + 0.0100 100 -----1.1101 100
- 4. Round to nearest even gives us 1.1110.
- 5. Result = 0|00000100|1110.

------

The basic algorithm for floating-point multiplication is shown in Figure 4.



Figure 4. Basic Floating-point multiplication algorithm

Multiplication Example:

1. A = 001001001

 $sign_a = 0; e_a = 01100100; frac_a = 1001$ 

B = 000100010

 $sign_b = 0; e_b = 01101110; frac_b = 0100$ 

2. 100 and 110 are the stored exponents; logical exponents are obtained by subtracting the bias of 127 from them.

That is, the logical exponents in this case are 100-127 and 110-127.

3. Multiply the fractions and calculate the

1.1001 × 1.0010 ..... 1.11000010 ..... So frac<sub>o</sub>= 1.11000010 and

Output exponent: stored exponent = 100+110 and logical exponent = 100+110-127= 83

 $e_{0} = 83$ 

- 4. Round the fraction to nearest-even:  $frac_0 = 1.1100$
- 5. Result: 0|11010010|1100

#### 2.3. History of Double-Double Arithmetic

Using single-length floating point arithmetic to describe or represent multi-length floating point arithmetic has been discussed and algorithms based on this approach were described by T.J.Dekker in his research report [3]. The report represents a double length floating point number as sum of two single length floating point numbers, one of them being negligible in single length precision. It also discusses the algorithms for basic operations like addition, subtraction and multiplication in the ALGOL 60 language. The Fortran-90 double-double precision system developed by D.H.Bailey uses two 64-bit IEEE arithmetic values to represent quad-precision values in the Fortran 90 programming language [4]. "Implementation of float-float operators on graphics hardware" discusses the methods for improving of precision in floating-point arithmetic on GPUs. The paper discusses different algorithms by Dekker, Knuth and Sterbenz, and the results, performance, and accuracy of these methods [7]. It describes the framework for software emulation of float-float operators with 44 bits of accuracy and proves that these high-precision operators are fast enough to be used in real-time multi pass algorithms [7]. The residual register algorithms discussed by Dieter and Dietz [23] and this thesis can be used with these or other precision extending algorithms.

Native-pair arithmetic is a more general term for double-double encompassing precisions other than double. As with double-double, it uses an extra floating-point number to represent error residual term resulting from a floating-point operation. A native-pair value does not have exactly double the precision of the single native value due the occurrence of zeroes in between the two mantissas. These zeroes make the precision equal to the number of bits in the two mantissas plus the number of zeroes between the mantissas [23]. In this approach, a higher-accuracy value is spread across the mantissas of two native floating-point numbers and the exponent of the lower component is used to align the mantissas [23]. The high component, called hi, takes the top most significant bits and those that are left, also referred to as residual are represented using the low component, called 10. The exponent of 10 will be less than that of exponent of hi by a minimum of N<sub>m</sub>, where N<sub>m</sub> is the number of mantissa bits in the native floating-point number. This means that if a higher precision value is spread over multiple native floating-point values, the exponents of consecutive 10 components keep decreasing by N<sub>m</sub>[1].

When considering a pair of native floating-point numbers and a 32-bit native mantissa being spread across them, the pair will have twice the precision of the mantissa being spread only if the exponent of the hi is at least  $N_m$  greater than that of the native bottom of the exponent range [1]. That is the dynamic range of the exponent is reduced by  $N_m$ steps or 10 percent. In a single-precision or 32-bit floating point system, the precision is limited by the exponent range to less than 11 float values [1]. Also, when there are zeros in the top of the lower-half of the higher precision mantissa, then the exponent of 10 part is further reduced by the number of zeros and the zeros are absorbed [2]. And if there are *K* zeros at the top of the lower-half then, the exponent of 10 part is reduced by *K*. This has certain implications which are as follows:

- Some values requiring up to *K* bits more precision than twice the native mantissa can be represented precisely, as the *K* zeros that come between the top half and the lower-half are absorbed [1].
- If the adopted native floating-point does not represent denormalized numbers, the Low component may fall out of range sometimes. For example, if the High exponent was 24 above the minimum value and number of zeros K = 1, then the result has 25 bits only and not 48 bits as the stored exponent of Low would have to be -1, which is not representable in IEEE format [1].

#### 2.4. The Residual Register

Native-pair arithmetic involves computing the error residual term from the floating point operation and using it to perform further computations. This error residual computation is the major overhead in the native-pair arithmetic. Dieter and Dietz proposed adding a residual register to save this left over information [23]. The residual register is only used to store the mantissa bits, exponent bits, the sign bit, and a complement flag. The value stored in the register need not be normalized immediately and has  $N_m + 2$  mantissa bits with an implicit leading 1 bit. The same normalization hardware used for floating-point operations normalizes the residual value only when it is being moved into an architectural register. The complement flag indicates whether the residual value must be complemented before moving into the architectural register. Normalizing the residual register is done by giving a "MOVRR" instruction that copies the residual register value

into an architectural register after normalizing it into IEEE 754 format. Also each operation results in updating the residual register with a new error residual value.

| Sign  | Complement Flag | Exponent   | Mantissa    |
|-------|-----------------|------------|-------------|
| 1 bit | 1 bit           | ← 8 bits → | ← 25 bits → |

#### Figure 5. Residual register

Consider two floating-point numbers x, y and o be an operation such as +, -, or ×. Let sign(x), exp(x) and mant(x), respectively denote the sign, exponent and mantissa of x.  $Fl(x \circ y)$  denotes the primary result of a floating-point operation and  $Res(x \circ y)$  be the residual of the floating-point operation. For operations discussed here namely addition, subtraction and multiplication the primary result and the residual are related as  $x \circ y = Fl(x \circ y) + Res(x \circ y)$ . This property holds true only for the round to nearest mode when IEEE 754 format is used. Depending on which rounding mode is used, the sign of the residual register value is set accordingly [23]. The residual logic only needs the information if the primary result is rounded up or down. Depending on this information the sign and the complement flag of the residual register is set as follows:

- When  $Fl(x \circ y) = x \circ y$ , the primary result is correct and the residual value is zero.
- When F1 (x o y) < x o y, the primary result p has been rounded down to the floating-point value with next lower magnitude. The residual r then takes the same sign as p to make x o y = F1 (x o y) + Res (x o y).</li>
- When F1 (x o y) > x o y, the primary result F1 (x o y) is rounded up to the next larger magnitude value. The residual r then takes the opposite sign as F1 (x o y) to make x o y = F1 (x o y) Res (x o y).

#### 2.5. Native-pair Addition and Subtraction

Addition or subtraction of two floating-point numbers 'a' and 'b' with 'b' being the smaller of the two, involves the shifting of the smaller number to align its radix point with that of the larger number. When the signs of the two numbers are the same, the numbers are added whereas in the case of opposite signs, the numbers are subtracted. The mantissa bits in the smaller number with significance less than  $2^{\exp(a) - (N_m + 1)}$  are stored in the residual register with least significant bit in the rightmost position, and the exponent is set to exp(b) when  $exp(a) - exp(b) \ge N_m + 1$  and the complement flag is not set. When  $exp(a) - exp(b) < N_m + 1$  or the complement flag is set, the residual register gets the bits in b with significance ranging from  $exp(a) - N_m + 1$  down to  $exp(a) - 2(N_m + 1)$ . That is, the residual register value is just below the primary output value. In this case, the exponent is set to  $exp(a) - 2(N_m + 1)$  with the radix point assumed to be to the right of the least significant residual register bit. The sign and complement flag are set depending on the signs of 'a' and 'b', and whether result p is rounded up or down. Four cases that arise depend on the signs of 'a', 'b' and whether the primary result is rounded up or down, are shown in Table 3 below:

| Case   | Sign of <i>a</i> | Sign of b | Rounded   | Complement | Sign of Residual    |
|--------|------------------|-----------|-----------|------------|---------------------|
|        |                  |           | up / down | flag       | register: Sign(rr)  |
| Case 1 | Sign(a)          | Sign(a)   | Down      | Cleared    | Sign(a)             |
| Case 2 | Sign(a)          | Sign(a)   | Up        | Set        | Opposite of sign(a) |
| Case 3 | Sign(a)          | -Sign(a)  | Down      | Set        | Sign(a)             |
| Case 4 | Sign(a)          | -Sign(a)  | Up        | Cleared    | Opposite of Sign(a) |

Table 3. Different cases for sign and complement flag of residual register

#### Native-pair Arithmetic Addition Algorithms

The algorithms that are discussed here are native-pair arithmetic algorithms for normalizing and adding two native-pair numbers. Each algorithm can be implemented with and without using the residual register. Algorithm 1 shows the nativepair\_normalize function adds two native floatingpoint numbers to produce a native-pair result. Given an unnormalized high and low pair of native numbers, the normalized native-pair is computed using this function. In general, the normalized native-pair is created without using the residual register.

# Algorithm 1. Native-pair normalization algorithm without using the residual register:

```
nativepair nativepair_normalize(native hi, native lo)
{
    nativepair r;
    native hierr;
    r.hi = hi + lo;
    hierr = hi - r.hi;
    r.lo = hierr + lo;
    return (r);
}
```

Algorithm 2 shows the use of the residual register in the nativepair\_normalize function. The hierr variable denotes the error residual computed from hi component. The getrr () function is assumed to be an inline function that returns the residual register value using a single MOVRR instruction. Compared to the Algorithm 1, Algorithm 2 does not need to compute hierr and as a result, the number of instructions is reduced by one relative to Algorithm 1. Every basic operation ends by normalizing the result so this reduction decreases the instruction count for every native-pair operation.

#### Algorithm 2. Native-pair normalization algorithm using the residual register:

```
nativepair nativepair_normalize (native hi, native lo)
{
```

```
nativepair r;
```

```
r.hi = hi + lo;
r.lo = getrr ();
return (r);
```

Algorithm 3 describes the addition of b (native floating point number) to a (native-pair number). The algorithm adds b to hi component of a, computing the residual result and adding the residual result to lo component. It then normalizes the final result.

# Algorithm 3. Addition of Native-pair number and a native number without residual register hardware.

```
nativepair nativepair_native_add (nativepair a, native b)
{
    native hi = a.hi + b ;
    native bhi = hi - a.hi;
    native ahi = hi - bhi;
    native bhierr = b - bhi;
    native ahierr = a.hi - ahi;
    native hierr = bhierr + ahierr;
    native lo = a.lo + hierr;
    return (nativepair_normalize(hi,lo));
}
```

Algorithm 4 describes the same native-pair and native number addition with the use of residual register. This usage computes the hierr component using the getrr () inline function and so eliminates the use of ahierr, bhierr i.e., instructions to compute ahi, bhi, ahierr, bhierr. As a result, number of instructions is reduced by four when with respect to Algorithm 3 which does not use residual register.

Algorithm 4. Addition of Native-pair number and a native number with residual register hardware.

```
nativepair nativepair_native_add (nativepair a, native b)
{
    native hi = a.hi + b;
    native hierr = getrr();
    native lo = a.lo + hierr;
    return (nativepair_normalize(hi,lo));
}
```

Algorithm 5 shows addition of two native-pair numbers without using the residual register and Algorithm 6 adds two native-pair numbers using the residual register. In Algorithm 5, which shows addition without residual register, the residual from adding the two high components is stored in ahierr or bhierr depending on the values of a and b. When a > b, bhierr contains the residual and ahierr is zero and when b > a, ahierr contains the residual and bhierr is zero. Such a system of computing is faster than using a condition to decide which one to compute. The addition algorithm with residual register reduces the instruction count to 6 compared to Algorithm 5 which takes 11 instructions.

# Algorithm 5. Addition of two Native-pair numbers without residual register hardware.

```
nativepair nativepair_add (nativepair a, nativepair b)
{
    native hi = a.hi + b.hi;
    native lo = a.lo + b.lo;
    native bhi = hi - a.hi;
    native ahi = hi - bhi;
    native bhierr = b.hi - bhi;
```

```
native ahierr = a.hi - ahi;
native hierr = bhierr + ahierr;
lo = lo + hierr;
return (nativepair_normalize(hi,lo));
}
```

Algorithm 6. Addition of two Native-pair numbers with residual register hardware.

```
nativepair nativepair_add (nativepair a, nativepair b)
{
    native hi = a.hi + b.hi;
    native hierr = getrr();
    native lo = a.lo + b.lo;
    lo = lo + hierr;
    return (nativepair_normalize(hi,lo));
}
```

Figure 6 shows the dataflow of the native-pair addition algorithm with and without residual register. Each ADD or SUB instruction typically would have a latency of 4 clock cycles. The MOVRR instruction is assumed to have a latency of 2 clock cycles as a worst case. Native-pair addition without residual register requires 9 instructions in its critical path and with a latency of  $36 = 9 \times 4$  clock cycles. Addition with residual register requires 3 ADD/SUB instructions and 2 MOVRR instructions yielding to a total latency of  $16 = 3 \times 4 + 2 \times 2$  clock cycles. But this latency can be decreased without changing the critical path by delaying lo portions of an input to the algorithm in the dataflow. This reduces the latency to 28 = 36-8 cycles in native-pair addition without residual register and 14 = 16-2 cycles in the native-pair addition with residual register. This results in exactly  $2 \times$  speedup over the algorithm not using residual register [23].



Figure 6. Native-pair addition data flow, conventional and residual algorithms.

## 2.6. Native-pair Multiplication

In multiplication of two floating-point numbers as opposed to addition, there is no shifting of the mantissa bits in order to make the exponents of the two numbers equal. Multiplication of two n-bit mantissa numbers produces a 2n-bit result and the exponents of the two numbers are simply added. The lower n-bits of the 2n-bit result are put into the residual register and its exponent is set to  $exp(a) - (N_m+1)$ . When the result is rounded down, the sign of the residual register is same as that of the result and the complement flag is cleared. On the other hand when the result is rounded up, the sign is set opposite to the sign of the result and the complement flag is set.

| Case   | Sign of | Rounded up / down | Complement Sign of Residual |                     |
|--------|---------|-------------------|-----------------------------|---------------------|
|        | product |                   | flag                        | register : Sign(rr) |
| Case 1 | Sign(p) | Down              | Cleared                     | Sign(p)             |
| Case 2 | Sign(p) | Up                | Set                         | Opposite of Sign(a) |

Table 4. Cases for complement flags and signs of residual register

#### Multiplication algorithms for native-pair multiplication

Algorithm 7 shows the multiplication of two native-pair numbers a and b without residual register hardware. The algorithm uses a native\_mul function to multiply two high components of the two native-pair numbers. The multiplication of the high and low components also takes place producing three low components namely native\_mul result low component, a.hi × b.lo and b.hi × a.lo. The fourth term, a.hi × b.lo, is too small to have an influence on the result. All the three low components are added to produce the final low component of the result. The native\_mul function implementation is simplified if the processor has a fused multiply- subtract instruction that preserves the full precision of the product before addition. In such a case the residual value can be obtained by subtracting the rounded product from the full precision product. When such a provision is unavailable the native\_mul function requires the entire component-wise multiplication of the high and low components.

#### Algorithm 7. Native-pair multiplication without residual register hardware

```
nativepair nativepair_mul (nativepair a, nativepair b)
{
    nativepair tops = native_mul (a.hi, b.hi);
    native hiloa = a.hi * b.lo;
    native hilob = b.hi * a.lo;
    native hilo = hiloa + hilob;
    tops.lo = tops.lo + hilo;
    return (nativepair_normalize (tops.hi, tops.lo));
}
```

#### Algorithm 7.1. native\_mul function

```
#define NATIVEBITS 24
#define NATIVESPLIT ((1<<(NATIVEBITS - (NATIVEBITS/2))) +
1.0)</pre>
```

```
nativepair native mul (native a, native b)
{
   nativepair c;
#ifdef HAS FUSED MULSUB
   c.hi = a * b;
   c.lo = a * b - c.hi;
#else
   native asplit = a * NATIVESPLIT;
   native bsplit = b * NATIVESPLIT;
   native as = a - asplit;
   native bs = b - bsplit;
   native atop = as + asplit;
   native btop = b + bsplit;
   native abot = a - atop;
   native bbot = b - btop;
   native top = atop * btop;
   native mida = atop * bbot;
   native midb = btop * abot;
   native mid = mida+ midb;
   native bot = abot * bbot;
   c = nativepair normalize (top, mid);
   c.lo = c.lo + bot;
#end if
   return(c) ;
}
```

When fused multiply-add is not available the residual register hardware simplifies the native\_mul function from 17 instructions to two instructions. The Algorithm 8 shown below takes 8 instructions to perform the multiplication. Though the instruction count is

the same as the fused multiply-add implementation, the need for a wider adder is removed in the residual register implementation.

#### Algorithm 8. Native-pair multiplication using residual register hardware

```
nativepair nativepair_mul (nativepair a, nativepair b)
{
    nativepair tophi = a.hi * b.hi;
    native toplo = getrr ( );
    native hiloa = a.hi * b.lo;
    native hilob = b.hi * a.lo;
    native hilo = hiloa + hilob;
    tops.lo = toplo + hilo;
    return (nativepair_normalize (tophi, toplo));
}
```

Nativepair multiplication has three data flow graphs: conventional, fused multiply-add and residual register implementation which are shown in Figure 7 in the next page. Depending on the latency of add and subtract operations in the critical path, the speed up resulting from the fused multiply-add implementation is 2.3 and that resulting from residual register implementation is 3 [23]. The residual register implementation also has an added advantage that the critical path can be implemented with only a 2-stage pipeline with careful instruction scheduling. Such improvisation is not possible in conventional and fused multiply-add implementations as they suffer from greater need for a larger pipeline [23].



Figure 7. Native-pair multiplication data flow, conventional and residual algorithms

#### Chapter 3. Native-pair Floating-point Unit

This chapter describes the working of the native floating-point unit addition/subtraction and multiplication units, followed by the construction of the native-pair floating-point unit and usage of the residual register hardware.

#### 3.1. Native Floating-Point Addition/Subtraction

The native floating-point addition/subtraction is subdivided into three steps: prenormalization, addition or subtraction and postnormalization.

#### 3.1.1. Prenormalization

The input operands to the floating-point unit first go to the prenormalization unit. This unit finds the difference between the exponents of the two operands, shift the mantissa with lower exponent to make the two exponents equal and send the mantissa bits and the exponent o the addition stage.

Initially the two operands A and B are divided into sign, exponent and mantissa fields. After the last step the following fields or signals are obtained:

- Exp (A)
- Exp (B)
- Mant(A)
- Mant(B)

The exp (A) and exp (B) of all the input operands are checked for zero values to see if they are denormalized. If an operand is denormalized, its exponent is incremented by 1 to make the exponent equal to -126 after unbiasing. If exp(A) and exp(B) are non-zero values ,the corresponding operands are considered normalized. The fraction values are concatenated with 5 more bits – carry, hidden, guard, round and sticky bits. Carry and hidden bits are added as most significant bits. Initially the carry bit is 0 and the hidden bit is 0 if the operand is denormalized otherwise the hidden bit is 1. The guard, round and sticky bits are appended at the end of the fraction bits and are initially all zeroes. After this step fractions take the form of

- New Mant(B) = carry, hidden, mant(B), guard, round, sticky.
- New Mant(B) = carry, hidden, mant(B), guard, round, sticky.

A comparator COMP1 is used to check which exponent is greater and a multiplexer MUX1 is used to assign the greater exponent to the output exponent based on the comparator output signal. Multiplexer MUX2 is used to give the difference of the two exponents. If exp(A) > exp(B), then MUX2 gives the difference exp(A) - exp(B) if not it gives the difference exp(B) - exp(A). The fraction bits of the lower exponent operand's mantissa are shifted right as many bits as the difference obtained from the exponent difference. The sticky bit for the shifted mantissa is computed and updated. The two updated mantissas with the output exponent and other signals are sent to the next stage. Figure 8 shows the prenormalization process.



Figure 8. Prenormalization unit for Floating-point addition

#### 3.1.2. Addition/Subtraction Stage

This stage has a simple functionality of computing the sign of the output and performing the addition or subtraction of mantissas based on the sign of the operands. The two mantissas are compared and the operation i.e., addition or subtraction to be performed is computed based on which mantissa is greater, signs of the two operands A and B and the opcode. Table 5 shows the different cases that arise. A and B having same signs with an opcode of 0, indicating addition is performed. If the opcode is a 1 then subtraction is performed. On the other hand A and B having opposite signs, for opcode of 0, subtraction is performed and for opcode of 1 addition is done.



Figure 9. Addition unit for Floating-point addition

The sign of the output is computed based on the signs of the operands A and B, which mantissa is greater and the operation being performed. If the operation is an addition then the two mantissas are added and if it is a subtraction then the lower mantissa is subtracted from the higher mantissa. The output sign and mantissa are sent to the postnormalization

stage along with input operands passed by the prenormalization stage. Input operands are required in postnormalization for generation of exceptions.

| Opcode | Sign of A | Sign of B | Operation   |
|--------|-----------|-----------|-------------|
| 0      | sign      | sign      | Addition    |
| 0      | sign      | ~ sign    | Subtraction |
| 1      | sign      | sign      | Subtraction |
| 1      | sign      | ~sign     | Addition    |

Table 5. Addition or subtraction cases based on opcode and signs of the operands

## 3.1.3. Postnormalization

Postnormalization is the final stage of any floating-point operation. The inputs to this stage are the addition/subtraction unit output, the output exponent, the output sign and the rounding mode.

The postnormalization unit checks the result of the addition/subtraction stage for a carry. If the carry bit in the result is set then, shift the result right once and increase the output exponent by one. If the result has the hidden bit equal to zero then, the result must be left shifted until the hidden bit is one. To determine how far to shift the mantissa, the number of zeros starting from the most significant bit is counted. After the shift is performed, the exponent is decreased by the same number. Once again the sticky bit is checked to find if any bits were lost. Depending on the rounding mode and the sticky bits at different stages in the postnormalization, the result is rounded up or rounded down. The carry bit is checked again to see if carry occurred and if carry has occurred then the result is shift right once and the exponent is incremented by one. Finally, the result is checked for exceptions such as NaN, infinite, overflow, inexact result and depending on these values, the final result along with the exception flags are send to the output. The postnormalization unit is shown in Figure 10.



## Figure 10. Postnormalization unit for Floating-point addition.

#### 3.2. Native Floating-Point Multiplication

The native floating-point multiplication unit is also subdivided into three steps: prenormalization, multiplication and postnormalization.

#### 3.2.1. Prenormalization

The input operands to the multiplication unit first go through the prenormalization unit. As compared to prenormalization in addition, the prenormalization in multiplication has less functionality. This unit checks if the operands A and B are denormalized, adds the exponents of A and B and transfers the mantissas to the multiplication stage.



Figure 11. Prenormalization unit in floating-point multiplication.

Initially the two operands A and B are divided into sign, exponent and mantissa fields. After the last step the following fields or signals are obtained:

- Exp(A)
- Exp(B)
- Mant(A)
- Mant(B)

The exp (A) and exp (B) of the all the incoming operands are checked for zero values to see if they are denormalized. If an operand is denormalized, its exponent is incremented by 1 to make the exponent equal to -126 after unbiasing. The fraction values are appended with just 1 more bit, the hidden bit as the most significant bit. The hidden bit is 0 if the operand is denormalized otherwise the hidden bit is 1. After this step fractions take the form of

- New Mant(B) = hidden, mant(B)
- New Mant(B) = hidden, mant(B)

The exponents are added, but since the exponents are already biased i.e., we are baising the exponent twice and so 127 is subtracted from the sum. The two updated mantissas with the output exponent and other signals are sent to the next stage.

#### 3.2.2. Multiplication Stage

This stage is used to multiply the two mantissas obtained from the previous stage. The multiplier used is a Booth's parallel multiplier model. An exclusive-or gate is used to obtain the sign of the product. The output sign s\_sign\_o as well as the product s\_fract\_o are transferred along with the other signals to the postnormalization stage.



Figure 12. Multiplication unit for Floating-point multiplication

#### 3.2.3. Postnormalization

The inputs to this stage are the multiplication unit output, the prenormalization exponent output, the multiplication sign output and the rounding mode. The postnormalization stage checks the multiplication output for a carry. If a carry has occurred, the multiplication output is shifted right once to normalize it. If the result has the hidden bit equal to zero then, the result must be left shifted until the hidden bit is one. For this, the number of zeros starting from the most significant bit is counted. After the shift is performed, the exponent is decreased by the same number. Once again the sticky bit is checked to find if any bits were lost. Depending on the rounding mode and the sticky bits at different stages in the postnormalization, the result is rounded up or rounded down. The carry bit is checked again to see if a carry occurred. If a carry has occurred then the result is shift right once and the exponent is incremented by one. Finally, the result is checked for exceptions such as NaN, infinite, overflow, inexact result and depending on these values, the final result along with the exception flags are send to the output.



Figure 13. Postnormalization unit for Floating-point multiplication

#### 3.3. Native-pair Floating-point Addition/subtraction

Native floating-point addition/subtraction has been discussed in Section 3.1.1. This section discusses the 32-bit native-pair floating-point addition/subtraction and the extra hardware added to the native floating-point addition/subtraction unit to make it work as a native-pair floating-point addition/subtraction unit. Native-pair addition/subtraction also

is subdivided into three steps: prenormalization, addition or subtraction and postnormalization. Each step and its hardware addition are discussed in detail below.

# 3.3.1. Prenormalization

The native-pair prenormalization unit, apart from doing the normal operation of making the exponents equal and aligning the mantissa's, also includes the first of the residual register operations.





For a better understanding the steps of the native normalization are again repeated. Initially the two operands A and B are divided into sign, exponent and mantissa fields. After the last step the following fields or signals are obtained:

- Exp(A)
- Exp(B)
- Mant(A)

#### • Mant(B)

The exponents  $\exp(A)$  and  $\exp(B)$  of all the input operands are checked for zero values to see if they are denormalized. The exponent is incremented by 1 to make the exponent equal to -126 after unbiasing. The fraction values are concatenated with 5 more bits – carry, hidden, guard, round and sticky bits. Carry and hidden bits are added as most significant bits. Initially the carry bit is 0 and the hidden bit is 0 if the operand is denormalized else hidden bit is 1. The guard, round and sticky bits are appended at the end of the fraction bits and are initially all zeroes. After this step fractions take the form of

- New Mant(B) = carry, hidden, mant(B), guard, round, sticky.
- New Mant(B) = carry, hidden, mant(B), guard, round, sticky.

A comparator COMP1 is used to check which exponent is greater and a multiplexer MUX1 is used to assign the greater exponent to the output exponent based on the comparator output signal. Multiplexer MUX2 is used to give the difference of the two exponents. If exp(A) > exp(B), then MUX2 gives the difference exp(A) - exp(B) if not it gives the difference exp(B) - exp(A). An 'and signal' is generated which is a 25 bit signal consisting of zeroes and ones in till the position of the exponent difference (s exp diff-1) i.e. if the exponent difference is 4 then the and signal is "0000000000000000000001111". The fraction bits of the lower exponent operand's mantissa's fract small are shifted right as many bits as the difference obtained from the exponent difference. A bit-wise AND operation is performed between the smaller mantissa s fract small and the and signal, the result is the initial mantissa part for the residual register. The bits that are being shifted out are stored in the mantissa of the residual register. The exponent of the residual register is set to the exponent of the lower mantissa. The sticky bit for the shifted mantissa is computed and updated. One other signal that is generated here is exp greater 24 which indicates if the exponent difference is greater than 24. The two updated mantissas with the output exponent, residual register exponent, mantissa and other signals are sent to the next stage.

#### 3.3.2. Addition/subtraction Stage

There is no change in the functionality of the addition/subtraction unit. It takes in the mantissas and the operand signs as inputs. After logically generating which operation has to be performed, it performs that operation i.e., addition or subtraction. Output sign is generated based on the operation, the signs of the operands and which operand is greater. The operation output, the output sign and other inputs such as the residual register values from the prenormalization stage etc., are all passed to the postnormalization stage.

#### 3.3.3. Postnormalization

Postnormalization in the Section 3.3.3 involves all the main operation surrounding the residual register hardware operation. The inputs to this stage are the addition/subtraction unit output, prenormalization output exponent, addition unit output sign, rounding mode, residual register values from the prenormalization stage. The following are the steps involved in the postnormalization stage:

Check the result of the addition/subtraction stage for a carry. If carry bit in the result is set then, shift the result right by once and increase the output exponent by one. If the result has the hidden bit equal to zero then, the result must be left shifted until the hidden bit is one. For this, the number of zeros starting from the most significant bit is counted. After the shift is performed, the exponent is decreased by the same number.

As the result is shifted, one bit before the guard bit is lost and this bit has to be prepended to the residual register. This bit has to be prepended before the bits that were inserted in the prenormalization stage. For this purpose a decoder is used which whose output d1 has a value in the position which corresponds to the exponent difference. D1 is logically ANDed with the output of the addition/subtraction and then ORed with the mantissa from the prenormalization stage to get the new updated mantissa.

s\_mant\_rr2\_br <= ('0' & mant\_i\_rr2) or (d1 and s\_fract\_28\_i (27 downto 3));



Figure 15. Postnormalization unit for Native-pair addition with Residual register

Once again the sticky bit is checked to find if any bits were lost. Depending on the rounding mode and the sticky bits at different stages in the postnormalization, the result is rounded up or rounded down. The carry bit is checked again to see if carry occurred and if carry has occurred then the result is shift right once and the exponent is incremented by one.

As the result is shifted right again, one bit before the guard bit is lost and this bit has to be added to the residual register. This bit has to be added before the bit that was added after the right shift performed before rounding. For this purpose another decoder is used whose output D2 has a value '1' in the position which next to '1' in D1. D2 is logically ANDed with the output of the after the rounded result is shifted right and then ORed with the mantissa s\_mant\_rr2\_br to get the new updated mantissa.

s\_mant\_rr2\_ar <= ('0' & s\_mant\_rr2\_ar) or (d2 and s\_fract\_rnd (27 downto 3));

Suppose that the exponent difference in the prenormalization was greater than 24, and all the bits of the smaller mantissa are shifted into the residual register. Now in the postnormalization stage if the result was shifted right twice once before rounding and once after rounding, then 2 bits must be stuck on the So. In total the residual register mantissa temporarily can have 27 bits and then the 25 most significant bits are stored as final residual value.

The sign of the residual register and the complement flag are also generated in this stage. If the complement flag is set, then residual value is complemented before it is stored in an architectural register. The signal exp\_greater\_24 that was generated in the prenormalization stage to check if the exponential difference was greater than the number of mantissa bits + 1 is used here.

If the signal is set, then the exponent of the residual is set to higher exponent  $-2(N_m+1)$  else exponent is set to lower exponent, where  $N_m$  is the number of mantissa bits.

Finally, the result is checked for exceptions such as NaN, infinite, overflow, inexact result and depending on these values, the final result along with the exception flags are send to the output.

The next instruction is to normalize the residual register value which happens with the MOVRR signal going high. During this stage, residual register value is concatenated with the guard, round and the sticky bits in the end to make it 28 bits and this value is directly sent into the postnormalization input for normalization. This normalized residual register value is later used in computation related to native-pair algorithms.

#### 3.4. Native-pair Floating-point Multiplication

This section discusses the 32-bit native-pair floating-point multiplication and the extra hardware added to the native floating-point multiplication unit to make it work as a native-pair floating-point multiplication unit. The residual register hardware in a multiplication is less complex when compared to addition. Since there is no shifting of mantissas in multiplication, there is no residual register functionality in the prenormalization. So the entire residual register operation takes place only in the postnormalization stage. Hence only the changes and the steps involved to the postnormalization are discussed here.

## 3.4.1. Postnormalization

The inputs to this stage are the multiplication unit output, the prenormalization output exponent, the multiplication sign output and the rounding mode. The following are the steps involved in the native-pair multiplication postnormalization stage



# Figure 16. Postnormalization unit for Native-pair multiplication with Residual register

Check the result of the multiplication stage for a carry. If carry bit in the result is set then, shift the result right by once and increase the output exponent by one. In postnormalization, the 25 most significant bits are taken into consideration for the final output, hence initially the residual register consists of the 23 least significant bits.

If a carry had occurred then, the result would be shifted right once, and the bit that comes out is stored into the residual register. Compared to the addition, multiplication does not involve shifting of mantissa in the prenormalization stage and so no decoder is required here to append the discarded bit into the residual register. If the result has the hidden bit equal to zero then, the result must be left shifted until the hidden bit is one. For this, the number of zeros starting from the most significant bit is counted. After the shift is performed, the exponent is decreased by the same number.

The sticky bit is checked to find if any bits were lost. Depending on the rounding mode and the sticky bits at different stages in the postnormalization, the result is rounded up or rounded down. The carry bit is checked again to see if carry occurred and if carry has occurred then the result is shift right once and the exponent is incremented by one. When a carry occurs second time and the result is shifted again, the discarded bit is again appended as the most significant bit into the residual register, this becomes the 25<sup>th</sup> bit.

The complement flag and the sign flag are generated. When he complement flag is set, then the final residual value being stored into the residual register is complemented. The exponent of the residual is set to higher exponent  $-(N_m+1)$  to align the residual register mantissa with the result, again  $N_m$  denotes the number of mantissa bits.

Finally, the result is checked for exceptions such as NaN, infinite, overflow, inexact result and depending on these values, the final result along with the exception flags are send to the output. The MOVRR signal can be used to re-route the residual register values into the postnormalization for getting the normalized value of the residual.

# 3.5. Debugging FPU Unit

The adopted FPU – floating-point unit suffered from some architectural errors related to the routing of signals through the pipelines, carrying of input signals to the various stages of the pipeline. This section discusses the changes made to the original Floating-point unit in order to enable its proper functioning in pipelined fashion. The floating-point unit pipeline consists of four stages: prenormalization stage, addition/multiplication stage, postnormalization stage and final output stage. All of these are instantiated in the FPU module. The clock, the input operands, the movrr signal, the rounding mode and the opcode are the inputs to the FPU module. These inputs are sent into the various stages depending on their usage. Apart from these primary inputs, at each stage intermediate outputs such as the exponents, mantissas, signs, operation results are generated and need to be carried to the later stages. The following sections discuss the changes made to the original FPU architecture.

In pipelined operation, the inputs change every clock cycle. Also different instruction or function or set of parallel instructions are executed in each clock cycle in different stages of the pipeline. Hence, an operation performed in the third clock cycle might need an input given in an earlier clock cycle. For this purpose, the needed inputs must be propagated through each stage or each clock cycle using registers until it is used.

The input operands are required in the prenormalization stage to generate the sign, exponent and mantissa bits. The input operands are also required in the postnormalization stage to generate the NaN – Not a number signals. Similarly the FPU operation signal fpu\_op\_i and the rounding mode signal s\_rmode\_i are required in addition/multiplication stage and the postnormalization stage. All these signals have to be propagated from prenormalization through addition/multiplication stage to postnormalization.



Figure 17. Floating point arithmetic unit pipeline

Figure 17 shows the four stages of the floating-point pipeline: prenormalization, arithmetic core, postnormalization and formatting output. In the formatted output pipeline stage, changes will be made in the output with respect to exceptions [19]. The right way of propagating signals is through the pipelines stages and not those marked X in the figure 17. Supposing the inputs to the FPU are opa i, opb i, fpu op i and rmode i, fpu op i is used in addition/multiplication stage and the opa i is used in the postnormalization stage. The operation is performed in the second clock-cycle and the postnormalization in the third clock-cycle. When performing the operation, the input through the pipeline is fpu op whereas the other input could be fpu op i. In the second clock cycle, the value of the opcode fpu op i can change and so a wrong operation is performed. Fpu op on the other hand was assigned original fpu op i at the end of the clock cycle and so its value does not change and is the opcode for correct operation. Similarly, opa i changes value until it reaches postnormalization, hence it is propagated through the pipeline stages via opa out and opa addout. All the signals that are added or modified to fix this problem are added with a comment "propagated input through pipeline register" in Modified FPU VHDL code given in Appendix B.

When a value or a signal generated in one pipeline stage is required in another stage, then that signal also has to be propagated through the pipeline stages. For example, the output exponent or the larger exponent output of the prenormalization stage is required in the postnormalization stage and hence this has to be taken through the addition/multiplication unit to the postnormalization stage. This is done using prenorm\_addsub\_exp signal to take the exponent from prenormalization to addition stage and then by using exp\_o\_addsubpost signal to take it from addition to postnormalization. These modified signals are commented as "intermediate outputs through pipeline register in the modified code.

All the pipeline stages should consume same number of clock cycles to produce the outputs of a particular stage. If different pipeline stages consume different number of clock cycles to produce the outputs of corresponding pipelines, then all the pipelines stages should wait until all the pipelines stages are done with producing their outputs to ensure the correct functioning of the pipelined system. This decreases the throughput of the system as outputs will be produced at a reduced frequency than that of the clock frequency. New output will be produced for every n clock cycles where n is the number of clock cycles consumed by the pipeline stage that consumes highest number of clock cycles to produce its output.

In the original adopted FPU [19], prenormalization takes two clock cycles, arithmetic core takes one clock cycle, postnormalization takes three clock cycles and formatting output takes one clock cycle to give their outputs. All the pipeline stages have been modified such that each pipeline consumes only one clock cycle. For example, two sequential process blocks used in Postnormalization caused two extra clock cycles to get the output of that stage. The signal that is computed in the first sequential process block is needed to compute the signals in second sequential process block and hence needs two clock cycles to get the output of that stage. Those two sequential process blocks are replaced by combinational logic as explained below to reduce the number of clock cycles required by postnormalization pipeline stage to one.

When the hidden bit and carry bit of the arithmetic result are zeros, then the mantissa has to be left-shifted to normalize it. The following sequential process block is used in postnormalization unit of FPU [19] to compute the number of positions by which the mantissa has to be shifted left.

Listing 1. Process to count zeroes from the left.

```
process(clk_i)
begin
if rising_edge(clk_i) then
-- count the leading zeros of fraction, needed for left-
shift
s_zeros <= count_l_zeros(s_fract_28_i(26 downto 0));
end if;
end process;</pre>
```

The above process block is replaced by the following line of combinational logic.

s\_zeros <= count\_l\_zeros(s\_fract\_28\_i(26 downto 0));</pre>

This change reduced the number of clock cycles required by the postnormalization unit to two. After the above mentioned sequential process block, combinational logic is used in the FPU to compute the left shifted mantissa (s\_fract\_shl) and corresponding decremented exponent (s\_exp\_shl) using the count of leading zeros of fraction (s\_zeros) computed in above process. After the combinational logic, the following sequential process blocks are used in FPU to compute the normalized fraction and corresponding exponent using left shifted mantissa (s\_fract\_shl) and decremented exponent (s\_exp\_shl) computed using combinational logic.

```
Listing 2. Process to compute normalized fraction
```

```
process (clk_i)
begin
if rising_edge(clk_i) then
if s_shrl='1' then -- if carry bit is set, then right shift
s_fract_1 <= s_fract_shr1; -- assign right shifted fraction
elsif s_shl='1' then -- if carry bit and hidden bits are
zeros, then left shift
s_fract_1 <= s_fract_shl; -- assign right shifted fraction
else
s_fract_1 <= s_fract_28_i; -- assign already normalized
fraction
end if;
```

```
end if;
end process;
process (clk i) -- process to compute normalized exponent
begin
if rising edge(clk i) then
if s shr1='1' or s shr1e='1' then -- if carry bit is set,
then right shift
s exp 1 <= s exp shr1; -- assign incremented exponent
elsif s shl='1' then -- if carry bit and hidden bits are
zeros, then left shift
s exp 1 <= s exp shl; -- assign decremented fraction</pre>
else
s exp 1 <= s exp i; -- assign already normalized exponent
end if;
end if;
end process;
```

The above two process blocks are used replaced with the following combinational logic.

Such similar changes have been made to the entire floating-point unit to enable its proper functioning. This conversion of sequential logic to combination logic increases the clock frequency. Postnormalization for multiplication is subdivided into more pipelines internally and care is taken to see that no branch prediction hazards occur. The size of the pipeline is influences the hardware cost; greater the size more is the cost. In the FPU, the 32-bit operands given to prenormalization are taken as inputs to postnormalization also to find whether inputs are infinities (two 1-bit signals) or NaNs (two 1-bit signals). Changes have been made to check the operands for infinity and SNaN in prenormalization unit itself. If operands are checked for infinity and SNaN in prenormalization, then 6 bits (four 1-bit signals indicating whether inputs are infinities are not and two 1-bit signals

indicating whether inputs are NaNs are not) can be carried across the pipeline registers instead of 64-bits (two 32-bit operands). This checking of the inputs for infinity and NaN does not increase the length of the critical path in prenormalization.

With all the changes made, the modified FPU was thoroughly tested by sending Gaussian distributed synthetic test data inputs using a VHDL test bench for full pipelined operation and this FPU has been later used for construction of native-pair FPU.

#### 3.6. Examples

Based on the steps and the circuitry described for performing floating-point arithmetic in Chapter 2 the following examples are worked out in a step by step fashion. Each subsection covers two examples. The same test cases are using for operation without residual register and for operation with residual register. For example, the operands used in addition are again used in addition with residual register to clearly differentiate the functioning of the two approaches.

#### 3.6.1. IEEE 754 Floating-point addition examples

#### Example 1:

A = 0x (4171999A) = 01000001011100011001100110011001

B = 0x (3FC147AE) = 00111111110000010100011110101110

#### **Step 1: Prenormalization:**

Sign (A) = 0; Exp (A) = 10000010 = 130 – 127 = 3; Mantissa (A) = 111000110011001100110101;

Append $\neq$  mantissa (A) with carry, hidden, guard, round and sticky bit. Hidden = 1 if Exp  $\neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 01|1110001100110011001101000

Sign (B) = 0; Exp (B) = 01111111 = 127 – 127 = 0; Mantissa (B) = 10000010100011110101110;

Append mantissa (B) with carry, hidden, guard, round and sticky bit.

Mantissa (B) = 01|10000010100011110101110|000

Exp difference = Exp (A) – Exp (B) = 3 - 0 = 3

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

Smaller mantissa = Mantissa (B) = 01|10000010100011110101110|000

Larger mantissa = Mantissa (A) = 01|1110001100110011001101000

Shift smaller mantissa right by Exp difference.

RS (Right shifted) Smaller Mantissa = 00|00110000010100011110101|110 000

Exponent to Postnormalization = 3

#### **Step 2: Addition**

Output sign = sign (A) if Exp(A) > Exp(B) = 1 else Sign (B) = '0'

Larger mantissa = 01|11100011001100110011010|000

RS Smaller Mantissa = 00|00110000010100011110101|110

Sum = 10|00010011100001010001111|10

#### **Step 3: Postnormalization**

Sum = 10|00010011100001010001111|110

Exponent to Postnormalization = 3

Carry =1 => right shift sum once

Right shifted Sum = 01|00001001110000101000111|111 0

Exponent = 3+1 = 4

Sum is rounded up assuming round-to-nearest mode.

Rounded Sum = Right shifted Sum + 1 = 01|00001001110000101000|111

Carry bit =0 =>No right shift.

Concatenating: Rounded Sum (excluding carry, hidden, guard, round and sticky bits) with Sign and Exponent

 $0 \mid 10000011 \mid 00001001110000101001000 = 0x (4184E148)$ 

#### Example 2:

A = 0x (501502F9) = 0101000000101010000001011111001

B = 0x (219392EF) = 00100001100100111001001011101111

#### **Step 1: Prenormalization**

Sign (A) = 0; Exp (A) = 10100000 = 160 - 127 = 33; Mantissa (A) = 00101010000001011111001;

Append mantissa (A) with carry, hidden, guard, round and sticky bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 01|00101010000001011111001|000

Sign (B) = 0; Exp (B) = 01000011 = 67 – 127 = -60; Mantissa (B) = 00100111001001011101111;

Append mantissa (B) with carry, hidden, guard, round and sticky bit.

Mantissa (B) = 01|00100111001001011101111|000

Exp difference = Exp (A) - Exp (B) = 33 - (-60) = 93

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

```
Smaller mantissa = Mantissa (B) = 01|00100111001001011101111|000
```

Larger mantissa = Mantissa (A) = 01|00101010000001011111001|000

Shift smaller mantissa right by Exp difference.

Discarded bits - 0111100011001100110011010

Exponent to Postnormalization = 33

#### **Step 2: Addition**

Output sign = sign (A) if Exp(A) > Exp(B) = 1 else Sign (B) = '0'

Larger mantissa = 01|00101010000001011111001|000

Sum = 01|00101010000001011111001|001

#### **Step 3: Postnormalization**

Sum = 01|00101010000001011111001|001

Exponent to Postnormalization = 33

Carry  $=0 \Rightarrow$  no right shift sum.

Sum = 01|00101010000001011111001|001

Exponent = 33

Sum is rounded down assuming round-to-nearest mode.

Rounded Sum = Sum = 01|00101010000001011111001|001

Carry bit =0 =>No right shift.

Concatenate: Rounded Sum (excluding carry, hidden, guard, round and sticky bits) with Sign and Exponent

 $0 \mid 10100000 \mid 00101010000001011111001 = 0x (501502F9)$ 

# **3.6.2. Addition with Residual Register examples:** Example 1:

A = 0x (4171999A) = 01000001011100011001100110011001

B = 0x (3FC147AE) = 00111111110000010100011110101110

# **Step 1: Prenormalization**

Sign (A) = 0; Exp (A) = 10000010 = 130 – 127 = 3; Mantissa (A) = 11100011001100110011010;

Append mantissa (A) with carry, hidden, guard, round and sticky bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 01|1110001100110011001101000

Sign (B) = 0; Exp (B) = 01111111 = 127 – 127 = 0; Mantissa (B) = 10000010100011110101110;

Append mantissa (B) with carry, hidden, guard, round and sticky bit.

Mantissa (B) = 01|10000010100011110101110|000

Exp difference = Exp (A) – Exp (B) = 3 - 0 = 3

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

Smaller mantissa = Mantissa (B) = 01|10000010100011110101110|000

Larger mantissa = Mantissa (A) = 01|1110001100110011001001000

Shift smaller mantissa right by Exp difference.

RS (Right shifted) Smaller Mantissa = 00|00110000010100011110101|110

Discarded bits - 110 go into residual register

Exponent to Postnormalization = 3

#### **Step 2: Addition**

Output sign = sign (A) if Exp(A) > Exp(B) = 1 else Sign (B) = '0'

Larger mantissa = 01|11100011001100110011010|000

RS Smaller Mantissa = 00|00110000010100011110101|110

Sum = 10|00010011100001010001111|10

#### **Step 3: Postnormalization**

Sum = 10|00010011100001010001111|110

Exponent to Postnormalization = 3

Carry =1 => right shift sum once

Right shifted Sum = 01|00001001110000101000111|111

Discarded bit = 1; Goes into residual register.

Exponent = 3+1 = 4

Sum is rounded up assuming round-to-nearest mode.

Rounded Sum = Right shifted Sum + 1 = 01|00001001110000101000|111

Carry bit =0 =>No right shift.

Mantissa (RR) after rounding = 0000000000000000000001110

Concatenate: Rounded Sum (excluding carry, hidden, guard, round and sticky bits) with Output Sign and Output Exponent.

 $0 \mid 10000011 \mid 00001001110000101001000 = 0x (4184E148)$ 

Complement flag for residual register = sign (a) XOR sign (b) XOR Roundup = 1

Residual register sign = Output Sign XOR Roundup = 1

2's complement (1110) - 0010

Exponent (RR) = Exp (A) –  $2(N_m + 1)$  when (exponent difference >  $N_m + 1$ ) else Exp (B).

Where  $N_m$  is number of mantissa bits in the native-precision floating-point number. For Single-precision  $N_m = 23$ .

Accordingly, Exponent (RR) = Exp(B) = 0

Final un-normalized residual value = Sign (RR) | Exponent (RR) | Final residual mantissa

Final outputs:

Output = 0 | 10000011 |00001001110000101001000 = 0x (4184E148)

Sign (RR) = 1

Exponent (RR) = "01111111"

MOVRR = 1

#### **Step 4: Normalization of residual value**

Exponent = Exponent (RR) = 01111111

Sign = Sign (RR) = $0 \Rightarrow$  left shift till hidden bit =1

22 left shifts. Exponent = Exponent (RR) - 22 = 127 - 22 = 105 = 01101001

= 0x (B4800000)

#### Example 2:

A = 0x (501502F9) = 0101000000101010000001011111001

B = 0x (219392EF) = 00100001100100111001001011101111

#### **Step 1: Prenormalization**

Sign (A) = 0; Exp (A) = 10100000 = 160 - 127 = 33;

Mantissa (A) = 00101010000001011111001

Append mantissa (A) with carry, hidden, guard, round and sticky bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 01|00101010000001011111001|000

Sign (B) = 0; Exp (B) = 01000011 = 67 - 127 = -60;

Mantissa (B) = 00100111001001011101111

Append mantissa (B) with carry, hidden, guard, round and sticky bit.

Mantissa (B) = 01|00100111001001011101111|000

Exp difference = Exp (A) - Exp (B) = 33 - (-60) = 93

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

Smaller mantissa = Mantissa (B) = 01|00100111001001011101111|000

Larger mantissa = Mantissa (A) = 01|00101010000001011111001|000

Shift smaller mantissa right by Exp difference.

Discarded bits - 0100100111001001011101111 go into residual register

Mantissa (RR) = **0100100111001001011101111** 

Exponent to Postnormalization = 33

#### **Step 2: Addition**

Output sign = sign (A) if Exp(A) > Exp(B) = 1 else Sign (B) = '0'

Larger mantissa = 01|00101010000001011111001|000

Sum = 01|00101010000001011111001|001

#### **Step 3: Postnormalization**

Sum = 01|00101010000001011111001|001

Exponent to Postnormalization = 33

Carry  $=0 \Rightarrow$  no right shift sum.

Sum = 01|00101010000001011111001|001

Exponent = 33

Mantissa (RR) before rounding = Mantissa (RR) = 0100100111001001011101111

Sum is rounded down assuming round-to-nearest mode.

Rounded Sum = Sum = 01|00101010000001011111001|001

Carry bit =0 =>No right shift.

Mantissa (RR) after rounding = Mantissa (RR) before rounding = 0100100111001001011101111

Concatenate: Rounded Sum (excluding carry, hidden, guard, round and sticky bits) with Sign and Exponent

 $0 \mid 10100000 \mid 00101010000001011111001 = 0x (501502F9)$ 

Complement flag for residual register = sign (a) XOR sign (b) XOR Roundup = 0

Residual register sign = Output Sign XOR Roundup = 0

Final residual mantissa = 0100100111001001011101111

Exponent (RR) = Exp (A) –  $2(N_m + 1)$  when (exponent difference >  $N_m + 1$ ) else Exp (B).

Where  $N_m$  is number of mantissa bits in the native-precision floating-point number. For Single-precision  $N_m = 23$ .

Accordingly, Exponent (RR) =  $Exp(A) - 2(N_m + 1) = 65 = 01000011$ 

Final un-normalized residual value = Sign (RR) | Exponent (RR) | Final residual mantissa

Final outputs:

Output = 0 | 10100000 |00101010000001011111001 = 0x (501502F9)

Sign (RR) = 0

Exponent (RR) = "01000011"

Mantissa (RR) = "0100100111001001011101111"

MOVRR = 1

#### **Step 4: Normalization of residual value**

Sum = Mantissa (RR) & 000 = 0100100111001001011101111|000

Hidden bit = 1 so no shifting Exponent = Exponent (RR) = 01000011

Normalized Residual register value = 0|01000011|00100111001001011101111

= 0x (219392EF)

# **3.6.3. IEEE 754 Floating-point Multiplication Examples:** Example 1:

A = 0x (4171999A) = 01000001011100011001100110011001

# Step 1: Prenormalization:

Sign (A) = 0; Exp (A) = 10000010 = 130 - 127 = 3; Mantissa (A) = 1100011001100110011010;

Prepend mantissa (A) with a hidden bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 1|11100011001100110011010 - 0x (F1999A)

Sign (B) = 0; Exp (B) = 10000000 = 128-127 = 1; Mantissa (B) = 10000010100011110101110;

Prepend mantissa (B) with a hidden bit.

Exp (O) – exponent to the postnormalization = 130 + 128 - 127 = 131 = 4

# **Step 2: Multiplication**

Output sign = sign (A) XOR Sign (B) =  $0^{\circ}$  XOR  $0^{\circ} = 0^{\circ}$ 

Mantissa (A) = 1|11100011001100110011010 – 0x (F1999A)

Mantissa (B) = 1|0000000000000000000010001 – 0x (800011)

#### **Step 3: Postnormalization**

Exponent to Postnormalization = Exp(O) = 4

Carry =0 => No right shifting product[47:0]

Product 2[47:0] = **01** 1110001100110011011010 |**00** 010110011001100111010

01 – carry and hidden bits.

00 – guard and round bits.

Sticky = OR (Product 2 [20:0]) = 1

#### Roundup = guard and ((round or sticky) or Product 2(23) = 0

Based on Rounding logic, product is rounded down

Rounded product = product 2 [47:23] + 1 = 01 | 1110001100110011011010

Lower 23 bits discarded - 00| 010110011001100111010

Carry bit =0 =>No right shift.

Concatenate: Rounded Sum (excluding carry and hidden bits) with Sign and Exponent

0 | 10000011 |1110001100110011011010 = 0x (41F199BA)

#### Example 2:

A = 0x (501502F9) = 0101000000101010000001011111001

 $\mathbf{B} = 0\mathbf{x} \ (41A77700) = 01000001101001110111011100000000$ 

#### **Step 1: Prenormalization**

Sign (A) = 0; Exp (A) = 10100000 = 160 - 127 = 33;

Mantissa (A) = 00101010000001011111001;

Prepend mantissa(A) with a hidden bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 1|00101010000001011111001

Sign (B) = 0; Exp (B) = 10000011 = 131 - 127 = 4;

Mantissa (B) = 01001110111011100000000;

Prepend mantissa(B) with a hidden bit.

Mantissa (B) = 1|01001110111011100000000

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

Mantissa (B) = 1|01001110111011100000000

Mantissa (A) = 1|00101010000001011111001

Exponent to Postnormalization = Exp(O) = 160 + 131 - 127 = 164 = 37 = 10100100

## **Step 2: Multiplication**

Output sign = sign (A) XOR Sign (B) = '0' XOR '0' = '0'

Mantissa (A) = 1|00101010000001011111001

Mantissa (B) = 1|01001110111011100000000

------

#### **Step 3: Postnormalization**

Exponent to Postnormalization = Exp(O) = 37

Carry =0=> no right shift product[47:0]; Exp (O) = 37

Product 2[47:0] = **01** |10000101111010001101001| **10** |1000010111111100000000

01 – carry and hidden bits.

10 – guard and round bits.

Sticky = OR (Product 2 [20:0]) = 1

#### **Roundup = guard and ((round or sticky) or Product 2(23) = 1**

Product is rounded up assuming round-to-nearest mode.

Rounded Product = Product 2 [47:23] + 1 = 01|10000101111010001101001 + 1

= 01|10000101111010001101010

Discarded bits - Product 2 [23:0] - 10 |100001011111100000000

Carry bit =0 =>No right shift.

Exp(O) = 37 = 164 (without bias) = 10100100

Concatenate: Rounded Sum (excluding carry and hidden bits) with Sign and Exponent

 $0 \mid 10100100 \mid 10000101111010001101010 = 0x (5242F46A)$ 

# **3.6.3. Multiplication with Residual Register Examples** Example 1:

A = 0x (4171999A) = 01000001011100011001100110011001

Step 1: Prenormalization:

Sign (A) = 0; Exp (A) = 10000010 = 130 – 127 = 3; Mantissa (A) = 1100011001100110011010;

Prepend mantissa (A) with a hidden bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 1|11100011001100110011010 - 0x (F1999A)

Sign (B) = 0; Exp (B) = 10000000 = 128-127 = 1; Mantissa (B) = 10000010100011110101110;

Prepend mantissa (B) with a hidden bit.

Exp (O) – exponent to the postnormalization = 130 + 128 - 127 = 131 = 4

Step 2: Multiplication

Output sign = sign (A) XOR Sign (B) = '0' XOR '0' = '0'

Mantissa (A) = 1|11100011001100110011010 - 0x (F1999A)

Mantissa (B) = 1|000000000000000000010001 - 0x (800011)

Step 3: Postnormalization

Exponent to Postnormalization = 4

Carry =0 => No right shift product[47:0]

Product 2 [47:0] = **01**| 1110001100110011011010 |**00**| 010110011001100111010

01 – carry and hidden bits.

00 – guard and round bits.

Sticky = OR (Product 2 [20:0]) = 1

### Roundup = guard and ((round or sticky) or Product 2(23) = 0

Based on rounding logic, product is rounded down.

Rounded product = product 2 [47:23] + 1 = 01 | 1110001100110011011010

Lower 23 bits discarded go into the residual register - 00| 010110011001100111010

Residual register mantissa = Mantissa (RR) = 00| 010110011001100111010

Carry bit =0 =>No right shift. Nothing goes into residual register.

Mantissa (RR) = 000| 010110011001100111010

Exponent (RR) = Output Exponent -24 = 131-24 = 107 = 01101011

Complement (RR) = roundup ='0'

Sign (RR) = sign (O) XOR roundup = '0'

Concatenate: Rounded Sum (excluding carry and hidden bits) with Sign and Exponent

0 | 10000011 |1110001100110011011010 = 0x (41F199BA)

Outputs:

Output = 0 | 10000011 |1110001100110011011010 = 0x (41F199BA)

Mantissa (RR) = 000001011001100110011010; Exponent (RR) = 01101011; Sign (RR) = '0';

-----

MOVRR =1

#### Step 4: Normalization of Residual Register Value

Exponent to postnormalization = 107 = 01101011

Sign (RR) = '0'

Hidden bit =  $0 \Rightarrow$  count zeros from left starting from hidden bit or Product[46] = 2

Shift left product 2 times.

Shifted product = Product 2 [46] =

Exponent (RR) = 107 - 5 = 102

01 – carry and hidden bits.

00 – guard and round bits.

Sticky = OR (Product 2 [20:0]) = 0

#### Roundup = guard and ((round or sticky) or Product 2(23) = 0

Shifted product is rounded down based on the rounding logic.

Rounded product = 00010110011001100111010

Carry bit =  $0 \Rightarrow$  no right shift.

Output = Sign (RR) | Exponent (RR)| Rounded Product (excluding carry and hidden bits)

Normalized Residual value = 0|01100110|00010110011001100111010 = 0x (330B333A)

#### **Example 2:**

A = 0x (501502F9) = 0101000000101010000001011111001

B = 0x (41A77700) = 01000001101001110111011100000000

## **Step 1: Prenormalization**

Sign (A) = 0; Exp (A) = 10100000 = 160 - 127 = 33;

Mantissa (A) = 00101010000001011111001;

Prepend mantissa (A) with a hidden bit. Hidden = 1 if  $Exp \neq$  "00000000" that is number is not a denormalized number.

Mantissa (A) = 1|00101010000001011111001

Sign (B) = 0; Exp (B) = 10000011 = 131 - 127 = 4;

Mantissa (B) = 01001110111011100000000;

Prepend mantissa (B) with a hidden bit.

Mantissa (B) = 1|01001110111011100000000

 $\operatorname{Exp}(A) > \operatorname{Exp}(B) = 1$ 

Mantissa (B) = 1|01001110111011100000000

Mantissa (A) = 1|00101010000001011111001

Exponent to Postnormalization = Exp(O) = 160 + 131 - 127 = 164 = 37(with bias) = 10100100

#### **Step 2: Multiplication**

Output sign = sign (A) XOR Sign (B) = '0' XOR '0' = '0'

Mantissa (A) = 1|00101010000001011111001

Mantissa (B) = 1|01001110111011100000000

\_\_\_\_\_

#### **Step 3: Postnormalization**

Exponent to Postnormalization = Exp(O) = 37

Carry =0=> no right shift product [47:0]; Exponent = Exp(O) = 37

Product 2 [47:0] = **01** |10000101111010001101001| **10** |1000010111111100000000

01 – carry and hidden bits.

10 – guard and round bits.

#### Sticky = OR (Product 2 [20:0]) = 1

#### **Roundup = guard and ((round or sticky) or Product 2(23) = 1**

Product is rounded up assuming round-to-nearest mode.

Rounded Product = Product 2 [47:23] + 1 = 01|10000101111010001101001 + 1

= 01|10000101111010001101010

Discarded bits – Product 2 [23:0] - 10 |100001011111100000000 go into the residual register

Residual Register Mantissa = 10100001011111100000000

Carry bit =0 =>No right shift. So nothing is added into the residual register.

Concatenate: Rounded product (excluding hidden bit) with Sign and Exponent

Product output = 0 | 10100100 |10000101111010001101010 = 0x (5242F46A)

Mantissa (RR) = 00|10100001011111100000000

Exponent (RR) = Exp (O) -24 = 164 - 24 = 140 (without bias) = 10001100

Sign (RR) = Sign (O) XOR roundup = 0 XOR 1 = 1

Complement (RR) = roundup = '1' => bits added in Mantissa (RR) are complemented.

Mantissa (RR) =  $00 \& (\sim \text{Mantissa (RR)})$ 

= 00 | 010111101000000111111111

## **Outputs:**

Product output = 01010010010000101111010001101010 = 0x (5242F46A)

Mantissa (RR) = 00 | 01011110100000011111111

Exponent (RR) = 10001100; Sign (RR) = '1'

\_\_\_\_\_

#### MOVRR = 1

#### Step 4: Normalization of Residual register value

Exponent to postnormalization = 140 = 10001100

Sign (RR) = '1'

Hidden bit =  $0 \Rightarrow$  count zeros from left starting from hidden bit or Product[46] = 2

Shift left product 3 times.

Exponent (RR) = 140 - 3 = 137

01 – carry and hidden bits.

00 – guard and round bits.

Sticky = OR (Product 2 [20:0]) = 0

#### Roundup = guard and ((round or sticky) or Product 2(23) = 0

Shifted product is rounded down based on the rounding logic.

Rounded product = 01|01111010000001111111100

Carry bit =  $0 \Rightarrow$  no right shift.

Output = Sign (RR) | Exponent (RR)| Rounded Product (excluding carry and hidden bits)

Normalized Residual value = 1|10001001|01011110100000011111111 = 0x (C4AF40FF)

#### **Chapter 4. Testing and Results**

The Native-pair FPU adder and multiplier have been verified using 6,300 test cases. Both behavioral and post-route simulations were run using Modelsim for this purpose. A set of Gaussian distributed data was used for the purpose of thorough testing of both the adder and the multiplier. This Gaussian sequence was a sequence 6300 million randomized numbers with mean value of zero and variance value of 1. The sequence values were stored into a text file in IEEE 754 format. A VHDL test bench was written to read this data from this text file, generate results. These same test cases have been used to test both the adder and multiplier and the results so obtained were compared to the results by the software program. The VHDL test benches and the residual.cpp codes are given in the Appendix B

Appendix A describes in detail the place and route simulation output waveforms.

#### Chapter 5. Estimation of Hardware Cost and Performance

The Floating-point unit with residual register needed to perform Native-pair floatingpoint arithmetic has been implemented in VHDL. The floating-point adder and multiplier with residual register have been synthesized to evaluate the hardware complexity and speed of the design. Implementation costs of these designs are compared with those of 32-bit FPU and 64-bit FPU without the residual register hardware. Xilinx 9.1 ise tool is used to generate post-route synthesis reports and Modelsim is used to generate place and route simulation waveform. The three designs – 32-bit FPU with residual register, 32-bit FPU without residual register, 64-bit FPU without residual register are targeted to Xilinx Virtex 4 FPGA xc4vlx25 device. Also the individual stages of the floating-point unit adder and multiplier i.e., prenormalization unit, addition unit, multiplication unit and postnormalization unit are also synthesized in order to analyze the residual register hardware needed to support the Native-pair floating-point arithmetic in a native 32-bit FPU.

### 5.1. Adder Implementation

| ADDER                    | Implementation cos |            | Minimum | period     |
|--------------------------|--------------------|------------|---------|------------|
|                          | Slices             | % Increase | (ns)    | % Increase |
| 32-bit FPU adder without | 1437               | 0.0        | 20.924  | 0.0        |
| residual register        |                    |            |         |            |
| 32-bit FPU adder with    | 1674               | 16.5       | 24.971  | 19.34      |
| residual register        |                    |            |         |            |
| 64-bit FPU adder without | 2272               | 56.4       | 62.1    | 197.8      |
| residual register        |                    |            |         |            |

 Table 6. Comparison of Implementation cost and delay for Adders

Table 6 above gives the number of slices used and the minimum period of the critical path obtained from the post-route synthesis reports of the floating-point adders. The Implementation cost column is divided into the absolute utilization and relative utilization. The absolute utilization gives the exact number of slices needed for each

design and the relative utilization indicates the percentage increase in number of slices for each design relative to 32-bit FPU adder without residual register. The second column gives the minimum period or the delay in the critical path of the three adders and their relative increases taking 32-bit FPU adder without residual register as the mean. The observations drawn from Table 6 are:

- The percentage increase in hardware from using residual register for 32-bit FPU adder instead of 32- FPU adder without residual register is 16.5% and the percentage increase jumps 56.4% when 64-bit FPU adder without residual register is used
- The minimum period increased by 4.67 ns for 32-bit FPU adder with residual register compared to the increase of 41 ns for 64-bit FPU adder hardware. The relative increase in the minimum period changes from 19.34% to 197.8%.
- Comparing only 32-bit FPU adder and 64-bit FPU adder, the relative hardware cost increases by a factor of 3.4 and minimum period increases by a factor of 10.

The hardware cost of 64-bit FPU adder is due to increase in the use of resources and increase in the size of pipeline. Table 7 shows the comparison of 32-bit FPU adders with and without residual register hardware. This rise in the hardware cost involves addition of no new logic but only increasing the size of all the combinational and sequential logic within the 32-bit FPU adder hardware. But the increase in the hardware cost of 32-bit FPU adder hardware is due to addition of new logic which is needed in the residual value computation.

Table 7. Comparison of device utilization reports of Prenormalization unit for 32-bitFPU adder with and without residual register hardware

| ADDER - Prenormalization   | 32-bit FPU adder without residual register | 32-bit FPU adder with residual register |
|----------------------------|--------------------------------------------|-----------------------------------------|
| Number of Slices           | 498                                        | 492                                     |
| Number of Slice Flip Flops | 16                                         | 64                                      |
| Number of 4 input LUTs     | 889                                        | 880                                     |
| Minimum period             | 1.10 ns                                    | 1.648 ns                                |

Table 7 shows the device utilization summary of the prenormalization unit for 32-bit FPU adder with and without the residual register. The overall hardware needed for adder with residual register increased due the addition of residual register and the storing it with the discarded bits after a right-shift is performed in prenormalization. This extra hardware shown in bold in Figure 14, adds sequential logic in the critical data path which becomes the cause for the delay and the increase in the minimum period.

 

 Table 8. Comparison of device utilization reports of Postnormalization unit for 32bit FPU adder with and without residual register hardware

| ADDER - Postnormalization | 32-bit FPU adder with residual register | 32-bit FPU adder without residual register |
|---------------------------|-----------------------------------------|--------------------------------------------|
| Number of Slices          | 522                                     | 639                                        |
| Number of 4 input LUTs    | 932                                     | 1130                                       |

Table 9 shows the extra hardware in the Postnormalization unit of the FPU adder with residual register. The extra hardware shown in Figure 15 is due to the appending of the discarded bits in to the residual register value that comes from the prenormalization unit, the computation of the sign, the complement flag and the exponent value, computing the 2's complement of the residual value based on the complement flag and storing all this into a residual register. As there is no delay in the critical datapath as the extra hardware does not involve any sequential logic in the datapath.

### 5.2. Multiplier Implementation

Table 10 shows the results of the place-route synthesis reports. The increase in number of slices for 32-bit FPU multiplier with residual register is 330 and it is 4497 for 64-bit FPU multiplier. The minimum period increases by 11.8% and 61.7% respectively for 32-bit FPU multiplier with residual register hardware and 64-bit FPU multiplier.

| MULTIPLIER                                         | Implem | entation cost | Minimum | period     |
|----------------------------------------------------|--------|---------------|---------|------------|
|                                                    | Slices | % Increase    | (ns)    | % Increase |
| 32-bit FPU multiplier                              | 2703   | 0.0           | 34.754  | 0.0        |
| without residual register                          |        |               |         |            |
| 32-bit FPU multiplier with residual register       | 3033   | 12.2          | 38.875  | 11.8       |
| 64-bit FPU multiplier<br>without residual register | 7200   | 166.3         | 55.7    | 61.74      |

 Table 9. Comparison of Implementation cost and delay of Multipliers

The inferences that can be drawn from table 10:

- The extra hardware needed for 32-bit FPU multiplier with residual register increases by 12.2% and for 64-bit FPU multiplier the hardware cost increases by 166%.
- The minimum period increases by 4 ns for 32-bit FPU multiplier with residual register hardware and 21 ns for 64-bit FPU multiplier.
- Hardware cost (64-bit FPU multiplier) = 13.6 × hardware cost (32-bit FPU multiplier with residual register).
- Minimum period increase (64-bit FPU multiplier) = 5.23 × minimum period increase (32-bit FPU multiplier with residual register).

Similar to the adders, the hardware increase in 64-bit multiplier is primarily due to increase in the size of the resources and the pipeline. From Table 9 it can also be observed that the increase in the hardware when residual register is used for multiplier is less than the increase for adder with residual register. Floating-point multiplication does not have any shifting or discarding of bits in prenormalization but floating-point addition involves shifting of mantissa in the prenormalization unit. The residual register hardware and the related logic are present only in the postnormalization unit for a FPU multiplier.

## Table 10. Comparison of device utilization reports of Postnormalization unit for 32bit FPU multiplier with and without residual register hardware

| Multiplier - Postnormalization | 32-bit FPU adder with residual register | 32-bit FPU adder without residual register |
|--------------------------------|-----------------------------------------|--------------------------------------------|
| Number of Slices               | 1817                                    | 1807                                       |
| Number of Slice flip flops     | 112                                     | 146                                        |
| Number of 4 input LUTs         | 3322                                    | 3311                                       |
| Minimum period                 | 8.332 ns                                | 7.637 ns                                   |

Table 12 shows the device utilization of postnormalization unit in FPU multiplier with residual register and FPU multiplier without residual register. All the extra logic resulting in extra hardware is due to setting of the residual mantissa before and after rounding, computing the residual register exponent, the sign and complement flag and complementing the residual register value if the complement flag is set. The delay in the critical path increases by 0.7 ns due to this extra hardware.

## Conclusion

Most processors in video game consoles and graphics hardware widely use 32-bit or single precision floating-point hardware which is available at low cost. To harness this hardware for scientific computing, the intermediate results in these processors require precision higher than 32-bits. Usage of double precision or 64-bit floating-point arithmetic is not justifiable for this purpose because the scientific computing market is too small to justify the added expense. Using 32-bit Native-pair arithmetic increases the accuracy of these applications close to that offered when double precision arithmetic is used but at a fraction of the cost of 64-bit floating point.

The FPU unit [19] used for this thesis has been debugged extensively before being used to implement the residual register hardware needed for adder and multiplier. The signals between modules have been routed correctly to enable pipelined operation. The input signals were carried through various stages in the pipeline or till where ever they were needed. The pipelines were balanced to attain maximum operable frequency. Signals present in the last stage and that required wider operands to be enabled have been computed in the earlier stage of the pipeline. There by the requirement for wider operands to be routed through the pipeline was negated and the width of the pipeline was reduced. The debugged FPU gave addition outputs every 3 clock cycles and multiplication output every 5 clock cycles. The residual register hardware was added to the FPU adder and FPU multiplier and its proper functioning has been implemented. Both the FPU adder and the FPU multiplier have been thoroughly tested by performing postroute simulations and also performing test-bench analysis using the synthetic data generated by the test code. The synthesis reports after the placement and routing have been obtained and a detailed analysis has been show in Chapter 5.

As can be seen from the synthesis results in Table 6, the increase in the hardware cost of due to residual register hardware is 15.4% for adders and 12.2% for multipliers. The increase in the hardware for 64-bit floating-point hardware is 55% for adders and 166% for multipliers. When comparing just 64-bit floating-point hardware and the 32-bit residual register hardware, there is a cost increases by a factor of 3.6 for adders and 13.6

for multipliers. The minimum period comparison as obtained from the Table 6 shows that the period increases for adder with residual register by 37% where as in multiplier with residual register the increase is just 11.8%. In comparison, the 64-bit adder has a decrease in performance by 226% and 64-bit multiplier has a decrease by 61%.

These results prove that with a minimal increase in hardware cost and a moderate slow down in performance, the native-pair arithmetic can be used to increase the accuracy of floating-point computations rather than going for the high cost double precision hardware. The residual register arithmetic unit performance can be enhanced greatly through the use of speculation. Using the native-pair hardware only when the speculation software detects loss of information above a certain limit will certainly result in floatingpoint arithmetic with higher precision, better accuracy and improved performance [31].

## Appendix A

#### **Post-route simulations**

The simulation reports for the addition with residual register are presented in the next two pages. The next page shows the post-route simulation for 8 pairs of operands inputs were given in continuous clock cycles. The 1<sup>st</sup> signal is the clock input; the 2<sup>nd</sup> signal is the MOVRR input; 3<sup>rd</sup> and 4<sup>th</sup> signals are the operands A and B represented as opa i and opb i; 5<sup>th</sup> signal is the opcode 000-addition, 001-subtraction; the 6<sup>th</sup> signal is the rounding mode; the 7<sup>th</sup> signal is the output of the FPU, it could be the result or the normalized residual value depending on the MOVRR signal; 8<sup>th</sup> and 16<sup>th</sup> signals are the output of the postnormalization and the input to the postnormalization. These signals have been used to check if the right residual value was going into the postnormalization unit when MOVRR goes high; the 17<sup>th</sup> signal is the final sign of the residual register value from the postnormalization unit; the 20<sup>th</sup> signal is the complement flag output of the residual register in the postnormalization unit and it is used to keep track of when the residual value is being complemented; signals 24,25 and 26 are the exponent values of residual registers in the prenormalization unit, addition/subtraction unit and postnormalization unit. But as the final exponent is set only in the postnormalization unit only the 26<sup>th</sup> signal can be considered important; the 27<sup>th</sup> signal, 28<sup>th</sup> signal and 29<sup>th</sup> signal are residual register mantissa outputs from prenormalization unit, addition/subtraction unit and postnormalization unit. Since the mantissa is set in both prenormalization and postnormalization, signals 27 and 29 are important. Apart from these, signals 9 to 15 give the inexact, overflow and the exception outputs. 17<sup>th</sup> signal is the post in signal and it is used to see if the residual register mantissa mant rr2, exponent exp rr2 and sign rr2 obtained in the previous clock cycle is sent as input into the postnormalization when the MOVRR signal is one. 23<sup>rd</sup> signal is the ready signal, valid postnormalization unit outputs are sent to the FPU output only after this signal goes high. The FPU addition takes place in 3 clock cycles, one clock cycle each for prenormalization, addition and postnormalization.

In the wave form shown in the next page, signal 3 indicates the operand A - opa i; it is 4171999A in the 1<sup>st</sup> clock cycle, C17199A in the 2<sup>nd</sup>, 4171999A in the 3rd clock cycle, C17199A in the 4<sup>th</sup>, 501502F9 in the 5<sup>th</sup> and 7<sup>th</sup>, D01502F9 in the 6<sup>th</sup> and 8<sup>th</sup> and 3F893773 in the  $9^{th}$  clock cycle.  $4^{th}$  signal is operand B – opb i which takes value 3FC147AE in 1<sup>st</sup> and 2<sup>nd</sup> clock cycles, BFC147AE in 3rd and 4th clock cycles, 219392EF in the 5<sup>th</sup> and 6<sup>th</sup> clock cycles, A19392EF in the 7<sup>th</sup> and 8<sup>th</sup> clock cycles and 00000000 in the 9<sup>th</sup> clock cycle. The output for the inputs in the 1<sup>st</sup> clock cycle that is opa\_i = 4171999A and opb i = 3FC147AE comes in the 4<sup>th</sup> clock cycle with the falling edge of the clock and its value is shown by the  $7^{\text{th}}$  signal output o = 4184E148. The outputs for the other inputs come along the consequent clock cycles. MOVRR signal goes high at the end of the  $3^{rd}$  clock cycle, at this time the post out = 4184E148 and post in becomes the residual register mantissa value obtained in the previous clock cycle. The normalized residual value to be stored in the architectural register B48000000 is obtained in the 5<sup>th</sup> clock cycle with the falling edge of the clock. Similarly MOVRR again goes high in the 8<sup>th</sup> clock cycle to give the normalized residual register value 219392EF as output in the 9<sup>th</sup> clock cycle.

The second wave form has been run to check if the all the 8 input vectors are giving the correct values of output and residual register values. As can be seen in the wave form MOVRR signal is made to go high after every 3 clock cycles for this purpose.

| /fpuaddition/clk_i       |               |                                    |                              |                             | Бр                                 |                                    |                                       |       |      |          |
|--------------------------|---------------|------------------------------------|------------------------------|-----------------------------|------------------------------------|------------------------------------|---------------------------------------|-------|------|----------|
| /fpuaddition/movrr       |               |                                    | . [] [] .                    |                             |                                    |                                    |                                       |       |      | ]        |
| /fpuaddition/opa_i       |               | 04 V41719994 VC1719994 V           | 50150259 V D0150259 V 501502 |                             | 2773                               | -                                  |                                       |       |      | ·        |
| /fpuaddition/opb_i       |               |                                    | 219392EF (A1                 |                             | ł                                  |                                    | +                                     |       |      |          |
|                          |               |                                    | 219392LI (A1                 | 9392LI 10000                | 1                                  | -                                  |                                       |       |      | ·        |
| /fpuaddition/fpu_op_i    |               |                                    |                              |                             | +                                  | -                                  |                                       |       |      |          |
| /fpuaddition/rmode_i     |               | γ t                                |                              | v v v                       | + v v                              | 05000770                           | -                                     |       | i    |          |
| /fpuaddition/output_o    |               | X4184E<br>7FC00000 X4184E148 XB480 |                              |                             | 1392EF (501502F9 (D01502F9         | (3F893773                          | -                                     |       | i    | <u> </u> |
| /fpuaddition/post_out    |               | 17FC00000 4184E148 B4800           | 1000 X415970A4 XC184E148     | 501502F9 X219392EF X50      | 1502F9 001502F9 33F89              | 3773                               | -                                     |       | 1    |          |
| /fpuaddition/ine_o       |               |                                    |                              | . L                         | +                                  | +                                  | +                                     |       | f    |          |
| /fpuaddition/overflow_o  |               | -                                  |                              |                             | +                                  | -                                  | -                                     |       | 1    |          |
| /fpuaddition/underflow_o |               |                                    |                              |                             | †                                  | +                                  | +                                     |       |      |          |
| /fpuaddition/inf_o       |               | + +                                |                              | -                           | +                                  | +                                  | +                                     |       | +    |          |
| /fpuaddition/zero_o      |               | + +                                |                              | -                           | +                                  | +                                  | -                                     |       |      |          |
| /fpuaddition/qnan_o      |               |                                    |                              | -                           | †                                  |                                    | +                                     | · · · |      |          |
| /fpuaddition/snan_o      |               |                                    |                              |                             | +                                  | +                                  |                                       |       |      |          |
| /fpuaddition/post_in     | 0000000       | 84E147E X X0000010                 | 6CB852 84E147E 4A817C9       | ) X 48C9775 X X 44817C7 X44 | <sup>1817C9</sup> <b>↓</b> 449BB98 | -                                  | -                                     | -     | j    |          |
| /fpuaddition/sign_rr0    | L             |                                    |                              |                             | +                                  | +                                  | -                                     |       |      |          |
| /fpuaddition/sign_rr1    | L             | -                                  |                              |                             | +                                  | -                                  |                                       |       |      |          |
| /fpuaddition/sign_rr2    |               |                                    |                              |                             |                                    | <br>-                              |                                       |       |      |          |
| /fpuaddition/cmpl_rr0    | L             |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
| /fpuaddition/cmpl_rr1    | L             |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
| /fpuaddition/cmpl_rr2    |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
| /fpuaddition/ready_o     |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
| /fpuaddition/exp_rr0     | 00 (7F        |                                    | 43                           | χo                          | 0                                  | †                                  |                                       |       |      |          |
| /fpuaddition/exp_rr1     | 00            | 7F                                 | (43                          |                             | χοο                                | *                                  |                                       |       |      |          |
| /fpuaddition/exp_rr2     | 00            | 【7F                                |                              | (43 )(13 )(4                | 3 <u>)</u> 00                      | +                                  | -                                     |       |      |          |
| /fpuaddition/mant_rr0    | 00000 000000  | 006                                | 09392EF                      | χc                          | 000000                             | -                                  | -                                     |       |      |          |
| /fpuaddition/mant_rr1    | 0000000       | 0000006                            | (0939)                       | 2EF                         | X0000000                           | -                                  | -                                     |       |      |          |
| /fpuaddition/mant_rr2    | (0000x0)(0000 | 000 (0000002                       | <u>.</u>                     | 09392EF 16C6D11             | 9392EF (0000                       | 000                                | +                                     |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          |               |                                    |                              |                             |                                    |                                    |                                       |       |      |          |
|                          | 0             | <br>                               | )000                         | 200                         | 00000                              | <br>                       <br>300 | <br>                         <br>0000 | 4000  | )000 |          |

|                          | ,       |          |                                        | Б                   | <u> </u>    |                            |                            |                  |       |       |      |
|--------------------------|---------|----------|----------------------------------------|---------------------|-------------|----------------------------|----------------------------|------------------|-------|-------|------|
| /fpuaddition/clk_i       |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
| /fpuaddition/movrr       |         |          | -                                      |                     |             | [                          | -                          | -                |       |       |      |
| /fpuaddition/opa_i       | 3F89377 |          | A C171999A 417199                      |                     |             | 501502F9 D01502F9          |                            | -                | -     |       |      |
| /fpuaddition/opb_i       | 0000000 | 0 (3FC1  | 47AE (BFO                              | 219<br>219          | 392EF       | (A19392EF)                 | 0000000                    | -                | -     |       |      |
| /fpuaddition/fpu_op_i    | 0       |          |                                        |                     |             |                            |                            |                  |       |       |      |
| /fpuaddition/rmode_i     | 0       |          |                                        |                     |             |                            |                            |                  |       |       |      |
| /fpuaddition/output_o    | 3F89377 | 3        |                                        | XXXXXXXXXX 4184E148 | B4800000 41 | 5970A4 C184E148 50150      | 2F9 219392EF 501502F9      | D01502F9 3F8937  | 73    |       |      |
| /fpuaddition/post_out    | 3F89377 | 3        | xxxxxxxxxxx                            | 4184E148 X B4800000 | 415970A4 C1 | 184E148 X 501502F9 X 21939 | 92EF X 501502F9 X D01502F9 | 3F893773         |       |       |      |
| /fpuaddition/ine_o       |         |          |                                        | +                   |             |                            |                            |                  |       |       |      |
| /fpuaddition/overflow_o  |         |          |                                        |                     |             |                            |                            | Į                |       |       |      |
| /fpuaddition/underflow_o |         |          | -                                      |                     |             |                            |                            | ļ                |       |       | -    |
| /fpuaddition/inf_o       |         |          | -                                      |                     |             |                            |                            | ļ                | •     |       | -    |
| /fpuaddition/zero_o      |         |          |                                        |                     |             |                            |                            |                  | -     |       |      |
| /fpuaddition/qnan_o      |         |          | -                                      |                     |             | -                          | -                          | -                | -     | -     |      |
| /fpuaddition/snan_o      |         |          |                                        | -                   | -           |                            |                            |                  |       |       |      |
| /fpuaddition/post_in     | 449BB98 | 3        | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | X0000010 X eccess2  | 84E147E 4/  | A817C9                     | 44817C9 44817C9 4449       | +<br>BB98        | -     |       |      |
| /fpuaddition/sign_rr0    |         |          | -                                      | -                   | Ť           |                            | -                          | -                |       |       |      |
| /fpuaddition/sign_rr1    |         |          | -                                      | -                   | Ť           |                            | -                          | -                |       | -     | -    |
| /fpuaddition/sign_rr2    | ł       |          | n .                                    | -                   | Ī           |                            | -                          |                  |       |       | -    |
| /fpuaddition/cmpl_rr0    |         |          | -                                      |                     | -           | -                          |                            | -                | -     |       | -    |
| /fpuaddition/cmpl_rr1    |         |          | а.                                     | -                   | -           |                            |                            | -                | -     |       |      |
| /fpuaddition/cmpl_rr2    | ł       |          | -                                      | -                   | -           |                            |                            | -                | -     | -     | -    |
| /fpuaddition/ready_o     | -       |          |                                        |                     |             |                            | -                          | -                | - ·   |       |      |
| /fpuaddition/exp_rr0     | ł       | XX       | (7F                                    |                     | 43          |                            | <u>)</u> (00               | -                |       |       |      |
| /fpuaddition/exp_rr1     | -       | Λ        | (5E (7F                                | ^                   | -           | 3                          | X00                        | -                | · · · |       |      |
| /fpuaddition/exp_rr2     | +       |          |                                        | (7F                 |             | <u>)</u><br>(43 )(13       |                            | 100              |       |       |      |
| /fpuaddition/mant_rr0    | -       | V0X147XE | 0000006                                | +                   | 093928      |                            | χοοοοοοο<br>χοοοοοοο       |                  |       | · · · |      |
| /fpuaddition/mant_rr1    |         |          | 0X147XE (0000                          | +                   | -           |                            |                            | 0000             |       |       |      |
| /fpuaddition/mant_rr2    |         |          |                                        | 0000002             |             |                            | 6D11 09392EF               | 0000000          | - ·   | - · · |      |
| /ipuadulion/mant_frz     | 000000  |          | A                                      | 1000002             |             | According According        | ACCOUNT OF                 |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          |         |          |                                        |                     |             |                            |                            |                  |       |       |      |
|                          | 1111    |          | <br>                                   |                     |             |                            | <br>                       |                  |       |       |      |
|                          | 5000    | 0000     | 6                                      | 00000               |             | 7000                       | 0000                       | 800 <sup> </sup> | 0000  | 9000  | 0000 |



|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   | -        |
|--------------------------------|------------------|-----------------------------------------------|-----------------------|---------------|-----------------------------------|----------------------|---------------------|------------------|-------------------|----------|
| /fpu_add_mar14tone/clk_i       |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
| /fpu_add_mar14tone/movrr       |                  |                                               | -                     |               |                                   |                      | -                   |                  | -                 | -        |
| /fpu_add_mar14tone/opa_i       |                  |                                               | 501502F9              | +             | 3F893773                          | +                    | BF591417            | +                | BFB363ED          | -        |
| /fpu_add_mar14tone/opb_i       |                  | ι χ                                           | 219392EF              | )             | 0000000                           | )<br>t               | BF02C7D7            | )                | BE3F7FFE          | -        |
| /fpu_add_mar14tone/fpu_op_i    | 0                |                                               | -                     | -             | -                                 | +                    | -                   | +                | -                 | -        |
| /fpu_add_mar14tone/rmode_i     | 0                |                                               | -                     | -             | -                                 | -                    | -                   | -                | -                 | -        |
| /fpu_add_mar14tone/output_o    | 0000000          | 4184                                          | E148 B4800000 X4184   | E148 (501502F | 9 (219392EF) (5015)               | 02F9 (3F89377        | 3 (000000 (3F89     | 3773 BFADED      | 7 (B3800000 (BFAD | EDF7     |
| /fpu_add_mar14tone/post_out    | 57C00000 8000000 | 7FC00000 X4184E148 B480                       | ‱ <b>∦</b> 4184E148   | 501502F9      | <sup>92EF</sup> <b>≬</b> 501502F9 | (3F893773 <b>)</b> ∞ | ‱ <b>(</b> 3F893773 | BFADEDF7         | BFADEDF7          | BFCBS3ED |
| /fpu_add_mar14tone/ine_o       |                  |                                               |                       |               |                                   |                      | -                   | -                | -                 | <u> </u> |
| /fpu_add_mar14tone/overflow_o  |                  |                                               | -                     | -             | -                                 | -                    | -                   | -                | -                 | <u> </u> |
| /fpu_add_mar14tone/underflow_o |                  |                                               | -                     | -             | -                                 |                      | -                   | +                | -                 | <u> </u> |
| /fpu_add_mar14tone/inf_o       |                  |                                               | -                     |               |                                   |                      |                     | -                |                   | <u> </u> |
| /fpu_add_mar14tone/zero_o      |                  |                                               | -                     | -             | -                                 |                      |                     | -                | -                 | <u> </u> |
| /fpu_add_mar14tone/qnan_o      |                  |                                               |                       |               |                                   |                      |                     | -                |                   | <u> </u> |
| /fpu_add_mar14tone/snan_o      |                  |                                               | -                     | -             | -                                 |                      | -                   | -                | -                 |          |
| /fpu_add_mar14tone/post_in     | 0000000          | 84E147E (************************************ | 84E147E <b>\4A</b> 81 | 17C9 49C9778  | 4A817C9 <b>(449E</b>              | 3B98 (0000000)       | 449BB98 <b>XADE</b> | DF70 <b>≬</b> )- | ADEDF70 65A       | 9F66     |
| /fpu_add_mar14tone/sign_rr0    | L                |                                               | -                     | -             | -                                 |                      | -                   | +                | -                 | <u> </u> |
| /fpu_add_mar14tone/sign_rr1    | L                |                                               | -                     |               |                                   | -                    | -                   | -                | -                 |          |
| /fpu_add_mar14tone/sign_rr2    |                  |                                               | -                     | l             | -                                 | -                    | -                   | ļ                |                   | <u> </u> |
| /fpu_add_mar14tone/cmpl_rr0    | L                |                                               |                       | -             |                                   | -                    | -                   | +                | -                 |          |
| /fpu_add_mar14tone/cmpl_rr1    | L                |                                               | -                     | -             | -                                 | -                    | -                   | -                | -                 |          |
| /fpu_add_mar14tone/cmpl_rr2    |                  |                                               |                       | l             | -                                 | -                    | -                   | -                | -                 | ļ –      |
| /fpu_add_mar14tone/ready_o     | L                |                                               | -                     |               |                                   | _                    |                     | _                |                   |          |
| /fpu_add_mar14tone/exp_rr0     | 00 (7F           |                                               | (43                   | -             | <u>X</u> 00                       | -                    | (7E                 | -                | (7C               |          |
| /fpu_add_mar14tone/exp_rr1     | 00               | (7F                                           | 43                    | -             | χoo                               | -                    | 【7E                 | -                | (7C               |          |
| /fpu_add_mar14tone/exp_rr2     | 00               | (7F                                           | -                     | 43            |                                   | 00                   |                     | (7E              | -                 | 7C       |
| /fpu_add_mar14tone/mant_rr0    | 000000 00000     | 006                                           | (09392EF              | -             | X0000000                          | -                    | -                   | -                | (0000006          | -        |
| /fpu_add_mar14tone/mant_rr1    |                  | 0000006                                       | <b>(</b> 0939         | ł             | (0000)                            | ł                    | -                   | -                | χοοος             | 006      |
| /fpu_add_mar14tone/mant_rr2    | (0000x0 )0000    | 000 (0000002 (000                             | 0006 (0000002         | 09392EF       |                                   | 0000000              |                     | (000001 )        | 0000001           | 0000002  |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   |                      |                     |                  |                   |          |
|                                |                  |                                               |                       |               |                                   | <br>                 | <br>                |                  |                   | <br>     |
|                                | 0                | 100                                           |                       | 200           |                                   | 300                  |                     | 400              |                   |          |

| |

| /fpu_add_mar14tone/clk_i        |         | Ьпг      |       | Бпг        |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     | inn         | Ьп                |
|---------------------------------|---------|----------|-------|------------|-----------------|---------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|-------------|-------------------|
| /fpu_add_mar14tone/movrr        |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
| /fpu_add_mar14tone/opa_i        |         |          | -     | <br>\/4171 | 9994            | (501502                               | =9                                                                                                                                                                                                                                                                                                          | <br>                | 73          | V BF591417        |
| /fpu_add_mar14tone/opb_i        |         | -        | -     |            | 147AE           | (2193926                              | +                                                                                                                                                                                                                                                                                                           | <u>(000000</u>      | ł           | <br><br>УвF02C7D7 |
| /fpu_add_mar14tone/fpu_op_i     |         |          | -     | <u></u>    |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             | <u>^</u>          |
| /fpu_add_mar14tone/rmode_i      |         | -        | -     | -          |                 |                                       | -                                                                                                                                                                                                                                                                                                           | -                   | -           |                   |
| /fpu_add_mar14tone/output_o     |         |          | B53ED | -          | γm              |                                       | V1184E148 V                                                                                                                                                                                                                                                                                                 | 501502F9 (219392EF  | 501502EQ 3  | 803773            |
| /fpu_add_mar14tone/post_out     |         |          | -     | -          | · · · · ·       |                                       | r                                                                                                                                                                                                                                                                                                           | 2F9 X219392EF X5015 |             | ļ(                |
| /fpu_add_mar14tone/ine_o        |         |          |       | -          | Λ               | · · · · · · · · · · · · · · · · · · · | L 140 1001002                                                                                                                                                                                                                                                                                               |                     |             | <u>5 A</u>        |
| /fpu_add_mar14tone/overflow_o   | · ·     |          | +     | -          |                 | · · · ·                               | -                                                                                                                                                                                                                                                                                                           |                     | ₽ ∟<br>     |                   |
| /fpu_add_mar14tone/underflow_o  |         |          | -     |            |                 |                                       |                                                                                                                                                                                                                                                                                                             | -                   | -           |                   |
| /fpu_add_mar14tone/inf_o        |         | -        | +     | -          |                 |                                       | -                                                                                                                                                                                                                                                                                                           | -                   | +           |                   |
| /fpu_add_mar14tone/zero_o       |         | -        | -     |            |                 | ·                                     |                                                                                                                                                                                                                                                                                                             | -                   | -           |                   |
| /fpu_add_mar14tone/qnan_o       |         |          | +     |            |                 | · · · · ·                             |                                                                                                                                                                                                                                                                                                             | -                   | +           |                   |
| /fpu_add_mar14tone/snan_o       |         | -        | -     | -          |                 | · .                                   | -                                                                                                                                                                                                                                                                                                           | -                   | -           |                   |
| /fpu_add_mar14tone/post_in      |         | 6549E66  | -     |            | XXXXXFC 84E147E |                                       | 4481709                                                                                                                                                                                                                                                                                                     | 49C9778 44A817C9    |             | 0000000 4498898   |
| /fpu_add_mar14tone/sign_rr0     |         |          | -     | -          |                 |                                       | <u>A+A01709</u>                                                                                                                                                                                                                                                                                             | <u></u>             | <u></u>     | <u> </u>          |
| /fpu_add_mar14tone/sign_rr1     |         | -        | +     | -          |                 |                                       | -                                                                                                                                                                                                                                                                                                           | -                   | +           |                   |
| /fpu_add_mar14tone/sign_rr2     |         | ł        | -     |            |                 |                                       |                                                                                                                                                                                                                                                                                                             | -                   | -           |                   |
| /fpu_add_mar14tone/cmpl_rr0     |         | -        | +     | -          |                 |                                       | <br>_                                                                                                                                                                                                                                                                                                       | -                   | +           |                   |
| /fpu_add_mar14tone/cmpl_rr1     |         |          | +     |            |                 |                                       |                                                                                                                                                                                                                                                                                                             | -                   | -           |                   |
| /fpu_add_mar14tone/cmpl_rr2     |         |          | +     | -          |                 |                                       |                                                                                                                                                                                                                                                                                                             | -                   | +           |                   |
| /fpu_add_mar14tone/ready_o      |         |          | -     | -          |                 |                                       |                                                                                                                                                                                                                                                                                                             | -                   | -           |                   |
| /fpu_add_mar14tone/exp_rr0      | 70.     | -        | -     | (0X        | (7F             | 43                                    | -                                                                                                                                                                                                                                                                                                           | (00                 | +           |                   |
| /fpu_add_mar14tone/exp_rr1      |         |          | -     |            | 1A (7F          | · ·                                   | 43                                                                                                                                                                                                                                                                                                          |                     | <b>X</b> 00 | <u>Λ</u>          |
| /fpu_add_mar14tone/exp_rr2      | · · · · | -        | +     | -          | XXX (7F         |                                       | χ10<br>(43                                                                                                                                                                                                                                                                                                  | -                   | χοο         |                   |
| /fpu_add_mar14tone/mant_rr0     |         | 6        | -     | Y000002    | 0000006         | 0939                                  | ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, _,, _ | <br>                | + ^         |                   |
|                                 |         | -        | +     | Λ          | 0000002 0000000 |                                       | 09392EF                                                                                                                                                                                                                                                                                                     | <u></u>             | 0000000     |                   |
| /fpu_add_mar14tone/mant_rr2     |         | -        | -     | -          | ·               | 0002 X0000006 X0000                   | f                                                                                                                                                                                                                                                                                                           | PFF                 | 1000000     | 0                 |
| //pa_ada_inai intene/inaine_in2 | A       | 10000002 |       |            | Δ               |                                       |                                                                                                                                                                                                                                                                                                             |                     | Access      |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 |         |          |       |            |                 |                                       |                                                                                                                                                                                                                                                                                                             |                     |             |                   |
|                                 | 5000    | <br>     | 600(  | 0000       | 7000            | )000                                  | 80)<br>80                                                                                                                                                                                                                                                                                                   |                     | 900         | <br>              |



The wave form shown in the next page is the wave form of the test bench file that is used to test the FPU adder with the synthetic data. The MOVRR signal is made to periodically go high, one clock cycle before the addition output appears. The inputs from the text file are given periodically to the operands, and the outputs from the text file are read with the rising and falling edge of temp\_mrr signal. This signal has been generated so that the outputs from the MUT - module under test and outputs from the text file match i.e., they correspond to the same operands.

The 1<sup>st</sup> signal is the clock input;; 2<sup>nd</sup> and 3<sup>rd</sup> signals are the operands A and B represented as opa\_i and opb\_i; the 6<sup>th</sup> signal is the MOVRR input; 4<sup>th</sup> signal is the opcode 000-addition, 001-subtraction; 5<sup>th</sup> signal is the rounding mode; 9<sup>th</sup> signal is the output of the FPU, it could be the result or the normalized residual value depending on the MOVRR signal; 10<sup>th</sup> and 11<sup>th</sup> signals are the output and the residual register values read from the test data file. 7<sup>th</sup> signal is the input to the postnormalization. 26<sup>th</sup>, 27<sup>th</sup> and 28<sup>th</sup> signals are the output sof MUT – module under test and the output values read from the test data file.

| /fpu_add_test_vhd/opa_i     | 4171999A           | C171999A       | (4171999.                | A (C1719             | 99A (50              | 1502F9                    | D01502F9        | (501502F9         | (D01502             | 2F9           |
|-----------------------------|--------------------|----------------|--------------------------|----------------------|----------------------|---------------------------|-----------------|-------------------|---------------------|---------------|
| /fpu_add_test_vhd/opb_i     | 3FC147AE           |                | BFC147                   | AE                   | 21                   | 9392EF                    |                 | A19392EF          |                     |               |
| /fpu_add_test_vhd/fpu_op_i  | 0                  |                |                          |                      |                      |                           |                 |                   |                     |               |
| /fpu_add_test_vhd/rmode_i   | 0                  |                |                          |                      |                      |                           |                 |                   |                     |               |
| /fpu_add_test_vhd/movrr     |                    | <u> </u>       |                          |                      |                      |                           | [[              | Ť                 |                     |               |
| /fpu_add_test_vhd/post_in   | 0000000 X84E147E X | 84E147E 6CB852 | <sup>2</sup> /2 (6CB8522 | X— X6CB8522 (8       | 4E147E 🗶 🗶 🗶 🗤 🕇     | — ¥4A817C9 X— ¥4          | 4A817C9 4A817C7 | 4A817C7           | X X X4A817C7 X4A    | T<br>\817C9   |
| /fpu_add_test_vhd/ready_o   |                    |                |                          | T                    |                      |                           |                 |                   |                     |               |
| /fpu_add_test_vhd/output_o  | 0000000)           | ( )(4184E148   | (C15970A4 ) (C15970A     | 4 (415970A4 ) (41597 | 70A4 (C184E148 )     | C184E148 501502F9         | — (501502F9 (∞  | Б02F9 X XD01502F9 | (501502F9 — )501502 | ‡<br>2F9 )    |
| /fpu_add_test_vhd/result_in | 4184E148           | C15970A4       | 415970A                  | 4 (C184E             | 148 (50              | 1502F9                    | D01502F9        | (501502F9         | D01502              | 2F9           |
| /fpu_add_test_vhd/rr_in     | B4800000           |                | (348                     | 00000                |                      | 2193                      | 92EF            |                   | A19392              | 2EF           |
| /fpu_add_test_vhd/temp_mrr  |                    |                |                          |                      |                      |                           |                 |                   |                     | ļ             |
| /fpu_add_test_vhd/sign_rr0  |                    |                |                          |                      |                      |                           |                 |                   |                     | İ             |
| /fpu_add_test_vhd/sign_rr1  | L                  |                |                          |                      |                      |                           | -               |                   |                     | ļ             |
| /fpu_add_test_vhd/sign_rr2  |                    |                | i l                      |                      |                      |                           |                 |                   |                     | 1             |
| /fpu_add_test_vhd/cmpl_rr0  | L                  | [              |                          |                      |                      |                           |                 | []                |                     | <u> </u>      |
| /fpu_add_test_vhd/cmpl_rr1  | L                  |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| /fpu_add_test_vhd/cmpl_rr2  |                    |                |                          |                      |                      |                           | -               |                   |                     | ļ             |
| /fpu_add_test_vhd/exp_rr0   | 00)(7F             |                |                          |                      | (43                  |                           | -               |                   |                     | 1             |
| /fpu_add_test_vhd/exp_rr1   | 00 (7F             | -              |                          |                      |                      | 43                        | -               |                   | -                   |               |
| /fpu_add_test_vhd/exp_rr2   | 00 (7F             |                |                          |                      | -                    | (××)43                    | -               | (13)(43           | (13 <b>(43</b>      |               |
| /fpu_add_test_vhd/mant_rr0  | - ( (000000        | )6             |                          |                      | (                    | 09392EF                   | -               |                   |                     |               |
| /fpu_add_test_vhd/mant_rr1  | 000000 X- X0000    | 2006           |                          |                      |                      | - (09392EF                | -               |                   | •                   |               |
| /fpu_add_test_vhd/mant_rr2  | 0000000 0000002    | (- (0000002    |                          |                      | )                    | <sup>002</sup> X− X09392E | F               | - 109392EF        |                     | F             |
| /fpu_add_test_vhd/cnt       | 0 1 2 3 4          | 5 (0 (1 (2 (3  | 4 )5 )0 (1 )2 (          | 3 \4 \5 \0 \1 2      | : <u>(3 (4 )5 (0</u> | 1 (2 (3 (4 )5             | 0 (1 (2 (3 (4   | 5 (0 (1 (2 (3     | 4 (5 (0 (1 (2       | <u>)</u> 3 (4 |
| /fpu_add_test_vhd/err_op    |                    |                |                          |                      |                      |                           |                 |                   | -                   | <u> </u>      |
| /fpu_add_test_vhd/err_rr    |                    |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| /fpu_add_test_vhd/err       |                    |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| /fpu_add_test_vhd/ine_o     |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
| fpu_add_test_vhd/overflow_o |                    |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| u_add_test_vhd/underflow_o  |                    |                |                          |                      |                      |                           |                 | <u> </u>          |                     |               |
| /fpu_add_test_vhd/inf_o     |                    |                |                          |                      |                      |                           |                 | <u> </u>          |                     |               |
| /fpu_add_test_vhd/zero_o    |                    |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| /fpu_add_test_vhd/qnan_o    |                    |                |                          |                      |                      |                           |                 |                   |                     | <u> </u>      |
| /fpu_add_test_vhd/snan_o    |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |
|                             |                    |                |                          |                      |                      |                           |                 |                   |                     |               |





| /fpu_add_test_vhd/clk_i       |                            |
|-------------------------------|----------------------------|
| /fpu_add_test_vhd/opa_i       | -                          |
| /fpu_add_test_vhd/opb_i       |                            |
| /fpu_add_test_vhd/fpu_op_i    | -                          |
| /fpu_add_test_vhd/rmode_i     |                            |
| /fpu_add_test_vhd/movrr       |                            |
| /fpu_add_test_vhd/post_in     | 7B63376 X X 7B63376        |
| /fpu_add_test_vhd/ready_o     |                            |
| /fpu_add_test_vhd/output_o    | <u>/ (BFF6C66F)</u>        |
| /fpu_add_test_vhd/result_in   |                            |
| /fpu_add_test_vhd/rr_in       | 33000000 33800000          |
| /fpu_add_test_vhd/temp_mrr    |                            |
| /fpu_add_test_vhd/sign_rr0    |                            |
| /fpu_add_test_vhd/sign_rr1    |                            |
| /fpu_add_test_vhd/sign_rr2    |                            |
| /fpu_add_test_vhd/cmpl_rr0    |                            |
| /fpu_add_test_vhd/cmpl_rr1    |                            |
| /fpu_add_test_vhd/cmpl_rr2    |                            |
| /fpu_add_test_vhd/exp_rr0     | 7D                         |
| /fpu_add_test_vhd/exp_rr1     | 7D                         |
| /fpu_add_test_vhd/exp_rr2     | 7D                         |
| /fpu_add_test_vhd/mant_rr0    | 0000003                    |
| /fpu_add_test_vhd/mant_rr1    | 0000003                    |
| /fpu_add_test_vhd/mant_rr2    | <u>0000001</u> )- (0000001 |
| /fpu_add_test_vhd/cnt         | 3 \4 \5 \0 \1 \2 \3 \4 \   |
| /fpu_add_test_vhd/err_op      |                            |
| /fpu_add_test_vhd/err_rr      |                            |
| /fpu_add_test_vhd/err         |                            |
| /fpu_add_test_vhd/ine_o       |                            |
| /fpu_add_test_vhd/overflow_o  |                            |
| /fpu_add_test_vhd/underflow_o |                            |
| /fpu_add_test_vhd/inf_o       |                            |
| /fpu_add_test_vhd/zero_o      |                            |
| /fpu_add_test_vhd/qnan_o      |                            |
| /fpu_add_test_vhd/snan_o      |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |
|                               |                            |

The simulation reports for the Native-pair multiplication are presented in the next two pages. The next page shows the post-route simulation for 8 pairs of operands inputs were given in continuous clock cycles. The 1<sup>st</sup> signal is the clock input; the 2<sup>nd</sup> signal is the MOVRR input; 3<sup>rd</sup> and 4<sup>th</sup> signals are the operands A and B represented as opa i and opb i;  $5^{th}$  signal is the opcode 010 - multiplication;  $6^{th}$  signal is the rounding mode;  $7^{th}$ signal is the output of the FPU, it could be the result or the normalized residual value depending on the MOVRR signal; 8<sup>th</sup> and 9<sup>th</sup> signals are the sign output and exponent output of the residual register. 10<sup>th</sup> signal is the complement flag output and is used to check the proper functioning of the residual register hardware. 11<sup>th</sup> signal is the residual register mantissa output. 12<sup>th</sup> signal is the ready signal used to indicate the valid output of the FPU multiplier. 12<sup>th</sup> signal is the post in signal and it used to check if the right residual value was going into the postnormalization unit when MOVRR goes high; Apart from these, signals from 13<sup>th</sup> to 19th give the inexact, overflow and the exception outputs. The FPU multiplication takes place in 5 clock cycles, one clock cycle each for prenormalization and multiplication, two clock cycles for postnormalization and once clock cycle for formatting output.

In the wave form shown in the next page, the signal 3 indicates the operand A - opa\_i; it is 501502F9 in the 1<sup>st</sup> clock cycle, 417199A in the 2<sup>nd</sup> and is periodically repeated. 4<sup>th</sup> signal is operand B – opb\_i which takes value 41A77700 in 1<sup>st</sup> cycle, 400000011 in 2nd clock cycle and there after repeats itself alternatively with 41A77700 and 40000011. This has been done in order to show the residual register value resulting from multiplication of these operands. The outputs are obtained from 6<sup>th</sup> clock cycle onwards. Consider the inputs in the 2<sup>nd</sup> clock cycle that is opa\_i = 4171999A and opb\_i = 40000011 and product is shown by the 7<sup>th</sup> signal output\_o = 41F199BA obtained in the 7<sup>th</sup> clock cycle with the rising edge of the clock. The outputs for the other inputs come along the consequent clock cycles. MOVRR signal goes high at the end of the 7<sup>th</sup> clock cycle, at this time the post\_in becomes the residual register mantissa value obtained in the previous clock cycle. The normalized residual value to be stored in the architectural register C4AF40FF is obtained in the 10<sup>th</sup> clock cycle with the rising edge of the clock. This is the residual

register value for the inputs  $opa_i = 501502F9$  and  $opb_i = 41A77700$ . Similarly MOVRR again goes high in the  $12^{th}$  clock cycle to give the normalized residual register value 330B333A as output in the  $15^{th}$  clock cycle which is the residual value for  $opa_i = 4171999A$  and  $opb_i = 40000011$ . The output of  $opa_i = 501502F9$  and  $opb_i = 41A77700$  can be observed in  $8^{th}$  clock cycle with the input being given in the  $3^{rd}$  clock cycle.

The second wave form has been run to check if the all the proper functioning of the FPU multiplier unit.

| /fpumulttest/clk_i       |             |                           |                             |                         |                           |                        |                      |   |   |
|--------------------------|-------------|---------------------------|-----------------------------|-------------------------|---------------------------|------------------------|----------------------|---|---|
| /fpumulttest/movrr       |             |                           |                             |                         |                           |                        |                      | + | ł |
| /fpumulttest/opa_i       |             | 000 X40400000 X40800000 X | 400,00000 400,00000 400,000 | 00 \41000000 \41100000  | 41200000 <b>(</b> 4130000 | 0 41400000             | 41500000 ¥41600000 ¥ | - | ł |
| /fpumulttest/opb_i       |             | +                         | 40400000 \$4000000 \$404000 |                         | -                         | +                      |                      | + | ł |
| /fpumulttest/fpu_op_i    |             |                           | <br>                        | -                       |                           |                        |                      | - | ł |
| /fpumulttest/rmode_i     |             | -                         |                             | -                       | -                         | -                      |                      | - | ł |
| /fpumulttest/output_o    |             | -                         |                             | 000 X41400000 X41000000 | 41700000 X41400000 X41A8  | 0000 41400000 41900000 | 41F00000 ¥1B00000    | - | ł |
| /fpumulttest/sign_rr_out |             | -                         |                             |                         |                           |                        |                      | - | ł |
| /fpumulttest/exp_rr_out  |             | -                         | (6B (6A                     | \<br>\                  |                           | А (6В                  | -                    | - | ł |
| /fpumulttest/cmpl_out    |             | -                         | -                           | -                       | -                         |                        | -                    | - | ł |
| /fpumulttest/mant_rr_out | 0000000     | -                         |                             |                         |                           |                        | -                    | + | t |
| /fpumulttest/ready_o     |             | -                         | +                           | -                       | +                         | +                      |                      | + | - |
| /fpumulttest/post_in     | 00000000000 | +<br>0                    | н                           |                         |                           | (58000000              | ±<br>0000 ( (        | + | t |
| /fpumulttest/ine_o       |             | +                         |                             |                         | ±                         |                        | +                    | † | t |
| /fpumulttest/overflow_o  |             | +                         | · · ·                       | -                       | -<br>-                    | ļ                      | <u>†</u>             | † | t |
| /fpumulttest/underflow_o |             | +                         |                             |                         | +                         |                        | 1                    | † | t |
| /fpumulttest/inf_o       |             | 1                         |                             | +                       | †                         | †                      | İ                    | † | Ì |
| /fpumulttest/zero_o      |             | †                         | †                           | +                       | †                         | †<br>                  | İ                    | † | Ì |
| /fpumulttest/qnan_o      |             | 1                         |                             | +                       | †                         | <b>†</b>               | İ                    | † | Ì |
| /fpumulttest/snan_o      |             | +                         |                             | -                       | -<br>-                    | -                      | -                    | + |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             |                           |                             |                         |                           |                        |                      |   |   |
|                          |             | i -                       | 1                           | 1                       | 1                         | 1                      | 1                    | 1 | 1 |

| /fpumult_testexamples/clk_i       |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|-----------------------------------|----------------------|----------------------------|-------------------------|------------------------|-----------------------------|---------------------------|----------------------------------|------------------------|-------------|-------|
| /fpumult_testexamples/movrr       |                      | +                          |                         |                        |                             |                           |                                  |                        |             |       |
| /fpumult_testexamples/opa_i       | X X501502F9 X41719   | 199A (501502F9 (4171999A ) | 501502F9 4171999A 50150 | 2F9 4171999A 501502F9  | 4171999A X 501502F9 X 41719 | 99A (501502F9 (4171999A ) | 501502F9                         |                        | X X501502   | 2F9 X |
| /fpumult_testexamples/opb_i       | - X-X41A77700 X40000 | 40000011                   | 41A77700 40000011 41A77 | 700 40000011 41A77700  | 40000011 41A77700 40000     | 11 (41A77700 (40000011 )  | 41A77700                         |                        | X-X41A777   | 700 / |
| /fpumult_testexamples/fpu_op_i    | 0 (2                 |                            |                         | · · · · ·              | -                           |                           |                                  |                        | ·           |       |
| /fpumult_testexamples/rmode_i     | 0                    | -                          |                         | - · · · ·              |                             | -                         | -                                |                        | · · · · · · |       |
| /fpumult_testexamples/output_o    | 00000000             |                            | (51E17A35 (41F1         | 99BA 5242F46A 41F199BA | 5242F46A X41F199BA XC4AI    | F40FF 42633374 5242F46A   | 41F199BA (5242F46A (330B         | 333A 51AF469A 41F199BA | 5242F46A    |       |
| /fpumult_testexamples/sign_rr_out |                      |                            |                         |                        |                             |                           |                                  |                        | · · · · · · |       |
| /fpumult_testexamples/exp_rr_out  | 00                   |                            | <u>(</u> 6В (80         | С (6В (8С              | (6B (71 (60                 | с (8С )6В                 | (8C (4E (8E                      | 3 (6B (8C              | ·           |       |
| /fpumult_testexamples/cmpl_out    |                      |                            |                         |                        |                             |                           |                                  |                        | ·           |       |
| /fpumult_testexamples/mant_rr_out | 0000000              |                            | 008333A 02F4            | 0FF 00B333A 02F40FF    | 00B333A X0000000 X02C0      | CCE8 02F40FF 00B333A      | 02F40FF 0000000 00BF             | 000 X00B333A X02F4     | 0FF         |       |
| /fpumult_testexamples/ready_o     | L                    | -                          |                         |                        |                             |                           |                                  |                        | ·           |       |
| /fpumult_testexamples/post_in     | 000000000            | 00                         | 78CCD08333A (617A3-     |                        |                             |                           | (#17A3HC088F00 X70CCD088333A X61 | 7A34D0BF00             | )           |       |
| /fpumult_testexamples/ine_o       |                      |                            |                         |                        |                             | T                         |                                  |                        |             |       |
| /fpumult_testexamples/overflow_o  |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
| /fpumult_testexamples/underflow_o |                      |                            |                         |                        |                             |                           |                                  |                        |             | <br>  |
| /fpumult_testexamples/inf_o       |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
| /fpumult_testexamples/zero_o      |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
| /fpumult_testexamples/qnan_o      |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
| /fpumult_testexamples/snan_o      |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |
|                                   |                      |                            |                         |                        |                             |                           |                                  |                        |             |       |

0 1000000 2000000 3000000 400000



## Appendix B

## **High-level Schematics**



Figure 18: High-level schematic of FPU Adder



Figure 19. High-level schematic Prenormalization unit used in Floating-point addition.



Figure 20. High-level schematic of Addition unit used in Floating-point addition



Figure 21. High-level schematic of Postnormalization Unit used in Floating-point addition



Figure 22. High-level schematic of Residual register used in prenormalization and postnormalization



Figure 23. High-level schematic of FPU multiplier



Figure 24. High-level schematic of Prenormalization unit for Multiplier



Figure 25. High-level schematic of Multiplier unit



Figure 26. High-level schematic of Postnormalization for Multiplier

# **VHDL Source Code**

### **FPU Adder**

```
library ieee;
use ieee.std logic 1164.all;
use ieee.numeric std.all;
use ieee.std logic misc.all;
use ieee.std logic ARITH.all;
library work;
use work.fpupack.all;
entity fpu add is
  port (
    clk i,movrr: in std logic;
     -- Input Operands A & B
                     : in std logic vector(FP WIDTH-1 downto 0); -- Default:
     opa i
FP_WIDTH=32
                 : in std logic vector(FP WIDTH-1 downto 0);
     opb i
     -- fpu operations (fpu op i):
              -- ==
              --000 = add,
              --001 = substract,
                     : in std logic vector(2 downto 0);
     fpu op i
    -- Rounding Mode:
    rmode i
                      : in std logic vector(1 downto 0);
     -- Output port
    output o, post out
                            : out std logic vector(FP WIDTH-1 downto 0);
     -- Exceptions
    ine o
                             : out std logic; -- inexact
                      : out std_logic; -- overflow
     overflow o
                      : out std logic; -- underflow
     underflow o
                             : out std logic; -- infinity
    inf o
    zero o
                             : out std logic; -- zero
                             : out std logic; -- queit Not-a-Number
     qnan o
                             : out std logic; -- signaling Not-a-Number
     snan o
                post in:out std logic vector(27 downto 0);--;
        --residuals
        sign rr0,sign rr1,sign rr2,cmpl rr0,cmpl rr1,cmpl rr2,ready o:out std logic;
        exp rr0,exp rr1,exp rr2:out std logic vector(EXP WIDTH-1 downto 0);
```

mant\_rr0,mant\_rr1,mant\_rr2:out std\_logic\_vector(FRAC\_WIDTH + 1 downto

0)

); end fpu\_add;

architecture rtl of fpu\_add is

-- Input/output registers signal s\_opa\_i, s\_opb\_i : std\_logic\_vector(FP\_WIDTH-1 downto 0); signal s\_fpu\_op\_i : std\_logic\_vector(2 downto 0); signal s\_rmode\_i : std\_logic\_vector(1 downto 0); signal s\_output\_o : std\_logic\_vector(FP\_WIDTH-1 downto 0); signal s\_ine\_o, s\_overflow\_o, s\_underflow\_o, s\_inf\_o, s\_zero\_o, s\_qnan\_o, s\_snan\_o :

```
std logic;
```

signal s\_output1 : std\_logic\_vector(FP\_WIDTH-1 downto 0);

-- \*\*\*Add/Substract units signals\*\*\* signal s\_mant\_rr2:std\_logic\_vector(FRAC\_WIDTH + 1 downto 0); signal post\_norm\_fract\_in:std\_logic\_vector(FRAC\_WIDTH + 4 downto 0); signal post\_norm\_exp\_in:std\_logic\_vector(EXP\_WIDTH-1 downto 0); signal post\_norm\_sign\_in:std\_logic; -------pipelining signals------

signal fpu op addsub:std logic;--

signal rmode pretoaddsub:std logic vector(1 downto 0);-signal prenorm addsub fracta 28 o:std logic vector(FRAC WIDTH+4 downto 0); signal prenorm addsub fractb 28 o:std logic vector(FRAC\_WIDTH+4 downto 0); signal prenorm addsub exp:std logic vector(EXP WIDTH-1 downto 0); signal test exp gr8r 24 addin:std logic; signal s sign rrpretoadd:std logic; signal s cmpl rrpretoadd:std logic; signal s exp rrpretoadd:std logic vector(EXP WIDTH-1 downto 0); signal s\_mant\_rrpretoadd:std\_logic\_vector(FRAC\_WIDTH + 1 downto 0); signal addsub fract o: std logic vector(FRAC WIDTH+4 downto 0); signal addsub sign o : std logic; signal rmode addsubpost:std logic vector(1 downto 0);-signal exp o addsubpost:std logic vector(EXP WIDTH -1 downto 0);-signal test exp gr8r 24 addsubpost:std logic;-signal postnorm addsub output o : std logic vector(31 downto 0); signal postnorm addsub ine o : std logic;

signal fpu\_op\_addsubpost :std\_logic;-signal s\_sign\_rraddtopost,s\_cmpl\_rraddtopost:std\_logic;

| <pre>signal s_exp_rraddtopost:std_logic_vector(EXP_WIDTH-1 downto 0);<br/>signal s_mant_rraddtopost:std_logic_vector(FRAC_WIDTH + 1 downto 0);<br/>signal s_sign_rr2,s_cmpl_rr2:std_logic;<br/>signal s_exp_rr2:std_logic_vector(EXP_WIDTH-1 downto 0);</pre> |                                                                                                                                                                                                                                                            |                                                                                                                                                                   |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| component prenorm_new is<br>port(                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                            |                                                                                                                                                                   |  |  |  |
| r                                                                                                                                                                                                                                                             | clk_i<br>opa i                                                                                                                                                                                                                                             | : in std_logic;<br>: in std_logic_vector(FP_WIDTH-1 downto                                                                                                        |  |  |  |
| 0);                                                                                                                                                                                                                                                           | opb_i                                                                                                                                                                                                                                                      | : in std_logic_vector(FP_WIDTH-1 downto                                                                                                                           |  |  |  |
| 0);                                                                                                                                                                                                                                                           | fpu_op_pretoaddsub_<br>rmode_pretoaddsub_<br>fpu_op_pretoaddsub_<br>rmode_pretoaddsub_<br>fracta 28 o                                                                                                                                                      | n: in std_logic_vector(1 downto 0);<br>out: out std_logic;                                                                                                        |  |  |  |
| downto 0); carr                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                            | ction(23) & guard(1) & round(1) & sticky(1)<br>: out std logic vector(FRAC WIDTH+4                                                                                |  |  |  |
| downto 0);                                                                                                                                                                                                                                                    | 14000_20_0                                                                                                                                                                                                                                                 |                                                                                                                                                                   |  |  |  |
| std_logic_vector(EXI                                                                                                                                                                                                                                          | exp_o_pretoaddsub_out : out<br>P_WIDTH-1 downto 0);<br>test_exp_gr8r_24_preout:out std_logic;<br>sign_o_rr0,cmpl_out0:out std_logic;<br>exp_o_rr0:out std_logic_vector(EXP_WIDTH-1 downto 0);<br>mant_o_rr0:out std_logic_vector(FRAC_WIDTH + 1 downto 0); |                                                                                                                                                                   |  |  |  |
| Ň                                                                                                                                                                                                                                                             | expdiff_out:out std_logic_vector(EXP_WIDTH-1 downto 0);<br>infa,infb,signa,signb,nan_a,nan_b,nan_in,nan_op:out std_logic                                                                                                                                   |                                                                                                                                                                   |  |  |  |
| );<br>end component;                                                                                                                                                                                                                                          |                                                                                                                                                                                                                                                            |                                                                                                                                                                   |  |  |  |
| component addsub<br>port(                                                                                                                                                                                                                                     | _28 is                                                                                                                                                                                                                                                     |                                                                                                                                                                   |  |  |  |
| downto 0); carry(1)                                                                                                                                                                                                                                           | clk_i<br>fracta_i<br>& hidden(1) & fraction<br>fractb i                                                                                                                                                                                                    | <pre>: in std_logic;<br/>: in std_logic_vector(FRAC_WIDTH+4<br/>on(23) &amp; guard(1) &amp; round(1) &amp; sticky(1)<br/>: in std_logic_vector(FRAC_WIDTH+4</pre> |  |  |  |
| downto 0);                                                                                                                                                                                                                                                    | fpu_op_addsub_in<br>rmode_addsub_in<br>exp o addsub in                                                                                                                                                                                                     | <pre>:in std_logic; :in std_logic_vector(1 downto 0); : in std_logic_vector(EXP_WIDTH-1</pre>                                                                     |  |  |  |
| downto 0);                                                                                                                                                                                                                                                    | test_exp_gr8r_24_ade                                                                                                                                                                                                                                       |                                                                                                                                                                   |  |  |  |

|                               | mant_i_rr1 :in std<br>sign_i_rr1,cmpl_in1<br>expdiff_in:in std_logi                                                         | <pre>_logic_vector(EXP_WIDTH-1 downto 0);<br/>_logic_vector(FRAC_WIDTH + 1 downto 0);<br/>: in std_logic;<br/>c_vector(EXP_WIDTH-1 downto 0);<br/>nan_a,nan_b,nan_in,nan_op:in std_logic;<br/>: out std_logic_vector(FRAC_WIDTH+4</pre> |  |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| downto 0);                    | —                                                                                                                           | ` _                                                                                                                                                                                                                                     |  |
| 0);                           |                                                                                                                             | : out std_logic;<br>ut std_logic_vector(1 downto 0);<br>ut std_logic_vector(EXP_WIDTH -1 downto                                                                                                                                         |  |
| <i>///</i>                    | 4 addout:out std logic                                                                                                      | 2:                                                                                                                                                                                                                                      |  |
|                               | fpu_op_addsub_out<br>sign_o_rr1,cmpl_out1<br>exp_o_rr1 :out sto                                                             | :out std_logic;                                                                                                                                                                                                                         |  |
| 0);                           | infa_postin,infb_posti                                                                                                      | <pre>pgic_vector(EXP_WIDTH-1 downto 0);<br/>in,signa_postin,signb_postin:out std_logic;<br/>postin,nan_in_postin,nan_op_postin:out</pre>                                                                                                |  |
| std_logic);<br>end component; |                                                                                                                             |                                                                                                                                                                                                                                         |  |
| component postnorn<br>port(   | m_june20 is                                                                                                                 |                                                                                                                                                                                                                                         |  |
|                               |                                                                                                                             | <pre>: in std_logic;<br/>: in std_logic_vector(FRAC_WIDTH+4<br/>ction(23) &amp; guard(1) &amp; round(1) &amp; sticky(1)<br/>: in std_logic_vector(EXP_WIDTH-1</pre>                                                                     |  |
| downto 0);                    | exp_i                                                                                                                       | . III std_logic_vector(EAF_w1D1H-1                                                                                                                                                                                                      |  |
| - /2                          | mant_i_rr2:in std_log<br>sign_i_rr2,cmpl_in2:<br>fpu_op_i<br>rmode_i                                                        | c_vector(EXP_WIDTH-1 downto 0);<br>ic_vector(FRAC_WIDTH +1 downto 0);                                                                                                                                                                   |  |
| std logio:                    | infa_postin,infb_postin,signa_postin,signb_postin:in std_logic;<br>nan_a_postin,nan_b_postin,nan_in_postin,nan_op_postin:in |                                                                                                                                                                                                                                         |  |
| std_logic;                    | output o                                                                                                                    | : out std logic vector(FP WIDTH-1                                                                                                                                                                                                       |  |
| downto 0);                    | infa_postout,infb_pos                                                                                                       | ` _                                                                                                                                                                                                                                     |  |

```
exp_o_rr2:out std_logic_vector(EXP_WIDTH-1 downto 0);
mant_o_rr2:out std_logic_vector(FRAC_WIDTH + 1 downto 0);
ine_o,sign_o_rr2,cmpl_out2: out std_logic
);
end component;
```

signal ready: std\_logic;

signal cnt: integer:=0;

signal expdiffpre\_add,expdiffadd\_post:std\_logic\_vector(EXP\_WIDTH-1 downto 0):="000000000";

signal s\_infa,s\_infb,s\_signa,s\_signb,s\_nan\_a,s\_nan\_b,s\_nan\_in,s\_nan\_op: std\_logic; signal

s\_infa\_postin,s\_infb\_postin,s\_signa\_postin,s\_signb\_postin,s\_nan\_a\_postin,s\_nan\_b\_post in,s\_nan\_in\_postin,s\_nan\_op\_postin: std\_logic;

signal s\_infa\_postout,s\_infb\_postout,s\_signa\_postout,s\_signb\_postout: std\_logic;

begin

i prenorm addsub: prenorm new port map ( clk i => clk i, opa i => s opa i, opb i => s opb i, fpu op pretoaddsub in => s fpu op i(0), rmode pretoaddsub in => s rmode i, fpu op pretoaddsub out => fpu op addsub, rmode pretoaddsub out =>rmode pretoaddsub, fracta 28  $o \Rightarrow$  prenorm addsub fracta 28 o, fractb 28  $o \Rightarrow$  prenorm addsub fractb 28 o, exp o pretoaddsub out=> prenorm addsub exp, test exp gr8r 24 preout => test exp gr8r 24 addin, sign o rr0=>s sign rrpretoadd, cmpl out0=>s cmpl rrpretoadd, exp o rr0=>s exp rrpretoadd, mant o rr0=>s mant rrpretoadd, expdiff out=>expdiffpre add, infa=>s infa,infb=>s infb,signa=>s signa, signb=>s signb,nan a=>s nan a, nan b=>s nan b,nan in=>s nan in,nan op=>s nan op);

i\_addsub: addsub\_28
port map(
 clk\_i => clk\_i,
 fracta\_i=> prenorm\_addsub\_fracta\_28\_o,
 fractb\_i=> prenorm\_addsub\_fractb\_28\_o,

```
fpu op addsub in => fpu op addsub,
rmode_addsub_in => rmode_pretoaddsub,
     exp o addsub in => prenorm addsub exp,
test exp gr8r 24 addin => test exp gr8r 24 addin,
exp i rr1=>s exp rrpretoadd,
     mant i rr1=>s mant rrpretoadd,
sign i rr1=>s sign rrpretoadd,
cmpl in1=>s cmpl rrpretoadd,
     expdiff in=>expdiffpre add,
     infa=>s infa,infb=>s infb,
     signa=>s signa,signb=>s signb,
     nan a=>s nan a,nan b=>s nan b,
     nan in=>s nan in,nan op=>s nan op,
fract o => addsub fract o,
sign o \Rightarrow addsub sign o,
rmode addsub out => rmode addsubpost,
exp o addsub out => exp o addsubpost,
test exp gr8r 24 addout => test exp gr8r 24 addsubpost,
fpu op addsub out => fpu op addsubpost,
     sign o rr1=>s sign rraddtopost,
cmpl out1=>s cmpl rraddtopost,
exp o rr1=>s exp rraddtopost,
mant o rr1=>s mant rraddtopost,
     expdiff out=>expdiffadd post,
     infa postin=>s infa postin, infb postin=>s infb postin,
     signa postin=>s signa postin, signb postin=>s signb postin,
     nan a postin=>s nan a postin,nan b postin=>s nan b postin,
     nan in postin=>s nan in postin, nan op postin=>s nan op postin);
```

i postnorm addsub: postnorm june20 port map( clk i => clk i, fract 28 i => post norm fract in, exp i => post norm exp in,sign  $i \Rightarrow post norm sign in$ , postnorm exprr set in=>test exp gr8r 24 addsubpost, exp i rr2=>s exp rraddtopost, mant i rr2=>s mant rraddtopost, sign i rr2=>s sign rraddtopost, cmpl in2=>s cmpl rraddtopost, fpu op i => fpu op addsubpost, rmode i => rmode addsubpost, expdiff postin=>expdiffadd post, infa postin=>s infa postin, infb postin=>s infb postin, signa postin=>s signa postin, signb postin=>s signb postin,

```
nan_a_postin=>s_nan_a_postin,nan_b_postin=>s_nan_b_postin,
nan_in_postin=>s_nan_in_postin,nan_op_postin=>s_nan_op_postin,
output_o => postnorm_addsub_output_o,
infa_postout=>s_infa_postout,infb_postout=>s_infb_postout,
signa_postout=>s_signa_postout,signb_postout=>s_signb_postout,
exp_o_rr2=>s_exp_rr2,
mant_o_rr2=>s_mant_rr2,
ine_o => postnorm_addsub_ine_o,
sign_o_rr2=>s_sign_rr2,
cmpl_out2=>s_cmpl_rr2);
```

--Multplexer for either supplying add/sub output or residual reg value to the post normalization unit

post\_norm\_fract\_in<=(s\_mant\_rr2 &"000")when (movrr='1')else addsub\_fract\_o; post\_norm\_exp\_in<=s\_exp\_rr2 when (movrr='1')else exp\_o\_addsubpost; post\_norm\_sign\_in<=s\_sign\_rr2 when (movrr='1')else addsub\_sign\_o;

post\_in<=post\_norm\_fract\_in; post\_out<=postnorm\_addsub\_output\_o;</pre>

```
_____
```

-- Input Register

s\_opa\_i <= opa\_i; s\_opb\_i <= opb\_i; s\_fpu\_op\_i <= fpu\_op\_i; s\_rmode\_i <= rmode\_i;</pre>

```
--Output Register

process(clk_i)

begin

if falling_edge(clk_i) then

if (ready = '1')then

output_o <= s_output_o;

ine_o <= s_ine_o;

overflow_o <= s_overflow_o;

underflow_o <= s_overflow_o;

inf_o <= s_inf_o;

zero_o <= s_zero_o;

qnan_o <= s_qnan_o;

snan_o <= s_snan_o;
```

end if;

end if; end process;

sign rr0<=s sign rrpretoadd; cmpl rr0<=s cmpl rrpretoadd; exp rr0<=s exp rrpretoadd; mant rr0<=s mant rrpretoadd; sign rr1<=s sign rraddtopost; cmpl rr1<=s cmpl rraddtopost; exp rr1<=s exp rraddtopost; mant rr1<=s mant rraddtopost; mant  $rr2 \le s$  mant rr2; exp rr2<=s exp rr2; sign rr2<=s sign rr2; cmpl rr2<=s cmpl rr2; -- Output Multiplexer process(clk i) begin if rising\_edge(clk\_i) then if fpu op i="000" or fpu op i="001" then s output1 <= postnorm addsub output o; s ine o <= postnorm addsub ine o; else <= (others => '0'); s output1 <= '0'; s ine o end if; end if: end process;

--In round down: the subtraction of two equal numbers other than zero are always -0!!!

process(s\_output1, s\_rmode\_i, s\_infa\_postout, s\_infb\_postout, s\_qnan\_o, s\_snan\_o, s\_zero\_o, s\_fpu\_op\_i, s\_signa\_postout, s\_signb\_postout) begin if s\_rmode\_i="00" or ((s\_infa or s\_infb) or s\_qnan\_o or

s\_snan\_o)='1' then --round-to-nearest-even

s output  $o \le s$  output1; elsif s rmode i="01" and s output1(30 downto 23)="11111111" then --In round-to-zero: the sum of two non-infinity operands is never infinity, even if an overflow occures s output  $o \le s$  output1(31) & "11111110111111111111111111111111111"; elsif s rmode i="10" and s output1(31 downto 23)="111111111" then --In round-up: the sum of two non-infinity operands is never negative infinity, even if an overflow occures elsif s rmode i="11" then --In round-down: a-a = -0if (s fpu op i="000" or s fpu op i="001") and s zero o='1' and (s opa i(31) or (s fpu op i(0) xor s opb i(31))='1' then s output  $o \le "1"$  & s output1(30 downto 0); --In round-down: the sum of two non-infinity operands is never postive infinity, even if an overflow occures elsif s output1(31 downto 23)="011111111" then s output o <= "01111111011111111111111111111111111"; else s output o  $\leq$  s output1; end if: else s output o  $\leq$  s output1; end if: end process; -- Generate Exceptions s underflow  $o \le 1'$  when s output1(30 downto 23)="00000000" and s ine o='1' else '0': s overflow  $o \le 1'$  when s output1(30 downto 23)="111111111" and s ine o=1'else '0': s inf o  $\leq$  '1' when s output1(30 downto 23)="11111111" and (s qnan o or s snan o)='0' else '0'; s zero  $o \le 1'$  when or reduce(s output1(30 downto 0))='0' else '0'; s qnan  $o \le 1'$  when s output1(30 downto 0)=QNAN else '0'; s snan o  $\leq$  '1' when s opa i(30 downto 0)=SNAN or s opb i(30 downto 0)=SNAN else '0':

----Ready signal to indicate start of valid outputs -- process(clk\_i)

```
begin
       if(falling edge(clk i))then
       if(cnt/=2)then
       \operatorname{cnt} \leq \operatorname{cnt} + 1;
       else
       cnt \leq cnt;
       end if;
       if(cnt=2)then
       ready<='1';
       else
       ready<='0';
       end if:
       end if;
       end process;
 ready o<=ready;
 end rtl;
---prenormalization unit---
library ieee ;
use ieee.std logic 1164.all;
use ieee.std logic unsigned.all;
use ieee.std logic misc.all;
use ieee.std logic ARITH.all;
library work;
use work.fpupack.all;
entity prenorm new is
       port(
                      clk i
                                             : in std logic;
                                             : in std logic vector(FP WIDTH-1 downto
                      opa i
0);
                                             : in std logic vector(FP WIDTH-1 downto
                      opb i
0);
                      fpu op pretoaddsub in:
                                                     in std logic;--
                      rmode pretoaddsub in:
                                                     in std logic vector(1 downto 0);--
                      fpu op pretoaddsub out:
                                                     out std logic;--
                      rmode pretoaddsub out:
                                                     out std logic vector(1 downto 0);--
                      fracta 28 o
                                             : out std logic vector(FRAC WIDTH+4
downto 0);
               -- carry(1) & hidden(1) & fraction(23) & guard(1) & round(1) & sticky(1)
                      fractb 28_o
                                             : out std logic vector(FRAC WIDTH+4
downto 0);
```

exp o pretoaddsub out : out std logic vector(EXP WIDTH-1 downto 0);-test exp gr8r 24 preout:out std logic;-sign o rr0,cmpl out0:out std logic; exp o rr0:out std logic\_vector(EXP\_WIDTH-1 downto 0); mant o rr0:out std logic vector(FRAC WIDTH + 1 downto 0); expdiff out:out std logic vector(EXP WIDTH-1 downto 0); infa,infb,signa,signb,nan a,nan b,nan in,nan\_op:out std\_logic ); end prenorm new; architecture rtl of prenorm new is signal s exp o : std logic vector(EXP WIDTH-1 downto 0); signal s fracta 28 o, s fractb 28 o : std logic\_vector(FRAC\_WIDTH+4 downto 0): signal s expa, s expb : std logic vector(EXP WIDTH-1 downto 0); signal s fracta, s fractb : std logic vector(FRAC WIDTH-1 downto 0); signal s fracta 28, s fractb 28, s fract sm 28: std logic vector(FRAC WIDTH+4 downto 0); signal s exp diff, s exp sm : std logic vector(EXP WIDTH-1 downto 0); signal s\_rzeros : std logic vector(5 downto 0); signal s expa lt expb : std logic; signal s expa eq expb : std logic; signal s fracta 1 : std logic; signal s fractb 1 : std logic; signal s op dn : std logic; signal s opa dn, s opb dn : std logic; signal s mux diff : std logic vector(1 downto 0); signal s mux exp,exp gr8r 24 : std logic; signal s sticky : std logic; signal s rr mant:std logic vector(FRAC WIDTH + 1 downto 0); signal s expdiff int:integer:=0; signal s fract shr 28:std logic vector(FRAC WIDTH+4 downto 0); signal and sig:std logic vector(FRAC WIDTH+1 downto 0); signal rr rev:std logic vector(FRAC WIDTH+1 downto 0); signal s sign o rr0,s cmpl out0:std logic:='Z'; signal s exp o rr0:std logic vector(EXP WIDTH-1 downto 0);

signal s\_mant\_o\_rr0:std\_logic\_vector(FRAC\_WIDTH+1 downto 0);

signal s\_infa,s\_infb,s\_nan\_a,s\_nan\_b,s\_nan\_in,s\_nan\_op:std\_logic;

component residualreg is port(sign\_rr:in std\_logic;exp\_rr:in
std\_logic\_vector(EXP\_WIDTH - 1 downto 0);cmpl\_in:in std\_logic;
mant\_rr:in std\_logic\_vector(FRAC\_WIDTH+1 downto 0);sign\_rr\_out:out
std\_logic;exp\_rr\_out:out std\_logic\_vector(EXP\_WIDTH - 1 downto 0);
cmpl\_out:out std\_logic;mant\_rr\_out:out std\_logic\_vector(FRAC\_WIDTH+1 downto 0));
end component residualreg;

begin

-- Input Register

s\_expa <= opa\_i(30 downto 23); s\_expb <= opb\_i(30 downto 23); s\_fracta <= opa\_i(22 downto 0); s\_fractb <= opb\_i(22 downto 0);</pre>

```
-- Output Register
   process(clk i)
   begin
   if falling edge(clk_i) then
          exp o pretoaddsub out <= s exp o;
          fracta 28 o \leq s fracta 28 o;
          fractb 28 o <= s fractb_28_o;
fpu op pretoaddsub out<=fpu op pretoaddsub in;
          rmode pretoaddsub out<=rmode pretoaddsub in;
          test exp gr8r 24 preout \leq exp gr8r 24;
          expdiff out <= s exp diff;
          sign o rr0<=s sign o rr0;
          exp o rr0 \le exp o rr0;
          cmpl out0<=s cmpl out0;
signa \leq opa i(31);
signb<=opb i(31);
     infa \le s infa;
     infb \le s infb;
          nan a<=s nan a;
          nan b\leq s nan b;
```

nan in<=s nan in; nan\_op<=s\_nan\_op;</pre> end if; end process; mant o rr0<=s mant o rr0; s expa eq expb  $\leq 1'$  when s expa = s expb else '0'; \_\_\_ s expa lt expb  $\leq 1'$  when s expa > s expb else '0'; -- '1' if fraction is not zero s fracta  $1 \le$ or reduce(s fracta); \_\_\_ s fractb 1 <= or reduce(s fractb); ------ opa or Opb is denormalized s op  $dn \le s$  opa dn or s opb dn; -s opa  $dn \leq not$  or reduce(s expa); s opb dn <= not or reduce(s expb); output the larger exponent -s mux  $exp \le s$  expa lt expb; process(clk i) begin if rising\_edge(clk\_i) then case s mux exp is when '0'  $\Rightarrow$  s exp o  $\leq$  s expb; when '1'  $\Rightarrow$  s exp o  $\leq$  s expa; when others  $\Rightarrow$  s exp o  $\leq$  "111111111"; end case; end if: end process; convert to an easy to handle floating-point format --

s\_fracta\_28 <= "01" & s\_fracta & "000" when s\_opa\_dn='0' else "00" & s\_fracta & "000";

s\_fractb\_28 <= "01" & s\_fractb & "000" when s\_opb\_dn='0' else "00" & s\_fractb & "000";

s\_mux\_diff <= s\_expa\_lt\_expb & (s\_opa\_dn xor s\_opb\_dn); ---a>b concat expa/expb..one only = 0.

```
s_exp_diff <= s_expb - s_expa when(s_mux_diff="00")else
s expb - (s expa+"00000001")when(s mux diff="01")else</pre>
```

```
s_expa - s_expb when(s_mux_diff="10")else
s_expa -
(s_expb+"00000001")when(s_mux_diff="11")else
"ZZZZZZZZ";
```

```
s expdiff int <= conv integer(s exp diff);
     process(clk i)
     begin
if rising edge(clk i) then
      andsig<="00000000000000000000000000000000000";
if(s expdiff int<25)then
      and sig(s expdiff int-1 downto 0)<=(others=>'1');
      else
      andsig<=(others=>'1');
      end if;
end if;
     end process;
     process(clk i)
     begin
     if(falling edge(clk i))then
     s rr mant<=rr rev;
     end if;
     end process;
     s fract sm 28 \le s fracta 28 when s expa lt expb='0' else s fractb 28;
     s exp sm\leqs expb when s expa lt expb='1' else s expa;
s fract shr 28 \le shr(s fract sm 28,s exp diff);
     rr rev<=s fract sm 28(FRAC WIDTH+4 downto 3) and andsig;
     -- count the zeros from right to check if result is inexact
     s rzeros \leq count r zeros(s fract sm 28);
     s sticky \leq 1' when s exp diff > s rzeros and or reduce(s fract sm 28)='1' else
```

```
'0';
```

s\_fracta\_28\_o<=s\_fracta\_28 when s\_expa\_lt\_expb='1' else s\_fract\_shr\_28(27 downto 1)&(s\_sticky or s\_fract\_shr\_28(0));

s\_fractb\_28\_o<=s\_fractb\_28 when s\_expa\_lt\_expb='0' else s\_fract\_shr\_28(27 downto 1)&(s\_sticky or s\_fract\_shr\_28(0));

rr0:residualreg port map('0',s exp sm,'0',s rr mant,s sign o rr0,s exp o rr0,s cmpl out0,s mant o rr0);

```
exp_gr8r_24 \le '1'  when (s_expdiff_int > 23)  else '0';
```

```
s_infa \le '1' \text{ when opa_i(30 downto 23)="111111111" else '0';}
s_infb \le '1' \text{ when opb_i(30 downto 23)="11111111" else '0';}
s_nan_a \le '1' \text{ when (s_infa='1' and or_reduce (opa_i(22 downto 0))='1') else '0';}
s_nan_b \le '1' \text{ when (s_infb='1' and or_reduce (opb_i(22 downto 0))='1') else '0';}
s_nan_in \le '1' \text{ when s_nan_a='1' or s_nan_b='1' else '0';}
s_nan_op \le '1' \text{ when (s_infa and s_infb)='1' and (opa_i(31) xor}
(fpu_op_pretoaddsub_in xor opb_i(31)) ='1' else '0'; -- inf-inf=Nan
```

end rtl;

### Adder/subtractor

library ieee ;
use ieee.std\_logic\_1164.all;
use ieee.std\_logic\_unsigned.all;
use ieee.std\_logic\_misc.all;
use IEEE.std\_logic\_arith.all;

library work; use work.fpupack.all;

entity addsub 28 is port( clk i : in std logic; : in std logic vector(FRAC WIDTH+4 fracta i downto 0); -- carry(1) & hidden(1) & fraction(23) & guard(1) & round(1) & sticky(1) fractb i : in std logic vector(FRAC WIDTH+4 downto 0); fpu op addsub in :in std logic;-rmode addsub in :in std logic vector(1 downto 0);--: in std logic vector(EXP WIDTH-1 exp o addsub in downto 0);-test exp gr8r 24 addin:in std logic; exp i rr1 :in std\_logic\_vector(EXP\_WIDTH-1 downto 0); :in std logic vector(FRAC WIDTH + 1 downto 0); mant i rr1

| doumto 0);                    | 1 = = •                                | : in std_logic;<br>ic_vector(EXP_WIDTH-1 downto 0);<br>nan_a,nan_b,nan_in,nan_op:in std_logic;<br>: out std_logic_vector(FRAC_WIDTH+4 |  |
|-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|--|
| downto 0);                    |                                        |                                                                                                                                       |  |
|                               | sign_o                                 | : out std_logic;                                                                                                                      |  |
|                               | rmode_addsub_out:o                     | ut std_logic_vector(1 downto 0);                                                                                                      |  |
|                               | exp_o_addsub_out:o                     | ut std_logic_vector(EXP_WIDTH -1 downto                                                                                               |  |
| 0);                           |                                        |                                                                                                                                       |  |
|                               | test_exp_gr8r_24_addout:out std_logic; |                                                                                                                                       |  |
|                               | fpu op addsub out :out std logic;      |                                                                                                                                       |  |
|                               | sign o rr1,cmpl out1:out std logic;    |                                                                                                                                       |  |
|                               | exp_o_rr1 :out st                      | d_logic_vector(EXP_WIDTH-1 downto 0);<br>d_logic_vector(FRAC_WIDTH + 1 downto                                                         |  |
| 0);                           |                                        |                                                                                                                                       |  |
| 0),                           | infa_postin,infb_post                  | ogic_vector(EXP_WIDTH-1 downto 0);<br>tin,signa_postin,signb_postin:out std_logic;<br>postin,nan in postin,nan op postin:out          |  |
| std_logic);<br>end addsub_28; |                                        |                                                                                                                                       |  |

architecture rtl of addsub\_28 is

```
signal s_fracta_i, s_fractb_i : std_logic_vector(FRAC_WIDTH+4 downto 0);
signal s_fract_o : std_logic_vector(FRAC_WIDTH+4 downto 0);
signal s_signa_i, s_signb_i, s_sign_o : std_logic;
signal s_fpu_op_i : std_logic;
signal fracta_lt_fractb : std_logic;
signal s_addop: std_logic;
```

begin

-- Input Register

```
s_fracta_i <= fracta_i;
s_fractb_i <= fractb_i;
s_signa_i <= signa;
s_signb_i <= signb;
s_fpu_op_i <= fpu_op_addsub_in;
--
-- Output Register
process(clk_i)
begin
if falling_edge(clk_i) then
fract_o <= s_fract_o;</pre>
```

```
sign o \le s sign o;
              rmode addsub out<=rmode addsub in;
              exp o addsub out<=exp o addsub in;
              test exp gr8r 24 addout<=test exp gr8r 24 addin;
   infa postin<=infa;
              infb postin<=infb;
              signa postin<=signa;
              signb postin<=signb;
              nan a postin<=nan a;
              nan b postin<=nan b;
              nan in postin<=nan in;
              nan op postin<=nan op;
   fpu op addsub out<=fpu op addsub in;
              sign o rr1<=sign i rr1;
   exp o rr1<=exp i rr1;
   cmpl out1<=cmpl in1;
              mant o rr1<=mant i rr1;
              expdiff out<=expdiff in;
end if;
end process;
fracta lt fractb \leq 1' when s fracta i > s fractb i else '0';
-- check if its a subtraction or an addition operation
s addop \leq ((s signa i xor s signb i)) and not (s fpu op i)) or ((s signa i xnor
s signb i)and (s fpu op i));
-- sign of result
s sign o \le 0' when s fract o = conv std logic vector(0,28) and (s_signa_i and
s signb i)='0' else
                                                                          ((not
s signa i) and ((not fracta lt fractb) and (s fpu op i xor s signb i))) or
                                                                          ((s signa i)
and (fracta lt fractb or (s fpu op i xor s signb i)));
-- add/substract
process(s fracta i, s fractb i, s addop, fracta lt fractb)
begin
       if s_addop='0' then
              s_fract_o \le s_fracta i + s fractb i;
       else
              if fracta lt fractb = '1' then
                      s fract o <= s fracta i - s fractb i;
              else
                      s fract o \le s fractb i - s fracta i;
```

end if;

end if; end process;

end rtl;

### Postnormalization

library ieee; use ieee.std logic 1164.all; use ieee.std logic unsigned.all; use ieee.std logic misc.all; use ieee.std logic arith.all; library work; use work.fpupack.all; entity postnorm june20 is port( clk i : in std logic; : in std logic vector(FRAC WIDTH+4 fract 28 i -- carry(1) & hidden(1) & fraction(23) & guard(1) & round(1) & sticky(1) downto 0); : in std logic vector(EXP WIDTH-1 exp i downto 0); sign i : in std logic; postnorm exprr set in:in std logic; exp i rr2:in std logic vector(EXP WIDTH-1 downto 0); mant i rr2:in std logic vector(FRAC WIDTH +1 downto 0); sign i rr2, cmpl in2: in std logic; fpu op i : in std logic; : in std logic vector(1 downto 0); rmode i expdiff postin: in std logic vector(EXP WIDTH-1 downto 0); infa postin, infb postin, signa postin, signb postin: in std logic; nan a postin,nan b postin,nan in postin,nan op postin:in std logic; : out std logic vector(FP\_WIDTH-1 output o downto 0); infa postout, infb postout: out std logic; signa postout, signb postout: out std logic; exp o rr2:out std logic vector(EXP WIDTH-1 downto 0); mant\_o\_rr2:out std\_logic\_vector(FRAC\_WIDTH + 1 downto 0); ine o,sign o rr2,cmpl out2: out std logic

); end postnorm\_june20;

architecture rtl of postnorm\_june20 is

signal s\_fract\_28\_i: std\_logic\_vector(FRAC\_WIDTH+4 downto 0);signal s\_exp\_i: std\_logic\_vector(EXP\_WIDTH-1 downto 0);signal s\_sign\_i,signa,signb: std\_logic;signal s\_fpu\_op\_i: std\_logic;signal s\_rmode\_i: std\_logic\_vector(1 downto 0);signal s\_output\_o: std\_logic\_vector(FP\_WIDTH-1 downto 0);signal s\_ine\_o: std\_logic;signal s\_overflow: std\_logic;

signal s\_shr1, s\_shr2, s\_shl, s\_shr1e : std\_logic;

signal s\_expr1\_9, s\_expr2\_9 : std\_logic\_vector(EXP\_WIDTH downto 0); signal s\_exp\_shr1, s\_exp\_shr2, s\_exp\_shl : std\_logic\_vector(EXP\_WIDTH-1 downto 0); signal s\_fract\_shr1, s\_fract\_shr2: std\_logic\_vector(FRAC\_WIDTH+4 downto 0); signal s\_fract\_shl : std\_logic\_vector(FRAC\_WIDTH + 4 downto 0); signal s\_zeros : std\_logic\_vector(5 downto 0); signal shl\_pos: std\_logic\_vector(5 downto 0);

signal s\_fract\_1, s\_fract\_2 : std\_logic\_vector(FRAC\_WIDTH+4 downto 0); signal s\_exp\_1, s\_exp\_2 : std\_logic\_vector(EXP\_WIDTH-1 downto 0);

signal s\_fract\_rnd : std\_logic\_vector(FRAC\_WIDTH+4 downto 0); signal s\_roundup : std\_logic; signal s\_sticky : std\_logic;

signal s\_zero\_fract : std\_logic; signal s\_lost : std\_logic; signal s\_infa, s\_infb : std\_logic; signal s\_nan\_in, s\_nan\_op, s\_nan\_a, s\_nan\_b, s\_nan\_sign : std\_logic; signal cmpl\_2\_br,cmpl\_2\_ar,rr\_sign2\_br,rr\_sign2\_ar:std\_logic; signal s\_exp\_rr2\_br,s\_exp\_rr2\_ar,s\_exptemprr\_ar:std\_logic\_vector(EXP\_WIDTH-1 downto 0); signal s\_mant\_rr2\_final:std\_logic\_vector(FRAC\_WIDTH+2 downto 0); signal mantrr2\_cmpl:std\_logic\_vector(FRAC\_WIDTH+1 downto 0):=(others=>'0'); signal s\_mant\_rr2\_br:std\_logic\_vector(FRAC\_WIDTH+2 downto 0); signal s\_mant\_rr2\_ar:std\_logic\_vector(FRAC\_WIDTH+3 downto 0); signal s\_mant\_rr2\_ar\_trunc:std\_logic\_vector(FRAC\_WIDTH+1 downto 0); signal s\_mant\_rr2\_ar\_trunc:std\_logic\_vector(FRAC\_WIDTH+1 downto 0);

signal s\_exp\_o\_rr2:std\_logic\_vector(EXP\_WIDTH-1 downto 0);

signal s\_mant\_o\_rr2:std\_logic\_vector(FRAC\_WIDTH + 1 downto 0); signal s\_sign\_o\_rr2,s\_cmpl\_out2:std\_logic;

signal d1:std\_logic\_vector(FRAC\_WIDTH + 2 downto 0):=(others=>'0'); signal d2:std\_logic\_vector(FRAC\_WIDTH + 3 downto 0):=(others=>'0');

component residualreg is port(sign\_rr:in std\_logic;exp\_rr:in
std\_logic\_vector(EXP\_WIDTH - 1 downto 0);cmpl\_in:in std\_logic;
mant\_rr:in std\_logic\_vector(FRAC\_WIDTH+1 downto 0);sign\_rr\_out:out
std\_logic;exp\_rr\_out:out std\_logic\_vector(EXP\_WIDTH - 1 downto 0);
cmpl\_out:out std\_logic;mant\_rr\_out:out std\_logic\_vector(FRAC\_WIDTH+1 downto 0));
end component residualreg;

component dec\_br is port(sel:in std\_logic\_vector(7 downto 0);en:in std\_logic;d:out std\_logic\_vector(25 downto 0)); end component dec\_br;

component dec\_ar is port(sel:in std\_logic\_vector(7 downto 0);en1,en2:in std\_logic;d:out std\_logic\_vector(26 downto 0)); end component dec\_ar;

signal a:std\_logic; -- to see if expdiff>24

signal sum:std\_logic\_vector(8 downto 0); signal a1:std\_logic; signal mask:std\_logic\_vector(24 downto 0):=(others=>'0');

begin

-- Input Register

```
s_fract_28_i <= fract_28_i;
s_exp_i <= exp_i;
s_sign_i <= sign_i;
s_fpu_op_i <= fpu_op_i;
s_rmode_i <= rmode_i;
cmpl_2_br<=cmpl_in2;
rr_sign2_br<=sign_i_rr2;
s_infa<=infa_postin;
s_nan_a<=nan_a_postin;
s_nan_b<=nan_b_postin;
s_nan_in<=nan_in_postin;
s_nan_op<=nan_op_postin;
signa<=signa_postin;</pre>
```

### signb<=signb\_postin;</pre>

```
a<=expdiff_postin(7)or expdiff_postin(6)or
expdiff_postin(5)or(expdiff_postin(4)and expdiff_postin(3) and(expdiff_postin(2) or
expdiff_postin(1) or expdiff_postin(0)));
```

```
--Output Register
       process(clk i)
       begin
       if falling edge(clk i) then
                      output o \le s output o;
                      infa postout<=infa postin;
                      infb postout<=infb postin;
                      signa postout<=signa postin;
                      signb postout<=signb postin;
                      ine o \le s ine o;
                      exp o rr2<=s exp o rr2;
                      mant o rr2<=s mant o rr2;
                      sign o rr2<=s sign o rr2;
                      cmpl out2<=s cmpl out2;
       end if:
       end process;
       -- check if shifting is needed
       -- stage 1a: right-shift (when necessary)
       s shr1 \leq s fract 28 i(27);
       s shr1e \leq '1' when s fract 28 i(26)='1' and or reduce(s exp i)='0' else '0'; --if
exp is zero, and hidden bit=1, then exp=exp+1 (no need to check s fract 28 i(27)!)
       s expr1 9 <= "0"&s exp i + "00000001";
 s fract shr1 \leq shr(s fract 28 i, "1");
       s exp shr1 \leq s_expr1_9(7 \text{ downto } 0);
       -- stage 1b: left-shift (when necessary)
       s shl <= '1' when s fract 28 i(27 downto 26)="00" and s exp i /= "00000000"
else '0':
       -- count the leading zero's of fraction, needed for left-shift
       s zeros \leq count 1 zeros(s fract 28 i(26 downto 0));
       --s \exp 1 9 \le ("0"\&s \exp i) - ("000"\&s zeros);
       shl pos \leq "000000" when s exp i="00000001" else s zeros;
       s fract shl <= shl(s fract 28 i, shl pos);
       s exp shl \leq "00000000" when s exp i="00000001" else s exp i-
("00"&sh1 pos);
       s fract 1<=s fract shr1 when(s shr1='1')else
```

```
s fract shl when(s shl='1')else
```

### s\_fract\_28\_i;

s\_exp\_1<=s\_exp\_shr1 when(s\_shr1='1')else
s\_exp\_shl when(s\_shl='1')else
s\_exp\_i;</pre>

dec1:dec\_br port map(expdiff\_postin,s\_shr1,d1); s\_mant\_rr2\_br<=('0' & mant\_i\_rr2)or d1;</pre>

```
s_exp_rr2_br<=exp_i_rr2;
```

-- round

 $s_{s_{i}} = 1' \text{ when } s_{f_{i}} = 1' \text{ or } (s_{f_{i}} = 1' \text{ or } (s_{i}) = 1' \text{ or } (s_{i}) = 1' \text{ else } 0'; -check last bit, before and after right-shift$ 

-- stage 2: right-shift after rounding (when necessary) s\_shr2 <= s\_fract\_rnd(27); s\_expr2\_9 <= ("0"&s\_exp\_1) + "000000001"; s\_fract\_shr2 <= shr(s\_fract\_rnd, "1"); s\_exp\_shr2 <= s\_expr2\_9(7 downto 0);

s\_fract\_2 <= s\_fract\_shr2 when s\_shr2='1' else s\_fract\_rnd; s\_exp\_2 <= s\_exp\_shr2 when s\_shr2='1' else s\_exp\_1;</pre>

dec2:dec\_ar port map(expdiff\_postin,s\_shr1,s\_shr2,d2); s\_mant\_rr2\_ar<=('0' & s\_mant\_rr2\_br)or d2;</pre>

s\_exptemprr\_ar<=conv\_std\_logic\_vector(conv\_integer(s\_exp\_i) -2\*(FRAC\_WIDTH+1),8); s\_exp\_rr2\_ar<=s\_exptemprr\_ar
when((postnorm\_exprr\_set\_in='1')and(cmpl\_2\_ar='1'))else s\_exp\_rr2\_br;</pre>

s\_mant\_rr2\_ar\_trunc<=s\_mant\_rr2\_ar(FRAC\_WIDTH+1 downto 0) when((s\_shr1 or s\_shr2)='0')else s\_mant\_rr2\_ar(FRAC\_WIDTH+1 downto 0) when(((s\_shr1 xor s\_shr2)='1')and(a='0'))else

s\_mant\_rr2\_ar(FRAC\_WIDTH+2 downto 1) when(((s\_shr1 xor s\_shr2)='1')and(a='1'))else

s\_mant\_rr2\_ar(FRAC\_WIDTH+1 downto 0) when(((s\_shr1 and s\_shr2)='1')and(a='0'))else

s\_mant\_rr2\_ar(FRAC\_WIDTH+3 downto 2) when(((s\_shr1 and s\_shr2)='1')and(a='1'))else

s\_mant\_rr2\_ar(FRAC\_WIDTH+1 downto 0);
------added

cmpl\_2\_ar<=signa xor signb xor s\_roundup; rr\_sign2\_ar<=s\_sign\_i xor s\_roundup; mantrr2\_cmpl<=(not s\_mant\_rr2\_ar\_trunc);</pre>

```
process(clk i)
       begin
              if rising edge(clk i) then
____
      if(cmpl 2 ar='0')then
              s mant rr2 final <= s mant rr2 ar trunc;
       elsif((s expdiff int<25)and(s shr1='0')and(s shr2='0')and(cmpl 2 ar='1'))then
              s mant rr2 final<=s mant rr2 ar trunc(FRAC WIDTH + 1 downto
s expdiff int)& mantrr2 cmpl(s expdiff int-1 downto 0);
              elsif((s expdiff int<25)and((s shr1 xor
s shr2)='1')and(cmpl 2 ar='1'))then
              s mant rr2 final<=s mant rr2 ar trunc(FRAC WIDTH + 1 downto
s expdiff int+1)& mantrr2 cmpl(s expdiff int downto 0);
              elsif((s expdiff int<25)and((s shr1 and
____
s shr2)='1')and(cmpl 2 ar='1'))then
              s mant rr2 final<=s mant rr2 ar trunc(FRAC WIDTH + 1 downto
s expdiff int+2)& mantrr2 cmpl(s expdiff int+1 downto 0);
              elsif((s expdiff int>25)and(cmpl 2 ar='1'))then
____
              s mant rr2 final<=mantrr2 cmpl;
              end if:
----
```

---- end if; ---- end process;

sum<=('0' & expdiff\_postin)+("00000000"&s\_shr1)+("00000000"&s\_shr2); a1<=(sum(4)and sum(3)and(sum(0) or sum(1)or sum(2)))or sum(8)or sum(7)or sum(6)or sum(5);

```
"00000000000000000000011" when ((sum = "000000010")and(a1='0')) else
    "000000000000000000000111" when ((sum = "000000011")and(a1='0')) else
    "0000000000000000000001111" when ((sum = "000000100")and(a1='0')) else
    "000000000000000000011111" when ((sum = "000000101")and(a1='0')) else
    "00000000000000000000111111" when ((sum = "000000110")and(a1='0')) else
    "000000000000000001111111" when ((sum = "000000111")and(a1='0')) else
    "0000000000000000011111111" when ((sum = "000001000")and(a1='0')) else
    "000000000000000111111111" when ((sum = "000001001")and(a1='0')) else
    "000000000000001111111111" when ((sum = "000001010")and(a1='0')) else
    "0000000000000111111111111" when ((sum = "000001011")and(a1='0')) else
    "000000000001111111111111" when ((sum = "000001100")and(a1='0')) else
    "000000000011111111111111" when ((sum = "000001101")and(a1='0')) else
    "0000000001111111111111111" when ((sum = "000001110")and(a1='0')) else
    "00000000011111111111111111" when ((sum = "000001111")and(a1='0')) else
    "00000000111111111111111111" when ((sum = "000010000")and(a1='0')) else
    "000000011111111111111111111" when ((sum = "000010001")and(a1='0')) else
    "00000001111111111111111111111111" when ((sum = "000010010")and(a1='0')) else
    "00000011111111111111111111111111" when ((sum = "000010011")and(a1='0')) else
    "0000011111111111111111111111" when ((sum = "000010100")and(a1='0')) else
    "00111111111111111111111111" when ((sum = "000010111")and(a1='0')) else
    "0111111111111111111111111111" when ((sum = "000011000")and(a1='0')) else
    "111111111111111111111111" when (a1='1') else
"1111111111111111111111111111111":
```

rr2:residualreg port map(rr\_sign2\_ar,s\_exp\_rr2\_ar,cmpl\_2\_ar,s\_mant\_rr2\_final(24 downto 0),s\_sign\_o\_rr2,s\_exp\_o\_rr2,s\_cmpl\_out2,s\_mant\_o\_rr2);

-- signa  $\leq s_{opa_i(31)};$ 

signb $\leq$  s opb i(31); -s infa  $\leq 1'$  when s opa i(30 downto 23)="111111111" else '0'; -s infb  $\leq 1'$  when s opb i(30 downto 23)="11111111" else '0'; -s nan a  $\leq 1'$  when (s infa='1' and or reduce (s opa i(22 downto 0))='1') else '0': s nan  $b \le 1'$  when (s infb='1' and or reduce (s opb i(22 downto 0))='1') else --'0'; s nan in  $\leq 1'$  when s nan a='1' or s nan b='1' else '0'; -s nan op  $\leq 1'$  when (s infa and s infb)='1' and (s opa i(31) xor (s fpu op i -xor s opb i(31)) ='1' else '0'; -- inf-inf=Nan s nan sign  $\leq s$  sign i when (s nan a and s nan b)='1' else signa when s nan a='1' else signb; -- check if result is inexact; s lost <= or reduce(s fract 28 i(2 downto 0)) or or\_reduce(s\_fract\_1(2 downto 0)) or or reduce(s fract 2(2 downto 0)); s ine  $o \le 1'$  when (s lost or s overflow)='1' and (s infa or s infb)='0' else '0'; s overflow  $\leq 1'$  when (s expr1 9(8) or s expr2 9(8))='1' and (s infa or s infb)='0' else '0'; s zero fract  $\leq 1'$  when s zeros=27 and s fract 28 i(27)='0' else '0'; -- '1' if fraction result is zero process(s sign i, s exp 2, s fract 2, s nan in, s nan op, s nan sign, s infa, s infb, s overflow, s zero fract) begin if (s nan in or s nan op)='1' then s output o  $\leq$  s nan sign & QNAN; elsif (s infa or s infb)='1' or s overflow='1' then s output o  $\leq$  s sign i & INF; elsif s\_zero fract='1' then s output o <= s sign i & ZERO VECTOR; else s output o  $\leq$  s sign i & s exp 2 & s fract 2(25 downto 3); end if: end process;

end rtl;

#### Package – FPU pack

library ieee; use ieee.std\_logic\_1164.all; use ieee.std\_logic\_unsigned.all;

package fpupack is

-- Data width of floating-point number. Deafult: 32 constant FP\_WIDTH : integer := 32;

-- Data width of fraction. Deafult: 23 constant FRAC\_WIDTH : integer := 23;

-- Data width of exponent. Deafult: 8 constant EXP WIDTH : integer := 8;

-- SNaN (Signaling Not a Number) FP format (without sign bit) constant SNAN : std\_logic\_vector(30 downto 0) := "1111111100000000000000000000001";

-- count the zeros starting from left function count l\_zeros (signal s\_vector: std\_logic\_vector) return std\_logic\_vector;

-- count the zeros starting from right function count\_r\_zeros (signal s\_vector: std\_logic\_vector) return std\_logic\_vector;

end fpupack;

package body fpupack is

-- count the zeros starting from left function count\_l\_zeros (signal s\_vector: std\_logic\_vector) return std\_logic\_vector is

```
variable v_count : std_logic_vector(5 downto 0);
begin
v_count := "000000";
for i in s_vector'range loop
case s_vector(i) is
when '0' => v_count := v_count + "000001";
when others => exit;
end case;
end loop;
return v_count;
end count_l_zeros;
```

```
-- count the zeros starting from right
function count_r_zeros (signal s_vector: std_logic_vector) return std_logic_vector is
            variable v_count : std_logic_vector(5 downto 0);
            begin
            v_count := "000000";
            for i in 0 to s_vector'length-1 loop
                case s_vector(i) is
                when '0' => v_count := v_count + "000001";
                when others => exit;
                end case;
            end loop;
            return v_count;
            end count_r_zeros;
```

end fpupack;

### Testbench for Adder with residual register

library ieee; use ieee.std\_logic\_1164.all; use ieee.numeric\_std.all; use ieee.std\_logic\_misc.all; use ieee.std\_logic\_ARITH.all; use ieee.std\_logic\_textio.all; use std.textio.all;

ENTITY fpu\_add\_test\_vhd IS --port(clk\_out: out std\_logic); --Type Text is file of String; --Type Line is access String;

END fpu\_add\_test\_vhd;

## ARCHITECTURE behavior OF fpu\_add\_test\_vhd IS

-- Component Declaration for the Unit Under Test (UUT)

COMPONENT fpu add PORT(clk i : IN std logic; movrr : IN std logic; opa i : IN std logic vector(31 downto 0); opb i: IN std logic vector(31 downto 0); fpu op i : IN std logic vector(2 downto 0); rmode i : IN std logic vector(1 downto 0); output o : OUT std logic vector(31 downto 0); ine o: OUT std logic; overflow o: OUT std logic; underflow o: OUT std logic; inf o : OUT std logic; zero o: OUT std logic; qnan o: OUT std logic; snan o: OUT std logic; post in:out std logic vector(27 downto 0); sign rr0 : OUT std logic; sign rr1 : OUT std logic; sign rr2: OUT std logic; cmpl rr0 :OUT std logic;cmpl rr1: OUT std logic;cmpl rr2 : OUT std logic;ready o:OUT std logic; exp rr0:OUT std logic vector(7 downto 0);exp rr1:OUT std logic vector(7 downto 0);exp rr2 : OUT std logic vector(7 downto 0); mant rr0 : OUT std logic vector(24 downto 0);mant rr1 : OUT

std\_logic\_vector(24 downto 0);mant\_rr2 : OUT std\_logic\_vector(24 downto 0));

## END COMPONENT;

--Inputs SIGNAL clk\_i : std\_logic := '0'; SIGNAL movrr : std\_logic := '0'; SIGNAL opa\_i : std\_logic\_vector(31 downto 0) := (others=>'0'); SIGNAL opb\_i : std\_logic\_vector(31 downto 0) := (others=>'0'); SIGNAL fpu\_op\_i : std\_logic\_vector(2 downto 0) := (others=>'0'); SIGNAL rmode\_i : std\_logic\_vector(1 downto 0) := (others=>'0');

--Outputs SIGNAL output\_o : std\_logic\_vector(31 downto 0); SIGNAL ine\_o : std\_logic; SIGNAL overflow\_o : std\_logic; SIGNAL underflow\_o : std\_logic; SIGNAL div\_zero\_o : std\_logic; SIGNAL inf\_o : std\_logic;

```
SIGNAL zero o : std logic;
SIGNAL qnan o : std logic;
SIGNAL snan o : std logic;
SIGNAL sign rr0 : std logic;
SIGNAL sign rr1 : std logic;
SIGNAL sign rr2 : std logic;
SIGNAL cmpl rr0 : std logic;
SIGNAL cmpl rr1 : std logic;
SIGNAL cmpl rr2 : std logic;
SIGNAL exp_rr0 : std logic vector(7 downto 0);
SIGNAL exp rr1 : std logic vector(7 downto 0);
SIGNAL exp rr2 : std logic vector(7 downto 0);
SIGNAL mant rr0:std logic vector(24 downto 0);
SIGNAL mant rr1:std logic vector(24 downto 0);
SIGNAL mant rr2:std logic vector(24 downto 0);
signal sig,temp mrr: std logic := '0';
signal cnt : integer:=0;
signal ready o: std logic;
signal post in: std logic vector(27 downto 0);
signal result in : std logic vector(31 downto 0);
signal rr in : std logic vector(31 downto 0);
signal rr out : std logic vector(31 downto 0);
signal err op,err rr,err: std logic:='0';
```

## BEGIN

-- Instantiate the Unit Under Test (UUT)

```
uut: fpu add PORT MAP(
                 clk i => clk i,
                 movrr => movrr,
                 opa i \Rightarrow opa i,
                 opb i = opb i,
                 fpu op i \Longrightarrow fpu op i,
                 rmode i => rmode i,
                 output o => output o.
                 ine o \Rightarrow ine o,
                 overflow o \Rightarrow overflow o,
                 underflow o \Rightarrow underflow o,
                 inf_o => inf_o,
                 zero o => zero o,
                 qnan o \Rightarrow qnan o,
                 snan o \Rightarrow snan o,
                 post in = post in,
                 sign rr0 \Rightarrow sign rr0,
                 sign rr1 => sign rr1,
```

```
sign rr2 \Rightarrow sign rr2,
                cmpl rr0 => cmpl rr0,
                cmpl rr1 => cmpl rr1,
                cmpl rr2 => cmpl rr2,
                ready o \Rightarrow ready o,
                exp rr0 \Rightarrow exp rr0,
                exp rr1 => exp rr1,
                exp rr2 \Rightarrow exp rr2,
                mant rr0 => mant rr0,
                mant rr1 => mant rr1,
                mant rr2 \Rightarrow mant rr2);
fpu op i<= "000";
rmode i <= "00";
clk i \le not (clk i) after 50 ns;
movrr gen:process(clk i)
begin
if(falling edge(clk i))then
if(cnt/= 5)then
cnt \leq cnt + 1;
else
cnt \le 0;
end if;
end if;
if(rising_edge(clk_i))then
if(cnt=4)then
movrr<='1';
else
movrr<='0';
end if:
if(cnt=5)then
temp mrr \leq 1';
else
temp mrr<='0';
end if;
end if:
end process movrr_gen;
```

read\_proc\_ab : process is file infile : TEXT open read\_mode is "testdata.txt"; variable opa\_in : std\_logic\_vector(31 downto 0); variable opb\_in : std\_logic\_vector(31 downto 0); variable val:std\_logic\_vector(127 downto 0); variable buf\_temp : line; BEGIN while not endfile(infile) loop READLINE(infile,buf\_temp); hread(buf\_temp,val); opa\_i<= val(127 downto 96); opb\_i<= val(95 downto 64); result\_in<= val(63 downto 32); wait for 600 ns; end loop ; wait for 600 ns; end process read\_proc\_ab;

read\_proc\_oprr : process is file infile : TEXT open read\_mode is "testdata.txt"; variable val:std\_logic\_vector(127 downto 0); variable buf\_temp : line; BEGIN while not endfile(infile) loop READLINE(infile,buf\_temp); hread(buf\_temp,val); rr\_in <= val(31 downto 0); wait for 700 ns; end loop ; wait for 700 ns; end process read proc\_oprr;

write\_proc : process(temp\_mrr) is
file outfile : text open write\_mode is "sum\_out.txt";
variable buf\_temp : line;
begin
if(rising\_edge(temp\_mrr))then
hwrite(buf\_temp,output\_o);
writeline(outfile,buf\_temp);

```
if(output_o/=result_in)then
err_op<='1';
else
err_op<='0';
end if;
END PROCESS write proc;
```

write\_rr :process(temp\_mrr) is
file outfile : text open write\_mode is "rr\_out.txt";
variable buf1\_temp,buf2\_temp : line;

begin
if(falling\_edge(temp\_mrr))then
hwrite(buf1\_temp,output\_o);
writeline(outfile,buf1\_temp);

```
if(rr_in/=output_o)then
err_rr<='1';
else
err_rr<='0';
end if;
end if;
END PROCESS write_rr;</pre>
```

err<= err\_op or err\_rr;

END;

## **FPU – Multiplier**

```
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use ieee.std_logic_misc.all;
```

```
-001 = substract
              -010 = multiply,
                     : in std logic vector(2 downto 0);
    fpu op i
    -- Rounding Mode:
    __ ===============
    -00 = round to nearest even(default),
    -01 = round to zero,
    -10 = round up,
    -11 = round down
                     : in std logic vector(1 downto 0);
    rmode i
    -- Output port
                  : out std logic vector(FP WIDTH-1 downto 0);
    output o
               sign rr out:out std logic;
               exp rr out:out std logic vector(EXP WIDTH - 1 downto 0);
    cmpl out:out std logic;
               mant rr out:out std logic vector(FRAC WIDTH+1 downto 0);
     ready o
                     : out std logic;
                     post in: out std logic vector(47 downto 0);
    -- Exceptions
    ine o
                            : out std logic; -- inexact
    overflow o
                     : out std logic; -- overflow
    underflow o
                     : out std logic; -- underflow
    inf o
                            : out std logic; -- infinity
                            : out std logic; -- zero
    zero o
                            : out std logic; -- queit Not-a-Number
    qnan o
                            : out std logic -- signaling Not-a-Number
    snan o
       );
end fpu mult;
architecture rtl of fpu mult is
       -- Input/output registers
       signal s_opa_i, s_opb_i : std_logic_vector(FP_WIDTH-1 downto 0);
       signal s fpu op i
                                   : std logic vector(2 downto 0);
       signal s rmode i : std logic vector(1 downto 0);
       signal s output o : std logic vector(FP WIDTH-1 downto 0);
 signal s ine o, s overflow o, s underflow o, s inf o, s zero o, s qnan o, s snan o :
```

```
std logic;
```

signal cnt : integer; signal ready : std logic; signal s output1 : std logic vector(FP WIDTH-1 downto 0); signal s infa, s infb : std logic; \*\*\*Multiply units signals\*\*\* signal pre norm mul exp 10 : std logic vector(9 downto 0); signal pre norm mul fracta 24 : std logic vector(23 downto 0); signal pre norm mul fractb 24 : std logic vector(23 downto 0); signal mul fract 48,s postnorm fract in: std logic vector(47 downto 0); signal mul sign, s postnorm sign in: std logic; signal post norm mul output : std logic vector(31 downto 0); signal post norm mul ine : std logic; signal s expa pretomultin, s expb pretomultin, s expa pretomultout, s expb pretomultout; std l ogic vector(EXP WIDTH-1 downto 0); signal s exp 10 pretomultout, s postnorm exp in: std logic vector(EXP WIDTH+1 downto 0); signal s sign pretomultin, s op 0 pretomultin, s fracta0 pretomultin, s fractb0 pretomultin: std logic; signal s op 0 multopostin, s fracta0 multopostin, s fractb0 multopostin: std logic: ----- components ----signal s sign rr out, s cmpl out:std logic; signal s exp rr out:std logic vector(EXP WIDTH - 1 downto 0); signal s mant rr out:std logic vector(FRAC WIDTH+1 downto 0); component pre norm mul is port( clk i : in std logic; opa i : in std logic vector(FP WIDTH-1 downto 0);: in std logic vector(FP WIDTH-1 downto opb i 0);exp 10 o : out std logic vector(EXP WIDTH+1 downto 0); fracta 24 o : out std logic vector(FRAC WIDTH -- hidden(1) & fraction(23) downto 0):

| 1 ( 0)                                                                                  | fractb_24_o : out s                                                                                   | std_logic_vector(FRAC_WIDTH      |  |  |
|-----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|----------------------------------|--|--|
| downto 0);                                                                              | expa_o,expb_o: out std_logic_vector(EXP_WIDTH-1 downto (<br>sign_o,op_0,fracta0,fractb0:out std_logic |                                  |  |  |
| );                                                                                      |                                                                                                       |                                  |  |  |
| end component;                                                                          |                                                                                                       |                                  |  |  |
| component m<br>port(                                                                    | ul_24 is                                                                                              |                                  |  |  |
| I - (                                                                                   | clk i                                                                                                 | : in std logic;                  |  |  |
|                                                                                         | fracta i                                                                                              | : in std logic vector(FRAC WIDTH |  |  |
| downto 0); hidden(                                                                      |                                                                                                       |                                  |  |  |
|                                                                                         | fractb i                                                                                              | : in std logic vector(FRAC WIDTH |  |  |
| downto 0);                                                                              |                                                                                                       |                                  |  |  |
| do white oy,                                                                            | expa_pretomultin                                                                                      | : in                             |  |  |
| std logic vector(EX)                                                                    | P WIDTH-1 downto 0);                                                                                  |                                  |  |  |
|                                                                                         | expb_pretomultin                                                                                      | : in                             |  |  |
| std logic vector(FX)                                                                    | P_WIDTH-1 downto 0);                                                                                  | . 111                            |  |  |
|                                                                                         | exp 10 pretomultin                                                                                    | : in                             |  |  |
| std logia vootor(EVI                                                                    | P WIDTH+1 downto 0);                                                                                  | . 111                            |  |  |
|                                                                                         |                                                                                                       |                                  |  |  |
| aion motomultin on 0 motomultin fronto0 motomultin fronth0 motomultinuin atd logic      |                                                                                                       |                                  |  |  |
| sign_pretomultin,op_0_pretomultin,fracta0_pretomultin,fractb0_pretomultin:in std_logic; |                                                                                                       |                                  |  |  |
| atd lacis waster (2*E                                                                   | fract_0                                                                                               | : out                            |  |  |
| std_logic_vector(2*F                                                                    | RAC_WIDTH+1 downto 0);                                                                                | 1 1 .                            |  |  |
|                                                                                         | sign_pretomultout                                                                                     | : out std_logic;                 |  |  |
|                                                                                         | expa_pretomultout                                                                                     | : out                            |  |  |
| std_logic_vector(EXI                                                                    | P_WIDTH-1 downto 0);                                                                                  |                                  |  |  |
|                                                                                         | expb_pretomultout                                                                                     | : out                            |  |  |
| std_logic_vector(EXI                                                                    | P_WIDTH-1 downto 0);                                                                                  |                                  |  |  |
|                                                                                         | op_0_pretomultout,fracta0_pretomultout,fractb0_pretomultout:out                                       |                                  |  |  |
| std_logic;                                                                              |                                                                                                       |                                  |  |  |
| std logic vector(EV)                                                                    | exp_10_pretomultout<br>P_WIDTH+1 downto 0)                                                            | : out                            |  |  |
|                                                                                         |                                                                                                       |                                  |  |  |
|                                                                                         | );                                                                                                    |                                  |  |  |
| and componen                                                                            |                                                                                                       |                                  |  |  |
| end component;                                                                          |                                                                                                       |                                  |  |  |
| component post_norm_mul is port(                                                        |                                                                                                       |                                  |  |  |
| port                                                                                    | clk i                                                                                                 | : in std logic;                  |  |  |
|                                                                                         |                                                                                                       | d_logic_vector(EXP_WIDTH-1       |  |  |
| downto 0);                                                                              | expa_munoposum . In st                                                                                |                                  |  |  |
| downto 0),                                                                              | auch multanastin in st                                                                                | d lagia vester (EVD WIDTH 1      |  |  |
| downto 0);                                                                              | expb_multopostin : in st                                                                              | d_logic_vector(EXP_WIDTH-1       |  |  |
| downto 0);                                                                              | our 10 i                                                                                              | . in                             |  |  |
| atd lasis and (DV)                                                                      | exp_10_i                                                                                              | : in                             |  |  |
| sta_togic_vector(EXI                                                                    | P_WIDTH+1 downto 0);                                                                                  |                                  |  |  |

fract 48 i : in std logic vector(2\*FRAC WIDTH+1 downto 0); -- hidden(1) & fraction(23) : in std logic; sign i rmode i : in std logic vector(1 downto 0); op 0 multopostin, fracta0 multopostin, fractb0 multopostin: in std logic; : out std logic vector(FP WIDTH-1 downto 0); output o ine o: out std logic; sign rr out, cmpl out:out std logic; exp rr out:out std logic\_vector(EXP\_WIDTH - 1 downto 0); mant rr out:out std logic vector(FRAC WIDTH+1 downto 0) ): end component;

### begin

--\*\*\*Multiply units\*\*\*

```
i pre norm mul: pre norm mul
port map(
        clk i => clk i,
        opa i \Rightarrow s opa i,
        opb i => s opb i,
        exp 10 o \Rightarrow pre norm mul exp 10,
        fracta 24 o = pre norm mul fracta 24,
        fractb 24 o = pre norm mul fractb 24,
        expa o \Rightarrow s expa pretomultin,
        expb o \Rightarrow s expb pretomultin,
        sign o \Rightarrow s sign pretomultin,
        op 0 \Rightarrow s op 0 pretomultin,
        fracta0 => s fracta0 pretomultin,
        fractb0 \Rightarrow s fractb0 pretomultin
        );
i mul 24 : mul 24
```

#### port map(

clk\_i => clk\_i, fracta\_i => pre\_norm\_mul\_fracta\_24, fractb\_i => pre\_norm\_mul\_fractb\_24, expa\_pretomultin => s\_expa\_pretomultin, expb\_pretomultin => s\_expb\_pretomultin, exp\_10\_pretomultin => pre\_norm\_mul\_exp\_10, sign\_pretomultin => s\_sign\_pretomultin, op\_0\_pretomultin => s\_op\_0\_pretomultin,

fracta0 pretomultin => s fracta0 pretomultin, fractb0 pretomultin => s fractb0 pretomultin, fract  $o \Rightarrow mul$  fract 48, sign pretomultout =>mul sign, expa pretomultout => s expa pretomultout, expb pretomultout => s expb pretomultout, op 0 pretomultout => s op 0 multopostin, fracta0 pretomultout => s fracta0 multopostin, fractb0 pretomultout => s fractb0 multopostin, exp 10 pretomultout => s exp 10 pretomultout ); i post norm mul: post norm mul port map( clk i => clk i. expa multopostin => s expa pretomultout, expb multopostin = s expb pretomultout, --exp 10 i => s exp 10 pretomultout, exp 10 i = s postnorm exp in, --fract 48 i = mul fract 48, fract 48 i =>s postnorm fract in, --sign i = mul sign, sign  $i \Rightarrow s$  postnorm sign in, rmode i => s rmode i, op 0 multopostin => s op 0 multopostin, fracta0 multopostin => s fracta0 multopostin, fractb0 multopostin => s fractb0 multopostin, output  $o \Rightarrow post norm mul output$ , ine  $o \Rightarrow post norm mul ine$ , sign rr out => s sign rr out, exp\_rr\_out => s\_exp rr out, cmpl\_out => s\_cmpl\_out, mant rr out => s mant rr out ): (movrr='1')else mul fract 48; s postnorm exp in<=("00" & s exp rr out) when (movrr='1')else s exp 10 pretomultout; s postnorm sign in <= s sign rr out when (movrr='1')else mul sign; -- s postnorm fract in  $\leq$  mul fract 48; -- s postnorm exp in <= s exp 10 pretomultout; -- s postnorm sign in<=mul sign;

```
-- Input Register
```

```
s opa i \le opa i;
              s opb i \leq opb i;
              s fpu op i <= fpu op i;
              s rmode i <= rmode i;
-- Output Register
process(clk_i)
begin
       if rising edge(clk i) then
       if(ready = '1')then
              output o <= s output o;
              ine o \le s ine o;
              overflow o <= s_overflow_o;
              underflow o <= s underflow o;
              inf o \le s inf o;
              zero_o <= s_zero_o;</pre>
              qnan o \le s qnan o;
              snan o <= s snan o;
              sign rr out<=s sign rr out;
              cmpl out <= s cmpl out;
              exp rr out<=s exp rr out;
              mant rr out<=s_mant_rr_out;</pre>
              post in <= s postnorm fract in;
       end if;
       end if;
end process;
-- Output Multiplexer
process(clk_i)
begin
       if rising_edge(clk_i) then
       if fpu op_i="010" then
                      s output1
                                    <= post norm mul output;
                                            <= post norm mul ine;
                      s_ine_o
              else
                                    <= (others => '0');
                      s output1
                      s ine o
                                            <= '0';
              end if;
       end if;
```

end process;

```
s infa \leq 1' when s opa i(30 downto 23)="111111111" else '0';
      s infb \leq 1' when s opb i(30 downto 23)="111111111" else '0';
      --In round down: the subtraction of two equal numbers other than zero are always
-0!!!
      process(s output1, s rmode i, s infa, s infb, s qnan o, s snan o, s zero o,
s_fpu_op_i, s opa i, s opb i)
      begin
                   if s rmode i="00" or ((s infa or s infb) or s qnan o or
s snan o)='1' then --round-to-nearest-even
                          s output o \le s output1;
                   elsif s rmode i="01" and s output1(30 downto 23)="11111111"
then
                          --In round-to-zero: the sum of two non-infinity operands is
never infinity, even if an overflow occures
                          s output o \leq s output1(31) &
elsif s rmode i="10" and s output1(31 downto 23)="111111111"
then
                          --In round-up: the sum of two non-infinity operands is
never negative infinity, even if an overflow occures
                          elsif s rmode i="11" then
                          --In round-down: a-a=-0
                          if (s fpu op i="000" or s fpu op i="001") and
s zero o='1' and (s opa i(31) or (s fpu op i(0) xor s opb i(31)))='1' then
                                 s output o \le "1" & s output1(30 downto 0);
                          --In round-down: the sum of two non-infinity operands is
never postive infinity, even if an overflow occures
                          elsif s output1(31 downto 23)="011111111" then
                                 s output o <=
else
                                 s output o \leq s output1;
                          end if:
                   else
                          s output o \leq s output1;
                   end if:
```

end process;

-- Generate Exceptions

 $s\_underflow\_o <= '1'$  when  $s\_output1(30 \ downto \ 23)="00000000" and <math display="inline">s\_ine\_o='1'$  else '0';

 $s_overflow_o \le '1'$  when  $s_output1(30 \text{ downto } 23)="111111111"$  and  $s_ine_o='1'$  else '0';

```
s_inf_o <= '1' when s_output1(30 downto 23)="111111111" and (s_qnan_o or
s_snan_o)='0' else '0';
s_zero_o <= '1' when or_reduce(s_output1(30 downto 0))='0' else '0';
s_qnan_o <= '1' when s_output1(30 downto 0)=QNAN else '0';
s_snan_o <= '1' when s_opa_i(30 downto 0)=SNAN or s_opb_i(30 downto 0)=SNAN
else '0';
```

```
----Ready signal to indicate start of valid outputs --
process(clk i)
begin
if(rising edge(clk i))then
if(cnt/=4)then
cnt \leq cnt + 1;
else
cnt \leq cnt;
end if;
if(cnt=4)then
ready<='1';
else
ready<='0';
end if;
end if:
end process;
```

ready\_o<=ready;

end rtl;

### Prenormalization

library ieee ;
use ieee.std\_logic\_1164.all;
use ieee.std\_logic\_unsigned.all;
use ieee.std\_logic\_misc.all;

```
library work;
use work.fpupack.all;
entity pre norm mul is
      port(
                     clk i
                                   : in std logic;
                                          : in std logic vector(FP WIDTH-1 downto
                     opa i
0);
                                         : in std logic vector(FP WIDTH-1 downto
                     opb i
0);
                     exp 10 o
                                                 : out
std logic vector(EXP WIDTH+1 downto 0);
                     fracta 24 o
                                         : out std logic vector(FRAC WIDTH
              -- hidden(1) & fraction(23)
downto 0):
                     fractb 24 o
                                         : out std logic vector(FRAC WIDTH
downto 0);
                     expa o, expb o: out std logic vector(EXP WIDTH-1 downto 0);
                     sign 0,0p 0,fracta0,fractb0:out std logic
              );
end pre norm mul;
```

architecture rtl of pre\_norm\_mul is

```
signal s_expa, s_expb : std_logic_vector(EXP_WIDTH-1 downto 0);
signal s_fracta, s_fractb : std_logic_vector(FRAC_WIDTH-1 downto 0);
signal s_exp_10_o, s_expa_in, s_expb_in : std_logic_vector(EXP_WIDTH+1 downto 0);
```

signal s\_opa\_dn, s\_opb\_dn : std\_logic;

begin

\_\_\_

# Multiplier

library ieee ;
use ieee.std\_logic\_1164.all;
use ieee.std\_logic\_unsigned.all;

library work; use work.fpupack.all;

entity mul\_24 is port( clk\_i fracta\_i downto 0); -- hidden(1) & fraction(23) fractb\_i downto 0);

: in std\_logic;: in std\_logic\_vector(FRAC\_WIDTH: in std\_logic\_vector(FRAC\_WIDTH

|                                                                                                                                                                    | expa_pretomultin                                                                                                                               | : in                                                                                                                                                                 |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| std_logic_vector(EXP                                                                                                                                               | _WIDTH-1 downto 0);                                                                                                                            |                                                                                                                                                                      |  |
| std_logic_vector(EXP                                                                                                                                               | expb_pretomultin<br>P_WIDTH-1 downto 0);                                                                                                       | : in                                                                                                                                                                 |  |
| std_logic_vector(EXP                                                                                                                                               | exp_10_pretomultin<br>WIDTH+1 downto 0);                                                                                                       | : in                                                                                                                                                                 |  |
| std_logic_vector(2*Fl<br>std_logic_vector(EXP                                                                                                                      | fract_o<br>RAC_WIDTH+1 downto 0<br>sign_pretomultout<br>expa_pretomultout<br>P_WIDTH-1 downto 0);<br>expb_pretomultout<br>P_WIDTH-1 downto 0); | <pre>omultin,fractb0_pretomultin:in std_logic;<br/>: out<br/>);<br/>: out std_logic;<br/>: out<br/>: out<br/>: out<br/>0_pretomultout,fractb0_pretomultout:out</pre> |  |
| std_logic;                                                                                                                                                         |                                                                                                                                                | _pretomation,nuotoo_pretomation.out                                                                                                                                  |  |
| std_logic_vector(EXP                                                                                                                                               | exp_10_pretomultout<br>WIDTH+1 downto 0)                                                                                                       | : out                                                                                                                                                                |  |
| end mul_24;                                                                                                                                                        | );                                                                                                                                             |                                                                                                                                                                      |  |
| architecture rtl of mul 24 is                                                                                                                                      |                                                                                                                                                |                                                                                                                                                                      |  |
|                                                                                                                                                                    |                                                                                                                                                |                                                                                                                                                                      |  |
| <pre>signal s_fracta_i, s_fractb_i : std_logic_vector(FRAC_WIDTH downto 0);<br/>signal s_fract_o: std_logic_vector(2*FRAC_WIDTH+1 downto 0);</pre>                 |                                                                                                                                                |                                                                                                                                                                      |  |
| signal a_h, a_l, b_h, b_l : std_logic_vector(11 downto 0);<br>signal a_h_h, a_h_l, b_h_h, b_h_l, a_l_h, a_l_l, b_l_h, b_l_l : std_logic_vector(5 downto 0);<br>0); |                                                                                                                                                |                                                                                                                                                                      |  |
| type op_6 is array (7 downto 0) of std_logic_vector(5 downto 0);<br>type prod_6 is array (3 downto 0) of op_6;                                                     |                                                                                                                                                |                                                                                                                                                                      |  |
| type prod_48 is array (4 downto 0) of std_logic_vector(47 downto 0);<br>type sum_24 is array (3 downto 0) of std_logic_vector(23 downto 0);                        |                                                                                                                                                |                                                                                                                                                                      |  |
| type a is array (3 downto 0) of std_logic_vector(23 downto 0);<br>type prod_24 is array (3 downto 0) of a;                                                         |                                                                                                                                                |                                                                                                                                                                      |  |
| signal prod : prod_6;<br>signal sum : sum_24;<br>signal prod_a_b : prod_48;                                                                                        |                                                                                                                                                |                                                                                                                                                                      |  |

```
signal prod2 : prod 24;
begin
-- Input Register
              s fracta i <= fracta i;
              s fractb i \leq fractb i;
-- Output Register
process(clk i)
begin
       if rising edge(clk i) then
              fract o \le s fract o;
              sign pretomultout <= sign pretomultin;
              expa pretomultout <= expa pretomultin;
              expb pretomultout <= expb pretomultin;
              op 0 pretomultout<=op 0 pretomultin;
              fracta0 pretomultout<=fracta0 pretomultin;
              fractb0 pretomultout<=fractb0 pretomultin;
              exp 10 pretomultout<=exp 10 pretomultin;
       end if:
```

```
end process;
```

```
--"00000000000"
--A = A h x 2^{N} + A 1, B = B h x 2^{N} + B 1
-- A \times B = A h \times B h \times 2^2 N + (A h \times B 1 + A h \times B h) 2^N + A h \times B 1
a h \le s fracta i(23 downto 12);
a 1 \le s fracta i(11 downto 0);
b h \leq s fractb i(23 downto 12);
b 1 \le s fractb i(11 downto 0);
```

```
a h h \leq a h(11 downto 6);
a h l \leq a h(5 \text{ downto } 0);
b h h \leq b h(11 downto 6);
b h l \le b h(5 downto 0);
```

a 1 h <= a l(11 downto 6);a 1 1 <= a 1(5 downto 0); b 1 h <= b l(11 downto 6);b 1 1 <= b 1(5 downto 0);

 $prod(0)(0) \le a h h; prod(0)(1) \le b h h;$ 

 $prod(0)(6) \le a h l; prod(0)(7) \le b h l;$  $prod(1)(0) \le a h h; prod(1)(1) \le b l h;$  $prod(1)(2) \le a h h; prod(1)(3) \le b 1 l;$  $prod(1)(4) \le a h l; prod(1)(5) \le b l h;$  $prod(1)(6) \le a \ h \ l; prod(1)(7) \le b \ l \ l;$  $prod(2)(0) \le a \ 1 \ h; prod(2)(1) \le b \ h \ h;$  $prod(2)(2) \le a \ l \ h; prod(2)(3) \le b \ h \ l;$  $prod(2)(4) \le a \ 1 \ 1; \ prod(2)(5) \le b \ h \ h;$ prod(2)(6) <= a 1 l; prod(2)(7) <= b h l;  $prod(3)(0) \le a \ 1 \ h; prod(3)(1) \le b \ 1 \ h;$  $prod(3)(2) \le a \ 1 \ h; prod(3)(3) \le b \ 1 \ l;$  $prod(3)(4) \le a \ 1 \ 1; \ prod(3)(5) \le b \ 1 \ h;$  $prod(3)(6) \le a \ 1 \ 1; prod(3)(7) \le b \ 1 \ 1;$  $prod2(0)(0) \le (prod(0)(0)*prod(0)(1))\&"00000000000";$  $prod2(0)(1) \le "000000"\&(prod(0)(2)*prod(0)(3))\&"000000";$  $prod2(0)(2) \le "000000"\&(prod(0)(4)*prod(0)(5))\&"000000";$  $prod2(0)(3) \le "00000000000"\&(prod(0)(6)*prod(0)(7));$  $prod2(1)(0) \le (prod(1)(0)*prod(1)(1))\&"00000000000";$  $prod2(1)(1) \le "000000"\&(prod(1)(2)*prod(1)(3))\&"000000";$  $prod2(1)(2) \le "000000"\&(prod(1)(4)*prod(1)(5))\&"000000";$  $prod2(1)(3) \le "00000000000"\&(prod(1)(6)*prod(1)(7));$  $prod2(2)(0) \le (prod(2)(0) prod(2)(1))\&"00000000000";$  $prod2(2)(1) \le "000000"\&(prod(2)(2)*prod(2)(3))\&"000000";$  $prod_2(2)(2) \le "000000" \& (prod_2)(4) * prod_2(2)(5)) \& "000000";$  $prod2(2)(3) \le "000000000000" \& (prod(2)(6)*prod(2)(7));$  $prod2(3)(0) \le (prod(3)(0) * prod(3)(1)) \& "00000000000";$  $prod_{2(3)(1)} \le "000000" \& (prod_{3)(2)} prod_{3(3)}) \& "000000";$  $prod2(3)(2) \le "000000"\&(prod(3)(4)*prod(3)(5))\&"000000";$  $prod2(3)(3) \le "00000000000"\&(prod(3)(6)*prod(3)(7));$  $sum(0) \le prod_2(0)(0) + prod_2(0)(1) + prod_2(0)(2) + prod_2(0)(3);$  $sum(1) \le prod_2(1)(0) + prod_2(1)(1) + prod_2(1)(2) + prod_2(1)(3);$  $sum(2) \le prod2(2)(0) + prod2(2)(1) + prod2(2)(2) + prod2(2)(3);$ 

prod(0)(2) <= a\_h\_h; prod(0)(3) <= b\_h\_l; prod(0)(4) <= a\_h\_l; prod(0)(5) <= b\_h\_h;

 $sum(3) \le prod2(3)(0) + prod2(3)(1) + prod2(3)(2) + prod2(3)(3);$ 

-- Last stage

### Postnormalization

library ieee; use ieee.std logic 1164.all; use ieee.std logic unsigned.all; use ieee.std logic misc.all; use ieee.std logic arith.all; library work; use work.fpupack.all; entity post norm mul is port( clk i : in std logic; : in std logic vector(EXP WIDTH-1 expa multopostin downto 0); expb multopostin : in std logic vector(EXP WIDTH-1 downto 0); exp 10 i : in std logic vector(EXP WIDTH+1 downto 0); fract 48 i : in std logic vector(2\*FRAC WIDTH+1 downto 0); -- hidden(1) & fraction(23) sign i : in std logic; : in std logic vector(1 downto 0); rmode i op 0 multopostin, fracta0 multopostin, fractb0 multopostin: in std logic; output o : out std logic vector(FP WIDTH-1 downto 0); ine o: out std logic;

sign\_rr\_out,cmpl\_out:out std\_logic;

exp rr out:out std logic vector(EXP WIDTH - 1 downto 0); mant rr out:out std logic vector(FRAC WIDTH+1 downto 0) ); end post norm mul; architecture rtl of post norm mul is signal s expa1, s expb1: std logic vector(EXP WIDTH-1 downto 0); signal s exp 10 i : std logic vector(EXP WIDTH+1 downto 0); signal s sign 1: std logic; signal s output o : std logic vector(FP WIDTH-1 downto 0):=X"00000000"; signal s ine o, s overflow : std logic; signal s rmode 1: std logic vector(1 downto 0); signal s zeros : std logic vector(5 downto 0); signal s carry : std logic; signal s shr2, s shl2 : std logic vector(5 downto 0):="000000"; signal s expo1 : std logic vector(8 downto 0); signal s exp 10a, s exp 10b : std logic vector(9 downto 0); signal s frac2a: std logic vector(47 downto 0); signal s sticky, s guard, s round : std logic; signal s roundup : std logic; signal s frac rnd, s frac3 : std logic vector(24 downto 0); signal s shr3 : std logic; signal s r zeros1: std logic vector(5 downto 0); signal s lost : std logic; signal s op 0 : std logic; signal s expo3, s expo2b : std logic vector(8 downto 0); signal s infa, s infb : std logic; signal s nan in, s nan op, s nan a, s nan b : std logic; \_\_\_\_\_ ----pipeline signals----signal s fract 48 1:std logic vector(2\*FRAC WIDTH+1 downto 0); signal s or a,s or b:std logic; \_\_\_\_\_ ----residual register component----component residualreg is port(sign rr:in std logic;exp rr:in std logic vector(EXP WIDTH - 1 downto 0);cmpl in:in std logic; mant rr:in std logic vector(FRAC WIDTH+1 downto 0);sign\_rr\_out:out std logic;exp rr out:out std logic vector(EXP WIDTH - 1 downto 0); cmpl out:out std logic;mant rr out:out std logic vector(FRAC WIDTH+1 downto 0)); end component; \_\_\_\_\_

----residual register signals---signal s sign rr,s sign rr out,s cmpl in,s cmpl out:std logic; signal s exp rr,s exp rr out:std logic vector(EXP WIDTH - 1 downto 0); signal s\_mant\_rr\_br:std\_logic\_vector(FRAC\_WIDTH-1 downto 0); signal s mant rr ar:std logic vector(FRAC WIDTH downto 0); signal s mant rr final, s mant rr out:std logic vector(FRAC WIDTH+1 downto 0); \_\_\_\_\_

signal s r zeros2: std logic vector(5 downto 0); signal s sign 2, s op 0 2, s or a2, s or b2:std logic;

begin

-- Input Register

s expa1 <= expa multopostin; s expb1 <= expb multopostin; s exp 10 i  $\leq$  exp 10 i; s fract 48 1 <= fract 48 i; s sign  $1 \le \text{sign } i$ ; s rmode  $1 \leq \text{rmode } i;$ s op  $0 \le 0$  multopostin; s or  $a \leq$ fracta0 multopostin; s or  $b \leq \text{fractb0}$  multopostin; -- Output Register process(clk\_i) if rising edge(clk i) then output o <= s output o; --

> ine o  $\leq s$  ine o; sign rr out<=s sign rr out;

```
cmpl out <= s cmpl out;
exp rr out <= s exp rr out;
mant rr out<=s mant rr out;
```

end if:

end process;

begin

```
s zeros \leq count 1 zeros(s fract 48 1(46 downto 1)) when
(s fract 48 1(47)='0')else "000000";
         s r zeros1 \le \text{count r zeros}(\text{s fract } 48 \ 1);
         s exp 10a \le x \exp 10 i + (000000000) \& s fract 48 1(47));
         s exp 10b \le x \exp 10 i + (0000000000 \& x \operatorname{fract} 48 1(47))-
```

```
("0000"&s zeros);
```

```
s carry \leq s fract 48 1(47);
 process(clk i)
       variable v shr1, v shl1 : std logic vector(9 downto 0);
       begin
       if rising edge(clk i) then
              if s exp 10a(9)='1' or s exp 10a="0000000000" then
                      v shr1 := "0000000001" - s exp 10a + ("000000000"&s carry);
                      v shl1 := (others =>'0');
                      s expo1 <= "000000001";
              else
                     if s exp 10b(9)='1' or s exp 10b="0000000000" then
                             v shr1 := (others =>'0');
                             v shl1 := ("0000"&s zeros) - s exp 10a;
                             s expo1 <= "000000001";
                      elsif s exp 10b(8)='1' then
                             v shr1 := (others =>'0');
                             v shl1 := (others =>'0');
                             s expo1 <= "0111111111";
                      else
                             v shr1 := ("00000000"&s carry);
                             v shl1 := ("0000"&s zeros);
                             s expo1 \leq s exp 10b(8 downto 0);
                      end if;
              end if:
              if v shr1(6)='1' then --"110000" = 48; maximal shift-right postions
               s shr2 <= "1111111";
        else
               s shr2 \le v shr1(5 downto 0);
              end if;
              s shl2 \leq v shl1(5 downto 0);
              end if;
       end process;
-- *** Stage 2 ***
       process(clk i)
       begin
       if(rising edge(clk i))then
       if(s shr2 /= "000000")then
       s frac2a \leq  shr(s fract 48_1, s_shr2);
       elsif(s shl2 /= "000000")then
 s frac2a \leq shl(s fract 48 1, s shl2);
       else
       s frac2a \leq s fract 48 1;
```

end if; end if; end process;

-- signals if precision was last during the right-shift above s lost  $\leq 1'$  when (s shr2+("00000"&s shr3)) > s r zeros1 else '0'; -- \*\*\*Stage 3\*\*\* -- Rounding 23 \_\_ ---- guard bit: s frac2a(23) (LSB of output) -- round bit: s frac2a(22) s\_guard <= s frac2a(22); s round  $\leq s$  frac2a(21); s sticky <= or reduce(s frac2a(20 downto 0)) or s lost; s roundup  $\leq$  s guard and ((s round or s sticky)or s frac2a(23)) when s rmode 1="00" else -- round to nearset even (s guard or s round or s sticky) and (not s sign 1) when s rmode 1="10" else -- round up (s guard or s round or s sticky) and (s sign 1) when s rmode 1="11" else -- round down '0'; -- round to zero(truncate = no rounding) s mant rr br $\leq$  s frac2a(22 downto 0); -- before rounding process(clk i) begin if(rising edge(clk i))then s r zeros2<=s r zeros1; s sign  $2 \le s$  sign 1; s op 0 2<=s op 0; s or  $a2 \le s$  or a;

end if; end process;

s or  $b2 \le or b$ ;

 $s_frac_rnd \le (s_frac2a(47 \text{ downto } 23)) + "1" \text{ when}(s_roundup='1')else s_frac2a(47 \text{ downto } 23);$ 

s\_expo2b <= s\_expo1 - "000000001" when s\_frac2a(46)='0' else s\_expo1;

 $s_sr3 \le s_frac_rnd(24);$ 

 $s_frac3 \le ("0"\&s_frac_rnd(24 \text{ downto } 1))when(s_shr3='1' \text{ and } s_expo2b /= "0111111111")else s frac rnd;$ 

s\_expo3 <= s\_expo2b + '1' when(s\_shr3='1' and s\_expo2b /= "011111111")else
s expo2b;</pre>

 $s_mant_rr_ar \le (s_frac_rnd(0) \& s_mant_rr_br)$  when  $(s_shr3='1' and s_expo2b /= "011111111")$  else ('0' &  $s_mant_rr_br$ );

---\*\*\*Stage 4\*\*\*\* -- Output

```
s_infa <= '1' when s_expa1="11111111" else '0';
s_infb <= '1' when s_expb1="11111111" else '0';
s_nan_a <= '1' when (s_infa='1' and s_or_a2='1') else '0';
s_nan_b <= '1' when (s_infb='1' and s_or_b2='1') else '0';
s_nan_in <= '1' when s_nan_a='1' or s_nan_b='1' else '0';
s_nan_op <= '1' when (s_infa or s_infb)='1' and s_op_0_2='1' else '0';-- 0 * inf = nan</pre>
```

s\_overflow <= '1' when s\_expo3 = "011111111" and (s\_infa or s\_infb)='0' else '0'; s\_ine\_o <= '1' when s\_op\_0\_2='0' and (s\_lost or s\_or\_a2 or s\_overflow)='1' else '0';

```
process(s_sign_2, s_expo3, s_frac3, s_nan_in, s_nan_op, s_infa, s_infb,
s_overflow, s_r_zeros2)
begin
if (s_nan_in or s_nan_op)='1' then
s_output_o <= s_sign_2 & QNAN;
elsif (s_infa or s_infb)='1' or s_overflow='1' then
s_output_o <= s_sign_2 & INF;
elsif s_r_zeros1=48 then
s_output_o <= s_sign_2 & ZERO_VECTOR;
```

else

s\_frac3(22 downto 0); end if; end process;
s\_output\_o <= s\_sign\_2 & s\_expo3(7 downto 0) & s\_sign\_rr<= s\_sign\_1 xor s\_roundup; s\_cmpl\_in<= s\_roundup;</pre>

s\_exp\_rr<= conv\_std\_logic\_vector(conv\_integer(s\_expo3(7 downto 0) (FRAC\_WIDTH+1)),8);</pre>

--residual register added----

rreg:residualreg port

map(s\_sign\_rr,s\_exp\_rr,s\_cmpl\_in,s\_mant\_rr\_final,s\_sign\_rr\_out,s\_exp\_rr\_out,s\_cmpl\_o ut,s\_mant\_rr\_out);

end rtl;

### References

- 1. H. G. Dietz, W. R. Dieter, R. Fisher, and K. Chang, "Floating-point computation with just enough accuracy," *Lecture Notes in Computer Science*, vol. 3991, pp.226 233, April 2006.
- 2. W. R. Dieter, H. G. Dietz, <u>Low-Cost Microarchitectural Support for Improved</u> <u>Floating-point Accuracy</u>, *UK ECE Technical Report #ECE–2006-10-14*, October 2006.
- 3. T. J. Dekker, <u>A Floating-point Technique for extending the available precision</u>, *Numerische Mathematik*, vol. 18, no. 3, June 1971.
- 4. D. H. Bailey, High-precision Software Directory. http://crd.lbl.gov/~dhbailey/mpdist/
- 5. "Basic requirements for a future floating-point arithmetic standard"., *GAMM Fachausschuss on Computer Arithmetic and Scientific Computing*. <u>http://www.math.uni-wuppertal.de/~xsc/gamm-fa/BasicRequ.pdf</u>
- 6. Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick, "The Potential of the Cell processor for scientific computing," *Proceedings of the 3<sup>rd</sup> conference on Computing frontiers*, pp.9-20,2006
- 7. Guillaume Da Gracca, David Defour, "Implementation of float-float operators on graphics hardware," <u>http://hal.archives-ouvertes.fr/ccsd-00021443/</u>
- 8. D. Goldberg. "What every computer scientist should know about floating-point arithmetic". *ACM Computing Surveys*, vol. 23, no. 1, Mar 1991.
- D. H. Bailey, H. Simon, J. Barton, and M. Fouts. "Floating-point arithmetic in future supercomputers". *Int. J. Supercomput. Appl. High Perform. Eng.* vol. 3, no. 3, pp. 86-90, 1989.
- 10. D. H. Bailey, "High-precision floating-point arithmetic in scientific computation".

http://crd.lbl.gov/~dhbailey/dhbpapers/high-prec-arith.pdf

- 11. Bruce Greer, John Harrison, Greg Henry and Wei Li Peter Tang. <u>Scientific</u> <u>computing on the Itanium processor.</u>
- 12. Julie Langou, Piotr Luszczek, Alfredo Buttari, Julien Langou, Jakub Kurzak and Jack Dongarra, "Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy". <u>http://icl.cs.utk.edu/projectsfiles/iter-ref/files/iter-ref.pdf</u>
- 13. W. Kahan, "On the cost of floating-point computation without extra-precise arithmetic," <u>http://www.cs.berkeley.edu/~wkahan/Qdrtcs.pdf</u>, Nov 2004.

- 14. Tech report, W. R. Dieter and H. G. Dietz, "Horseshoes & Hand Grenades". http://aggregate.org/WHITE/sc06accprec.pdf, November 2006.
- 15. Poster Presentation, Andrew Thall, "Extended-precision floating-point numbers for GPU computation," ACM SIGGRAPH, 2006. http://delivery.acm.org/10.1145/1180000/1179682/p52thall.pdf?key1=1179682&key2=1986636021&coll=GUIDE&dl=GUIDE&CFID= 60732742&CFTOKEN=20419339
- 16. W. Kahan, "Why do we need a floating-point arithmetic standard?" February, 1981. <u>http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf</u>
- 17. Nicholas J. Higham, "Accuracy and stability of numerical algorithms", Society for Industrial and Applied Mathematics, 2002.
- 18. IEEE 754 Standard for floating-point arithmetic, ANSI/IEEE Std 754-1985 Vol, Issue, 12 Aug 1985.
- 19. Jidan Al-Eryani, Floating point unit. http://www.opencores.org.uk/projects.cgi/web/fpu100/, August 2006.
- 20. IBM 700/7000 series. Wikipedia Document. http://en.wikipedia.org/wiki/IBM\_705#Data\_formats
- 21. Konrad Zuse, Z3 Computer, http://en.wikipedia.org/wiki/Z3
- 22. Konrad Zuse, Z4 Computer, <u>http://irb.cs.tu-berlin.de/~zuse/Konrad Zuse/en/Rechner Z4.html</u>.
- 23. William R. Dieter, Akil Kaveti, Henry G. Dietz, <u>Low-Cost Microarchitectural</u> <u>Support for Improved Floating-Point Accuracy</u>, *IEEE Computer Architecture Letters*, Vol. 6, No. 1, 2007.
- 24. Floating-point formats, http://www.quadibloc.com/comp/cp0201.htm.
- 25. North Star Computers Inc., NorthStar Hardware Floating point board FPB-A manual, 25015B, 1977.
- 26. Yozo Hida, Xiaoye S. Li and David H. Bailey, "Algorithms for Quad-Double Precision Floating Point Arithmetic", *ARITH-15*, Oct. 2000.
- 27. D. H. Bailey. "A Fortran-90 based multiprecision system," ACM Transactions on Mathematical Software, vol. 21, no. 4, pp. 379-387, 1995.
- 28. K. Briggs. The doubledouble library, 1998. http://keithbriggs.info/software.html
- 29. Jonathan R. Shewchuk. "Adaptive precision floating-point arithmetic and fast robust geometric predicates". *In Discrete and Computational Geometry*, vol. 18, pp. 305-363, 1997.

# Vita

- Date of Birth: 6<sup>th</sup> Feb, 1984.
- Place of Birth: Hyderabad, Andhra Pradesh, India.
- Bachelor of Engineering, M.V.S.R Engineering College (Affiliated to Osmania University), Nadergul, R.R. District, A.P, India.
- Publications:
  - William R. Dieter, Akil Kaveti, Henry G. Dietz, Low-Cost Microarchitectural Support for Improved Floating-Point Accuracy, IEEE Computer Architecture Letters, Vol. 6, No. 1,2007.
  - Akil Kaveti, N.Lakshmi, G.K. Sandeep, Implementation of Rijndael-AES Cryptoprocessor in FPGA, IETE Journal, 2005.
- Name: Akil Kaveti.