next up previous
Next: Conclusion Up: Analysis of Bernstein's Factorization Previous: Operation count, equipment cost,


Hardware for the matrix step for 1024-bit moduli

In this section we extrapolate current factoring knowledge to come up with reasonable estimates for the sizes of the matrix $ A$ that would have to be processed for the factorization of a 1024-bit composite when using ordinary relation collection (cf. 2.6), and using slower relation collection according to matrix exponent $ 5/2$ as used in circuit-NFS. For the latter (smaller sized) matrix we consider how expensive it would be to build the mesh-sorting-based matrix-by-vector multiplication circuit proposed in [1] using custom-built hardware and we estimate how much time the matrix step would take on the resulting device. We then propose an alternative mesh-based matrix-by-vector multiplication circuit and estimate its performance for both matrices, for custom-built and off-the-shelf hardware. Throughout this section we are interested mainly in assessing feasibility, for the purpose of evaluating the security implications. Our assumptions will be somewhat optimistic, but we believe that the designs are fundamentally sound and give realistic indications of feasibility using technology that is available in the present or in the near future.

5.1. Matrix sizes.

For the factorization of RSA-512 the matrix had about 6.7 million columns and average column density about 63 [3]. There is no doubt that this matrix is considerably smaller than a matrix that would have resulted from ordinary relation collection as defined in 2.6, cf. Remark 3.7. Nevertheless, we make the optimistic assumption that this is the size that would result from ordinary relation collection. Combining this figure with the $ L(2/3^{2/3})$ matrix size growth rate (cf. 2.6) we find

$\displaystyle 6 700 000\cdot\frac{L_{2^{1024}}[1/3,2/3^{2/3}]}{L_{2^{512}}[1/3,2/3^{2/3}]}\approx1.8\cdot10^{10}$

(cf. 2.1). Including the effect of the $ o(1)$ it is estimated that an optimal 1024-bit matrix would contain about $ 10^{10}$ columns. We optimistically assume an average column density of about $ 100$. We refer to this matrix as the ``large'' matrix. Correcting this matrix size for the $ L((5/3)^{1/3}(2/3))$ matrix size growth rate for matrix exponent $ 5/2$ (cf. 2.6) we find

$\displaystyle 1.8\cdot10^{10}\cdot\frac{L_{2^{1024}}[1/3,(5/3)^{1/3}(2/3)]}{L_{2^{1024}}[1/3,2/3^{2/3}]}\approx8.7\cdot10^7.$

We arrive at an estimate of about $ 4\cdot10^7$ columns for the circuit-NFS 1024-bit matrix. We again, optimistically, assume that the average column density is about $ 100$. We refer to this matrix as the ``small'' matrix.

5.2. Estimated relation collection cost.

Relation collection for RSA-512 could have been done in about 8 years on a 1GHz PC [3]. Since

$\displaystyle 8\cdot\frac{L_{2^{1024}}[1/3,4/3^{2/3}]}{L_{2^{512}}[1/3,4/3^{2/3}]}\approx6\cdot10^7$

we estimate that generating the large matrix would require about a year on about 30 million 1GHz PCs with large memories (or more PC-time but less memory when using alternative smoothness tests - keep in mind, though, that it may be possible to achieve the same operation count using different hardware, as rightly noted in [1] and speculated in [12, 2.4.7]). With

$\displaystyle \frac{L_{2^{1024}}[1/3,(5/3)^{4/3}]}{L_{2^{1024}}[1/3,4/3^{2/3}]}\approx5$

it follows that generating the smaller matrix would require about 5 times the above effort. Neither computation is infeasible. But, it can be argued that 1024-bit RSA moduli provide a reasonable level of security just based on the operation count of the relation collection step.

5.3. Processing the ``small'' matrix using Bernstein's circuits.

We estimate the size of the circuit required to implement the mesh circuit of [1] when the NFS parameters are optimized for the throughput cost function and 1024-bit composites. We then derive a rough prediction of the associated costs when the mesh is implemented by custom hardware using current VLSI technology. In this subsection we use the circuit exactly as described in [1]; the next subsections will make several improvements, including those listed as future plans in [1]. In [1], the algorithm used for finding dependencies among the columns of $ A$ is Wiedemann's original algorithm [18], which is a special case of block Wiedemann with blocking factor $ K$=1 (cf. 2.5). In the first stage (inner product computation), we are given the sparse $ D \times D$ matrix $ A$ and some pair of vectors $ \vec{u},\vec{v}$ and wish to calculate $ \vec{u}A^k \vec{v}$ for $ k=1,\ldots,2D$. The polynomial evaluation stage is slightly different, but the designs given below can be easily adapted so we will not discuss it explicitly. The mesh consists of $ m\times m$ nodes, where $ m^{2}>w(A)+2D$ (cf. 3.1). By assumption, $ w(A)\approx 4\cdot 10^{9}$ and $ D\approx 4\cdot 10^{7}$ so we may choose $ m=63256$. To execute the sorting-based algorithm, each node consists mainly of 3 registers of $ \left\lceil \log _{2}(4\cdot 10^{7})\right\rceil =26$ bits each, a 26-bit compare-exchange element (in at least half of the nodes), and some logic for tracking the current stage of the algorithm. Input, namely the nonzero elements of $ A$ and the initial vector $ \vec{v}$, is loaded just once so this can be done serially. The mesh computes the vectors $ A^k\vec{v}$ by repeated matrix-by-vector multiplication, and following each such multiplication it calculates the inner product $ \vec{u}(A^k\vec{v})$ and outputs this single bit. In standard CMOS VLSI design, a single-bit register (i.e., a D-type edge-triggered flip-flop) requires about 8 transistors, which amounts to 624 transistors per node. To account for the logic and additional overheads such as a clock distribution network, we shall assume an average of 2000 transistors per node for a total of $ 8.0\cdot 10^{12}$ transistors in the mesh. As a representative of current technology available on large scale we consider Intel's latest Pentium processor, the Pentium 4 ``Northwood'' (0.13 $ \mu \mathrm{m}^2$ feature size process). A single Northwood chip (inclusive of its on-board L2 cache) contains $ 5.5\cdot 10^{7}$ transistors, and can be manufactured in dies of size $ 131\textrm{mm}^{\textrm{2 }}$on wafers of diameter 300mm, i.e., about 530 chips per wafer when disregarding defects. The 1.6GHz variant is currently sold at $140 in retail channels. By transistor count, the complete mesh would require about $ \left(8.0\cdot
10^{12}\right)/\left(5.5\cdot 10^{7}\right)\approx
145 500$ Northwood-sized dies or about $ 273$ wafers. Using the above per-chip price figure naively, the construction cost is about $20M. Alternatively, assuming a wafer cost of about $5,000 we get a construction cost of roughly $1.4M, and the initial costs (e.g., mask creation) are under $1M. The matter of inter-chip communication is problematic. The mesh as a whole needs very few external lines (serial input, 1-bit output, clock, and power). However, a chip consisting of $ s\times s$ nodes has $ 4s-4$ nodes on its edges, and each of these needs two 26-bit bidirectional links with its neighbor on an adjacent chip, for a total of about $ 2 \cdot 2\cdot 26\cdot 4s=416s$ connections. Moreover, such connections typically do not support the full 1GHz clock rate, so to achieve the necessary bandwidth we will need about 4 times as many connections: $ 1664s$. While standard wiring technology cannot provide such enormous density, the following scheme seems plausible. Emerging ``flip-chip'' technologies allow direct connections between chips that are placed face-to-face, at a density of 277 connections per mm$ ^2$ (i.e., 60$ \mu$s array pitch). We cut each wafer into the shape of a cross, and arrange the wafers in a two-dimensional grid with the arms of adjacent wafers in full overlap. The central square of each cross-shaped wafer contains mesh nodes, and the arms are dedicated to inter-wafer connections. Simple calculation shows that with the above connection density, if 40% of the (uncut) wafer area is used for mesh nodes then there is sufficient room left for the connection pads and associated circuitry. This disregards the issues of delays (mesh edges that cross wafer boundaries are realized by longer wires and are thus slower than the rest), and of the defects which are bound to occur. To address these, adaptation of the algorithm is needed. Assuming the algorithmic issues are surmountable, the inter-wafer communication entails a cost increase by a factor of about 3, to $4.1M. According to [1, Section 4], a matrix-by-vector multiplication consists of, essentially, three sort operations on the $ m\times m$ mesh. Each sort operation takes $ 8m$ steps, where each step consists of a compare-exchange operation between 26-bit registers of adjacent nodes. Thus, multiplication requires $ 3\cdot 8m\approx 1.52\cdot
10^{6}$ steps. Assuming that each step takes a single clock cycle at a 1GHz clock rate, we get a throughput of $ 659$ multiplications per second. Basically, Wiedemann's algorithm requires $ 3D$ multiplications. Alas, the use of blocking factor $ K=1$ entails some additional costs. First, the number of multiplications roughly doubles due to the possibility of failure (cf. 2.5). Moreover, the algorithm will yield a single vector from the kernel of $ A$, whereas the Number Field Sieve requires several linearly independent kernel elements: half of these yield a trivial congruence (c.f. 2.2), and moreover certain NFS optimizations necessitate discarding most of the vectors. In RSA-512, a total of about 10 kernel vectors were needed. Fortunately, getting additional vectors is likely to be cheaper than getting the first one (this is implicit in [18, Algorithm 1]). Overall, we expect the number of multiplications to be roughly $ 2 \cdot \frac{10}{3} \cdot 3D =
20D$. Thus, the expected total running time is roughly $ 20 \cdot
4\cdot 10^{7}/659 \approx 1 210 000$ seconds, or 14 days. The throughput cost is thus $ 5.10 \cdot 10^{12}\;\mathrm{\$}\times\mathrm{sec}$. If we increase the blocking factor from 1 to over 32 and handle the multiplication chains sequentially on a single mesh, then only $ 3D$ multiplications are needed ([1] considers this but claims that it will not change the cost of computation; that is true only up to constant factors). In this case the time decreases to 50 hours, and the throughput cost decreases to $ 7.4 \cdot 10^{11}\;\mathrm{\$}\times\mathrm{sec}$. Heat dissipation (i.e., power consumption) may limit the node density and clock rate of the device, and needs to be analysed. Note however that this limitation is technological rather than theoretical, since in principle the mesh sorting algorithm can be efficiently implemented using reversible gates and arbitrarily low heat dissipation.

5.4. A routing-based circuit.

The above analysis refers to the mesh circuit described in [1], which relies on the novel use of mesh sorting for matrix-by-vector multiplication. We now present an alternative design, based on mesh routing. This design performs a single routing operation per multiplication, compared to three sorting operations (where even a single sorting operation is slower than routing). The resulting design has a reduced cost, improved fault tolerance and very simple local control. Moreover, its inherent flexibility allows further improvements, as discussed in the next section. The basic design is as follows. For simplicity assume that each of the $ D$ columns of the matrix has weight exactly $ h$ (here $ h=100$), and that the nonzero elements of $ A$ are uniformly distributed (both assumptions can be easily relaxed). Let $ m=\sqrt{D\cdot h}$. We divide the $ m\times m$ mesh into $ D$ blocks of size $ \sqrt{h}\times \sqrt{h}$. Let $ S_{i}$ denote the $ i$-th block in row-major order ( $ i\in \{1,\ldots,D\}$), and let $ t_{i}$ denote the node in the upper left corner of $ S_{i}$. We say that $ t_{i}$ is the target of the value $ i$. Each node holds two $ \log _{2}D$-bit values, $ Q[i]$ and $ R[i]$. Each target node $ t_{i}$ also contains a single-bit value $ P[i]$. For repeated multiplication of $ A$ and $ \vec{v}$, the mesh is initialized as follows: the $ i$-th entry of $ \vec{v}$ is loaded into $ P[i]$, and the row indices of the nonzero elements in column $ i\in \{1,\ldots,D\}$ of $ A$ are stored (in arbitrary order) in the $ Q[\cdot ]$ of the nodes in $ S_{i}$. Each multiplication is performed thus:
  1. For all $ i$, broadcast the value of $ P[i]$ from $ t_{i}$ to the rest of the nodes in $ S_{i}$ (this can be accomplished in $ 2\sqrt{h}-2$ steps).
  2. For all $ i$ and every node $ j$ in $ S_{i}$: if $ P[i]=1$ then $ R[j]\leftarrow Q[j]$, else $ R[j]\leftarrow \mathrm{nil}$ (where $ \mathrm{nil}$ is some distinguished value outside $ \{1,\ldots ,D\}$).
  3. $ P[i]\leftarrow 0$ for all $ i$
  4. Invoke a mesh-based packet routing algorithm on the $ R[\cdot ]$, such that each non- $ \mathrm{nil}$ value $ R[j]$ is routed to its target node $ t_{R[j]}$. Each time a value $ i$ arrives at its target $ t_{i}$, discard it and flip $ P[i]$.
After these steps, $ P[\cdot]$ contain the result of the multiplication, and the mesh is ready for the next multiplication. As before, in the inner product computation stage of the Wiedemann algorithm, we need only compute $ \vec{u}A^k \vec{v}$ for some vector $ \vec{u}$, so we load the $ i$-th coordinate of $ \vec{u}$ into node $ t_i$ during initialization, and compute the single-bit result $ \vec{u}A^k \vec{v}$ inside the mesh during the next multiplication. There remains the choice of a routing algorithm. Many candidates exist (see [7] for a survey). To minimize hardware cost, we restrict our attention to algorithms for the ``one packet'' model, in which at each step every node holds at most one packet (and consequentially each node can send at most one packet and receive at most one packet per step). Note that this rules out most known algorithms, including those for the well-studied ``hot-potato'' routing model which provides a register for every edge. Since we do binary multiplication, the routing problem has the following unusual property: pairwise packet annihilation is allowed. That is, pairs of packets with identical values may be ``cancelled out'' without affecting the result of the computation. This relaxation can greatly reduce the congestion caused by multiple packets converging to a common destination. Indeed this seems to render commonly-cited lower bounds inapplicable, and we are not aware of any discussion of this variant in the literature. While known routing and sorting algorithms can be adapted to our task, we suggest a new routing algorithm that seems optimal, based on our empirical tests. The algorithm, which we call clockwise transposition routing, has an exceptionally simple control structure which consists of repeating 4 steps. Each step involves compare-exchange operations on pairs of neighboring nodes, such that the exchange is performed iff it reduces the distance-to-target of the non- $ \mathrm{nil}$ value (out of at most 2) that is farthest from its target along the relevant direction. This boils down to comparison of the target row indices (for vertically adjacent nodes) or target column indices (for horizontally adjacent nodes). For instance, for horizontally adjacent nodes $ i,i+1$ such that $ t_{R[i]}$ resides on column $ c_{i}$ and $ t_{R[i+1]}$ resides on column $ c_{i+1}$, an exchange of $ i$ and $ i+1$ will be done iff $ c_{i}>c_{i+1}$. To this we add annihilation: if $ R[i]=R[i+1]$ then both are replaced by $ \mathrm{nil}$. The first step of clockwise transposition routing consists of compare-exchange between each node residing on an odd row with the node above it (if any). The second step consists of compare-exchange between each node residing on an odd column with the node to its right (if any). The third and fourth steps are similar to the first and second respectively, except that they involve the neighbors in the opposite direction. It is easily seen that each node simply performs compare-exchanges with its four neighbors in either clockwise or counterclockwise order. We do not yet have a theoretical analysis of this algorithm. However, we have simulated it on numerous inputs of sizes up to $ 13 000\times 13 000$ with random inputs drawn from a distribution mimicking that of the above mesh, as well as the simple distribution that puts a random value in every node. In all runs (except for very small meshes), we have not observed even a single case where the running time exceeded $ 2m$ steps. This is just two steps from the trivial lower bound $ 2m-2$. Our algorithm is a generalization of odd-even transposition sort, with a schedule that is identical to the ``2D-bubblesort'' algorithm of [8] but with different compare-exchange elements. The change from sorting to routing is indeed quite beneficial, as [8] shows that 2D-bubblesort is considerably slower than the observed performance of our clockwise transposition routing. The new algorithm appears to be much faster than the $ 8m$ sorting algorithm (due to Schimmler) used in [1], and its local control is very simple compared to the complicated recursive algorithms that achieve the $ 3m$-step lower bound on mesh sorting (cf. [16]). A physical realization of the mesh will contain many local faults (especially for devices that are wafer-scale or larger, as discussed below). In the routing-based mesh, we can handle local defects by algorithmic means as follows. Each node shall contain 4 additional state bits, indicating whether each of its 4 neighbors is ``disabled''. These bits are loaded during device initialization, after mapping out the defects. The compare-exchange logic is augmented such that if node $ i$ has a ``disabled'' neighbor in direction $ \Delta$ then $ i$ never performs an exchange in that direction, but always performs the exchange in the two directions orthogonal to $ \Delta$. This allows us to ``close off'' arbitrary rectangular regions of the mesh, such that values that reach a ``closed-off'' region from outside are routed along its perimeter. We add a few spare nodes to the mesh, and manipulate the mesh inputs such that the spare effectively replace the nodes of the in closed-off regions. We conjecture that the local disturbance caused by a few small closed-off regions will not have a significant effect on the routing performance. Going back to the cost evaluation, we see that replacing the sorting-based mesh with a routing-based mesh reduces time by a factor of $ 3\cdot 8/2=12$. Also, note that the $ Q[\cdot ]$ values are used just once per multiplication, and can thus be stored in slower DRAM cells in the vicinity of the node. DRAM cells are much smaller than edge-triggered flip-flops, since they require only one transistor and one capacitor per bit. Moreover, the regular structure of DRAM banks allows for very dense packing. Using large banks of embedded DRAM (which are shared by many nodes in their vicinity), the amortized chip area per DRAM bit is about $ 0.7 \mu \mathrm{m}^2$. Our Northwood-based estimates lead to $ 2.38 \mu \mathrm{m}^2$ per transistor, so we surmise that for our purposes a DRAM bit costs $ 1/3.4$ as much as a logic transistor, or about $ 1/27$ as much as a flip-flop. For simplicity, we ignore the circuitry needed to retrieve the values from DRAM -- this can be done cheaply by temporarily wiring chains of adjacent $ R[\cdot ]$ into shift registers. In terms of circuit size, we effectively eliminate two of the three large registers per node, and some associated logic, so the routing-based mesh is about 3 times cheaper to manufacture. Overall, we gain a reduction of a factor $ 3 \cdot 12=36$ in the throughput cost.

5.5. An improved routing-based circuit.

We now tweak the routing-based circuit design to gain additional cost reductions. Compared to the sorting-based design (cf. 5.3), these will yield a (constant-factor) improvement by several order of magnitudes. While asymptotically insignificant, this suggests a very practical device for the NFS matrix step of 1024-bit moduli. Moreover, it shows that already for 1024-bit moduli, the cost of parallelization can be negligible compared to the cost of the RAM needed to store the input, and thus the speed advantage is gained essentially for free. The first improvement follows from increasing the density of targets. Let $ \rho $ denote the average number of $ P[\cdot]$ registers per node. In the above scheme, $ \rho =h^{-1}\approx 1/100$. The total number of $ P[\cdot]$ registers is fixed at $ D$, so if we increase $ \rho $ the number of mesh nodes decreases by $ h\rho$. However, we no longer have enough mesh nodes to route all the $ hD$ nonzero entries of $ A$ simultaneously. We address this by partially serializing the routing process, as follows. Instead of storing one matrix entry $ Q[\cdot ]$ per node, we store $ h\rho$ such values per node: for $ \rho \geq 1$, each node $ j$ is ``in charge'' of a set of $ \rho $ matrix columns $ C_j=\{c_{j,1},\ldots,c_{j,\rho}\}$, in the sense that node $ j$ contains the registers $ P[c_{j,1}],\ldots,P[c_{j,\rho}]$, and the nonzero elements of $ A$ in columns $ c_{j,1},\ldots,c_{j,\rho}$. To carry out a multiplication we perform $ h\rho$ iterations, where each iteration consists of retrieving the next such nonzero element (or skipping it, depending on the result of the previous multiplication) and then performing clockwise transposition routing as before. The second improvement follows from using block Wiedemann with a blocking factor $ K>1$ (cf. 2.5). Besides reducing the number of multiplications by a factor of roughly $ \frac{20}{3}$ (cf. 5.3), this produces an opportunity for reducing the cost of multiplication, as follows. Recall that in block Wiedemann, we need to perform $ K$ multiplication chains of the form $ A^k\vec{v}_i$, for $ i=1,\ldots,K$ and $ k=1,\ldots,2D/K$, and later again, for $ k=1,\ldots,D/K$. The idea is to perform several chains in parallel on a single mesh, reusing most resources (in particular, the storage taken by $ A$). For simplicity, we will consider handling all $ K$ chains on one mesh. In the routing-based circuits described so far, each node emitted at most one message per routing operation -- a matrix row index, which implies the address of the target cell. The information content of this message (or its absence) is a single bit. Consider attaching $ K$ bits of information to this message: $ \log_2(D)$ bits for the row index, and $ K$ bits of ``payload'', one bit per multiplication chain. Combining the two generalizations gives the following algorithm, for $ 0<\rho\leq1$ and integer $ K\geq1$. The case $ 0<\rho<1$ requires distributing the entries of each matrix column among several mesh nodes, as in 5.4, but its cost is similar. Let $ \{C_j\}_{j \in \{1,\ldots,D/\rho\}}$ be a partition of $ \{1,\ldots ,D\}$, $ C_j=\{c : (j-1)\rho \leq c-1 < j\rho \}$. Each node $ j \in \{1,\ldots,D/\rho\}$ contains single-bit registers $ P_i[c]$ and $ P'_i[c]$ for all $ i=1,\ldots,K$ and $ c \in C_j$, and a register $ R_j$ of size $ \log_2(D)+K$. Node $ j$ also contains a list $ Q_j=\{(r,c) \mid
A_{r,c}=1, c \in C_j\}$ of the nonzero matrix entries in the columns $ C_j$ of $ A$, and an index $ I_j$ into $ C_j$. Initially, load the vectors $ \vec{v}_i$ into the $ P_i[\cdot]$ registers. Each multiplication is then performed thus:
  1. For all $ i$ and $ c$, $ P'_i[c]\leftarrow0$. For all $ j$, $ I_j\leftarrow1$.
  2. Repeat $ h\rho$ times:
    1. For all $ j$: $ (r,c)\leftarrow Q_j[I_j]$, $ I_j\leftarrow I_j+1$, $ R[j] \leftarrow \big< r, P_1[c],\ldots,P_K[c]\big>$.
    2. Invoke the clockwise transposition routing algorithm on the $ R[\cdot ]$, such that each value $ R[j]=\left<r,\ldots\right>$ is routed to the node $ t_j$ for which $ r \in C_j$.
      During routing, whenever a node $ j$ receives a message $ \left<r, p_1,\ldots,p_K\right>$ such that $ r \in C_j$, it sets $ P'_i[r] \leftarrow P'_i[r] \oplus p_i$ for $ i=1,\ldots,K$ and discards the message. Moreover, whenever packets $ \left<r, p_1,\ldots,p_K\right>$ and $ \left<r, p'_1,\ldots,p'_K\right>$ in adjacent nodes are compared, they are combined: one is annihilated and the other is replaced by $ \left<r, p_1 \oplus p'_1,\ldots,p_K \oplus p'_K\right>$.
  3. $ P_i[c] \leftarrow P'_i[c]$ for all $ i$ and $ c$.
After these steps, $ P_i[\cdot]$ contain the bits of $ A^k\vec{v}_i$ and the mesh is ready for the next multiplication. We need to compute and output the inner products $ \vec{u}_j(A^k \vec{v}_i)$ for some vectors $ \vec{u}_1,\ldots,\vec{u}_K$, and this computation should be completed before the next multiplication is done. In general, this seems to require $ \Theta(K^2)$ additional wires between neighboring mesh nodes and additional registers. However, usually the $ \vec{u}_j$ are chosen to have weight 1 or 2, so the cost of computing these inner products can be kept very low. Also, note that the number of routed messages is now doubled, because previously only half the nodes sent non- $ \mathrm{nil}$ messages. However, empirically it appears that the clockwise transposition routing algorithm handles the full load without any slowdown. It remains to determine the optimal values of $ K$ and $ \rho $. This involves implementation details and technological quirks, and obtaining precise figures appears rather hard. We thus derive expressions for the various cost measures, based on parameters which can characterize a wide range of implementations. We then substitute values that reasonably represent today's technology, and optimize for these. The parameters are as follows: We consider three concrete implementations: custom-produced ``logic'' wafers (as used in 5.3, with which we maintain consistency), custom-produced ``DRAM'' wafers (which reduce the size of DRAM cells at the expense of size and speed of logic transistors) and an FPGA-based design using off-the-shelf parts (cf. Appendix A). Rough estimates of the respective parameters are given in Table 2.

Table 2: Implementation hardware parameters
\begin{tabular}{\vert l\vert r@{ }l\vert r@{ }l\vert r@{ }l\vert} \hline
& \mu...
...c}/\mu \mathrm{m}$ & $1.43\cdot10^{-9}$&$\mathrm{sec}$  \hline

$ \emptyset$ marks values that are inapplicable, and taken to be zero.

The cost of the matrix step is derived with some additional approximations: Table 3 lists the cost of the improved routing-based circuit for several choices of $ \rho $ and $ K$, according to the above. It also lists the cost of the sorting-based circuits (cf. 5.3) and the PC implementation of Appendix B. The lines marked by ``(opt)'' give the parameter choice that minimize the throughput cost for each type of hardware. The second line describes a routing-based design whose throughput cost is roughly $ 45 000$ times lower than that of the original sorting-based circuit (or $ 6 700$ times lower than sorting with $ K \gg 1$). Notably, this is a single-wafer device, which completely solves the technological problem of connecting multiple wafers with millions of parallel wires, as necessary in the original design of [1]. The third line shows that significant parallelism can be gained essentially for free: here, 88% of the wafer area is occupied simply by the DRAM banks needed to store the input matrix, so further reduction in construction cost seems impossible.

Table 3: Cost of the matrix step for the ``small'' matrix
\begin{tabular}{\vert l\vert l\vert r@{}l\vert r\vert r\vert r\vert r@{ }l\vert ...
...0&$ 2 290 000$ & (27 days)& $5.52$&$\cdot 10^{10}$ &  \hline

5.6. An improved circuit for the ``large'' matrix.

The large matrix resulting from ordinary relation collection contains $ 250$ times more columns: $ D\approx10^{10}$. We assume that the average column density remains $ h=100$. It is no longer possible to fit the device on a single wafer, so the feasibility of the mesh design now depends critically on the ability to make high bandwidth inter-wafer connections (cf. 5.3). Using the formulas given in the previous section, we obtain the costs in Table 4 for the custom and FPGA implementations, for various parameter choices. The third line shows that here too, significant parallelism can be attained at very little cost (88% of the wafer area is occupied by DRAM storing the input). As can be seen, the improved mesh is quite feasible also for the large matrix, and its cost is a small fraction of the cost of the alternatives, and of relation collection.

Table 4: Cost of the matrix step for the ``large'' matrix
\begin{tabular}{\vert l\vert l\vert r@{}l\vert r\vert r\vert r\vert r@{ }l\vert ...
...&$3.17\cdot10^8$ &(10 years) & $4.84$&$\cdot 10^{16}$ & \hline

5.7. Summary of hardware findings.

The improved design of 5.5 and 5.6, when implemented using custom hardware, appears feasible for both matrix sizes. Moreover, it is very attractive when compared to the traditional serial implementations (though appropriate parallelization techniques partially close this gap; see Appendix B). However, these conclusions are based on numerous assumptions, some quite optimistic. Much more research, and possibly actual relation collection experiments, would have to be carried out to get a clearer grasp of the actual cost (time and money) of both the relation collection and matrix steps for 1024-bit moduli. In light of the above, one may try to improve the overall performance of NFS by re-balancing the relation collection step and the matrix step, i.e., by increasing the smoothness bounds (the opposite of the approach sketched in Remark 3.7). For ordinary NFS, asymptotically this is impossible since the parameters used for ordinary relation collection (i.e., the ``large'' matrix) already minimize the cost of relation collection (cf. 2.6). For improved NFS that is applied to a single factorization (cf. 2.7), if we disregard the cost of the matrix step and optimize just for relation collection then we can expect a cost reduction of about $ L_{2^{1024}}[1/3,1.9018836\cdots] /
L_{2^{1024}}[1/3,1.8689328\cdots] \approx 2.8$. If many integers in a large range must be factored -- a reasonable assumption given our interpretation of the throughput cost (cf. 3.2) -- a much faster method exists (cf. [4]). It remains to be studied whether these asymptotic properties indeed hold for 1024-bit moduli and what are the practical implications of the methods from [4].
next up previous
Next: Conclusion Up: Analysis of Bernstein's Factorization Previous: Operation count, equipment cost,