next up previous
Next: Operation count, equipment cost, Up: Analysis of Bernstein's Factorization Previous: Background on the number


The circuits for integer factorization from [1]

3.1. Matrix-by-vector multiplication using mesh sorting.

In [1] an interesting new mesh-sorting-based method is described to compute a matrix-by-vector product. Let $ A$ be the bit matrix from 2.5 with $ D=L(\beta)$ columns and weight $ w(A)=L(\beta)$, and let $ m$ be the least power of $ 2$ such that $ m^2>w(A)+2D$. Thus $ m=L(\beta/2)$. We assume, without loss of generality, that $ A$ is square. A mesh of $ m\times m$ processors, each with $ O(\log D)=L(0)$ memory, initially stores the matrix $ A$ and a not necessarily sparse $ D$-dimensional bit vector $ \vec{v}$. An elegant method is given that computes the product $ A\vec{v}$ using repeated sorting in $ O(m)$ steps, where each step involves a small constant number of simultaneous operations on all $ m\times m$ mesh processors. At the end of the computation $ A\vec{v}$ can easily be extracted from the mesh. Furthermore, the mesh is immediately, without further changes to its state, ready for the computation of the product of $ A$ and the vector $ A\vec{v}$. We use ``circuit-NFS'' to refer to NFS that uses the mesh-sorting-based matrix step.

3.2. The throughput cost function from [1].

Judging by operation counts, the mesh-based algorithm is not competitive with the traditional way of computing $ A\vec{v}$: as indicated in 2.5 it can be done in $ O(w(A))=L(\beta)$ operations. The mesh-based computation takes $ O(m)$ steps on all $ m\times m$ mesh processors simultaneously, resulting in an operation count per matrix-by-vector multiplication of $ O(m^3)=L(3\beta/2)$. Iterating the matrix-by-vector multiplications $ L(\beta)$ times results in a mesh-sorting-based matrix step that requires $ L(5\beta/2)=L(\beta)^{5/2}$ operations as opposed to just $ L(2\beta)$ for the traditional approach. This explains the non-ordinary relation collection parameters used in [1] corresponding to the analysis given in 2.6 for $ 2\epsilon=5/2$, something we comment upon below in 3.3. However, the standard comparison of operation counts overlooks the following fact. The traditional approach requires memory $ O(w(A)+D)=L(\beta)$ for storage of $ A$ and the vector; given that amount of memory it takes time $ L(2\beta)$. But given the $ m\times m$ mesh, with $ m\times m=L(\beta)$, the mesh-based approach takes time just $ L(3\beta/2)$ because during each unit of time $ L(\beta)$ operations are carried out simultaneously on the mesh. To capture the advantage of ``active small processors'' (as in the mesh) compared to ``inactive memory'' (as in the traditional approach) and the fact that their price is comparable, it is stipulated in [1] that the cost of factorization is ``the product of the time and the cost of the machine.'' We refer to this cost function as throughput cost, since it can be interpreted as measuring the equipment cost per unit problem-solving throughput. It is frequently used in VLSI design (where it's known as ``AT cost'', for Area$ \times$Time), but apparently was not used previously in the context of computational number theory. It appears that throughput cost is indeed appropriate when a large number of problems must be solved during some long period of time while minimizing total expenses. This does not imply that throughput cost is always appropriate for assessing security, as illustrated by the following example. Suppose Carrol wishes to assess the risk of her encryption key being broken by each of two adversaries, Alice and Bob. Carrol knows that Alice has plans for a device that costs $1M and takes 50 years to break a key, and that Bob's device costs $50M and takes 1 year to break a key. In one scenario, each adversary has a $1M budget -- clearly Alice is dangerous and Bob is not. In another scenario, each adversary has a $50M budget. This time both are dangerous, but Bob apparently forms a greater menace because he can break Carrol's key within one year, while Alice still needs 50 years. Thus, the two devices have the same throughput cost, yet either can be more ``dangerous'' than the other, depending on external settings. The key point is that if Alice and Bob have many keys to break within 50 years then indeed their cost-per-key figures are identical, but the time it will take Bob to break Carrol's key depends on her priority in his list of victims, and arguably Carrol should make the paranoid assumption that she is first. In Section 4 we comment further on performance measurement for the NFS.

3.3. Application of the throughput cost.

The time required for all matrix-by-vector multiplications on the mesh is $ L(3\beta/2)$. The equipment cost of the mesh is the cost of $ m^2$ small processors with $ L(0)$ memory per processor, and is thus $ L(\beta)$. The throughput cost, the product of the time and the cost of the equipment, is therefore $ L(5\beta/2)$. The matrix step of standard-NFS requires time $ L(2\beta)$ and equipment cost $ L(\beta)$ for the memory, resulting in a throughput cost of $ L(3\beta)$. Thus, the throughput cost advantage of the mesh-based approach is a factor $ L(\beta/2)$ if the two methods would use the same $ \beta$ (cf. Remark 3.4). The same observation applies if the standard-NFS matrix step is $ K$-fold parallelized, for reasonable $ K$ (cf. 2.5): the time drops by a factor $ K$ which is cancelled (in the throughput cost) by a $ K$ times higher equipment cost because each participating processor needs the same memory $ L(\beta)$. In circuit-NFS (i.e., the mesh) a parallelization factor $ m^2$ is used: the time drops by a factor only $ m$ (not $ m^2$), but the equipment cost stays the same because memory $ L(0)$ suffices for each of the $ m^2$ participating processors. Thus, with the throughput cost circuit-NFS achieves an advantage of $ m=L(\beta/2)$. The mesh itself can of course be $ K$-fold parallelized but the resulting $ K$-fold increase in equipment cost and $ K$-fold drop in time cancel each other in the throughput cost [1, Section 4].

Remark 3.4

It can be argued that before evaluating an existing algorithm based on a new cost function, the algorithm first should be tuned to the new cost function. This is further commented upon below in 3.5.

3.5. Implication of the throughput cost.

We consider the implication of the matrix step throughput cost of $ L(5\beta/2)$ for circuit-NFS compared to $ L(3\beta)$ for standard-NFS. In [1] the well known fact is used that the throughput cost of relation collection is $ L(2\alpha)$ (cf. 2.4): an operation count of $ L(2\alpha)$ on a single processor with $ L(0)$ memory results in time $ L(2\alpha)$, equipment cost $ L(0)$, and throughput cost $ L(2\alpha)$. This can be time-sliced in any way that is convenient, i.e., for any $ K$ use $ K$ processors of $ L(0)$ memory each and spend time $ L(2\alpha)/K$ on all $ K$ processors simultaneously, resulting in the same throughput cost $ L(2\alpha)$. Thus, for relation collection the throughput cost is proportional to the operation count. The analysis of 2.6 applies with $ 2\epsilon=5/2$ and leads to an optimal overall circuit-NFS throughput cost of $ L(1.9760518\cdots)$. As mentioned above and in 3.2, the throughput cost and the operation count are equivalent for both relation collection and the matrix step of circuit-NFS. Thus, as calculated in 2.6, circuit-NFS is from an operation count point of view less powerful than standard-NFS, losing already 40 bits in the 500-bit range (disregarding the $ o(1)$'s) when compared to standard-NFS with ordinary parameter choices. This conclusion applies to any NFS implementation, such as many existing ones, where memory requirements are not multiplicatively included in the cost function. But operation count is not the point of view taken in [1]. There standard-NFS is compared to circuit-NFS in the following way. The parameters for standard-NFS are chosen under the assumption that the throughput cost of relation collection is $ L(3\alpha)$: operation count $ L(2\alpha)$ and memory cost $ L(\alpha)$ for the sieving result in time $ L(2\alpha)/K$ and equipment cost $ K\cdot L(\alpha)$ (for any $ K$-fold parallelization) and thus throughput cost $ L(3\alpha)$. This disregards the fact that long before [1] appeared is was known that the use of $ L(\alpha)$ memory per processor may be convenient, in practice and for relatively small numbers, but is by no means required (cf. 2.4). In any case, combined with $ L(3\beta)$ for the throughput cost of the matrix step this leads to $\alpha \mbox{\rlap{${}^{ o}$}$=$} \beta$, implying that the analysis from 2.6 with $ 2\epsilon=2$ applies, but that the resulting operation count must be raised to the $ 3/2$-th power. In [1] the improvement from [4] mentioned in 2.7 is used, leading to a throughput cost for standard-NFS of $ L(2.8528254\cdots)$ (where $ 2.8528254\cdots$ is $ 1.5$ times the $ 1.9018836\cdots$ referred to in 2.7). Since $ (2.8528254\cdots/1.9760518\cdots)^3=3.0090581\cdots$, it is suggested in [1] that the number of digits of factorable composites grows by a factor $ 3$ if circuit-NFS is used instead of standard-NFS.

3.6. Alternative interpretation.

How does the comparison between circuit-NFS and standard-NFS with respect to their throughput costs turn out if standard-NFS is first properly tuned (Remark 3.4) to the throughput cost function, given the state of the art in, say, 1990 (cf. [10, 4.15]; also the year that [4] originally appeared)? With throughput cost $ L(2\alpha)$ for relation collection (cf. above and 2.4), the analysis from 2.6 with $ 2\epsilon=3$ applies, resulting in a throughput cost of just $ L(2.0800838\cdots)$ for standard-NFS. Since $ (2.0800838\cdots/1.9760518\cdots)^3<1.17$, this would suggest that $ 1.17D$-digit composites can be factored using circuit-NFS for the throughput cost of $ D$-digit integers using standard-NFS. The significance of this comparison depends on whether or not the throughput cost is an acceptable way of measuring the cost of standard-NFS. If not, then the conclusion based on the operation count (as mentioned above) would be that circuit-NFS is slower than standard-NFS; but see Section 4 for a more complete picture. Other examples where it is recognized that the memory cost of relation collection is asymptotically not a concern can be found in [12] and [9], and are implied by [14].

Remark 3.7

It can be argued that the approach in 3.6 of replacing the ordinary standard-NFS parameters by smaller smoothness bounds in order to make the matrix step easier corresponds to what happens in many actual NFS factorizations. There it is done not only to make the matrix step less cumbersome at the cost of somewhat more sieving, but also to make do with available PC memories. Each contributing PC uses the largest smoothness bounds and sieving range that fit conveniently and that cause minimal interference with the PC-owner's real work. Thus, parameters may vary from machine to machine. This is combined with other memory saving methods such as ``special-$ q$'s.'' In any case, if insufficient memory is available for sieving with optimal ordinary parameters, one does not run out to buy more memory but settles for slight suboptimality, with the added benefit of an easier matrix step. See also 4.1.

Remark 3.8

In [19], Wiener outlines a three-dimensional circuit for the matrix step, with structure that is optimal in a certain sense (when considering the cost of internal wiring). This design leads to a matrix step exponent of $ 2\epsilon=7/3$, compared to $ 5/2$ in the designs of [1] and this paper. However, adaptation of that design to two dimensions yields a matrix step exponent that is asymptotically identical to ours, and vice versa. Thus the approach of [19] is asymptotically equivalent to ours, while its practical cost remains to be evaluated. We note that in either approach, there are sound technological reasons to prefer the 2D variant. Interestingly, $ 2\epsilon=7/3$ is the point where improved and standard NFS become the same (cf. 2.7).
next up previous
Next: Operation count, equipment cost, Up: Analysis of Bernstein's Factorization Previous: Background on the number