Transcription of Rounding Errors in Complex Floating-Point …
1 SimonFraserUniversityRoundingErrorsin Complex floating -PointMultiplication sign,significand,andexponent, ,( 1)s e B mwherefs; e; mg N,0< e < E, t 1 m < t, and ; t; B; Eareparametersof specialcase,e= 0andm= t 1represents IEEE754 double precision arithmetic, = 2,t= 53,B= 1075andE= , infinities, andNaNs, but innumericalcodethey Complex floating -PointMultiplication sign,significand,andexponent, ,( 1)s e B mwherefs; e; mg N,0< e < E, t 1 m < t, and ; t; B; Eareparametersof specialcase,e= 0andm= t 1represents IEEE754 double precision arithmetic, = 2,t= 53,B= 1075andE= , infinities, andNaNs, but innumericalcodethey Complex floating -PointMultiplication sign,significand,andexponent, ,( 1)s e B mwherefs; e; mg N,0< e < E, t 1 m < t, and ; t; B; Eareparametersof specialcase,e= 0andm= t 1represents IEEE754 double precision arithmetic, = 2,t= 53,B= 1075andE= , infinities, andNaNs, but innumericalcodethey Complex floating -PointMultiplication sign,significand,andexponent, ,( 1)s e B mwherefs; e; mg N,0< e < E, t 1 m < t, and ; t; B.
2 Eareparametersof specialcase,e= 0andm= t 1represents IEEE754 double precision arithmetic, = 2,t= 53,B= 1075andE= , infinities, andNaNs, but innumericalcodethey Complex floating -PointMultiplication , , and theresultsof roundedfloating-pointaddition,subtractio n,andmultiplication,anddefinethe unitin thelastplace ulp(x)forx6= 0asthe(unique)powerof suchthat t 1 jxj=ulp(x)< tandulp(0) = 0,ulp(( 1)s e B m) = e Bandulp(x) x 1 =12ulp(1)=12 1 Complex floating -PointMultiplication , , and theresultsof roundedfloating-pointaddition,subtractio n,andmultiplication,anddefinethe unitin thelastplace ulp(x)forx6= 0asthe(unique)powerof suchthat t 1 jxj=ulp(x)< tandulp(0) = 0,ulp(( 1)s e B m) = e Bandulp(x) x 1 =12ulp(1)=12 1 Complex floating -PointMultiplication , , and theresultsof roundedfloating-pointaddition,subtractio n,andmultiplication,anddefinethe unitin thelastplace ulp(x)forx6= 0asthe(unique)powerof suchthat t 1 jxj=ulp(x)< tandulp(0) = 0,ulp(( 1)s e B m) = e Bandulp(x) x 1 =12ulp(1)=12 1 Complex floating -PointMultiplication , , thesameasifthevalueswerecomputedto Complex floating -PointMultiplication , , thesameasifthevalueswerecomputedto round-to-nearestmodeonIEEE754systems,j(x +y) (x y)j 12ulp(x+y)< (x+y)j(x y) (x y)j 12ulp(x y)< (x y)j(xy) (x y)j 12ulp(xy)< (xy)
3 RoundingErrorsin Complex floating -PointMultiplication , , thesameasifthevalueswerecomputedto round-to-nearestmodeonIEEE754systems,j(x +y) (x y)j 12ulp(x+y)< (x+y)j(x y) (x y)j 12ulp(x y)< (x y)j(xy) (x y)j 12ulp(xy)< (xy)RoundingErrorsin Complex floating -PointMultiplication +ib0,z1=a1+ib1, if we computez2=a2+ib2= (a0 a1) +i(b0 b1), thenj(z0+z1) z2j=q((a0+a1) a2)2+ ((b0+b1) b2)2<q( ja0+a1j)2+ ( jb0+b1j)2= jz0+z1jProblem:If we computex2= (a0 a1) (b0 b1)y2= (a0 b1) (b0 a1);whatis thesmallest suchthatjz0z1 z2j< jz0z1j?RoundingErrorsin Complex floating -PointMultiplication +ib0,z1=a1+ib1, if we computez2=a2+ib2= (a0 a1) +i(b0 b1), thenj(z0+z1) z2j=q((a0+a1) a2)2+ ((b0+b1) b2)2<q( ja0+a1j)2+ ( jb0+b1j)2= jz0+z1jProblem:If we computex2= (a0 a1) (b0 b1)y2= (a0 b1) (b0 a1);whatis thesmallest suchthatjz0z1 z2j< jz0z1j?
4 RoundingErrorsin Complex floating -PointMultiplication makeslargepolynomial(andinteger) slow [Percival,2002]:TheFFTallowsaccuratecomp utationof thecyclicconvolutionz=x yof twovectorsof lengthN= 2nof Gaussianintegersifjxj jyj ((1 + )3n(1 + )3n+1(1 + )3n 1)<12where is themaximumrelative errorof complexmultiplication,and is themaximumerrorin theprecomputedcomplex rootsof Complex floating -PointMultiplication makeslargepolynomial(andinteger) slow [Percival,2002]:TheFFTallowsaccuratecomp utationof thecyclicconvolutionz=x yof twovectorsof lengthN= 2nof Gaussianintegersifjxj jyj ((1 + )3n(1 + )3n+1(1 + )3n 1)<12where is themaximumrelative errorof complexmultiplication,and is themaximumerrorin theprecomputedcomplex rootsof Complex floating -PointMultiplication makeslargepolynomial(andinteger) slow [Percival,2002].
5 TheFFTallowsaccuratecomputationof thecyclicconvolutionz=x yof twovectorsof lengthN= 2nof Gaussianintegersifjxj jyj ((1 + )3n(1 + )3n+1(1 + )3n 1)<12where is themaximumrelative errorof complexmultiplication,and is themaximumerrorin theprecomputedcomplex rootsof Complex floating -PointMultiplication makeslargepolynomial(andinteger) slow [Percival,2002]:TheFFTallowsaccuratecomp utationof thecyclicconvolutionz=x yof twovectorsof lengthN= 2nof Gaussianintegersifjxj jyj ((1 + )3n(1 + )3n+1(1 + )3n 1)<12where is themaximumrelative errorof complexmultiplication,and is themaximumerrorin theprecomputedcomplex rootsof Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].
6 We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!RoundingErrorsi n Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!RoundingErrorsi n Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!
7 RoundingErrorsin Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!RoundingErrorsi n Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!RoundingErrorsi n Complex floating -PointMultiplication cantake =p8[Higham,AccuracyandStabilityofNumeric alAlgorithms].
8 We cantake =p16=3[Olver, 1986].We cantake =p5[Percival,2002].Conjecturedbasedoncom paringtheresultsofsingle-precisionanddou ble-precisioncomplexmultiplicationof several tookfive yearsbeforeanyonenoticed!RoundingErrorsi n Complex floating -PointMultiplication [Brent,Percival,Zimmermann,2006]Letz0=a0 +b0iandz1=a1+b1i, witha0; b0; a1; b1floating-pointvalueswitht-digitbase- significands, andz2= ((a0 a1) (b0 b1)) + ((a0 b1) (b0 a1))i:Providingthatnooverflow orunderflow occur, nodenormalvaluesareproduced,arithmeticre sultsarecorrectlyroundedto a nearestrepresentable value,z0z16= 0, and t 25,jz0z1 z2j<12 1 tjz0z1j= p5jz0z1j:RoundingErrorsin Complex floating -PointMultiplication generality, we canassumethegreatestpossible relative erroroccurswhen0 a0; b0; a1.
9 B1, by multiplyingby powersofi,b0b1 a0a1, by takingcomplex congugatesandmultiplyingz0,z1byi,b0a1 a0b1, by swappingz0andz1,12 a0<1, by multiplyingz0by powersof2, and12 a0a1<1, by multiplyingz1by powersof2,RoundingErrorsin Complex floating -PointMultiplication generality, we canassumethegreatestpossible relative erroroccurswhen0 a0; b0; a1; b1, by multiplyingby powersofi,b0b1 a0a1, by takingcomplex congugatesandmultiplyingz0,z1byi,b0a1 a0b1, by swappingz0andz1,12 a0<1, by multiplyingz0by powersof2, and12 a0a1<1, by multiplyingz1by powersof2,RoundingErrorsin Complex floating -PointMultiplication generality, we canassumethegreatestpossible relative erroroccurswhen0 a0; b0; a1; b1, by multiplyingby powersofi,b0b1 a0a1, by takingcomplex congugatesandmultiplyingz0,z1byi,b0a1 a0b1, by swappingz0andz1,12 a0<1, by multiplyingz0by powersof2, and12 a0a1<1, by multiplyingz1by powersof2,RoundingErrorsin Complex floating -PointMultiplication generality, we canassumethegreatestpossible relative erroroccurswhen0 a0; b0; a1; b1, by multiplyingby powersofi,b0b1 a0a1, by takingcomplex congugatesandmultiplyingz0,z1byi,b0a1 a0b1, by swappingz0andz1,12 a0<1, by multiplyingz0by powersof2, and12 a0a1<1, by multiplyingz1by powersof2,RoundingErrorsin Complex floating -PointMultiplication generality, we canassumethegreatestpossible relative erroroccurswhen0 a0; b0; a1.
10 B1, by multiplyingby powersofi,b0b1 a0a1, by takingcomplex congugatesandmultiplyingz0,z1byi,b0a1 a0b1, by swappingz0andz1,12 a0<1, by multiplyingz0by powersof2, and12 a0a1<1, by multiplyingz1by powersof2,noneof whichaffecttheresultingrelative Complex floating -PointMultiplication boundtheimaginary errorj=(z0z1 z2)j, we considertwo cases:CaseI1:ulp(a0b1+b0a1)<ulp(a0 b1+b0 a1)CaseI2:ulp(a0 b1+b0 a1) ulp(a0b1+b0a1)In eachcase, we findthatj(a0 b1+b0 a1) ((a0 b1) (b0 a1))j< (a0b1+b0a1)andthusj=(z0z1 z2)j< (2a0b1+ 2b0a1):RoundingErrorsin Complex floating -PointMultiplication boundtheimaginary errorj=(z0z1 z2)j, we considertwo cases:CaseI1:ulp(a0b1+b0a1)<ulp(a0 b1+b0 a1)CaseI2:ulp(a0 b1+b0 a1) ulp(a0b1+b0a1)In eachcase, we findthatj(a0 b1+b0 a1) ((a0 b1) (b0 a1))j< (a0b1+b0a1)andthusj=(z0z1 z2)j< (2a0b1+ 2b0a1):RoundingErrorsin Complex floating -PointMultiplication boundtheimaginary errorj=(z0z1 z2)j, we considertwo cases:CaseI1:ulp(a0b1+b0a1)<ulp(a0 b1+b0 a1)CaseI2:ulp(a0 b1+b0 a1) ulp(a0b1+b0a1)In eachcase, we findthatj(a0 b1+b0 a1) ((a0 b1) (b0 a1))j< (a0b1+b0a1)andthusj=(z0z1 z2)j< (2a0b1+ 2b0a1):RoundingErrorsin Complex floating -PointMultiplication boundtheimaginary errorj=(z0z1 z2)j, we considertwo cases:CaseI1:ulp(a0b1+b0a1)<ulp(a0 b1+b0 a1)CaseI2.