Message ID | AM4PR0701MB216257873FAD30D4E8975FF0E49E0@AM4PR0701MB2162.eurprd07.prod.outlook.com |
---|---|
State | New |
Headers | show |
Bernd Edlinger wrote: > this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned > also at split1 except for TARGET_NEON and TARGET_IWMMXT. > > In the new test case the stack is reduced to about 270 bytes, except > for neon and iwmmxt, where this does not change anything. This looks odd: - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); Given this is an SI mode scratch, do we need the else part at all? It seems wrong to ask for the low part of an SI mode operand... Other than that it looks good to me, but I can't approve. As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI mode compare with zero is always split into ORR during expand. Wilco
On 12/20/16 16:09, Wilco Dijkstra wrote: > Bernd Edlinger wrote: >> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >> >> In the new test case the stack is reduced to about 270 bytes, except >> for neon and iwmmxt, where this does not change anything. > > This looks odd: > > - operands[2] = gen_lowpart (SImode, operands[2]); > + if (can_create_pseudo_p ()) > + operands[2] = gen_reg_rtx (SImode); > + else > + operands[2] = gen_lowpart (SImode, operands[2]); > > Given this is an SI mode scratch, do we need the else part at all? It seems wrong > to ask for the low part of an SI mode operand... > Yes, I think that is correct. > Other than that it looks good to me, but I can't approve. > > As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* > patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI > mode compare with zero is always split into ORR during expand. > I did not change anything for -mthumb -mfpu=neon for instance. Do you think that iordi_notdi* is never used also for that configuration? And if the arm_cmpdi_zero is never expanded, isn't it already unused before my patch? Bernd.
Bernd Edlinger wrote: On 12/20/16 16:09, Wilco Dijkstra wrote: > > As a result of your patches a few patterns are unused now. All the Thumb-2 iordi_notdi* > > patterns cannot be used anymore. Also I think arm_cmpdi_zero never gets used - a DI >> mode compare with zero is always split into ORR during expand. > > I did not change anything for -mthumb -mfpu=neon for instance. > Do you think that iordi_notdi* is never used also for that > configuration? With -mfpu=vfp or -msoft-float, these patterns cannot be used as logical operations are expanded before combine. Interestingly with -mfpu=neon ARM uses the orndi3_neon patterns (which are inefficient for ARM and probably should be disabled) but Thumb-2 uses the iordi_notdi patterns... So removing these reduces the number of patterns while we will still generate orn for Thumb-2. > And if the arm_cmpdi_zero is never expanded, isn't it already > unused before my patch? It appears to be, so we don't need to fix it now. However when improving the expansion of comparisons it does trigger. For example x == 3 expands currently into 3 instructions: cmp r1, #0 itt eq cmpeq r0, #3 Tweaking arm_select_cc_mode uses arm_cmpdi_zero, and when expanded early we generate this: eor r0, r0, #3 orrs r0, r0, r1 Using sub rather than eor would be even better of course. Wilco
Ping... I attached the latest version of my patch. Thanks Bernd. On 12/18/16 14:14, Bernd Edlinger wrote: > Hi, > > this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned > also at split1 except for TARGET_NEON and TARGET_IWMMXT. > > In the new test case the stack is reduced to about 270 bytes, except > for neon and iwmmxt, where this does not change anything. > > This patch depends on [1] and [2] before it can be applied. > > Bootstrapped and reg-tested on arm-linux-gnueabihf. > Is it OK for trunk? > > > Thanks > Bernd. > > > > [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html > [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. testsuite: 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-2.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 243782) +++ gcc/config/arm/arm.md (working copy) @@ -4689,7 +4689,7 @@ "TARGET_32BIT" "#" ; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0 (ARM) ; negs %Q0, %Q1 ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2) - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7359,7 +7359,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7383,7 +7383,8 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7397,7 +7398,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) Index: gcc/testsuite/gcc.target/arm/pr77308-2.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-2.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-2.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +}
Hi Bernd, On 29/04/17 18:52, Bernd Edlinger wrote: > Ping... > > I attached the latest version of my patch. > > > Thanks > Bernd. > > On 12/18/16 14:14, Bernd Edlinger wrote: >> Hi, >> >> this splits the *arm_negdi2, *arm_cmpdi_insn and *arm_cmpdi_unsigned >> also at split1 except for TARGET_NEON and TARGET_IWMMXT. >> >> In the new test case the stack is reduced to about 270 bytes, except >> for neon and iwmmxt, where this does not change anything. >> >> This patch depends on [1] and [2] before it can be applied. >> >> Bootstrapped and reg-tested on arm-linux-gnueabihf. >> Is it OK for trunk? >> >> >> Thanks >> Bernd. >> >> >> >> [1] https://gcc.gnu.org/ml/gcc-patches/2016-11/msg02796.html >> [2] https://gcc.gnu.org/ml/gcc-patches/2016-12/msg01562.html 2016-12-18 Bernd Edlinger<bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. You're changing negdi2_insn rather than *arm_negdi2. Ok with the fixed ChangeLog once the prerequisite is committed. Thanks, Kyrill
2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * config/arm/arm.md (*arm_negdi2, *arm_cmpdi_insn, *arm_cmpdi_unsigned): Split early except for TARGET_NEON and TARGET_IWMMXT. testsuite: 2016-12-18 Bernd Edlinger <bernd.edlinger@hotmail.de> PR target/77308 * gcc.target/arm/pr77308-2.c: New test. Index: gcc/config/arm/arm.md =================================================================== --- gcc/config/arm/arm.md (revision 243782) +++ gcc/config/arm/arm.md (working copy) @@ -4689,7 +4689,7 @@ "TARGET_32BIT" "#" ; rsbs %Q0, %Q1, #0; rsc %R0, %R1, #0 (ARM) ; negs %Q0, %Q1 ; sbc %R0, %R1, %R1, lsl #1 (Thumb-2) - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(parallel [(set (reg:CC CC_REGNUM) (compare:CC (const_int 0) (match_dup 1))) (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))]) @@ -7359,7 +7359,7 @@ (clobber (match_scratch:SI 2 "=r"))] "TARGET_32BIT" "#" ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 0) (match_dup 1))) (parallel [(set (reg:CC CC_REGNUM) @@ -7383,7 +7383,10 @@ operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]); } operands[1] = gen_lowpart (SImode, operands[1]); - operands[2] = gen_lowpart (SImode, operands[2]); + if (can_create_pseudo_p ()) + operands[2] = gen_reg_rtx (SImode); + else + operands[2] = gen_lowpart (SImode, operands[2]); } [(set_attr "conds" "set") (set_attr "length" "8") @@ -7397,7 +7400,7 @@ "TARGET_32BIT" "#" ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1" - "&& reload_completed" + "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)" [(set (reg:CC CC_REGNUM) (compare:CC (match_dup 2) (match_dup 3))) (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0)) Index: gcc/testsuite/gcc.target/arm/pr77308-2.c =================================================================== --- gcc/testsuite/gcc.target/arm/pr77308-2.c (revision 0) +++ gcc/testsuite/gcc.target/arm/pr77308-2.c (working copy) @@ -0,0 +1,169 @@ +/* { dg-do compile } */ +/* { dg-options "-Os -Wstack-usage=2500" } */ + +/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks. + It improves the test coverage of cmpdi and negdi2 patterns. + Unlike the original test case these insns can reach the reload pass, + which may result in large stack usage. */ + +#define SHA_LONG64 unsigned long long +#define U64(C) C##ULL + +#define SHA_LBLOCK 16 +#define SHA512_CBLOCK (SHA_LBLOCK*8) + +typedef struct SHA512state_st { + SHA_LONG64 h[8]; + SHA_LONG64 Nl, Nh; + union { + SHA_LONG64 d[SHA_LBLOCK]; + unsigned char p[SHA512_CBLOCK]; + } u; + unsigned int num, md_len; +} SHA512_CTX; + +static const SHA_LONG64 K512[80] = { + U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd), + U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc), + U64(0x3956c25bf348b538), U64(0x59f111f1b605d019), + U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118), + U64(0xd807aa98a3030242), U64(0x12835b0145706fbe), + U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2), + U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1), + U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694), + U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3), + U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65), + U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483), + U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5), + U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210), + U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4), + U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725), + U64(0x06ca6351e003826f), U64(0x142929670a0e6e70), + U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926), + U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df), + U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8), + U64(0x81c2c92e47edaee6), U64(0x92722c851482353b), + U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001), + U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30), + U64(0xd192e819d6ef5218), U64(0xd69906245565a910), + U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8), + U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53), + U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8), + U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb), + U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3), + U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60), + U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec), + U64(0x90befffa23631e28), U64(0xa4506cebde82bde9), + U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b), + U64(0xca273eceea26619c), U64(0xd186b8c721c0c207), + U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178), + U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6), + U64(0x113f9804bef90dae), U64(0x1b710b35131c471b), + U64(0x28db77f523047d84), U64(0x32caab7b40c72493), + U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c), + U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a), + U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817) +}; + +#define B(x,j) (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8)) +#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7)) +#define ROTR(x,s) (((x)>>s) | (x)<<(64-s)) +#define Sigma0(x) (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x)) +#define Sigma1(x) (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x)) +#define sigma0(x) (ROTR((x),1) ^ ROTR((x),8) ^ (((x)>>7) > (x)) ? -(x) : (x)) +#define sigma1(x) (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x)) +#define Ch(x,y,z) (((x) & (y)) ^ ((~(x)) & (z))) +#define Maj(x,y,z) (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z))) + +#define ROUND_00_15(i,a,b,c,d,e,f,g,h) do { \ + T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i]; \ + h = Sigma0(a) + Maj(a,b,c); \ + d += T1; h += T1; } while (0) +#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X) do { \ + s0 = X[(j+1)&0x0f]; s0 = sigma0(s0); \ + s1 = X[(j+14)&0x0f]; s1 = sigma1(s1); \ + T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f]; \ + ROUND_00_15(i+j,a,b,c,d,e,f,g,h); } while (0) +void sha512_block_data_order(SHA512_CTX *ctx, const void *in, + unsigned int num) +{ + const SHA_LONG64 *W = in; + SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1; + SHA_LONG64 X[16]; + int i; + + while (num--) { + + a = ctx->h[0]; + b = ctx->h[1]; + c = ctx->h[2]; + d = ctx->h[3]; + e = ctx->h[4]; + f = ctx->h[5]; + g = ctx->h[6]; + h = ctx->h[7]; + + T1 = X[0] = PULL64(W[0]); + ROUND_00_15(0, a, b, c, d, e, f, g, h); + T1 = X[1] = PULL64(W[1]); + ROUND_00_15(1, h, a, b, c, d, e, f, g); + T1 = X[2] = PULL64(W[2]); + ROUND_00_15(2, g, h, a, b, c, d, e, f); + T1 = X[3] = PULL64(W[3]); + ROUND_00_15(3, f, g, h, a, b, c, d, e); + T1 = X[4] = PULL64(W[4]); + ROUND_00_15(4, e, f, g, h, a, b, c, d); + T1 = X[5] = PULL64(W[5]); + ROUND_00_15(5, d, e, f, g, h, a, b, c); + T1 = X[6] = PULL64(W[6]); + ROUND_00_15(6, c, d, e, f, g, h, a, b); + T1 = X[7] = PULL64(W[7]); + ROUND_00_15(7, b, c, d, e, f, g, h, a); + T1 = X[8] = PULL64(W[8]); + ROUND_00_15(8, a, b, c, d, e, f, g, h); + T1 = X[9] = PULL64(W[9]); + ROUND_00_15(9, h, a, b, c, d, e, f, g); + T1 = X[10] = PULL64(W[10]); + ROUND_00_15(10, g, h, a, b, c, d, e, f); + T1 = X[11] = PULL64(W[11]); + ROUND_00_15(11, f, g, h, a, b, c, d, e); + T1 = X[12] = PULL64(W[12]); + ROUND_00_15(12, e, f, g, h, a, b, c, d); + T1 = X[13] = PULL64(W[13]); + ROUND_00_15(13, d, e, f, g, h, a, b, c); + T1 = X[14] = PULL64(W[14]); + ROUND_00_15(14, c, d, e, f, g, h, a, b); + T1 = X[15] = PULL64(W[15]); + ROUND_00_15(15, b, c, d, e, f, g, h, a); + + for (i = 16; i < 80; i += 16) { + ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X); + ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X); + ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X); + ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X); + ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X); + ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X); + ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X); + ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X); + ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X); + } + + ctx->h[0] += a; + ctx->h[1] += b; + ctx->h[2] += c; + ctx->h[3] += d; + ctx->h[4] += e; + ctx->h[5] += f; + ctx->h[6] += g; + ctx->h[7] += h; + + W += SHA_LBLOCK; + } +}