[john-users] sha512crypt & Drupal 7+ password cracking on FPGA

Discussion:

Solar Designer

2018-07-23 15:27:51 UTC

Hi,

As many of you are aware, we support descrypt and bcrypt password hash
cracking on the old ZTEX 1.15y quad-FPGA boards. Threads:

http://www.openwall.com/lists/john-users/2016/11/06/1
http://www.openwall.com/lists/john-users/2017/06/25/1

Now Denis has also added support for sha512crypt and Drupal 7+ SHA-512
based password hashes on those same old boards.

We had achieved energy-efficiency improvement over current high-end GPUs
at descrypt and bcrypt, and in the case of bcrypt also decent speed
improvement per board and per rig (see further messages in the above
threads). However, for sha512crypt and Drupal 7+ hashes we're merely on
par with current high-end GPUs in terms of energy-efficiency and our
speeds per-board are lower (it takes four or so boards to match one
high-end GPU). Thus, for practical purposes this is useful to those who
have those boards anyway or would acquire such boards primarily for
bcrypt and descrypt, so that the boards can also be put to more uses.

This is also valuable as being, to the best of my knowledge, the very
first implementation of these two hash types on FPGA. And it is also
our first attempt to use specialized soft CPU cores(*) along with
cryptographic cores in an FPGA design to combine some limited
flexibility (in this case, used to implement two higher-level hash types
in one bitstream) with resource savings (no need to waste logic on
sha512crypt's higher-level algorithm specifics) and efficient
cryptographic cores (in this case, SHA-512). Application of a similar
approach to newer and much larger FPGAs (such as those available on AWS
F1) will result in improvement over current GPUs at least in
energy-efficiency (and for the largest FPGAs probably also in
performance).

(*) Denis' bcrypt design uses microcode to save on logic, but it's a
closer match to historical CPUs' wide microcode than to a CPU program.
Maybe it'll help us implement bcrypt-pbkdf at some point, though.

Denis wrote a good description of the design with some ASCII diagrams,
currently found here:

https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha512crypt

Each soft CPU core is 16-way SMT (runs 16 hardware threads with their
separate register files) and it controls four SHA-512 cores with each of
those capable of up to four in-flight hash computations (most of the
time only two are being computed, but there's some overlap between
finishing processing on one pair of hashes and starting on the next).

One soft CPU core (plus its memory and glue logic) and four SHA-512
cores form a unit. The SHA-512 cores occupy 80% of the unit's area,
so in those terms the overhead of using soft CPUs is at most 25% (but
they actually help save on algorithm-specific logic).

10 units fit in one Spartan-6 LX150 FPGA. This means 10 soft CPU cores,
160 hardware threads, 40 SHA-512 cores, up to 160 in-flight SHA-512 per
FPGA. Four times that per board.

Also included are on-device candidate password generator (for mask mode,
including in hybrid modes along with a wordlist coming from host, etc.)
and hash comparator (capable of up to 512 loaded hashes per salt; no
limit on total loaded hashes as that's handled on host). This is
similar to what Denis' designs for descrypt and bcrypt also have.

sha512crypt and Drupal 7+ hashes are two entry points into the program
memory. (The Drupal 7+ program is much simpler than sha512crypt's.
It could also be more efficient on a more specialized design since it
does not need unaligned access to the buffers, which we support for
sha512crypt. Yet it's good to have it along with sha512crypt
essentially for free.)

Per Xilinx tools, this design was supposed to work at 225 MHz.
Unfortunately, in our testing it only works at this frequency with very
few units built into the bitstream. We don't know exactly why (maybe
it's the power draw). With 10 units, the design works reliably for us
at 135 MHz on many boards tested, so that's what we set as the current
default. It also sometimes works at higher frequencies such as 160 MHz,
but other times not. This is configurable in john.conf.

Here's a test run against 512 of same-salt sha512crypt hashes (good for
quick reliability testing as all 512 are supposed to be cracked) on one
board (4 FPGAs) at 135 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
327g 0:00:00:42 62.00% (ETA: 15:55:22) 7.746g/s 47003p/s 47003c/s 16282KC/s 40447..40137
512g 0:00:01:05 DONE (2018-07-23 15:55) 7.825g/s 46950p/s 46950c/s 12179KC/s 40500..40190
Session completed

Four boards (16 FPGAs), 135 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
378g 0:00:00:12 72.00% (ETA: 15:53:55) 30.45g/s 185656p/s 185656c/s 62318KC/s 40348..1AF58
512g 0:00:00:16 DONE (2018-07-23 15:53) 30.89g/s 185395p/s 185395c/s 51138KC/s 40000..40140
Session completed

Scaling efficiency 185395/46950/4 = 98.7%.

Four boards (16 FPGAs), 160 MHz:

$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
174g 0:00:00:04 32.00% (ETA: 15:57:33) 36.78g/s 216490p/s 216490c/s 94714KC/s 40044..1AF54
512g 0:00:00:14 DONE (2018-07-23 15:57) 36.44g/s 218647p/s 218647c/s 60310KC/s 40000..40340
Session completed

This is similar speed to what Jeremi Gosney reported for hashcat on one
GTX 1080 Ti at stock clocks:

https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073

Hashtype: sha512crypt, SHA512(Unix)

Speed.Dev.#1.....: 216.0 kH/s (53.53ms)

Somehow a newer benchmark of 8x GTX 1080 Ti shows slightly higher speed
per GPU:

https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505

Hashtype: sha512crypt $6$, SHA512 (Unix)

Speed.Dev.#1.....: 235.9 kH/s (96.29ms)
Speed.Dev.#2.....: 228.3 kH/s (50.67ms)
Speed.Dev.#3.....: 230.4 kH/s (50.22ms)
Speed.Dev.#4.....: 230.5 kH/s (50.18ms)
Speed.Dev.#5.....: 230.6 kH/s (50.16ms)
Speed.Dev.#6.....: 230.1 kH/s (50.27ms)
Speed.Dev.#7.....: 232.0 kH/s (49.85ms)
Speed.Dev.#8.....: 231.3 kH/s (50.01ms)
Speed.Dev.#*.....: 1849.1 kH/s

We're probably consuming around 160W for the boards (Denis measured 3.4A
at 12V per board at 160 MHz, which translates to ~40W/board) or 180W at
the wall at ~90% PSU efficiency.

I guess GTX 1080 Ti might consume a little bit more at this benchmark
(it's a 300W TDP card). Jeremi (or someone else who has one of those
cards) can probably check via nvidia-smi while running hashcat.

Drupal 7+ hash, one board (4 FPGAs) at 135 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 2.49% (ETA: 16:08:54) 0g/s 14250p/s 14250c/s 14250C/s prdowaap..oooarsap
0g 0:00:02:03 30.91% (ETA: 16:08:49) 0g/s 14421p/s 14421c/s 14421C/s awoppaas..rssoasas
0g 0:00:03:31 52.93% (ETA: 16:08:50) 0g/s 14427p/s 14427c/s 14427C/s wdwdwdow..pdawrprw
0g 0:00:06:20 95.21% (ETA: 16:08:51) 0g/s 14430p/s 14430c/s 14430C/s wpddwood..ppowrrod
password (?)
1g 0:00:06:28 DONE (2018-07-23 16:08) 0.002571g/s 14428p/s 14428c/s 14428C/s password..orpadord
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Four boards (16 FPGAs), 135 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 10.23% (ETA: 16:01:23) 0g/s 56120p/s 56120c/s 56120C/s oaoopprp..rooddwrp
0g 0:00:00:35 35.24% (ETA: 16:01:26) 0g/s 56590p/s 56590c/s 56590C/s dwpadaws..ppawrrws
0g 0:00:01:01 60.25% (ETA: 16:01:27) 0g/s 56662p/s 56662c/s 56662C/s adwoowao..ssodwpso
password (?)
1g 0:00:01:39 DONE (2018-07-23 16:01) 0.01005g/s 56678p/s 56678c/s 56678C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed

Scaling efficiency 56678/14428/4 = 98.2% despite of the complaint about
too small mask (too few different characters for the mask positions
handled on device).

Four boards (16 FPGAs), 160 MHz:

$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:12 14.78% (ETA: 16:11:22) 0g/s 65890p/s 65890c/s 65890C/s rpdroapa..dwdporpa
0g 0:00:00:31 36.38% (ETA: 16:11:25) 0g/s 66386p/s 66386c/s 66386C/s apawrrws..swarosos
0g 0:00:01:16 88.67% (ETA: 16:11:26) 0g/s 66586p/s 66586c/s 66586C/s soapawad..wpssppsd
password (?)
1g 0:00:01:24 DONE (2018-07-23 16:11) 0.01180g/s 66541p/s 66541c/s 66541C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed

We'd appreciate more testing, such as on Royce' larger cluster of these
boards maybe. Please post your results as follow-ups to this message.

Alexander

Jens Steube

2018-07-23 18:40:48 UTC

Permalink

It's nice to see sha512crypt available for ztex boards, this is great work!

I'd like to step in here as you did the comparison with the GPU based on
the benchmark tables from Jeremy where hashcat is optimized to run on
maximum performance. But when it comes to power efficiency perspective
I'd recommend the GTX1080 and limiting it to 90W. You can do that with
"nvidia-smi -pl 90". Here's a sheet where you can see the great effect
on Performance/Watt ratio by limiting the power consumption:
https://docs.google.com/spreadsheets/d/1yyefbpYOq7UIBeBmi5SDUNTXnkIEsdz_gRC5mAeN1x8/edit?usp=sharing

To make it short: We can limit the GPU to consume only the half of power
but at the same time not losing half of the performance, just 25%.
Limiting the power consumption has other advantages. For example it's
much much easier to cool them. On my system the GPU's stay around 70c
even on longer runs. I'm using them as they are without any
modifications or external cooling solutions. The fans (air) are driver
auto-controlled and stay far below 50%.

I have a system with four GTX1080 for development. While running the
hashcat and controlling the power consumption in a second shell (in
parallel using nvidia-smi) I can see the power consumption sometimes
peaks up to 92W, but in most of the time goes down to 75W and sometimes
even 70W. I don't know about the technical details here, but my gut
feeling tells me it's lower than 90W on average. OTOH we have to keep up
the host system while we do not need to do that on a mature FPGA system
which can run fully on its own. Therefore I think it's a good trade-off.

For sha512crypt I'm getting around 377kH/s on all four GPU. That
translates to ~94300 per 90W.
For Drupal7 I'm getting around 156kH/s on all four GPU. That translates
to ~39000 per 90W.

This is a weird result on the first look. If I understand your
measurements correctly a single quad FPGA board is doing 54600H/s at 40W
on sha512crypt and 16600H/s at 40W on Drupal7. If you scale this up to
90W, it's 122850H/s per sha512crypt and 37350H/s per Drupal7. That means
from power consumption perspective it's 30% faster than the GPU for
sha512crypt, but at the same time it's slower for Drupal7? The reason
here is the branches in the loop function in sha512crypt which is a
special case. GPU's really don't like them. IOW, the GPU implementation
for all *crypt algorithms is a bit below it's theoretical maximum. In
Drupal7 (and PBKDF2 and most other KDF) there's no such branches in the
loop thus the GPU can perform at full speed on all compute units.

As you can see here the GPU of today are pretty close when it comes to
power consumption to a FPGA board. I know that ztex boards are old now
and that there's better solutions, but the same as with newer GPU, see
alone the V100. I'm happy with the results.

- Jens

Post by Solar Designer
Hi,
As many of you are aware, we support descrypt and bcrypt password hash
http://www.openwall.com/lists/john-users/2016/11/06/1
http://www.openwall.com/lists/john-users/2017/06/25/1
Now Denis has also added support for sha512crypt and Drupal 7+ SHA-512
based password hashes on those same old boards.
We had achieved energy-efficiency improvement over current high-end GPUs
at descrypt and bcrypt, and in the case of bcrypt also decent speed
improvement per board and per rig (see further messages in the above
threads). However, for sha512crypt and Drupal 7+ hashes we're merely on
par with current high-end GPUs in terms of energy-efficiency and our
speeds per-board are lower (it takes four or so boards to match one
high-end GPU). Thus, for practical purposes this is useful to those who
have those boards anyway or would acquire such boards primarily for
bcrypt and descrypt, so that the boards can also be put to more uses.
This is also valuable as being, to the best of my knowledge, the very
first implementation of these two hash types on FPGA. And it is also
our first attempt to use specialized soft CPU cores(*) along with
cryptographic cores in an FPGA design to combine some limited
flexibility (in this case, used to implement two higher-level hash types
in one bitstream) with resource savings (no need to waste logic on
sha512crypt's higher-level algorithm specifics) and efficient
cryptographic cores (in this case, SHA-512). Application of a similar
approach to newer and much larger FPGAs (such as those available on AWS
F1) will result in improvement over current GPUs at least in
energy-efficiency (and for the largest FPGAs probably also in
performance).
(*) Denis' bcrypt design uses microcode to save on logic, but it's a
closer match to historical CPUs' wide microcode than to a CPU program.
Maybe it'll help us implement bcrypt-pbkdf at some point, though.
Denis wrote a good description of the design with some ASCII diagrams,
https://github.com/magnumripper/JohnTheRipper/tree/bleeding-jumbo/src/ztex/fpga-sha512crypt
Each soft CPU core is 16-way SMT (runs 16 hardware threads with their
separate register files) and it controls four SHA-512 cores with each of
those capable of up to four in-flight hash computations (most of the
time only two are being computed, but there's some overlap between
finishing processing on one pair of hashes and starting on the next).
One soft CPU core (plus its memory and glue logic) and four SHA-512
cores form a unit. The SHA-512 cores occupy 80% of the unit's area,
so in those terms the overhead of using soft CPUs is at most 25% (but
they actually help save on algorithm-specific logic).
10 units fit in one Spartan-6 LX150 FPGA. This means 10 soft CPU cores,
160 hardware threads, 40 SHA-512 cores, up to 160 in-flight SHA-512 per
FPGA. Four times that per board.
Also included are on-device candidate password generator (for mask mode,
including in hybrid modes along with a wordlist coming from host, etc.)
and hash comparator (capable of up to 512 loaded hashes per salt; no
limit on total loaded hashes as that's handled on host). This is
similar to what Denis' designs for descrypt and bcrypt also have.
sha512crypt and Drupal 7+ hashes are two entry points into the program
memory. (The Drupal 7+ program is much simpler than sha512crypt's.
It could also be more efficient on a more specialized design since it
does not need unaligned access to the buffers, which we support for
sha512crypt. Yet it's good to have it along with sha512crypt
essentially for free.)
Per Xilinx tools, this design was supposed to work at 225 MHz.
Unfortunately, in our testing it only works at this frequency with very
few units built into the bitstream. We don't know exactly why (maybe
it's the power draw). With 10 units, the design works reliably for us
at 135 MHz on many boards tested, so that's what we set as the current
default. It also sometimes works at higher frequencies such as 160 MHz,
but other times not. This is configurable in john.conf.
Here's a test run against 512 of same-salt sha512crypt hashes (good for
quick reliability testing as all 512 are supposed to be cracked) on one
$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
327g 0:00:00:42 62.00% (ETA: 15:55:22) 7.746g/s 47003p/s 47003c/s 16282KC/s 40447..40137
512g 0:00:01:05 DONE (2018-07-23 15:55) 7.825g/s 46950p/s 46950c/s 12179KC/s 40500..40190
Session completed
$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
378g 0:00:00:12 72.00% (ETA: 15:53:55) 30.45g/s 185656p/s 185656c/s 62318KC/s 40348..1AF58
512g 0:00:00:16 DONE (2018-07-23 15:53) 30.89g/s 185395p/s 185395c/s 51138KC/s 40000..40140
Session completed
Scaling efficiency 185395/46950/4 = 98.7%.
$ ./john -2='1A2B3C4D5E6F7G8H9I0J' --mask='?2?2?2?2?2' --format=sha512crypt-ztex --verbosity=1 pw-sha512crypt
[...]
Loaded 512 password hashes with no different salts (sha512crypt-ztex, crypt(3) $6$ [sha512crypt ZTEX])
Press 'q' or Ctrl-C to abort, almost any other key for status
174g 0:00:00:04 32.00% (ETA: 15:57:33) 36.78g/s 216490p/s 216490c/s 94714KC/s 40044..1AF54
512g 0:00:00:14 DONE (2018-07-23 15:57) 36.44g/s 218647p/s 218647c/s 60310KC/s 40000..40340
Session completed
This is similar speed to what Jeremi Gosney reported for hashcat on one
https://gist.github.com/epixoip/973da7352f4cc005746c627527e4d073
Hashtype: sha512crypt, SHA512(Unix)
Speed.Dev.#1.....: 216.0 kH/s (53.53ms)
Somehow a newer benchmark of 8x GTX 1080 Ti shows slightly higher speed
https://gist.github.com/epixoip/ace60d09981be09544fdd35005051505
Hashtype: sha512crypt $6$, SHA512 (Unix)
Speed.Dev.#1.....: 235.9 kH/s (96.29ms)
Speed.Dev.#2.....: 228.3 kH/s (50.67ms)
Speed.Dev.#3.....: 230.4 kH/s (50.22ms)
Speed.Dev.#4.....: 230.5 kH/s (50.18ms)
Speed.Dev.#5.....: 230.6 kH/s (50.16ms)
Speed.Dev.#6.....: 230.1 kH/s (50.27ms)
Speed.Dev.#7.....: 232.0 kH/s (49.85ms)
Speed.Dev.#8.....: 231.3 kH/s (50.01ms)
Speed.Dev.#*.....: 1849.1 kH/s
We're probably consuming around 160W for the boards (Denis measured 3.4A
at 12V per board at 160 MHz, which translates to ~40W/board) or 180W at
the wall at ~90% PSU efficiency.
I guess GTX 1080 Ti might consume a little bit more at this benchmark
(it's a 300W TDP card). Jeremi (or someone else who has one of those
cards) can probably check via nvidia-smi while running hashcat.
$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 2.49% (ETA: 16:08:54) 0g/s 14250p/s 14250c/s 14250C/s prdowaap..oooarsap
0g 0:00:02:03 30.91% (ETA: 16:08:49) 0g/s 14421p/s 14421c/s 14421C/s awoppaas..rssoasas
0g 0:00:03:31 52.93% (ETA: 16:08:50) 0g/s 14427p/s 14427c/s 14427C/s wdwdwdow..pdawrprw
0g 0:00:06:20 95.21% (ETA: 16:08:51) 0g/s 14430p/s 14430c/s 14430C/s wpddwood..ppowrrod
password (?)
1g 0:00:06:28 DONE (2018-07-23 16:08) 0.002571g/s 14428p/s 14428c/s 14428C/s password..orpadord
Use the "--show" option to display all of the cracked passwords reliably
Session completed
$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:10 10.23% (ETA: 16:01:23) 0g/s 56120p/s 56120c/s 56120C/s oaoopprp..rooddwrp
0g 0:00:00:35 35.24% (ETA: 16:01:26) 0g/s 56590p/s 56590c/s 56590C/s dwpadaws..ppawrrws
0g 0:00:01:01 60.25% (ETA: 16:01:27) 0g/s 56662p/s 56662c/s 56662C/s adwoowao..ssodwpso
password (?)
1g 0:00:01:39 DONE (2018-07-23 16:01) 0.01005g/s 56678p/s 56678c/s 56678C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed
Scaling efficiency 56678/14428/4 = 98.2% despite of the complaint about
too small mask (too few different characters for the mask positions
handled on device).
$ ./john -2='pasword' --mask='?2?2?2?2?2?2?2?2' --format=drupal7-ztex pw-drupal7
[...]
Loaded 1 password hash (Drupal7-ztex, $S$ [SHA512 ZTEX])
Cost 1 (iteration count) is 16384 for all loaded hashes
Warning: Slow communication channel to the device. Increase mask or expect performance degradation.
Press 'q' or Ctrl-C to abort, almost any other key for status
0g 0:00:00:12 14.78% (ETA: 16:11:22) 0g/s 65890p/s 65890c/s 65890C/s rpdroapa..dwdporpa
0g 0:00:00:31 36.38% (ETA: 16:11:25) 0g/s 66386p/s 66386c/s 66386C/s apawrrws..swarosos
0g 0:00:01:16 88.67% (ETA: 16:11:26) 0g/s 66586p/s 66586c/s 66586C/s soapawad..wpssppsd
password (?)
1g 0:00:01:24 DONE (2018-07-23 16:11) 0.01180g/s 66541p/s 66541c/s 66541C/s password..wsrssdrd
Use the "--show" option to display all of the cracked passwords reliably
Session completed
We'd appreciate more testing, such as on Royce' larger cluster of these
boards maybe. Please post your results as follow-ups to this message.
Alexander

Solar Designer

2018-07-23 19:41:35 UTC

Permalink

Post by Jens Steube
For sha512crypt I'm getting around 377kH/s on all four GPU. That
translates to ~94300 per 90W.
For Drupal7 I'm getting around 156kH/s on all four GPU. That translates
to ~39000 per 90W.
This is a weird result on the first look.

Why, it looks reasonable to me. Thanks for sharing it.

Post by Jens Steube
If I understand your
measurements correctly a single quad FPGA board is doing 54600H/s at 40W
on sha512crypt and 16600H/s at 40W on Drupal7. If you scale this up to
90W, it's 122850H/s per sha512crypt and 37350H/s per Drupal7. That means
from power consumption perspective it's 30% faster than the GPU for
sha512crypt, but at the same time it's slower for Drupal7?

Right. Our sha512crypt and Drupal7 on FPGA are basically same speed in
terms of their underlying SHA-512 hashes computed per second. Like I
mentioned, our Drupal7 could have been more optimal in a specialized
design without support for unaligned access and maybe without the soft
CPUs at all (we could have freed up that logic to have up to 25% more
SHA-512 cores maybe), but we got it almost for free here (on top of the
sha512crypt design), so we're happy. On GPU, you actually take

Post by Jens Steube
The reason
here is the branches in the loop function in sha512crypt which is a
special case. GPU's really don't like them.

Actually, when all passwords loaded on the GPU at once are of the same
length I guess the branches don't hurt much. What hurts is the need to
support unaligned accesses - and I guess you avoid this overhead in your
Drupal7 kernel.

Post by Jens Steube
IOW, the GPU implementation
for all *crypt algorithms is a bit below it's theoretical maximum. In
Drupal7 (and PBKDF2 and most other KDF) there's no such branches in the
loop thus the GPU can perform at full speed on all compute units.
As you can see here the GPU of today are pretty close when it comes to
power consumption to a FPGA board. I know that ztex boards are old now
and that there's better solutions, but the same as with newer GPU, see
alone the V100. I'm happy with the results.

Right. Spartan-6 was introduced in 2009(?) on a 45nm process, and as
budget series (Virtex-6 were larger and faster). NVIDIA Pascal was
introduced in 2016 and on a 16nm process. So there's bigger potential
for improvement by switching from Spartan-6 to current UltraScale+ FPGAs
(2016, 16nm) than from Pascal to Volta (2017-2018, 12nm).

V100 is about twice larger than GTX 1080. VU9P as offered on AWS F1 is
~16x larger than our Spartan-6 LX150 (so ~4x larger than our boards) and
also faster (we'll have higher clock rate - e.g., I saw mentions of it
running Keccak at 700+ MHz as a power consumption stress-test that
altcoin miners now use). And this isn't even the largest FPGA (but
apparently larger ones are unrealistic to cool at full utilization).
The drawback is price. Thousands of those boards tweaked for
cryptocurrency mining (lower core voltage, etc.) were recently offered
and quickly sold out to altcoin miners for $3600 each. Original are
called VCU1525, tweaked are BCU1525 - you might want to Google them and
the reported altcoin mining speeds vs. GPUs. I didn't look into this
closely yet, but if people are buying these then there must be
significant advantage.

Thanks again,

Alexander

Solar Designer

2018-10-13 13:29:08 UTC

Permalink

Post by Solar Designer
Per Xilinx tools, this design was supposed to work at 225 MHz.
Unfortunately, in our testing it only works at this frequency with very
few units built into the bitstream. We don't know exactly why (maybe
it's the power draw). With 10 units, the design works reliably for us
at 135 MHz on many boards tested, so that's what we set as the current
default. It also sometimes works at higher frequencies such as 160 MHz,
but other times not. This is configurable in john.conf.

Denis has now backported some improvements from his md5crypt+phpass
design into this older sha512crypt+Drupal7 design, which let us increase
the default frequency for sha512crypt and Drupal7 from 135 to 160 MHz.
Some lucky boards also work at 170 MHz now.

Alexander