Discussion:
[john-users] OMP vs. OpenCL performance
Scott I. Remick
2017-09-29 03:29:10 UTC
Permalink
Hello! I am taking my first stab playing around with JtR and it's
certainly been enlightening. I have it working (I think) so I don't have
any problems, but instead a question:

My system has an Intel Xeon E3-1276v3 (quad-core 3.6GHz) and an nVidia
GeForce GTX 750. When I ran a pre-built binary (John the Ripper
1.8.0-jumbo-1-5901-gbda8f8e+ OMP [linux-gnu 64-bit AVX2-ac]) and
launched it, I saw it spawn 8 processes (hyperthreading) and was getting
a measly 13-14 p/s on PBKDF2-HMAC-SHA512. But then I compiled a newer
build w/ OpenCL (John the Ripper 1.8.0-jumbo-1-5908-g004c382 OMP
[linux-gnu 64-bit AVX2-ac]), confirmed OpenCL and then forced it with a
suitable --format option. That instance (even running simultaneously as
the OMP instance) is getting currently 760 p/s (and rising, it was 590
when I started) running on just the single GPU. This seems ridiculously
faster...? Is the speed boost really that extreme? I don't even have a
particularly powerful GPU... I figured the 4x CPU 3.6GHz cores
cumulatively would beat it. Or did I mess up something with my CPU/OMP
build?

(This is all on Ubuntu 16.04, if it matters)

This is making me re-think my passively-cooled GPU card! :D For general
usage it's fine (depending on the very-good case cooling), but if I'm
going to start using this as a compute card I might want some sort of
on-card cooling that can respond to GPU temps. 79'C currently, but it's
been rising from the 54'C it was when I started 30 mins ago.
magnum
2017-09-29 07:10:34 UTC
Permalink
Post by Scott I. Remick
My system has an Intel Xeon E3-1276v3 (quad-core 3.6GHz) and an nVidia
GeForce GTX 750. When I ran a pre-built binary (John the Ripper
1.8.0-jumbo-1-5901-gbda8f8e+ OMP [linux-gnu 64-bit AVX2-ac]) and
launched it, I saw it spawn 8 processes (hyperthreading) and was getting
a measly 13-14 p/s on PBKDF2-HMAC-SHA512.
I presume your hashes are in the order of 500,000 iterations. If not,
that's too slow.
Post by Scott I. Remick
But then I compiled a newer
build w/ OpenCL (John the Ripper 1.8.0-jumbo-1-5908-g004c382 OMP
[linux-gnu 64-bit AVX2-ac]), confirmed OpenCL and then forced it with a
suitable --format option. That instance (even running simultaneously as
the OMP instance) is getting currently 760 p/s (and rising, it was 590
when I started) running on just the single GPU. This seems ridiculously
faster...? Is the speed boost really that extreme? I don't even have a
particularly powerful GPU...
It's plausible. My (REALLY weak) GPU, a GT650M, outperforms my i7-3615QM
2.30GHz by 20% (also 8 HT) or so. Heck, even my ridiculous "Intel HD
Graphics 4000" can do half the speed of my CPU!
Post by Scott I. Remick
This is making me re-think my passively-cooled GPU card! :D For general
usage it's fine (depending on the very-good case cooling), but if I'm
going to start using this as a compute card I might want some sort of
on-card cooling that can respond to GPU temps. 79'C currently, but it's
been rising from the 54'C it was when I started 30 mins ago.
Ensure you buy an nvidia with Maxwell (9xx) or Pascal (10xx) chipset. As
you've seen, even a low budget one will be way faster than your CPU.

magnum
Scott I. Remick
2017-09-29 13:34:55 UTC
Permalink
Post by magnum
I presume your hashes are in the order of 500,000 iterations. If not,
that's too slow.
I did some research and couldn't find a definitive answer to that, but I
wouldn't put it beyond Apple to use that many in MacOS in 2017. Perhaps
someone on this list knows.

But is that "too slow" opinion based just upon the OMP p/s, or the
OpenCL figure as well? Since I'm already a convert to OpenCL and going
to run with that, as long as the OpenCL numbers look good I won't stress
out about it. But if those figures also seem too small for the presumed
iterations then I'll definitely be looking for assistance to diagnose
that.
Post by magnum
Ensure you buy an nvidia with Maxwell (9xx) or Pascal (10xx) chipset. As
you've seen, even a low budget one will be way faster than your CPU.
Definitely. I'm leaning towards a GTX 1070, having built several Oculus
Rift and HTC Vive VR workstations at this point, it seems to be the
sweet spot at the moment. And one with a blower-exhaust design (vs. a
"churn" design), like the Asus card which I've had great success with.
However this is a low-priority and not happening anytime soon, so for
the time being I'm looking to squeeze out as much performance from my
existing hardware as I can.
Frank Dittrich
2017-09-29 13:44:55 UTC
Permalink
Post by Scott I. Remick
Post by magnum
I presume your hashes are in the order of 500,000 iterations. If not,
that's too slow.
I did some research and couldn't find a definitive answer to that, but I
wouldn't put it beyond Apple to use that many in MacOS in 2017. Perhaps
someone on this list knows.
Actually, john will provide that information :


Loaded 1 password hash (PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+
[PBKDF2-SHA512 128/128 AVX 2x])
Cost 1 (iteration count) is 24213 for all loaded hashes

or

Loaded 11 password hashes with 10 different salts (PBKDF2-HMAC-SHA512,
GRUB2 / OS X 10.8+ [PBKDF2-SHA512 128/128 AVX 2x])
Loaded hashes with cost 1 (iteration count) varying from 1000 to 37313

Frank
Scott I. Remick
2017-09-30 16:12:46 UTC
Permalink
Ok here what I had gotten:

~/JohnTheRipper/run$ ./john --session=opencl
--format=PBKDF2-HMAC-SHA512-opencl hash.txt
Device 0: GeForce GTX 750
Using default input encoding: UTF-8
Loaded 1 password hash (PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+
[PBKDF2-SHA512 OpenCL])
Cost 1 (iteration count) is 48543 for all loaded hashes

So if the "48543" is what you thought would need to be over 500K to
account for the speed, then I suppose maybe there is indeed a
problem...? Currently been running 1 day, 13h, on phase 3/3 and 777p/s
magnum
2017-09-30 22:38:17 UTC
Permalink
Post by Scott I. Remick
~/JohnTheRipper/run$ ./john --session=opencl
--format=PBKDF2-HMAC-SHA512-opencl hash.txt
Device 0: GeForce GTX 750
Using default input encoding: UTF-8
Loaded 1 password hash (PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+
[PBKDF2-SHA512 OpenCL])
Cost 1 (iteration count) is 48543 for all loaded hashes
So if the "48543" is what you thought would need to be over 500K to
account for the speed, then I suppose maybe there is indeed a
problem...? Currently been running 1 day, 13h, on phase 3/3 and 777p/s
One (or some) of the format's test vectors have an iteration count of
10000. You can benchmark it like this:

$ ../run/john -test -form:PBKDF2-HMAC-SHA512 -cost:10000
Will run 8 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512
128/128 AVX 2x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 10000
Raw: 721 c/s real, 96.2 c/s virtual

The figure above is from a 5 yo laptop w/ 4 cores 8 threads and clocked
at a relaxed 2.3 GHz. Unless I'm totally senile right now, that should
mean a figure of about 148 c/s for 48583 iterations and you only only
get a tenth of that? I have no idea why (unless your gear is also
occupied with computing other things).

Try that exact benchmark and report your outcome. The system should be
idle when benchmarking, of course.

magnum
magnum
2017-09-30 22:45:18 UTC
Permalink
Post by magnum
Post by Scott I. Remick
~/JohnTheRipper/run$ ./john --session=opencl
--format=PBKDF2-HMAC-SHA512-opencl hash.txt
Device 0: GeForce GTX 750
Using default input encoding: UTF-8
Loaded 1 password hash (PBKDF2-HMAC-SHA512-opencl, GRUB2 / OS X 10.8+
[PBKDF2-SHA512 OpenCL])
Cost 1 (iteration count) is 48543 for all loaded hashes
So if the "48543" is what you thought would need to be over 500K to
account for the speed, then I suppose maybe there is indeed a
problem...? Currently been running 1 day, 13h, on phase 3/3 and 777p/s
BTW I hope you mean 777 p/s on GPU now. If you get that speed on CPU,
all bets are off - it's now too fast (LOL). The answer below was in the
context of your CPU figure of about 13-14 p/s.

magnum
Post by magnum
One (or some) of the format's test vectors have an iteration count of
$ ../run/john -test -form:PBKDF2-HMAC-SHA512 -cost:10000
Will run 8 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512
128/128 AVX 2x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 10000
Raw:    721 c/s real, 96.2 c/s virtual
The figure above is from a 5 yo laptop w/ 4 cores 8 threads and clocked
at a relaxed 2.3 GHz. Unless I'm totally senile right now, that should
mean a figure of about 148 c/s for 48583 iterations and you only only
get a tenth of that? I have no idea why (unless your gear is also
occupied with computing other things).
Try that exact benchmark and report your outcome. The system should be
idle when benchmarking, of course.
Scott I. Remick
2017-10-01 17:47:18 UTC
Permalink
Post by magnum
BTW I hope you mean 777 p/s on GPU now. If you get that speed on CPU,
all bets are off - it's now too fast (LOL). The answer below was in the
context of your CPU figure of about 13-14 p/s.
Yes! Sorry if that wasn't clear. Since it seems I wouldn't have expected
to beat this even if OMP was working properly, I'm inclined not to worry
about OMP for the time being and focus on OpenCL (especially since it's
working now... took me a while to get it functional). Unless I can
somehow split the number-crunching between the two.

Does 777 p/s seem reasonable for a GeForce GTX 750?

Solar Designer
2017-09-30 23:05:54 UTC
Permalink
Post by magnum
One (or some) of the format's test vectors have an iteration count of
$ ../run/john -test -form:PBKDF2-HMAC-SHA512 -cost:10000
Will run 8 OpenMP threads
Benchmarking: PBKDF2-HMAC-SHA512, GRUB2 / OS X 10.8+ [PBKDF2-SHA512
128/128 AVX 2x]... (8xOMP) DONE
Speed for cost 1 (iteration count) of 10000
Raw: 721 c/s real, 96.2 c/s virtual
The figure above is from a 5 yo laptop w/ 4 cores 8 threads and clocked
at a relaxed 2.3 GHz. Unless I'm totally senile right now, that should
mean a figure of about 148 c/s for 48583 iterations and you only only
get a tenth of that? I have no idea why (unless your gear is also
occupied with computing other things).
Try that exact benchmark and report your outcome. The system should be
idle when benchmarking, of course.
FWIW, with OpenMP even slight other load can have great impact on JtR's
overall performance, because the threads are sync'ed often. So all
threads wait for the slowest one. And the slowest one might become e.g.
twice slower if there's some heavy JavaScript running in a web browser
tab, competing with JtR for one logical CPU. Ten times slower is
surprising, but not entirely unrealistic.

Our OpenMP support should be used on otherwise completely idle systems,
or else the number of threads should be reduced e.g. with:

OMP_NUM_THREADS=7 ./john ...

There are other OpenMP settings that could be tuned for better (or
rather not as bad) behavior under load, such as setting
GOMP_SPINCOUNT=10000 (to reduce busy waits, which hurt other logical
CPUs in the same cores; the default is usually higher than 10000) or
enabling of dynamic scheduling.

Or indeed "--fork=8" may be used - it is somewhat cumbersome, but it's
not impacted by other system load excessively, unlike OpenMP.

Alexander
Loading...