Discussion:
[john-users] hashcat CPU vs. JtR
Solar Designer
2015-12-06 14:40:44 UTC
Permalink
Hi,

Most value of hashcat is in oclHashcat, and I greatly appreciate atom's
generosity in making it open source along with the CPU hashcat. We have
more stuff to learn from there. However, this one posting is about the
CPU hashcat.

What are some reasons why someone may prefer to use hashcat over JtR,
both on CPU? Is it some cracking modes we don't have equivalents for in
JtR? What are those?

hashcat appears to support a subset of hash types that we have in jumbo,
and in my testing today is typically 2 to 3 times slower than JtR, with
few exceptions. (This is consistent with what I heard from others
before. I just didn't test this myself until now.)

The most notable exception, where hashcat is much faster than JtR, is
with its multi-threading support for fast hashes. When using JtR on
fast hashes, currently --fork should be used instead of multiple threads,
and it can be cumbersome (multiple status lines instead of one, the
child processes terminating not exactly at the same time, etc.)

Another exception is bcrypt, where hashcat delivers about the best speed
we can get out of JtR, and in fact better than a default build of JtR
does on our 2x E5-2670 machine (which I am testing this on):

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 3200
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: bcrypt, Blowfish(OpenBSD)
Speed/sec: 16.82k words

JtR is slightly slower by default (built with the same gcc 4.9.1 as
hashcat above):

[***@super src]$ ../run/john -test -form=bcrypt
Will run 32 OpenMP threads
Benchmarking: bcrypt ("$2a$05", 32 iterations) [Blowfish 32/64 X2]... (32xOMP) DONE
Speed for cost 1 (iteration count) of 32
Raw: 16128 c/s real, 506 c/s virtual

Its performance on this machine can be improved to 16900 c/s (same as
hashcat) by forcing BF_X2 = 3 in arch.h, but the current logic in jumbo
is to only use that setting on HT-less Intel CPUs (and these Xeons are
HT-capable) as that appears to work slightly better on many other CPUs
(just not on this particular machine).

Another exception I noticed is scrypt, where hashcat is only moderately
slower than JtR:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 8900
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: scrypt
Speed/sec: 639 words

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=scrypt
Will run 32 OpenMP threads
Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (32xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw: 878 c/s real, 27.6 c/s virtual

(BTW, I think this used to be ~960 c/s. Looks like we got a performance
regression we need to look into, or just get the latest yescrypt code in
first and then see.)

hashcat is at 639/878 = 73% of JtR's speed at scrypt here

Yet another exception in SunMD5, where I am puzzled about what hashcat
is actually benchmarking:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 3300
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: MD5(Sun)
Speed/sec: 223.64M words

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=sunmd5
Will run 32 OpenMP threads
Benchmarking: SunMD5 [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw: 10593 c/s real, 332 c/s virtual

223.64M vs. 10.6K?! This can't be right. SunMD5 with typical settings
is known to be slow.

For most other hash types I checked, JtR is a lot faster, e.g.:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 500
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: md5crypt, MD5(Unix), FreeBSD MD5, Cisco-IOS MD5
Speed/sec: 269.21k words

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=md5crypt
Will run 32 OpenMP threads
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw: 729600 c/s real, 22750 c/s virtual

729600/269210 = 2.71 times faster

sha512crypt:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1800
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: sha512crypt, SHA512(Unix)
Speed/sec: 5.35k words

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=sha512crypt
Will run 32 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... (32xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw: 11299 c/s real, 354 c/s virtual

11299/5350 = 2.11 times faster

Raw MD5:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 0
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: MD5
Speed/sec: 268.55M words

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 0 -n 1
Initializing hashcat v2.00 with 1 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 1

Hash type: MD5
Speed/sec: 12.71M words

Good multi-threaded efficiency (unlike JtR's at fast hashes like this),
but poor per-thread speed. JtR's is:

[***@super src]$ ../run/john -test -form=raw-md5
Benchmarking: Raw-MD5 [MD5 128/128 AVX 4x3]... DONE
Raw: 38898K c/s real, 38898K c/s virtual

OpenMP is compile-time disabled for fast hashes (which is the current
default in bleeding-jumbo), so this is for 1 thread (and --fork should
be used - yes, with its drawbacks).

38898/12710 = 3.06 times faster

Raw SHA-1:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 100 -n 1
Initializing hashcat v2.00 with 1 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 1

Hash type: SHA1
Speed/sec: 10.12M words

[***@super src]$ ../run/john -test -form=raw-sha1
Benchmarking: Raw-SHA1 [SHA1 128/128 AVX 4x]... DONE
Raw: 19075K c/s real, 19075K c/s virtual

19075/10120 = 1.88 times faster

Not that bad. I guess hashcat has optimizations here that we don't, but
lacks interleaving. Still, I wouldn't use hashcat over john --fork.

NTLM:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1000 -n 1
Initializing hashcat v2.00 with 1 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 1

Hash type: NTLM
Speed/sec: 14.21M words

[***@super src]$ ../run/john -test -form=nt
Benchmarking: NT [MD4 128/128 AVX 4x3]... DONE
Raw: 44687K c/s real, 44687K c/s virtual

44687/14210 = 3.14 times faster

Raw SHA-256:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1400 -n 1
Initializing hashcat v2.00 with 1 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 1

Hash type: SHA256
Speed/sec: 5.10M words

[***@super src]$ OMP_NUM_THREADS=1 ../run/john -test -form=raw-sha256
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: Raw-SHA256 [SHA256 128/128 AVX 4x]... DONE
Raw: 9068K c/s real, 9068K c/s virtual

9068/5100 = 1.78 times faster

We also have OpenMP support enabled by default for raw SHA-256, but it
doesn't scale well for 32 threads:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1400
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: SHA256
Speed/sec: 80.85M words

[***@super src]$ ../run/john -test -form=raw-sha256
Will run 32 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 128/128 AVX 4x]... (32xOMP) DONE
Raw: 39976K c/s real, 3774K c/s virtual

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=raw-sha256
Will run 32 OpenMP threads
Benchmarking: Raw-SHA256 [SHA256 128/128 AVX 4x]... (32xOMP) DONE
Raw: 40370K c/s real, 3731K c/s virtual

hashcat is 2 times faster with multi-threading, but JtR --fork would be
faster yet.

Raw SHA-512:

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1700 -n 1
Initializing hashcat v2.00 with 1 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 1

Hash type: SHA512
Speed/sec: 1.32M words

[***@super src]$ OMP_NUM_THREADS=1 ../run/john -test -form=raw-sha512
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: Raw-SHA512 [SHA512 128/128 AVX 2x]... DONE
Raw: 3856K c/s real, 3856K c/s virtual

3856/1320 = 2.92 times faster

[***@super hashcat-build]$ ./hashcat-cli64.bin -b -m 1700
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: x86_64
Number of threads: 32

Hash type: SHA512
Speed/sec: 26.80M words

[***@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=raw-sha512
Will run 32 OpenMP threads
Benchmarking: Raw-SHA512 [SHA512 128/128 AVX 2x]... (32xOMP) DONE
Raw: 23330K c/s real, 1577K c/s virtual

SHA-512 is almost slow enough that JtR's (poor) multi-threading support
is almost on par with hashcat's even at 32 threads. Yet --fork would be
2 to 3 times faster than hashcat.

My JtR benchmarks are with yesterday's bleeding-jumbo. It could be
better to (also) use actual cracking runs to compare the tools - maybe
someone else will.

Alexander
Solar Designer
2015-12-06 14:53:12 UTC
Permalink
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Oh, I just realized I should have explicitly built for AVX. I just did,
with "make posixAVX". This produced hashcat-cliAVX.bin. Somehow it's a
bit slower at bcrypt:

[***@super hashcat-build]$ ./hashcat-cliAVX.bin -b -m 3200
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: AVX
Number of threads: 32

Hash type: bcrypt, Blowfish(OpenBSD)
Speed/sec: 16.46k words

but anyway these SIMD extensions (until we get faster gather loads) are
not (very) helpful for bcrypt (if at all). JtR's 16k to 16.9k at bcrypt
on this machine is without SIMD.
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Number of threads: 32
Hash type: scrypt
Speed/sec: 639 words
[***@super hashcat-build]$ ./hashcat-cliAVX.bin -b -m 8900
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: AVX
Number of threads: 32

Hash type: scrypt
Speed/sec: 839 words
Post by Solar Designer
Will run 32 OpenMP threads
Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (32xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw: 878 c/s real, 27.6 c/s virtual
(BTW, I think this used to be ~960 c/s. Looks like we got a performance
regression we need to look into, or just get the latest yescrypt code in
first and then see.)
hashcat is at 639/878 = 73% of JtR's speed at scrypt here
839/878 = 95.6% of JtR's speed
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Number of threads: 32
Hash type: md5crypt, MD5(Unix), FreeBSD MD5, Cisco-IOS MD5
Speed/sec: 269.21k words
[***@super hashcat-build]$ ./hashcat-cliAVX.bin -b -m 500
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: AVX
Number of threads: 32

Hash type: md5crypt, MD5(Unix), FreeBSD MD5, Cisco-IOS MD5
Speed/sec: 274.32k words
Post by Solar Designer
Will run 32 OpenMP threads
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 AVX 4x3]... (32xOMP) DONE
Raw: 729600 c/s real, 22750 c/s virtual
729600/269210 = 2.71 times faster
729600/274320 = 2.66
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Number of threads: 32
Hash type: sha512crypt, SHA512(Unix)
Speed/sec: 5.35k words
[***@super hashcat-build]$ ./hashcat-cliAVX.bin -b -m 1800
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...

Device...........: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Instruction set..: AVX
Number of threads: 32

Hash type: sha512crypt, SHA512(Unix)
Speed/sec: 6.04k words
Post by Solar Designer
Will run 32 OpenMP threads
Benchmarking: sha512crypt, crypt(3) $6$ (rounds=5000) [SHA512 128/128 AVX 2x]... (32xOMP) DONE
Speed for cost 1 (iteration count) of 5000
Raw: 11299 c/s real, 354 c/s virtual
11299/5350 = 2.11 times faster
11299/6040 = 1.87

So clearly AVX does improve speeds at some of these, but overall it's
the same picture as before.

Alexander
magnum
2015-12-06 21:27:53 UTC
Permalink
Post by Solar Designer
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Oh, I just realized I should have explicitly built for AVX. I just did,
with "make posixAVX". This produced hashcat-cliAVX.bin. Somehow it's a
(...)
So clearly AVX does improve speeds at some of these, but overall it's
the same picture as before.
Another aspect is HC does not take advantage of AVX2 yet and Atom had no
plans for fixing that (but now someone else could). So using an AVX2
build, JtR runs in circles around HC. And we also support AVX-512 already...

magnum
atom
2015-12-07 05:37:13 UTC
Permalink
Back then when I started with Hashcat it was already clear that CPU based
cracking was about to get outdated by GPU. But since I was new to both,
GPGPU and crypto, I had to learn alot at once. That's simply more easy by
writing code for CPU. I did that with Hashcat CPU. That was a time when
there was only SSE2, no AVX and no XOP. Back then, hashcat was by far
faster than any other CPU cracker. When I had the feeling that I understood
the basic hashing algorithms I instantly started to write GPGPU code and
stopped working on hashcat CPU completely, with the result that what you
see today as "hashcat" is almost 4 year old code. The only thing I've added
was the XOP code and some algorithms.

For me it's oclHashcat that counts and which I've refactored completely
about 5 times till I found the architecture of today. In theory, a pure CPU
integration into oclHashcat would be easy. Just add a new Platform, write
the basic macros and for each hash-mode write a function that uses the same
parameters as the OpenCL kernels do and simply copy/paste the OpenCL kernel
code. I did that once, just to find out if it would work, to end up with an
NTLM BF speed of over >900MH/s on my i7-4770K. Such a speed on CPU is far
from what you see in JtR or mdxfind and it will work for any other
algorithm supported with oclHashcat. The question for me was do I really
want to put that effort to build support for a cracker that runs 25 times
slower than a 250€ GPU? Unless you have an explicit anti-GPU algorithm they
all will run much faster on GPU. So this question actually ends up in a
discussion CPU vs. GPU which I really do not want to start here.

I just wanted to point out that all the code in oclHashcat, the bitmaps,
the rule-engine, the vliw (or SIMD) support, the optimizations inside the
algorithms and whatnot else, compared to hashcat CPU is a difference of 4
years and 4 refactorizations. I'm really wondering why you even started to
compare. Maybe you don't know, but many people mean oclHashcat when they
talk about hashcat, it's just an alias. That's why my long-term plan is to
add CPU support to oclHashcat, then obsolete hashcat and then rename
oclHashcat into hashcat.
Post by magnum
Post by Solar Designer
Post by Solar Designer
Initializing hashcat v2.00 with 32 threads and 32mb segment-size...
Instruction set..: x86_64
Oh, I just realized I should have explicitly built for AVX. I just did,
with "make posixAVX". This produced hashcat-cliAVX.bin. Somehow it's a
(...)
So clearly AVX does improve speeds at some of these, but overall it's
the same picture as before.
Another aspect is HC does not take advantage of AVX2 yet and Atom had no
plans for fixing that (but now someone else could). So using an AVX2 build,
JtR runs in circles around HC. And we also support AVX-512 already...
magnum
--
atom
Solar Designer
2015-12-07 12:32:55 UTC
Permalink
Hi atom,

Thank you for your reply.
Post by atom
For me it's oclHashcat that counts
Sure, and I agree. I started my message with "Most value of hashcat is
in oclHashcat", and I started my tweet on the topic with "oclHashcat is
the real thing".

However, I felt that there got to be some value in hashcat CPU as well -
in fact, I still think there may be cracking modes that we're missing.

My comparison wasn't intended to diminish any of your results. Rather,
it was to see what you have that we don't (e.g., efficient
multi-threading for fast hashes is still one of those things) and to
provide some info (and request more from the community) on when to use
one tool or the other if running on a CPU. And yes, except for
GPU-unfriendly hash types it is obvious that using a GPU is going to be
faster, if one does have a GPU.

Also, I intended to, but forgot to mention one possible reason why one
may prefer to use hashcat CPU over JtR: if the person normally uses
oclHashcat, they may be more used to hashcat's command-line options and
exact feature set, so if they have to run on a CPU on some occasion,
they may prefer to use hashcat for that reason. (For me, this currently
works the other way around: since I normally use JtR, I may prefer to
use JtR even if oclHashcat is faster for a given hash type. But now
that you've open-sourced it, I think I need to find time and learn to
use oclHashcat for what it does better.)
Post by atom
and which I've refactored completely
about 5 times till I found the architecture of today.
Impressive. JtR didn't go through this many refactorings - there was
just one major refactoring for version 1.5 in 1998. I certainly felt
the need to refactor it again (some years ago), but we never got around
to, proceeding with evolutionary changes only.
Post by atom
In theory, a pure CPU
integration into oclHashcat would be easy. Just add a new Platform, write
the basic macros and for each hash-mode write a function that uses the same
parameters as the OpenCL kernels do and simply copy/paste the OpenCL kernel
code. I did that once, just to find out if it would work, to end up with an
NTLM BF speed of over >900MH/s on my i7-4770K.
That's impressive speed, indeed. Testing JtR's current AVX2 code on a
i7-4770K (stock clocks), I get on one core:

[***@well run]$ ./john -test -form=nt
Benchmarking: NT [MD4 256/256 AVX2 8x3]... DONE
Raw: 88371K c/s real, 88371K c/s virtual

Running "-form=nt -mask='?a?a?a?a?a?a?a?a' -fork=8" against one NTLM
hash gives 28.9M per process, so 231M total. Yours is almost 4 times
higher. I attribute this difference to you doing NTLM step reversal,
which we still mostly don't, and to you having a dumb brute-force mode,
which we don't (mask is as close as we have). Sounds right, or do you
think there's more that we're missing?

Alain Espinosa reports comparable speeds to yours for Hash Suite (~500M
on i5-4670; no results on i7-4770K to compare directly), but it's
non-free (although its core's source code is open source under GPLv2+
via Hash Suite for Android).

Both your results and Alain's remind us that we still have plenty of
room for improvement for JtR on CPU. You might not care (since you
rightly point out that GPUs are a lot faster anyway), but it is still a
valid direction for JtR project - possibly even more so if your project
is not interested in pursuing this direction. This would result in
greater coverage of possible machines and use cases by our two projects
combined (now that users can use hashcat and JtR interchangeably to a
greater extent than before, due to hashcat becoming open source).
Post by atom
I'm really wondering why you even started to compare.
I hope I've addressed this above.
Post by atom
Maybe you don't know, but many people mean oclHashcat when they
talk about hashcat, it's just an alias. That's why my long-term plan is to
add CPU support to oclHashcat, then obsolete hashcat and then rename
oclHashcat into hashcat.
Sounds cool.

I actually did know that "oclHashcat is the real thing", but I think
that you're getting quite some new users now due to the announcement,
some of whom are not aware. Some of them may genuinely expect hashcat
CPU to run faster than JtR, without even checking whether this is in
fact the case. And it actually is the case if they crack fast hashes
with multiple threads, rather than slow hashes or/and with --fork, which
my message also pointed out.

Alexander
Frank Dittrich
2015-12-07 13:37:35 UTC
Permalink
Post by Solar Designer
And yes, except for
GPU-unfriendly hash types it is obvious that using a GPU is going to be
faster, if one does have a GPU.
Not sure about oclHashcat, but with John the Ripper, GPU formats usually
require a much larger number of password candidates to achieve their
max. speed.
If you have to test several small attacks (word lists / patterns)
against a huge number of password hashes, or with john's --single mode,
the CPU format may often be the better choice, even if the GPU format
reports a much better c/s rate.

Frank

magnum
2015-12-06 23:58:32 UTC
Permalink
Post by Solar Designer
Will run 32 OpenMP threads
Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (32xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw: 878 c/s real, 27.6 c/s virtual
(BTW, I think this used to be ~960 c/s. Looks like we got a performance
regression we need to look into, or just get the latest yescrypt code in
first and then see.)
When was that? I see no regression comparing to Jumbo-1.

magnum
Solar Designer
2015-12-07 12:54:49 UTC
Permalink
Post by magnum
Post by Solar Designer
Will run 32 OpenMP threads
Benchmarking: scrypt (16384, 8, 1) [Salsa20/8 128/128 AVX]... (32xOMP) DONE
Speed for cost 1 (N) of 16384, cost 2 (r) of 8, cost 3 (p) of 1
Raw: 878 c/s real, 27.6 c/s virtual
(BTW, I think this used to be ~960 c/s. Looks like we got a performance
regression we need to look into, or just get the latest yescrypt code in
first and then see.)
When was that? I see no regression comparing to Jumbo-1.
I probably recall incorrectly. I guess we never integrated the faster
code into jumbo - it still uses pretty ancient escrypt-lite. We should
update to latest yescrypt code (although I have yet to finalize the
string encoding for yescrypt native hashes).

I've just tested yescrypt-0.7.1 and yescrypt-0.8.1 by editing their
userom.c to use "#define YESCRYPT_FLAGS 0" (requests classic scrypt) and
running "GOMP_CPU_AFFINITY=0-31 ./userom 0 16". Both reported 926 c/s
with gcc 4.9.1. Going back to RHEL6's default gcc 4.4.7 gives 945 c/s.

Availability of huge pages may also make a difference
(yescrypt-platform.c currently has HUGEPAGE_THRESHOLD at 12 MiB), but
I've just tried allocating them with sysctl and it didn't change the
numbers above on this machine.

I think I saw 960 c/s for some other revision, but I can't find it now.

Alexander
Loading...