Splitting workload on multiple hosts

Discussion:

Lasse Knudsen

2013-12-11 20:51:45 UTC

Greetings,

i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?

eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?

look like the Incremental:ALLN almost does this, but just splitting on the
individual characters rather than just the amount of characters in the job
slice.

I guess it could be done, by generating the wordlists (aaaaaaaaaa-bbbbbbbb)
and then feed individual wordlists to the jobs, but i believe that would
take a lot time to generate, and they would consume space.

Im not sure my question makes sense, but hey i tried :)
-lk

Rich Rumble

2013-12-11 22:25:47 UTC

Permalink

On Wed, Dec 11, 2013 at 3:51 PM, Lasse Knudsen

Post by Lasse Knudsen
i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?

Others asked very similar questions (like me:)
http://www.openwall.com/lists/john-users/2011/06/28/1

Post by Lasse Knudsen
eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?
look like the Incremental:ALLN almost does this, but just splitting on the
individual characters rather than just the amount of characters in the job
slice.

http://www.openwall.com/lists/john-users/2011/06/28/1 Incremental
isn't as "dumb" as aaa, aab, aac, aad etc... It chooses more likely
candidates than that. There is an External mode called "dumbforce" and
I asked about that here:
http://www.openwall.com/lists/john-users/2012/08/06/3 it does iterate
in a more traditional brute force way.

Post by Lasse Knudsen
I guess it could be done, by generating the wordlists (aaaaaaaaaa-bbbbbbbb)
and then feed individual wordlists to the jobs, but i believe that would
take a lot time to generate, and they would consume space.

There are other places you may want to look at first before posting
questions, but certainly follow up or ask further questions if you are
not satisfied with what you find.
http://openwall.info/wiki/john
http://openwall.info/wiki/john/mailing-list-excerpts
The wiki is a great place for information, the mailing list are good
to, I know nothing of MPI and I know that is a way you can split the
work up, as are the Fork and Node options.
http://openwall.info/wiki/john/parallelization
-rich

magnum

2013-12-11 23:59:13 UTC

Permalink

Post by Lasse Knudsen
Greetings,
i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?
eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?

From version 1.8 you can say "I'm node 427 out of 10000" using the
option "--node=427/10000". If some of your nodes are much stronger than
others you can tell them to do more work, eg. "--node=1-20/10000" will
make this node do the first 20 splices.

magnum

Lasse Knudsen

2013-12-12 18:35:57 UTC

Permalink

Hi Magnum,

this sounds exactly what i need,

thank you
-lk

From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
magnum

Rich Rumble

2014-08-04 03:59:09 UTC

Permalink

From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.

So for an MPI host, would you also use "--node=1-8/16" on one host, and
"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.
-rich

magnum

2014-08-04 09:50:54 UTC

Permalink

Post by Rich Rumble

From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.

So for an MPI host, would you also use "--node=1-8/16" on one host, and
"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.

You would normally use MPI options and no --node option. Eg. "mpirun
-host=alpha,bravo -np 16 ./john (...)" for splitting the job in 16
processes over two hosts (so 8 on each).

However, if you want an MPI job to be part of a larger job (as in the
original example) you'd do something like "mpirun -host=alpha,bravo -np
16 ./john -node=1-16/10000 (...)".

Basically the syntax for MPI with --node is the same as for --fork with
--node. So these two examples are equivalent in terms of work space:

./john -fork=16 -node=1-16/10000 (...)

mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)

magnum

Rich Rumble

2014-08-04 14:44:59 UTC

Permalink

Post by magnum

Post by magnum
From version 1.8 you can say "I'm node 427 out of 10000" using the

Post by Rich Rumble
option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
So for an MPI host, would you also use "--node=1-8/16" on one host, and

"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.

You would normally use MPI options and no --node option. Eg. "mpirun
-host=alpha,bravo -np 16 ./john (...)" for splitting the job in 16
processes over two hosts (so 8 on each).
However, if you want an MPI job to be part of a larger job (as in the
original example) you'd do something like "mpirun -host=alpha,bravo -np 16
./john -node=1-16/10000 (...)".
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)

Since Fork is no good for Windows and i currently want to dumbforce
something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.
-rich

magnum

2014-08-04 23:31:04 UTC

Permalink

Post by Rich Rumble

Post by magnum
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)

The key space distribution is 100% identical whether you use fork, MPI
or just manual instances of --node.

magnum

Rich Rumble

2018-04-09 17:37:30 UTC

Permalink

Post by Rich Rumble

Post by magnum
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)
Since Fork is no good for Windows and i currently want to dumbforce

something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.

The key space distribution is 100% identical whether you use fork, MPI or
just manual instances of --node.
magnum

Sorry to dredge this subject back up, I'm not convinced Fork is fully using
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them (raw-sha1). I
have a few (other)issues, one being the hashes I'm going after are enormous
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when pressing
a key. So I have found I can load four 1Gb slices in ram (save-mem=2), and
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But if I
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node would
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.
-rich

Solar Designer

2018-04-09 19:03:00 UTC

Permalink

Post by Rich Rumble
Sorry to dredge this subject back up, I'm not convinced Fork is fully using
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them (raw-sha1). I
have a few (other)issues, one being the hashes I'm going after are enormous
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when pressing
a key. So I have found I can load four 1Gb slices in ram (save-mem=2), and
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But if I
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node would
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.

How much RAM do you have? It sounds like some of the child processes
are simply getting killed on out-of-memory. Unfortunately, when JtR
cracks a password the child processes deviate from each other in their
memory contents, and their combined memory usage grows. This is not
ideal, but that's how it currently is with "--fork".

You'll want to get most of those HaveIBeenPwnd v2 passwords cracked
while running fewer processes (e.g., initially just one or two so that
you can possibly load all of the hashes at once) before you proceed to
attempt using all 24 that you need for your machine.

Helpful john.conf settings:

NoLoaderDupeCheck = Y

This is the default anyway, but maybe worth double-checking:

ReloadAtCrack = N

These are not obviously an improvement (with these at "N", the pot file
may grow larger from more duplicate entries, but cracking will be faster
and the memory usage increase from copy-on-write across --fork'ed
processes should be less, so more of them may be run):

ReloadAtDone = N
ReloadAtSave = N

Helpful command-line options:

-verb=1 -nolog -save-mem=1

"-save-mem=1" should actually speed things up by not wasting memory on
pointers to (non-existent) login names, which also improves the locality
of reference. "-save-mem=2" has performance impact and is probably not
worth it in this case.

You may also want to increase PASSWORD_HASH_SIZE_FOR_LDR in params.h by
one (from 4 to 5) to speedup loading of large hash files like this, and
rebuild. (The same change slows down loading of small files, which is
why it's not the default.)

FWIW, I previously experimented with HaveIBeenPwnd v1, which was 320M
hashes. I loaded those all at once (without splitting) and was able to
run a few forks at first (4 or so) and all 40 forks eventually on a
machine with 128 GB RAM with 40 logical CPUs.

You really need to watch your RAM usage when you do things like this.
If you see less than a half of RAM free, chances are it will be eaten up
and some children will die as they crack more passwords. So try to keep
your fork count such that you leave a half of RAM free when cracking
just starts.

Alexander

Rich Rumble

2018-04-09 20:13:55 UTC

Permalink

Post by Rich Rumble

Post by Rich Rumble
Sorry to dredge this subject back up, I'm not convinced Fork is fully

using

Post by Rich Rumble
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them

(raw-sha1). I

Post by Rich Rumble
have a few (other)issues, one being the hashes I'm going after are

enormous

Post by Rich Rumble
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when

pressing

Post by Rich Rumble
a key. So I have found I can load four 1Gb slices in ram (save-mem=2),

and

Post by Rich Rumble
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But

if I

Post by Rich Rumble
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node

would

Post by Rich Rumble
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.

How much RAM do you have? It sounds like some of the child processes
are simply getting killed on out-of-memory. Unfortunately, when JtR
cracks a password the child processes deviate from each other in their
memory contents, and their combined memory usage grows. This is not
ideal, but that's how it currently is with "--fork".
You'll want to get most of those HaveIBeenPwnd v2 passwords cracked
while running fewer processes (e.g., initially just one or two so that
you can possibly load all of the hashes at once) before you proceed to
attempt using all 24 that you need for your machine.
NoLoaderDupeCheck = Y
ReloadAtCrack = N
These are not obviously an improvement (with these at "N", the pot file
may grow larger from more duplicate entries, but cracking will be faster
and the memory usage increase from copy-on-write across --fork'ed
ReloadAtDone = N
ReloadAtSave = N
-verb=1 -nolog -save-mem=1
"-save-mem=1" should actually speed things up by not wasting memory on
pointers to (non-existent) login names, which also improves the locality
of reference. "-save-mem=2" has performance impact and is probably not
worth it in this case.
You may also want to increase PASSWORD_HASH_SIZE_FOR_LDR in params.h by
one (from 4 to 5) to speedup loading of large hash files like this, and
rebuild. (The same change slows down loading of small files, which is
why it's not the default.)
FWIW, I previously experimented with HaveIBeenPwnd v1, which was 320M
hashes. I loaded those all at once (without splitting) and was able to
run a few forks at first (4 or so) and all 40 forks eventually on a
machine with 128 GB RAM with 40 logical CPUs.
You really need to watch your RAM usage when you do things like this.
If you see less than a half of RAM free, chances are it will be eaten up
and some children will die as they crack more passwords. So try to keep
your fork count such that you leave a half of RAM free when cracking
just starts.
Alexander

I have 32Gb of ram, and I never tried without fork this whole time :) I'm
past 300M
cracked (out of 501M), and I know hashes.org is already at 95% or more, I
was just
trying to do it myself. Until the last few days I was always using
fork=24, and unless
I used save-memory=2 it was going over and into swap for the 1G slices. I
was also
using a min-length=6 so at most I would of thought only a few processes
would of
died off/finished in the short runs I'm doing. Wordlist, Prince and hybrid
mask also
ate up tons of ram on large wordlists (300Mb+), incremental so far has been
the
lesser user so far. After these threads run a few more days I may just
start the whole
process over again with your suggestions and see how things go. Thanks

Solar Designer

2018-04-09 20:44:22 UTC

Permalink

Post by Rich Rumble
I used save-memory=2 it was going over and into swap for the 1G slices.

You'd likely achieve better speed by using --save-memory=1 and running
fewer forks. The performance difference between 12 and 24 forks is
probably small (those are just second logical CPUs in the same cores).
The performance difference between --save-memory=1 and --save-memory=2
for large hash counts when things do fit in RAM with =1 can be large (a
few times, as it can be a 16x difference in bitmap and hash table size
and thus up to as much difference in lookup speed). You could very well
prefer, say, 6 forks and lower memory saving over 24 forks and larger
memory saving per each. Larger chunks and fewer forks, too. These are
unsalted hashes, and there's little point in recomputing the same hashes
(of the same candidate passwords) for each chunk when you can avoid that
(even if by using fewer CPU cores at a time).

Post by Rich Rumble
Are you able to use taskset to push each one to a CPU? I found that
sometimes the kernel would shove multiple processes to the same CPU.
This was done more by the kernel and not the process itself so taskset
or similar tools needed to be done to get the forks off to their own
client.

This shouldn't be much of a problem with recent kernels, except for
latency sensitive tasks which password cracking isn't, and anyway it
would be the least of Rich's worries given what he's doing.

Alexander

Stephen John Smoogen

2018-04-09 20:22:17 UTC

Permalink

Post by Rich Rumble

The key space distribution is 100% identical whether you use fork, MPI or
just manual instances of --node.
magnum

Are you able to use taskset to push each one to a CPU? I found that
sometimes the kernel would shove multiple processes to the same CPU.
This was done more by the kernel and not the process itself so taskset
or similar tools needed to be done to get the forks off to their own
client. [If this is not possible for obvious reasons, I am sorry.. saw
this while debugging a different problem with taskset ]

--
Stephen J Smoogen.