Discussion:
Splitting workload on multiple hosts
Lasse Knudsen
2013-12-11 20:51:45 UTC
Permalink
Greetings,

i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?

eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?

look like the Incremental:ALLN almost does this, but just splitting on the
individual characters rather than just the amount of characters in the job
slice.

I guess it could be done, by generating the wordlists (aaaaaaaaaa-bbbbbbbb)
and then feed individual wordlists to the jobs, but i believe that would
take a lot time to generate, and they would consume space.

Im not sure my question makes sense, but hey i tried :)
-lk
Rich Rumble
2013-12-11 22:25:47 UTC
Permalink
On Wed, Dec 11, 2013 at 3:51 PM, Lasse Knudsen
Post by Lasse Knudsen
i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?
Others asked very similar questions (like me:)
http://www.openwall.com/lists/john-users/2011/06/28/1
Post by Lasse Knudsen
eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?
look like the Incremental:ALLN almost does this, but just splitting on the
individual characters rather than just the amount of characters in the job
slice.
http://www.openwall.com/lists/john-users/2011/06/28/1 Incremental
isn't as "dumb" as aaa, aab, aac, aad etc... It chooses more likely
candidates than that. There is an External mode called "dumbforce" and
I asked about that here:
http://www.openwall.com/lists/john-users/2012/08/06/3 it does iterate
in a more traditional brute force way.
Post by Lasse Knudsen
I guess it could be done, by generating the wordlists (aaaaaaaaaa-bbbbbbbb)
and then feed individual wordlists to the jobs, but i believe that would
take a lot time to generate, and they would consume space.
There are other places you may want to look at first before posting
questions, but certainly follow up or ask further questions if you are
not satisfied with what you find.
http://openwall.info/wiki/john
http://openwall.info/wiki/john/mailing-list-excerpts
The wiki is a great place for information, the mailing list are good
to, I know nothing of MPI and I know that is a way you can split the
work up, as are the Fork and Node options.
http://openwall.info/wiki/john/parallelization
-rich
magnum
2013-12-11 23:59:13 UTC
Permalink
Post by Lasse Knudsen
Greetings,
i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?
eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?
From version 1.8 you can say "I'm node 427 out of 10000" using the
option "--node=427/10000". If some of your nodes are much stronger than
others you can tell them to do more work, eg. "--node=1-20/10000" will
make this node do the first 20 splices.

magnum
Lasse Knudsen
2013-12-12 18:35:57 UTC
Permalink
Hi Magnum,

this sounds exactly what i need,

thank you
-lk
Post by Lasse Knudsen
Greetings,
i was wondering if is possible to split a large john job up into many
smaller jobs, lets say 10000, where we these jobs then indivually can be
send off to a compute grid?
eg.
aaaaaaaa-ZZZZZZZZZ gets split up into 10000 individual jobs that each take
a fraction of the job?
From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
magnum
Rich Rumble
2014-08-04 03:59:09 UTC
Permalink
From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
So for an MPI host, would you also use "--node=1-8/16" on one host, and
"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.
-rich
magnum
2014-08-04 09:50:54 UTC
Permalink
Post by Rich Rumble
From version 1.8 you can say "I'm node 427 out of 10000" using the option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
So for an MPI host, would you also use "--node=1-8/16" on one host, and
"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.
You would normally use MPI options and no --node option. Eg. "mpirun
-host=alpha,bravo -np 16 ./john (...)" for splitting the job in 16
processes over two hosts (so 8 on each).

However, if you want an MPI job to be part of a larger job (as in the
original example) you'd do something like "mpirun -host=alpha,bravo -np
16 ./john -node=1-16/10000 (...)".

Basically the syntax for MPI with --node is the same as for --fork with
--node. So these two examples are equivalent in terms of work space:

./john -fork=16 -node=1-16/10000 (...)

mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)

magnum
Rich Rumble
2014-08-04 14:44:59 UTC
Permalink
Post by magnum
Post by magnum
From version 1.8 you can say "I'm node 427 out of 10000" using the
Post by Rich Rumble
option
"--node=427/10000". If some of your nodes are much stronger than others you
can tell them to do more work, eg. "--node=1-20/10000" will make this node
do the first 20 splices.
So for an MPI host, would you also use "--node=1-8/16" on one host, and
"--node=9-16/16" on the other? Assuming they are nearly identical and have
8 cores to use.
You would normally use MPI options and no --node option. Eg. "mpirun
-host=alpha,bravo -np 16 ./john (...)" for splitting the job in 16
processes over two hosts (so 8 on each).
However, if you want an MPI job to be part of a larger job (as in the
original example) you'd do something like "mpirun -host=alpha,bravo -np 16
./john -node=1-16/10000 (...)".
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)
Since Fork is no good for Windows and i currently want to dumbforce
something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.
-rich
magnum
2014-08-04 23:31:04 UTC
Permalink
Post by Rich Rumble
Post by magnum
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)
Since Fork is no good for Windows and i currently want to dumbforce
something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.
The key space distribution is 100% identical whether you use fork, MPI
or just manual instances of --node.

magnum
Rich Rumble
2018-04-09 17:37:30 UTC
Permalink
Post by Rich Rumble
Post by magnum
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)
Since Fork is no good for Windows and i currently want to dumbforce
something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.
The key space distribution is 100% identical whether you use fork, MPI or
just manual instances of --node.
magnum
Sorry to dredge this subject back up, I'm not convinced Fork is fully using
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them (raw-sha1). I
have a few (other)issues, one being the hashes I'm going after are enormous
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when pressing
a key. So I have found I can load four 1Gb slices in ram (save-mem=2), and
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But if I
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node would
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.
-rich
Solar Designer
2018-04-09 19:03:00 UTC
Permalink
Post by Rich Rumble
Sorry to dredge this subject back up, I'm not convinced Fork is fully using
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them (raw-sha1). I
have a few (other)issues, one being the hashes I'm going after are enormous
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when pressing
a key. So I have found I can load four 1Gb slices in ram (save-mem=2), and
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But if I
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node would
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.
How much RAM do you have? It sounds like some of the child processes
are simply getting killed on out-of-memory. Unfortunately, when JtR
cracks a password the child processes deviate from each other in their
memory contents, and their combined memory usage grows. This is not
ideal, but that's how it currently is with "--fork".

You'll want to get most of those HaveIBeenPwnd v2 passwords cracked
while running fewer processes (e.g., initially just one or two so that
you can possibly load all of the hashes at once) before you proceed to
attempt using all 24 that you need for your machine.

Helpful john.conf settings:

NoLoaderDupeCheck = Y

This is the default anyway, but maybe worth double-checking:

ReloadAtCrack = N

These are not obviously an improvement (with these at "N", the pot file
may grow larger from more duplicate entries, but cracking will be faster
and the memory usage increase from copy-on-write across --fork'ed
processes should be less, so more of them may be run):

ReloadAtDone = N
ReloadAtSave = N

Helpful command-line options:

-verb=1 -nolog -save-mem=1

"-save-mem=1" should actually speed things up by not wasting memory on
pointers to (non-existent) login names, which also improves the locality
of reference. "-save-mem=2" has performance impact and is probably not
worth it in this case.

You may also want to increase PASSWORD_HASH_SIZE_FOR_LDR in params.h by
one (from 4 to 5) to speedup loading of large hash files like this, and
rebuild. (The same change slows down loading of small files, which is
why it's not the default.)

FWIW, I previously experimented with HaveIBeenPwnd v1, which was 320M
hashes. I loaded those all at once (without splitting) and was able to
run a few forks at first (4 or so) and all 40 forks eventually on a
machine with 128 GB RAM with 40 logical CPUs.

You really need to watch your RAM usage when you do things like this.
If you see less than a half of RAM free, chances are it will be eaten up
and some children will die as they crack more passwords. So try to keep
your fork count such that you leave a half of RAM free when cracking
just starts.

Alexander
Rich Rumble
2018-04-09 20:13:55 UTC
Permalink
Post by Rich Rumble
Post by Rich Rumble
Sorry to dredge this subject back up, I'm not convinced Fork is fully
using
Post by Rich Rumble
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them
(raw-sha1). I
Post by Rich Rumble
have a few (other)issues, one being the hashes I'm going after are
enormous
Post by Rich Rumble
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when
pressing
Post by Rich Rumble
a key. So I have found I can load four 1Gb slices in ram (save-mem=2),
and
Post by Rich Rumble
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But
if I
Post by Rich Rumble
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node
would
Post by Rich Rumble
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.
How much RAM do you have? It sounds like some of the child processes
are simply getting killed on out-of-memory. Unfortunately, when JtR
cracks a password the child processes deviate from each other in their
memory contents, and their combined memory usage grows. This is not
ideal, but that's how it currently is with "--fork".
You'll want to get most of those HaveIBeenPwnd v2 passwords cracked
while running fewer processes (e.g., initially just one or two so that
you can possibly load all of the hashes at once) before you proceed to
attempt using all 24 that you need for your machine.
NoLoaderDupeCheck = Y
ReloadAtCrack = N
These are not obviously an improvement (with these at "N", the pot file
may grow larger from more duplicate entries, but cracking will be faster
and the memory usage increase from copy-on-write across --fork'ed
ReloadAtDone = N
ReloadAtSave = N
-verb=1 -nolog -save-mem=1
"-save-mem=1" should actually speed things up by not wasting memory on
pointers to (non-existent) login names, which also improves the locality
of reference. "-save-mem=2" has performance impact and is probably not
worth it in this case.
You may also want to increase PASSWORD_HASH_SIZE_FOR_LDR in params.h by
one (from 4 to 5) to speedup loading of large hash files like this, and
rebuild. (The same change slows down loading of small files, which is
why it's not the default.)
FWIW, I previously experimented with HaveIBeenPwnd v1, which was 320M
hashes. I loaded those all at once (without splitting) and was able to
run a few forks at first (4 or so) and all 40 forks eventually on a
machine with 128 GB RAM with 40 logical CPUs.
You really need to watch your RAM usage when you do things like this.
If you see less than a half of RAM free, chances are it will be eaten up
and some children will die as they crack more passwords. So try to keep
your fork count such that you leave a half of RAM free when cracking
just starts.
Alexander
I have 32Gb of ram, and I never tried without fork this whole time :) I'm
past 300M
cracked (out of 501M), and I know hashes.org is already at 95% or more, I
was just
trying to do it myself. Until the last few days I was always using
fork=24, and unless
I used save-memory=2 it was going over and into swap for the 1G slices. I
was also
using a min-length=6 so at most I would of thought only a few processes
would of
died off/finished in the short runs I'm doing. Wordlist, Prince and hybrid
mask also
ate up tons of ram on large wordlists (300Mb+), incremental so far has been
the
lesser user so far. After these threads run a few more days I may just
start the whole
process over again with your suggestions and see how things go. Thanks
Solar Designer
2018-04-09 20:44:22 UTC
Permalink
Post by Rich Rumble
I used save-memory=2 it was going over and into swap for the 1G slices.
You'd likely achieve better speed by using --save-memory=1 and running
fewer forks. The performance difference between 12 and 24 forks is
probably small (those are just second logical CPUs in the same cores).
The performance difference between --save-memory=1 and --save-memory=2
for large hash counts when things do fit in RAM with =1 can be large (a
few times, as it can be a 16x difference in bitmap and hash table size
and thus up to as much difference in lookup speed). You could very well
prefer, say, 6 forks and lower memory saving over 24 forks and larger
memory saving per each. Larger chunks and fewer forks, too. These are
unsalted hashes, and there's little point in recomputing the same hashes
(of the same candidate passwords) for each chunk when you can avoid that
(even if by using fewer CPU cores at a time).
Post by Rich Rumble
Are you able to use taskset to push each one to a CPU? I found that
sometimes the kernel would shove multiple processes to the same CPU.
This was done more by the kernel and not the process itself so taskset
or similar tools needed to be done to get the forks off to their own
client.
This shouldn't be much of a problem with recent kernels, except for
latency sensitive tasks which password cracking isn't, and anyway it
would be the least of Rich's worries given what he's doing.

Alexander

Stephen John Smoogen
2018-04-09 20:22:17 UTC
Permalink
Post by Rich Rumble
Post by Rich Rumble
Post by magnum
Basically the syntax for MPI with --node is the same as for --fork with
./john -fork=16 -node=1-16/10000 (...)
mpirun -host=alpha,bravo -np 16 ./john -node=1-16/10000 (...)
Since Fork is no good for Windows and i currently want to dumbforce
something. I don't have network connectivity to most of the hosts I'm
pooling, so I may have to go and use live CD's and use Fork after all, but
in linux. I am using node like the example, but I'm not sure it will work
on external modes as well as fork would.
The key space distribution is 100% identical whether you use fork, MPI or
just manual instances of --node.
magnum
Sorry to dredge this subject back up, I'm not convinced Fork is fully using
all 24 CPU's in my single machine to the best if it's ability, on an
"incremental" run I'm doing. Will some modes work better in fork than
others? I know certain algorithms do, and mine is one of them (raw-sha1). I
have a few (other)issues, one being the hashes I'm going after are enormous
and I can't fit them all in ram at once (HaveIBeenPwnd v2) so I've split
them up into 20 1Gb slices. Perhaps a new thread may be needed for the
incremental issue I'm not sure, but using -fork=24 seems to only see 6-8
threads of 100% util, and status updates are also between 6-8 when pressing
a key. So I have found I can load four 1Gb slices in ram (save-mem=2), and
run fork=6 on those. In doing that I appear to have some overlap, in that
some threads are being used twice for work, but I'm not 100% sure. But if I
stop one of the four runs, as soon as it's stopped one or two of the
remaining three start churning out passwords like crazy. I do not think
this is a problem fork/node are there to solve, but was curious if there
was a way to make sure work in cpu/threads 1-6 are only done by this john
instance, and work for the other john instance 1-6 are only done by
cpu/threads 7-12. Since I'm doing different work, I didn't think node would
be the answer for that, I figured the potential for overlap would be the
same even if I specified node=0-5 for each instance.
-rich
Are you able to use taskset to push each one to a CPU? I found that
sometimes the kernel would shove multiple processes to the same CPU.
This was done more by the kernel and not the process itself so taskset
or similar tools needed to be done to get the forks off to their own
client. [If this is not possible for obvious reasons, I am sorry.. saw
this while debugging a different problem with taskset ]
--
Stephen J Smoogen.
Loading...