[john-users] OpenMPI and .rec files?

Discussion:

Jeroen

2017-11-14 10:04:22 UTC

Hi,

If I run john with --fork on a N core machine, I'll see N .rec files for
resuming the task. So I guess it's one .rec file per process.

I'm now playing with a OpenMPI environment, using e.g. 20 systems with 32
processes each. When john starts, 640 processes phone home

..
502 0g 0:00:00:00 ..
614 0g 0:00:00:00 ..
640 0g 0:00:00:00 ..
..

In total 100 .rec files are generated, where I would expect 640 or perhaps
20.

Same result for OpenMPI tasks with (more OR less than 640) AND more than 100
subtasks.

Is all the resume data in 100 recovery files, don't matter the number of
tasks or is there something going wrong?

Thanks,

Jeroen

magnum

2017-11-22 18:28:56 UTC

Permalink

Post by Jeroen
Hi,
If I run john with --fork on a N core machine, I'll see N .rec files for
resuming the task. So I guess it's one .rec file per process.

That is correct. And to a large extent, the MPI code paths are exactly
the same: It's even supported (by us/me) to resume a --fork session
using MPI instead, and vice versa.

Post by Jeroen
I'm now playing with a OpenMPI environment, using e.g. 20 systems with 32
processes each. When john starts, 640 processes phone home
..
502 0g 0:00:00:00 ..
614 0g 0:00:00:00 ..
640 0g 0:00:00:00 ..
..
In total 100 .rec files are generated, where I would expect 640 or perhaps
20.

This sounds like either a bug or PEBCAK but it may well be a bug - I'm
pretty sure I have never tested that many nodes at once.

Post by Jeroen
Same result for OpenMPI tasks with (more OR less than 640) AND more than 100
subtasks.
Is all the resume data in 100 recovery files, don't matter the number of
tasks or is there something going wrong?

You should get one session file per node. What "exact" command line did
you use to start the job? Are all nodes running in the same $JOHN
directory, eg. using NFS?

What happens if you try to resume such a session? It should fail and
complain about missing files unless the bug is deeper than I can imagine.

magnum

Jeroen

2017-11-23 00:05:58 UTC

Permalink

magnum wrote:
<SNAP>
This sounds like either a bug or PEBCAK but it may well be a bug - I'm pretty

Post by magnum
sure I have never tested that many nodes at once.

Post by Jeroen
Same result for OpenMPI tasks with (more OR less than 640) AND more
than 100 subtasks.
Is all the resume data in 100 recovery files, don't matter the number
of tasks or is there something going wrong?

You should get one session file per node. What "exact" command line did you
use to start the job?

For example when submitted with prun (control framework for cluster job management):

prun -np 18 -32 -t 5:00 -script openmpi-config /home/john/run/test hashes

where -np 18 is #hosts, -32 is 32 processes per host, openmpi-config is a basic bash script, loading openmpi (gcc 64 bit) on the workers.

The number is jobs started is - as mentioned before - ok, benchmark (--test) also works fine (... (640xMPI) DONE). Number of .rec files never exceeds 100.

Post by magnum
Are all nodes running in the same $JOHN directory, eg.
using NFS?

Yes.

Post by magnum
What happens if you try to resume such a session? It should fail and complain
about missing files unless the bug is deeper than I can imagine.

Is resuming like any other normal job, no complains as far as I can see.

Please let me know if you need specific debug info.

Thanks,

Jeroen

Jeroen

2017-11-23 10:15:22 UTC

Permalink

A clean check on a local system with a very basic setup shows that it is not a generic issue:

---
***@cruncher:/opt/JohnTheRipper/run$ rm *rec; mpirun -np 99 ./john --format=raw-md4 /tmp/hashes
--------------------------------------------------------------------------
[[62017,1],34]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: cruncher

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Using default input encoding: UTF-8
Loaded 1200 password hashes with no different salts (Raw-MD4 [MD4 128/128 SSE4.1 4x3])
Node numbers 1-99 of 99 (MPI)
Send SIGUSR1 to mpirun for status
[cruncher:54746] 98 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[cruncher:54746] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
^C^CAbort is in progress...hit ctrl-c again within 5 seconds to forcibly terminate
<SNAP>
***@cruncher:/opt/JohnTheRipper/run$ ls *rec|wc -l
99

***@cruncher:/opt/JohnTheRipper/run$ rm *rec; mpirun -np 120 ./john --format=raw-md4 /tmp/hashes
--------------------------------------------------------------------------
[[61661,1],114]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: cruncher

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Using default input encoding: UTF-8
Loaded 1200 password hashes with no different salts (Raw-MD4 [MD4 128/128 SSE4.1 4x3])
Node numbers 1-120 of 120 (MPI)
[cruncher:55110] 119 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[cruncher:55110] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Send SIGUSR1 to mpirun for status
^C^C
***@cruncher:/opt/JohnTheRipper/rls *rec|wc -l
120
---

Same outcome using --fork.

I'll check cluster specific settings and behavior. If you've got suggestions, please let me know.

Thanks,

Jeroen

magnum

2017-11-23 16:58:50 UTC

Permalink

(...) > I'll check cluster specific settings and behavior. If you've got

suggestions, please let me know.

Yeah I did some testing too and the only problems I saw (somewhere
beyond 120 nodes) were just ulimits or MPI-settings stuff.

If you find out anything useful for others, please report here!

Oh BTW it's very strange you can resume such session with (seemingly) no
problems. If you didn't already, you should look very closely in the log
file and see if there are any clues there. Perhaps there is some error
seen there that we should handle (or report) better.

Good luck,
magnum

Jeroen

2017-11-23 18:09:17 UTC

Permalink

magnum wrote:
<SNAP>

Post by magnum
If you find out anything useful for others, please report here!

First finding that narrows things down:

- On the cluster, jobs seem to work fine with normal tasks (#.rec files = #jobs). It caps at 100 when --wordlist=<dict> is used.
- I cannot reproduce this on the very basic lab setup (#.rec files = #jobs). However, the output is slightly different when a wordlist is used:

---
***@cruncher:/opt/JohnTheRipper/run$ rm *rec; mpirun -np 120 ./john --format=raw-md4 /tmp/hashes
<SNAP>
Loaded 1200 password hashes with no different salts (Raw-MD4 [MD4 128/128 SSE4.1 4x3])
Using default input encoding: UTF-8
Node numbers 1-120 of 120 (MPI)
Send SIGUSR1 to mpirun for status
---

If the job uses a wordlist, "Send SIGUSR1 to mpirun for status" is missing.

Post by magnum
Oh BTW it's very strange you can resume such session with (seemingly) no
problems. If you didn't already, you should look very closely in the log
file and see if there are any clues there. Perhaps there is some error seen
there that we should handle (or report) better.

On the console everything looks normal:

Remaining X password hashes with X different salts
Cost 1 (iteration count) is X for all loaded hashes
MPI in use, disabling OMP (see doc/README.mpi)
Node numbers 1-640 of 640 (MPI)
<session status messages>

However, this message pops up in john.log for a number of workers: "Terminating on error, recovery.c:165".

Cheers,

Jeroen

magnum

2017-11-23 18:42:59 UTC

Permalink

Post by Jeroen

Post by magnum
If you find out anything useful for others, please report here!

- On the cluster, jobs seem to work fine with normal tasks (#.rec files = #jobs). It caps at 100 when --wordlist=<dict> is used.
---
<SNAP>
Loaded 1200 password hashes with no different salts (Raw-MD4 [MD4 128/128 SSE4.1 4x3])
Using default input encoding: UTF-8
Node numbers 1-120 of 120 (MPI)
Send SIGUSR1 to mpirun for status
---
If the job uses a wordlist, "Send SIGUSR1 to mpirun for status" is missing.

Remaining X password hashes with X different salts
Cost 1 (iteration count) is X for all loaded hashes
MPI in use, disabling OMP (see doc/README.mpi)
Node numbers 1-640 of 640 (MPI)
<session status messages>
However, this message pops up in john.log for a number of workers: "Terminating on error, recovery.c:165".

In that case we do print an error to stderr also:

fprintf(stderr, "Node %d@%s: Crash recovery file is"
" locked: %s\n", mpi_id + 1, mpi_name,
path_expand(rec_name));
error();

Line 165 is the "error()".

So now we have two questions:
1) Why was it already locked? Some half-dead process still running?
2) Why do you not see the error printed to stderr? Something with your
OpenMPI wrapper script?

magnum

Jeroen

2017-11-23 19:22:31 UTC

Permalink

magnum wrote:
<SNAP>

Post by magnum
1) Why was it already locked? Some half-dead process still running?
2) Why do you not see the error printed to stderr? Something with your
OpenMPI wrapper script?

1) Messages are send to a logfile per process. The message were all generated by a few boxes, seems like hanging / non-terminated processes, killed them.
2) Logs can be saved on a per worker basis or send to the console with a verbose flag.

New data 1:
- New john.log.
- Test with > 100 tasks and incremental mode
- Everything works as expected: N .rec files where N > 100.
- Stopped and resumed the task with --restore: ok.
- No weird log entries.

New data 2:
- New john.log.
- Test with > 100 tasks and --wordlist=<dict>.
- 100 .rec files.
- Stopped and resumed the task with --restore: relevant messages:
- "Continuing an interrupted session"
- "No crash recovery file, terminating": count of this entry is total N - 100.

Cheers,

Jeroen

magnum

2017-11-23 19:34:47 UTC

Permalink

Post by Jeroen
<SNAP>

Post by magnum
1) Why was it already locked? Some half-dead process still running?
2) Why do you not see the error printed to stderr? Something with your
OpenMPI wrapper script?

1) Messages are send to a logfile per process. The message were all generated by a few boxes, seems like hanging / non-terminated processes, killed them.
2) Logs can be saved on a per worker basis or send to the console with a verbose flag.
- New john.log.
- Test with > 100 tasks and incremental mode
- Everything works as expected: N .rec files where N > 100.
- Stopped and resumed the task with --restore: ok.
- No weird log entries.
- New john.log.
- Test with > 100 tasks and --wordlist=<dict>.
- 100 .rec files.
- "Continuing an interrupted session"
- "No crash recovery file, terminating": count of this entry is total N - 100.

So you only see the error when using wordlist. What are the effective
ulimits? Perhaps try "ulimit -a" and see what happens?

Or, taking the opposite approach: What if you run incremental and N is
larger than, say, 200?

Perhaps also look for errors/warnings in the NFS server log...

magnum

magnum

2017-11-23 19:40:50 UTC

Permalink

Post by magnum
Perhaps try "ulimit -a" and see what happens?

Sorry, that was misworded. I meant try bumping all (relevant) ulimits to
the max. and also perhaps look for OpenMPI and/or NFS limits that may be
involved.

magnum