[john-users] Markov Sampling

Post by Matlink
For speed reasons, I would need to sample the output of JtR --markov
mode with --stdout. For example, I would like to add a --sample=10
option to get only 1 of 10 candidates generated by this mode.

Are you doing some research? If so, please also include incremental
mode in it.

Post by Matlink
Obviously I could do it by myself in a post-process, but the UNIX pipe
slows a lot the candidates gathering. So I found it more efficient to
skip some candidates.
Do you know where in the code I can add such an option?

The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

However, note that "every 10th" doesn't necessarily produce a
representative sample: the underlying cracking mode (in this case,
Markov) might happen to have some periodicity in its output, and one of
its period lengths might just happen to be a multiple of 10 or whatever.
So ideally you'd want to randomize the order (if the order somehow
doesn't matter for your research) over a larger number of candidate
passwords - say, pass a million of them through GNU coreutils' shuf(1) -
and then take every 10th out of that randomized list.

Alexander

Matlink

2018-02-13 15:44:03 UTC

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

Post by Solar Designer
However, note that "every 10th" doesn't necessarily produce a
representative sample: the underlying cracking mode (in this case,
Markov) might happen to have some periodicity in its output, and one of
its period lengths might just happen to be a multiple of 10 or whatever.
So ideally you'd want to randomize the order (if the order somehow
doesn't matter for your research) over a larger number of candidate
passwords - say, pass a million of them through GNU coreutils' shuf(1) -
and then take every 10th out of that randomized list.

My issue is that I can't get the whole output because it is too costly
for me to gather them due to UNIX pipe. I would like to my

john --stdout --markov --sample=100 | my_sublime_post-process

be somewhat 100 times faster than

john --stdout --markov --sample=1 | my_sublime_post-process

Your solution requires to get the whole output of john and then
post-process it, but I can't find a satisfiable way to get its whole
output (since john is really fast to generate candidates).

--
Matlink - Sysadmin matlink.fr
Sortez couverts, chiffrez vos mails : https://cafÃ©-vie-privÃ©e.fr/
XMPP/Jabber : ***@matlink.fr
ClÃ© publique PGP : 0x186BB3CA
Empreinte Off-the-record : 572174BF 6983EA74 91417CA7 705ED899 DE9D05B2

Solar Designer

2018-02-13 16:47:02 UTC

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

Post by Matlink
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

No. It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.

You could use the built-in --node=1/100 feature, which probably will
speed things up a lot, but then it almost certainly doesn't result in a
representative sample - it's just a way to split the work between
multiple nodes, without regard as to whether each node would get a
representative sample and be expected to crack a similar percentage of
real-world passwords that other nodes crack or not (so this probably
won't be the case, making this approach unsuitable for use in research).

The same applies to incremental mode.

Post by Matlink
Your solution requires to get the whole output of john and then
post-process it, but I can't find a satisfiable way to get its whole
output (since john is really fast to generate candidates).

A question is whether you actually need to get this many candidates (or
a sample from this many), or whether fewer would suffice. That depends
on what your ultimate goal is.

Alexander

Matt Weir

2018-02-13 18:16:19 UTC

So my initial reaction was that I'm doubtful printing out 1/10 of all
the password guesses will give you the end results you are looking
for. Here are a couple of other options:

1) You can always work backwards if your target plaintexts are known.
Aka calculate the level required to generate them, (that's easy to do.
Sum up all of the transitions based on your character file). Once you
have that, calculate how many guesses are required to get to that
level which the included tools in JtR will do for you. If you need
additional help on this approach I can write up something more
detailed. This is *super* fast and gives you a pretty good
approximation.

2) You can rely on JtR's own logging to see when passwords would be
cracked so that way you are not slowed down by piping the output to
stdout or using a 3rd party tool to figure out how many passwords you
cracked. If you want to go this route I'm sure the people on this list
can help out.

There's other options as well I guess. You could always modify the
Markov code directly in JtR. If you want other examples of the Markov
code, I re-implemented it in Python for my PCFG cracker. You can see
the guess generation code at:

https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_manager/markov_cracker.py

Of course, my code is slow so that almost certainly is not the way to
go, but it may be useful as a reference. Long story short, if you let
us know more what you are trying to do I'm sure we can brainstorm some
better options.

Matt

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

Post by Matlink
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

No. It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.

A question is whether you actually need to get this many candidates (or
a sample from this many), or whether fewer would suffice. That depends
on what your ultimate goal is.
Alexander

Matlink

2018-02-15 14:10:28 UTC

I am interested by knowing how you calculate the level required to
generate plaintext passwords (as you explain in 1)).

Post by Matt Weir
So my initial reaction was that I'm doubtful printing out 1/10 of all
the password guesses will give you the end results you are looking
1) You can always work backwards if your target plaintexts are known.
Aka calculate the level required to generate them, (that's easy to do.
Sum up all of the transitions based on your character file). Once you
have that, calculate how many guesses are required to get to that
level which the included tools in JtR will do for you. If you need
additional help on this approach I can write up something more
detailed. This is *super* fast and gives you a pretty good
approximation.
2) You can rely on JtR's own logging to see when passwords would be
cracked so that way you are not slowed down by piping the output to
stdout or using a 3rd party tool to figure out how many passwords you
cracked. If you want to go this route I'm sure the people on this list
can help out.
There's other options as well I guess. You could always modify the
Markov code directly in JtR. If you want other examples of the Markov
code, I re-implemented it in Python for my PCFG cracker. You can see
https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_manager/markov_cracker.py
Of course, my code is slow so that almost certainly is not the way to
go, but it may be useful as a reference. Long story short, if you let
us know more what you are trying to do I'm sure we can brainstorm some
better options.
Matt

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

Post by Matlink
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

No. It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.

A question is whether you actually need to get this many candidates (or
a sample from this many), or whether fewer would suffice. That depends
on what your ultimate goal is.
Alexander

Matt Weir

2018-02-15 15:00:28 UTC

This is for JtR's Markov mode. You can do similar options with more
advanced Markov methods like OMEN but I'm going to skip that for now.

1) Make sure you have your stats file trained already. The following
instructions will calculate the level based upon that.
2) Open up the stats file. You'll see a lot of lines like:

97=proba1[32]
51=proba2[32*256+35]
44=proba2[32*256+38]
51=proba2[32*256+39]
40=proba2[32*256+48]
35=proba2[32*256+49]
37=proba2[32*256+50]
....

3) The format is a bit wonky if you have never dug into it before. The
main thing to keep in mind is the level is the leftmost value.
'proba1' means the probability for the first letter. 'proba2' means a
conditional probability for a letter following the previous letter.

These examples taken from the code comments in my pcfg implementation.
Note when I'm using probability/level interchangeably, (not ideal I
know) : https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_trainer/markov.py

# Format:
# Probability=proba1[ORD_REP1]
# Probability=proba2[ORD_REP1*256+ORD_REP2]
#
# Example:
# 27=proba1[97] //'a' has probability 27
# 85=proba2[97*256+114] //'r' given 'a' has a probability of 85
# 83=proba2[97*256+100] //'d' given 'a' has a probability of 83

4) This is all based on the ASCII representation so 'a' = '97'. You
can ignore the *256. That's a holdover from the JtR implementation. I
suspect at one point there was a plan to support non-ASCII characters.

5) Now take the password you want to find the Markov level/probability
for. Let's assume it is 'cats'. You start out by looking up the proba1
for [c]. That's your starting level. Then look up the prob2['c' +
'a']. Add that level to your running total. Next look up proba2['a' +
t']. Add that to your running total. Finally look up proba2['t' + s'].
Add that to your running total and the sum of all of that is your
Markov level.

6) If you have a transition that isn't listed, (for example it uses a
non-ascii character) that guess will never be generated.

7) Use ./genmkvpwd to calculate how many password guesses would be
made for that level of the target password. While it won't tell you
where in that level the password would be guessed, you can probably
say it'll be (keysize for level)/2 and have it average out. With
Markov it isn't weighted to the front which is why cracking sessions
look a lot like a straight line with a few bumps in it.

./genmkvpwd statfile max_lvl [max_len] [start] [stop]

Example (if you are looking up for level 120, max length 120)
./genmkvpwd statfile 120 12 1 120

8) Yeah you have to specify a maximum length for guesses and that does
impact the keysize. Also I can't remember if the results of genmkvpwd
are total for 0-level or just that level and you have to add all the
previous levels to get the guesses for it so you might need to play
with that. I think they are additive, (aka take the last value for the
total keysize), but I could be wrong as it's been a while since I
played with that.

Good luck, and if this is confusing or didn't make sense let me know.

Matt

Post by Matlink
I am interested by knowing how you calculate the level required to
generate plaintext passwords (as you explain in 1)).

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

Post by Matlink
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

No. It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.

A question is whether you actually need to get this many candidates (or
a sample from this many), or whether fewer would suffice. That depends
on what your ultimate goal is.
Alexander

--
Matlink - Sysadmin matlink.fr
Sortez couverts, chiffrez vos mails : https://café-vie-privée.fr/
Clé publique PGP : 0x186BB3CA
Empreinte Off-the-record : 572174BF 6983EA74 91417CA7 705ED899 DE9D05B2

Matt Weir

2018-02-15 15:17:58 UTC

Realized I had a typo in my genmkvpwd description. That would be for a
max length 12, not 120. Should have picked my example level a bit
differently ;p

Matt

Post by Matt Weir
This is for JtR's Markov mode. You can do similar options with more
advanced Markov methods like OMEN but I'm going to skip that for now.
1) Make sure you have your stats file trained already. The following
instructions will calculate the level based upon that.
97=proba1[32]
51=proba2[32*256+35]
44=proba2[32*256+38]
51=proba2[32*256+39]
40=proba2[32*256+48]
35=proba2[32*256+49]
37=proba2[32*256+50]
....
3) The format is a bit wonky if you have never dug into it before. The
main thing to keep in mind is the level is the leftmost value.
'proba1' means the probability for the first letter. 'proba2' means a
conditional probability for a letter following the previous letter.
These examples taken from the code comments in my pcfg implementation.
Note when I'm using probability/level interchangeably, (not ideal I
know) : https://github.com/lakiw/pcfg_cracker/blob/master/python_pcfg_cracker/pcfg_trainer/markov.py
# Probability=proba1[ORD_REP1]
# Probability=proba2[ORD_REP1*256+ORD_REP2]
#
# 27=proba1[97] //'a' has probability 27
# 85=proba2[97*256+114] //'r' given 'a' has a probability of 85
# 83=proba2[97*256+100] //'d' given 'a' has a probability of 83
4) This is all based on the ASCII representation so 'a' = '97'. You
can ignore the *256. That's a holdover from the JtR implementation. I
suspect at one point there was a plan to support non-ASCII characters.
5) Now take the password you want to find the Markov level/probability
for. Let's assume it is 'cats'. You start out by looking up the proba1
for [c]. That's your starting level. Then look up the prob2['c' +
'a']. Add that level to your running total. Next look up proba2['a' +
t']. Add that to your running total. Finally look up proba2['t' + s'].
Add that to your running total and the sum of all of that is your
Markov level.
6) If you have a transition that isn't listed, (for example it uses a
non-ascii character) that guess will never be generated.
7) Use ./genmkvpwd to calculate how many password guesses would be
made for that level of the target password. While it won't tell you
where in that level the password would be guessed, you can probably
say it'll be (keysize for level)/2 and have it average out. With
Markov it isn't weighted to the front which is why cracking sessions
look a lot like a straight line with a few bumps in it.
./genmkvpwd statfile max_lvl [max_len] [start] [stop]
Example (if you are looking up for level 120, max length 120)
./genmkvpwd statfile 120 12 1 120
8) Yeah you have to specify a maximum length for guesses and that does
impact the keysize. Also I can't remember if the results of genmkvpwd
are total for 0-level or just that level and you have to add all the
previous levels to get the guesses for it so you might need to play
with that. I think they are additive, (aka take the last value for the
total keysize), but I could be wrong as it's been a while since I
played with that.
Good luck, and if this is confusing or didn't make sense let me know.
Matt

Post by Matlink
I am interested by knowing how you calculate the level required to
generate plaintext passwords (as you explain in 1)).

Post by Solar Designer
The pre-defined --external=Parallel mode will do what you ask for.
You'll just need to customize the "node" and "total" numbers in its
init() in john.conf.

Well, I guess it's only 'not printing' generated candidates? Does it
really speed up the process, since generating a password candidate is
more costly than printing it?

It doesn't speed up the processing inside JtR; it actually adds extra
processing.

Post by Matlink
Concretely, is --markov --stdout --external=Parallel with node 1/100,
100 times faster than with node 1/1?

No. It's probably roughly same speed: the external mode adds overhead
internally to JtR, but then those skipped candidates don't need to be
printed to the Unix pipe.