Pre-Mangling (Wordlist cleanup)

Discussion:

2010-02-03 13:04:25 UTC

I would like to use ./john --rules=Pre-Mangle --stdout | ./ unique to
clean up arbitrary (large) "dirty" wordlists.

In other words: I have target-specific generated wordlists (of about
2GB size), which still contain a lot of "unusable junk" like raw MD5
hashes, punctuation, Base64 fragments, QP-encoded fragments, falsely
decoded UTF-8 etc.

My intention is to put together a number of word mangling rules that
help to reduce this chaos and only let through "reasonable" candidates
for future processing with ./john --rules and ./john --rules=Single.

(My currently running "--rules=Single" session on that 2GB list has
got an ETA of mid-November 2010 (salted raw MD5 hashes, JimF's patch).)

Does such a collection of rules already exist? I couldn't find one,
and I must admit that the complexity of http://www.openwall.com/john/doc/RULES.shtml
is a bit too much for me to start from scratch.

What it should accomplish:
* obviously no no-op (:)
* include "dictionary-like" words up to à certain length (haven't seen
any password longer than 18 chars in my samples, so lenght 22 should
probably be sufficient)
* shorter alphanumeric "words" might be included as-is, maybe up to 8
or 10 chars
* punctuation should probably be purged (or truncated?)
* words with false transcodings (lots of /(.[ÂÃ])+/) should get
rejected

Could anybody please point me to a reasonable start? I shall follow-up
with a patch to john.conf, if this idea proves succesful.

Minga Minga

2010-02-03 19:16:08 UTC

Permalink

This is not REALLY what you are looking for, but when I've had
cases like yours, I've just used command line tools to 'clean up' my
.dic files (wordlists). All regexp's below are lame - and can be
re-written to be better/smarter/faster. Notice 'sort -u' will
sort the lists, and unique them afterwards. These are just
examples - they are not all logical - but its a start to get
you going.

Also: run 'strings' on your wordlists. It will get rid of SOME high-ascii.

Assuming your input file is custom.dic :

# The following command will extract all 4 and 5 character words that
# are alphanumeric only:
egrep '^[a-zA-Z0-9]{4,5}$' custom.dic | sort -u > custom_45.dic

# or for 8 characters
egrep '^[a-zA-Z0-9]{8}$' custom.dic | sort -u > custom_8.dic

# This is a LAME regexp - that needs to be re-written
# but it will make a .dic file that is only letters, numbers and SOME specials
# with a max length of 8 chars.
egrep '^[a-zA-Z0-9!@#!$?()%^&{}*/.,<>|`_;:]{1,8}$' custom.dic | sort
-u > custom_8special.dic

# up to 22 chars - alphanumic
egrep '^[a-zA-Z]{22}$' custom.dic | sort -u > custom_lets_22.dic

You get the idea. Its at least a START until you can get john.conf
rules to do what
you want. But in general, I usually just clean up my .dic files - and
don't mess with
john.conf rules to do so for me.

---------

-Minga
KoreLogic

Solar Designer

2010-02-05 07:01:36 UTC

Permalink

Post by SL
I would like to use ./john --rules=Pre-Mangle --stdout | ./ unique to
clean up arbitrary (large) "dirty" wordlists.
In other words: I have target-specific generated wordlists (of about
2GB size), which still contain a lot of "unusable junk" like raw MD5
hashes, punctuation, Base64 fragments, QP-encoded fragments, falsely
decoded UTF-8 etc.
My intention is to put together a number of word mangling rules that
help to reduce this chaos and only let through "reasonable" candidates
for future processing with ./john --rules and ./john --rules=Single.

That's a curious idea. So far, people have been using tools other than
JtR itself to pre-process "dirty" wordlists like this. I do see some
value in having a ruleset like this for JtR itself.

Post by SL
Does such a collection of rules already exist? I couldn't find one,

I think not.

Post by SL
and I must admit that the complexity of
http://www.openwall.com/john/doc/RULES.shtml is a bit too much for me to
start from scratch.

I suggest that you start by reading the existing john.conf. Many of the
rules in the default rulesets start by rejecting some "words". You can
learn from those rejection commands and build your ruleset upon them.

Post by SL
* obviously no no-op (:)
* include "dictionary-like" words up to ?? certain length (haven't seen
any password longer than 18 chars in my samples, so lenght 22 should
probably be sufficient)

# Permit pure alphabetic words of up to 22 characters long
<N !?A

Post by SL
* shorter alphanumeric "words" might be included as-is, maybe up to 8
or 10 chars

So instead of the above, we have to write:

# Permit alphanumeric "words" of up to 10 characters long
<B !?X
# Permit pure alphabetic words of up to 22 characters long

Post by SL
A <N !?A
* punctuation should probably be purged (or truncated?)

...and what should be done with whatever remains?

# Purge punctuation and special symbols, then apply the usual requirements
@?p @?s Q <B !?X
@?p @?s Q >A <N !?A

Post by SL
* words with false transcodings (lots of /(.[????])+/) should get
rejected

You haven't fully specified this (your regexp looks wrong) and it'd be
tricky to implement with the rules anyway.

Post by SL
Could anybody please point me to a reasonable start? I shall follow-up
with a patch to john.conf, if this idea proves succesful.

I've provided some examples above. Please do post whatever ruleset you
might come up with.

Thanks,

Alexander

2010-02-05 12:32:13 UTC

Permalink

Post by Solar Designer

Post by SL
* words with false transcodings (lots of /(.[????])+/) should get
rejected

You haven't fully specified this (your regexp looks wrong) and it'd be
tricky to implement with the rules anyway.

We just had an (albeit different) example of false/failed transcoding
here. In place of those question marks, I had written two two-byte
characters: "capital A circumflex" and "capital A tilde". Either way
not very useful in candidate wordlists.

Thank you for your helpful examples, I will follow up when done.