[john-users] good program for sorting large wordlists

Discussion:

JohnyKrekan

2018-09-11 15:19:18 UTC

Hello, I would like to ask whether someone has experience with good tool to sort large text files with possibilities such as gnu sort. I am using it to sort wordlists but when I tried to sort 11 gb wordlist, it crashed while writing final output file after writing around 7 gb of data and did not delete some temp files. When I was sorting smaller (2gb) wordlist it took me just about 15 minutes while this 11 gb took 4.5 hours (Intel core I 7 2.6ghz, 12 gb ram, ssd drives).
Could you recommend any better sorting program with similar abylities as gnu sort offers?
Thanx
Johny Krekan

Solar Designer

2018-09-11 15:42:46 UTC

Permalink

Hi,

Post by JohnyKrekan
Hello, I would like to ask whether someone has experience with good tool to sort large text files with possibilities such as gnu sort. I am using it to sort wordlists but when I tried to sort 11 gb wordlist, it crashed while writing final output file after writing around 7 gb of data and did not delete some temp files. When I was sorting smaller (2gb) wordlist it took me just about 15 minutes while this 11 gb took 4.5 hours (Intel core I 7 2.6ghz, 12 gb ram, ssd drives).

Most importantly, usually you do not need to "sort" - you just need to
eliminate duplicates. In fact, in many cases you'd prefer to eliminate
duplicates without sorting, in case your input list is sorted roughly
for non-increasing estimated probability of hitting a real password -
e.g., if it's produced by concatenating common/leaked password lists
first with other general wordlists next, or/and by pre-applying wordlist
rules (which their authors generally order such that better performing
rules come first).

You can eliminate duplicates without sorting using JtR's bundled
"unique" program. In jumbo and running on a 64-bit platform, it will by
default use a memory buffer of 2 GB (the maximum it can use). It does
not use any temporary files (instead, it reads back the output file
multiple times if needed). You can use it e.g. like this:

./unique output.lst < input.lst

or:

cat ~/wordlists/* | ./unique output.lst

or:

cat ~/wordlists/common/* ~/wordlists/uncommon/* | ./unique output.lst

or:

./john -w=password.lst --rules=jumbo --stdout | ./unique output.lst

As to sorting, recent GNU sort from the coreutils package works well.
You'll want to use the "-S" option to let it use more RAM, and less
temporary files, e.g. "-S 5G". You can also use e.g. "--parallel=8".

As to it running out of space for the temporary files, perhaps you have
your /tmp on tmpfs, so in RAM+swap, and this might be too limiting. If
so, you may use the "-T" option, e.g. "-T /home/user/tmp", to let it use
your SSDs instead. Combine this with e.g. "-S 5G" to also use your RAM.

As to "it crashed while writing final output file after writing around 7
gb of data", did you possibly put the output file in /tmp as well? Just
don't do that.

I hope this helps.

Alexander

JohnyKrekan

2018-09-12 11:14:22 UTC

Permalink

Thanx for infos, after I have raised the memory sizes and the space for
temp, the sort went well. Iwas sorting it to know how many duplicates (when
ignoring the character case) are in the superwpa wordlist. The original file
size was approx 10.7 gb, after sorting it was 7.05 gb, so 4 gb was taken by
the same words with modified character case.
If you could decide: would you rather use this smaller wordlist and set the
case changing rules in the program which is used to test those hashes or use
the original wordlist which contains lots of same words with modified
casing.
Johny Krekan

----- Original Message -----
From: "Solar Designer" <***@openwall.com>
To: <john-***@lists.openwall.com>
Sent: Tuesday, September 11, 2018 5:42 PM
Subject: Re: [john-users] good program for sorting large wordlists

Post by Solar Designer
Hi,

Post by JohnyKrekan
Hello, I would like to ask whether someone has experience with good tool
to sort large text files with possibilities such as gnu sort. I am using
it to sort wordlists but when I tried to sort 11 gb wordlist, it crashed
while writing final output file after writing around 7 gb of data and
did not delete some temp files. When I was sorting smaller (2gb) wordlist
it took me just about 15 minutes while this 11 gb took 4.5 hours (Intel
core I 7 2.6ghz, 12 gb ram, ssd drives).

Most importantly, usually you do not need to "sort" - you just need to
eliminate duplicates. In fact, in many cases you'd prefer to eliminate
duplicates without sorting, in case your input list is sorted roughly
for non-increasing estimated probability of hitting a real password -
e.g., if it's produced by concatenating common/leaked password lists
first with other general wordlists next, or/and by pre-applying wordlist
rules (which their authors generally order such that better performing
rules come first).
You can eliminate duplicates without sorting using JtR's bundled
"unique" program. In jumbo and running on a 64-bit platform, it will by
default use a memory buffer of 2 GB (the maximum it can use). It does
not use any temporary files (instead, it reads back the output file
./unique output.lst < input.lst
cat ~/wordlists/* | ./unique output.lst
cat ~/wordlists/common/* ~/wordlists/uncommon/* | ./unique output.lst
./john -w=password.lst --rules=jumbo --stdout | ./unique output.lst
As to sorting, recent GNU sort from the coreutils package works well.
You'll want to use the "-S" option to let it use more RAM, and less
temporary files, e.g. "-S 5G". You can also use e.g. "--parallel=8".
As to it running out of space for the temporary files, perhaps you have
your /tmp on tmpfs, so in RAM+swap, and this might be too limiting. If
so, you may use the "-T" option, e.g. "-T /home/user/tmp", to let it use
your SSDs instead. Combine this with e.g. "-S 5G" to also use your RAM.
As to "it crashed while writing final output file after writing around 7
gb of data", did you possibly put the output file in /tmp as well? Just
don't do that.
I hope this helps.
Alexander

Solar Designer

2018-09-12 11:49:51 UTC

Permalink

Post by JohnyKrekan
Thanx for infos, after I have raised the memory sizes and the space for
temp, the sort went well. Iwas sorting it to know how many duplicates (when
ignoring the character case) are in the superwpa wordlist. The original
file size was approx 10.7 gb, after sorting it was 7.05 gb, so 4 gb was
taken by the same words with modified character case.

It's a case where you don't need to sort. You could use:

./unique -v output.lst < input.lst

or e.g.:

tr 'A-Z' 'a-z' < input.lst | ./unique -v output.lst

Testing this on JtR's bundled password.lst:

$ tr 'A-Z' 'a-z' < password.lst | ./unique output.lst
Total lines read 3559 Unique lines written 3422

If you're interested in sizes in bytes as well, use "ls -l" or "wc -c"
on the two files.

For tiny wordlists like password.lst, "sort -u" is more convenient in
that it can output to a pipe, so you can do:

$ tr 'A-Z' 'a-z' < password.lst | sort -u | wc -l
3422

But for large wordlists "sort" may be slower, even with the "-S" and
"--parallel" options.

Alexander

JohnyKrekan

2018-09-12 12:07:04 UTC

Permalink

My question now is not about sorting but about the wordlist which you would
use for hash testing and is already saved on disk, the smaller (with all
words lowercase) or the bigger (mixed). Is it better to let the program for
example EWSA make the case modifications or use bigger one and disable all
the case modifying rules.
Johny Krekan

----- Original Message -----
From: "Solar Designer" <***@openwall.com>
To: <john-***@lists.openwall.com>
Sent: Wednesday, September 12, 2018 1:49 PM
Subject: Re: [john-users] good program for sorting large wordlists

Post by Solar Designer

./unique -v output.lst < input.lst
tr 'A-Z' 'a-z' < input.lst | ./unique -v output.lst
$ tr 'A-Z' 'a-z' < password.lst | ./unique output.lst
Total lines read 3559 Unique lines written 3422
If you're interested in sizes in bytes as well, use "ls -l" or "wc -c"
on the two files.
For tiny wordlists like password.lst, "sort -u" is more convenient in
$ tr 'A-Z' 'a-z' < password.lst | sort -u | wc -l
3422
But for large wordlists "sort" may be slower, even with the "-S" and
"--parallel" options.
Alexander

Solar Designer

2018-09-12 12:45:29 UTC

Permalink

Post by JohnyKrekan
My question now is not about sorting but about the wordlist which you would
use for hash testing and is already saved on disk, the smaller (with all
words lowercase) or the bigger (mixed). Is it better to let the program for
example EWSA make the case modifications or use bigger one and disable all
the case modifying rules.

Oh, you didn't include a question mark there, so I assumed it wasn't a
question but rather you stating the dilemma.

Ideally, you'd combine the advantages of both approaches using e.g.:

./john -w=input.lst --rules=best64 --min-length=8 --stdout | ./unique output.lst

assuming that the first rule is to keep words as-is. Then you'd use
output.lst either with JtR itself or with another tool like EWSA (why?)

Here, input.lst is your original wordlist with the mixed-case lines
still intact.

If the first rule isn't to keep words as-is (not a colon, ":"), then you
can revise the command e.g. to:

(./john -w=input.lst --min-length=8 --stdout && ./john -w=input.lst --rules=someother --min-length=8 --stdout) | ./unique output.lst

or you can indeed add a colon to the start of the "someother" ruleset,
then use the simpler command.

Alexander

Solar Designer

2018-09-16 15:51:31 UTC

Permalink

Post by Solar Designer

Oh, you didn't include a question mark there, so I assumed it wasn't a
question but rather you stating the dilemma.
./john -w=input.lst --rules=best64 --min-length=8 --stdout | ./unique output.lst
assuming that the first rule is to keep words as-is. Then you'd use
output.lst either with JtR itself or with another tool like EWSA (why?)
Here, input.lst is your original wordlist with the mixed-case lines
still intact.
If the first rule isn't to keep words as-is (not a colon, ":"), then you
(./john -w=input.lst --min-length=8 --stdout && ./john -w=input.lst --rules=someother --min-length=8 --stdout) | ./unique output.lst
or you can indeed add a colon to the start of the "someother" ruleset,
then use the simpler command.

I realized the advice I provided above is a poor fit for wordlist files
this large, where the output file would be too large and would take too
long to produce (as "unique" would have to read it back too many times).

I commonly use this approach for more focused and thus smaller input
wordlist files.

Alexander

jeff

2018-09-12 12:12:00 UTC

Permalink

I am
using JohnTheRipper-v1.8.0.12-jumbo-1-bleeding-e6214ceab--2018-02-07--Win-x64

When I start it up with the command line
john.exe --format=descrypt-opencl pass-file
I get the following:

Warning: '/dev/shm' does not exists or is not a directory.
POSIX shared memory objects require the existance of this directory.
Create the directory '/dev/shm' and set the permissions to 01777.
For instance on the command line: mkdir -m 01777 /dev/shm
No OpenCL devices found

When I don't specify --format=descrypt-opencl, it works using my CPU.
My password file, is in des format.
I have a nvidia 1060 3gb card, and a recent driver which works fine with
hashcat.
How do I get JTR to use my graphics card on windows 10?
I also have cygwin installed, and when I look I have a /dev/shm, so I
don't know why I am getting that message...

thanks,
jeff

Matlink

2018-09-12 11:37:31 UTC

Permalink

Post by Solar Designer
Hi,

As to sorting, recent GNU sort from the coreutils package works well.
You'll want to use the "-S" option to let it use more RAM, and less
temporary files, e.g. "-S 5G". You can also use e.g. "--parallel=8".
As to it running out of space for the temporary files, perhaps you have
your /tmp on tmpfs, so in RAM+swap, and this might be too limiting. If
so, you may use the "-T" option, e.g. "-T /home/user/tmp", to let it use
your SSDs instead. Combine this with e.g. "-S 5G" to also use your RAM.

As Alexander said, you should use "--parallel" option for such big
files. And yes, you'll need temporary files and then a folder than can
handle huge files. I usually sort files of around dozens of gigas, and
it takes time but rarely more than 1 hour.

--
Matlink - Sysadmin matlink.fr
Sortez couverts, chiffrez vos mails : https://café-vie-privée.fr/
XMPP/Jabber : ***@matlink.fr
Clé publique PGP : 0x186BB3CA
Empreinte Off-the-record : 572174BF 6983EA74 91417CA7 705ED899 DE9D05B2

Albert Veli

2018-09-16 19:28:42 UTC

Permalink

Hi!

If you don't succeed with other methods, one thing that has worked for me
is splitting the wordlist into smaller parts and sorting each one
individually. Then you can merge the sorted lists together using for
instance mli2 from hashcat-utils. That will put the big list in sorted
order. But the parts must be sorted first, before merging.

This is only necessary for very large wordlists. Like in your case.

PS I think there is a tool in hashcat-utils for splitting too. Don't
remember the name. Maybe gate.

jeff

2018-09-16 19:51:54 UTC

Permalink

Post by Albert Veli
Hi!

If you don't succeed with other methods, one thing that has worked for me
is splitting the wordlist into smaller parts and sorting each one
individually. Then you can merge the sorted lists together using for
instance mli2 from hashcat-utils. That will put the big list in sorted
order. But the parts must be sorted first, before merging.
This is only necessary for very large wordlists. Like in your case.
PS I think there is a tool in hashcat-utils for splitting too. Don't
remember the name. Maybe gate.

For whatever reason, I have a collection of large, sorted wordlists.
I have 9 over 10gb, and the biggest one is 123gb.

As mentioned above, I use split to split the files into manageable pieces.
I typically say something like 'split -l 100000000' or so.

I then use gnu sort on each piece.
gnu sort will sort files that are quite large, using temp files if needed.
It is still a good a good idea to have a reasonable amount of physical
memory; my machine has 32 gb.

Then I take the sorted files and merge them using a program I wrote
called multi-merge, which merges
one or more sorted files.

Then I use uniq on the sorted file, to remove duplicates.

This process can take awhile, but you will end up with a sorted, unique
wordlist.

I also wrote a bunch of other programs to manipulate wordlists. In my
experience, large wordlists often
contain quite a bit of junk, such as files with really long
lines,sometimes 10k to over 100k bytes.
I have a program to truncate long lines, sample lines of big files,
remove non-ascii lines, etc.
I also use emacs to look at the contents of files. It can edit
multi-gigabyte files, though it is slow.
Sometimes long lines are many password separated by ',' or ';' or some
other separator.

jeff