dupes.pl - a de-duping proggie (free) - Dana Booth, dana@oz.net 1999

* Requires perl *

This little proggie is written in perl, and obviously requires that perl be
installed on your computer. Perl is very easy to install, and is a great
scripting language. I ran dupes.pl with Perl 5 on UNIX (OpenBSD), Active
Perl for Windows 95/98, and GNU Perl for DOS. If you're running UNIX, you
already have perl installed most likely, to obtain Active Perl, go to the
www.perl.com web document where there's a link to download it. If you want a
dos version of perl let me know, and I'll send one to you, since I'm not
sure they're available anymore.

* What it does *

dupes.pl will very simply read an input text file, and write an output
textfile, but the output textfile will contain only unique lines, i.e., any
duplicate lines contained in the input textfile will have been removed. The
input textfile is not touched, only read, so your data is safe.

* Here's the catch *

dupes.pl is extremely fast. However, in order to obtain speed, the order of
the lines in the output textfile will have been re-arranged from the order
in which they appear in the input textfile.

* Command line *

DOS and Windows:

perl dupes.pl inputfilename outputfilename


UNIX users, place the executable bit on the script, change the pound/bang
path to perl if necessary and:

./dupes.pl inputfilename outputfilename

from the current directory, or put it in your path and drop the dot/slash.

* History *

I get a textfile each day which grows by about 15 or 20 lines per day. It's
currently at about 58,000 lines. Trouble is, it's reproduced each day at a
remote site, and it just came to my attention that it contains duplicate
lines here and there. Further, each day, it'll contain duplicate lines,
since it's reproduced each day. I found a DOS proggie on the internet that
removed dupe lines, but it was very slow on a text file that was this large.
Anyway, I decided to do it myself, and dupes.pl works very fast. Keep in
mind that I'm not a programmer, but I do use perl from time to time for
simple things.

dupes.pl de-duped my 58,000 line textfile in about 15 seconds on my Windows
computer at work, a cyrix 233 and 64mb ram. I decided to have a little fun,
and de-duped the same textfile on our OpenBSD fileserver, a PC which runs a
k6III 450 and 256mb ran and an UW, and it took less then 2 seconds. :)

Feel free to modify the script to your hearts content.

Send any gripes or praise to: dana@oz.net

