This post is some Linux geekery inspired by a problem a coworker solved today.
The problem was that he had a file full of stuff (one item per line) and another file full of partially overlapping stuff and wanted a list of the stuff that appeared in the first file but not the second, which is essentially set difference. This was to be done in bash by preference, as it was part of a longer script and perl/ruby one-liners look ugly in scripts. You may want to take a few seconds to try figure this one out before you look at the explanation below.
cat foo bar bar | sort | uniq -u
Starting at the end, uniq -u outputs only lines that have multiple consecutive copies in the input. Since the lines need to be consecutive, the input needs to be sorted. The cat portion is the trick. We cat bar twice to make sure that it will never contribute a line to the output. Combined with foo, this will give us one copy of any line that appears only in foo, two copies of any line that appears only in bar and three copies of any line that appears in both. There is only one real requirement, and that is that foo contains no duplicates to begin with. This is fairly trivial to arrange and is left as an exercise for the reader.
This is not the only set operation possible, however. You can also do the following:
- Intersection (requires no duplicates in either file): cat foo bar | sort | uniq -d
(Have a look at man uniq for details on the flags it takes.)
- Union (no input restrictions): cat foo bar | sort | uniq
(The sort and uniq are not strictly required here, but they keep the output format the same. Also, the sort | uniq can be replaced with sort -u for a small efficiency gain.)
Complements don't really make much sense since you can use difference to filter out the set you don't want from pretty much everything. Shell scripts seldom need to deal with infinite sets and they'd probably take too long to run anyway...
This post brought to you courtesy of caffeine, day-job problems and
mithrandi.