Seek and you will find

May 29, 2008 10:50

This post is a mea culpa, an admission of failure: if any of you have read my stuff thinking that I know what I'm talking about when it comes to programming, this is your chance to recalibrate. I'm posting in the hopes that by writing this particular ultra-basic technique down in public, I'll never ever forget about it.

Yesterday, on another forum, my friend Mat was asking for some help. He had a collection of (up to 4.4GB) raw h264 HD video streams, which he wanted to play on his PS3. The raw streams were produced during the penultimate stage of remuxing, and he needed a hand with the final stage. The problem was that the PS3 doesn't think it can play video encoded at anything higher than level 4.1, and almost all of Mat's videos were encoded at a higher level. Transcoding the videos took twice the stream's running time, and clogged all his cores into the bargain. But there's a workaround: it seems that the PS3 can in fact play videos with higher encoding levels, providing you can get it to accept them. Mat had been manually patching the header to change the stated encoding level. This involves opening up the file in a hex editor, looking for the sequence "64 00 xx" in the header (where xx is usually 33), and replacing it with "64 00 29". Doing this manually was tedious, and he wanted to automate the process, preferably using standard tools. The program was to run on Unix of some sort.

My first thought, obviously, was "can't you use sed?". The sequence to be matched involves non-printing characters, but it's easily describable as a regex. As usual with these things, I spent about five minutes trying to get sed to work before giving up and writing the following Perl script:
#!/usr/bin/perl -wpi

BEGIN {
$regex = chr(0x64).chr(0).".";
$replacement = chr(0x64).chr(0).chr(0x29);
$/ = undef; # slurp file
}
s/$regex/$replacement/;
Note the shebang line: -w turns on warnings (which I've never found terribly helpful, but what the hell), -p means "act like sed", i.e. loop over the lines of every file specified on the command line or stdin, reading each line into $_, and print out $_ at the end of the script, and -i is the "modify in-place" switch, which redirects the output into whatever file's currently being read (having moved the original file out of the way first, so reads are not interfered with). The BEGIN block is evaluated at compile-time, so before the implicit while (<>) { } loop; it's there to set up the search-and-replace, and to set the input record separator $/ to nothing, so the whole file is read in as one line (we only want to change the first occurrence of 64 00 xx, remember). Had I not used the command-line switches, the program would have looked like
#!/usr/bin/perl
use File::Temp qw/tmpnam/;
$regex = chr(0x64).chr(0).".";
$replacement = chr(0x64).chr(0).chr(0x29);
$/ = undef; # slurp file
LINE: while (<>) {
if ($ARGV ne $oldargv) {
unlink $oldversion if $oldversion;
$oldversion = $tmpnam();
rename($ARGV, $oldversion);
open(ARGVOUT, ">$ARGV");
select(ARGVOUT);
$oldargv = $ARGV;
}
s/$regex/$replacement/;
}
continue {
print; # this prints to original filename
}
unlink $oldversion if $oldversion;
select(STDOUT);
(untested). A reasonable alternative would have been to use -i and not -p, in which case it would have looked like
#!/usr/bin/perl -wi

$regex = chr(0x64).chr(0).".";
$replacement = chr(0x64).chr(0).chr(0x29);
$/ = undef; # slurp file

while (<>) {
s/$regex/$replacement/;
print;
}
(also untested).

Now, my program works, in that it implements the spec correctly; but it's horribly inefficient. It makes a copy of the file under the hood - worse, it reads the whole thing into memory, when it only needs to look at the first kilobyte or so. Remember, these are multi-gigabyte files we're talking about. I mentioned these concerns to Mat, who said that even if it took ten minutes or so to copy the file (he has fast disks, apparently), that's still an improvement on two hours for brute-force transcoding. I still wasn't happy, but couldn't see a better way to do it that was within my capabilities - I'm used to dealing with the OS through thick layers of abstraction, and while I would, in principle, be capable of writing C code to mess around with inode numbers and so on, it would take me rather longer than I wanted to devote to the problem.

You're all laughing at me, I can tell. Go on, get it out of your systems.

There was, of course, a much better solution, and a few minutes later Afternoon showed it to us. I'd forgotten about the seek command (if I even knew about it in the first place).
  • Open the file in r+ mode (or better, rb+ mode, on the off chance you try running it on an OS that has binary and text file modes, i.e. Winders).
  • Slurp n bytes, where n is the smallest distance from byte 0 that will definitely have your 3-byte search string in.
  • Do the replace.
  • Seek back to byte 0
  • Write the modified string.
In Python:
#!/usr/bin/python
import re, sys
n = 1024 # guess
bigfile = file(sys.argv[1], "rb+")
oldheader = bigfile.read(n)
newheader = re.sub("\x64\x00[\x00-\x99]", "\x64\x00\x29", oldheader)
bigfile.seek(0)
bigfile.write(newheader)
On real video streams, that completes in around 0.01 seconds. I could have done the same thing in Perl:
#!/usr/bin/perl -w

my $header_length = 1024; # guess
foreach my $bigfile (@ARGV) {
open BIGFILE, "+<", $bigfile or die "Couldn't open $bigfile: $!";
binmode(BIGFILE);
read(BIGFILE, $header, $header_length)
or die "Couldn't read from $bigfile: $!";
$header =~ s/\x64\x00[\x00-\x99]/\x64\x00\x29/;
seek BIGFILE, 0, 0;
print BIGFILE $header;
}
Not quite so neat: treating seek, read etc as methods on filehandles is nice.

Arguably, we shouldn't be using regexes for this, and should instead parse the header information into a structure and update it properly. But the denizens of Doom9 assure us that this is safe, and they apparently know their video codecs.

The bigger lesson here, I suppose, is that while a particular tool (here, line-based IO) may be extremely useful and adequate to many tasks, it's probably not the only game in town, and one should be aware of the alternatives and always keep them in mind. When all you have is a hammer, and so on.

But anyway, gah. I can't believe I didn't think of seek. Eejit.

Anyone like to submit a Haskell version?

[The full, working version of PS3 Remuxatron (by Mat Brown, GPLed) is here.]

computers, programming, beware the geek, python, perl

Previous post Next post
Up