Sort movies by existence in list of movie titles

Alec Jacobson

March 24, 2014

weblog/

Using the bash script from my previous post I can generate a list of all movie titles on tehconnection.eu and store them as lines in a text file called tehconnection.eu. I wanted this to sift through a giant list of movies I had on an old external hard drive. This ruby script (save in sort_by_tehconnection.rb) to check my directories list of movies against the list from tehconnection and move the files that exist in that list into a separate directory (which I'll then double check and delete).

Using the bash script from my previous post I can generate a list of all movie titles on tehconnection.eu and store them as lines in a text file called tehconnection.eu. I wanted this to sift through a giant list of movies I had on an old external hard drive. This ruby script (save in sort_by_tehconnection.rb) to check my directories list of movies against the list from tehconnection and move the files that exist in that list into a separate directory (which I'll then double check and delete).

#!/opt/local/bin/ruby

require "levenshtein_distance.rb"
require 'fileutils'

look_dir='/Volumes/Senorita/DOCUMENTARIES/Feature Documentaries/'
found_dir='/Volumes/Senorita/DOCUMENTARIES/Feature Documentaries/tehconnection/'

#files_array=File.readlines('files.txt');
files_array=Dir.entries(look_dir).reject{|entry| entry == "." || entry == ".." || entry == ".DS_Store"};
teh_array=File.readlines('tehconnection.txt').collect {|l| l.gsub(/[.!?:';"]/,"").downcase.sub(/^the /,"").chomp};
teh_array = teh_array.inject([]) do |res,e|
  if e =~/([^\(]*) \(aka ([^\)]*)\)/
    res.concat([$1,$2]);
  else
    res.concat([e]);
  end
end
teh_map = [];
teh_array.each do |e| 
  i = e.length-1;
  if teh_map[i].nil?
    teh_map[i] = [];
  end
  teh_map[i] << e
end
acceptable_dist=2
min_dist=10000
min_teh=""
files_array.each do |orig_title|
  title=orig_title.sub(/ *\([^\(]*\)/,"")
  title.downcase!
  title.sub!(/\.[^\.]*$/,"")
  title.sub!(/^the /,"")
  title.gsub!("\"","")
  title.sub!(/^marx brothers - /,"")
  len = title.length;
  max_len = [len+acceptable_dist,teh_map.length-1].min;
  min_len = [len-acceptable_dist,0].max;
  puts "#{title}..."
  (min_len..max_len).each do |l|
    l_array = teh_map[l];
    found = false;
    if not l_array.nil?
      l_array.each do |teh_title|
        dist=levenshtein_distance(title,teh_title)
        if dist<acceptable_dist
          # found
          puts "  FOUND (#{orig_title}): #{teh_title} --> #{title}, #{dist}"
          FileUtils.mv(look_dir+"/"+orig_title,found_dir+"/"+orig_title);
          found = true;
          break;
        end
      end
    end
    if found
      break;
    end
  end
end