because i'm a nerd, that's why: coriolinus

coriolinus

because i'm a nerd, that's why

Apr 12, 2008 07:53

I've been dutifully putting song ratings into iTunes for years now, rating each song individually according to its merit. iTunes actually died a while ago and forced me to start the entire rating process over again, but I still hope that one day I will have a fully rated music library.

While I can set up smart playlists within iTunes to get a good mix of music, it's more interesting to have data that I can visualise. Naturally, I wrote a program to gather and interpret that data for me. Here are the (somewhat voluminous) results:

Parsing XML... 7114 track items parsed
Building model of track/album/artist relationships... Done!
7114 total tracks
162 genres
1575 artists
2230 albums
70 orphan tracks

Pruning library with a threshold of 5...
Unrated tracks eliminated...
albums with too few ratings eliminated...
artists with too few ratings eliminated...
genres with too few ratings eliminated...

Final cleanup of pruned library... Done!
434 pruned tracks
25 genres
22 artists
12 albums

Average tracks per artist: 13.3511111111
Artists with the most tracks:
Red Hot Chili Peppers: 186
KMFDM: 115
Star Ocean The Second Story OST: 86
Spoon: 74
311: 74
Insane Clown Posse: 64
Fatboy Slim: 62
Powerman 5000: 61
Nine Inch Nails: 59
Trans-Siberian Orchestra: 59
The Kleptones: 56
Pitchshifter: 54
Rage Against the Machine: 51
Cake: 50
The Beatles: 47

76% of tracks have genres noted
Average tracks per genre: 9.54320987654
Genres with the most tracks:
Rock: 828
Other: 765
Pop: 344
Alternative: 323
Soundtrack: 263
Electronic: 259
Metal: 147
Sound Clip: 132
Blues: 114
Game: 109
Techno: 92
Punk: 89
Industrial: 83
Hard Rock: 80
Holiday: 78

Average albums per artist: 1.73714285714
Artists with the most albums:
The Beatles: 27
NOFX: 22
Queen: 19
Red Hot Chili Peppers: 16
Marilyn Manson: 16
U2: 14
311: 14
Dream Theater: 13
Aerosmith: 13
KMFDM: 13
Eminem: 12
Dave Matthews Band: 12
Cake: 11
Sublime: 11
Jars of Clay: 10

9% of tracks have ratings noted
Artists with the best average rating
Ben Folds - Ben Folds: 100.0
analoq: 100.0
Splashdown: 100.0
JET: 100.0
Stretch & Vern Present "Maddog": 100.0
川井憲次: 80.0
The Beta Band: 80.0
Tsuneo Imahori: 80.0
Mylo Vs. Miami Sound Machine: 80.0
The Music: 80.0
India.Arie: 80.0
team9: 80.0
Lemon Pipers: 80.0
Milla: 80.0
L M Montgomery: 80.0

Genres with the best average rating
Ambient Alternative: 80.0
Salsa: 60.0
BritPop: 60.0
Noise: 60.0
Film Soundtrack: 60.0
New Wave: 40.0
Jazz Funk: 40.0
Alt-Folk: 35.0
default: 33.3333333333
Data: 30.0
General Alternative: 30.0
Grunge: 26.6666666667
Children's: 25.0
RnB: 23.0769230769
Rap & Hip Hop: 21.25

Considering only categories with at least five samples to compare between:
Artists with the best average rating
Gorillaz: 88.0
Daft Punk: 80.0
Sara Bareilles: 80.0
Roisin Murphy: 80.0
Mylo: 80.0
Smash Mouth: 76.0
Vitalic: 74.0
Rihanna: 72.0
The Moog Cookbook: 70.0
Red Hot Chili Peppers: 67.6923076923
Fatboy Slim: 64.4444444444
Rage Against The Machine: 64.0
Spoon: 63.3333333333
KMFDM: 62.2222222222
311: 60.0

Albums with the best average rating
Destroy Rock & Roll: 80.0
Ruby Blue: 80.0
Little Voice: 80.0
V Live: 76.0
Stadium Arcadium: 76.0
OK Cowboy: 72.0
24 Hours: 65.4545454545
Best of Bootie 2007: 60.0
Freaky Styley: 60.0
Original Motion Picture Soundtrack (Disc Two): 56.6666666667
A Night At The Hip-Hopera: 50.0
Forgotten Freshness 4: 47.1428571429

Genres with the best average rating
Electronica/Dance: 80.0
Alternative Rock: 76.0
Mix CD: 74.2857142857
Electronic: 73.5714285714
Electronica: 72.3076923077
Dance: 72.0
Techno: 70.0
Acapella: 68.8888888889
Alternative: 68.0
Rock: 67.3684210526
Sound Clip: 65.7142857143
Other: 65.5555555556
Pop: 65.1162790698
Metal: 64.0
Industrial: 64.0

Because I am a good person and like you, here is the source:
iTunesStats.py

#!/usr/env/python
"""
A set of utilities for working with iTunes XML files and generating interesting statistics therefrom.

Dependencies:
PListReader (http://www.shearersoftware.com/software/developers/plist/)
XMLFilter (http://www.shearersoftware.com/software/developers/xmlfilter/)
path (http://www.jorendorff.com/articles/python/path)
"""

from __future__ import division

import sys
from PListReader import PListReader
from XMLFilter import XMLFilter
from path import path
from copy import copy

alphabet = set(list('abcdefghijklmnopqrstuvwxyz'))

def load(iml=None):
if iml is None:
#yup, i'm assuming Windows here
iml = path('~/My Documents/My Music/iTunes/iTunes Music Library.xml').expand().abspath()
if not iml.exists():
#i do take into account the possibility of mac/unix users
iml = path('~/Music/iTunes/iTunes Music Library.xml').expand().abspath()
if not iml.exists():
raise IOError('Could not automatically find "iTunes Music Library.xml"')
else:
iml = path(iml).expand().abspath()

reader = PListReader()
XMLFilter.parseFilePath(iml, reader, features = reader.getRecommendedFeatures())
return reader.getResult()

class Track(object):
class Lib(object):
def __init__(self, track):
self.track = track
self.artist = None
self.album = None
self.genre = None

def __str__(self):
return u''.join([u'Track: ', unicode(self.track), u'\nArtist: ', unicode(self.artist), u'\n',
u'Album: ', unicode(self.album), u'\nGenre: ', unicode(self.genre)])

def __init__(self, tdict, library = None):
for key, val in tdict.iteritems():
self.toAttr(key, val)
keys = set(tdict.keys())
if u'Name' not in keys:
self.name = path(self.location).name
if '.' in self.name:
self.name = self.name.rpartition('.')[0]
self.name = self.name.replace('%20', ' ')
self.name = self.name.replace('_', ' ')
if u'Artist' not in keys or self.artist == 'Various':
self.artist = None
if u'Album' not in keys:
self.album = None
if u'Genre' not in keys or self.genre == 'Unknown':
self.genre = None
if u'Rating' not in keys:
self.rating = None

#these will be initialized from the outside to point to
#the object representations
self.lib = Track.Lib(self)

self.library = library
if self.library is not None:
self.setLibrary(self.library)

def __cmp__(self, other):
return cmp(self.trackID, other.trackID)

def __str__(self):
return unicode(self.name)

def __repr__(self):
return u'' % (unicode(self.artist), unicode(self.name))

def toAttr(self, keyname, val):
kn = []
first = True
for i in xrange(len(keyname)):
ki = keyname[i]
if ki.lower() in alphabet:
if first:
kn.append(ki.lower())
first = False
else:
kn.append(ki)
setattr(self, ''.join(kn), val)

def setLibrary(self, library):
self.library = library

self.library.tracks.add(self)

if self.album is not None:
self.library.albums.setdefault(self.album.lower(), TrackCollection(self.album)).add(self)
self.lib.album = self.library.albums[self.album.lower()]

if self.artist is not None:
self.library.artists.setdefault(self.artist.lower(), TrackCollection(self.artist)).add(self)
self.lib.artist = self.library.artists[self.artist.lower()]
else:
self.library.orphans.add(self)

if self.genre is not None:
self.library.genres.setdefault(self.genre.lower(), TrackCollection(self.genre)).add(self)
self.lib.genre = self.library.genres[self.genre.lower()]

class TrackCollection(set):
def __init__(self, name):
self.name = name

def __cmp__(self, other):
return cmp(self.name.lower(), other.name.lower())

def __repr__(self):
return '<%s: %i Tracks>' % (self.name, len(self))

def __str__(self):
return self.name

def average(self, key=lambda track: track):
return self.sum((key(track) for track in self)) / len(self)

def sum(self, iterable, key=lambda track: track):
t = 0
for i in iterable:
try:
t += i
except TypeError:
pass
return t

class Library(object):
def __init__(self, iml=None, messages=sys.stdout, suppressAutoIML=False):
self.tracks = set()
self.albums = {}
self.artists = {}
self.genres = {}
self.orphans = set()

self.messages = messages

if not suppressAutoIML:
self.initFromIML(iml)

def initFromIML(self, iml):
"""
Initialize the library from an iTunes Media Library
"""
self.pr("Parsing XML... ", newline=False)
lib = load(iml)
self.pr("%i track items parsed" % len(lib[u'Tracks']))

self.pr("Building model of track/album/artist relationships... ", newline=False)
for tid, track in lib[u'Tracks'].iteritems():
Track(track, self)
self.pr("Done!")
self.pr(" %i total tracks" % len(self.tracks))
self.pr(" %i genres" % len(self.genres))
self.pr(" %i artists" % len(self.artists))
self.pr(" %i albums" % len(self.albums))
self.pr(" %i orphan tracks" % len(self.orphans))

def pr(self, msg='', newline=True):
self.messages.write(unicode(msg).encode("utf-8"))
if newline:
self.messages.write('\n')

def most(self, collection, collectionOperation=lambda col: len(col), viewTop=15, show=False):
"""
See the most populous members of a collection.

Collection is one of "albums", "artists", "genres"
collectionOperation is a function which is performed on each collection. Defaults to lambda col: len(col),
which causes this function to return the most populous members of the collection. Other examples:
lambda col: col.average(lambda track: track.rating) causes this to return the collections with the
best average rating.
viewTop restricts the number displayed. If 0, displays all.
"""
if show:
for col, size in self.most(collection, collectionOperation, viewTop, False):
self.pr(unicode(col) + u': ' + unicode(size))
else:
col = [(collectionOperation(c), c) for c in getattr(self, collection).values()]
col.sort()
col.reverse()
return [(c, cl) for cl, c in col] if viewTop == 0 else [(c, cl) for cl, c in col][:viewTop]

def prune(self, threshold=5):
"""
Generates a copy of the library with weak members pruned out.

All unrated tracks are pruned. Then, for each collection type, each member with
fewer than threshold tracks are pruned.
"""

self.pr("Pruning library with a threshold of %i..." % threshold)

l2 = Library(messages=self.messages, suppressAutoIML=True)
for track in self.tracks:
if track.rating is not None and track.rating > 0:
t2 = copy(track)
t2.lib = Track.Lib(t2)
t2.setLibrary(l2)
self.pr(" Unrated tracks eliminated...")

for collection in ['albums', 'artists', 'genres']:
toremove = set()
for key, member in getattr(l2, collection).iteritems():
if len(member) < threshold:
toremove.add(key)
elif len([i for i in member if i.rating is not None and i.rating > 0]) < threshold:
toremove.add(key)
coll = getattr(l2, collection)
for key in toremove:
del coll[key]
setattr(l2, collection, coll)
self.pr(" %s with too few ratings eliminated..." % collection)

self.pr()
self.pr("Final cleanup of pruned library... ", False)
newtracks = set()
for collection in ['albums', 'artists', 'genres']:
for member in getattr(l2, collection).values():
for track in member:
newtracks.add(track)
l2.tracks = newtracks
self.pr("Done!")
self.pr(" %i pruned tracks" % len(l2.tracks))
self.pr(" %i genres" % len(l2.genres))
self.pr(" %i artists" % len(l2.artists))
self.pr(" %i albums" % len(l2.albums))

return l2

def main(argv=None):
if argv is None:
argv = sys.argv
iTunesLib = None
if len(argv) > 1:
iTunesLib = argv[1]
lib = Library(iTunesLib)

lib.pr()

l2 = lib.prune()

lib.pr()

#now we just run through some standard stats
lib.pr("Average tracks per artist: ", False)
spa = [len(a) for a in lib.artists]
lib.pr(sum(spa) / len(spa))
lib.pr("Artists with the most tracks:")
lib.most('artists', show=True)

lib.pr()

lib.pr("%i%% of tracks have genres noted" % int(100*(len([i for i in lib.tracks if i.genre is not None])/len(lib.tracks))))
lib.pr("Average tracks per genre: ", False)
spg = [len(g) for g in lib.genres]
lib.pr(sum(spg) / len(spg))
lib.pr("Genres with the most tracks:")
lib.most('genres', show=True)

lib.pr()

lib.pr("Average albums per artist: ", False)
lib.pr(sum((len(set((track.album for track in artist))) for artist in lib.artists.values())) / len(lib.artists))
lib.pr("Artists with the most albums:")
lib.most('artists', lambda artist: len(set((track.album for track in artist))), show=True)

lib.pr()

noratings = len([i for i in lib.tracks if i.rating is not None])
lib.pr("%i%% of tracks have ratings noted" % int(100*(noratings/len(lib.tracks))))
if noratings > 0:
lib.pr("Artists with the best average rating")
lib.most('artists', lambda col: col.average(lambda track: track.rating), show=True)

lib.pr()

lib.pr("Genres with the best average rating")
lib.most('genres', lambda col: col.average(lambda track: track.rating), show=True)

if len(l2.tracks) > 0:
lib.pr()

lib.pr("Considering only categories with at least five samples to compare between:")
lib.pr("Artists with the best average rating")
l2.most('artists', lambda col: col.average(lambda track: track.rating), show=True)

lib.pr()

lib.pr("Albums with the best average rating")
l2.most('albums', lambda col: col.average(lambda track: track.rating), show=True)

lib.pr()

lib.pr("Genres with the best average rating")
l2.most('genres', lambda col: col.average(lambda track: track.rating), show=True)

if __name__ == '__main__':
sys.exit(main())

I encourage you to post your own results.

meme, geekspeak