C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

Reflections on Playlist Generation (UYKFD)

From: andrew cooke <andrew@...>

Date: Sun, 6 Dec 2009 14:16:45 -0300

This is going to be, I hope, a long rambling brain dump that records
my experiences with the UYKFD project -
http://code.google.com/p/uykfd/


First, some basic choices I made at the start of the project.

I used Last FM's "artist tags" rather than, say, "related artists", as
my raw material.  I can't remember now if this was because "related
artists" didn't provide what I needed, or if I wanted to see how I
could duplicate that data, or whether I simply overlooked it.

I also decided to work mainly with artists rather than tracks or
albums.  The advantage of doing this is that there are more data for
artists.  The disadvantages are that the data are less specific (if an
artist has produced a range of work) and that it can be difficult to
decide how to choose tracks based on artist relationships (for
example, should you choose an artist more often if you have many
tracks by them?).

And my aim has always been to build a tool to help explore a broad
collection of music.  I think many people these days have more music
than they actually know.  At the same time, playing "random shuffle"
can be frustrating; either because the changes between tracks are too
jarring, or because you are in a mood for a certain kind of music.


Next, some of the more conceptual technical challenges.

Perhaps the "biggest problem" is how to reduce detailed, inhomogenous,
noisy data (particularly Last FM tags, but MP3 ID errors were also an
issue) into something that is sufficiently uniform that it applies to
all artists equally (in the sense of avoiding biases) while still
retaining enough information to discriminate.

It might not be clear why this is a problem unless I describe in more
detail what happens.

The Last FM API provides weighted tags for artists.  These are strings
associated with integer values that appear to be between 0 and 100 (a
larger number meaning that more people have used that string to label
that artist, I assume).  So I read this and (skipping some details I
will describe later), associate these with the artists in a database.

Next, I infer the "degree of connection" between artists by looking
for common strings and combining the weights.  So if two artists both
have the label "rock" with high scores then they will be connected by
a high score; if they have the same label, but a low score then they
will still be connected, but with a low score.

Some artists have many more tags than others, and some tags are more
more general than others.  Worse, the underlying artist selection is
already "clumpy" (people like certain artists in certain styles).  So
the end result is that artists tend to be joined into groups, and some
artists appear to be "central" within those groups (this is probably
related to Last FM playing U2 for almost everyone when I used it some
time ago).

The problem I mentioned above is: given this clumpy data, how do I
generate a playlist that really does explore the music collection,
rather than staying in a single clump (and playing the "central"
artists again and again)?

I didn't have a good answer to this at the start - most of my work has
been stumbling around trying to better understand what is happening
(and the work is incomplete - there is a lot of interesting work to be
done on the clumps, which I have been treating as problems rather than
as something to understand in themselves, for example).


My current solution combines two related ideas.

The first idea is to construct a "non parameteric" graph of related
artists.  I am not sure if "non parametric" helps here - I am alluding
to non parameteric statistics, which rely on simple properties like
ordering, rather than numerical values.

To construct this graph I take each artist in turn and look at the top
N neighbours (by weight), where N is a small number (less than 10).  I
add edges to my graph only for those neighbours.  This discards an
awful lot of information.  First, I am throwing away any information
about more distant neighbours.  Second, even for the neighbours I
keep, I treat them all as equals (no weights).  But the resulting
graph is, I hope, a lot more uniform.

This graph forms the skeleton on which I build the playlist - I jump
from node to node, along the connected edges, choosing a track from
each artist in turn.  Because I have forced the graph to be much more
uniform than the original, clumpy, data, the playlist does quite a
good job of exploring the available music.

Unfortunately, it does rather too good a job, and wanders too far, too
quickly, from the starting artist.  It turns out that I can get from
one track to *any* other in just a handful of hops.


So this is where the second idea comes in - I bias the choice of
artist based on some concept of "distance".

The trick here is how to define distance so that it is useful, without
re-introducing the clumpiness problems from earlier.  I am not sure I
have found the best solution yet (it is certainly not very efficient),
but it seems to work.

To achieve a robust distance I iterate the graph approach above, using
progressively more neighbours.  So I start with just "top" neighbours
and take the usual graph distance (in this case there may be many
unconnected groups - that doesn't really matter).  Next, I take the
"top two" neighbours and calculate a new distance.  I *add* this
distance to the original value (if there is no original distance,
because the artists were completely isolated in the previous round, I
use the largest distance from the previous round - aka the perimeter
distance).

This is expensive, but three iterations gives a good distribution of
distances, and all this work is done just once, then stored in a
database.


The graph and distances solve the "big picture" problem.  But there
are also some interesting lower-level problems.

MP3 ID tags can vary quite a bit.  They can contain strange "foreign"
characters, or "artist" can be a list of names, for example.  So I
take care to normalize text to lower case ASCII letters plus space, to
separate on various tokens (comma, "and", etc), and to try dropping
various values (anything after "and", for example).  This means that
one "artist" can abe associated with quite a few different strings - I
pass all the different strings to Last FM and combine the returned
tags.

Another issue is directional graphs.  The initial weights that relate
artists are, by construction, symmetric, but there's no guarantee that
the "top N neighbours" is commutative (B may be A's top neighbour, but
B's top neighbour could be C).  So my code uses directed graphs.

However, when you look a little more closely, it turns out that a
directed graph is fine, because it is equivalent to adding "backlinks"
(linking B to A if A links to B), which itself seems quite reasonable.

Other issues involve weighting.  To select a new track all
neighbouring artists are selected, initially weighted by the number of
tracks they have which have not been recently played.  Then these
weights are modified according to distance from an earlier artist.  An
artist is then selected based on the weights (ie an artist with weight
2X is twice as likely to be picked as one with weight X).  The same
process is then repeated to select from an artist's tracks, with those
tracks that have already been played giving a lower weight.


To get this process to give "pleasant" results still requires a fair
amount of tweaking of parameters.

The two most important parameters are the artist queue length, and the
distance exponent.  The artist queue length fixes the number of
artists that are stored in the "already played" list.  In itself, this
is not very important, but the this list also provides the reference
(ie oldest) artist for distances.  So specifying a long list keeps the
original artist as "oldest" and so measures all distances from the
same point.  A shorter list allows the selection to "wander" as
distances are measured from the a point that itself moves over time.

The distance exponent controls the strength of the weighting by
distance and can be negative (conservative, keeping the selection
centred) or positive (liberal, driving the selection to new music).

The best approach I have found is to keep the distance exponent
"strong" (-7 to -10 in my current implementation) and then reduce the
queue size to allow exploration "to taste".


A third parameter (of middling importance) is the number of neighbours
per artist used to construct the "non parametric" graph.  Too few and
it's possible for a sequence of artists to get "trapped" with no way
to return the "original artist" no matter how strongly distances are
weighted.  Too high and the clustering in the raw data starts to come
through.


In conclusion, then, UYKFD's success (ie that I enjoy the playlists
generated - noone else is using it, as far as I know) is largely based
on the "non parametric" graph construction.  This gives the basic
connections, which restrict successive tracks from being too
different, and also underlies the distance calculations, which support
weighting that keeps a "central theme".

Andrew

Correct Exponents

From: andrew cooke <andrew@...>

Date: Sun, 13 Dec 2009 21:06:06 -0300

A note that describes the latest version (1.4).

I fixed some bugs that mean that an exponent like 1 or 2 is all that
is required to control divergence (a value of 7 is given above).  This
can be seen in the following scripts:

http://code.google.com/p/uykfd/source/browse/random.sh
http://code.google.com/p/uykfd/source/browse/all.sh

Andrew

Comment on this post