Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Cat Soft LLC - Sell your invoices for immediate revenue that grows with Factoring Quotes's turnover!; Update; Credit Card Processing for Cat Soft LLC; Copier Quotes for Cat Soft LLC; [Book] Galileo's Middle Finger; VOIP quote for Cat Soft LLC; [Bike] Chinese Carbon Rims; Collection Agencies for Cat Soft LLC; Get Coffee Quotes for Cat Soft LLC; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; Now Is Cat Soft LLC's Chance To Save Up To 32% On Mail; Call Center Services for Cat Soft LLC; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life; [Link, Bike] Cheap Cycling Jerseys; [Link, Music] Music To Steal 2017; [Link, Future] Simulated Brain Drives Robot; [Link, Computing] Learned Index Structures; Solo Air Equalization; Update: Higher Pressures; Psychology; [Bike] Exercise And Fuel; Continental Race King 2.2; Removing Lowers; Mnesiacs; [Maths, Link] Dividing By Zero; [Book, Review] Ray Monk - Ludwig Wittgenstein: The Duty Of Genius; [Link, Bike, Computing] Evolving Lacing Patterns; [Jam] Strawberry and Orange Jam; [Chile, Privacy] Biometric Check During Mail Delivery; [Link, Chile, Spanish] Article on the Chilean Drought; [Bike] Extended Gear Ratios, Shimano XT M8000 (24/36 Chainring); [Link, Politics, USA] The Future Of American Democracy; Mass Hysteria; [Review, Books, Links] Kazuo Ishiguro - Never Let Me Go; [Link, Books] David Mitchell's Favourite Japanese Fiction; [Link, Bike] Rear Suspension Geometry; [Link, Cycling, Art] Strava Artwork; [Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks; [Links, Computing] Snap, Flatpack, Appimage; [Link, Computing] Oracle is leaving Java (to die); [Link, Politics] Cubans + Ultrasonics; [Book, Link] Laurent Binet; VirtualBox; [Book, Link] No One's Ways; [Link] The Biggest Problem For Cyclists Is Bad Driving; [Computing] Doxygen, Sphinx, Breathe; [Admin] Brokw Recent Permalinks; [Bike, Chile] Buying Bearings in Santiago; [Computing, Opensuse] Upgrading to 42.3; [Link, Physics] First Support for a Physics Theory of Life; [Link, Bike] Peruvian Frame Maker; [Link] Awesome Game Theory Tit-For-Tat Thing; [Food, Review] La Fabbrica - Good Italian Food In Santiago; [Link, Programming] MySQL UTF8 Broken; [Link, Books] Latin American Authors; [Link, Computing] Optimizatin Puzzle; [Link, Books, Politics] Orwell Prize; [Link] What the Hell Is Happening With Qatar?; [Link] Deep Learning + Virtual Tensor Machines; [Link] Scaled Composites: Largest Wingspan Ever; [Link] SCP Foundation; [Bike] Lessons From 2 Leading 2 Trailing; [Link] Veg Restaurants in Santiago; [Link] List of Contemporary Latin American Authors; [Bike] FTHR; [Link] Whoa - NSA Reduces Collection (of US Residents); [Link] Red Bull's Breitbart; [Link] Linux Threads; [Link] Punycode; [Link] Bull / Girl Statues on Wall Street; [Link] Beautiful Chair Video; Update: Lower Pressures; [Link] Neat Python Exceptions; [Link] Fix for Windows 10 to Avoid Ads; [Link] Attacks on ZRTP; [Link] UK Jazz Invasion; [Review] Cuba; [Link] Aricle on Gender Reversal of US Presidential Debate; {OpenSuse] Fix for Network Offline in Updater Applet; [Link] Parkinson's Related to Gut Flora; Farellones Bike Park; [Meta] Tags; Update: Second Ride; Schwalbe Thunder Burt 2.1 v Continental X-King 2.4; Mountain Biking in Santiago; Books on Ethics; Security Fail from Command Driven Interface; Everything Old is New Again; Interesting Take on Trump's Lies; Chutney v6; References on Entropy; Amusing "Alexa.." broadcast; The Shame of Chile's Education System; Playing mp4 gifs in Firefox on Opensuses Leap 42.2

© 2006-2017 Andrew Cooke (site) / post authors (content).

Reflections on Playlist Generation (UYKFD)

From: andrew cooke <andrew@...>

Date: Sun, 6 Dec 2009 14:16:45 -0300

This is going to be, I hope, a long rambling brain dump that records
my experiences with the UYKFD project -

First, some basic choices I made at the start of the project.

I used Last FM's "artist tags" rather than, say, "related artists", as
my raw material.  I can't remember now if this was because "related
artists" didn't provide what I needed, or if I wanted to see how I
could duplicate that data, or whether I simply overlooked it.

I also decided to work mainly with artists rather than tracks or
albums.  The advantage of doing this is that there are more data for
artists.  The disadvantages are that the data are less specific (if an
artist has produced a range of work) and that it can be difficult to
decide how to choose tracks based on artist relationships (for
example, should you choose an artist more often if you have many
tracks by them?).

And my aim has always been to build a tool to help explore a broad
collection of music.  I think many people these days have more music
than they actually know.  At the same time, playing "random shuffle"
can be frustrating; either because the changes between tracks are too
jarring, or because you are in a mood for a certain kind of music.

Next, some of the more conceptual technical challenges.

Perhaps the "biggest problem" is how to reduce detailed, inhomogenous,
noisy data (particularly Last FM tags, but MP3 ID errors were also an
issue) into something that is sufficiently uniform that it applies to
all artists equally (in the sense of avoiding biases) while still
retaining enough information to discriminate.

It might not be clear why this is a problem unless I describe in more
detail what happens.

The Last FM API provides weighted tags for artists.  These are strings
associated with integer values that appear to be between 0 and 100 (a
larger number meaning that more people have used that string to label
that artist, I assume).  So I read this and (skipping some details I
will describe later), associate these with the artists in a database.

Next, I infer the "degree of connection" between artists by looking
for common strings and combining the weights.  So if two artists both
have the label "rock" with high scores then they will be connected by
a high score; if they have the same label, but a low score then they
will still be connected, but with a low score.

Some artists have many more tags than others, and some tags are more
more general than others.  Worse, the underlying artist selection is
already "clumpy" (people like certain artists in certain styles).  So
the end result is that artists tend to be joined into groups, and some
artists appear to be "central" within those groups (this is probably
related to Last FM playing U2 for almost everyone when I used it some
time ago).

The problem I mentioned above is: given this clumpy data, how do I
generate a playlist that really does explore the music collection,
rather than staying in a single clump (and playing the "central"
artists again and again)?

I didn't have a good answer to this at the start - most of my work has
been stumbling around trying to better understand what is happening
(and the work is incomplete - there is a lot of interesting work to be
done on the clumps, which I have been treating as problems rather than
as something to understand in themselves, for example).

My current solution combines two related ideas.

The first idea is to construct a "non parameteric" graph of related
artists.  I am not sure if "non parametric" helps here - I am alluding
to non parameteric statistics, which rely on simple properties like
ordering, rather than numerical values.

To construct this graph I take each artist in turn and look at the top
N neighbours (by weight), where N is a small number (less than 10).  I
add edges to my graph only for those neighbours.  This discards an
awful lot of information.  First, I am throwing away any information
about more distant neighbours.  Second, even for the neighbours I
keep, I treat them all as equals (no weights).  But the resulting
graph is, I hope, a lot more uniform.

This graph forms the skeleton on which I build the playlist - I jump
from node to node, along the connected edges, choosing a track from
each artist in turn.  Because I have forced the graph to be much more
uniform than the original, clumpy, data, the playlist does quite a
good job of exploring the available music.

Unfortunately, it does rather too good a job, and wanders too far, too
quickly, from the starting artist.  It turns out that I can get from
one track to *any* other in just a handful of hops.

So this is where the second idea comes in - I bias the choice of
artist based on some concept of "distance".

The trick here is how to define distance so that it is useful, without
re-introducing the clumpiness problems from earlier.  I am not sure I
have found the best solution yet (it is certainly not very efficient),
but it seems to work.

To achieve a robust distance I iterate the graph approach above, using
progressively more neighbours.  So I start with just "top" neighbours
and take the usual graph distance (in this case there may be many
unconnected groups - that doesn't really matter).  Next, I take the
"top two" neighbours and calculate a new distance.  I *add* this
distance to the original value (if there is no original distance,
because the artists were completely isolated in the previous round, I
use the largest distance from the previous round - aka the perimeter

This is expensive, but three iterations gives a good distribution of
distances, and all this work is done just once, then stored in a

The graph and distances solve the "big picture" problem.  But there
are also some interesting lower-level problems.

MP3 ID tags can vary quite a bit.  They can contain strange "foreign"
characters, or "artist" can be a list of names, for example.  So I
take care to normalize text to lower case ASCII letters plus space, to
separate on various tokens (comma, "and", etc), and to try dropping
various values (anything after "and", for example).  This means that
one "artist" can abe associated with quite a few different strings - I
pass all the different strings to Last FM and combine the returned

Another issue is directional graphs.  The initial weights that relate
artists are, by construction, symmetric, but there's no guarantee that
the "top N neighbours" is commutative (B may be A's top neighbour, but
B's top neighbour could be C).  So my code uses directed graphs.

However, when you look a little more closely, it turns out that a
directed graph is fine, because it is equivalent to adding "backlinks"
(linking B to A if A links to B), which itself seems quite reasonable.

Other issues involve weighting.  To select a new track all
neighbouring artists are selected, initially weighted by the number of
tracks they have which have not been recently played.  Then these
weights are modified according to distance from an earlier artist.  An
artist is then selected based on the weights (ie an artist with weight
2X is twice as likely to be picked as one with weight X).  The same
process is then repeated to select from an artist's tracks, with those
tracks that have already been played giving a lower weight.

To get this process to give "pleasant" results still requires a fair
amount of tweaking of parameters.

The two most important parameters are the artist queue length, and the
distance exponent.  The artist queue length fixes the number of
artists that are stored in the "already played" list.  In itself, this
is not very important, but the this list also provides the reference
(ie oldest) artist for distances.  So specifying a long list keeps the
original artist as "oldest" and so measures all distances from the
same point.  A shorter list allows the selection to "wander" as
distances are measured from the a point that itself moves over time.

The distance exponent controls the strength of the weighting by
distance and can be negative (conservative, keeping the selection
centred) or positive (liberal, driving the selection to new music).

The best approach I have found is to keep the distance exponent
"strong" (-7 to -10 in my current implementation) and then reduce the
queue size to allow exploration "to taste".

A third parameter (of middling importance) is the number of neighbours
per artist used to construct the "non parametric" graph.  Too few and
it's possible for a sequence of artists to get "trapped" with no way
to return the "original artist" no matter how strongly distances are
weighted.  Too high and the clustering in the raw data starts to come

In conclusion, then, UYKFD's success (ie that I enjoy the playlists
generated - noone else is using it, as far as I know) is largely based
on the "non parametric" graph construction.  This gives the basic
connections, which restrict successive tracks from being too
different, and also underlies the distance calculations, which support
weighting that keeps a "central theme".


Correct Exponents

From: andrew cooke <andrew@...>

Date: Sun, 13 Dec 2009 21:06:06 -0300

A note that describes the latest version (1.4).

I fixed some bugs that mean that an exponent like 1 or 2 is all that
is required to control divergence (a value of 7 is given above).  This
can be seen in the following scripts:


Comment on this post