From: "andrew cooke" <andrew@...>
Date: Sun, 7 Sep 2008 21:56:24 -0400 (CLT)
Might be useful one day - http://arxiv.org/abs/0806.3301v1 Andrew
Permalink | Comment on this post
Previous Entries
For comments, see relevant pages (permalinks).
MySQL and Graphs
From: "andrew cooke" <andrew@...>
Date: Sun, 7 Sep 2008 21:53:45 -0400 (CLT)
Goes into deep MySQL specific detail - http://www.artfulsoftware.com/mysqlbook/sampler/mysqled1ch20.html Part of a book, contents here - http://www.artfulsoftware.com/ - but only selected sections available for free. Andrew
Identifying Related DNA Sequences
From: "andrew cooke" <andrew@...>
Date: Sat, 6 Sep 2008 20:50:08 -0400 (CLT)
Last weekend I spent some time working on a "puzzle" I found on the net.
I made some notes here but, since it was a recruitment question and I am
about to explain a possible answer, I've deleted those posts (I've also
changed the details slightly below so that I use a different vocabulary).
1. The Problem
The challenge was to define the relationships between members of a
population, given their "DNA". To keep things simple we are told that:
- Reproduction is asexual: a single parent "gives birth" to a
single offspring.
- The population started with a single individual
- All individuals are immortal
- The population grows in "steps". At each step, an existing
individual is chosen at random (ie uniform probability) to
reproduce.
- The "DNA" is a sequence containing just two types of "bases",
of a fixed (long) length.
- On reproduction, each base has a 20% chance of being "flipped".
So we can represent the DNA as a binary number (the original puzzle was
more explicit about this, but as I said I am purposefully avoiding their
vocabulary).
The data contained ten thousand individuals, each with a DNA containing
ten thousand bases.
2. Initial Observations
These are fairly obvious points, but they are critical to the solution I
implemented:
a - We can define a "distance" between any two DNA samples that is
simply the number of different bases.
b - The population is so large that a direct search is going to be
too slow (and boring).
c - For long DNA (eg many thousands of bases) the available "space"
is *huge* (consider each base as choosing a direction along a
distinct, orthogonal axis). This means that we do not need
to worry about accidental confusion - if two DNA are close
(ie. differ by about 20%) then they come from parent and child.
d - Most of the work in generating a "family tree" is constructing
a minimal spanning tree across the population (edges
representing parent-child relations). The tree then "falls
out" if we select the node with largest degree as the initial
"grandparent".
Note that 2c wasn't true for an additional sample that was given (with
solution) for testing, which used shorter DNA sequences. I didn't use
that data until I had a solution that worked with the full data set, so
this wasn't an issue (except that I had to add a final cleaning step,
described below).
120 ++------+-------+-------A-------+-------+-------+------++
+ + + A+ + 'distances'+ A +
| A AA |
100 ++ AAA ++
| AAAAAAAAA |
| AAAA A A |
| AAA AAA |
80 ++ A AA AAA ++
| AAA A AAA |
| AAA AA |
60 ++ AA AA ++
| A AAA |
| AAA AA |
40 ++ AAAA AA ++
| AAAA AAAA |
| AAAA AA |
| AAAA AAA |
20 ++ AA AAAAA ++
| AAA AAAAA |
+ AAAAAA + + + AAAAAA + +
0 +AAA-AAAAA------+-------+-------+-------AAAAAAA-AA-A---++
1850 1900 1950 2000 2050 2100 2150 2200
Figure: Distribution of distances for the final solution with DNA
sequences of size 10000. The cut-off used in searching was 3000, but no
values were found > 2169.
3. Basic Approach
The above points imply that a solution should be fairly easy, provided I
can find an efficient heuristic for generating "candidate neighbours".
Given that heuristic I can simply:
- Generate neighbour candidates
- Select "verified neighbours" (2a+c)
- Accumulate verified neighbours in a graph
- Repeat until I have the complete tree
And the obvious heuristic for finding neighbour candidates was to look for
common sections of DNA.
Before implementing this, however, I did a quick literature search (google
for things like "multidimentional neighbours"). It turns out that the
naive approach I was thinking of is pretty close to a known technique
using "Locality Preserving Hashing" (the DNA sections function as the
hashes).
So all I needed to do was to write the code.
4. Implementation
Whenever I start a project I am torn between using the tools I know or
starting from scratch with something more suited to the job. In this case
I thought Mercury (an efficient, typed Prolog kind-of-thing) (or perhaps
Oz/Mozart, which also has Prolog influences) would be a good fit.
But I was hoping to get things done that weekend, and I doubted I could do
so with a (relatively) new language. So I went with Python and SQL. I'm
not sure if that sounds like an odd combination or not - the idea was to
use SQL to do the heavy lifting (the wonderful SQLAlchemy library makes
calling SQL from Python a doddle). My main concern was whether the search
could be decoupled from anything related to graphs (because basic SQL
can't handle the recursion necessary to handle graphs nicely) - a little
thought suggested there would be no problem.
Some quick calculations showed that I wouldn't need the entire DNA to find
candidates. A chunk of just 100 bases should be enough (matching
sequences of size 6 bases) (these numbers come from basic probability
arguments given the numbers in section 1).
So my program does the following:
- Read 100 base sections of the population's DNA into a SQL table,
chopped into sections of 6 bases each (ie hashes of 6 bit
integers).
- Search for entries in that table (SQL select) that have a number
(typically 3) of chunks in common.
- For each candidate pair found the real separation (using the
entire DNA) is then calculated and verified neighbours added
to the graph.
This is repeated for several independent chunks until few new neighbours
are found. The remaining work cleans the graph:
- A direct search is made for neighbours to any missing nodes
(ie for any individual with no parent or child)
- If the graph is not connected, a direct search is made for
bridges to join the separate sub-graphs (for efficiency this
search starts with the nodes in the sub-graphs with largest
degree, since those are more likely to be "old" and so
connected to the main component).
- If the graph contains any cycles, the edges are ordered by
distance and removed one-by-one (so the most distant neighbours
are removed first) (only edges in a cycle are removed, and
this is not necessary for long DNA sequences, as discussed
above)
That leaves the graph as a (close approximation to the) minimal spanning
tree, as required.
5. Results
I have no way of knowing whether the final result is correct, but it is
consistent (no loops were found). The most likely source of error is the
choice of "grandparent", since there is no statistical guarantee that the
node with largest degree is indeed the root (and I can think of no better
criterion).
6. Performance
For the large data set, loading and querying the DNA sections requires
around 7 minutes processing by Postgres. I use 5 sets of sections and the
entire process takes just over 2.5 hours.
Most of the time not spent in Postgres is spent calculating differences
between DNA sequences (counting bits). This uses a lookup table of all 24
bit integers and their bit counts.
The Python profiler was useful in identifying hotspots - significant gains
came from doing things smarter (caching) or simpler (avoiding work).
Remaining sources of improvement might include tuning the various
parameters (size of sequences/hashes, etc), implementing critical sections
of code in C, and splitting tasks to run in parallel. I haven't explored
any optimisation in Postgres (except for specifying keys).
One odd detail I do not understand - while a query for 3 or 4 matches
works fine when searching for neighbours to a given individual, the query
for 4 matches over all individuals causes Postgres to hang (CPU time is
consumed, but no data are returned; this continues for at least 8 hours
and requires an engine reset to stop). The query for 3 matches (over all
individuals) does not give any problems. I should try looking at
"explain" output, but the query is large (constructed programmatically in
SQLAlchemy; the same code is used to construct all the queries described
so it seems unlikely to be a simple error in the SQL), and the resulting
plan complex...
Andrew
Appendix: SQL Query
This is the query that locks Postgres - it works with one less pair of
"chunk tables" (the code is generated via SQLAlchemy). I've added
comments to help explain what is happening:
SELECT DISTINCT src0.vector, dst0.vector
-- the chunk tables contain the hashes
FROM chunks_100_6_0 AS src0
JOIN chunks_100_6_0 AS dst0
-- a hash "lookup" is a match between two chunk table aliases
-- in which the "chunk" indexes the hash position along the
-- dna
ON src0.chunk = dst0.chunk
-- and the value matches
AND src0.value = dst0.value
-- and this constraint ties elsewhere to a constant that
-- avoids any "final bits" if the dna length is not an
-- exact multiple of the hash size
AND src0.bits = dst0.bits
-- this is the "standard sql trick" for indicating that we
-- want the corresponding values to be missing from another
-- table - do an outer join and assert the values null
-- (here we are excluding already-known pairs)
LEFT OUTER JOIN neighbours_100_6_3000 AS neighbours_100_6_3000_1
ON src0.vector = neighbours_100_6_3000_1.source
AND dst0.vector = neighbours_100_6_3000_1.destn
LEFT OUTER JOIN neighbours_100_6_3000 AS neighbours_100_6_3000_2
ON src0.vector = neighbours_100_6_3000_2.destn
AND dst0.vector = neighbours_100_6_3000_2.source,
-- and here are the rest of the hashes, one pair for each
chunks_100_6_0 AS src1
JOIN chunks_100_6_0 AS dst1
ON src1.chunk = dst1.chunk
AND src1.value = dst1.value
AND src1.bits = dst1.bits,
chunks_100_6_0 AS src3
JOIN chunks_100_6_0 AS dst3
ON src3.chunk = dst3.chunk
AND src3.value = dst3.value
AND src3.bits = dst3.bits,
chunks_100_6_0 AS src2
JOIN chunks_100_6_0 AS dst2
ON src2.chunk = dst2.chunk
AND src2.value = dst2.value
AND src2.bits = dst2.bits
-- the literal hash size specified here
-- (see comments above about final hash possibly being smaller)
WHERE src0.bits = %(bits_1)s
-- avoid (a,b) = (b,a) duplication
AND src0.vector < dst0.vector
-- the assertion described above - these are not already known
AND (neighbours_100_6_3000_1.source IS NULL
OR neighbours_100_6_3000_1.destn IS NULL)
AND (neighbours_100_6_3000_2.source IS NULL
OR neighbours_100_6_3000_2.destn IS NULL)
-- make sure all hashes are for same dna pair
AND src1.vector = src0.vector
AND dst1.vector = dst0.vector
AND src1.bits = src0.bits
-- and avoid permutations of the same selection of hashes
AND src1.chunk < src0.chunk
-- and repeat for each hash
AND src2.vector = src0.vector
AND dst2.vector = dst0.vector
AND src2.bits = src0.bits
AND src2.chunk < src1.chunk
AND src3.vector = src0.vector
AND dst3.vector = dst0.vector
AND src3.bits = src0.bits
AND src3.chunk < src2.chunk
ORDER BY src0.vector, dst0.vector
Loma Largo Quinteto - Fruity, Light and Chilean!
From: "andrew cooke" <andrew@...>
Date: Sat, 6 Sep 2008 13:56:22 -0400 (CLT)
When my parents were here last summer (Jan/Feb 2008) we went to a local wine shop to pick up a few bottles for my (belated) birthday (or maybe it was Christmas?). This was a step up from my normal supermarket wine shopping so I looked for unusual varieties and/or blends. The best of the bunch (by miles) has been the 2006 Loma Larga Quienteto, which is nothing like a typical Chilean wine - way more fruit, and much lighter that the usual Cab Sav. I opened it yesterday lunchtime (yes, still working, but it was a Friday...) and (not exactly regretfully) had to finish it this Saturday lunchtime (so that I can start a course of antibiotics). But what a way to start an alcohol-free period... It's a blend of Syrah, Cabernet Franc, Malbec, Merlot, and Cabernet Sauvignon. My only criticism is the price - 7000 pesos. Why can't Chile produce wine like this for at the same price as, say, a Don Matias (Cosineau Macul, about 3500)? We really need to diversify away from CS and Merlot..... http://www.lomalarga.com Andrew
Sun's VirtualBox v2
From: "andrew cooke" <andrew@...>
Date: Fri, 5 Sep 2008 20:17:38 -0400 (CLT)
Seems like this might be an easier way to run Solaris on opensuse (had a lot of frustrating pain with Xen a while back). http://www.sun.com/software/products/virtualbox/index.jsp http://dlc.sun.com/virtualbox/vboxdownload.html Andrew
Good Summary of Recent Spring Config Options
From: "andrew cooke" <andrew@...>
Date: Fri, 5 Sep 2008 12:09:45 -0400 (CLT)
http://www.habuma.com/spring/springcleaning.pdf Great set of slides that shows the different options available (I was going crazy trying to remember the "p" stuff...) Andrew
Launchpad - Open Source Projects Support/Hosting
From: "andrew cooke" <andrew@...>
Date: Fri, 5 Sep 2008 07:42:53 -0400 (CLT)
Not noticed this before. https://launchpad.net/ https://launchpad.net/+tour/index They host Drizzle - https://launchpad.net/drizzle (which I found while reading up on the exit of MySQL founder from Sun - not sure what that's all about) Oh, it's from Canonical (Ubuntu). http://www.canonical.com/ Supports Bazaar (with SVN and CVS continuous import) and exports APT (not RPM?). Andrew
Secure Remote Password Protocol (+ Python TLS)
From: "andrew cooke" <andrew@...>
Date: Fri, 5 Sep 2008 07:33:12 -0400 (CLT)
SRP is a protocol for securely verifying passwords without needing encryption (it uses hashes and temporary secrets). There's an implementation at http://srp.stanford.edu/ Documentation - http://srp.stanford.edu/doc.html White paper - http://srp.stanford.edu/ndss.html Download - http://srp.stanford.edu/download.html It doesn't seem to say anywhere, but the download appears to be a C library. It also contains some Java code that appears to implement a Telnet client with SRP support. Related, there's a pure Python library for TLS available here - http://trevp.net/ http://trevp.net/tlslite/ - along with extensions to use SRP within TLS - http://trevp.net/tls_srp/ Andrew
Good Analysis of Georgia Issues
From: "andrew cooke" <andrew@...>
Date: Thu, 4 Sep 2008 02:25:42 -0400 (CLT)
http://www.informationclearinghouse.info/article20673.htm Andrew
iBatis Error with Recursive Generics
From: "andrew cooke" <andrew@...>
Date: Wed, 3 Sep 2008 13:10:01 -0400 (CLT)
I have a base class for ORM that looks something like this:
public class Keyed<KeyType extends Comparable<KeyType>>
implements Comparable<Keyed<KeyType>>
{
private KeyType key;
protected final void setKey(final KeyType key) {...}
...
}
For some reason, iBatis does not like this, even thought it was fine (as
far as I can tell) with an earlier version that was also generic (but did
not include the recursive Comparable declaration).
The error when you try to map (sublasses of) this is given below (in full
for Google).
I tried tracking what was causing this, and it seems that the type handler
is not inferred correctly while reading the configuration. Rather than
digging further I found a simple workaround - specify the type handler by
hand.
So my maps now look like:
<sqlMap namespace="entity">
<resultMap id="entity-result" class="...Entity">
<result property="key" column="entity_id"
typeHandler="com.ibatis.sqlmap.engine.type.StringTypeHandler"/>
<result property="title" column="title"/>
<result property="description" column="description"/>
...
</resultMap>
...
</sqlMap>
Note the explicit typeHandler.
Andrew
03-Sep-2008 13:06:32 com.isti.kpidb.KpiDb getEntities
SEVERE: Failed to read Entities
com.ibatis.common.jdbc.exception.NestedSQLException:
--- The error occurred in maps/entity.xml.
--- The error occurred while applying a result map.
--- Check the entity.entity-result.
--- Check the result mapping for the 'key' property.
--- Cause: com.ibatis.sqlmap.client.SqlMapException: No type handler could
be found to map the property 'key' to the column 'entity_id'. One or both
of the types, or the combination of types is not supported.
at
com.ibatis.sqlmap.engine.mapping.statement.MappedStatement.executeQueryWithCallback(MappedStatement.java:204)
at
com.ibatis.sqlmap.engine.mapping.statement.MappedStatement.executeQueryForList(MappedStatement.java:139)
at
com.ibatis.sqlmap.engine.mapping.statement.CachingStatement.executeQueryForList(CachingStatement.java:97)
at
com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.queryForList(SqlMapExecutorDelegate.java:567)
at
com.ibatis.sqlmap.engine.impl.SqlMapExecutorDelegate.queryForList(SqlMapExecutorDelegate.java:541)
at
com.ibatis.sqlmap.engine.impl.SqlMapSessionImpl.queryForList(SqlMapSessionImpl.java:118)
at
com.ibatis.sqlmap.engine.impl.SqlMapClientImpl.queryForList(SqlMapClientImpl.java:94)
at com.isti.kpidb.KpiDb.queryForList(KpiDb.java:331)
at com.isti.kpidb.KpiDb.queryForList(KpiDb.java:311)
at com.isti.kpidb.KpiDb.getEntities(KpiDb.java:102)
at com.isti.kpidb.demo.ListEntities.execute(ListEntities.java:23)
at com.isti.kpidb.demo.Demo.processInput(Demo.java:125)
at com.isti.kpidb.demo.Demo.mainLoop(Demo.java:84)
at com.isti.kpidb.demo.Demo.<init>(Demo.java:58)
at com.isti.kpidb.demo.Demo.main(Demo.java:183)
Caused by: com.ibatis.sqlmap.client.SqlMapException: No type handler could
be found to map the property 'key' to the column 'entity_id'. One or both
of the types, or the combination of types is not supported.
at
com.ibatis.sqlmap.engine.mapping.result.ResultMap.getPrimitiveResultMappingValue(ResultMap.java:613)
at
com.ibatis.sqlmap.engine.mapping.result.ResultMap.getResults(ResultMap.java:345)
at
com.ibatis.sqlmap.engine.execution.SqlExecutor.handleResults(SqlExecutor.java:384)
at
com.ibatis.sqlmap.engine.execution.SqlExecutor.handleMultipleResults(SqlExecutor.java:300)
at
com.ibatis.sqlmap.engine.execution.SqlExecutor.executeQuery(SqlExecutor.java:189)
at
com.ibatis.sqlmap.engine.mapping.statement.MappedStatement.sqlExecuteQuery(MappedStatement.java:221)
at
com.ibatis.sqlmap.engine.mapping.statement.MappedStatement.executeQueryWithCallback(MappedStatement.java:189)
... 14 more
Google's Web Browser - Chrome
From: "andrew cooke" <andrew@...>
Date: Mon, 1 Sep 2008 16:45:15 -0400 (CLT)
http://blogoscoped.com.nyud.net/google-chrome http://www.google.com/chrome http://blogoscoped.com/archive/2008-09-01-n47.html Andrew
Plotting Data from Postgres
From: "andrew cooke" <andrew@...>
Date: Sun, 31 Aug 2008 13:16:04 -0400 (CLT)
If PL/R looks too complicated - http://www.varlena.com/GeneralBits/Tidbits/bernier/art_66/graphingWithR.html - then from google hits it looks like gnuplot is the best (ie most popular) option for plotting data in Postgres. In postgres: \t (do this only once - it toggles) \o plot.txt select count(*), distance from nearest_6_100 group by distance order by distance; \o In gnuplot: set term dumb 70 24 plot 'plot.txt' 120 ++--------A---------+----------+---------+---------+--------++ + + + + + 'plot.txt' A + | AA | 100 ++ AAAAA ++ | AAA AA | | AAA A A | | A AAA | 80 ++ AA AAAA ++ | A A AAA | | AAA AA | 60 ++ AAA A ++ | AAAA A | | AAA A A | 40 ++ AA AAA ++ | A AA | | AA AAA | | AA AA | 20 ++ AAA AAA ++ | AAA AAA A | +AAA + AAAAAA AA + A AAAAAAAAAA+A + 0 AAA-------+---------+-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA-----++ 0 100 200 300 400 500 600 Andrew