Andrew Cooke | Contents | Latest | RSS | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

Surprise Paradox; [Books] Good Author List; [Computing] Efficient queries with grouping in Postgres; [Computing] Automatic Wake (Linux); [Computing] AWS CDK Aspects in Go; [Bike] Adidas Gravel Shoes; [Computing, Horror] Biological Chips; [Books] Weird Lit Recs; [Covid] Extended SIR Models; [Art] York-based Printmaker; [Physics] Quantum Transitions are not Instantaneous; [Computing] AI and Drum Machines; [Computing] Probabilities, Stopping Times, Martingales; bpftrace Intro Article; [Computing] Starlab Systems - Linux Laptops; [Computing] Extended Berkeley Packet Filter; [Green] Mainspring Linear Generator; Better Approach; Rummikub Solver; Chilean Poetry; Felicitations - Empowerment Grant; [Bike] Fixing Spyre Brakes (That Need Constant Adjustment); [Computing, Music] Raspberry Pi Media (Audio) Streamer; [Computing] Amazing Hack To Embed DSL In Python; [Bike] Ruta Del Condor (El Alfalfal); [Bike] Estimating Power On Climbs; [Computing] Applying Azure B2C Authentication To Function Apps; [Bike] Gearing On The Back Of An Envelope; [Computing] Okular and Postscript in OpenSuse; There's a fix!; [Computing] Fail2Ban on OpenSuse Leap 15.3 (NFTables); [Cycling, Computing] Power Calculation and Brakes; [Hardware, Computing] Amazing Pockit Computer; Bullying; How I Am - 3 Years Post Accident, 8+ Years With MS; [USA Politics] In America's Uncivil War Republicans Are The Aggressors; [Programming] Selenium and Python; Better Walking Data; [Bike] How Fast Before Walking More Efficient Than Cycling?; [COVID] Coronavirus And Cycling; [Programming] Docker on OpenSuse; Cadence v Speed; [Bike] Gearing For Real Cyclists; [Programming] React plotting - visx; [Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life

© 2006-2017 Andrew Cooke (site) / post authors (content).

Identifying Related DNA Sequences

From: "andrew cooke" <andrew@...>

Date: Sat, 6 Sep 2008 20:50:08 -0400 (CLT)

Last weekend I spent some time working on a "puzzle" I found on the net. 
I made some notes here but, since it was a recruitment question and I am
about to explain a possible answer, I've deleted those posts (I've also
changed the details slightly below so that I use a different vocabulary).


1. The Problem

The challenge was to define the relationships between members of a
population, given their "DNA".  To keep things simple we are told that:

- Reproduction is asexual: a single parent "gives birth" to a
  single offspring.
- The population started with a single individual
- All individuals are immortal
- The population grows in "steps".  At each step, an existing
  individual is chosen at random (ie uniform probability) to
  reproduce.
- The "DNA" is a sequence containing just two types of "bases",
  of a fixed (long) length.
- On reproduction, each base has a 20% chance of being "flipped".

So we can represent the DNA as a binary number (the original puzzle was
more explicit about this, but as I said I am purposefully avoiding their
vocabulary).

The data contained ten thousand individuals, each with a DNA containing
ten thousand bases.


2. Initial Observations

These are fairly obvious points, but they are critical to the solution I
implemented:

a - We can define a "distance" between any two DNA samples that is
    simply the number of different bases.
b - The population is so large that a direct search is going to be
    too slow (and boring).
c - For long DNA (eg many thousands of bases) the available "space"
    is *huge* (consider each base as choosing a direction along a
    distinct, orthogonal axis).  This means that we do not need
    to worry about accidental confusion - if two DNA are close
    (ie. differ by about 20%) then they come from parent and child.
d - Most of the work in generating a "family tree" is constructing
    a minimal spanning tree across the population (edges
    representing parent-child relations).  The tree then "falls
    out" if we select the node with largest degree as the initial
    "grandparent".

Note that 2c wasn't true for an additional sample that was given (with
solution) for testing, which used shorter DNA sequences.  I didn't use
that data until I had a solution that worked with the full data set, so
this wasn't an issue (except that I had to add a final cleaning step,
described below).


120 ++------+-------+-------A-------+-------+-------+------++
    +       +       +      A+       +    'distances'+  A    +
    |                    A AA                               |
100 ++                      AAA                            ++
    |                   AAAAAAAAA                           |
    |                   AAAA A  A                           |
    |                   AAA    AAA                          |
 80 ++               A AA     AAA                          ++
    |                 AAA     A AAA                         |
    |                AAA         AA                         |
 60 ++              AA            AA                       ++
    |                 A           AAA                       |
    |              AAA              AA                      |
 40 ++            AAAA              AA                     ++
    |            AAAA               AAAA                    |
    |            AAAA                AA                     |
    |          AAAA                  AAA                    |
 20 ++         AA                     AAAAA                ++
    |         AAA                       AAAAA               |
    +    AAAAAA     +       +       +    AAAAAA     +       +
  0 +AAA-AAAAA------+-------+-------+-------AAAAAAA-AA-A---++
   1850    1900    1950    2000    2050    2100    2150    2200

Figure: Distribution of distances for the final solution with DNA 
sequences of size 10000.  The cut-off used in searching was 3000,  but no
values were found > 2169.


3. Basic Approach

The above points imply that a solution should be fairly easy, provided I
can find an efficient heuristic for generating "candidate neighbours". 
Given that heuristic I can simply:

- Generate neighbour candidates
- Select "verified neighbours" (2a+c)
- Accumulate verified neighbours in a graph
- Repeat until I have the complete tree

And the obvious heuristic for finding neighbour candidates was to look for
common sections of DNA.

Before implementing this, however, I did a quick literature search (google
for things like "multidimentional neighbours").  It turns out that the
naive approach I was thinking of is pretty close to a known technique
using "Locality Preserving Hashing" (the DNA sections function as the
hashes).

So all I needed to do was to write the code.


4. Implementation

Whenever I start a project I am torn between using the tools I know or
starting from scratch with something more suited to the job.  In this case
I thought Mercury (an efficient, typed Prolog kind-of-thing) (or perhaps
Oz/Mozart, which also has Prolog influences) would be a good fit.

But I was hoping to get things done that weekend, and I doubted I could do
so with a (relatively) new language.  So I went with Python and SQL.  I'm
not sure if that sounds like an odd combination or not - the idea was to
use SQL to do the heavy lifting (the wonderful SQLAlchemy library makes
calling SQL from Python a doddle).  My main concern was whether the search
could be decoupled from anything related to graphs (because basic SQL
can't handle the recursion necessary to handle graphs nicely) - a little
thought suggested there would be no problem.

Some quick calculations showed that I wouldn't need the entire DNA to find
candidates.  A chunk of just 100 bases should be enough (matching
sequences of size 6 bases) (these numbers come from basic probability
arguments given the numbers in section 1).

So my program does the following:

- Read 100 base sections of the population's DNA into a SQL table,
  chopped into sections of 6 bases each (ie hashes of 6 bit
  integers).
- Search for entries in that table (SQL select) that have a number
  (typically 3) of chunks in common.
- For each candidate pair found the real separation (using the
  entire DNA) is then calculated and verified neighbours added
  to the graph.

This is repeated for several independent chunks until few new neighbours
are found.  The remaining work cleans the graph:

- A direct search is made for neighbours to any missing nodes
  (ie for any individual with no parent or child)
- If the graph is not connected, a direct search is made for
  bridges to join the separate sub-graphs (for efficiency this
  search starts with the nodes in the sub-graphs with largest
  degree, since those are more likely to be "old" and so
  connected to the main component).
- If the graph contains any cycles, the edges are ordered by
  distance and removed one-by-one (so the most distant neighbours
  are removed first) (only edges in a cycle are removed, and
  this is not necessary for long DNA sequences, as discussed
  above)

That leaves the graph as a (close approximation to the) minimal spanning
tree, as required.


5. Results

I have no way of knowing whether the final result is correct, but it is
consistent (no loops were found).  The most likely source of error is the
choice of "grandparent", since there is no statistical guarantee that the
node with largest degree is indeed the root (and I can think of no better
criterion).


6. Performance

For the large data set, loading and querying the DNA sections requires
around 7 minutes processing by Postgres.  I use 5 sets of sections and the
entire process takes just over 2.5 hours.

Most of the time not spent in Postgres is spent calculating differences
between DNA sequences (counting bits).  This uses a lookup table of all 24
bit integers and their bit counts.

The Python profiler was useful in identifying hotspots - significant gains
came from doing things smarter (caching) or simpler (avoiding work). 
Remaining sources of improvement might include tuning the various
parameters (size of sequences/hashes, etc), implementing critical sections
of code in C, and splitting tasks to run in parallel.  I haven't explored
any optimisation in Postgres (except for specifying keys).

One odd detail I do not understand - while a query for 3 or 4 matches
works fine when searching for neighbours to a given individual, the query
for 4 matches over all individuals causes Postgres to hang (CPU time is
consumed, but no data are returned; this continues for at least 8 hours
and requires an engine reset to stop).  The query for 3 matches (over all
individuals) does not give any problems.  I should try looking at
"explain" output, but the query is large (constructed programmatically in
SQLAlchemy; the same code is used to construct all the queries described
so it seems unlikely to be a simple error in the SQL), and the resulting
plan complex...

Andrew


Appendix: SQL Query

This is the query that locks Postgres - it works with one less pair of
"chunk tables" (the code is generated via SQLAlchemy).  I've added
comments to help explain what is happening:

SELECT DISTINCT src0.vector, dst0.vector
    -- the chunk tables contain the hashes
  FROM chunks_100_6_0 AS src0
  JOIN chunks_100_6_0 AS dst0
    -- a hash "lookup" is a match between two chunk table aliases
    -- in which the "chunk" indexes the hash position along the
    -- dna
   ON src0.chunk = dst0.chunk
    -- and the value matches
   AND src0.value = dst0.value
    -- and this constraint ties elsewhere to a constant that
    -- avoids any "final bits" if the dna length is not an
    -- exact multiple of the hash size
   AND src0.bits = dst0.bits
    -- this is the "standard sql trick" for indicating that we
    -- want the corresponding values to be missing from another
    -- table - do an outer join and assert the values null
    -- (here we are excluding already-known pairs)
  LEFT OUTER JOIN neighbours_100_6_3000 AS neighbours_100_6_3000_1
    ON src0.vector = neighbours_100_6_3000_1.source
   AND dst0.vector = neighbours_100_6_3000_1.destn
  LEFT OUTER JOIN neighbours_100_6_3000 AS neighbours_100_6_3000_2
    ON src0.vector = neighbours_100_6_3000_2.destn
   AND dst0.vector = neighbours_100_6_3000_2.source,
    -- and here are the rest of the hashes, one pair for each
       chunks_100_6_0 AS src1
  JOIN chunks_100_6_0 AS dst1
    ON src1.chunk = dst1.chunk
   AND src1.value = dst1.value
   AND src1.bits = dst1.bits,
       chunks_100_6_0 AS src3
  JOIN chunks_100_6_0 AS dst3
    ON src3.chunk = dst3.chunk
   AND src3.value = dst3.value
   AND src3.bits = dst3.bits,
       chunks_100_6_0 AS src2
  JOIN chunks_100_6_0 AS dst2
    ON src2.chunk = dst2.chunk
   AND src2.value = dst2.value
   AND src2.bits = dst2.bits
    -- the literal hash size specified here
    -- (see comments above about final hash possibly being smaller)
 WHERE src0.bits = %(bits_1)s
    -- avoid (a,b) = (b,a) duplication
   AND src0.vector < dst0.vector
    -- the assertion described above - these are not already known
   AND (neighbours_100_6_3000_1.source IS NULL
        OR neighbours_100_6_3000_1.destn IS NULL)
   AND (neighbours_100_6_3000_2.source IS NULL
        OR neighbours_100_6_3000_2.destn IS NULL)
    -- make sure all hashes are for same dna pair
   AND src1.vector = src0.vector
   AND dst1.vector = dst0.vector
   AND src1.bits = src0.bits
    -- and avoid permutations of the same selection of hashes
   AND src1.chunk < src0.chunk
    -- and repeat for each hash
   AND src2.vector = src0.vector
   AND dst2.vector = dst0.vector
   AND src2.bits = src0.bits
   AND src2.chunk < src1.chunk
   AND src3.vector = src0.vector
   AND dst3.vector = dst0.vector
   AND src3.bits = src0.bits
   AND src3.chunk < src2.chunk
 ORDER BY src0.vector, dst0.vector

Updated Timing

From: "andrew cooke" <andrew@...>

Date: Sun, 7 Sep 2008 09:15:55 -0400 (CLT)

The timing above is wrong (I'm not sure where it comes from - the first
run took around 6 hours, I believe).  Anyway, current time required for a
full solution is 5652s (1h34m) on my server (a year old Core2Duo).

More Efficient Search Parameters: 30min

From: "andrew cooke" <andrew@...>

Date: Sun, 7 Sep 2008 16:13:56 -0400 (CLT)

A little playing around with parameters has reduce the running time to
30mins.  There's a trade-off between the expense (combinatorial
complexity) of the search and the number of candidate neighbours
considered.  Increasing the size of DNA fragments allows for more matches
(but increases SQL search times); increasing the number of bases in a hash
gives more accurate matches (reduces the number of candidates).

By moving from 100 to 200 base sections of DNA, and using 8 base (bit)
hashes rather than 6 I can find a solution using 4 iterations (4 different
200 base sections) that initially (things get worse when looking for the
final pairs, obviously) finds one valid neighbour in every 2 candidates
(compared to 5 iterations and 1 in 100 valid neighbours with the initial
configuration).

Andrew

Update

From: "andrew cooke" <andrew@...>

Date: Sun, 21 Sep 2008 20:06:28 -0400 (CLT)

I implemented this in Java and it runs in under 10 seconds (same results).
 See more info at http://www.acooke.org/cute/MatchingDN0.html

Andrew

Comment on this post