Andrew Cooke | Contents | RSS | Twitter | Previous
From: andrew cooke <andrew@...>
Date: Wed, 28 Jul 2010 10:16:53 -0400
There's a nice algorithm for auto-scaling axes, called the "nice number algorithm", written by Paul Heckbert and published in "Graphics Gems" - http://books.google.com/books?id=fvA7zLEFWZgC&pg=PA61&lpg=PA61&dq=nice+numbers+graphics+gems&source=bl&ots=7LdCq3nI-j&sig=L8qoZ8l_a95KAtHmMjagJ8cC0U0&hl=en&ei=KDhQTKLwGcT48AbTsvnEAQ&sa=X&oi=book_result&ct=result&resnum=2&ved=0CBYQ6AEwAQ#v=onepage&q&f=false The routines below implement this, but are parameterised over the number base used, so can also be used for axes based on units that repeat over multiples of 12, 60, or any other value. from calendar import timegm from math import floor, log, log10, ceil from time import gmtime # These allow the use with base 10, 12 and 60: LIM10 = (10, [(1.5, 1), (3, 2), (7, 5)], [1, 2, 5]) LIM12 = (12, [(1.5, 1), (3, 2), (8, 6)], [1, 2, 6]) LIM60 = (60, [(1.5, 1), (20, 15), (40, 30)], [1, 15, 40]) def heckbert_d(lo, hi, ntick=5, limits=None): ''' Calculate the step size. ''' if limits is None: limits = LIM10 (base, rfs, fs) = limits def nicenum(x, round): step = base ** floor(log(x)/log(base)) f = float(x) / step nf = base if round: for (a, b) in rfs: if f < a: nf = b break else: for a in fs: if f <= a: nf = a break return nf * step delta = nicenum(hi-lo, False) return nicenum(delta / (ntick-1), True) def heckbert(lo, hi, ntick=5, limits=None): ''' Calculate the axes lables. ''' def _heckbert(): d = heckbert_d(lo, hi, ntick=ntick, limits=limits) graphlo = floor(lo / d) * d graphhi = ceil(hi / d) * d fmt = '%' + '.%df' % max(-floor(log10(d)), 0) value = graphlo while value < graphhi + 0.5*d: yield fmt % value value += d return list(_heckbert()) This can then be used with a range of seconds as follows: def autoscale_time(start, end): ''' Yields a sequence of epochs that are nicely spaced. start and end are Unix epochs. ''' time_chunks = [('days', 3 * 24 * 60 * 60, 24 * 60 * 60, 2, None), ('hours', 3 * 60 * 60, 60 * 60, 3, LIM12), ('minutes', 3 * 60, 60, 4, LIM60), ('seconds', 0, 1, 5, LIM60)] for (name, limit, secs, sindex, limits) in self.time_chunks: if (end - start) > limit: break d = heckbert_d(start / secs, end / secs, limits=limits) # zero out lower steps, so that we get a starting date that's an # integral number of units stime = list(gmtime(start)) for i in range(sindex, 9): stime[i] = 0 # generate a sequence of epochs (cannot use the usual heckbert routine # because formatting will be different) value = timegm(stime) while value <= end: if value >= start: yield value value += d * secs This could be extended further by: - having different formats in the time_chunks parameter, so that different intervals are formatted differently - adding months etc. This would require changing the "secs" increment to be a timedelta and working with datetime instances rather than epochs (because months are not all equally sized). NOTE: The code above is cut + pasted from some working code and is not tested in its existing form; I may have introduced a bug somewhere, but hopefully this illustrates the idea. Andrew
Permalink | Comment on this post
For comments, see relevant pages (permalinks).
From: andrew cooke <andrew@...>
Date: Tue, 27 Jul 2010 11:59:16 -0400
Subversion drives me crazy in how it sets file permissions (keeping whatever
permission teh file originally had, and overwriting local changes on update).
Turns out that for the executable bit, you can set a special property that
will get the correct behaviour. For example:
find . -name "*.sh" -exec svn propset svn:executable ON \{} \;
Peace.... :o)
Andrew
From: andrew cooke <andrew@...>
Date: Mon, 26 Jul 2010 00:40:07 -0400
YUI 3 is amazing. It looks terrifyingly complex, but once you get into it,
you can do complex things trivially. I use jQuery at work, and in comparison,
YUI 3 feels like it was written by software engineers rather than people
hacking web pages.
For example, here's how to make a menu that shows as a tab on the right-hand
side of the page that "zooms in" (to the left) when you mouse-over.
First, place the menu in a div in the HTML:
<div class="menu" id="menu-1">
<img src="menu-icon-small.png"/>
<ul>
<li>Foo</li>
<li>Bar</li>
<li>Foo</li>
<li>Bar</li>
</ul>
</div>
Next, add some CSS (probably ina separate stylsheet) so that the icon is on
the left, and will be visible, but the rest of the div is off the page to the
right:
ul {
margin: 10px;
margin-left: 40px;
}
li {
list-style-type: none;
list-style-position: outside;
margin-top: 4px;
margin-bottom: 4px;
}
div.menu {
position: fixed;
right: -278px;
width: 300px;
border: 1px solid #c0c0c0;
border-right: none;
-moz-border-radius: 4px;
background: white;
}
div#menu-1 {
top: 50px;
}
div.menu img {
margin: 1px;
float: left;
}
Import the YUI magic (in the HTML header).
<script type="text/javascript"
src="http://yui.yahooapis.com/combo?3.1.1/build/yui/yui-min.js"></script>
and define the animation:
<script type="text/javascript">
YUI().use('event-mouseenter', 'console', 'anim', function(Y) {
/* new Y.Console().render(); */
var open = new Y.Anim({node: '#menu-1', to: {right: 0},
easing: Y.Easing.easeOut});
var close = new Y.Anim({node: '#menu-1', to: {right: -278},
easing: Y.Easing.easeIn});
Y.on("mouseenter", function (e) {open.run();}, "#menu-1");
Y.on("mouseleave", function (e) {close.run();}, "#menu-1");
});
</script>
and that's it!
The script above sets up the events and defines animations for the div so that
it moves into and out of sight. The "Easing" part even makes it move in a
"natural" manner (it slows down as it extends fully).
Andrew
From: andrew cooke <andrew@...>
Date: Fri, 23 Jul 2010 10:07:27 -0400
There's not much love for Python 2.4 and SVG (which, I admit is something of
an odd combination - who would be so conservative they would use Python 2.4
and then require SVG?), particularly if you need a non-GPL solution. So I
ended up wiring a simple wrapper around DOM that helps generate the XML.
Here's a summary of the approach (there's nothing really hard here, just
fiddly DOM details):
class SvgBase(object):
'''
Support class. Contains a reference to the DOM document and
the element corresponding to this node. Note that we allow any
node to be extended with any child node; we also allow any
attribute to be added.
'''
SVG = 'http://www.w3.org/2000/svg'
def __init__(self, doc, element, **attrs):
self._doc = doc
self._element = element
for name in attrs:
value = attrs[name]
if value is not None:
self._element.setAttributeNS(self.SVG, name, str(value))
def line(self, (x1, y1), (x2, y2), **attrs):
'''
Add a child line
'''
return Line(self, (x1, y1), (x2, y2), **attrs)
# more child metods here.....
class Svg(SvgBase):
'''
The root svg element. This creates the document, allows a
stylesheet to be used, etc.
'''
def __init__(self, version='1.1', width=None, height=None):
implementation = getDOMImplementation('')
doctype = implementation.createDocumentType('svg',
'-//W3C//DTD SVG 1.1//EN',
'http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd')
document = implementation.createDocument(self.SVG, 'svg', doctype)
for node in document.childNodes:
if node.nodeType == Node.ELEMENT_NODE:
element = node
break
# xlink lets "a" element work correctly (use xlink:href)
ns = {'xmlns': self.SVG,
'xmlns:xlink': 'http://www.w3.org/1999/xlink'}
super(Svg, self).__init__(document, element,
version=version, width=width, height=height,
**ns)
def add_stylesheet(self, url):
style = self._doc.createProcessingInstruction('xml-stylesheet',
'href="%s" type="text/css"' % url)
for node in self._doc.childNodes:
if node.nodeType == Node.DOCUMENT_TYPE_NODE:
self._doc.insertBefore(style, node)
break
def __str__(self):
'''
The XML of the entire document.
'''
return self._doc.toxml()
class Line(SvgBase):
'''
An example child node.
'''
def __init__(self, parent, (x1, y1), (x2, y2), **attrs):
element = parent._doc.createElementNS(self.SVG, 'line')
super(Line, self).__init__(parent._doc, element,
x1=self._1dp(x1), y1=self._1dp(y1),
x2=self._1dp(x2), y2=self._1dp(y2),
**attrs)
parent._element.appendChild(self._element)
With that, it's pretty easy to do things like:
>>> svg = Svg()
>>> svg.line((1,2), (3,4), stroke='blue')
>>> svg.add_stylesheet('http://example.com/foo')
>>> str(svg)
'''<?xml version="1.0" ?>
<?xml-stylesheet href="http://example.com/foo" type="text/css"?>
<!DOCTYPE svg PUBLIC '-//W3C//DTD SVG 1.1//EN'
'http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd'>
<svg version="1.1" xmlns="http://www.w3.org/2000/svg"
xmlns:xlink="http://www.w3.org/1999/xlink">
<line x1="1.0" x2="3.0" y1="2.0" y2="4.0" stroke="blue"/>
</svg>'''
Andrew
From: andrew cooke <andrew@...>
Date: Sat, 17 Jul 2010 16:48:30 -0400
Much of this is just duplicating the re2 work, and the rest is RXPY-specific,
but even so, these are quite interesting (for me, at least).
Note that the times are relative to the Python re package (log10 in brackets).
Also, the plots are both linear ### and logarithmic [ ].
First, some basic matching. This is just showing how slow my code is
(hundreds of times slower than the re package).
Match a(.)c against abc
Backtracking [########## ] 95 [2.0]
Parallel wide [####################### ] 206 [2.3]
Parallel wide, hashed [#################################### ] 315 [2.5]
Parallel seq [####################### ] 205 [2.3]
Parallel seq, hashed [################################### ] 304 [2.5]
Parallel beam [############################ ] 245 [2.4]
Parallel beam, hashed [######################################] 339 [2.5]
Match (a)b(?<=(?(1)b|x))(c) against abc
Backtracking [############ ] 210 [2.3]
Parallel wide [######################### ] 398 [2.6]
Parallel wide, hashed [################################# ] 531 [2.7]
Parallel seq [######################### ] 400 [2.6]
Parallel seq, hashed [################################## ] 546 [2.7]
Parallel beam [############################## ] 480 [2.7]
Parallel beam, hashed [######################################] 622 [2.8]
Match a*b against a^100b
Backtracking [#################################### ] 1902 [3.3]
Parallel wide [################################# ] 1737 [3.2]
Parallel wide, hashed [#################################### ] 1919 [3.3]
Parallel seq [################################ ] 1733 [3.2]
Parallel seq, hashed [#################################### ] 1919 [3.3]
Parallel beam [################################### ] 1884 [3.3]
Parallel beam, hashed [######################################] 2073 [3.3]
Match .*b against a^100b
Backtracking [################# ] 1902 [3.3]
Parallel wide [############### ] 1769 [3.2]
Parallel wide, hashed [################# ] 1927 [3.3]
Parallel seq [############### ] 1757 [3.2]
Parallel seq, hashed [################## ] 2015 [3.3]
Parallel beam [################################### ] 3825 [3.6]
Parallel beam, hashed [######################################] 4300 [3.6]
Above is quite interesting because the beam search is suddenly twice as slow,
just by changing from "a*b" to ".*b". That's because the second expression
has a failure ("." matchs the final "b") and so needs to backtrack. Because
the initial beam width is 1 that means that the entire search must be repeated
with doubled with.
The next test is a search.
Search .*b against a^100b
Backtracking [ ] 1172 [3.1]
Parallel wide [######################################] 41841 [4.6]
Parallel wide, hashed [ ] 1459 [3.2]
Parallel seq [ ] 1212 [3.1]
Parallel seq, hashed [ ] 1338 [3.1]
Parallel beam [### ] 3764 [3.6]
Parallel beam, hashed [### ] 3688 [3.6]
Above shows the problem with a naive wide parallel approach - there are an
awful lot of different states to store. Of course, most are duplicates so the
hashing removes the problem (as does sequential or beam search).
The next tests show the exponential explosion problem in Python's matcher.
This is a little more difficult to trigger than in the re2 paper because
Python has a "shortcut" (hack!) that avoids the problem in very simple cases
(but, as we will see below, does nothing to address the underlying issues).
Match (?:a|b)?^2n(?:ab)^n against (ab)^n for n=4
Backtracking [############# ] 573 [2.8]
Parallel wide [######### ] 392 [2.6]
Parallel wide, hashed [## ] 125 [2.1]
Parallel seq [######### ] 398 [2.6]
Parallel seq, hashed [## ] 125 [2.1]
Parallel beam [######################################] 1606 [3.2]
Parallel beam, hashed [######## ] 360 [2.6]
Match (?:a|b)?^2n(?:ab)^n against (ab)^n for n=6
Backtracking [######################################] 637 [2.8]
Parallel wide [############################# ] 476 [2.7]
Parallel wide, hashed [# ] 27 [1.4]
Parallel seq [############################# ] 475 [2.7]
Parallel seq, hashed [# ] 27 [1.4]
Parallel beam, hashed [## ] 51 [1.7]
Match (?:a|b)?^2n(?:ab)^n against (ab)^n for n=8
Parallel wide, hashed [############## ] 4 [0.6]
Parallel seq, hashed [############## ] 4 [0.6]
Parallel beam, hashed [######################################] 10 [1.0]
As the number of terms increase, only the hashed approaches are efficient (due
to the explosion in the number of states). This is as expected. What is nice
to see is that by n=8 my Python code is oly 4x slower than the re package!
This is because the re backtracker is "exploding", while my code is linear
(but slow).
This is only possible, though, when we can discard duplicate state. Which
isn't so easy when matching groups:
Match (a|b)?^2n(?:ab)^n against (ab)^n for n=4
Backtracking [####### ] 481 [2.7]
Parallel wide [##### ] 378 [2.6]
Parallel wide, hashed [######### ] 627 [2.8]
Parallel seq [###### ] 414 [2.6]
Parallel seq, hashed [########## ] 667 [2.8]
Parallel beam [########################## ] 1612 [3.2]
Parallel beam, hashed [######################################] 2415 [3.4]
Match (a|b)?^2n(?:ab)^n against (ab)^n for n=6
Backtracking [########################### ] 512 [2.7]
Parallel wide [###################### ] 415 [2.6]
Parallel wide, hashed [######################################] 723 [2.9]
Parallel seq [###################### ] 418 [2.6]
Parallel seq, hashed [######################################] 714 [2.9]
And here we do no better than normal backtracking.
The main conclusions, then, are:
1 - Python's re package does suffer from the exponential "explosion" issue,
while my code (with duplicate state elimination) does not.
2 - My code can also handle groups, but the degrades in performance as
expected.
So RXPY is both general and, when possible, efficient. Even if it is terribly
slow.
A secondary conclusion is that the beam search approach does not seem to be
worth the trouble.
Andrew
From: andrew cooke <andrew@...>
Date: Fri, 16 Jul 2010 14:07:59 -0400
A short update on the pure-Python RXPY regular expression engine. I just got the beam-width engine working. This uses the "re2" approach, in a very general sense, but differs in that: - all state is tracked (so nothing is excluded from regular expressions) - the maximum number of states is restricted using a "beam search" approach The aim has been to investigate different high-level, full featured approaches, before worrying about low-level optimisations. And the latest results are pretty cool - all my stress tests are handled nicely by the beam search approach. I'm sure I could construct problem cases, but this is already better than a backtracking approach. Next step is to define a set of benchmarks and formally compare the different approaches. After that, with uncertain priority, I need to (a) start discarding features for speed and (b) look at pypy. I was going to end there, but realised I haven't explained what I mean by "beam search". The idea is that "re2" accumulates different states, effectively doing NFA->DFA conversion "on the fly", but that the number of states is restricted (to 1,2,4,8... states). This reduces the time spent searching for (possibly exponential?) alternatives in most normal cases. Andrew
From: andrew cooke <andrew@...>
Date: Thu, 15 Jul 2010 07:22:27 -0400
The frequency variation of the mains supply can be used to validate and date audio recordings: http://www.theregister.co.uk/2010/06/01/enf_met_police/ Andrew
From: andrew cooke <andrew@...>
Date: Wed, 14 Jul 2010 20:06:45 -0400
"Among the most damning documents are a series of interrogation reports from MI5 officers that betray their disregard for the suffering of a British resident whom they were questioning at a US airbase in Afghanistan. The documents also show that the officers were content to see the mistreatment continue." http://www.guardian.co.uk/law/2010/jul/14/torture-classified-documents-disclosed Andrew
From: andrew@...
Date: Fri, 9 Jul 2010 18:13:57 -0400
This is a short summary of Cloud Computing and various related ("trendy")
technologies.
The driving force for much of these ideas seems to have come from Amazon and
Google as they have dealt with scaling their businesses (it's strange how much
Amazon is involved in this, rather than Microsoft, for example). The simple
idea behind Cloud Computing is that computers, and the basic functionality
they provide, are becoming "comoditized". The usual comparison is with an
Electric Company, that supplies electricity. Just as you don't care exactly
what generator is used, so you shouldn't need to care what hardware is used to
run your software.
There are two main approaches to this, one taken by Amazon, and the other by
Google.
Amazon's approach is "Infrastructure as a Service" (Amazon Web Services,
AWS). By that, they mean that they provide basic infrastructure as the
"commodity". So you can configure a "machine" and then run N instances, where
N varies depending on the current load. Similarly, you can access "databases"
and "permanent stores". So this is very like "normal" computing, except that
you have services that are "elastic" - they scale as required, typically by
adding instances.
Google's approach is "Platform as a Service". By that, they mean that they
offer a set of APIs that you can use to write a webservice, and then they will
run that service for you, scaling as necessary. So you work at a higher level
of abstraction: you write code that handles a single thread; Google will make
sure that as many threads as necessary run. For data storage, you save data to
a kind-of database (more like a hash table) and you don't care about the lower
level details.
So Google's "App Engine" approach is easier to use (although you need to learn
a whole new API in either Python or Java), but more limiting. In very simple
terms, Amazon's approach is like owning an expandable set of virtual machines
while Google's approach is more like having an infinitely powerful single
machine. One cool detail about Google's approach is that, for small services,
it's free.
And of course there are many more providers, generally doing things like
Amazon (partly because no-one apart from Google is big enough that it's worth
learning a new API just to use their service).
Now, tangentially related to that, are the spin-off technologies. Two are
particularly important, and are related: MapReduce and NoSQL.
To understand both, and how they are related, it's probably easiest to start
with Google. As you can imagine, Google's web search is a huge database spread
across many machines. And to run a search they need to run tasks on each
machine and then collect the results. This process is called MapReduce: "map"
is related to dividing the work out, doing the same thing on various machines;
"reduce" is the process of collecting the results together.
Now MapReduce doesn't make much sense without data. So the flip side of
MapReduce is the distributed data. And that implies some kind of distributed
database. It turns out that databases can do 2 out of 3 things well, where the
3 things are: availability; consistency; partitions. SQL databases are "AC",
but aren't so good at P. Spreading data across various machines requires
partitioning (P), so new technologies have appeared that are either "AP"
(don't always give consistent results) or "CP" (are not guaranteed
available). These tend not to use SQL and are collectively called "NoSQL".
Curiously, Google, who invented MapReduce, don't actually let the public use
it. However, Amazon do support it as a service, and there are many 3rd party
solutions that you can also run yourself (most famously, CouchDB, MongoDB,
HBase and Cassandra). All of these support MapReduce implementations. And
Google's App Engine (its PaaS - see above) does have support for something
similar, but it's not the same MapReduce that Google use internally.
Andrew
From: andrew cooke <andrew@...>
Date: Fri, 9 Jul 2010 09:56:09 -0400
http://www.hoopoe-cloud.com/ http://ycumulus.ca/ (but these people never replied to my email) Andrew
From: andrew cooke <andrew@...>
Date: Thu, 8 Jul 2010 11:00:16 -0400
http://hunch.com/nosql-data-stores/ From Suman Karthik on Quora http://www.quora.com/What-are-the-various-methods-to-overcome-the-transactions-debits-credits-bottleneck-in-a-DB-centric-game-system Andrew
From: andrew cooke <andrew@...>
Date: Thu, 8 Jul 2010 09:04:14 -0400
Extended regular expressions (particularly) have empty transitions that can occur in loops. For example, (?(1)a) only matches "a" if group 1 exists, so (?(1)a)* could be a repated matching od the empty string. To some extent this is already avoided at compile time, by refusing to parse things like a**, but there are many possible cases. The problems for an implementation are then: - Whether to warn or reject such cases - If not rejected, whether to try avoid infinite loops during evaluation I am currently working on this with RXPY. In general, I want to (i) provide a safe system as a default, but (ii) allow the user complete control. So it seeems that two flags are necessary: one to disable compile time errors and one to disable run time safety. Implementation must also consider efficiency and ease of maintenance / impementation. It seems to me that many (but not all) cases could be automatically rewritten to a safer version, but I don't currently have good graph rewriting support (I talk about graphs here because the "opcodes" in RXPY are nodes on a graph; the regular expression is "compiled" to a graph of these nodes that is then "evaluated" against the input). Since I do not have rewriting, and because that is not a complete solution anyway, I need some other runtime scheme. The best I have found so far is to add additional nodes that "break" any dangerous loops. A machine can then verify that input has beenconsumed between each encounter with such a node. This keeps almost all the cost to those expressions that need such a feature. Also, the logic to generate these nodes can also be used to generate compile time errors. The graph API includes "consumer(lenient)" where lenient is a boolean. If lenient is true then consumer returns True except when a node (or sequence of nodes) *cannot* consume input. Repeating such nodes (or sequences) gives a compile time error. If lenient is false then consumer returns True only when a node (or sequence) *guarantees* consumption (in all cases). This can be used to detect when to add the runtime check node (ie when False). This is not completely implemented, but an initial attempt is looking very positive - "spinning" in empty loops is avoided at very little cost, and the integration of logic for compile and runtime checks reduces the code / detail required. Andrew
This is my blog. It used to be a mailing list called C[omp]ute. It is still generated by email. You can reply to comments via the appropriate link. Edit the mail address to remove the anti-spam measure. However, given the very low volume of replies, and the high rate of spam, it can be months before I moderate a post. Sorry. © 2006-2009 Andrew Cooke (site) / post authors (content).
I am always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Java web/enterprise; functional languages; Python client GUI/web/database; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.
Setting File Permissions in Subversion
Easy Slide-in Menus using YUI 3
Forensics Using Frequency Variation of Mains Supply
Empty Loops in Regular Expressions
Compiling Python Numerics to GPU wuth Theano
Anybots - Physical Presence for Telecommuting
Efficient List Slices in Python
Closures and Anon Functions in Java 7
Debugging A Hung (Spinning) Python Process
Re: A Practical Introduction to OpenCL
OProfile - An Alternative for Profiling Java (and C)