| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Small Success With Go!; Re: Quick message - This link is broken; Adding Reverb To The Echo Chamber; Sox Audio Tools; Would This Have Been OK?; Honesty only important economically before institutions develop; Stegangraphy via PS4; OpenCL Mess; More Book Recommendations; Good Explanation of Difference Between Majority + Minority; Musical Chairs - Who's The Privileged White Guy; I can see straight men watching this conversation and laffing; When it's Actually a Source of Indignation and Disgust; Meta Thread Defending POC Causes POC To Close Account; Indigenous People Of Chile; Curry Recipe; Interesting Link On Marginality; A Nuclear Launch Ordered, 1962; More Book Recs (Better Person); It's Nuanced, And I Tried, So Back Off; Marx; The Negative Of Positive; Jenny Holzer Rocks; Huge Article on Cultural Evolution and More; "Ignoring language theory"; Negative Finger Counting; Week 12; Communication Via Telecomm Bids; Finding Suspects Via Relatives' DNA From Non-Crime Databases; Statistics and Information Theory; Ice OK in USA; On The Other Hand; (Current Understanding Of) Chilean Taxes / Contributions; M John Harrison; Playing Games on a Cloud GPU; China Gamifies Real Life; Can't Help Thinking It's Thoughtcrime; Mefi Quotes; Spray Painting Bike Frame; Weeks 10 + 11; Change: No Longer Possible To Merge Metadata; Books on Old Age; Health Tree Maps; MRA - Men's Rights Activists; Writing Good C++14; Risk Assessment - Fukushima; The Future of Advertising and Surveillance; Travelling With Betaferon; I think I know what I dislike so much about Metafilter; Weeks 8 + 9; More; Pastamore - Bad Italian in Vitacura; History Books; Iraq + The (UK) Governing Elite; Answering Some Hard Questions; Pinochet: The Dictator's Shadow; An Outsider's Guide To Julia Packages; Nobody gives a shit; Lepton Decay Irregularity; An Easier Way; Julia's BinDeps (aka How To Install Cairo); Good Example Of Good Police Work (And Anonymity Being Hard); Best Santiago Burgers; Also; Michael Emmerich (Vibrator Translator) Interview (Japanese Books); Clarice Lispector (Brazillian Writer); Books On Evolution; Looks like Ara (Modular Phone) is dead; Index - Translations From Chile; More Emotion in Chilean Wines; Week 7; Aeon Magazine (Science-ish); QM, Deutsch, Constructor Theory; Interesting Talk Transcripts; Interesting Suggestion Of Election Fraud; "Hard" Books; Articles or Papers on depolarizing the US; Textbook for "QM as complex probabilities"; SFO Get Libor Trader (14 years); Why Are There Still So Many Jobs?; Navier Stokes Incomplete; More on Benford; FBI Claimed Vandalism; Architectural Tessellation; Also: Go, Blake's 7; Delusions of Gender (book); Crypto AG DID work with NSA / GCHQ; UNUMS (Universal Number Format); MOOCs (Massive Open Online Courses); Interesting Looking Game; Euler's Theorem for Polynomials; Weeks 3-6; Reddit Comment; Differential Cryptanalysis For Dummies; Japanese Graphic Design; Books To Be Re-Read; And Today I Learned Bugs Need Clear Examples; Factoring a 67 bit prime in your head; Islamic Geometric Art; Useful Julia Backtraces from Tasks; Nothing, however, is lost with less discomfort than that which, when lost, cannot be missed

© 2006-2015 Andrew Cooke (site) / post authors (content).

Workshop on Web Information Retrieval

From: "andrew cooke" <andrew@...>

Date: Fri, 12 Aug 2005 15:06:09 -0400 (CLT)

[en castellano mas abajo]

A short summary of the morning talks at the Workshop on Web Information
Retrieval - http://www.cwr.cl/events/ir-workshop.html - hosted by the
Centre for Web Research - http://www.cwr.cl/ (U Chile).

Efficient and Expressively Complete XML Query Languages; XML Data
Exchange: Consistency and Query Answering:
Bleagh.  Both talks way over my head.  As far as I could work out (though
I don't think anyone said this) XPATH and XQUERY were designed by some
rather pragmatic (possible read: ignorant of the theory) people.  As a
consequence, they have the usual problems that come with "pragmatic"
solutions - they're difficult to study analytically and behave very poorly
in certain cases.  Seems like a bit more effort could have been taken to
build on previous knowledge and design something that not only had a
friendly syntax, but was easy to map onto known systems (first order
logic, modal second order logic, whatever those are) and where the bits
that imply more expensive processing are added in such a way that a more
efficient subset of the language could be defined.

Temporal RDF:
RDF is a way of putting semantics on the web.  You define relations: "this
is XXX wrt YYY".  For example, "fred is the son of bob", or "P is a
subclass of Q" or "this relation is of type Z" (they can refer to
themselves).  So you have a bunch of triples (subject, object, relation)
and get a graph out the end.
Turns out that you can define a normal form for these graphs, due to
recent work.
Temporal RDF, then, is a way of extensing this to include time.  Which
adds more relations.  The problem is not doing this - there are an
infinite number of approaches - but finding the most useful approach.
Incidentally, I suggested using RDF to place the NSA metatdata on the web.
 This work would allow timeseries to be expressed in a natural manner.

Query Languages for Graph Databases:
Very good talk.  Most data can be nicely expressed as graphs.  That's why
pointers (and graph theory!) are so importnt in programming.  Yet
relational database are horribly inefficient ways of manging such
structures (as anyone who has had to encode a tree and doesn't know
Celko's hack can testify!).
Anyway, XML data are trees (and references give DCGs).  And RDF gives
DCGs.  So these things are coming back into fashion.  Trouble is, again,
people are ignoring past results.  Turns out that none of the suggested
RDF database systems (and there's a good half dozen) answer common
questions as well as generic "graph databases" from research in the 90s.

Interactive Cross Language Retrieval:
Very entertaining talk on searching documents which are written in a
language you don't understand (including the obvious point: why?!).
Key point: machine translation (at least currently) needs to be focussed
on a particular task.  How you do translation for one task (eg searching
documents) is different from another (eg presenting documents to the user
for them to assess their relevance, or doing a translation for "use"
Anyway, for cross language search they can now do as well as searching in
a single language!  Impressive result.  Done via machine learning on large
bi-lingual corpuses (corpi?).  Take the two translations of the same text
and see how words match up.  Typically you get word X in language A
matching to a set of workds (P, Q, R, S) in language B.  For search, use
all those with their relative weighting.
Interactive tools that allow you to refine the (cross-language) search (in
various cool ways) are only worth using if you have more than 10mins to
spend fiddling.
Search tip: When you think you have the answer to a question, do a search
including the answer.  Large number of hits indicates success.

Precision Recall with User Modelling Applied to XML Retrieval:
How to rate different XML searches in a fair, standard way.  A bit
technical and focussed for me, but the technique he suggested sounded good
(the user model includes the idea that you give a user a node and if the
answer is in a neighbouring node, they'll probably see it).

Efficient Searchable Natural Language Adaptive Compression
Another interesting talk.  There are two kinds of compression.  You either
decide what you're going to do before hand (typically by studying your
data) or optimize your encoding on the fly.  Huffman coding is the
traditional example of the first approach, lzip the second.
Typically, adaptive coding (the second approach) is best, but it makes
searching difficult, since the encoding keeps changing.
Key point: In natural languages, word frequency is much more skewed than
letter frequency, so encode whole words.
Anyway, the speaker presented a really cool hybrid solution.  If you use
bytes to encode words then you don't care about the ordering of words
except near n/n+1 byte boundaries.  So you get almost static ordering plus
an occasional "swap encoding" when a word bumps up over a boundary.  The
number of swaps stabilizes after a Mb or two of text, so you get very
efficient encoding, easy decoding, and the possibility of searching (as
long as you pay attention to the swaps).

[traduccion en castellano]
bueno, cuando comence, no sabia que ib a escribir tanto.  lo siento, pero
tienen que practicar su ingles si quieren enetenderlo...  una cosa -
exequiel estaba preguntando sobre la diferencia entre "database" y
"relational database".  parece que Codd definio una base de datos en un
paper escrito en 1980.  tienen tres componentes - un metodo de estructurar
los datos, un sistema para buscarlos, y una manera en que se puede
verificar (enforcar?) que estan bien.
http://portal.acm.org/citation.cfm?id=806891 (no se puede leer el paper
entero, desafortunademente).


` __ _ __ ___  ___| |_____   personal web site: ttp://www.acooke.org/andrew
 / _` / _/ _ \/ _ \ / / -_)  list: ttp://www.acooke.org/andrew/compute.html
 \__,_\__\___/\___/_\_\___|  aim: acookeorg; skype: andrew-cooke

compute mailing list

Comment on this post