## Workshop on Web Information Retrieval

From: "andrew cooke" <andrew@...>

Date: Fri, 12 Aug 2005 15:06:09 -0400 (CLT)

A short summary of the morning talks at the Workshop on Web Information
Retrieval - http://www.cwr.cl/events/ir-workshop.html - hosted by the
Centre for Web Research - http://www.cwr.cl/ (U Chile).

Efficient and Expressively Complete XML Query Languages; XML Data
Bleagh.  Both talks way over my head.  As far as I could work out (though
I don't think anyone said this) XPATH and XQUERY were designed by some
rather pragmatic (possible read: ignorant of the theory) people.  As a
consequence, they have the usual problems that come with "pragmatic"
solutions - they're difficult to study analytically and behave very poorly
in certain cases.  Seems like a bit more effort could have been taken to
build on previous knowledge and design something that not only had a
friendly syntax, but was easy to map onto known systems (first order
logic, modal second order logic, whatever those are) and where the bits
that imply more expensive processing are added in such a way that a more
efficient subset of the language could be defined.

Temporal RDF:
RDF is a way of putting semantics on the web.  You define relations: "this
is XXX wrt YYY".  For example, "fred is the son of bob", or "P is a
subclass of Q" or "this relation is of type Z" (they can refer to
themselves).  So you have a bunch of triples (subject, object, relation)
and get a graph out the end.
Turns out that you can define a normal form for these graphs, due to
recent work.
Temporal RDF, then, is a way of extensing this to include time.  Which
adds more relations.  The problem is not doing this - there are an
infinite number of approaches - but finding the most useful approach.
Incidentally, I suggested using RDF to place the NSA metatdata on the web.
This work would allow timeseries to be expressed in a natural manner.

Query Languages for Graph Databases:
Very good talk.  Most data can be nicely expressed as graphs.  That's why
pointers (and graph theory!) are so importnt in programming.  Yet
relational database are horribly inefficient ways of manging such
structures (as anyone who has had to encode a tree and doesn't know
Celko's hack can testify!).
Anyway, XML data are trees (and references give DCGs).  And RDF gives
DCGs.  So these things are coming back into fashion.  Trouble is, again,
people are ignoring past results.  Turns out that none of the suggested
RDF database systems (and there's a good half dozen) answer common
questions as well as generic "graph databases" from research in the 90s.

Interactive Cross Language Retrieval:
Very entertaining talk on searching documents which are written in a
language you don't understand (including the obvious point: why?!).
Key point: machine translation (at least currently) needs to be focussed
on a particular task.  How you do translation for one task (eg searching
documents) is different from another (eg presenting documents to the user
for them to assess their relevance, or doing a translation for "use"
(hard)).
Anyway, for cross language search they can now do as well as searching in
a single language!  Impressive result.  Done via machine learning on large
bi-lingual corpuses (corpi?).  Take the two translations of the same text
and see how words match up.  Typically you get word X in language A
matching to a set of workds (P, Q, R, S) in language B.  For search, use
all those with their relative weighting.
Interactive tools that allow you to refine the (cross-language) search (in
various cool ways) are only worth using if you have more than 10mins to
spend fiddling.
Search tip: When you think you have the answer to a question, do a search
including the answer.  Large number of hits indicates success.

Precision Recall with User Modelling Applied to XML Retrieval:
How to rate different XML searches in a fair, standard way.  A bit
technical and focussed for me, but the technique he suggested sounded good
(the user model includes the idea that you give a user a node and if the
answer is in a neighbouring node, they'll probably see it).

Efficient Searchable Natural Language Adaptive Compression
Another interesting talk.  There are two kinds of compression.  You either
decide what you're going to do before hand (typically by studying your
data) or optimize your encoding on the fly.  Huffman coding is the
traditional example of the first approach, lzip the second.
Typically, adaptive coding (the second approach) is best, but it makes
searching difficult, since the encoding keeps changing.
Key point: In natural languages, word frequency is much more skewed than
letter frequency, so encode whole words.
Anyway, the speaker presented a really cool hybrid solution.  If you use
bytes to encode words then you don't care about the ordering of words
except near n/n+1 byte boundaries.  So you get almost static ordering plus
an occasional "swap encoding" when a word bumps up over a boundary.  The
number of swaps stabilizes after a Mb or two of text, so you get very
efficient encoding, easy decoding, and the possibility of searching (as
long as you pay attention to the swaps).

bueno, cuando comence, no sabia que ib a escribir tanto.  lo siento, pero
tienen que practicar su ingles si quieren enetenderlo...  una cosa -
exequiel estaba preguntando sobre la diferencia entre "database" y
"relational database".  parece que Codd definio una base de datos en un
paper escrito en 1980.  tienen tres componentes - un metodo de estructurar
los datos, un sistema para buscarlos, y una manera en que se puede
verificar (enforcar?) que estan bien.
http://portal.acm.org/citation.cfm?id=806891 (no se puede leer el paper
