| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

OpenSuse 42.2 and Synaptics Touch-Pads; Telephone System Quotes for Cat Soft LLC; Even; Cherry Jam; Lebanese Writer Amin Maalouf; Learning From Trump; Chinese Writer Hu Fayun; C++ - it's the language of the future; And; Apricot Jam; Also; Excellent Article on USA Politics; Oh Metafilter; Prejudice Against The Rurals; Also, Zizek; Trump; Why Trump Won; Doxygen + Latex on CentOS 6; SMASH - Solve 5 Biggest Problems in Physics; Good article on racism, brexit, and social divides; Grandaddy are back!; Consciousness From Max Entropy; Democrats; Harvard Will Fix Black Poverty; Modelling Bicycle Wheels; Amusing Polling Outlier; If Labour keeps telling working class people...; Populism and Choice; Books on Defeat; Enrique Ferrari - Argentine Author; Transcript of German Scientists on Learning of Hiroshima; Calvert Journal; Owen Jones on Twitter; Possible Japanese Authors; Complex American Literature; Chutney v5; Weird Componentized Virus; Interesting Argentinian Author - Antonio Di Benedetto; Useful Thread on MetaPhysics; RAND on fighting online anarchy (2001); NSA Hacked; Very Good LRB Article on Brexit; Nussbaum on Anger; Tasting; Apple + Kiwi Jam; Hit Me; Sudoku - CSP + Chaos; Recycling Electronics In Santiago; Vector Displays in OpenGL; And Anti-Aliased; OpenGL - Render via Intermediate Texture; And Garmin Connect; Using Garmin Forerunner 230 With Linux; (Beating Dead Horse) StackOverflow; Current State of Justice in China; Axiom of Determinacy; Ewww; Fee Chaos Book; Course on Differential Geometry; Okay, but...; Sparse Matrices, Deep Learning; Sounds Bad; Applebaum Rape; Tomato Chutney v4; Have to add...; Culturally Liberal and Nothing More; Weird Finite / Infinite Result; Your diamond is a beaten up mess; Maths Books; Good Bike Route from Providencia / Las Condes to Panul; Iain Pears (Author of Complex Plots); Plum Jam; Excellent; More Recently; For a moment I forgot StackOverflow sucked; A Few Weeks On...; Chilean Book Recommendations; How To Write Shared Libraries; Jenny Erpenbeck (Author); Dijkstra, Coins, Tables; Python libraries error on OpenSuse; Deserving Trump; And Smugness; McCloskey Economics Trilogy; cmocka - Mocks for C; Concept Creep (Americans); Futhark - OpenCL Language; Moved / Gone; Fan and USB issues; Burgers in Santiago; The Origin of Icosahedral Symmetry in Viruses; autoenum on PyPI; Jars Explains; Tomato Chutney v3; REST; US Elections and Gender: 24 Point Swing; PPPoE on OpenSuse Leap 42.1; SuperMicro X10SDV-TLN4F/F with Opensuse Leap 42.1; Big Data AI Could Be Very Bad Indeed....; Cornering; Postcapitalism (Paul Mason)

© 2006-2015 Andrew Cooke (site) / post authors (content).

Workshop on Web Information Retrieval

From: "andrew cooke" <andrew@...>

Date: Fri, 12 Aug 2005 15:06:09 -0400 (CLT)

[en castellano mas abajo]

A short summary of the morning talks at the Workshop on Web Information
Retrieval - http://www.cwr.cl/events/ir-workshop.html - hosted by the
Centre for Web Research - http://www.cwr.cl/ (U Chile).

Efficient and Expressively Complete XML Query Languages; XML Data
Exchange: Consistency and Query Answering:
Bleagh.  Both talks way over my head.  As far as I could work out (though
I don't think anyone said this) XPATH and XQUERY were designed by some
rather pragmatic (possible read: ignorant of the theory) people.  As a
consequence, they have the usual problems that come with "pragmatic"
solutions - they're difficult to study analytically and behave very poorly
in certain cases.  Seems like a bit more effort could have been taken to
build on previous knowledge and design something that not only had a
friendly syntax, but was easy to map onto known systems (first order
logic, modal second order logic, whatever those are) and where the bits
that imply more expensive processing are added in such a way that a more
efficient subset of the language could be defined.

Temporal RDF:
RDF is a way of putting semantics on the web.  You define relations: "this
is XXX wrt YYY".  For example, "fred is the son of bob", or "P is a
subclass of Q" or "this relation is of type Z" (they can refer to
themselves).  So you have a bunch of triples (subject, object, relation)
and get a graph out the end.
Turns out that you can define a normal form for these graphs, due to
recent work.
Temporal RDF, then, is a way of extensing this to include time.  Which
adds more relations.  The problem is not doing this - there are an
infinite number of approaches - but finding the most useful approach.
Incidentally, I suggested using RDF to place the NSA metatdata on the web.
 This work would allow timeseries to be expressed in a natural manner.

Query Languages for Graph Databases:
Very good talk.  Most data can be nicely expressed as graphs.  That's why
pointers (and graph theory!) are so importnt in programming.  Yet
relational database are horribly inefficient ways of manging such
structures (as anyone who has had to encode a tree and doesn't know
Celko's hack can testify!).
Anyway, XML data are trees (and references give DCGs).  And RDF gives
DCGs.  So these things are coming back into fashion.  Trouble is, again,
people are ignoring past results.  Turns out that none of the suggested
RDF database systems (and there's a good half dozen) answer common
questions as well as generic "graph databases" from research in the 90s.

Interactive Cross Language Retrieval:
Very entertaining talk on searching documents which are written in a
language you don't understand (including the obvious point: why?!).
Key point: machine translation (at least currently) needs to be focussed
on a particular task.  How you do translation for one task (eg searching
documents) is different from another (eg presenting documents to the user
for them to assess their relevance, or doing a translation for "use"
(hard)).
Anyway, for cross language search they can now do as well as searching in
a single language!  Impressive result.  Done via machine learning on large
bi-lingual corpuses (corpi?).  Take the two translations of the same text
and see how words match up.  Typically you get word X in language A
matching to a set of workds (P, Q, R, S) in language B.  For search, use
all those with their relative weighting.
Interactive tools that allow you to refine the (cross-language) search (in
various cool ways) are only worth using if you have more than 10mins to
spend fiddling.
Search tip: When you think you have the answer to a question, do a search
including the answer.  Large number of hits indicates success.

Precision Recall with User Modelling Applied to XML Retrieval:
How to rate different XML searches in a fair, standard way.  A bit
technical and focussed for me, but the technique he suggested sounded good
(the user model includes the idea that you give a user a node and if the
answer is in a neighbouring node, they'll probably see it).

Efficient Searchable Natural Language Adaptive Compression
Another interesting talk.  There are two kinds of compression.  You either
decide what you're going to do before hand (typically by studying your
data) or optimize your encoding on the fly.  Huffman coding is the
traditional example of the first approach, lzip the second.
Typically, adaptive coding (the second approach) is best, but it makes
searching difficult, since the encoding keeps changing.
Key point: In natural languages, word frequency is much more skewed than
letter frequency, so encode whole words.
Anyway, the speaker presented a really cool hybrid solution.  If you use
bytes to encode words then you don't care about the ordering of words
except near n/n+1 byte boundaries.  So you get almost static ordering plus
an occasional "swap encoding" when a word bumps up over a boundary.  The
number of swaps stabilizes after a Mb or two of text, so you get very
efficient encoding, easy decoding, and the possibility of searching (as
long as you pay attention to the swaps).

[traduccion en castellano]
bueno, cuando comence, no sabia que ib a escribir tanto.  lo siento, pero
tienen que practicar su ingles si quieren enetenderlo...  una cosa -
exequiel estaba preguntando sobre la diferencia entre "database" y
"relational database".  parece que Codd definio una base de datos en un
paper escrito en 1980.  tienen tres componentes - un metodo de estructurar
los datos, un sistema para buscarlos, y una manera en que se puede
verificar (enforcar?) que estan bien.
http://portal.acm.org/citation.cfm?id=806891 (no se puede leer el paper
entero, desafortunademente).

Andrew

-- 
` __ _ __ ___  ___| |_____   personal web site: ttp://www.acooke.org/andrew
 / _` / _/ _ \/ _ \ / / -_)  list: ttp://www.acooke.org/andrew/compute.html
 \__,_\__\___/\___/_\_\___|  aim: acookeorg; skype: andrew-cooke

_______________________________________________
compute mailing list
compute@...
https://acooke.dyndns.org/mailman/listinfo/compute

Comment on this post