# C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

© 2006-2017 Andrew Cooke (site) / post authors (content).

## Web as Database

From: "andrew cooke" <andrew@...>

Date: Sat, 4 Nov 2006 19:23:24 -0300 (CLST)

Notes on the "Web as Database" Workshop, Santiago 2006.
http://grupoweb.upf.es/web_as_database/home.html

Arrived late to start (poor bus journey down from La Serena the night
before).

General comment: was surprised how specific the work was.  Very
closely related to Yahoo's core business.  Not like the impression you
get of Google and Microsoft research.

Enrico Franconi: Semantic Web as Incomplete DB

I arrived during this talk, which seemed to be a fairly theoretical
approach to the RDF "triple" (object, predicate, subject).

Only conclusions I came away with were:
- RDF isn't theoretically sound (there are some small errors /
inconsistencies in some of the underlying theory).
- If you want to interact with other first order logics then you have
some problems with "bnodes" (free variables?).  If a predicate is a
variable then it's kind-of hard to match it with a first order logic.
But with the restriction of no bnode predicates, things work.

Questions from my notes:
- Who is Michael Kifer / HiLog?
1st order logic for higher order logic languages(?)
http://portal.acm.org/citation.cfm?id=94911
http://citeseer.ist.psu.edu/chen89hilog.html
- What is F-logic?
OO equivalent of predicate calculaus
- What is SPARQL?
Query language for RDF
http://www.w3.org/TR/rdf-sparql-query/

Andrei Broder: From Queries to Context

Some big cheese from Yahoo.  Entertaining talk with lots of juicy
facts on the economics of the business (which I didn't write down -
sorry, but generally made notes after talks).

that it gives a throw-away justification ("we recommended this book
because you bought...").  They have a whole pile of very clever people
working on recommendations, based on a huge amount of information
related to your previous preferences, purchases, and clicking habits.
It's a lot more complex than showing you related books.  But by giving
a simple reason they appear less secretive.

The amonut of data Yahoo has on its customers is huge.

At various points he made the point that adverts that work are
appreciated by users - they are useful information.  And that there is
no point in providing adverts to, say, people who never click
through.  But when was the last time you saw a search engine that let
you make the call yourself with a "no adverts" option?

It seems to me that the situation is dangerously close to Spam - that
there is no cost to the company for giving you adverts.

Obviously, the situation isn't identical, because they (1) have to
show some search results or people wouldn't use the service at all and
(2) have a permanent identity with associated reputation (unlike most
spam).  But that cost seems very low from my pov.

Anyway, big emphasis was the shift to context driven adverts.  Need
for semantical information.  (Apparently - how would I know, I use
adblock - people no longer place ads on news pages with negative
news.)

The problem, then, is how to get the semantics.  It's hard.  Later, I
realise dthis was similar to old AI's frame problem.  And many of the
talks on this subject make more sense when you realise that they are
different ways of reducing the decision space, constraining possible
semantics, and so making the problem tractable.

Ricardo Baeza-Yates: Mining Queries

One use (common to many talks) was as a way of finding semantically
related words / clusters.

Benjamin Piwowarski: Mining Query Logs

Query logs (was this the same data set that caused a scandal with the
amount of personal data?) provide a wealth of data.

The main result here was that they managed to separate "result
quality" and "page bias".  Page bias is the bias within search result
pages, and between pages - people tend to click on the results at the
top of the first page.

Obviously this is then fiercely correlated with how good the link was
(under the assumption that the user clicks on correct link, so for
repeated queries you can get a popularity measure and compare it to
yahoo's own recommendation ranking).  But by exploiting incorrectly
ranked pages and different functional forms for different queries
(assuming page bias is contant across queries) the two can be
separated.

Someone else said that the bias is exponential decline, within a page,
although if I read the plots of these results right this work was
significantly less steep than that.

Hugo Zaragoza: Learning Ranking Functions

Semantics is hard.  So difficult to automate.  So how do you make a
system that uses semantics to rate results in a way that lets you play
around, experiment, tweak, etc?

Previous attempts failed for two reasons:
- Quality measures depended on query text.  Huge domain.
- Ranking is a global function.

Fix these by using quality measures that don't depend directly on text
(eg number of words in query that also appear in page) and compairing
pairs of queries to incrementally construct a global ranking.

Result is a general machine learning ranker that can be made to work
on a range of "pluggable" approaches to semantics.

Related to this (in retrospect, I can't remember how) was a point
about synonyms.  Listing a bunch of synonyms tends to lose searches
because the page you are interested in probably contains only one of
the synonyms, while simple scoring gives you things like thesauruses
where all synonyms appear.

Mauricio Marin: Efficient Parallelization of Crawling

I didn't understand this, but it was very interesting!

Have a problem that's hard to parallelize.  Best approach is to run in
parallel for a while, then send cross-cutting messages, work out what
was broken, roll those things back, and start from there with a new
iteration.  Most of the talk was the details of how to do this.

Seemed to me that "time" had to somehow be related to pagerank
weights, but, as I said, I was very confused.

Esteban Fuerstein: Slot Assignment

How do you decide what ads to show?  Not just page context, but also
who has paid what.  At the same time, want variety.

Rather disturbing talk, used loaded terms like "fraud" for things that
seemed to be more like technical problems with Yahoo's system that
advantage certain customers (ie without the moral implications
normally associated with "fraud", as far as I could see).

For example, with bidding, to penalise people making several small
bids instead of one big one, they arrange things so that an ad that
costs some price is shown more than the sum of all ads that cost less
that price.

Which is extremely non-linear, right?  And must reduce variety, since
the expensive ad must be shown a lot more than slightly less expensive
ones.  Conclusions seemed to have been tailored to avoid showing
this...

Carlos Heuser: Measuring Quality of Similarity Functions

If you have a ranking (scoring), can it be used as a discriminant?
Depends on whether the worst score for "good" data is consistently
better than the best score for "bad" data.

Random thought:
- Worst score for "good data" / best score for "bad data" sounds
vaguely like a Galois Connection.

Gonzalo Navarro: Succint Data Structures

Seen some of this before, I think, at an earlier UdeC conference.
Anyway, smart guy doing cool things.

The idea is to find data structures (and associated algorithms) for
performing certain operations in an optimal way.  For example, by
increasing cache hits, or avoiding disk writes.

New (to me) data structure: wavelet tree.  Didn't seem to be related
to wavelets.  If I understood correctly, some operations always take
same path through tree, so can vertically stack the data in-order.
However, if that's specific to a certain operation, is this equivalent
to just working out the answers?  I guess it must not be - guess I
didn't understand something.

Tried to encode graphs as text, hoping to exploit known patterns in
graphs of web pages.  Doesn't work out as well as dedicated work, so
far.

Random thought:
- Are data and code dual in some way?  What about compact programs?

Raghu Ramakrishnan: Community Systems

Another big cheese?  Appeared to be head of "community" group.
Focussed on semantics rather than reputations (for example).

Example product (DBLife) is portal for a vertical slice - reduce
problem of semantics by restricting frame to, in this case,
researchers in DB systems.

Groups structure info and attract clicks.

Incidentally, the guy worked on Quiq, which was an answer system that
it probably a terribly parochial question).

Listed divisions of Yahoo Research at some point.  Micro-economics was
one of them.  Get the impression that the whole division is not that
big.

Best way to calculate mean/min/max of leaf nodes in tree structure if
can only sample a few?  What about if doing multiple queries?

So a few samples, use that info to guide subsequent samples.

Looked like simple minimisation of expected error but:
- that assumes independence.  What if books (say) appear in more than
one grouping
- there was a quick mention of an apparently ad-hoc change to the
process that (I assume) had no theoretical justification.  Seemed odd.

Question from audience raised similarity with collating information
from distributed network of sensors.

Mariano Consens: Exploiting Structural Summaries

A structural summary groups information in an XML document by an XPath
expression (a "structural ID or SID).  So, for example, all titles of
papers.

Helps rewriting data.  More practical / specific than a DTS / Schema.

This confused me a bit, and also someone else in the audience, I
think.  SIDs didn't seem to be well defined / constrained (there's an
awful lot of XPath expressions for a document).

especially if the DTD / Schema is very loose (eg HTML).

Random thought:
This seems to be related to dependent types.  DTD / Schema defines
types for a document.  But it's unclear how deep recursion nests, for
example.  In contrast, an SID includes this information (via the
presence / absence of certain paths).  Since this depends on the data,
it's dependent types.  But this argument is rather vague - especially
(worryingly) I realised that the relationship between grammar and type
wasn't clear (to me).

Georges Dupret: Hierarchy from Princpal Components

I screwed up here - thought I understood where the talk was going,
stopped concentrating, and then realised I was wrong.  Got lost.

Somewhere in the practical details of constructing a set of princpial
components you reject possible axes in favour of others.  It turns out
that when you're working with text, this has some semantical
implications!  The phrases you reject are related to the ones you use
instead in some kind of hierarchy!  It makes sense I guess...

So the question is, what is this emergent structure?  What are it
properties?  Seems like it is transitive in at least some cases.

Apparently a very efficient way of generating hierarchies.

http://webhosting.vse.cz/svatek/KDO05/paper3.pdf

Josep Lluis Larriba-Pey: Same Entities in Co-Authorship Graphs

Dirty information may contain false duplicates.  For example,
mis-spelt names.  There are algorithms that can check for this, but
they are going to be n^2 (I guess), so cannot be applied naively to
large data sets.  Instead, you need to divide the data into "blocks".

Syntactic approaches include blocks defined by edit distance, lexical
search, etc.

Semantic approach here uses graph connections (co-author relations).

People:
- Ended up chatting to Patricio Burgos, I think.
- Other interesting companies attending:
http://www.amable.info/
http://www.pragmaconsultores.com/
http://www.americaeconomia.com