[PP-main] Technical/implementation matters

Fri Mar 3 00:43:57 CET 2000

Part three:

> Technical/implementation matters:
> 
> * What format should news items be in?
>   I've given it some thought, and I think this should definitely be a rather
>   formal and verbose XML datatype, or something giving similar semantic
>   structure. It'll allow us to do a lot of things that aren't possible with
>   pure text exchange, HTML, etc. I have some particular ideas about meta data
>   and how that could be put to good use. In addition to the normal meta data
>   news items would obviously have, like author name, keywords, etc., we
>   should have things to aid quality control/fact checking. For instance, a
>   list of URLs that back up the story, phone numbers you can call to confirm
>   it, etc. I'd love to get more feedback on this, and I think we should get
>   the spec down soon, and then get some of the XML knowledgable people around
>   here to work on making a DTD proposal.

XML XML XML! Definitely it should be XML. Here are the advantages:

* Can be backwards-compatible with existing syndication schemes
(my.netscape/my.userland)
* Libraries for creating and dealing with XML exist in many languages
* It's just ascii text, so there shouldn't be cross-platform issues
* Easy to transfer between master/child/client servers. Just write some
CGI's and use HTTP for all the inter-server communication
* Flexible to implement

Bascially, I think XML would be ideal for this. I think our first real
job is to define a preliminary DTD from which to work. I'm not an XML
guru my any means, so I'd much rather have someone who knows what
they're talking about involved with this. But I will help work up a list
of things the spec ultimately will need to include.

> 
> * How do we exchange them?
>   There are two or three basic ways to do it. First of all, should it be
>   centralized or not? That is, should there be a single point all the data
>   passes through? I think so. This is dependant on timely delivery, so a 100%
>   distributed system would be too slow, and would also consume a lot of
>   excess bandwidth. The other question is, do you poll for new items once in
>   a while, or do you open a connection and get all the stuff pushed to you as
>   it happens? Or maybe you do a combination. I don't really have an opinion
>   about this, input would be appreciated, as usual.

I think that ultimately, it needs to be centralized to some extent. This
doesn't mean there can't be failover and redundancy built into the
architecture, and IMO, there should be. 

My recommendation would be:
* Have a smallish group of "central servers" which communicate with each
other, and their ring of "Child servers"
* Child servers mediate between client sites and central servers.
* The central servers store user auth information, and are central
repositories for syndicated data. They all mirror each other
* The children relieve some of the load from the central servers, and
actually communicate with the client sites, accepting content (which
they forward on to the central servers) and getting content out to the
clients.

As for how/when the data gets retrieved, I'd say clients should poll for
news periodically. What could happen is, a client system signs up for an
account on PeerPress, and gets an identifier. Then, from time to time,
they contact their local child server (probably through http or https)
and ask for the latest news that fits criteria they registered with
their account. This would require us to develop some kind of a
"content-tree" system. So a client running a gaming site could sign up
for feeds of "News->Technology ->Software->Games" and
"Opinions->Technology->Software->Games", for example. Whenever a client
connected to the child server and authenticated themself as that client,
they'd get the latest articles in those categories.

Content submission would be the same, but in reverse. Clients would open
an HTTP connection, and submit an XML file representing some content.
The server would parse the story for category info, and forward it on
the it's local central server, where it would probably go into a
database. Having a child->master setup allows children to do these
things asynchronously-- that is, accept content all the time, and queue
it for submission to the master server, so as to mitigate traffic
issues, problems with slow master servers, network outage, or whatever.

> 
>   But I strongly believe the stuff should pass through a central server. To
>   make it scale, we can just add child servers that get their info from the
>   central one. One level of indirection is cheap, and you can scale *really*
>   high that way, especially given that the system won't really be used
>   directly by the general public.

Yes.

> 
>   Member sites should have a client system that lets them filter by keywords
>   and whatnot. Perhaps the filter should be possible to upload to the server
>   they use, so that they don't need to transfer the full feed first, and
>   filter on the local box. Should be possible.

Should filtering be done automatically, manually, or some combination of
both? Do we just trust clients to categorize things sanely, or do we
accept their categorizations and add our own with automated keyword
filtering?

> 
> * How does an origin site push a news item into the system?
>   This is also interesting. We will need some sort of interface system that
>   can plug in to the different systems people use to run their sites,
>   obviously. In addition, there should be a strict requirement on filling in
>   metadata and whatnot, so there needs to be an interface to do that, which
>   is also efficient. This needs some thought.

Basically, I think we need to define an XML standard that content must
comply to. Afetr that anyone can write their own client that fits in
with the rest of their site's software. We can also develop "sample
clients" but I think the important thing here is to have a public
standard, and have master servers that only accept submissions
conforming to that standard.

I think http is the way to go for client<->server communication. It's
cheap, easy, and everyone running a website already has it implemented. 

> 
> * How can we keep the quality level up?
>   Trust metrics are probably a good idea. That is, intra-site trust metrics.
>   The editor(s) of a given site are the ones that pick the items that go up
>   on their site, and also which ones are sent out over the wire. Thus,
>   editors of other sites can use trust metrics to certify them, and how good
>   they thing the editorial policy of that site is. That will let you sort
>   things out quickly. Trust threshold should be one of the filter criteria,
>   obviously. Perhaps a combination of site and author metrics would be in
>   order, to get a more fine-grained mechanism (author first, if the author is
>   unknown or has a low rating, that can be rectified by the item coming from
>   a well-respected site).

Well, sometimes the readers of a site are the ones that pick the
stories. :-) Kuro5hin.org is nominally edited by me, but I almost never
accept or reject content on my own. It all gets voted up or down by the
readers.

But there should be some kind of trust metric, I agree. As a client, I
want to be able to say "Only send me stories from sites that have a
three-star rating" or whatever.

I hope others will tell me where I've gone wrong (or right, even!). In
case you all didn't realize, I'm really excited about this. This idea
was my ultimate goal for Scoop/kuro5hin, once I had a system running on
it's own, so I'm glad to see there are others interested too!

--R

--