Drumbeat/MoJo/hackfest/berlin/projects/followthis

Project Name: FollowThis

   Project Lead(s): Matt Terenzio

Big Goal for MoJo Hackfest:

Ship some usable code. (achieved)
Learn how to manage an Open Source project.
Work with others on related projects. (achieved)
Drink heavily. (achieved)

Key steps toward goal:

1. need to be able to extract RDFa, Microformats from pages. (working)
2. Need to be able to use NLP to extract entities if Semantic metadata is not present. (adopt and contribute to metameta project for this)
3. Need to be able to store and query the metadata. (am currently able to query the RDF triplestore but need to hone queries)
4. Need a solid UI for users to be able to interact with the service. (getting there)
5. A crawler for the news sources would be nice. (deferred to version .2)

Pending needs:

Important:Need to make a button that is an embeddable widget for ease of deployment
I have a working bookmarklet but it needs work. JQuery help. (still need a session with jquery expert)
Totally clueless on entity extraction from pages that don't have semantic metadata. (solved somewhat)
Also need to figure out SPARQL and the best persistent data store for RDF. (Laurian gave me some good starting points)

Link for more info:

rNews has been brought up. I've installed the RDFa distiller from W3C and you can use it to distill RDFa from pages.

   Example which distills a page with rNews in it: 
   To use it, just call:
http://followth.is/cgi-bin/RDFa.py?uri=uri-of-we-page-youwant-to-distill

(update: Matt has a better entity extractor than this using Stanford NLP -- will use that) To extract keywords from some text I set up a CGI script that does so if you feed it text.

example

It should accept posts to that URL as well as gets.

First pass at a readability-like way to extract the article text and headline from a web page:

http://followth.is/read/article/http%3A%2F%2Fwww.thehour.com%2Fstory%2F511535%2Ffrank-fay-way-we-were/

Another endpoint that distills RDFa froma web page (this one in PHP)

http://followth.is/transform/?type=rdfa&url=http://www.thehour.com/story/511535/frank-fay-way-we-were

A SPARQL endpoint for the triplestore of rNews data

http://followth.is/transform/sparql/

Link for demo:

FollowThis demo

Link to source code:

FollowThis on GitHub

Where from here:

Though code is in working form, it is necessary to clean and organize a few parts for better forward maintainability and extension
Continue to work on open alternatives to some of the portions that use third party APIs
Documentation for both developers and users
Promote rNews adoption
Deploy to at least one news site by end of 2011

Project Status

Though code is in working form, it is necessary to clean and organize a few parts for better forward maintainability and extension
Code still relies on third party APIs. Need to offer open source alternative to those parts

Collaborators

The following folks helped with this project:

Laurian/ How to model data for RDF storage and how to query that data using SPARQL
Raynor/ TF-IDF (term frequency–inverse document frequency)
Laurian/Raynor/ Cosine Similarity concepts for comparing documents
Matt (BBC)/ Entity extraction with Stanford NLP

Also really needed these perspectives:

Jordan/ What constitutes a valuable difference between documents from a user or journalist perspective
Chris/ perspective of CMS usage from an editorial standpoint
Mark/ Open Source project strategy

Next steps

- From here I would like to:

Continue to work on open alternatives to some of the portions that use third party APIs, including actively contributing to metameta project which is solving some of same
Documentation for both developers and users
Promote rNews adoption
Deploy to at least one news site by end of 2011

Places where this project might be tested include:

Since I'm still at TheHour.com, I plan on testing it there VERY soon.
Zeit, Guardian, Boston Globe . . . any news site with text articles is really appropriate

Drumbeat/MoJo/hackfest/berlin/projects/followthis

Contents

Project Name: FollowThis

Big Goal for MoJo Hackfest:

Key steps toward goal:

Pending needs:

Link for more info:

Link for demo:

Link to source code:

Where from here:

Project Status

Collaborators

Next steps

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools