Extracting meta-data from pages
From MozillaWiki
Goal
Pancake currently only looks at page urls and titles. Investigate what we can find out about pages by looking at their content. Example are:
- Page structure. headings, article text, etc.
- Meta tags: icons, authors, etc.
- Embedded micro formats like recipes, contacts, geo-information, etc.
We should find out how easy is it to find and extract this information and see if a big enough number of pages has useful information that we can do something with it.
How we use the extracted information for generic results and maybe very domain-specific results like for example "people", "recipes", "locations".