Support:GSOC Project Scope and Timeline
Contents
Timeline
June 25: end of term for GSOC student
June 25: re-connect, follow up meeting
June 26 - June 28: Install Sphinx and sumo on development server.
June 29 - July 4: Develop indexing engine
July 7 - July 10: Develop filtering and weighting engine
July 11 - July 15: Develop search component and search UI
July 16 - July 23: Develop fudge factor improvements
July 24 - July 31: Refinements
Aug 4 - Aug 20: Load testing, UI improvement, caching (these are considered not part of GSOC scope)
Components
Indexing Engine
Sphinx based - triggered as batch job, access Tiki db directly
Filtering and Weighting Engine
Extended tables based on Sphinx - custom UI for admin to add remove weights.
Weights stored with index for performance reasons.
Search component
Replaces Tiki lib/searchlib.php - searches index and returns results based on given parameters
Search UI
Replaces Tiki tiki-searchindex.php and tiki-searchindex.tpl to provide front-end UI to search
Scope
Source Data
Data will come from knowledge base and forums. The system will be extensible to other Tiki features but this project will only cover kb and forums.
Filtering
Data searched for will be filterable by:
- kb vs. forums
- by forum thread state (forum threads that are answered)
- by article type (help vs. troubleshooting)
- by category
- by author of article
- by contributors to the forum thread
- by freshness of data (last modified for wiki pages, and last post date for forum threads)
Filtering information will be part of the index, to speed performance.
Localization
This is just another type of filtering.
Locale information will be in the index as well.
Searching for "translations of search terms" is beyond the scope of this project.
Returning of search results that include translations based on user defined fallback is beyond the scope of this project.
Weighting
This will be done based on:
- source type (article vs. forum)
- each source type field can be weighted, e.g. title, vs description.
- existence of search term in freetags
- poll results
Weighting info will be stored in the index for performance reasons.
Indexing
This will be a batch job.
The last modified fate of an article could be used as a means to speed indexing (avoid unnecessary reindexing).
Indexing should not include tiki syntax.
Searching
Need to support for boolean logic in searching for search terms – OR, AND, NOT.
Caching of search results
Need to be done, but not part of GSOC project - to be scheduled separately.
Fudge factor
Handle spelling errors ("did you mean...").
Synonyms (searching for "favorites" also searches for "bookmarks")
Ignores locale-specific common words ("the", "a", "Firefox") - this will be limited to English for the scope of this project, but will be extensible.
Display of search results
Show the title of the page, the first paragraph (actually the description field). (the text surrounding the text matched is not in this project)
Display results as plain text without Tiki formatting (description field will not have Tiki formatting)
Show data on the article - such as poll results - will be based on info in index only - to improve performance.
“More like this” is a separate thing and should be considered out of scope of this project.