Firefox/Projects/Places DB Creation Scripts
Contents
Overview
Sprint lead: ddahl
Sprinters: adw
- Description
- Create python scripts to generate Places DBs with various characteristics such as "many visits within the same domain", "visits across many domains", "many tags", "many bookmarks", etc. Also, collect data from real-world users to inform the profiles of our generated DBs.
Status
Sprint's been on the back burner while we're getting Firefox 3.5b4 out the door.
Currently collecting stats from Mozilla community at https://places-stats.mozilla.com/. Been doing so since early March. Stats will inform our database generation script.
Database generation script (Python) being worked on. Patches are up on bug 480340. If you are feeling adventurous, please download the Python and try it out. I would like to document bootstrapping this better. Feel free to ping ddahl in #places.
Relevant links:
- Mozillazine forum posting about stats collection portion
- ddahl's blog post about database generation script
- adw's blog post about stats collection implementation and initial results
Goals / Use Cases
The chief goal is to be able to automate the generation of these sample sqlite databases for a continuous test to run on Places. We want to be able to reliably set some benchmarks and see what code changes either slow down or speed up queries in Places.
The sample data set should actually be quite huge (according to Beltzner and Shaver). We should collect stats from users so that our sample databases reflect real-world use.
Next step: take as input to the generation script the data we gather from the stats web page.
Non Goals
Creating a sample database for every little niche use case. If at some point it becomes important to test a little niche use case, fine, our generator script should be able to handle it, but we will not be doing so at the outset.
Going out of our way to collect data that would help other teams/sprinters at Mozilla. If we can share our results with others because it would help them, fantastic. But time is wasting, we need to get going, so we can't accommodate everyone. Maybe later.
Design
We should try to use the Django ORM to reverse-engineer the Places database schema into Django Models so creating rows will be easy and we can concentrate on url data collection.
Data collection:
Beltzner envisions a huge dataset made up of perhaps 10k unique urls in bookmarks and a similar data set in history, etc...
We need to brainstorm a method for getting this raw data. Spider/bot? There are many python libs for this.
What are the variables we need to keep in mind when creating this data sample for performance testing? ASK Dietrich and Shawn.
Potential exemplar datasets:
- "Grandma": Very few visits per month, mostly to the same sites. Very few bookmarks.
- "Nerd": Very many visits per month across a wide range of sites with a core of often visited sites. Tons o' bookmarks, maybe lots of tags, too.
- "Random Walk": Many visits to many different sites with no discernible most often visited sites.
- "News Hound": Many visits per month, mostly to the same sites.
Or a more general way to think about it, we have these dimensions:
- Number of places (unique URLs)
- Number of visits
- Nature of visits (visiting same URLs often to the exclusion of others, or visiting all places equally? Visiting same domains often? (Does that matter?) Type of transition?)
- Number of bookmarks
- Number of tags
- Nature of tags (each bookmark has tons of tags, few tags, or varied?)
- Keywords
We can come up with different data points in each dimension, take cartesian product across all dimensions to get a full suite of databases... User of our script should be able to specify a point in each dimension, and our script generates a database.
Implementation
Database generator
set up django:
http://www.djangoproject.com/download/1.0.2/tarball/
uncompress and run:
sudo python setup.py install
add django bin to your path
export PATH=$PATH:~/code/python/django/bin:~/code/python
cd ~/code/python
run this:
django-admin.py startproject places
django-admin.py startapp builddb
copy a places.sqlite file to ~/code/python/places
export PLACES_DB_PATH=~/code/python/places/places.sqlite
export DJANGO_SETTINGS_MODULE=places.settings
export PYTHONPATH=$PYTHONPATH:~/code/python
edit the places/settings.py:
import os
DATABASE_ENGINE = 'sqlite3'
DATABASE_NAME = os.environ['PLACES_DB_PATH']
reverse engineer the Django Models from the schema:
cd ~/code/python/places
python manage.py inspectdb >> builddb/models.py
Now, we need to clean up the foreign keys.
Stats collector
https://places-stats.mozilla.com/
The stats collector is a CGI script written in Ruby located at the above address. Visitors are presented with instructions on how to submit statistics related to their Places databases. They copy a small piece of JavaScript, located at https://places-stats.mozilla.com/places.js and embedded in the page, and paste it into Firefox's JavaScript console and evaluate it. The JavaScript computes numerous statistics from their Places database, presents them to the user, and allows him to submit them to the site. Once submitted, the stats are inserted into a MySQL database, from which they are presented to all visitors to the site.
We will publicize the site via blogs, forums, and wherever else to solicit submissions from the community.
Bugs
Misc notes for ddahl and adw
Awesomebar autocomplete
How should AutoComplete be stressed? Shawn says:
- http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp
- GetAutoCompleteBaseQuery() http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#190
- see BOOK_TAG_SQL - having a lot of tags will slow stuff down, however that might not be representative of "normal users": http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#109
- mDBAdaptiveQuery http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#485
- mDBKeywordQuery http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#507
- AutoCompleteProcessSearch() http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#1000 - does post-processing
- This file (or these queries at least) are being rewritten in JS: see _processRow() in https://bug455555.bugzilla.mozilla.org/attachment.cgi?id=363641
Some notes on the above funcs and SQL:
- GetAutoCompleteBaseQuery() selects from table moz_places(_temp) x moz_favicons; where frecency != 0; orders by column 9 (guessing this is frecency column...)
- BOOK_TAG_SQL defined in terms of SQL_STR_FRAGMENT_GET_BOOK_TAG http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp#94 which itself selects from moz_bookmarks x moz_bookmarks where TYPE_BOOKMARK; sometimes orders by lastModified;
- mDBAdaptiveQuery uses BOOK_TAG_SQL, moz_inputhistory...
GetAutoCompleteBaseQuery() boils down to:
SELECT h.url, h.title, f.url, (SELECT b.parent FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS parent, (SELECT b.title FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS bookmark, (SELECT GROUP_CONCAT(t.title, ',') FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ) AS tags, h.visit_count, h.typed, h.frecency FROM moz_places_temp h LEFT OUTER JOIN moz_favicons f ON f.id = h.favicon_id WHERE h.frecency <> 0 {ADDITIONAL_CONDITIONS} UNION ALL SELECT h.url, h.title, f.url, (SELECT b.parent FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS parent, (SELECT b.title FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS bookmark, (SELECT GROUP_CONCAT(t.title, ',') FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ) AS tags, h.visit_count, h.typed, h.frecency FROM moz_places h LEFT OUTER JOIN moz_favicons f ON f.id = h.favicon_id WHERE h.id NOT IN (SELECT id FROM moz_places_temp) AND h.frecency <> 0 {ADDITIONAL_CONDITIONS} -- ORDER BY h.frecency, the 9th column ORDER BY 9 DESC LIMIT ?2 OFFSET ?3);
mDBAdaptiveQuery:
SELECT IFNULL(h_t.url, h.url), IFNULL(h_t.title, h.title), f.url, (SELECT b.parent FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS parent, (SELECT b.title FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent != ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ORDER BY b.lastModified DESC LIMIT 1 ) AS bookmark, (SELECT GROUP_CONCAT(t.title, ',') FROM moz_bookmarks b JOIN moz_bookmarks t ON t.id = b.parent AND t.parent = ?1 WHERE b.type = nsINavBookmarksService::TYPE_BOOKMARK AND b.fk = h.id ) AS tags, IFNULL(h_t.visit_count, h.visit_count), IFNULL(h_t.typed, h.typed), rank FROM (SELECT ROUND( MAX( ((i.input = ?2) + (SUBSTR(i.input, 1, LENGTH(?2)) = ?2)) * i.use_count ), 1 ) AS rank, place_id FROM moz_inputhistory i GROUP BY i.place_id HAVING rank > 0 ) AS i LEFT JOIN moz_places h ON h.id = i.place_id LEFT JOIN moz_places_temp h_t ON h_t.id = i.place_id LEFT JOIN moz_favicons f ON f.id = IFNULL(h_t.favicon_id, h.favicon_id) WHERE IFNULL(h_t.url, h.url) NOTNULL ORDER BY rank DESC, IFNULL(h_t.frecency, h.frecency) DESC
mDBKeywordQuery:
SELECT IFNULL( (SELECT REPLACE(url, '%s', ?2) FROM moz_places_temp WHERE id = b.fk), (SELECT REPLACE(url, '%s', ?2) FROM moz_places WHERE id = b.fk) ) AS search_url, IFNULL(h_t.title, h.title), COALESCE( f.url, (SELECT f.url FROM moz_places_temp JOIN moz_favicons f ON f.id = favicon_id WHERE rev_host = IFNULL( (SELECT rev_host FROM moz_places_temp WHERE id = b.fk), (SELECT rev_host FROM moz_places WHERE id = b.fk) ) ORDER BY frecency DESC LIMIT 1), (SELECT f.url FROM moz_places JOIN moz_favicons f ON f.id = favicon_id WHERE rev_host = IFNULL( (SELECT rev_host FROM moz_places_temp WHERE id = b.fk), (SELECT rev_host FROM moz_places WHERE id = b.fk) ) ORDER BY frecency DESC LIMIT 1) ), b.parent, b.title, NULL, IFNULL(h_t.visit_count, h.visit_count), IFNULL(h_t.typed, h.typed) FROM moz_keywords k JOIN moz_bookmarks b ON b.keyword_id = k.id LEFT JOIN moz_places AS h ON h.url = search_url LEFT JOIN moz_places_temp AS h_t ON h_t.url = search_url LEFT JOIN moz_favicons f ON f.id = IFNULL(h_t.favicon_id, h.favicon_id) WHERE LOWER(k.keyword) = LOWER(?1) ORDER BY IFNULL(h_t.frecency, h.frecency) DESC")
AutoComplete is definitely important, but we'd like our database construction scripts/methodology to be general enough to generate places databases for any kind of testing context.
Frecency
- Algorithm description, though sdwilsh says this may be out of date
- Actual frecency calculation at nsNavHistory::CalculateFrecencyInternal(), http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/nsNavHistory.cpp#7275
Stats we should have collected but did not
For each data point:
- Distribution of moz_historyvisits.visit_type. This value is one of the nsINavHistoryService.TRANSITION_* constants.
- Distribution of moz_places.typed
- Distribution of moz_places.frecency
- Nested folder stats (ddahl)