AUS:v3
Deprecated by work on Balrog.
Contents
- 1 Database Schema Goals
- 2 Proposed Solution
- 2.1 Collapsed
- 2.2 A Couple Problems
- 2.2.1 Mapping Files To Tables
- 2.2.2 table 1: updates
- 2.2.3 table 2: patches
- 2.2.4 table 3: updateParameters
- 2.2.5 table 4: updateParametersToUpdates
- 2.2.6 table 5: products
- 2.2.7 table 6: versions
- 2.2.8 table 7: buildIds
- 2.2.9 table 8: buildTargets
- 2.2.10 table 9: locales
- 2.2.11 table 10: channels
- 2.2.12 table 11: osVersions
- 2.2.13 table 12: distributions
- 2.2.14 table 13: distributionVersions
- 3 Previous Proposals
- 4 Metadata First Cut
Database Schema Goals
- Referential integrity in data source
- Add an RDBMS - MySQL likely - that uses a schema designed to store the data required for business rules and update metadata -- including data points we may not even track currently (these need to be defined).
- No duplication of data
- Basic normalization (without OVER normalizing) should quite easily take care of this.
Proposed Solution
Using this less than completely normalized solution is theorized to be faster. Rather than using joins, each table will have it's value looked up with an independent query. Caching the results of these queries will minimize actual hits on the database.
Collapsed
A Couple Problems
On beginning to translate the file system to the database I have growing concern that the database schema is not correct. The diagram shows the updateParameters table having a foreign key to the updates table. This couldn't be correct because the sample xml file shows more than one update being offered. If the diagram were correct, only one update could be offered for a given set of input parameters.
Reversing the relationship is not correct either. If the foreign key were to be in the updates table pointing to the updateParameters table, then there could be more than one update offered for a given set of input parameters. However, this would mean that an update is tied to a specific set of update parameters. I wager that is not what is intended.
I believe that we need to add a join table to go between the the updateParameters table and the updates table. This would enable a set of updates to server more than one set of input parameters.
However, I am concerned that an undesirable situation may develop. I generate rows in the updates and patches tables by reading the complete.txt and partial.txt files. I create tentative row entries for the tables from the files and then look into the tables to see if rows identical to my tentative rows exist. If an identical row exists, I throw away my tentative row and substitute the existing row.
Look at this scenario:
- .../somepath01/complete.txt translates to updates row U1 and patches row P1.
- .../somepath01/partial.txt translates to updates row U1 and patches row P2.
- U1 is in the database only once, with P1 and P2 with references to it.
- U1 is associated with a row in updateParameters for 'somepath01'
- .../somepath02/complete.txt translates to updates row U1 and patches row P3.
- .../somepath02/partial.txt translates to updates row U1 and patches row P4.
- Since updates row U1 was already in the database, a new one was not created.
- Patches rows P3 and P4 are given an association with U1.
- Since U1 is given an additional association with updateParameters row 'somepath02'
- All four patches rows P1, P2, P3 and P4 now have an association through updates row U1 with both
updateParameters rows 'somepath01' and 'somepath02'.
This was not expressed in the original file structure. Is this situation likely to happen? Is it undesirable?
Mapping Files To Tables
In mapping the pathname to database fields, each segment in the path leads to an entry in a database table:
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Product.name
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> Version.name
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> BuildTarget.name
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> BuildId.name
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> locale.name
incoming/2/Firefox/trunk/Linux_x86-gcc3/2007061204/en-US/complete.txt -> channel.name (2 - nightly; 3 - ???)
NULL -> os_version.name
NULL -> Distribution.name
NULL -> Distibution_version.name
The rest of the values to be mapped into the database come from the complete.txt and partial.txt files. An example mapping follows:
There are two file formats, one with keywords and one without
complete -> patches.type http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2007-06-12-07-trunk/firefox-3.0a6pre.en-US.linux-i686.complete.mar -> patches.URL sha1 -> patches.hashFunction 4fe7d4db03ef96eb8f34bce05d740d986fcb0bc0 -> patches.hashValue 10752925 -> patches.size 2007061207 -> patches.build 3.0a6pre -> updates.version 3.0a6pre -> updates.extensionVersion detailsUrl=http://www.mozilla.com/thunderbird/releases/1.5.0.14.html -> patches.detailsUrl (if present, otherwise NULL)
updates.type is another complication. It gets 'minor' if updates.type is 'nightly' otherwise it gets NULL
table 1: updates
column name | type | attribute | comment |
id | sequence | primary | generated |
updateType | integer | 1 - minor (default) 2 - major (file: updateType line) |
|
version | varchar | file: version line | |
appv | varchar | file: appv line | |
extv | varchar | file: extv line | |
detailsUrl | varchar | NULL file: detailsURL line (if present) |
|
licenseUrl | varchar | NULL file: licenseUrl line (if present) |
id | updateType | version | extensionVersion | detailsURL | licenseUrl |
0 | minor | 1.0.4 | 1.0 | http://www.foo.com/1.0.4/whatsnew.html | NULL |
1 | major | 1.1.2 | 1.1 | http://www.foo.com/1.1.2/whatsnew.html | NULL |
table 2: patches
column name | type | attribute | comment |
id | sequence | primary | generated |
type | integer | 1 - complete; 2 - partial (file: type line) |
|
url | varchar | file: url line | |
hashFunction | varchar | file: hashFunction line | |
hashValue | varchar | file: hashValue line | |
size | varchar | file: size line | |
build | varchar | file: build line | |
update_id | integer | FK(updates) |
id | type | URL | hashFunction | hashValue | size | updateId |
0 | partial | http://www.foo.com/1.0.4-partial.mar | 0 | |||
1 | complete | http://www.foo.com/1.0.4-complete.mar | 0 | |||
2 | complete | http://www.foo.com/1.1.2-complete.mar | 1 |
table 3: updateParameters
column name | type | attribute | comment |
id | sequence | primary | |
product_id | integer | FK(product) | |
version_id | integer | FK(version) | |
buildId_id | integer | FK(build_id) | |
buildTarget_id | integer | FK(build_target) | |
locale_id | integer | FK(locale) | |
channel_id | integer | FK(channel) | |
osVersion_id | integer | FK(os_version) | |
distribution_id | integer | FK(distribution) | |
distributionVersion_id | integer | FK(distribution_version) | |
update_id | integer | FK(updates) |
table 4: updateParametersToUpdates
column name | type | attribute | comment |
id | sequence | primary | |
updateParameters_id | integer | FK(updateParameters) | |
updates_id | integer | FK(updates) |
table 5: products
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 6: versions
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 7: buildIds
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 8: buildTargets
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 9: locales
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 10: channels
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 11: osVersions
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 12: distributions
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
table 13: distributionVersions
column name | type | attribute | comment |
id | sequence | primary | |
name | varchar |
Previous Proposals
Fully Normalized Hierarchy
Stats:
- number of tables: 17
- number of joins to get data: 16
Sample SQL:
select payload.* from product join pv on (product.name = %PRODUCT% and product.id = pv.productId) join version on (version.name = %VERSION% and version.id = pv.versionId) join pvb on (pv.id = pvb.pvId) join build_id on (build_id.name = %BUILD_ID% and build_id.id = pvb.build_idId) join pvbbt on (pvb.id = pvbbt.pvbId) join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = pvbbt.build_targetId) join pvbbtl on (pvbbt.id = pcbbtl.pvbbtId) join locale on (locale.name = %LOCALE% and locale.id = pvbbtl.localId) join pvbbtlc on (pvbbtl.id = pvbbtlc.pvbbtlId) join channel on (channel.name = %CHANNEL% and channel.id = pvbbtlc.channelId) join pvbbtlcos on (pvbbtlc.id = pcvbbtlcos.pvbbtlcId) join os_version on (os_version.name = %OS_VERSION% and os_version.id = pvbbtlcos.os_versionId) join pvbbtlcosd on (pvbbtlcos.id = pvbbtlcosd.pcbbtlcosId) join distribution in (distribution.name = %DISTRIBUTION% and distribution.id = pvbbtlcosd.distributionId) join payload on (pvbbtlcosd.id = payload.pvbbtlcosdId) join distibution_version on (distibution_version.name = %DISTRIBUTION_VERSION% and distibution_version.id = payload.distibution_versionId)
Advantages:
- simulates the original hierarchical structure
- highly normalized
Disadvantages:
- very verbose SQL
- structure, while faithful to the previous system, doesn't necessarily accurately model the data relationships
Mostly Normalized Hierarchy with Rearrangement
Stats
- number of tables: 16
- number of joins: 15
Sample SQL
select payload.* from payload join (product join pv on (product.name = %PRODUCT% and product.id = pv.productId) join version on (version.name = %VERSION% and version.id = pv.versionId) join pvb on (pv.id = pvb.pvId) join build_id on (build_id.name = %BUILD_ID% and build_id.id = pvb.build_idId)) on payload.pvbId = pcbId.id join (btos join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = btos.build_targetId) join os_version on (os_version.name = %OS_VERSION% and os_version.id = btos.os_versionId) join btosl on (btos.id = btosl.btosId) join locale on (locale.name = %LOCALE% and locale.id = btosl.localeId)) on payload.btoslId = btosl.id join (cd join channel on (channel.name = %CHANNEL% and channel.id = cd.channelId) join distribution on (distribution.name = %DISTRIBUTION% and distribution.id = cd.distributionId) join cddv on (cd.id = cddv.cdId) join distribution_version on (distribution_version.name = %DISTRIBUTION_VERSION% and distribution_version.id = cddv.distribution_versionId)) on payload.cddvId = cddv.id
Advantages
- normalized
- rearrangement more accurately reflect the data relationships
Disadvantages
- complicated SQL
Collapsed
Stats:
- number of tables: 10
- number of joins to get data: 9
Sample SQL:
select payload.* from payload join product on (product.name = %PRODUCT% and product.id = payload.productId) join version on (version.name = %VERSION% and version.id = payload.versionId) join build_id on (build_id.name = %BUILD_ID% and build_id.id = payload.build_idId) join build_target on (build_target.name = %BUILD_TARGET% and build_target.id = payload.build_targetId) join locale on (locale.name = %LOCALE% and locale.id = payload.localId) join channel on (channel.name = %CHANNEL% and channel.id = payload.channelId) join os_version on (os_version.name = %OS_VERSION% and os_version.id = payload.os_versionId) join distribution in (distribution.name = %DISTRIBUTION% and distribution.id = payload.distributionId) join distibution_version on (distibution_version.name = %DISTRIBUTION_VERSION% and distibution_version.id = payload.distibution_versionId)
Advantages
- fewer joins means faster queries
Disadvantages
- poor normalization
- no direct expression of the relationships between the keys
Degenerate
No Image
The information is stored in a single table keyed by the entire URL.
Stats:
- number of tables: 1
- number of joins to get data: 0
Sample SQL:
select payload.* from payload where url = %URL%
Advantages
- fastest of the options
- no complicated SQL to maintain
Disadvantages
- no relationships between keys stored in the database
Metadata First Cut
Converting an XML Schema into a relational database schema can be pretty straight forward. An element generally corresponds with a row in a table. The attributes inside the start tags correspond to the data in columns of the table. Nested elements are rows in another table that has a foreign key relationship with the table to which the parent element is associated. The tree structure of XML lends itself to orderly relational database schema.
While more complicated relationships can exist between the elements, the schema of this project seems very simple.
<?xml version="1.0"?> <updates> <update type="minor" version="1.0.4" extensionVersion="1.0" detailsURL="http://www.foo.com/1.0.4/whatsnew.html"> <patch type="partial" URL="http://www.foo.com/1.0.4-partial.mar" hashFunction="" hashValue="" size=""/> <patch type="complete" URL="http://www.foo.com/1.0.4-complete.mar" hashFunction="" hashValue="" size=""/> </update> .. <update type="major" version="1.1.2" extensionVersion="1.1" detailsURL="http://www.foo.com/1.1.2/whatsnew.html"> <patch type="complete" URL="http://www.foo.com/1.1.2-complete.mar" hashFunction="" hashValue="" size=""/> </update> </updates>
AUS Metadata Database Schema Naive Version 1
This proposal introduces just two tables: updates and patches. They correspond to the major elements of the aforementioned XML file.
table 1: updates
column name | type | attribute | comment |
id | sequence | primary | |
type | varchar | enum | integer | several options for implementation | |
version | varchar | likely a foreign key? | |
extensionVersion | varchar | likely a foreign key? | |
detailsURL | varchar | is this url completely unique? |
id | type | version | extensionVersion | detailsURL |
0 | minor | 1.0.4 | 1.0 | http://www.foo.com/1.0.4/whatsnew.html |
1 | major | 1.1.2 | 1.1 | http://www.foo.com/1.1.2/whatsnew.html |
table 2: patches
column name | type | attribute | comment |
id | sequence | primary | |
type | varchar | enum | integer | several options for implementation | |
URL | varchar | is this url completely unique? | |
hashFunction | varchar | ||
hashValue | varchar | ||
size | varchar | if it's never actually used as an integer |
|
updateId | integer | FK(updates) |
id | type | URL | hashFunction | hashValue | size | updateId |
0 | partial | http://www.foo.com/1.0.4-partial.mar | 0 | |||
1 | complete | http://www.foo.com/1.0.4-complete.mar | 0 | |||
2 | complete | http://www.foo.com/1.1.2-complete.mar | 1 |
Discussion
This schema is missing any indication of the product. Does this system need to take into account more than just one product? If is does, imagine another enclosing
How does this schema meet the goals at the top of the page?
- ...data required for business rules and update metadata...: What are the business rules? Can't test to see if this is sufficient for the business rules until I have business rules. If the "update metadata" is captured in its entirety by the XML example, the it is safe to judge that this database schema is equivalent.
- Basic normalization (without OVER normalizing)...: This simple schema is normalized if the values in the columns are unique. Each table has a type attribute that could be foreign keys to tables listing the allowed types. Since there are so few options, splitting them off into their own tables would be over normalization. The URL columns, if not unique within their tables, could be consolidated into their own tables leaving behind a foreign key.