EngineeringProductivity/Projects/ActiveData

From MozillaWiki
Jump to: navigation, search

Overview

ActiveData is a collection of about 8 billion records (Feb 2016) covering unit tests, Buildbot jobs, performance data, and mercurial. This collection is publicly available, and can be queried directly, similar to any database.

ActiveData is built on top of ElasticSearch, a fast, distributed, redundant document store. ActiveData provides the benefits of familiar and succinct SQL by translating SQL-like queries to ElasticSearch queries,

Problem

In order to improve our testing infrastructure we require data on how that infrastructure is performing. That information can be extracted from the raw logs, but that requires downloading samples, parsing data, insertion into a database (or worse, writing queries in an imperative language, like Python). When we are done an analysis we have effectively built an ETL pipeline that does not scale, and is too specific to be reused elsewhere. The next project does this work all over again.

Solution

ActiveData will serve as a reusable ETL pipeline; annotating the test results with as much relevant data as possible. It also provides a query service to explore and aggregate the data, so there is minimal setup required to access this data.

Charts

ActiveData is fast enough to support dashboards.

Build times

End to End Times Shows overall time from when a build is first requested to the time tests on that build are complete.
Build Times Time series view of build times by platform and build type. Click on a bar to get a scatter plot view.
Detailed Build Times Scatter plot of build times. Use the left navigation panel to choose a combination. Click on a data point to see the Buildbot step times and Mozharness step times.
Buildbot Simulator
incomplete
An incomplete Buildbot scheduling simulator. It can be used to see past wait times, queue size, and inter-job delays.
Test Runtimes Choose test suite and machine pool to get an average run time for each of the buildbot steps, and Mozharness steps.

Unit Test Visualization

With all unit test results in ActiveData, we can get accurate estimates of "failure rate"; and be able to focus on the most-failing tests.

Top Intermittent Failures List of top 30 most-failing unit tests, and list of top 30 most-recent failing tests. Click on the link to get a scatterplot.
Find Test Results Use the search bar to find a test. A list of matching tests, and platform combinations will show the unit test failures and durations.
Neglected Oranges Cross reference OrangeFactor and Bugzilla to give a list of frequent intermittents that have no bug activity.

Design

ActiveData attempts to provide the benefits of an available database to the public; except larger and faster.

Goals

An active data instance distinguishes itself from a static resource, or database, or big data solution, by delivering a particular set of features:

  • A service, open to third party clients - By providing the service, clients don't need to setup their own datastore
  • Fast filtering - Sub-second filtering over the contents of the whole datastore, independent of size, saves the application developer from declaring and managing indexes that do the same: There is sufficient information in the queries to determine which indexes should be built to deliver a quick response.
  • Fast aggregates - Sub-second calculation of statistics over the whole datastore saves the application developer from building and managing caches of those aggregates.
  • API is a query language (SQL?, MDX?) - Building upon the formalisms, and familiarity, of existing query languages, we reduce the learning curve, and also provide Active Data implementations with more insight into the intent of the client application; and optimize for its use cases.
  • Uniform, Cartesian space of values - Mozilla has a mandate of data driven decision making. Data analysis tools, like Spreadsheets, R, Scipy, Numpy, and Pandas are used to perform data analysis, and they all require uniform data in multi-dimensional arrays, commonly known as "pivot tables" or "data frames". ActiveData's objective is to provide query results in these formats
  • Metadata on dimensions and measures - ActiveData also provides context to the data it holds. It serves the purpose to allow exploration and discovery by third parties; by describing unit-of-measure, how dimensions relate to others, and provide human descriptions of the columns stored. This metadata is also invaluable in automating the orientation and formatting of dashboard charts: Knowing the domain of an axis allows code to decide the best (default) chart form, and provides logically reasonable aggregate options.
  • Has a security model - Simpler applications can avoid the complications of a security model if it is baked into the ActiveData solution. If ActiveData is to become mainstream it is important that it can manage sensitive data and PII.

Limitations

The unittest data is limited to those test suites that generate structured logs. Currently (Feb, 2016) the following do NOT have structured logs, and are NOT in ActiveData:

  • cppunittest
  • and any of the js based gaia suites (e.g Gij)

Specifically, you can see if a structured log is being generated: In Treeherder, click a job. Under the "Job details" pane at the bottom, look for a line similar to:

artifact uploaded: <suite>_raw.log

If you see that, it is using structured logging.

ActiveData makes specific tradeoffs to achieve its goals. It has the following limitations:

  • large memory requirements
  • low add/update/remove speeds
  • strict data model (snowflake schema, hierarchical relations only)
  • non-relational
  • ETL work required to de-normalize data
  • ETL work required to provide dimension metadata

Non Goals

ActiveData is not meant to replace an application database. Applications often track significantly more data related to good interface design, process sequences, complex relations, and object life cycles. ActiveData's simple model makes it difficult to track object life cycles and impossible to model many-to-many relations. Data is not live, and definitely does not track "pending jobs" like TreeHerder or TaskCluster do. Test results may take a day, or more, to be indexed.

Dependencies / Who will use this

Dependencies

ActiveData's ETL pipeline ingests data from a variety of sources:

  • Buildbot
  • Mozharness
  • Structured Logs
  • Task Cluster (end of Q1 2016)
  • PerfHerder
  • hg.mozilla.org

Users

ActiveData's primary goal is to support dashboards that give Mozilla useful perspectives into the large amount of data:

  • Individual unit test results
  • Buildbot test times
  • Firefox compile times
  • Recently new, removed, and disabled tests
  • Buildbot wait times

Let's Use It!

The service listens at http://activedata.allizom.org/query and accepts queries in JSON Query Expression format.

   curl -XPOST -d "{\"from\":\"unittest\"}" http://activedata.allizom.org/query

The Query Tool

The ActiveData service is intended for use by automated clients, not humans. The Query Tool is a minimal web page for humans to do some exploration, and to test phrasing queries.

Documentation

Code

Development is still in the early stages, setting up your own service

Tests

Bugs

Bug are tracked in Bugzilla. The open issues are shown here:

ID Summary Priority Status
1030857 logging.handlers.RotatingFileHandler needs replacement -- NEW
1152971 API to obtain build and log URLs for a given commit -- NEW
1153953 Add saved parametric queries -- NEW
1156347 Do not return 4xx if ActiveData thinks it can fix the problem -- NEW
1158590 [SpotManager] Should consider the remaining hour is effectively free -- NEW
1160412 Add tests for all expressions, including "prefix" -- NEW
1161202 [QueryTool] Cancel current query if new one is submitted -- NEW
1161204 [Qb] Be able to take average of query the last N of given filter -- NEW
1161205 [Qb] the limit clause should specify the dimension -- NEW
1161232 Verify volume of data in ActiveData -- NEW
1161255 Prioritize re-ETL -- NEW
1162595 [SpotManager] Bidding for ES spot instances is suboptimal -- NEW
1162925 Show when service is unresponsive -- NEW
1163078 [SpotManager] May not exit if there is a setup() failure -- NEW
1163094 [SpotManager] add Availability Zone to settings -- NEW
1163313 [Query Tool] reloading a saved query will replace the textarea -- NEW
1163315 Set size=0 on all queries -- NEW
1164093 Add $value and $object properties for indexing to ES -- NEW
1164851 [Query Tool] Show list of previous queries -- NEW
1166972 [QueryTool] Let URL parameters fill query parameters -- NEW
1166979 [QueryTool] Some find JSON a nightmare to write -- NEW
1167561 [ETL] Verify index metadata conforms to what we expect -- NEW
1169299 [ETL] Ensure push_to_es.py does not hang on records for too long -- NEW
1169766 [ETL] Start pipeline with raw buildbot Pulse messages? -- NEW
1172289 [Qb] Non-conforming JSON confuses the formatter -- NEW
1172297 [Qb] Auto-convert strings to numbers, if that s how a field is used -- NEW
1173168 [Qb] Fix nested queries -- NEW
1173750 [ETL] Subtests are too big -- NEW
1174699 [Qb] Add median as possible aggregation -- NEW
1180681 [Qb] Add property aliases -- NEW
1191013 [Qb] Stream JSON -- NEW
1191021 [ETL] Import hg.mozilla.org -- NEW
1192520 [qb] Date strings are not flexible -- NEW
1193250 Daemon to backfill test results into ES -- NEW
1196305 [Qb] multiple groupby should string-join multiple columns, the aggregate? -- NEW
1196307 [Qb] Add `concat` function -- NEW
1196343 [Qb] Use new `missing` option on terms aggregation -- NEW
1196749 Balance Shards by Size, not Count -- NEW
1196752 Multiple ES nodes per machine -- NEW
1210599 Deep queries into Perf metrics is not working -- NEW
1213025 Make a Test Failure Dashboard -- NEW
1218184 Better ElasticSearch index creation? -- NEW
1218526 Storing "other.en_revision": [null]? -- NEW
1223057 Ingest TaskCluster Test Results -- NEW
1233499 [ETL] Add Abstraction: Permanent Queue -- NEW
1233500 [QueryTool] balance the quotes! -- NEW
1233507 [QueryTool] Add SQL interface to ActiveData -- NEW
1233511 Use `bool` in ES queries -- NEW
1253672 Determine what is the time spent on docker image downloads -- NEW
1261867 Derive Tests from SqlLogicTest -- NEW
1263185 Create an algorithm to get fastest and most frequently failing tests from ActiveData -- NEW
1270275 BuildbotBridge Job ETL is not working P3 NEW
1273544 Something smells off with IsUnstyledDocument, IsLoadedAsData and IsLoadedAsInteractiveData -- NEW
1282229 Query response should not include "put" and "pull" properties -- NEW
1282230 Use `treeherder.groupName` to determine `run.suite`? -- NEW
1293392 [SpotManager] d2* instance types do not have file limits set high enough -- NEW
1295586 Provide a list of common "recipes" -- NEW
1296637 [OrangeFactor v2] Add link to Treeherder job_id -- NEW
1296643 [OrangeFactor v2] Reconcile difference between TestFailures and OrangeFactor -- NEW
1296650 [BuildTimings] Improve links in build timing dashboards -- NEW
1296653 [CodeCoverage] add links to jobs -- NEW
1296671 Look up all tests run by job_id -- NEW
1296673 Lookup Mozharness step times by job id -- NEW
1296710 Perform adhoc queries on TH data -- NEW
1302699 [ETL] Jobs ETL generating nasty step names -- NEW
1304150 Capture AWS shutdown message for ES node shutdown -- NEW
1307894 [SpotManager] Sometimes does not complete -- NEW
1308644 Add more properties to unittests from taskcluster -- NEW
1310785 Add more CORS negotiation to /query endpoint -- NEW
1313139 [ETL] Add source.file.language = [js, c, etc] to coverage markup -- NEW
1315665 [SpotManager] Add bidding strategy -- NEW
1315666 [SpotManager] Overestimates cost exposure -- NEW
1329297 Handle new json request format from hg -- NEW
1341759 Make Redash Connector for ActiveData -- NEW
1352100 Store longer unittest timelines in ActiveData -- NEW
1361362 Treeherder index missing lots of TH data -- NEW
1372004 Import Github repos -- NEW
1374334 Dashboard to show unit test durations, by platform, longest running first -- NEW
1375840 Recent Firefox Test Engineering results not appearing in ActiveData -- NEW
1377472 Perform lcov rewriting in the coverage uploader task and generate artifact for ActiveData -- NEW
1378491 [TestGroup UI] Explore if ActiveData can be used to get pass/fail ratios on a per-test basis -- NEW

81 Total; 81 Open (100%); 0 Resolved (0%); 0 Verified (0%);


Contact

  • Kyle Lahnakoski
    • IRC: ekyle@irc.mozilla.org
    • Email: klahnakoski@mozilla.org
    • Bugzilla: :ekyle

More Context

Mostly rambling, optional reading.

Inspiration

This project is inspired by the data warehouse and data mart technology that is common inside large corporations. These warehouses are useful because they are "active" services: This means the data is not only available, but it can be explored interactively by large audience using a query language.

General Problem

A significant portion of any application is the backend database/datastore, which include:

  • Managing resources and machines to support the datastore
  • Data migrations on schemas during application lifetime
  • Manually defining database indexes for responsive data retrieval
  • Coding caching logic to reduce application latency

The manual effort put toward these features becomes significant as the amount of data grows in size and complexity. More importantly, this effort is being spent over and over on a multitude of applications, each a trivial variation of the next.

General Solution

Abstractly, we desire to reduce this redundant workload by adding a layer of abstraction called ActiveData: Clients using ActiveData benefit from the features it provides and avoid the datastore management complexities. While the ActiveData implementers can focus on these common issues while being given a simpler data model, and simpler query language, upon which to calculate optimizations.

Columnar datastores, have solved many (but not all) problems with changing schema. Query-directed indexing has been around for decades in Oracle's query optimization algorithms, and are available for free in ElasticSearch. We now have the technology to build an ActiveData solution.

By defining an ActiveData standard, we can innovate on both sides of the ActiveData abstraction layer independently

Client Architecture

Applications that leverage an active data warehouse can forgo significant server side development, if not all, and put the logic on the client side.