Testing ETL Processes

ETL (“extract, transform, load”) come in many shapes, sizes, and product types, and occur under many names – “data migration” projects, business intelligence software, analytics, reporting, scraping, database upgrades, and so on. I’ve collected some notes, attempting to classify these projects by their principal attributes, so that you can estimate the risks and plan the testing process for similar projects- if you have other additions to this list, please add comments below.

Types of ETL Processes

  • Data migrations from an unknown source to a known source
  • Data migrations from unknown, unstructured data into “something useful”
    • Scraped data => Reports (see Ahrefs, MixRank)
    • Scanned data => Reports (see Westlaw)
  • Data migrations from known datasets to known destinations
    • Old version of existing product => New product (v1 => v2)
    • Loading data from {cassandra, postgres, solr, mongo …} => {cassandra, postgres, solr, mongo, …}

Risks

  • Unknown source data – Highest Risk
    • Risks can be hard to articulate without seeing the source system
    • In theory risks can be mitigated with industry standards
    • Often Better to do by hand for small data
    • Can be difficult / impossible to script
    • Source data usually missing columns
    • Source data usually has data values that don’t match
    • Source and destination data handle gaps differently
    • Great for people who bill by the hour
    • Finite data with a large number of edge cases
    • Performance will likely not degrade over time
    • May add junk data to an existing system and permanently degrade performance
  • Medium Risk (Scraped/Scanned Data)
    • Easier to articulate risks, because people understand the nature of scanning, etc
    • May have no metadata
    • Some data is impossible to recover (missing pages, things stored in cabinets)
    • Wide variety of formats (scraped data will vary over time on a site, same for scanned forms)
    • Data volume is typically limited by what humans can generate. Can still be big (e.g. all court cases), but still easier than “all stars” or “all clicks”
    • Volume of support calls may not justify the effort at 95%, but that may be all that’s achievable
    • Infinite number of subtle edge cases (infinite = you will always find more)
    • Performance will degrade over time if the system is in use
  • Known to Known system – Medium/Low risk
    • Near real time may be expected but be difficult
    • Risk of data falling behind in dest. system
    • If this is done because of data volume, certain features will not be available to users (e.g. medians? or other more O(x) heavy operations)
    • You control the data, but there is communication risk among developers on a team
    • Subtle edge cases are very visible
    • Some edge cases will be known, but more will develop over time
    • Performance will degrade over time if the system is in use

Testing ETL processes

  • Completeness of migrated data
    • Can be fairly easy to test (SELECT count(*) FROM … in each system)
    • Better to migrate slightly more data than needed into the warehouse – helps avoid rework and easier to tell that it’s all there
    • Even if you populate everything to a warehouse you can still cut it down when you push it to the system end users see (e.g. if OLAP)
  • Performance of ETL process
    • Near real time?
    • Stop the world vs. incremental
    • Under load (e.g. sharing population of warehouse with some external volume, e.g. used for marketing system)
    • Each of these generates extra test cases
    • Is it worth NoSQL solutions – pain of additional server setup, use of newer/less tested software to spray work out and duplicate everything vs. having control/centralized database/more query features
  • Quality of transformation
    • This is never-ending
    • This is harder (i.e. impossible) if it is done once and never revisited
    • Requirements changes must propagate to every corner of the system
    • The end result system that receives the migrated data must handle every idiosyncracy in the source data
    • Fail early and risk stopping often vs. failing late, and risk not finding defects
    • Diff before/after for two implementations
    • Need a thin layer of processing over before/after to coerce into something sane (widen date range, handle nulls, add guaranteed sort order, add subsetting of data, must be rows rather than counts)