{"id":1992,"date":"2013-11-05T03:47:07","date_gmt":"2013-11-05T03:47:07","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=1992"},"modified":"2013-11-05T03:47:07","modified_gmt":"2013-11-05T03:47:07","slug":"testing-etl-processes","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/testing-etl-processes\/","title":{"rendered":"Testing ETL Processes"},"content":{"rendered":"<p>ETL (&#8220;extract, transform, load&#8221;) come in many shapes, sizes, and product types, and occur under many names &#8211; &#8220;data migration&#8221; projects, business intelligence software, analytics, reporting, scraping, database upgrades, and so on. I&#8217;ve collected some notes, attempting to classify these projects by their principal attributes, so that you can estimate the risks and plan the testing process for similar projects- if you have other additions to this list, please add comments below.<\/p>\n<p><b>Types of ETL Processes<\/b><\/p>\n<ul>\n<li><strong>Data migrations from an unknown source to a known source<\/strong>\n<ul>\n<li>Consulting projects<\/li>\n<li>Generic migration products (<a href=\"http:\/\/wiki.postgresql.org\/wiki\/Converting_from_other_Databases_to_PostgreSQL\">http:\/\/wiki.postgresql.org\/wiki\/Converting_from_other_Databases_to_PostgreSQL<\/a>)<\/li>\n<li>Data from acquisition =&gt; existing product<\/li>\n<li>Conversion from paper to electronic systems<\/li>\n<\/ul>\n<\/li>\n<li><strong>Data migrations from unknown, unstructured data into \u201csomething useful\u201d<\/strong>\n<ul>\n<li>Scraped data =&gt; Reports (see Ahrefs, MixRank)<\/li>\n<li>Scanned data =&gt; Reports (see Westlaw)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Data migrations from known datasets to known destinations<\/strong>\n<ul>\n<li>Old version of existing product =&gt; New product (v1 =&gt; v2)<\/li>\n<li>Loading data from {cassandra, postgres, solr, mongo \u2026} =&gt; {cassandra, postgres, solr, mongo, \u2026}<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><b>Risks<\/b><\/p>\n<ul>\n<li><strong>Unknown source data &#8211; Highest Risk<\/strong>\n<ul>\n<li>Risks can be hard to articulate without seeing the source system<\/li>\n<li>In theory risks can be mitigated with industry standards<\/li>\n<li>Often Better to do by hand for small data<\/li>\n<li>Can be difficult \/ impossible to script<\/li>\n<li>Source data usually missing columns<\/li>\n<li>Source data usually has data values that don&#8217;t match<\/li>\n<li>Source and destination data handle gaps differently<\/li>\n<li>Great for people who bill by the hour<\/li>\n<li>Finite data with a large number of edge cases<\/li>\n<li>Performance will likely not degrade over time<\/li>\n<li>May add junk data to an existing system and permanently degrade performance<\/li>\n<\/ul>\n<\/li>\n<li><strong>Medium Risk (Scraped\/Scanned Data)<\/strong>\n<ul>\n<li>Easier to articulate risks, because people understand the nature of scanning, etc<\/li>\n<li>May have no metadata<\/li>\n<li>Some data is impossible to recover (missing pages, things stored in cabinets)<\/li>\n<li>Wide variety of formats (scraped data will vary over time on a site, same for scanned forms)<\/li>\n<li>Data volume is typically limited by what humans can generate. Can still be big (e.g. all court cases), but still easier than \u201call stars\u201d or \u201call clicks\u201d<\/li>\n<li>Volume of support calls may not justify the effort at 95%, but that may be all that\u2019s achievable<\/li>\n<li>Infinite number of subtle edge cases (infinite = you will always find more)<\/li>\n<li>Performance will degrade over time if the system is in use<\/li>\n<\/ul>\n<\/li>\n<li><strong>Known to Known system &#8211; Medium\/Low risk<\/strong>\n<ul>\n<li>Near real time may be expected but be difficult<\/li>\n<li>Risk of data falling behind in dest. system<\/li>\n<li>If this is done because of data volume, certain features will not be available to users (e.g. medians? or other more O(x) heavy operations)<\/li>\n<li>You control the data, but there is communication risk among developers on a team<\/li>\n<li>Subtle edge cases are very visible<\/li>\n<li>Some edge cases will be known, but more will develop over time<\/li>\n<li>Performance will degrade over time if the system is in use<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><b>Testing ETL processes<\/b><\/p>\n<ul>\n<li><strong>Completeness of migrated data<\/strong>\n<ul>\n<li>Can be fairly easy to test (SELECT count(*) FROM \u2026 in each system)<\/li>\n<li>Better to migrate slightly more data than needed into the warehouse &#8211; helps avoid rework and easier to tell that it\u2019s all there<\/li>\n<li>Even if you populate everything to a warehouse you can still cut it down when you push it to the system end users see (e.g. if OLAP)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Performance of ETL process<\/strong>\n<ul>\n<li>Near real time?<\/li>\n<li>Stop the world vs. incremental<\/li>\n<li>Under load (e.g. sharing population of warehouse with some external volume, e.g. used for marketing system)<\/li>\n<li>Each of these generates extra test cases<\/li>\n<li>Is it worth NoSQL solutions &#8211; pain of additional server setup, use of newer\/less tested software to spray work out and duplicate everything vs. having control\/centralized database\/more query features<\/li>\n<\/ul>\n<\/li>\n<li><strong>Quality of transformation<\/strong>\n<ul>\n<li>This is never-ending<\/li>\n<li>This is harder (i.e. impossible) if it is done once and never revisited<\/li>\n<li>Requirements changes must propagate to every corner of the system<\/li>\n<li>The end result system that receives the migrated data must handle every idiosyncracy in the source data<\/li>\n<li>Fail early and risk stopping often vs. failing late, and risk not finding defects<\/li>\n<li>Diff before\/after for two implementations<\/li>\n<li>Need a thin layer of processing over before\/after to coerce into something sane (widen date range, handle nulls, add guaranteed sort order, add subsetting of data, must be rows rather than counts)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>ETL (&#8220;extract, transform, load&#8221;) come in many shapes, sizes, and product types, and occur under many names &#8211; &#8220;data migration&#8221; projects, business intelligence software, analytics, reporting, scraping, database upgrades, and so on. I&#8217;ve collected some notes, attempting to classify these projects by their principal attributes, so that you can estimate the risks and plan the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/testing-etl-processes\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Testing ETL Processes&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[3,6,29],"tags":[85,146,147,152,160,204,511],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1992"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=1992"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/1992\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=1992"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=1992"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=1992"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}