{"id":222,"date":"2012-07-01T20:18:28","date_gmt":"2012-07-01T20:18:28","guid":{"rendered":"http:\/\/garysieling.com\/blog\/?p=222"},"modified":"2012-07-01T20:18:28","modified_gmt":"2012-07-01T20:18:28","slug":"generating-arff-files-for-weka-from-postgres","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/generating-arff-files-for-weka-from-postgres\/","title":{"rendered":"Generating ARFF files for Weka from Postgres"},"content":{"rendered":"<p>Since all my <a href=\"http:\/\/garysieling.com\/blog\/tag\/javascript-scraper\">scraped data <\/a>is in <a href=\"http:\/\/garysieling.com\/blog\/tag\/postgres\">Postgres<\/a>, this is the easiest way to get it out &#8211; the fastest iteration possible. At some point I&#8217;ll probably switch to a Java library. It&#8217;s interesting to see, but probably the only lesson from this is that all ETL scripts are ugly.<\/p>\n<pre lang=\"sql\">with advertisers_ranked as (\n\tselect advertiser_id, replace(replace(lower(advertiser), ' ', '_'), '\/', '_') advertiser, \n\t6 + dense_rank() over (partition by 1 order by advertiser) advertiser_rank -- 6 for the number of attributes prior to the 'advertiser' attributes\n\tfrom advertisers\n)\nselect '@RELATION flippa' line\nunion all\nselect '@ATTRIBUTE default numeric' line\nunion all\nselect '@ATTRIBUTE siteid string' line\nunion all\nselect '@ATTRIBUTE banned {0,1}' line\nunion all\nselect '@ATTRIBUTE length numeric' line\nunion all\nselect '@ATTRIBUTE h1 numeric' line\nunion all\nselect '@ATTRIBUTE h2 numeric' line\nunion all\nselect '@ATTRIBUTE h3 numeric' line\nunion all\n(select '@ATTRIBUTE ' || advertiser || ' {0, 1}' line\nfrom advertisers_ranked order by advertiser_rank)\nunion all\nselect '@DATA' line\nunion all\n-- there are N advertisers per row, this combines them into one\nselect '{' || siteid || ', ' || banned || ', ' || length || ', ' || h1 || ', ' || h2 || ', ' || h3 || ', ' || array_to_string(array_agg(advertiser ORDER BY advertiser_rank), ', ') || '}' line\nfrom (\n\tselect distinct\n\t        '1 ' || s.site_id siteid, \n\t\t'2 ' || (case when seller like '%banned%' then 1 else 0 end) as banned, \n\t\t'3 ' || char_length(description) length,\n\t\t'4 ' || (length(description) - length(regexp_replace(lower(description),'h1','','g'))) \/ length('h1') h1, \n\t\t'5 ' || (length(description) - length(regexp_replace(lower(description),'h2','','g'))) \/ length('h2') h2,\n\t\t'6 ' || (length(description) - length(regexp_replace(lower(description),'h3','','g'))) \/ length('h3') h3, \n\t\tadvertiser_rank || ' 1' advertiser,\n\t\tadvertiser_rank\n\tfrom sites s\n\tjoin sites_advertisers on s.site_id = sites_advertisers.site_id \n\tjoin advertisers_ranked a on a.advertiser_id = sites_advertisers.advertiser_id\n\tjoin auctions on auctions.site_id = s.site_id\n\t) a\ngroup by siteid, banned, length, h1, h2, h3<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Since all my scraped data is in Postgres, this is the easiest way to get it out &#8211; the fastest iteration possible. At some point I&#8217;ll probably switch to a Java library. It&#8217;s interesting to see, but probably the only lesson from this is that all ETL scripts are ugly. with advertisers_ranked as ( select &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/generating-arff-files-for-weka-from-postgres\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Generating ARFF files for Weka from Postgres&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[147,152,204,205,235,437,523,595],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/222"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=222"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/222\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=222"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=222"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=222"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}