{"id":2024,"date":"2014-02-19T13:50:18","date_gmt":"2014-02-19T13:50:18","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2024"},"modified":"2014-02-19T13:50:18","modified_gmt":"2014-02-19T13:50:18","slug":"hacking-customer-service-python","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/hacking-customer-service-python\/","title":{"rendered":"Hacking Customer Service with Python"},"content":{"rendered":"<p>Customer support questions come in many forms, some of which are difficult to answer:<\/p>\n<ul>\n<li>Why does this screen show different results than it did a week ago?<\/li>\n<li>Why does screen A show ten records when screen B shows five?<\/li>\n<li>And, a favorite, what don&#8217;t you know about this problem?<\/li>\n<\/ul>\n<p>Answering these questions in a satisfactory way is challenging, and a few large tech companies have avoided this challenge by not providing customer service. SQL was built to map human questions to a data model, and while it doesn&#8217;t make the above questions easy, it&#8217;s embedded in many existing systems, and it makes sense to work with what we have.<\/p>\n<p>These questions tend to arise when an application applies many rules to achieve it&#8217;s results &#8211; for instance, Google claims to apply over two hundred factors in it&#8217;s ranking algorithms.<\/p>\n<p>If your product uses a relational database, you can get a lot of mileage from extracting SQL queries from a log, finding what where clauses are included, then documenting the result &#8211; a process that might be executed by a helpdesk with heavy technical skills.<\/p>\n<p>Consider, however, that the technical community has made great advances in our ability to deal with data, so we ought to be able to automate some of this away.<\/p>\n<p>Consider the following hypothetical query, which searches for files visible to a particular user, with names matching a typed search. While there are better ways to build a system that does this, many do work this way:<\/p>\n<pre lang=\"sql\">\nSELECT *\nFROM files\nWHERE\n  file_name like '%test%' AND\n  parent_folder = 'test_folder' AND \n  ( owner_user = 'gsieling' OR\n    file_groups IN \n      ( SELECT name  \n        FROM groups \n        WHERE member_user = 'gsieling' ) )\n<\/pre>\n<p>In this example, files may be hidden because of any combination of three filters. These are straightforward, and should be pretty simple to explain to a customer (note that we&#8217;re not explaining <em>why<\/em>). Rather than filtering rows out, it is helpful to transform this query into an equivalent form, which returns each row and whether it should have been included:<\/p>\n<pre lang=\"sql\">\nSELECT \n   (case when file_name like '%test%' then 1 else 0 end) as name_condition,\n   (case when parent_folder = 'test_folder' then 1 else 0 end) as folder_condition,\n   (case when owner_user = 'gsieling' then 1 else 0 end) as owner_condition,\n   (case when file_groups IN \n      ( SELECT name  \n        FROM groups \n        WHERE member_user = 'gsieling' ) then 1 else 0 end) as group_condition\nFROM files\n<\/pre>\n<p>Often the first form of these queries are <a href=\"http:\/\/en.wikipedia.org\/wiki\/Object-relational_mapping\">generated by an application layer<\/a>, but the second form is something I&#8217;ve not seen used. <\/p>\n<p>SQL is a complex language, but fortunately there are libraries that handle parsing, which fits this problem well, especially if you&#8217;re using an ORM, since the queries will follow a pattern. <\/p>\n<p>A Python library called <a href=\"https:\/\/code.google.com\/p\/python-sqlparse\/\">sql-parse<\/a> can tokenize and tag queries without validating them, which is a perfect fit.<\/p>\n<p>You can parse a query simply:<\/p>\n<pre lang=\"python\">\nstmt = sqlparse.parse(sql)[0]\n<\/pre>\n<p>If we could use this result to find each section of the WHERE clause, we could trivially turn these into CASE statements:<\/p>\n<pre lang=\"python\">def generate_case(token, index):\n  return '(case when (' + token + ') then 1 else 0 end) as a' + str(index)<\/pre>\n<p>Sqlparse returns parts of the query as a parse tree, with each object marked with attributes identifying the role a token plays in the query. We need to split the query into the part that is in the WHERE clause, and the part that isn&#8217;t. Having done that, we&#8217;ll reformat the conditions and place them in the SELECT (those with SQL experience will likely object: we&#8217;re not handing nested queries, common table expressions, or conditions on JOINs, but consider that this is a thought experiment)<\/p>\n<pre lang=\"python\">\ndef remove_where(stmt):\n  return [x for x in stmt.flatten() \\\n     if not x.within(sql.Where)]\n\ndef get_where(stmt):\n  results = [x for x in stmt.tokens \\\n     if type(x) is sql.Where]\n\n  if len(results) &gt; 0:\n    return results[0]\n  else: \n    return None\n<\/pre>\n<p>The tricky part is extracting all the conditions &#8211; to do this we can recurse through the entire WHERE clause, pulling out each combination of conditions.<\/p>\n<pre lang=\"python\">\ndef get_clauses(tokens):\n  found = []\n  for token in tokens:    \n    if (token.is_group() and type(token) is sql.Parenthesis):\n      found.append(token.to_unicode())\n      found.extend(get_clauses(token.tokens))\n    elif (token.is_group()):\n      found.append(token.to_unicode())\n\n  return found\n<\/pre>\n<p>To put this back together, we need to find the SELECT within the query. This demonstrates the biggest risk of this type of process: there are many ways to write the same query, so the only way this works is if queries are generated consistently by our application.<\/p>\n<pre lang=\"python\">def replace_select(tokens, toSelect):\n  foundSelect = False\n  foundFrom = False\n  addedSelect = False\n\n  resultSql = ''\n  for token in tokens:\n    if token.value.lower() == 'from':\n      foundFrom = True\n\n    if not foundSelect or foundFrom:\n      resultSql = resultSql + token.to_unicode()\n    else:\n      if not addedSelect:\n        resultSql = resultSql + ' ' + toSelect + ' '\n        addedSelect = True\n\n    if token.value.lower() == 'select':\n      foundSelect = True\n\n  return resultSql<\/pre>\n<p>We now have all we need to re-assemble a new query &#8211; <a href=\"https:\/\/github.com\/garysieling\/sql-for-support\/blob\/master\/sql-mangler.py\">the full source and test cases are available on Github<\/a>. <\/p>\n<p>You might ask, what is the point of all this?<\/p>\n<p>In less than eighty lines of Python, we&#8217;ve turned entries in a production log into a powerful tool for answering a client&#8217;s tickets. <\/p>\n<p>Not only that, we&#8217;ve turned a tedious research task into an interesting engineering challenge, and increased the scope of what a helpdesk can achieve.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Customer support questions come in many forms, some of which are difficult to answer: Why does this screen show different results than it did a week ago? Why does screen A show ten records when screen B shows five? And, a favorite, what don&#8217;t you know about this problem? Answering these questions in a satisfactory &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/hacking-customer-service-python\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Hacking Customer Service with Python&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[29],"tags":[133,134,160,447,523],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2024"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2024"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2024\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2024"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2024"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2024"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}