{"id":3897,"date":"2016-04-25T12:27:01","date_gmt":"2016-04-25T12:27:01","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=3897"},"modified":"2016-04-25T12:27:01","modified_gmt":"2016-04-25T12:27:01","slug":"bulk-import-json-files-solr","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/bulk-import-json-files-solr\/","title":{"rendered":"Bulk import JSON files into Solr"},"content":{"rendered":"<p>Solr has a really nice utility for importing documents, so you don&#8217;t have to script anything.<\/p>\n<p>When you install it, in the &#8220;bin&#8221; directory there is a script called post, which you can use to send stuff to Solr:<\/p>\n<pre lang=\"bash\">\n.\/bin\/post -c talks \\\n  \/d\/data\/talks\/1.json\n<\/pre>\n<p>I found in testing this that if I sent a whole folder, I occasionally get an error:<\/p>\n<pre>\nSimplePostTool: WARNING: Response: {\"responseHeader\":{\"status\":400,\"QTime\":1},\"error\":{\"metadata\":[\"error-class\",\"org.apache.solr.common.SolrException\",\"root-error-class\",\"org.apache.solr.common.SolrException\"],\"msg\":\"Invalid Date String:'Monday, 9 November 2009, 12:00AM'\",\"code\":400}}\nSimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http:\/\/localhost:8983\/solr\/talks\/update\/json\/docs\n<\/pre>\n<p>This seems to be caused by having extra files that aren&#8217;t valid JSON, and the best way to fix this is to make sure you&#8217;ve filtered these out.<\/p>\n<p>This script is pretty nice, because you can import a lot of things, including Office documents, PDFs, or crawl a website:<\/p>\n<pre>\nUsage: post -c <collection> [OPTIONS] <files|directories|urls|-d [\"...\",...]>\n    or post -help\n\n   collection name defaults to DEFAULT_SOLR_COLLECTION if not specified\n\nOPTIONS\n=======\n  Solr options:\n    -url <base Solr update URL> (overrides collection, host, and port)\n    -host <host> (default: localhost)\n    -p or -port <port> (default: 8983)\n    -commit yes|no (default: yes)\n\n  Web crawl options:\n    -recursive <depth> (default: 1)\n    -delay <seconds> (default: 10)\n\n  Directory crawl options:\n    -delay <seconds> (default: 0)\n\n  stdin\/args options:\n    -type <content\/type> (default: application\/xml)\n\n  Other options:\n    -filetypes <type>[,<type>,...] (default: xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)\n    -params \"<key>=<value>[&<key>=<value>...]\" (values must be URL-encoded; these pass through to Solr update request)\n    -out yes|no (default: no; yes outputs Solr response to console)\n    -format solr (sends application\/json content as Solr commands to \/update instead of \/update\/json\/docs)\n\n\nExamples:\n\n* JSON file: .\/bin\/post -c wizbang events.json\n* XML files: .\/bin\/post -c records article*.xml\n* CSV file: .\/bin\/post -c signals LATEST-signals.csv\n* Directory of files: .\/bin\/post -c myfiles ~\/Documents\n* Web crawl: .\/bin\/post -c gettingstarted http:\/\/lucene.apache.org\/solr -recursive 1 -delay 1\n* Standard input (stdin): echo '{commit: {}}' | .\/bin\/post -c my_collection -type application\/json -out yes -d\n* Data as string: .\/bin\/post -c signals -type text\/csv -out yes -d $'id,value\\n1,0.47'\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Bulk import of JSON files in Solr<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[9],"tags":[204,517],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3897"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=3897"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/3897\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=3897"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=3897"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=3897"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}