Bulk import JSON files into Solr

Solr has a really nice utility for importing documents, so you don’t have to script anything.

When you install it, in the “bin” directory there is a script called post, which you can use to send stuff to Solr:

./bin/post -c talks \
  /d/data/talks/1.json

I found in testing this that if I sent a whole folder, I occasionally get an error:

SimplePostTool: WARNING: Response: {"responseHeader":{"status":400,"QTime":1},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"Invalid Date String:'Monday, 9 November 2009, 12:00AM'","code":400}}
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/talks/update/json/docs

This seems to be caused by having extra files that aren’t valid JSON, and the best way to fix this is to make sure you’ve filtered these out.

This script is pretty nice, because you can import a lot of things, including Office documents, PDFs, or crawl a website:

Usage: post -c  [OPTIONS] 
    or post -help

   collection name defaults to DEFAULT_SOLR_COLLECTION if not specified

OPTIONS
=======
  Solr options:
    -url  (overrides collection, host, and port)
    -host  (default: localhost)
    -p or -port  (default: 8983)
    -commit yes|no (default: yes)

  Web crawl options:
    -recursive  (default: 1)
    -delay  (default: 10)

  Directory crawl options:
    -delay  (default: 0)

  stdin/args options:
    -type  (default: application/xml)

  Other options:
    -filetypes [,,...] (default: xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)
    -params "=[&=...]" (values must be URL-encoded; these pass through to Solr update request)
    -out yes|no (default: no; yes outputs Solr response to console)
    -format solr (sends application/json content as Solr commands to /update instead of /update/json/docs)


Examples:

* JSON file: ./bin/post -c wizbang events.json
* XML files: ./bin/post -c records article*.xml
* CSV file: ./bin/post -c signals LATEST-signals.csv
* Directory of files: ./bin/post -c myfiles ~/Documents
* Web crawl: ./bin/post -c gettingstarted http://lucene.apache.org/solr -recursive 1 -delay 1
* Standard input (stdin): echo '{commit: {}}' | ./bin/post -c my_collection -type application/json -out yes -d
* Data as string: ./bin/post -c signals -type text/csv -out yes -d $'id,value\n1,0.47'

Leave a Reply

Your email address will not be published. Required fields are marked *