{"id":2042,"date":"2014-02-21T02:47:30","date_gmt":"2014-02-21T02:47:30","guid":{"rendered":"http:\/\/www.garysieling.com\/blog\/?p=2042"},"modified":"2014-02-21T02:47:30","modified_gmt":"2014-02-21T02:47:30","slug":"indexing-filesystem-sqlalchemy-postgres","status":"publish","type":"post","link":"https:\/\/www.garysieling.com\/blog\/indexing-filesystem-sqlalchemy-postgres\/","title":{"rendered":"Indexing the filesystem with SqlAlchemy and Postgres"},"content":{"rendered":"<p>In attempt to understand how SqlAlchemy works, I&#8217;ve written up a sample indexing a large folder hierarchy.<\/p>\n<p>This loops over each file in a structure (there are several million files here &#8211; it&#8217;s a small subset of court cases from PACER).<\/p>\n<p>First, connect to the postgres database- before running this you want to install Anaconda, and the postgres drivers:<\/p>\n<pre>\npip install sqlalchemy\neasy_install pg8000\n<\/pre>\n<p>Then, you can set up the connection:<\/p>\n<pre lang=\"python\">\nfrom sqlalchemy import *\nengine = create_engine(\n                \"postgresql+pg8000:\/\/postgres:postgres@localhost\/pacer\",\n                isolation_level=\"READ UNCOMMITTED\"\n            )\nc = engine.connect()\n<\/pre>\n<p>Defining tables is pretty simple:<\/p>\n<pre lang=\"python\">\nfrom sqlalchemy.ext.declarative import declarative_base\nBase = declarative_base()\nfrom sqlalchemy import Column, Integer, String\n\nclass PacerFile(Base):\n     __tablename__ = 'pacer_files'\n\n     id = Column(Integer, primary_key=True)\n     file_name = Column(String)\n     file_path = Column(String)\n\n     def __repr__(self):\n        return \"<PacerFile(file_name='%s', file_path='%s')>\" % \\\n               (self.file_name, self.file_path)\n\nBase.metadata.create_all(engine) \n<\/pre>\n<p>Then, create an actual database session:<\/p>\n<pre lang=\"python\">\nfrom sqlalchemy.orm import sessionmaker\nSession = sessionmaker(bind=engine)\n\nSession = sessionmaker()\n\nSession.configure(bind=engine)\n\nsession = Session()\n<\/pre>\n<p>Finally, loop over each file in the hierarchy. I commit periodically, as it lets me see how things are going. Likely it would be faster committing just at the end.<\/p>\n<pre lang=\"python\">\nimport os\nfor root, subFolders, files in os.walk(\"Q:\\\\pacer2\\\\\"):\n  for f in files:\n    pacer_file = PacerFile(file_name=f, \\\n                           file_path=os.path.join(root))\n    session.add(pacer_file)\n\n  session.commit()\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In attempt to understand how SqlAlchemy works, I&#8217;ve written up a sample indexing a large folder hierarchy. This loops over each file in a structure (there are several million files here &#8211; it&#8217;s a small subset of court cases from PACER). First, connect to the postgres database- before running this you want to install Anaconda, &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.garysieling.com\/blog\/indexing-filesystem-sqlalchemy-postgres\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Indexing the filesystem with SqlAlchemy and Postgres&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[4],"tags":[412,447,525],"aioseo_notices":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2042"}],"collection":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/comments?post=2042"}],"version-history":[{"count":0,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/posts\/2042\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/media?parent=2042"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/categories?post=2042"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.garysieling.com\/blog\/wp-json\/wp\/v2\/tags?post=2042"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}