Indexing the filesystem with SqlAlchemy and Postgres

In attempt to understand how SqlAlchemy works, I’ve written up a sample indexing a large folder hierarchy.

This loops over each file in a structure (there are several million files here – it’s a small subset of court cases from PACER).

First, connect to the postgres database- before running this you want to install Anaconda, and the postgres drivers:

pip install sqlalchemy
easy_install pg8000

Then, you can set up the connection:

from sqlalchemy import *
engine = create_engine(
                isolation_level="READ UNCOMMITTED"
c = engine.connect()

Defining tables is pretty simple:

from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
from sqlalchemy import Column, Integer, String

class PacerFile(Base):
     __tablename__ = 'pacer_files'

     id = Column(Integer, primary_key=True)
     file_name = Column(String)
     file_path = Column(String)

     def __repr__(self):
        return "" % \
               (self.file_name, self.file_path)


Then, create an actual database session:

from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)

Session = sessionmaker()


session = Session()

Finally, loop over each file in the hierarchy. I commit periodically, as it lets me see how things are going. Likely it would be faster committing just at the end.

import os
for root, subFolders, files in os.walk("Q:\\pacer2\\"):
  for f in files:
    pacer_file = PacerFile(file_name=f, \


Leave a Reply

Your email address will not be published.