I set up an AWS lambda to update video metadata for about 75% of FindLectures.com. For version two, I want to provide an option to sort based on confidence intervals within a topic (i.e. based on the number of up/downvotes, how likely is this video to be good).
The architecture of the update was as follows:
- For every video that could be updated, make a record in DynamoDB
- For each insert to DynamoDB, trigger one lambda
- For a given video, run youtube-dl to get just video metadata
- Save the video metadata to S3
Ultimately this did not work as well as I hoped. I inserted 150,000 rows into DynamoDB. AWS ran max 15 concurrent lambdas, and ultimately I only got 3,000 successful runs of the job. There wasn’t any evidence of where the concurrency problem lies (the lamdbda & account max are at 1000, and I gave DynamoDB more readers, just in case).
Ultimately this experiment was nearly free:
AWS Lambda - Compute Free Tier - 400,000 GB-Seconds - US East (Northern Virginia) 42,760.113 seconds $0.00 AWS Lambda Request $0.00 AWS Lambda - Requests Free Tier - 1,000,000 Requests - US East (Northern Virginia) 17,651 Requests $0.00 Amazon Simple Storage Service Requests-Tier1 $0.08 $0.005 per 1,000 PUT, COPY, POST, or LIST requests 16,945 Requests $0.08 Amazon Simple Storage Service Requests-Tier2 $0.00 $0.004 per 10,000 GET and all other requests 777 Requests $0.00 Amazon Simple Storage Service TimedStorage-ByteHrs $0.10 $0.023 per GB - first 50 TB / month of storage used 4.350 GB-Mo$0.10
- If you naively use non batch APIs (e.g. to upload / retreive from DynamoDB) you lose a ton of time, even inside the AWS datacenters
- You need to replicate the AWS environment as much as you can for your own testing. E.g. your first step should be to get a sample JSON payload from DynamoDB.
- There are other tools to help replicate AWS locally (Exodus bundler, for getting binaries for the AWS Linux, and Docker containers)
- The free tier sounds like a lot of requests, but note that it is also metered by time, which is easier to hit in an ETL process like this.
- Lambdas with DynamoDB are a really powerful form of stored procedures – any language, outside of a database transaction.
- If you choose to upload a zip of your lambda, you can run just about any binary, so long as it runs on Amazon linux and doesn’t write to any filesystem other than /tmp