Tom Vincent

Prometheus backfilling

2021-01-06T17:05:18+00:00

Backfill support for Prometheus has been long requested and with the v2.24.0 release, is finally here!

OpenMetrics primer

Prometheus’ backfilling currently only supports the OpenMetrics format, which is a simple text (or protobuf) representation for metrics.

For example:

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{code="200",service="user"} 123 1609954636
http_requests_total{code="500",service="user"} 456 1609954730
# EOF

… where HELP and TYPE are MetricFamily metadata giving a brief description of the metric family (set) and its data type. The http_requests_total metric family contains two metrics; both with comma-separated labels, a value and a timestamp (Unix time).

Note, the file (“exposition”) must end with EOF.

Backfilling

The new backfilling support is implemented as the create-blocks-from openmetrics subcommand to tsdb via promtool. Lets give it a try.

First ensure you’re running v2.24.0 or later. Binary releases are conveniently provided if it has yet to land in your distribution.

If we launch prometheus with its default configuration, a data directory is created with the following contents:

❯ tree data
data
├── chunks_head
├── lock
├── queries.active
└── wal
    └── 00000000

2 directories, 3 files

Lets run the backfill command:

❯ ./promtool tsdb create-blocks-from openmetrics metrics
BLOCK ULID                  MIN TIME       MAX TIME       DURATION     NUM SAMPLES  NUM CHUNKS   NUM SERIES   SIZE
01EVCJ6E3XKHCY35AEYYWQB61N  1609954636000  1609954730001  1m34.001s    2            2            2            805

The new block is created in the data directory (by default):

❯ tree data
data
├── 01EVCJ6E3XKHCY35AEYYWQB61N
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── chunks_head
├── lock
├── queries.active
└── wal
    └── 00000000

4 directories, 7 files

Restart prometheus, query on the http_requests_total metric name, switch to the graph view and there we have it; backfilled metrics.

Note, backfilled data is subject to the server’s retention configuration, both size and time. Set these to values that make sense for your data.

Usecases

Why’s backfilling useful? Some ideas:

Migrating historic data to Prometheus
Restoring metrics after system downtime
Generating fake metrics to be used as seed data, for example:

#!/usr/bin/env bash
set -euo pipefail

hour="$(( $(date +%H) - 1))"
dateHour="$(date -I)T$(printf %02g $hour)"

cat << EOF
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
EOF

for i in {0..59}; do
  for status in 200 500; do
    echo "http_requests_total{code=\"$status\",service=\"user\"} $RANDOM $(date -d "${dateHour}:$(printf %02g "$i"):00" +%s)"
  done
done

echo "# EOF"

Decorated Lambda handlers

2020-03-31T00:00:00+00:00

The main sell of AWS Lambda (and Functions as as Service in general) is the ability to shift developer attention away from infrastructure to the business logic. Nonetheless, there are a number of cross-cutting concerns that Lambdas need to handle. This post outlines some of these and how they can be addressed.

Note, this focuses on the Node.js runtime, but the same principles can be applied to others.

Structured logging

As every Lambda function is automatically set up with a AWS CloudWatch Log group, debugging can be as simple as adding a console.log. This can often be enough for simpler cases, but as projects grow, so does the need for logs. Perhaps your system is composed of multiple Lambdas and you need to search across them, or you need to run aggregations. Whilst this can be solved with regexs, writing logs in a machine-readable format such as JSON simplifies parsing and querying.

Pino is a lightweight structured logging library that works well with Lambda. Using its base option, we can decorate all log lines with the Lambda’s runtime context:

const pino = require('pino')

const logger = pino({
  base: {
    memorySize: process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE,
    region: process.env.AWS_REGION,
    runtime: process.env.AWS_EXECUTION_ENV,
    version: process.env.AWS_LAMBDA_FUNCTION_VERSION,
  },
  name: process.env.AWS_LAMBDA_FUNCTION_NAME,
  level: process.env.LOG_LEVEL || 'info',
  useLevelLabels: true,
})

exports.handler = () => {
  logger.info({ uuid: 'foo' }, 'hello world')
}

Results in logs such as:

{
  "level": "info",
  "memorySize": "128",
  "msg": "hello world",
  "name": "my-lambda",
  "region": "eu-west-2",
  "runtime": "AWS_Lambda_nodejs12.x",
  "time": 1493426328206,
  "uuid": "foo",
  "v": 1,
  "version": "$LATEST"
}

CloudWatch Logs has first-party support for JSON filters. For example, to filter log lines containing the foo UUID, use { $.uuid = "foo" }:

Instrumentation

As a distributed system grows, debugging becomes harder. Microservice and serverless architectures are composed of many services interacting with each other. When there’s a problem, it can be difficult to identify which service in the mesh is at fault.

Yan Cui’s Capture and forward correlation IDs through different Lambda event sources outlines how correlation IDs can be used to alleviate this. In the same way as identifiers such as a uuid can be logged to provide context, other identifiers can be used to thread messages together as they flow through the system.

AWS Lambda includes awsRequestId in its context object, which is unique per invocation. When set up as a integration in API Gateway, this provides a way to trace a request back its initial API call. However, this ID is not automatically forwarded to further downstream services e.g. other AWS services or third-party APIs.

AWS X-Ray is a fully-featured tracing system that provides this functionality out of the box. In automatic mode (the default), all outgoing HTTP(S) requests can be instrumented using the captureHTTPsGlobal method:

const https = require('https')
const AWSXRay = require('aws-xray-sdk-core')

exports.handler = async () => {
  AWSXRay.captureHTTPsGlobal(https)
  await got('https://tlvince.com')
}

Note, this works by monkey patching the core Node.js http/https modules, which can be dangerous. Alternatively, X-Ray’s scope can be reduced to AWS calls using the captureAWS method.

For completeness, we can also add the X-Ray trace ID as well as the awsRequestId to the logs for easier cross-referencing. One gotcha to remember is neither IDs are set until the function has been executed, so will be undefined if referenced in the function’s global context rather than inside its handler. To workaround this, use a Pino child logger:

const https = require('https')
const got = require('got')
const pino = require('pino')
const AWSXRay = require('aws-xray-sdk-core')

const parentLogger = pino({
  base: {
    memorySize: process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE,
    region: process.env.AWS_REGION,
    runtime: process.env.AWS_EXECUTION_ENV,
    version: process.env.AWS_LAMBDA_FUNCTION_VERSION,
  },
  name: process.env.AWS_LAMBDA_FUNCTION_NAME,
  level: process.env.LOG_LEVEL || 'info',
  useLevelLabels: true,
})

exports.handler = (event, context) => {
  AWSXRay.captureHTTPsGlobal(https)

  const logger = parentLogger.child({
    traceId: process.env._X_AMZN_TRACE_ID,
    awsRequestId: context.awsRequestId,
  })

  logger.info({ uuid: 'foo' }, 'hello world')
}

Event validation

Probably the most important technical concern for any externally-facing service is to validate its inputs. Doing this upfront helps guard against malformed (or malicious) events, helps simplify property references within the business logic and can also help reduce costs by short-circuiting the function early.

Depending on your needs, a JSON schema validator such as ajv is typically the go-to option. validate is a lightweight alternative, which trades expressiveness at the expense of schema interoperability.

An example for SQS events:

const Schema = require('validate')

const schema = new Schema({
  Records: [
    {
      body: {
        type: String,
        required: true,
      },
    },
  ],
})

exports.handler = event => {
  const errors = schema.validate(event, { strip: false })
  if (errors.length) {
    throw new Error(error)
  }
}

Note, { strip: false } is used to ensure validate does not mutate the event object.

Environment variable validation

In the same manner as input event validation, environment variables can be validated via a simple process.env check:

const requiredEnvs = ['FOO']
const missingEnvs = requiredEnvs.filter(requiredEnv => !process.env[requiredEnv])
if (missingEnvs.length) {
  throw new Error(`missing environment variables ${missingEnvs}`)
}

Reusing HTTP connections

A neat performance optimisation I learnt from Matt Lavin’s Node Summit 2018 talk was that HTTP connections can be reused. By default, Node.js’s HTTP agent does not use keep-alive and therefore every request incurs the overheads of establishing a new TCP connection.

Since the majority of HTTP requests made by Lambdas are to other AWS services, it makes sense to scope this optimisation first and observe its effect:

const AWS = require('aws-sdk')
const https = require('https')

const agent = new https.Agent({
  keepAlive: true,
})

AWS.config.update({
  httpOptions: {
    agent,
  },
})

Since aws-sdk 2.463.0, this is further simplified by setting the AWS_NODEJS_CONNECTION_REUSE_ENABLED environment variable. The configuration can therefore be removed from the handler and moved to your infrastructure as code tool of choice.

Decorator example

Each of these concerns can be combined together into a re-usable decorator function. For example:

const https = require('https')

const pino = require('pino')
const Schema = require('validate')
const AWSXRay = require('aws-xray-sdk-core')

const parentLogger = pino({
  base: {
    memorySize: process.env.AWS_LAMBDA_FUNCTION_MEMORY_SIZE,
    region: process.env.AWS_REGION,
    runtime: process.env.AWS_EXECUTION_ENV,
    version: process.env.AWS_LAMBDA_FUNCTION_VERSION,
  },
  name: process.env.AWS_LAMBDA_FUNCTION_NAME,
  level: process.env.LOG_LEVEL || 'info',
  useLevelLabels: true,
})

module.exports = ({ handler, requiredEnvs = [], eventSchema = {} }) => (
  event,
  context
) => {
  AWSXRay.captureHTTPsGlobal(https)

  const logger = parentLogger.child({
    traceId: process.env._X_AMZN_TRACE_ID,
    awsRequestId: context.awsRequestId,
  })

  const schema = new Schema(eventSchema)
  const errors = schema.validate(event, { strip: false })
  if (errors.length) {
    logger.debug({ errors }, 'event validation errors')
    throw new Error(errors)
  }

  const missingEnvs = requiredEnvs.filter(
    requiredEnv => !process.env[requiredEnv]
  )

  if (missingEnvs.length) {
    logger.debug({ missingEnvs }, 'missing environment variables')
    throw new Error(`missing environment variables ${missingEnvs}`)
  }

  return handler(event, context, { logger })
}

The Lambda handler body itself can then be simplified to focussing on the business logic, besides a few lines of configuration:

const decoratedHandler = require('./handler-decorator')

const handler = async (event, context, { logger }) => {
  logger.debug('reached Lambda handler')
  return event.Records.map(record => record.body)
}

exports.handler = decoratedHandler({
  handler,
  requiredEnvs: ['FOO'],
  eventSchema: {
    Records: [
      {
        body: {
          type: String,
          required: true,
        },
      },
    ],
  },
})

Conclusion

By extracting noisy yet necessary boilerplate, Lambda handlers can be kept lean and focussed on their business logic. A number of cross-cutting concerns were discussed, with an approach to encapsulate them using a reusable function following the decorator pattern. Alternatives include middy, a more pluggable, middleware-based approach or lambda_decorators for the Python runtime.

Terraforming Lambdas

2020-02-07T00:00:00+00:00

When provisioning a Lambda function with Terraform, one gotcha to remember is that Terraform expects the deployment package to exist before it can create the function itself. Put another way, the infrastructure code depends on the application code.

One way of handling this is to manage both the function logic and its provisioning in Terraform using a local file deployment package:

This ensures Terraform can build out its dependency graph correctly and so can create the deployment package before the function.

However, there are a number of downsides to this approach. Firstly, as the docs mention, Terraform is unoptimised for handling large file uploads. It does not handle multi-part or resuming.

Secondly, because source_code_hash is a computed property (its value isn’t known until terraform apply is ran), Terraform is often overly-cautious in deciding when the deployment package has changed. More often than not, this results in Terraform creating a new version (and therefore reuploading the deployment package) on every run.

Decoupling application code from Terraform

Another approach is to decouple infrastructure from application code. In this approach, Terraform creates a placeholder deployment package to fulfil its dependency requirement and the deployment of the real application code is managed outside of Terraform, ideally in its own automation step:

An implementation of this (in Terraform 0.12.x) uses the archive_file provider along with the s3_key and s3_bucket attributes in the Lambda resource:

data "archive_file" "my_lambda_placeholder_zip" {
  type        = "zip"
  output_path = "${path.module}/lambda/my_lambda.zip"

  source {
    content  = "exports.handler = () => {}"
    filename = "index.js"
  }
}

resource "aws_s3_bucket_object" "my_lambda" {
  bucket = aws_s3_bucket.deployment.id
  key    = "lambda/connection-manager.zip"
  source = data.archive_file.core_placeholder_zip.output_path
}

resource "aws_lambda_function" "my_lambda" {
  function_name = "my-lambda"
  description   = "Decoupled Lambda deployment example"
  s3_bucket     = aws_s3_bucket.deployment.id
  s3_key        = aws_s3_bucket_object.my_lambda.id
  handler       = "index.handler"
  runtime       = "nodejs12.x"
}

The application deployment step is then a few lines of shell:

#!/bin/sh

cd /path/to/my-lambda
npm run build
cd dist
zip -9rX "my-lambda.zip" .
aws lambda update-function-code \
  --function-name "my-lambda" \
  --zip-file "fileb://dist/my-lambda.zip"

Conclusion

By decoupling infrastructure from application provisioning in Terraform, we trade managing part of the stack outside of Terraform with the ability to optimise the deployment of application code. Issues surrounding change detection on often large deployment artefacts are resolved and uploads are more efficiently handled by the AWS CLI.

Typically a function’s configuration and dependant infrastructure changes less than application logic itself. By decoupling the two, the risk of failure between infrastructure changesets is reduced.

Lambdaless

2019-01-01T00:00:00+00:00

Lets assume you need to expose a JSON file behind an API. Using a serverless approach with AWS, you might first reach for an architecture like the following:

… i.e. an API Gateway in front of a Lambda, which calls S3. Alternatively, did you know you could remove the Lambda and have API Gateway call S3 directly?

This is what I call “Lambdaless”. It leverages API Gateway’s AWS integration type, which allows you to expose any AWS service without any intermediate application logic. Mapping templates provide the glue to transform request/responses, using the Velocity templating language (VTL) and JSONPath expressions.

Walkthrough

Continuing with the S3 example above, create an API Gateway with a GET method and set up the integration request per the following:

choose the AWS service type, region and Simple Storage Service (S3)
select the GET HTTP method
select the “use path override” action type
enter the object’s <bucket>/<prefix> in the path override field

Create an IAM role that has a policy that has s3:GetObject permission on your <bucket>/<prefix> and a Trust Relationship that allows the API Gateway to assume it to be so. Now all you need to do is switch to the test view, click “test” and you should see the contents of your JSON object in the response body:

Examples

Mock integration

Taking the JSON example to its logical conclusion, we can go a step further and remove S3 from the equation altogether. Choose the MOCK integration type, add the required {"statusCode": 200} request mapping template and move the contents of your JSON object to the integration response mapping template.

This approach typically yields ~3ms response times (compared to ~65ms with the additional hop to S3) and is a good solution for static data.

DynamoDB

Simple CRUD APIs with DynamoDB are a great fit for Lambdaless. API Gateway’s $context variables includes $context.requestId, which can be used as a entity’s UUID, along with $context.requestTimeEpoch for created/updated at timestamps.

Request/response templates can be used to convert to/from DynamoDB’s data type descriptors, for example:

#set($inputRoot = $input.path('$'))
{
  "TableName": "my-table",
  "Key": {
    "uuid": {
      "S": "$context.requestId"
    }
  },
  "Item": {
    "uuid": {
      "S": "$context.requestId"
    },
    "name": {
      "S": "$inputRoot.name"
    },
    "items": {
      "L": [
        #foreach($item in $inputRoot.items)
        {
          "S": "$item"
        }#if($foreach.hasNext),#end
        #end
      ]
    },
    "createdAt": {
      "N": "$context.requestTimeEpoch"
    }
  }
}

Other ideas

use the HTTP_PROXY integration to bypass region-locked websites
pump events into an SQS queue
raise AWS Support tickets using your existing customer service solution

Advantages

A simple Lambda may seem innocuous at first, but each function comes with their own maintenance cost including:

maintaining the application code
maintaining dependencies
any CI/CD tooling around delivering that code
performing runtime upgrades
security scanning
configuring monitoring and alerts (e.g. CloudWatch)
configuring instrumentation (e.g. X-Ray)

Removing a Lambda means fewer resources to maintain, test and pay for. Latency is also reduced. There are less hops in the chain and the issue of cold starts disappears.

Disadvantages

There are however a number drawbacks to consider with this Lambdaless method. Probably most apparent is the fact that you can only integrate with a single service at a time. This limits the approach to simple integrations and rules out complex logic e.g. joins.

Velocity, whilst offering some level of control flow such as if/else and loops, as well as AWS’s own extensions such as util functions, is somewhat of a niche language and introduces its own complexity over using your Lambda runtime language of choice (e.g. JavaScript, Python).

This approach is also tightly coupled with API Gateway. The AWS integration type and request/response mapping template approach is unique to API Gateway and therefore is less portable than Lambda application logic (which is easier to abstract from the Lambda environment itself).

It also relies on “low-level” AWS APIs, which are less accessible and often sparsely documented compared to their corresponding SDK wrappers.

Pandoc on TravisCI

2017-08-19T00:00:00+00:00

A few approaches of running Pandoc in TravisCI.

1. sudo & apt-get

Using Travis’ standard infrastructure, you can simply use apt-get:

sudo: true
before_install:
  - sudo apt-get -qq update
  - sudo apt-get install -y pandoc

Depending on what Travis’ current Linux environment is (Ubuntu Trusty at the time of writing), this may be all you need. However, you may be limited to an old version of Pandoc (Trusty currently has v1.12.2).

2. Without sudo & APT addon

Using Travis’ container infrastructure (Docker), as pandoc is in the APT addon whitelist, you can do:

addons:
  apt:
    packages:
      - pandoc

However, as before, this limits you to the version of pandoc currently in the Ubuntu repos.

3. With sudo, without an APT repo

As pandoc helpfully ships .deb packages in its GitHub releases, you can download the .deb and install it manually.

sudo: true
before_install:
  - curl -L https://github.com/jgm/pandoc/releases/download/1.19.2.1/pandoc-1.19.2.1-1-amd64.deb > pandoc.deb
  - sudo dpkg -i pandoc.deb

The benefit here being you can choose any version of Pandoc, so long as they continue to ship a .deb for the right architecture.

4. Without sudo, without an APT repo

Taking the above further, we manually extract the .deb without sudo and thereby have faster job startup times (sudo/non-container based infrastructure jobs take ~20 secs to spin up).

before_install:
  - curl -L https://github.com/jgm/pandoc/releases/download/1.19.2.1/pandoc-1.19.2.1-1-amd64.deb > pandoc.deb
  - dpkg -x pandoc.deb .
  - export PATH="$PWD/usr/bin:$PATH"

Note, this only works as Pandoc is built statically and is liable to break. However, coupled with caching, this method produces the fastest builds with arbitary Pandoc versions.

See tlvince/talks/.travis.yml for a version with caching.

Composable Yeoman Generators

2014-08-08T11:38:49+00:00

Yeoman generator v0.17.0 included a useful new feature dubbed composability. If you’ve ever wanted to reuse generators by calling one from another, this is the feature you’ve been waiting for. Here’s a quick overview of how you might use it.

Creating a generator

Lets begin by creating a new generator. The Yeoman team have made it trivial to get started via generator-generator, so lets fire it up:

npm install -g yo generator-generator
mkdir my-generator && cd my-generator
yo generator
cd generator-my-generator

Note, as of generator-generator v0.4.4, an older version of yeoman-generator without composability support is used. So first confirm that "yeoman-generator": "~0.17.0" is listed in package.json or update accordingly.

generator-generator produces commonly used templates such as .jshintrc and .editorconfig for us, but wouldn’t it be nice if these were maintained elsewhere? That’s where generator-common comes in.

Composability

Here we’ll use composeWith to programmatically call generator-common from our new generator. Lets remove the pre-generated templates and methods:

rm -rf app/templates

app/index.js:

'use strict';

var yeoman = require('yeoman-generator');

var MyGeneratorGenerator = yeoman.generators.Base.extend({
 // Prototype methods
});

module.exports = MyGeneratorGenerator;

By default, Yeoman calls every method in the generator’s prototype in sequence. So lets add a new method — templates — that calls generator-common:

var MyGeneratorGenerator = yeoman.generators.Base.extend({
  templates: function() {
    this.composeWith('common', {});
  }
});

Lets give it a try:

npm link
yo my-generator

If you haven’t previously installed generator-common, you’ll likely be shown an error similar to:

You don’t seem to have a generator with the name common installed.

By default, composeWith hooks into npm’s peerDependencies to resolve a generator. (If you’re not familiar, a peer dependency is one that is installed as a sibling).

So lets indicate generator-common is a peer by appending it to package.json’s peerDependencies block:

"peerDependencies": {
  "yo": ">=1.0.0",
  "generator-common": ">=0.2.0"
}

Note, I’ve followed Yeoman’s recommendation of using a higher or equal to version qualifier to prevent conflicts.

Lets install generator-common and give our generator another spin:

npm install -g generator-common
yo my-generator

All being well, you’ll see Yeoman’s noble face and your generated templates.

     _-----_
    |       |    .--------------------------.
    |--(o)--|    |   Welcome to the Yeoman  |
   `---------´   |     Common generator!    |
    ( _´U`_ )    '--------------------------'
    /___A___\
     |  ~  |
   __'.___.'__
 ´   `  |° ´ Y `

Conclusion

We’ve barely scratched the surface of composeWith’s potential, but have covered just enough to get you started. See Yeoman’s composability documentation for further information and tlvince/generator-my-generator for this tutorial’s source.

AngularJS chained modules

2014-03-11T18:47:52+00:00

Using var mod = angular.module('MyModule', []) to declare a module? Don’t.

As this plunkr demonstrates, mod will be accessible on the global scope (i.e. window.mod).

Same goes for var ctrl = mod.controller('MyCtrl').

As you’ve no-doubt heard, this is a bad idea as anything on window can be unwittingly overwritten. As a case in point, try uncommenting lines 6, then 35 in the aforementioned plunkr and opening up your browser’s console. window.angular no-more.

Unfortunately, Angular’s own documentation give examples in this way, for example the module docs (correct as of 78165c224d).

Using a “chained” module definition alleviates this problem, such as:

angular.module('MyModule', []).controller('MyCtrl', function() {})

If your modules are starting to get large, use the “module retrieval” syntax (omit the dependency array argument) to get a reference to a previously declared module and continue the module definition in another file, e.g.:

angular.module('MyModule', [])

angular.module('MyModule').controller('MyCtrl', function() {})

Note: be careful not to pass the dependency array a second time as it will overwrite the previous module declaration!

Startup programming

2013-11-04T23:30:00+00:00

Earlier in the year, I asked for advice on how to start a programming career in tech startups. Fast forward eight months; landing four job offers, numerous freelancing gigs and founding my own consultancy, here’s the advice I was given that has stuck with me:

The best way to get high quality attention these days is by maintaining a strong GitHub profile and communicating your skills via a blog. Contribute to any open source projects you love and make that fact public.

Learn Ruby (and by extension, Ruby on Rails), strengthen your HTML/CSS chops, know JavaScript inside out; keep your skills fresh, practice and read tech blogs.

Eat, sleep and breathe Test Driven and Behaviour Driven Development. It’s the way of the future for a long, long time.

Hang out at hacker events; go to conferences, hackathons and workshops. Make your presence known. People will be beating down your door to hire you.

Finally (and most importantly), don’t waste your time at companies who don’t practice pair programming in an agile environment. Pair programming is the fastest, most fun way to learn how to be a great programmer. It does wonders for your communication and teaching skills.

Everyone needs a programmer that can communicate well with others and collaborate to get things done.

Post-PhotoRec Strategies

2012-12-20T00:00:00+00:00

If you’ve ever been in the unfortunate situation where your hard disk fails beyond recognition (like mine did), then you’ve likely come across a low-level file recovery tool called PhotoRec.

PhotoRec does a fantastic job of recovering files by matching byte headers with signatures of known file formats. At the time of writing, it recognises over 440 file formats, which covers just about every format you’re likely to encounter day-to-day.

However, the challenge after using PhotoRec is what to do with its output; the unavoidable result of the data carving technique it uses is that the underlying directory tree and file names are lost. You are therefore left with a flat-level tree containing thousands of seemingly nonsensical files with file names such as f1191548088.txt… Not particularly useful.

This post looks at a few approaches you can use to organise the recovered files.

Sorting strategies

Lets look at a few strategies to sort through the mess:

Sort by file extension
Hash audit
Remove corrupt files
Rename using metadata

Sort by file extension

PhotoRec’s After Using PhotoRec wiki page lists a few methods to sort files after using the tool. The mentioned Python script collates each file by its file extension. Whilst by no means fully solving the problem, this method can help in combination with other approaches. Although unlikely, this may also be of use if the file system in use has a maximum files per directory limit, such as FAT32.

Hash audit

hashdeep, a program that computes and matches hashsets, has an audit function that can compare file hashes against a known set. If you have a known-good backup, this can be an effective way to determine which files you already have and then prune them from PhotoRec’s set.

Rename using metadata

A fortunate side-effect of using binary formats is that metadata is often saved alongside its content. Depending on the format, a number of tools can be used to re-organise the recovered file without reliance on file names.

Photos

In the case of photos, we can use the excellent exiftool to rebuild a directory tree based based on their timestamp:

exiftool -r '-FileName<CreateDate' -d %Y/%m/%Y%m%d_%H%M%S%%-c.%%e [files]

Music

Music can be handled elegantly using MusicBrainz Picard. For a given audio file, it will use acoustic fingerprinting techniques to generate a hash of said file and then query it against the MusicBrainz database to determine its contents.

Be sure to read through Picard’s how-to guide, particularly the clustering function, which greatly speeds up the querying process. Also, at the time of writing, the latest release of Picard (v1.2) contains a memory leak which causes it to hang when dealing with large datasets. Try running the development version (the issue is resolved in pull-requests #143 and #146) if you experience this.

Alternatively, many cloud-based music platforms such as Google Play Music or iTunes have a “scan and match” feature (using similar fingerprinting technologies as Picard), which will provide high bitrate, fully-tagged versions of recognised files available to stream or re-download.

Remove corrupt files

Unfortunately, there isn’t a universal way of determining whether a file is corrupt. However, depending on the importance of your recovered data, there are a few approaches worth trying:

Photos

The Python Imaging Library (PIL) contains a verify method (search for ‘verify’) that should catch obvious corruptions. After installing PIL, try running Denilson Sá’s jpeg_corrupt, which is a thin command-line-based wrapper around PIL’s verify method; given a glob of input paths, it prints the names of those verify determines as corrupt.

Music/Videos

Running ffmpeg without an output file parameter displays information about the given file. If ffmpeg is unable to parse the file, it’ll spit out a warning, which can be leveraged to filter and delete corrupt files, e.g.:

ffmpeg -i "$i" 2>&1 | grep -q 'Invalid data found when processing input' && rm "$i"

Persnickety design

2012-12-11T20:44:45+00:00

“Web Design is 95% Typography” they say… and I tend to agree. This post looks at how improved my site’s typography using a Node.js module and closes with a remark on CSS hyphenation.

Typography

Like Steve Losh, the underlying goal of my site (in terms of design) is minimalism. I use little-to-no images, a large font and a narrow measure. This text-centric “text as a user interface” approach is intended to make my site a pleasure to read without tools like Readability.

Behind the curtains, all content of this site is written in Markdown and parsed as HTML using marked. Whilst marked is a fantastic parser, it (currently) does not support any typographical-enhancing extensions, such as those provided by SmartyPants. Enter typogr.js.

typogr.js is a small Node library with the aim to do one thing and to do it well: apply transformations on plain text to yield typographically-improved HTML. It can apply a raft of typographical filters besides those provided by SmartyPants. See its API for more details.

After a few patches, I use typogr.js throughout this site. Besides smart quotes and correct use of en- and em-dashes, ordinals are styled to match sup tags (such as those used on a post’s authored date), the imposition of block capitals (such as “API”) is reduced to match surrounding body text and widows (lines containing only a single word) are eliminated through careful placement of  .

A bleak aside on hyphenation

As with the last iteration of this site, I was keen to use hyphenation. Previously, I was using hyphenator, which, all-in-all, works rather well. However, since this iteration proudly uses zero Javascript, I preferred a CSS approach.

Alas, although CSS3’s hyphenation works wonderfully in Firefox, webkit has yet to catch up. I toyed with enabling it regardless, but as Divya Manian states, hyphens without justified text reduces readability.

Besides conditionally setting justified text via CSS browser hacks, native support for hyphenation and justified text is still impractical as of 2012. Lets hope 2013 is the year of the hyphen.

Tom Vincent

Prometheus backfilling

OpenMetrics primer

Backfilling

Usecases

Decorated Lambda handlers

Structured logging

Instrumentation

Event validation

Environment variable validation

Reusing HTTP connections

Decorator example

Conclusion

Terraforming Lambdas

Decoupling application code from Terraform

Conclusion

Lambdaless

Walkthrough

Examples

Mock integration

DynamoDB

Other ideas

Advantages

Disadvantages

Further reading

Pandoc on TravisCI

1. sudo & apt-get

2. Without sudo & APT addon

3. With sudo, without an APT repo

4. Without sudo, without an APT repo

Composable Yeoman Generators

Creating a generator

Composability

Conclusion

AngularJS chained modules

Startup programming

Post-PhotoRec Strategies

Sorting strategies

Sort by file extension

Hash audit

Rename using metadata

Photos

Music

Remove corrupt files

Photos

Music/Videos

Persnickety design

Typography

A bleak aside on hyphenation