Increasing labeling efficiency with a tiled image annotation layout

Published by Dave on

When it’s time to train an image recognition model you need to start with a library of labeled images that can be used as training data for your algorithm. To make gathering this data easier, Amazon provides the crowd-image-classifier element within their Crowd HTML Elements library. This library makes it easy to quickly setup a labeling task in both Amazon SageMaker Ground Truth and Amazon Mechanical Turk, where it’s one of the built in templates. While this template is great, I’ve found it’s often too slow. Handling each image as it’s own task is great for improving accuracy but it has big impacts on how fast workers can complete tasks which ultimately increases your labeling costs.

In this post I’m going to show how by moving to a tiled layout I was able to reduce labeling time by 69% with minimal (-140 bps) impacts to accuracy. This meant that I was able to lower the amount I spent for the MTurk public crowd workforce 50% and reduced my per object fees for SageMaker Ground Truth 95%. The gains would have been even higher with a private of vendor workforce since the real cost of that labor is almost always higher than using the MTurk public crowd.

Moving from classifying a single image at a time reduce the time it took to label the images by 69%.

Of course, all of this comes with some trade-offs. Batching multiple images together into a single task and handling the results requires more data wrangling; the setup of a custom SageMaker Ground Truth task adds additional complexity; and you lose the ability to leverage automated data labeling. But the benefits are huge if you’re able to follow the steps I’ll layout below.

Crowd Image Classifier

The crowd-image-classifier interface. The image is of my dog Larry (the small one) and his best friend Olivia.

The layout of the crowd-image-classifier is very similar to many of the other Crowd HTML Elements. The content to be labeled is in the center of the page, there are great controls for zooming and panning around the image, an instructions block is included, and the categories are easily selected using either mouse or keyboard. The net result is a very accurate labeling interface that any worker can use easily. In a test gathering 1000 annotations from ImageNet images, workers achieved 99.7% accuracy.

Unfortunately, that accuracy comes at a price in terms of time. For each image a worker needs to wait for the page to load, select a category, and then pan their mouse down to select Submit. While each of these steps typically happen in a matter of seconds, that time quickly adds up across thousands of images. In my test gathering 1000 images, workers collectively spent 29 minutes waiting for the task to load and 20 minutes between selecting a category and clicking submit.

A breakdown of time spent by workers labeling each image and the resulting costs. Note that the AWS fees reflect the $0.08 per object fees charged after exhausting the monthly 1000 object free tier.

Tiled Annotation Layout

To address this limitation I’ve created a basic tiled layout that will let me display as many images as I wish in a single task. In this example I’m including 20 images in each task but it could easily be more or less depending on your needs.

Tiled task layout. Note that when this is populated each image will be unique.

Of course this template lacks an instructions block and the ability to zoom into each image. For many tasks such as this one, it’s not necessary, but if we did need to add instructions or additional controls, that would be relatively straightforward.

Consolidated manifest

We’ll start by taking an existing SageMaker Ground Truth manifest and creating a new version that consolidates the images into groups of 20. The Python code below takes the original manifest which has records like this {'source-ref': 's3://mybucket/img1.jpg'} and groups them into lists of 20 images like the following.

{   
    'source': [
        {'source-ref': 's3://mybucket/img1.jpg', 'index': 0},
        {'source-ref': 's3://mybucket/img2.jpg', 'index': 1},
        {'source-ref': 's3://mybucket/img3.jpg', 'index': 2},
        {'source-ref': 's3://mybucket/img4.jpg', 'index': 3}
        ...
    ]
}
import json

group_size = 20
original_manifest= # MY ORIGINAL MANIFEST
task_count = (len(original_manifest) // group_size + 
              (1 if len(original_manifest) % group_size > 0 else 0))
consolidated_manifest = [None] * task_count
for idx, record in enumerate(original_manifest):
    group = consolidated_manifest[idx % task_count]
    if not group:
        group = {}
        consolidated_manifest[idx % task_count] = group
    refs = group.get('source', [])
    refs.append({
        'source-ref': record['source-ref'],
        'index': idx
    })
    group['source'] = refs

# because ground truth won't accept a JSON value for source or source-ref
for record in consolidated_manifest:
    record['source'] = json.dumps(record['source'])

Note that in the last step we flatten the JSON arrays into strings. This is necessary because SageMaker Ground Truth currently only supports string values for source or source-ref. We’ll unpack those in our pre-annotation Lambda.

Pre and Post Annotation Lambdas

Custom Ground Truth labeling tasks require you have two Lambdas that you can use for pre and post processing of data as it flows through the annotation steps. The following is the pre-annotation Lambda we’ll want to create.

import json

def lambda_handler(event, context):
    print(event)
    source = event['dataObject'].get('source')

    if source is None:
        print("Missing source data object")
        return {}

    # Ground truth currently only allows string values for source or source-ref attributes
    # This allows the source to be passed as a string and loaded into an object
    if type(source) is str:
        source = json.loads(source)

    response = {
        "taskInput": {
            "taskObject": source
        }
    }
    print(response)
    return response

As you can see, I’ve kept this template as simple as possible so that I have the flexibility to use it for other, similar, annotation tasks. I simply retrieve the existing source data from the event, unpack the string value into a JSON object, and then pass it along as the object of this task. We’ll see how we use that JSON object when we look at the template.

Next we’ll want to create a post-annotation Lambda we can use. For this I’m going to use a generic annotation handler I use for many of my annotation tasks, particularly in the development phase. This task simply retrieves the annotation payload and passes the annotations along with the ID of the worker who completed them. I then perform any answer consolidation in my notebook. This requires additional data wrangling in my notebook, but avoids the need to bake complex consolidation logic into my Lambdas.

import json
import boto3
from urllib.parse import urlparse

def lambda_handler(event, context):
    print(json.dumps(event))

    payload = get_payload(event)
    print(json.dumps(payload))

    consolidated_response = []
    for dataset in payload:
        annotations = dataset['annotations']
        responses = []
        for annotation in annotations:
            response = json.loads(annotation['annotationData']['content'])
            if 'annotatedResult' in response:
                response = response['annotatedResult']

            responses.append({
                'workerId': annotation['workerId'],
                'annotation': response
            })

        consolidated_response.append({
            'datasetObjectId': dataset['datasetObjectId'],
            'consolidatedAnnotation' : {
                'content': {
                    event['labelAttributeName']: {
                        'responses': responses
                    }
                }
            }
        })

    print(json.dumps(consolidated_response))
    return consolidated_response


def get_payload(event):
    if 'payload' in event:
        parsed_url = urlparse(event['payload']['s3Uri'])
        s3 = boto3.client('s3')
        text_file = s3.get_object(Bucket=parsed_url.netloc, Key=parsed_url.path[1:])
        return json.loads(text_file['Body'].read())
    else:
        return event.get('test_payload',[])

Create the labeling task

Now I can go to SageMaker Ground Truth in the AWS console and setup a new custom labeling job. For the template I’ll use the HTML below.

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
<script src="https://code.jquery.com/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css">

<crowd-form>
    <div class="container-fluid">
      <div class="row">

        {% for image in task.input.taskObject %}
        <div class="card text-center" style="width: 20rem;">
          <img class="card-img-top" src="{{ image.source-ref | grant_read_access }}" alt=""/>
          <div class="row no-gutters">
            <div class="col">
              <div class="btn-group-vertical btn-group-toggle" style="width: 50%" data-toggle="buttons">
                <label class="btn btn-outline-dark btn-sm">
                  <input type="radio" name="{{ image.index }}" value="cat" required> Cat
                </label>
                <label class="btn btn-outline-danger btn-sm">
                  <input type="radio" name="{{ image.index }}" value="dog" required> Dog
                </label>
                <label class="btn btn-outline-warning btn-sm">
                  <input type="radio" name="{{ image.index }}" value="neither" required> Neither
                </label>
                <label class="btn btn-outline-secondary btn-sm">
                  <input type="radio" name="{{ image.index }}" value="invalid image" required> Invalid
                </label>
              </div>
            </div>
          </div>
        </div>
        {% endfor %}

      </div>
    </div>
</crowd-form>
</script>

I’ve imported some Bootstrap libraries to make laying out the template easier and take advantage of the Liquid templating library that Ground Truth supports. There are a few things to note:

  1. The template has a for-loop to iterate through the 20 images we provided in our taskObject. You can see this in the opening {% for image in task.input.taskObject %} and closing {% endfor %} surrounding the card div.
  2. Within the card div I retrieve the image by inserting {{ image.source-ref | grant_read_access }}.
  3. To ensure that the inputs on each card have a unique name, I’ve used the index that I attached to each record: {{ image.index }}
  4. The target size of each card is defined by the width: 20rem value in the style of the card div. This can be adjusted to alter the size of the images to have more or fewer in each row.

Now that we have this setup, we can submit it for annotation.

Handle the results

Note that the results returned included both our consolidated source data and the results of the task as a consolidated object. Note that the source data will still be formatted as as string but has been unpacked as shown below for readability.

{   
    "source": [
        {"source-ref": "s3://mybucket/img1.jpg", "index": 0},
        {"source-ref": "s3://mybucket/img2.jpg", "index": 1},
        {"source-ref": "s3://mybucket/img3.jpg", "index": 2},
        {"source-ref": "s3://mybucket/img4.jpg", "index": 3},
        ...
    ],
    "label_name": { 
        "responses": [
            {
                "worker_id": "public.us-east-1.AAAAAA",
                "annotation": {
                    "0": {
                        "cat": True,
                        "dog": False,
                        "invalid image": False,
                        "neither": False},
                    "100": {
                        "cat": True
                        "dog": False,
                        "invalid image": False,
                        "neither": False},
                    ...
            },
            ...
    ]
}

To unpack this into a simple manifest, we’ll iterate through the records in the source and use the index to find the associated label. Note that in the code below I simply take the first worker’s response. If I wanted to consolidate multiple answers I’d need to revise this to include consolidation logic that would handle multiple responses for each set of images.

def get_selected_key(response):
    '''
    Retrieves the dict key that has a True value.
    '''
    for key, value in response.items():
        if value:
            return key
        
result_manifest = cons_annotations
flattened_manifest = []
for record in result_manifest:
    # Unpack the source data
    source = json.loads(record['source'])
    
    # Get the first annotation
    annotation = record[label_name]['responses'][0]['annotation']

    # Use the index in the source as a key to retrieve 
    # the associated annotation
    for image in source:
        new_record = {'source-ref': image['source-ref']}
        new_record[label_name] = get_selected_key(annotation[str(image['index'])])
        flattened_manifest.append(new_record)

This will flatten the manifest down into a format you can more easily use for training.

Testing results

As you can see in the data below, the new tiled layout was a big improvement over the one-by-one approach. While each individual task took more than 6x longer, workers were completing 20 images per task, so the resulting time per image was 2.5 seconds per image compared to the 8 seconds per image we saw earlier. The total time spent by workers was 135 minutes in my first test but that dropped 69% to 42 minutes by using a tiled layout.

Comparison of the initial test to the final results of the performance testing (pink box).

The biggest downside was the drop in accuracy we saw with this approach. Overall accuracy dropped 140 bps from 99.7% to 98.3%. This isn’t surprising since it’s easier to make mistakes when clicking through multiple images in a grid. All in, that level of accuracy is still very good relative to the gains in efficiency and we could easily correct for this by adding an additional worker to review each image and still see better a net improvement in time spent.

I hope you found this useful. Let me know in the comments if there are other approaches you think would further improve time and accuracy.

Categories: Uncategorized

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *