Understanding Auto-Paginated Scan with DynamoDBMapper

The DynamoDBMapper framework is a simple way to get Java objects into Amazon DynamoDB and back out again. In a blog post a few months ago, we outlined a simple use case for saving an object to DynamoDB, loading it, and then deleting it. If you haven't used the DynamoDBMapper framework before, you should take a few moments to read the previous post, since the use case we're examining today is more advanced.

Reintroducing the User Class

For this example, we'll be working with the same simple User class as the last post. The class has been properly annotated with the DynamoDBMapper annotations so that it works with the framework. The only difference is that, this time, the class has a @DynamoDBRangeKey attribute.

 
    @DynamoDBTable(tableName = "users")
    public static class User {
      
        private Integer id;
        private Date joinDate;
        private Set<String> friends;
        private String status;
      
        @DynamoDBHashKey
        public Integer getId() { return id; }
        public void setId(Integer id) { this.id = id; }
      
        @DynamoDBRangeKey
        public Date getJoinDate() { return joinDate; }       
        public void setJoinDate(Date joinDate) { this.joinDate = joinDate; }

        @DynamoDBAttribute(attributeName = "allFriends")
        public Set<String> getFriends() { return friends; }
        public void setFriends(Set<String> friends) { this.friends = friends; }
      
        @DynamoDBAttribute
        public String getStatus() { return status; }
        public void setStatus(String status) { this.status = status; }        
    }

Let's say that we want to find all active users that are friends with someone named Jason. To do so, we can issue a scan request like so:

        DynamoDBMapper mapper = new DynamoDBMapper(dynamo);

        DynamoDBScanExpression scanExpression = new DynamoDBScanExpression();
        Map<String, Condition> filter = new HashMap<String, Condition>();
        filter.put("allFriends", new Condition().withComparisonOperator(ComparisonOperator.CONTAINS)
                .withAttributeValueList(new AttributeValue().withS("Jason")));
        filter.put(
                "status",
                new Condition().withComparisonOperator(ComparisonOperator.EQ).withAttributeValueList(
                        new AttributeValue().withS("active")));

        scanExpression.setScanFilter(filter);
        List<User> scanResult = mapper.scan(User.class, scanExpression);

Note the "allFriends" attribute on line 5. Even though the Java object property is called "friends," the @DyamoDBAttribute annotation overrides the name of the attribute to be "allFriends." Also notice that we're using the CONTAINS comparison operator, which will check to see if a set-typed attribute contains a given value. The scan method on DynamoDBMapper immediately returns a list of results, which we can iterate over like so:

        int usersFound = 0;
        for ( User user : scanResult ) {
            System.out.println("Found user with id: " + user.getId());
            usersFound++;
        }
        System.out.println("Found " + usersFound + " users.");

So far, so good. But if we run this code on a large table, one with thousands or millions of items, we might notice some strange behavior. For one thing, our logging statements may not come at regular intervals—the program would seem to pause unpredictably in between chunks of results. And if you have wire-level logging turned on, you might notice something even stranger.

Found user with id: 5
DEBUG com.amazonaws.request - Sending Request: POST https://dynamodb.us-east-1.amazonaws.com/ ... 
DEBUG com.amazonaws.request - Sending Request: POST https://dynamodb.us-east-1.amazonaws.com/ ...
DEBUG com.amazonaws.request - Sending Request: POST https://dynamodb.us-east-1.amazonaws.com/ ...
DEBUG com.amazonaws.request - Sending Request: POST https://dynamodb.us-east-1.amazonaws.com/ ...
Found user with id: 6

Why does it take four service calls to iterate from user 5 to user 6? To answer this question, we need to understand how the scan operation works in DynamoDB, and what the scan operation in DynamoDBMapper is doing for us behind the scenes.

The Limit Parameter and Provisioned Throughput

In DynamoDB, the scan operation takes an optional limit parameter. Many new customers of the service get confused by this parameter, assuming that it's used to limit the number of results that are returned by the operation, as is the case with the query operation. This isn't the case at all. The limit for a scan doesn't apply to how many results are returned, but to how many table items are examined. Because scan works on arbitrary item attributes, not the indexed table keys like query does, DynamoDB has to scan through every item in the table to find the ones you want, and it can't predict ahead of time how many items it will have to examine to find a match. The limit parameter is there so that you can control how much of your table's provisioned throughput to consume with the scan before returning the results collected so far, which may be empty. That's why it took four services calls to find user 6 after finding user 5: DynamoDB had to scan through three full pages of the table before it found another item that matched the filters we specified. The List object returned by DynamoDBMapper.scan() hides this complexity from you and magically returns all the matching items in your table, no matter how many service calls it takes, so that you can concentrate on working with the domain objects in your search, rather than writing service calls in a loop. But it's still helpful to understand what's going on behind the scenes, so that you know how the scan operation can affect your table's available provisioned throughput.

Auto-Pagination to the Rescue

The scan method returns a PaginatedList, which lazily loads more results from DynamoDB as necessary. The list will make as many service calls as necessary to load the next item in the list. In the example above, it had to make four service calls to find the next matching user between user 5 and user 6. Importantly, not all methods from the List interface can take advantage of lazy loading. For example, if you call get(), the list will try to load as many items as the index you specified, if it hasn't loaded that many already. If you call the size() method, the list will load every single result in order to give you an accurate count. This can result in lots of provisioned throughput being consumed without you intending to, so be careful. On a very large table, it could even exhaust all the memory in your JVM.

We've had customer requests to provide manually paginated scan and query methods for DynamoDBMapper to enable more fine-tuned control of provisioned throughput consumption, and we're working on getting those out in a future release. In the meantime, tell us how you're using the auto-paginated scan and query functionality, and what you would like to see improved, in the comments!

Comments