X
    Categories: Blog

Finding objects in AWS S3

As we know, the world is becoming more digital, connected, and producing tremendous quantities of diverse data.

Companies are looking for scalable, secure, versatile and cost-effective storage of such information, and AWS S3 is one such service that can meet the above requirements.

That being said, the S3 bucket may hold a multitude of data after a certain period of time, and we may come across requirements to find one specific content (‘object’ in technical terms) from the bucket.

 

ListObjects API

The problem arises when loads of data are stored in the S3 bucket and a specific object must be identified.

While AWS S3 has standardized APIs for fetching objects, a cap of 1000 objects per call is in place that can be easily reached in the case of Data Lakes.

To find the corresponding location of the object (“key” in technical terms), you can use the ListObjects API that responds with a list of all objects in S3, well, only up to 1000 objects in alphabetical orderAs a workaround, the API response also includes “IsTruncated” and “NextMarker” parameters.

The former is essentially a flag indicating whether AWS S3 returned all the results that met the search criteria. The latter is a string that can be used to retrieve the next set of objects in a subsequent call.  Next, we’re going to have a look at how these parameters can be used to solve our problems.

Python with py-aws-helper

Boto3 is the AWS SDK for python and can be used to create, configure, and manage AWS services like S3.

We have developed a custom python package called “py-aws-helper” that uses boto3 to make authenticated API calls based on the IAM user’s secret keys with corresponding access to S3 that are safely configured in the AWS CLI installed on the local system.

Prerequisites:

  • IAM user with read access to S3 and its secret keys
  • AWS CLI configured with the secret keys
  • Python, Pip and py-aws-helper package

The s3objectfinder module in the py aws helper package makes a recursive ListObjects API call based on the arguments passed to the module, i.e.

  1. Bucket
  2. Prefix
  3. Delimiter
  4. File name to be found.

 

Below is a sample snippet using that module to find a file in a given S3 bucket.

from py_aws_helper import s3objectfinder

bucket=”     # provide bucket name

file_name=”    # provide file name

output = s3objectfinder.find_object(bucket=bucket, file_name=file_name)

print(‘Total objects fetched: ‘, output[‘total_objects_fetched’])

print(‘Total objects matched: ‘, output[‘total_objects_matched’])

for key in output[‘matched_keys’]:

        print(‘Key: ‘, key)

The module returns a dictionary containing the number of retrieved objects, the number of matched objects and the list of matched keys. The code will run a bit efficiently if the “prefix” and “delimiter” arguments are loaded because the objects will only be retrieved from that prefix and grouped based on the delimiter. The above module makes it a little easier to find content in the S3 bucket.

PyPi project link 

Written by:

Sripranav P & Umashankar N

Sharing is caring!

Comments are closed.