pyspark Tutorial => Consuming Data From S3 using PySpark

Example

There are two methods using which you can consume data from AWS S3 bucket.

Using sc.textFile (or sc.wholeTextFiles) API: This api can be used for HDFS and local file system as well.

aws_config = {}  # set your aws credential here
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_config['aws.secret.access.key'])
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_config['aws.secret.access.key'])
s3_keys = ['s3n/{bucket}/{key1}', 's3n/{bucket}/{key2}']
data_rdd = sc.wholeTextFiles(s3_keys)

Reading it using custom API (Say a boto downloader):

def download_data_from_custom_api(key):
    # implement this function as per your understanding (if you're new, use [boto][1] api)
    # don't worry about multi-threading as each worker will have single thread executing your job
    return ''

s3_keys = ['s3n/{bucket}/{key1}', 's3n/{bucket}/{key2}']
# numSlices is the number of partitions. You'll have to set it according to your cluster configuration and performance requirement
key_rdd = sc.parallelize(s3_keys, numSlices=16) 

data_rdd = key_rdd.map(lambda key: (key, download_data_from_custom_api(key))

I recommend to use approach 2 because while working with approach 1, the driver downloads all the data and the workers just process it. This has following drawbacks:

You'll run out of memory as data size increases.
Your workers will be sitting idle till the data has been downloaded

PDF - Download pyspark for free

Previous Next

pyspark

Fastest Entity Framework Extensions

Example

Got any pyspark Question?

pyspark

pyspark Getting started with pyspark Consuming Data From S3 using PySpark

Fastest Entity Framework Extensions

Example

Got any pyspark Question?