apache-spark Tutorial => Transformation vs Action

Example

Spark uses lazy evaluation; that means it will not do any work, unless it really has to. That approach allows us to avoid unnecessary memory usage, thus making us able to work with big data.

A transformation is lazy evaluated and the actual work happens, when an action occurs.

Example:

In [1]: lines = sc.textFile(file)        // will run instantly, regardless file's size
In [2]: errors = lines.filter(lambda line: line.startsWith("error")) // run instantly
In [3]: errorCount = errors.count()    // an action occurred, let the party start!
Out[3]: 0                              // no line with 'error', in this example

So, in [1] we told Spark to read a file into an RDD, named lines. Spark heard us and told us: "Yes I will do it", but in fact it didn't yet read the file.

In [2], we are filtering the lines of the file, assuming that its contents contain lines with errors that are marked with an error in their start. So we tell Spark to create a new RDD, called errors, which will have the elements of the RDD lines, that had the word error at their start.

Now in [3], we ask Spark to count the errors, i.e. count the number of elements the RDD called errors has. count() is an action, which leave no choice to Spark, but to actually make the operation, so that it can find the result of count(), which will be an integer.

As a result, when [3] is reached, [1] and [2] will actually being performed, i.e. that when we reach [3], then and only then:

the file is going to be read in textFile() (because of [1])
lines will be filter()'ed (because of [2])
count() will execute, because of [3]

Debug tip: Since Spark won't do any real work until [3] is reached, it is important to understand that if an error exist in [1] and/or [2], it won't appear, until the action in [3] triggers Spark to do actual work. For example if your data in the file do not support the startsWith() I used, then [2] is going to be properly accepted by Spark and it won't raise any error, but when [3] is submitted, and Spark actually evaluates both [1] and [2], then and only then it will understand that something is not correct with [2] and produce a descriptive error.

As a result, an error may be triggered when [3] is executed, but that doesn't mean that the error must lie in the statement of [3]!

Note, neither lines nor errors will be stored in memory after [3]. They will continue to exist only as a set of processing instructions. If there will be multiple actions performed on either of these RDDs, spark will read and filter the data multiple times. To avoid duplicating operations when performing multiple actions on a single RDD, it is often useful to store data into memory using cache.

You can see more transformations/actions in Spark docs.

PDF - Download apache-spark for free

Previous Next

apache-spark

Fastest Entity Framework Extensions

Example

Got any apache-spark Question?

apache-spark

apache-spark Getting started with apache-spark Transformation vs Action

Fastest Entity Framework Extensions

Example

Got any apache-spark Question?