Class Details

Price: $2,295

3-Day Course Includes:

  • Class exercises in addition to training instruction
  • Courseware books, notepads, pens, highlighters and other materials
  • Free subscription to Cloudera's practice exam questions
  • Full breakfast with variety of bagels, fruits, yogurt, doughnuts and juice
  • Tea, coffee, and soda available all day
  • Freshly baked cookies every afternoon - * only at participating locations

For group training options, please call us at (240) 667-7757 or email 

Course Outline

Why Use Spark

  • Traditional Larger Scaled Systems Problems

  • An Intro to Spark

Apache Spark Intro

  • Apache Spark

  • Review of:

    • Hadoop MapReduce to Spark

    • HDFS

    • YARN

The Basics of Spark

  • Spark Shell

  • The Resilient Distributed Datasets (RDDs)

  • Functional Programming


  • Creating and Understanding RDD Operations

Pair RDDs and Data Aggregation

  • Key-Value Pair RDDs

  • Pair RDD Operations and MapReduce

Understanding How to Write and Deploy Spark Applications

  • The Spark Shell and Spark Applications

  • Creating SparkContext

  • Constructing a Spark Application with Scala and Java

  • Starting and Running Spark Applications

  • Spark Application Web User Interface

  • Spark Properties Configuration

  • Logging

Parallel Processing

  • Spark on Clusters

  • RDD Partitions

  • File-based RDDs and Partitioning

  • Data Locality and HDFS

  • Parallel Operations Execution

  • Understanding Stages and Tasks

Spark RDD Persistence

  • RDD Lineage

  • Persistence

  • Distributed Persistence

Spark Streaming

  • Spark Streaming

  • Stream Request Count

  • DStreams

  • Creating Spark Streaming Applications

Advanced Methods for Spark Streaming

  • Multi-Batch Operations

  • State Operations

  • Sliding Window Operations

  • Advanced Data Sources

Spark Data Processing and Typical Patterns

  • Typical Common Spark Cases

  • Iterative Algorithms

  • Graph Analysis and Processing

  • Machine Learning

DataFrames and Spark SQL

  • SQL and Context

  • Constructing DataFrames

  • Querying DataFrames

  • How to Save DataFrames

  • RDDs and DataFrames

  • Understanding the Differences between Spark SQL, Impala, and Hive-on-Spark


By course completion, students should understand:

  • The Spark shell for interactive data analysis
  • Spark's Resilient Distributed Datasets features
  • How Spark works on a cluster
  • Spark's parallelism with executing tasks
  • Spark for processing streaming data