Phoenix TS

Spark and Machine Learning at Scale Training

BONUS! Cyber Phoenix Subscription Included: All Phoenix TS students receive complimentary ninety (90) day access to the Cyber Phoenix learning platform, which hosts hundreds of expert asynchronous training courses in Cybersecurity, IT, Soft Skills, and Management and more!

Course Overview

In this four day, instructor-led Spark course in Washington, DC Metro, Tysons Corner, VA, Columbia, MD or Live Online, participants learn how to build, deploy, and maintain powerful data-driven solutions using Spark and its associated technologies. The course begins with an introduction to Spark, its architecture, and how it fits into the Hadoop and Cloud-based ecosystems. Participants learn to set up Spark environments using DataBricks Cloud, AWS EMR clusters, and SageMaker Studio. In addition, students learn about Spark’s core functionalities, including RDDs, DataFrames, transformations, and actions.

At the completion of this course, students will be able to:

  • Work with Spark’s machine learning (ML) libraries, focusing on data preprocessing, feature engineering, model training, and evaluation.
  • Perform stream processing and graph analysis with GraphX and Graphframes
  • Deploy Spark ML artifacts
  • Understand machine learning at scale
  • Implement distributed training, hyperparameter tuning, model selection, and performance optimization for machine learning pipelines


Spark and Machine Learning at Scale Training

7/08/24 - 7/11/24 (4 days)
8/05/24 - 8/08/24 (4 days)
8/26/24 - 8/29/24 (4 days)
9/23/24 - 9/26/24 (4 days)

Program Level



This course is intended for data scientists, machine learning engineers, big data engineers, and other professionals with experience in data analysis who wish to leverage Spark for scalable machine learning solutions. It is also suitable for those who want to enhance their large-scale data processing and machine learning knowledge.
All learners are expected to have:

  • Basic understanding of Python programming
  • Familiarity with data processing and analysis concepts
  • Familiarity with Python Pandas
  • Familiarity with basic machine learning concepts and algorithms is recommended

Course Outline

Chapter 1: Introduction to Spark

  • Big Data and the Analytics Process
  • What is Big Data?
  • Volume
  • Velocity
  • Variety
  • Veracity
  • Too large to fit into memory
  • Big data and analytic process
  • Scaling and Distributed Computing
  • How to Actually Scale?
  • Bring the Data to the Compute
  • Bring the Compute to the Data
  • Introduction to the Spark Platform
  • History of Spark and Hadoop
  • Spark vs. Hadoop MapReduce
  • Supported Languages
  • Pandas API on Spark
  • Spark Architecture: Cluster Manager
  • Standalone cluster manager
  • Apache Hadoop YARN
  • Apache Mesos
  • Spark Architecture: Driver Process
  • Spark Architecture: Executor Process and Workers
  • Spark Building Blocks
  • Spark SQL and the Catalyst

Chapter 2: Introduction to Spark – Setting up a Spark Environment

  •  Set Up On-Premise Spark Environment (Ubuntu 20.04, Docker)
  • Set Up DataBricks Community Cloud and Compute Cluster
  • Set Up EMR Cluster and Attach Notebook

Chapter 3: Basic Spark Operations and Transformations

  • Spark Session and Context
  • Loading Data
  • Actions and Transformations
  • More on Actions in Spark
  • More on Transformations in Spark
  • Persistence and Caching

Chapter 4 – Introduction to Spark SQL

  • What is Spark SQL?
  • Uniform Data Access with Spark SQL
  • Integration with cloud storage
  • Using JDBC Sources
  • Hive Integration
  • What is a DataFrame?
  • Creating a DataFrame in PySpark
  • Commonly Used DataFrame Methods and Properties in PySpark
  • Grouping and Aggregation in PySpark
  • The \”DataFrame to RDD\” Bridge in PySpark
  • The SQLContext Object
  • Examples of Spark SQL / DataFrame (PySpark Example)
  • Converting an RDD to a DataFrame Example
  • Example of Reading / Writing a JSON File
  • Performance, Scalability, and Fault-tolerance of Spark SQL

Chapter 5: Spark’s ML libraries – Lecture: Introduction to Spark’s ML libraries

  • Spark MLlib
  • Algorithms
  • Classification
  • Binary Classification
  • Multi-Class Classification
  • Multi-Label Classification
  • Imbalanced Classification
  • Regression
  • Linear Regression
  • Simple Linear Regression
  • Multiple Linear Regression
  • Polynomial Regression
  • Support Vector Regression
  • Decision Tree Regression
  • Random Forest Regression
  • Feature Engineering
  • TF-IDF – PySpark example
  • Word2Vec – PySpark example
  • Count Vectorizer – PySpark example
  • Feature Transformers of Spark MLlib
  • Tokenizer – PySpark example
  • Stopwords Remover
  • Stopwords Remover – PySpark example
  • N-gram – PySpark example
  • Binarizer – PySpark example
  • Principal Component Analysis
  • What is PCA used for?
  • Advantages and disdvantagesof PCA
  • PCA – PySpark example
  • String Indexing – PySpark example
  • Why One-Hot Encoding is used for nominal data?
  • One-Hot Encoding – PySpark Example
  • Bucketizer – PySpark example
  • Standardization and Normalization
  • Difference between Standardization and Normalization
  • Standard Scaler
  • Robust Scaler
  • Min Max Scaler
  • Max Abs Scaler
  • Imputer
  • Feature Selectors in Spark MLlib
  • Vector Slicer – PySpark example
  • Chi-Squared selection – PySpark example
  • Univariate Feature Selector
  • Variance Threshold Selector
  • Locality Sensitive Hashing
  • Locality Sensitive Hashing in Spark MLlib
  • LSH Operations
  • Locality Sensitive Hashing in Spark MLlib
  • Bucketed Random Projection for Euclidean Distance
  • MinHash for Jaccard Distance
  • Pipeline
  • Transformer
  • Estimator
  • Persistence
  • Introduction to Hyperparameter Tuning
  • Hyperparameter tuning methods
  • Random Search
  • Grid Search
  • Bayesian Optimisation
  • Hyperparameter Tuning with Spark

Chapter 6: Streaming and Graphs

  • Stream Analytics
  • Tools for Stream Analytics: Kafka, Storm, Flink, Spark
  • Timestamps in stream analytics
  • Windowing Operations

Chapter 7: Deploying Spark ML Artifacts – Introduction to deploying Spark ML Artifacts

  • How the Spark system works
  • What is Deployment?
  • Spark Deployment Artifacts
  • Packaging Spark (ML) for Production
  • Deploy Spark ML to EMR
  • Deploy Spark (ML) with Sagamaker
  • Serving and Updating Spark ML Models
  • Model Versioning with AWS Model Registry

Chapter 8: Machine learning at Scale – Introduction to Machine Learning at Scale

  • Introduction to Scalability
  • Common Reasons for Scaling Up ML Systems
  • How to Avoid Scaling Infrastructure?
  • Benefits of ML at Scale
  • Challenges in ML Scalability
  • Data Complexities – Challenges
  • ML System Engineering – Challenges
  • Integration Risks – Challenges
  • Collaboration Issues – Challenges

Chapter 9: Machine learning at Scale – Distributed Training of Machine Learning models

  • Introduction to Distributed Training
  • Data Parallelism
  • Steps of Data Parallelism
  • Data Parallelism vs. Random Forest
  • Model Parallelism
  • Frameworks for Implementing Distributed ML
  • Introduction to Distributed Training vs. Distributed Inference
  • Introduction to Training
  • Introduction to Inference
  • Key components of Inference
  • Inference Challenges
  • Training vs. Inference
  • Introduction to GPUs
  • Inference – Hardware
  • AWS Inferentia Chip vs GPU

Chapter 10: Machine learning at Scale – Hyperparameter tuning and model selection at scale

  • Hyperparameter Tuning at Scale
  • Hyperparameter Tuning Challenges
  • Distributed Hyperparameter Tuning
  • Bayesian Optimization
  • Distributed Hyperparameter Tuning
  • Spark Based Tools
  • TensorFlowOnSpark
  • Advantages of TensorFlowOnSpark
  • BigDL
  • Advantages of BigDL
  • Horovod
  • Advantages of Horovod
    H2O Sparkling Water
    Advantages of Sparkling Water over H2O


BONUS! Cyber Phoenix Subscription Included: All Phoenix TS students receive complimentary ninety (90) day access to the Cyber Phoenix learning platform, which hosts hundreds of expert asynchronous training courses in Cybersecurity, IT, Soft Skills, and Management and more!

Phoenix TS is registered with the National Association of State Boards of Accountancy (NASBA) as a sponsor of continuing professional education on the National Registry of CPE Sponsors. State boards of accountancy have final authority on the acceptance of individual courses for CPE credit. Complaints re-garding registered sponsors may be submitted to the National Registry of CPE Sponsors through its web site: www.nasbaregistry.org

Subscribe now

Get new class alerts, promotions, and blog posts

Phoenix TS needs the contact information you provide to us to contact you about our products and services. You may unsubscribe from these communications at anytime. For information on how to unsubscribe, as well as our privacy practices and commitment to protecting your privacy, check out our Privacy Policy.

Download Course Brochure

Enter your information below to download this brochure!