Class Details

Price: $2,295

3-Day Course Includes:

  • Class exercises in addition to training instruction
  • Courseware books, notepads, pens, highlighters and other materials
  • Free subscription to Cloudera's practice exam questions
  • Full breakfast with variety of bagels, fruits, yogurt, doughnuts and juice
  • Tea, coffee, and soda available all day
  • Freshly baked cookies every afternoon - * only at participating locations

For group training options, please call us at (240) 667-7757 or email promo@phoenixts.com. 

Course Outline

Basics of Hadoop

  • Origin and Overview of Hadoop
  • HDFS
  • MapReduce
  • Hadoop Ecosystem

Overview of Apache Pig

  • Pig Intro
  • Features and Use Cases for Pig
  • Pig Interaction

Pig for Basic Data Analysis

  • Pig Latin Syntax
  • Basics of Loading Data and Data Types
  • Field Definitions
  • Data Output
  • Schema
  • Data Sort and Filters
  • Typical Functions

Pig for Complex Data Analysis

  • Formats for Data Storage
  • Complex and Nested as Data Types
  • Grouping
  • Built-In Functions
  • Iterating Grouped Data

Pig for Multi-Dataset Operations

  • Methods for Merging Data Sets
  • Set Operations
  • Methods for Splitting Data Sets

Extending Pig

  • Incorporating Flexibility with Parameters
  • Macros and Imports
  • UDFs
  • Contributed Functions
  • Processing Data with Other Scripting Languages

Pig for Optimizing and Troubleshooting

  • Troubleshooting with Apache Pig
  • Logging
  • Utilizing Web UI by Hadoop
  • Sample and Debug for Data
  • Overview of Performance
  • Plan for Execution
  • Improve Pig Job Performance

Overview of Apache Hive

  • Intro to Hive
  • Schema and Data Storage
  • Hive vs. Traditional Databases
  • Hive vs. Pig
  • Use Cases for and Interacting with Hive

Relational Data Analysis with Hive

  • Databases and Tables
  • HiveQL Syntax
  • Types of Data
  • Process for Joining Data Sets
  • Built-In Functions

Managing Hive Data

  • Data Formats in Hive
  • Database and Hive-Managed Table Creation
  • Data Loading into Hive
  • Alterations to Tables and Databases
  • Self-Managed Tables
  • Simplification – Queries with Views
  • Storing Results
  • Data Access Control

Hive for Processing Text

  • Text Processing Basics
  • String Functions
  • Reg Expressions
  • Sentiment Analysis and N-Grams

Optimizing Hive

  • Query Performance
  • Controls for Job Execution Plan
  • Partitioning and Bucketing
  • Data Indexing

Hive Extensions

  • SerDes
  • Transforming Data
  • Customized Scripts
  • Functions as User-Defined
  • Parameterized Queries

Impala Intro

  • Overview of Impala
  • Impala vs. Hive and Pig
  • Impala vs. Relational Databases
  • Future Directions and Limitations
  • Utilizing Impala Shell

Impala for Data Analysis

  • Common Syntax
  • Types of Data
  • Results – Filter, Sort and Limit
  • Join and Group Data
  • Improvements to Impala Performance

Best Tools for the Job

  • MapReduce, Pig, Hive, Impala & Relational Databases
  • How to Decide?

Objectives

  • Hadoop Fundamentals
  • Apache Pig
  • Pig and Data Analysis
  • Pig and Complex Data Analysis
  • Pig for Multi-Dataset Operations
  • Pig for Optimizing and Troubleshooting
  • Apache Hive
  • Hive and Relational Data Analysis
  • Hive Data Management
  • Processing Text in Hive
  • Hive Optimization
  • Extensions for Hive
  • Impala and Data Analysis

Class Exam

The Cloudera Data Analyst training course is one part to achieve the Cloudera Certified Professional: Data Scientist (CCP:DS) certification. 

Data Science Essentials (DS-200) Exam

Details:

  • Number of Questions: 60 questions, including 6-10 unscored (beta) items
  • Question types: multiple-choice, matching and reading passages
  • Time Limit: 90 minutes
  • Passing Score: 500 on a scale of 0-700
  • Delivery: Pearson VUE

Data Science Challenge

Candidates for certification must complete the Cloudera Data Science Challenge, which is offered twice a year. Candidates have three months from the start of their project to complete it.

Each project utilizes a real-world data science problem and students are judged based on each other’s work and against a benchmark assigned by the world’s top data scientists. Those who surpass the benchmarks can earn the Data Science credential.  

Register for Class

Date Location
01/29/19 - 02/01/19, 4 days, 10:00AM – 6:00PM Online Register