Data Science with Hadoop and Spark

This course prepares anyone for Data Science via the Apache Hadoop platform. There are two primary distributions for Hadoop, that of Hortonworks and Cloudera. This course is architected in a way that allows maximum customization for your needs.

You need to make 3 decisions. One is to choose the Core Essentials course, which this is Data Scientist. The remainder determine the content and length of the course. The Essentials class is 3 days.

Then decide whether students have a need for the Ecosystem and deeper dive for the coverage of code or simply to discuss the Ecosystem components. This adds 1 day to the course length.

Then decide if the students need Hortonworks, Cloudera, or both. This affects labs and has no effect on course length.



  • Hadoop
  • Traditional Large-scale Systems
  • What has happened to Hadoop!
  • The Hadoop EcoSystem

Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • HDFS Architecture
  • Using HDFS
  • YARN Architecture
  • Working with YARN

YARN and MapReduce

  • YARN architecture & components
  • MapReduce
  • Advanced HDFS
  • Advanced MapReduce
  • MapReduce Joins
  • Decommissioning a DataNode

Data Science

  • Use cases
  • Python and NumPy to analyze
  • Writing Python scripts
  • K-means with Python
Apache Spark

  • Introduction
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Spark RDDs

  • Creating RDDs
  • Other General RDD Operations
  • Creation of pair RDDs from generic RDDs
  • Special operations via pair RDDs
  • MapReduce algorithms
  • Other Pair RDD Operations

Spark Applications

  • Spark Applications vs. Spark Shell
  • Spark Applications (Scala and Python)
  • Running a Spark Application
  • Spark MLib
  • The Spark Application Web UI

Spark & Machine Learning

  • RDD Lineage
  • Caching Overview
  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames

Apache Pig

  • Pig Concepts
  • Installing the Pig engine
  • Pig Latin

Apache Hive

  • Introduction to Hive
  • Comparing Hive to Traditional Databases
  • Hive Use Cases
  • How to split datasets
  • Data Storage
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatLoader and HCatStorer
  • The Metadata Caching

Data Formats and Compression

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive
LABS (3 to 4 days)

Creation of a data frame and perform analysis
Load, transform, and store data using Spark with Hive tables
Python as a data analysis platform
K-means as a cluster tool
Python natural language tool
Spark MLib machine learning algorithms
Hive to discover useful information in a dataset
Hive queries get executed as MapReduce jobs
Perform a join of two datasets with Hive

Loading data into HDFS and Hadoop
Using HDFS commands to manage files and folders
Using Spark “Hello World” word count application
RDDs to perform sort, join and other tasks
Exploring partitioning and the Spark UI
Build and package a Spark application
A variable to efficiently join a small dataset to a massive dataset
Explore, transform, split and join datasets using Pig
Pig to transform and export a dataset for use with Hive