ratings
Apache Spark, a significant component in the Hadoop Ecosystem, is a cluster computing engine used in Big Data. Building on top of the Hadoop YARN and HDFS ecosystem, it offers order-of-magnitude faster processing for many in-memory computing tasks compared to Map/Reduce.
Unlimited Duration
March 5, 2021
This “skills-centric” course is about 50% hands-on lab and 50% lecture, designed to train attendees in core big data/ Spark development and use skills, coupling the most current, effective techniques with the soundest industry practices.In this course you will learn about:
· Spark Ecosystem
· Spark Shell
· Spark Data structures (RDD, DataFrame, Dataset)
· Spark SQL
· Modern data formats and Spark
· Spark API
· Spark & Hadoop & Hive
· Spark ML overview
· GraphX
· Time-permitting: Spark Streaming
Time-permitting: Optional Capstone Workshop (Time-Permitting)
Course Curriculum
-
- Big data, Hadoop, Spark 00:00:00
- Spark concepts and architecture 00:00:00
- Spark components overview 00:00:00
- Labs: installing and running Spark 00:00:00
-
- Spark shell 00:00:00
- Analyzing dataset – part 1 00:00:00
- Labs: Spark shell exploration 00:00:00
- Partitions 00:00:00
- Distributed execution 00:00:00
- Operations: transformations and actions 00:00:00
- Labs: Unstructured data analytics using RDDs 00:00:00
- DataFrames Intro 00:00:00
- Loading structured data (JSON, CSV) using DataFrames 00:00:00
- Using schema 00:00:00
- Specifying schema for DataFrames 00:00:00
- Labs: DataFrames, Datasets, Schema 00:00:00
- Hadoop Primer: HDFS, YARN 00:00:00
- Hadoop + Spark architecture 00:00:00
- Running Spark on Hadoop YARN 00:00:00
- Processing HDFS files using Spark 00:00:00
- Spark & Hive 00:00:00
- Machine Learning primer 00:00:00
- Machine Learning in Spark: MLib / ML 00:00:00
- Spark ML overview (newer Spark2 version) 00:00:00
- Algorithms overview: Clustering, Classifications, Recommendations 00:00:00
- Labs: Writing ML applications in Spark 00:00:00
- Spark Streaming 00:00:00
- Workshop 00:00:00
Course Reviews

Students