Code Sensei - PySpark for Big Data

Is your data too large to process on a single computer? Spark is the leading technology for processing Big Data using distributed computing, and it offers unparalleled performance. This course covers how to program Spark with PySpark, using the Pandas-on-Spark API.

Duration

3-5 days

Technologies

python

spark

pandas

jupyter

Audience

Professionals working with data in other technologies like Excel, R, matlab etc.

Goal

Learn how to analyse and process datasets on a cluster using PySpark, including exploring, querying, cleaning and transforming datasets, statistical and mathematical operations.

Structure

In-classroom or virtual. The entire course is hands-on.

Need this for your dev team?

Outline

Below is an example of how this course might be delivered. Of course this is fully customizable to fit your needs.

Labs/Exercises

There are exercises available for all topics covered. Participants work on a Jupyter Lab environment hosted by Code Sensei.

1: Core Python recap

We will adjust the time spent on core Python skills according to the experience level of the participants.

Course Introduction
Big Picture: Python, Spark and Big Data
Overview of Learning Environment
Group Introductions
Variables
Basic data types (int, str, float, bool)
Input, Output, Type Conversions
If statements
While loops
Functions
Lists
Dicts
Tuples
Sets
For Loops
Exceptions

2: Data Analysis Introduction

We will take our first steps exploring and analysing datasets using Pandas DataFrames.

Comprehensions
Lambdas, map, filter
Numpy introduction
Efficient computations using numpy
Pandas introduction
DataFrames and Series
Reading and writing data (incl. CSV, Parquet, SQL, etc
Exploring a dataset with Pandas
Columns, dtypes, info()
Selecting and indexing, .loc, .iloc
Updating selected values
Boolean indexing
Basic statistics
Sorting by value and index

3: Data Analysis, Continued

Cleaning a Dataset with Pandas
Detecting missing values
Handling null values: bfill/ffill, dropna, fillna, interpolate
Removing duplicates
Converting column types
Changing/fixing/resetting index
Transforming a DataSet with Pandas
Apply mathematical functions and statistics
Groupby
Changing data structure: pivot, melt, stack, unstack
Joinining and concatenating datasets
Transforming a DataSet with Pandas
Apply mathematical functions and statistics
Groupby
Changing data structure: pivot, melt, stack, unstack
Joinining and concatenating datasets

4: Pyspark and Pandas-on-Spark

By now we are ready to take the leap towards distributed computing on Spark. The focus is on understanding the Spark execution model.

We do briefly cover Spark SQL, but focus mainly on the Pandas-on-Spark API.

Spark introduction
Overview of Spark libraries
Spark architecture incl. cluster manager, context, session
Creating a Session
Configuring logging
Spark SQL brief overview
The Spark Catalog
Querying Spark Tables
Spark IO: reading and writing various file formats incl. CSV, Parquet, ORC, AVRO etc.
Spark Schemas
Spark DataFrames: lazy and distributed operations
Spark DataFrames as SQL Views
Comparing Spark and Pandas DataFrames
The Spark execution plan
Transformations vs. Actions
Narrow and wide transformations

5: Pandas-on-Spark

Pandas-on-Spark DataFrames
Spark-specific functions (e.g. spark.explain())
Best Practices
Overview of supported Pandas API
Converting between Pandas, Spark and Pandas-on-Spark
Exploring, cleaning and transforming datasets (applying the skills learned on day 2 and 3 using Spark)
Optionally (when time permits): visualization
Optionally (when time permits): overview of machine learning with Spark MLLib