PySpark for Big Data

Handle Big Data with Python and Spark.

Is your data too large to process on a single computer? Spark is the leading technology for processing Big Data using distributed computing, and it offers unparalleled performance. This course covers how to program Spark with PySpark, using the Pandas-on-Spark API.
3-5 days
ansible python
ansible spark
ansible pandas
ansible jupyter
Professionals working with data in other technologies like Excel, R, matlab etc.
Learn how to analyse and process datasets on a cluster using PySpark, including exploring, querying, cleaning and transforming datasets, statistical and mathematical operations.
In-classroom or virtual. The entire course is hands-on.
Need this for your dev team?

Outline

Below is an example of how this course might be delivered. Of course this is fully customizable to fit your needs.

Labs/Exercises

There are exercises available for all topics covered. Participants work on a Jupyter Lab environment hosted by Code Sensei.

1: Core Python recap

We will adjust the time spent on core Python skills according to the experience level of the participants.

  • Course Introduction
  • Big Picture: Python, Spark and Big Data
  • Overview of Learning Environment
  • Group Introductions
  • Variables
  • Basic data types (int, str, float, bool)
  • Input, Output, Type Conversions
  • If statements
  • While loops
  • Functions
  • Lists
  • Dicts
  • Tuples
  • Sets
  • For Loops
  • Exceptions

2: Data Analysis Introduction

We will take our first steps exploring and analysing datasets using Pandas DataFrames.

  • Comprehensions
  • Lambdas, map, filter
  • Numpy introduction
  • Efficient computations using numpy
  • Pandas introduction
  • DataFrames and Series
  • Reading and writing data (incl. CSV, Parquet, SQL, etc
  • Exploring a dataset with Pandas
  • Columns, dtypes, info()
  • Selecting and indexing, .loc, .iloc
  • Updating selected values
  • Boolean indexing
  • Basic statistics
  • Sorting by value and index

3: Data Analysis, Continued

  • Cleaning a Dataset with Pandas
  • Detecting missing values
  • Handling null values: bfill/ffill, dropna, fillna, interpolate
  • Removing duplicates
  • Converting column types
  • Changing/fixing/resetting index
  • Transforming a DataSet with Pandas
  • Apply mathematical functions and statistics
  • Groupby
  • Changing data structure: pivot, melt, stack, unstack
  • Joinining and concatenating datasets
  • Transforming a DataSet with Pandas
  • Apply mathematical functions and statistics
  • Groupby
  • Changing data structure: pivot, melt, stack, unstack
  • Joinining and concatenating datasets

4: Pyspark and Pandas-on-Spark

By now we are ready to take the leap towards distributed computing on Spark. The focus is on understanding the Spark execution model.

We do briefly cover Spark SQL, but focus mainly on the Pandas-on-Spark API.

  • Spark introduction
  • Overview of Spark libraries
  • Spark architecture incl. cluster manager, context, session
  • Creating a Session
  • Configuring logging
  • Spark SQL brief overview
  • The Spark Catalog
  • Querying Spark Tables
  • Spark IO: reading and writing various file formats incl. CSV, Parquet, ORC, AVRO etc.
  • Spark Schemas
  • Spark DataFrames: lazy and distributed operations
  • Spark DataFrames as SQL Views
  • Comparing Spark and Pandas DataFrames
  • The Spark execution plan
  • Transformations vs. Actions
  • Narrow and wide transformations

5: Pandas-on-Spark

  • Pandas-on-Spark DataFrames
  • Spark-specific functions (e.g. spark.explain())
  • Best Practices
  • Overview of supported Pandas API
  • Converting between Pandas, Spark and Pandas-on-Spark
  • Exploring, cleaning and transforming datasets (applying the skills learned on day 2 and 3 using Spark)
  • Optionally (when time permits): visualization
  • Optionally (when time permits): overview of machine learning with Spark MLLib
Adapt this course to fit your needs