28 docs tagged with "Big Data"

Aggregations & GroupBy — Sum, Count, Avg, Max & Min

Learn how to perform aggregations in PySpark using groupBy, sum, count, average, max, and min functions with practical Databricks examples.

Complex SQL Queries in PySpark

Learn how to write complex SQL queries in PySpark, including joins, subqueries, aggregations, and window functions for advanced analytics in Databricks.

Creating DataFrames from CSV, JSON, Parquet & Hive Tables

Learn how to create PySpark DataFrames from CSV, JSON, Parquet files, and Hive tables using Databricks and Spark best practices.

Data Aggregation in PySpark DataFrames (Complete Guide)

Learn how to perform data aggregation in PySpark using groupBy, agg, max, sum, avg, distinct, and sorting operations with real shipment dataset examples.

Data Filtering in PySpark DataFrames (Complete Guide with Examples)

Learn how to filter data in PySpark DataFrames using conditions, column expressions, multiple filters, and row extraction with examples and outputs.

DataFrame API — Select, Filter, WithColumn & Drop

Learn how to use PySpark DataFrame API operations such as select, filter, withColumn, and drop through practical Databricks examples and real-world use cases.

First PySpark Job — Hello World Example

Learn how to write and run your first PySpark job with a hands-on “Hello World” example, and understand the end-to-end workflow in Spark.

Handling Missing Data — Drop, Fill & Replace

Learn how to handle missing data in PySpark DataFrames using drop, fill, and replace operations with practical Databricks examples.

Handling Missing Data in PySpark DataFrames (Complete Guide)

Learn all techniques for handling missing or null data in PySpark DataFrames including dropping nulls, filling values, conditional replacement, and computing statistics.

Installing PySpark & Setting Up Environment

Step-by-step guide to install PySpark, set up your development environment, and run your first Spark job for big data processing.

Introduction to PySpark — Why Spark & Big Data

Learn why PySpark is a leading framework for big data processing, its importance in modern data engineering, and how it enables fast, scalable analytics.

Joins in PySpark DataFrames (Full Beginner Guide)

Learn all types of joins in PySpark DataFrames — inner, left, right, outer, semi, anti, and cross join with clear examples, code, and explanations.

Key-Value RDDs — reduceByKey, groupByKey & aggregate

Learn how to use key-value RDDs in Spark with reduceByKey, groupByKey, and aggregate operations, complete with real-world Databricks examples and performance tips.

MLlib Overview

Imagine you’re a data scientist in a high-tech lab, not just a data engineer. Data isn’t sitting quietly in files—it’s streaming, growing, and changing constantly. You want to predict outcomes, classify users, or group behaviors, all at scale.

Performance Comparison — DataFrame API vs Spark SQL

Understand the performance differences between PySpark DataFrame API and Spark SQL, with tips on when to use each approach for optimal performance in Databricks.

PySpark Architecture — Driver, Executor, and Cluster Modes

Understand the PySpark architecture, including Driver, Executor, and cluster modes, to efficiently design distributed data processing workflows.

PySpark Data I/O — Reading, Writing & Optimizing Big Data

Learn how to efficiently read, write, and process data in PySpark including CSV, JSON, Parquet, ORC, JDBC databases, cloud storage, streaming, and compression. A complete guide for beginners and data engineers.

PySpark DataFrame Basics (Part 1) — Complete Beginner Guide

Learn the fundamentals of PySpark DataFrames including creation, schema inspection, show(), describe(), and column operations. Perfect for beginners starting with distributed data processing.

PySpark DataFrame Basics (Part 2) — Custom Schemas, Column Ops & SQL

Learn how to define custom schemas, select columns, add new columns, rename columns, inspect types, and run SQL queries on PySpark DataFrames.

PySpark Setup and Configuration — Simple Guide

Step-by-step guide to installing, setting up, and configuring PySpark for local and cluster environments. Learn SparkSession initialization, environment variables, and configuration best practices.

RDD Basics — Creation, Transformation & Actions

Learn the fundamentals of RDDs in Apache Spark, including how to create them, apply transformations, trigger actions, and understand their importance in distributed data processing.

RDD Persistence & Caching — Memory Management in Spark

Learn how Spark RDD caching and persistence work, why they matter for performance, and how to manage memory effectively in distributed data pipelines.

RDDs vs DataFrames vs Datasets — When to Use

Understand the differences between RDDs, DataFrames, and Datasets in PySpark, and learn when to use each for efficient big data processing.

SparkSession, SparkContext, and Configuration Basics

Learn the core components of PySpark—SparkSession, SparkContext, and configurations—and how they form the foundation of big data processing.

UDFs & UDAFs — Custom Functions in SQL

Learn how to create and use PySpark UDFs (User Defined Functions) and UDAFs (User Defined Aggregate Functions) to implement custom logic and aggregations in Spark SQL and DataFrames.

Using Spark SQL — Register Temp Views and Query

Learn how to use Spark SQL in PySpark by registering temporary views and running SQL queries on DataFrames in Databricks.

Window Functions in PySpark DataFrames

Learn how to use PySpark window functions for ranking, running totals, cumulative sums, and time-based analytics in Databricks.

Working with Dates and Timestamps in PySpark DataFrames (Full Guide)

Complete guide to date and timestamp operations in PySpark, including extracting date components, aggregations, ratios, and SQL queries with real examples.