Pyspark

What is Spark and Pyspark?

Spark is an open-source, distributed computing framework designed for fast and general-purpose cluster computing.

Fast: Leverages in-memory caching to significantly speed up computations compared to traditional MapReduce.
Versatile: Supports a wide range of processing tasks, including data analysis, machine learning, stream processing, and graph processing.
Distributed: Can process massive datasets across a cluster of machines, enabling parallel processing for faster results.
Easy to use: Provides high-level APIs in various languages (Scala, Java, Python, R) for easy development

Pyspark is the Python API for Apache Spark. PySpark provides a user-friendly interface to interact with Spark using the Python programming language.

What is Sparksession and Sparkconf?

SparkSession is the central object in Spark for interacting with Spark functionalities. It's the primary way to access and manipulate Spark DataFrames and Datasets.

Key Responsibilities:

Manages the Spark context and its configuration.
Provides methods to create DataFrames and Datasets.
Enables SQL queries over DataFrames registered as tables.
Handles caching and persistence of data.

SparkConf is an object used to set various configuration properties for a Spark application. These properties control how Spark executes tasks, allocates resources, and interacts with the underlying cluster.

master: Specifies the deployment mode (e.g., "local", "yarn", "spark://master:7077").
appName: Sets the name of the Spark application.
spark.executor.cores: Configures the number of cores per executor.
spark.executor.memory: Sets the memory allocated to each executor.

Note: SparkSession internally uses a SparkConf object to manage the configuration settings for the Spark application.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("YourAppName") \
    .getOrCreate()

Spark Dataframes and RDDs

Feature	RDD	SQL DataFrame
Data Representation	Unstructured	Structured with schema
Immutability	Immutable	Immutable
Fault Tolerance	Resilient through lineage	Resilient through lineage and persistence
Parallelism	Inherent parallelism	Inherent parallelism
API	Lower-level API	Higher-level API with SQL support
Performance	Can be slower for complex operations	Generally faster due to optimizations

Lineage and persistence are complementary mechanisms that ensure Spark's fault tolerance. Lineage enables automatic recovery of lost data, while persistence provides a performance boost and enhances fault tolerance by caching data. By effectively utilizing these mechanisms, users can build robust and scalable Spark applications that can handle node failures and efficiently process large datasets.

Basic Functions

Task	PySpark Code	Explanation
List All Tables	spark.catalog.listTables()	Returns a list of all tables registered in the Spark session.
Pandas DataFrame to Spark DataFrame	spark_df = spark.createDataFrame(pandas_df)	Creates a Spark DataFrame from an existing Pandas DataFrame.
Spark DataFrame to Pandas DataFrame	pandas_df = spark_df.toPandas()	Converts a Spark DataFrame to a Pandas DataFrame. Note: This operation can be expensive for large datasets as it requires transferring all the data to the driver.
Create Temp View	spark_df.createOrReplaceTempView("table_name")	Creates or replaces a temporary view with the given name, allowing SQL queries to be executed on the Spark DataFrame.
Read Data (e.g., from CSV)	spark_df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)	Reads data from a CSV file into a Spark DataFrame. <br> - header=True: Indicates that the first row contains column names. <br> - inferSchema=True: Automatically infers the data types of each column.

Stay tuned for the next in the series….

Pyspark - 1

Introduction to Pyspark

Table of contents

What is Spark and Pyspark?

What is Sparksession and Sparkconf?

Spark Dataframes and RDDs

Basic Functions