Ranking Duplicate Values of a Column in Incremental Order in PySpark
Last Updated :
11 Jul, 2024
In data processing, it is often necessary to rank or order the values within the column especially when dealing with the duplicate values. The Ranking duplicate values in an incremental order helps in the various data analytics tasks such as the identifying the sequence of the events prioritizing tasks or generating the unique identifiers. The PySpark, the Python API for the Apache Spark offers powerful tools for the handling large-scale data processing and can efficiently perform ranking the operations on the large datasets. This article will guide we through the process of the ranking duplicate values in the column using the PySpark.
Understanding Window Functions in PySpark
The PySpark provides several functions to the rank or order data within the DataFrames. Window functions in PySpark are used to perform operations on a set of rows that are related to the current row. These functions are particularly useful for tasks such as ranking, cumulative sums, moving averages, and more. The main window functions used for ranking are:
rank()
: Assigns ranks to rows within a window partition, leaving gaps in rank values if there are ties.dense_rank()
: Similar to rank()
, but does not leave gaps in rank values.row_number()
: Assigns unique consecutive numbers to rows within a window partition.
The PySpark Solution: A Step-by-Step Approach
Let’s dive into an example to illustrate how to rank duplicate values in the column incrementally using PySpark.
Set Up the PySpark Environment
First, ensure that we have PySpark installed. we can install it using pip if we haven’t already:
pip install pyspark
Next, import the necessary modules and create a Spark session:
Python
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, dense_rank, row_number
# Create a Spark session
spark = SparkSession.builder.appName("RankDuplicates").getOrCreate()
Creating a Sample DataFrame
Let’s create a sample DataFrame to demonstrate the ranking of duplicate values:
Python
# Sample data
data = [
(1, 1101),
(2, 1102),
(3, 1101),
(4, 1102)
]
# Column names
columns = ["customer_id", "trigger_id"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-----------+----------+
|customer_id|trigger_id|
+-----------+----------+
| 1| 1101|
| 2| 1102|
| 3| 1101|
| 4| 1102|
+-----------+----------+
1. Ranking Duplicate Values Using rank()
To rank duplicate values in incremental order, we can use the rank()
function along with a window specification:
Python
# Define the window specification
window_spec = Window.partitionBy("trigger_id").orderBy("customer_id")
# Apply the rank function
df_rank = df.withColumn("rank", rank().over(window_spec))
df_rank.show()
Output:
+-----------+----------+----+
|customer_id|trigger_id|rank|
+-----------+----------+----+
| 1| 1101| 1|
| 3| 1101| 2|
| 2| 1102| 1|
| 4| 1102| 2|
+-----------+----------+----+
2. Ranking Duplicate Values Using dense_rank()
The dense_rank()
function can be used similarly to rank()
, but it does not leave gaps in the ranking sequence:
Python
# Apply the dense_rank function
df_dense_rank = df.withColumn("dense_rank", dense_rank().over(window_spec))
df_dense_rank.show()
Output:
+-----------+----------+----------+
|customer_id|trigger_id|dense_rank|
+-----------+----------+----------+
| 1| 1101| 1|
| 3| 1101| 2|
| 2| 1102| 1|
| 4| 1102| 2|
+-----------+----------+----------+
Comparing rank()
, dense_rank()
, and row_number()
To understand the differences between these functions, let’s apply all three to the DataFrame:
Python
# Apply the row_number function
df_row_number = df.withColumn("row_number", row_number().over(window_spec))
# Show all rankings
df_all_ranks = df_rank.join(df_dense_rank, ["customer_id", "trigger_id"]) \
.join(df_row_number, ["customer_id", "trigger_id"])
df_all_ranks.show()
Output:
+-----------+----------+----+----------+----------+
|customer_id|trigger_id|rank|dense_rank|row_number|
+-----------+----------+----+----------+----------+
| 1| 1101| 1| 1| 1|
| 3| 1101| 2| 2| 2|
| 2| 1102| 1| 1| 1|
| 4| 1102| 2| 2| 2|
+-----------+----------+----+----------+----------+
Splitting DataFrames Based on Rank
You may want to split the DataFrame into two separate DataFrames based on even and odd ranks. Here’s how you can do it:
Python
# Split DataFrame into even and odd ranks
df_even_rank = df_rank.filter(df_rank.rank % 2 == 0)
df_odd_rank = df_rank.filter(df_rank.rank % 2 != 0)
# Show even rank DataFrame
df_even_rank.show()
# Show odd rank DataFrame
df_odd_rank.show()
Even Rank Output:
+-----------+----------+----+
|customer_id|trigger_id|rank|
+-----------+----------+----+
| 3| 1101| 2|
| 4| 1102| 2|
+-----------+----------+----+
Odd Rank Output:
+-----------+----------+----+
|customer_id|trigger_id|rank|
+-----------+----------+----+
| 1| 1101| 1|
| 2| 1102| 1|
+-----------+----------+----+
Conclusion
Ranking duplicate values in a column incrementally is a common task in data analysis, and PySpark provides efficient tools to accomplish this. By using window functions such as rank(), dense_rank(), and row_number(), you can easily rank and manipulate your data according to your needs.
In this article, we covered:
- Setting up PySpark and creating a sample DataFrame.
- Using window functions to rank duplicate values.
- Comparing different ranking functions.
- Splitting DataFrames based on rank.
Please Login to comment...