VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/12/how-to-encrypt-and-decrypt-the-data-in-pyspark/

⇱ How to Encrypt and Decrypt the Data in PySpark? -


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

How to Encrypt and Decrypt the Data in PySpark?

Kishan Yadav Last Updated : 10 Jan, 2023
6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Data sharing has become so easy today, and we can share the details with just a few clicks. To access services, we need to share essential details like email IDs, phone numbers, social security numbers, etc. These details can get leaked if the service provider doesn’t follow a robust data protection methodology. Many data breaches happen due to negligent or accidental exposure, which may impact the user personally, professionally, or economically. We have our email ids, phone numbers, and government-issued cards, which are sensitive and confidential. We must protect them so they can’t get into the wrong hands.

In this article, we will work on two different methods to encrypt these data so that they can’t get into the hands of unauthorized users. We will see how we can encrypt and decrypt the sensitive data using PySpark.

👁 encrypt data
Source: Canva

Why is There a Need for Data Encryption?

Data encryption is essential in several contexts. Suppose an organization that deals with different clients has to share the data to provide services to them. Clients share their confidential details with firms like their database, customer info, products they sell or purchase, etc.

All these details are sensitive and must be protected so they can’t get into the wrong hands. If unauthorized individuals access these data, it can lead to severe consequences such as financial loss, reputational damage, or even legal liabilities.

So data encryption helps us to protect sensitive and confidential information. It is a very crucial aspect of data security.

👁 encrypt data
Source: Canva

Data Frame Creation

To perform encryption and decryption, we need sample data with essential information like user email id, phone number, social security number, address, etc. Before sending these details to a user, they need to be encrypted. So, we will create a sample dataframe that has this information. The dataframe has four columns named ‘customer_name’, ‘mail_id’, ‘phone_num’, and ‘social_security_number’. The column’s descriptions are as follows:-

  • customer_name:- This column contains the customer’s names.
  • mail_id:- This column has customer email information.
  • phone_num:- This column has the customer’s mobile numbers data.
  • social_security_number:- This column has the customer’s government-issued social security number information.
# import necessary libs
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, LongType, StringType
# create a SparkSession
spark = SparkSession.builder.appName("demo").getOrCreate()
# define the schema for the DataFrame
schema = StructType([
 StructField("customer_name", StringType(), True),
 StructField("mail_id", StringType(), True),
 StructField("mobile_num", LongType(), True),
 StructField("social_security_number", StringType(), True)
])
# create the sample data
data = [ ("Max", '[email protected]', 9789457864, '7548-8546-4512'),
("Michael", '[email protected]', 9089848243, '7845-8745-8756'),
("Alex", '[email protected]', 9589848643, '3245-6547-9854'),
("Hector", '[email protected]', 9189648245, '6547-7845-2150')
]
# create the DataFrame
df = spark.createDataFrame(data, schema)
df.show()

The output of the above dataframe will be:-

+-------------+-------------------+----------+----------------------+
|customer_name| mail_id|mobile_num|social_security_number|
+-------------+-------------------+----------+----------------------+
| Max| [email protected]|9789457864| 7548-8546-4512|
| Michael|[email protected]|9089848243| 7845-8745-8756|
| Alex|[email protected]|9589848643| 3245-6547-9854|
| Hector|[email protected]|9189648245| 6547-7845-2150|
+-------------+-------------------+----------+----------------------+

The above data, like name, email, mobile number, and social security number, are the user’s personal information and can’t be shared directly with any other person or organization. To share these details, we must encrypt and send this data. On the other end, the receiver can decrypt this data with the key.

Using aes_encrypt and aes_decrypt Functions

We will work with the inbuilt function to encrypt the above dataframe data. We will use the aes_encrypt function to encrypt the ‘mail_id’, ‘mobile_num’, and ‘social_security_number’ columns. Later we will use the aes_decrypt function to decrypt the encrypted data. The decoded data value will get compared with the original values for successful decryption.

aes_encrypt() – This function encrypts the plain text. In this, we will pass the column name whose data needs to encrypt inside the expr arguments. Then we give the key to decrypt the encrypted data. Then we pass the mode argument value and, finally, the padding value. The output of this function is the encrypted values.

This function will take the following arguments as input:-

  • ‘expr’ – The binary value to encrypt the data.
  • ‘key’ – The passphrase value to use to encrypt the data.
  • ‘mode’ – Select the block cypher mode to encrypt the messages. Valid modes are ECB and GCM.
  • ‘padding’ – Used to pad messages whose length is not in a multiple of the block size. Valid values are PKCS, NONE, and DEFAULT. The DEFAULT padding means PKCS for ECB and NONE for GCM.

Syntax of this function is aes_encrypt(expr, key[, mode[, padding]]). The output of this function will be encrypted data values. This function supports the key lengths of 16, 24, and 32 bits. The default mode is the GCM.

Now we will pass the column names in the expr function to encrypt the data values. The column names whose data we will encrypt are ‘mail_id’, ‘mobile_num’, and ‘social_security_num’. We are going to store the encrypted data in a new dataframe.

enc_df = df.withColumn('encrypted_mail', expr("base64(aes_encrypt(mail_id, '1234567890abcdef', 'ECB', 'PKCS'))"))
 .withColumn('encrypted_mobile_num', expr("base64(aes_encrypt(mobile_num, '1234567890abcdgh', 'ECB', 'PKCS'))"))
 .withColumn('encrypted_ssn', expr("base64(aes_encrypt(social_security_number, '1234567890abcdij', 'ECB', 'PKCS'))"))
enc_df.show()

In this, we have created new column names using the ‘withColumn’ function; inside it, we have passed the column name in the expr function. We have used ‘1234567890abcdef’ as the encryption key to encrypt the ‘mail_id’ data. ECB is the mode, and PKCS is helpful for padding. The same thing goes for the other two columns. Only the keys are different. Here we also used ‘base64’ conversion to convert the bytes data into a text string.

Now we get the encrypted data which looks like this.

+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name| mail_id|mobile_num|social_security_number| encrypted_mail|encrypted_mobile_num| encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
| Max| [email protected]|9789457864| 7548-8546-4512|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...|
| Michael|[email protected]|9089848243| 7845-8745-8756|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...|
| Alex|[email protected]|9589848643| 3245-6547-9854|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...|
| Hector|[email protected]|9189648245| 6547-7845-2150|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+

So, we have our encrypted data, and now we will see how to decrypt this data and get our original data back.

aes_decrypt()- We use this function to decrypt the data values. In this, we pass the data column information whose data need to decode. It will return the decoded data values as the final output.

This function will take the following arguments as input:-

  • ‘expr’ – The binary value to decrypt the data.
  • ‘key’ – The passphrase value to use to decrypt the data.
  • ‘mode’ – Select the block cypher mode to decrypt the messages. Valid modes: ECB, GCM.
  • ‘padding’ – Used to pad messages whose length is not in a multiple of the block size. Valid values are PKCS, NONE, and DEFAULT. The DEFAULT padding means PKCS for ECB and NONE for GCM.

Syntax of this function is aes_decrypt(expr, key[, mode[, padding]]). The output of this function will be decrypted original data values. This function supports the key lengths of 16, 24, and 32 bits.

Now we will pass the encrypted data columns in this function and compare the results with the original data.

# original data
+-------------+-------------------+----------+----------------------+
|customer_name| mail_id|mobile_num|social_security_number|
+-------------+-------------------+----------+----------------------+
| Max| [email protected]|9789457864| 7548-8546-4512|
| Michael|[email protected]|9089848243| 7845-8745-8756|
| Alex|[email protected]|9589848643| 3245-6547-9854|
| Hector|[email protected]|9189648245| 6547-7845-2150|
+-------------+-------------------+----------+----------------------+
# encrypted data
+-------------+--------------------+--------------------+--------------------+
|customer_name| encrypted_mail|encrypted_mobile_num| encrypted_ssn|
+-------------+--------------------+--------------------+--------------------+
| Max|sk33JvRxTV9PU11qw...|4DF70TSV5/k2f7XDy...|kruqxwUhDD582Q4mf...|
| Michael|RzIRtA7ihZG7YlRj9...|eaMgFEdzEkqz7b6+Q...|QvfthH7TQqL6aJNp6...|
| Alex|ZahqBXBlprhgNfTyU...|msPEyWULCkIhbtel0...|1Majk18XVhQIJ10J5...|
| Hector|O3JpFSx0DGqs+XSIO...|647cANlvcGS4rwwVU...|cMH3zNTAgq8RmHL5R...|
+-------------+--------------------+--------------------+--------------------+
# decrypted data
+-------------+-------------------+--------------------+--------------+
|customer_name| decrypted_mail|decrypted_mobile_num| decrypted_ssn|
+-------------+-------------------+--------------------+--------------+
| Max| [email protected]| 9789457864|7548-8546-4512|
| Michael|[email protected]| 9089848243|7845-8745-8756|
| Alex|[email protected]| 9589848643|3245-6547-9854|
| Hector|[email protected]| 9189648245|6547-7845-2150|
+-------------+-------------------+--------------------+--------------+

Using Cryptography Library

Now we will use the cryptography library to perform encryption and decryption. In this, we will create a user-defined function (udf) that will take data and complete the encryption and decryption.

Encrypting –

# import necessary libs
from pyspark.sql.functions import udf, lit, col
from cryptography.fernet import Fernet
# encrypt func
def encrypt_data(plain_text, KEY):
 f = Fernet(KEY)
 encrip_text = f.encrypt(str(palin_text).encode()).decode()
 return encrp_text
encrypt_udf = udf(encrypt_val, StringType())
# generate the encryption key
Key = Fernet.generate_key()
# encrypt the 'mail_id', 'mobile_num', and 'social_security_number' cols
enc_df = df.withColumn("encrypted_mail_id", encrypt(col('mail_id'), lit(Key))) 
 .withColumn("encrypted_mobile_num", encrypt(col('mobile_num'), lit(Key))) 
 .withColumn("encrypted_ssn", encrypt(col('social_security_number'), lit(Key)))
enc_df.show()

In this, we have to generate the key to encrypt the data using the cryptography library, then pass the columns that we want to encrypt, and pass the encryption key along with it. Now we will see the encrypted results.

+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name| mail_id|mobile_num|social_security_number| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
| Max| [email protected]|9789457864| 7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...|
| Michael|[email protected]|9089848243| 7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...|
| Alex|[email protected]|9589848643| 3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...|
| Hector|[email protected]|9189648245| 6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+

So, the ’email_id’, ‘mobile_num’, and ‘social_security_num’ gets encrypted.

Now we will see how to decrypt these encrypted columns to get the original values back.

Decrypting-

def decrypt_data(encrypt_data, KEY):
 f = Fernet(bytes(KEY))
 decoded_val = f.decrypt(encrypt_data.encode()).decode()
 return decoded_val
decrypt_udf = udf(decrypt_data, StringType())
# decrypt the data
dec_df = enc_df.withColumn("decrypted_mail_id", decrypt_udf(col('encrypted_mail_id'), lit(Key))) 
 .withColumn("decrypted_mobile_num", decrypt_udf(col('encrypted_mobile_num'), lit(Key))) 
 .withColumn("decrypted_ssn", decrypt_udf(col('encrypted_ssn'), lit(Key))) 
 .drop('mail_id', 'mobile_num', 'social_security_number')
dec_df.show()

In this, we successfully decrypted the data and got back our original data. We can now see the result and compare it with actual data.

# original and encrypted data
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
|customer_name| mail_id|mobile_num|social_security_number| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
| Max| [email protected]|9789457864| 7548-8546-4512|gAAAAABjpED66V3Xw...|gAAAAABjpED6oaixb...|gAAAAABjpED6TWeAg...|
| Michael|[email protected]|9089848243| 7845-8745-8756|gAAAAABjpED7nVl6j...|gAAAAABjpED77xy8P...|gAAAAABjpED7D73yg...|
| Alex|[email protected]|9589848643| 3245-6547-9854|gAAAAABjpED7Iuq5N...|gAAAAABjpED73BQYd...|gAAAAABjpED7OjE8W...|
| Hector|[email protected]|9189648245| 6547-7845-2150|gAAAAABjpED7sT3Tz...|gAAAAABjpED7lH29J...|gAAAAABjpED7SXANT...|
+-------------+-------------------+----------+----------------------+--------------------+--------------------+--------------------+
# decrypted data
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+
|customer_name| encrypted_mail_id|encrypted_mobile_num| encrypted_ssn| decrypted_mail_id|decrypted_mobile_num| decrypted_ssn|
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+
| Max|gAAAAABjpEE9TcrVL...|gAAAAABjpEE907red...|gAAAAABjpEE92mIuZ...| [email protected]| 9789457864|7548-8546-4512|
| Michael|gAAAAABjpEE9UXJF6...|gAAAAABjpEE9OlqYJ...|gAAAAABjpEE9TV8rm...|[email protected]| 9089848243|7845-8745-8756|
| Alex|gAAAAABjpEE93b3z_...|gAAAAABjpEE9knvQ7...|gAAAAABjpEE9rXc4g...|[email protected]| 9589848643|3245-6547-9854|
| Hector|gAAAAABjpEE9bbV1Z...|gAAAAABjpEE9DfOWj...|gAAAAABjpEE9Lvw6g...|[email protected]| 9189648245|6547-7845-2150|
+-------------+--------------------+--------------------+--------------------+-------------------+--------------------+--------------+

Note:- Encryption and hashing are different things. Hashing, once done, cannot be reverted to the original data. At the same time, we can decode the encoded values later to get the actual data value back.

Conclusion

In this article, we have covered two methods to encrypt and decrypt data while sharing. By doing so, we can ensure that our data is kept secure and protected from unauthorized access. In PySpark, we can achieve this by following the above two methods and efficiently safeguarding our data.

Key takeaways from this article are:-

  1. We have defined the dataframe and used the ‘aes_encryption’ and ‘aes_decryption’ methods to protect our data.
  2. Then we compare the results after decrypting the data to ensure we get the same original data.
  3. Then we use the cryptography library to encrypt and decrypt our data.
  4. In this, we have written a user-defined function (udf) and then used this function to perform the data encryption.

This article helps you to perform encryption and decryption in PySpark. If you have any opinions or questions, then comment down below. Connect with me on LinkedIn for further discussion.

Keep Learning!!!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hello 👋,
I am a Data Engineer with a proven track record of working in the information technology and services industry. I am skilled in Apache Spark, Hive, SQL, Python, Hadoop, Databricks and Cloud.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner