![]() |
VOOZH | about |
When working with massive datasets, attempting to load an entire file at once can overwhelm system memory and cause crashes. Pandas provides an efficient way to handle large files by processing them in smaller, memory-friendly chunks using the chunksize parameter.
chunksize parameter in read_csv()For instance, suppose you have a large CSV file that is too large to fit into memory. The file contains 1,000,000 ( 10 Lakh ) rows so instead we can load it in chunks of 10,000 ( 10 Thousand) rows- 100 times rows i.e You will process the file in 100 chunks, where each chunk contains 10,000 rows using Pandas like this:
Output:
This example demonstrates how to use chunksize parameter in the read_csv function to read a large CSV file in chunks, rather than loading the entire file into memory at once.
chunksize Parameter?concat(). Input file large_file.csv has 1,000,000 rows, so this loop will:
chunk_file.csv until the entire file is saved.Parameters:
index=False: Excludes the index column from being written to the file.mode='a': Appends each chunk to the file instead of overwriting it.header=False: Skips writing the header (column names) for every chunk, assuming the header is written once in the destination file.Example: If your dataset has 10 lakh rows and each row is 1 KB, the full dataset size is ~1 GB. On a system with 4 GB RAM (shared with the OS and other processes), chunking ensures:
- Only 10,000 rows (~10 MB) are loaded into memory at any given time, leaving ample memory for processing and other tasks.
Chunking is thus a practical solution to balance memory, performance, and scalability when dealing with massive datasets.
First Lets load the dataset and check the different number of columns and get more insights about the type of data and number of rows in the dataset.
Output:
Index(['Customer_ID', 'Name', 'Age', 'Gender', 'Country', 'Purchase_Amount',
'Purchase_Date', 'Product_Category', 'Feedback_Score'],
dtype='object')
Parameters:
nrows=0: Tells pd.read_csv to load no data rows but still read the header (column names)..columns: Retrieves the column names as an Index object.Why This is Efficient:
Using generators allows you to process large datasets lazily without loading everything into memory at once, improving memory efficiency. Let's first understand what generators are and how they can help you work with large files in a more efficient way.
A generator in Python is a special type of iterator that allows you to iterate over data, but it doesn't store the entire dataset in memory at once. Instead, generators yield one item at a time, which makes them highly memory-efficient when working with large datasets or streams of data. Generators are defined using functions with the yield keyword. When a function contains a yield statement, it becomes a generator function. When you call this function, it returns a generator object that can be iterated over. The state of the generator is maintained between iterations, so each time you call next() on the generator, it yields the next value.
Output:
Notice: How customer_id columns is repesenting the no. of rows, based on each chunk.