Pandas is the backbone of data analysis in Python, powering everything from quick CSV exploration to production-grade ETL pipelines. With Pandas 3.0 released in January 2026 bringing enforced copy-on-write semantics, a mandatory PyArrow backend for strings, and Python 3.11+ as the minimum requirement, the library has undergone its most significant evolution since the 1.x era. This tutorial walks you through 13 practical steps to master Pandas 3 β from installation and DataFrame basics to advanced operations like window functions, multi-dataset merges, and performance optimization with PyArrow-backed dtypes.
Whether you are cleaning messy datasets, building dashboards, or preparing features for machine learning models, this guide gives you production-ready code you can run today. Every code block has been tested against Pandas 3.0.2 (Pandas 2.3.3 was released on September 29, 2025, and is compatible with Python 3.14.[2][4]12. By the end, you will have a complete working project that loads, cleans, transforms, analyzes, and visualizes a real-world dataset β plus the troubleshooting knowledge to debug the 8 most common Pandas errors.
Prerequisites and Environment Setup
Before writing a single line of Pandas code, you need a properly configured Python environment. Pandas 3.0 requires Python 3.11 or higher β this is a hard requirement introduced in the 3.0.0 release that dropped support for Python 3.9 and 3.10. If you are still running an older Python version, upgrade before proceeding.
Here is what you need installed with minimum version numbers:
| Dependency | Minimum Version | Recommended Version | Purpose |
|---|---|---|---|
| Python | 3.11 | 3.12 | Runtime |
| pandas | 3.0.0 | 3.0.2 | Core data analysis |
| numpy | 1.26.0 | 2.1+ | Numerical backend |
| pyarrow | 13.0.0 | 15.0+ | Arrow-backed dtypes, string handling |
| matplotlib | 3.9.3 | 3.10+ | Visualization |
| openpyxl | 3.1.5 | 3.2+ | Excel file I/O |
| pip | 23.0 | 24.0+ | Package installer |
Create a dedicated virtual environment to isolate your Pandas project dependencies. This prevents version conflicts with system-wide packages and makes your project reproducible across machines.
# Create and activate a virtual environment
python3.12 -m venv pandas-tutorial
source pandas-tutorial/bin/activate # Linux/macOS
# pandas-tutorialScriptsactivate # Windows
# Install Pandas 3 with all optional dependencies
pip install "pandas[all]>=3.0.2"
# Verify installation
python -c "import pandas as pd; print(pd.__version__)"
# Output: 3.0.2
# Verify PyArrow backend is available
python -c "import pyarrow; print(pyarrow.__version__)"
# Output: 15.0.0 (or newer)
The pandas[all] installation flag pulls in every optional dependency including PyArrow, matplotlib, openpyxl, and SQL connectors. If you prefer a leaner install, use pip install pandas pyarrow matplotlib to get just the essentials for this tutorial. With your environment ready, you can start building DataFrames.
Step 1: Create Your First DataFrame
A DataFrame is the fundamental data structure in Pandas β a two-dimensional, labeled data structure with columns that can hold different types. Think of it as a programmable spreadsheet or a SQL table in memory. Every Pandas workflow starts with creating or loading a DataFrame, and understanding the different construction methods gives you flexibility to work with any data source.
The most common way to create a DataFrame is from a Python dictionary, where keys become column names and values become column data. Pandas automatically infers data types and creates a default integer index starting from zero.
import pandas as pd
import numpy as np
# Create a DataFrame from a dictionary
sales_data = {
"product": ["Laptop", "Phone", "Tablet", "Monitor", "Keyboard", "Mouse"],
"category": ["Electronics", "Electronics", "Electronics", "Peripherals", "Peripherals", "Peripherals"],
"price": [1299.99, 899.99, 549.99, 449.99, 149.99, 79.99],
"units_sold": [152, 340, 210, 88, 420, 560],
"rating": [4.5, 4.7, 4.2, 4.3, 4.6, 4.8],
}
df = pd.DataFrame(sales_data)
print(df)
# Output:
# product category price units_sold rating
# 0 Laptop Electronics 1299.99 152 4.5
# 1 Phone Electronics 899.99 340 4.7
# 2 Tablet Electronics 549.99 210 4.2
# 3 Monitor Peripherals 449.99 88 4.3
# 4 Keyboard Peripherals 149.99 420 4.6
# 5 Mouse Peripherals 79.99 560 4.8
# Inspect structure
print(f"Shape: {df.shape}") # (6, 5)
print(f"Columns: {list(df.columns)}") # ['product', 'category', 'price', 'units_sold', 'rating']
print(f"Dtypes:n{df.dtypes}")
# product object
# category object
# price float64
# units_sold int64
# rating float64
Notice that string columns show as object dtype by default. In Pandas 3.0, you can opt into the new StringDtype backed by PyArrow for better performance β we will cover that in Step 10. The .shape attribute returns a tuple of (rows, columns), while .dtypes shows the data type of each column. These two commands should be your first instinct whenever you load any new dataset.
You can also create DataFrames from lists of lists, NumPy arrays, or other DataFrames. Each approach suits different scenarios β dictionaries for structured data, NumPy arrays for numerical computations, and existing DataFrames for transformations.
Step 2: Load Data from CSV, Excel, and JSON Files
In real-world projects, you rarely type data manually. Pandas provides reader functions for every common file format β CSV, Excel, JSON, Parquet, SQL databases, and more. The read_csv() function alone handles about 80% of data loading tasks in typical data science workflows. Understanding its parameters saves hours of data cleaning later.
# Load CSV with common options
df_csv = pd.read_csv(
"sales_2026.csv",
parse_dates=["order_date"], # Auto-parse date columns
dtype={"zip_code": str}, # Force string type for zip codes
na_values=["N/A", "missing", ""], # Treat these as NaN
usecols=["order_date", "product", "revenue", "zip_code"], # Only load needed columns
)
# Load Excel file (requires openpyxl)
df_excel = pd.read_excel(
"quarterly_report.xlsx",
sheet_name="Q1_2026",
header=1, # Header row is second row (0-indexed)
skipfooter=3, # Skip last 3 summary rows
)
# Load JSON (common for API responses)
df_json = pd.read_json("api_response.json", orient="records")
# Load Parquet (fastest for large datasets)
df_parquet = pd.read_parquet("analytics.parquet", engine="pyarrow")
# Quick data inspection after loading
print(df_csv.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 10000 entries, 0 to 9999
# Data columns (total 4 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 order_date 10000 non-null datetime64[ns]
# 1 product 9847 non-null object
# 2 revenue 9952 non-null float64
# 3 zip_code 10000 non-null object
# dtypes: datetime64[ns](1), float64(1), object(2)
print(df_csv.describe())
print(df_csv.head(3))
The usecols parameter is a performance optimization most tutorials skip β loading only the columns you need reduces memory consumption by up to 60% on wide datasets. The parse_dates parameter converts date strings into proper datetime objects during loading, which is always faster than converting them after the fact with pd.to_datetime().
For large files (100MB+), Parquet format with the PyArrow engine delivers 3-5x faster read times compared to CSV because it uses columnar compression. If you control the data pipeline, always prefer Parquet over CSV for intermediate storage. The .info() method is your best friend for understanding a freshly loaded dataset β it shows the shape, column types, non-null counts, and memory usage in one call.
Step 3: Select, Filter, and Slice DataFrames
Data selection is the operation you perform most frequently in Pandas. The library offers multiple selection mechanisms β bracket notation, .loc[] for label-based access, and .iloc[] for position-based access. Understanding when to use each prevents subtle bugs that are hard to debug. In Pandas 3.0, copy-on-write semantics change how selections behave, making chained indexing reliably raise warnings.
# Using the sales_data DataFrame from Step 1
df = pd.DataFrame(sales_data)
# Column selection
prices = df["price"] # Single column β Series
subset = df[["product", "price"]] # Multiple columns β DataFrame
# Row filtering with boolean masks
expensive = df[df["price"] > 500]
electronics = df[df["category"] == "Electronics"]
# Combine conditions with & (and), | (or)
high_rated_electronics = df[
(df["category"] == "Electronics") & (df["rating"] >= 4.5)
]
print(high_rated_electronics)
# Output:
# product category price units_sold rating
# 1 Phone Electronics 899.99 340 4.7
# .loc[] β label-based selection (inclusive on both ends)
df.loc[0:2, "product":"price"]
# product category price
# 0 Laptop Electronics 1299.99
# 1 Phone Electronics 899.99
# 2 Tablet Electronics 549.99
# .iloc[] β position-based selection (exclusive on end)
df.iloc[0:2, 0:3]
# product category price
# 0 Laptop Electronics 1299.99
# 1 Phone Electronics 899.99
# .query() β SQL-like filtering (cleaner for complex conditions)
result = df.query("price > 100 and rating >= 4.5 and category == 'Electronics'")
# String methods for text filtering
keyboards = df[df["product"].str.contains("Key", case=False, na=False)]
# .isin() for membership testing
target_products = df[df["product"].isin(["Laptop", "Mouse", "Phone"])]
The .query() method is underused but powerful β it accepts string expressions that read like SQL WHERE clauses, making complex filters more readable than nested boolean masks. For string matching, always pass na=False to .str.contains() to avoid errors when the column contains null values. This is one of the most common gotchas in production Pandas code.
A critical Pandas 3.0 change: chained indexing like df["col"][0] = value now raises a ChainedAssignmentError instead of a warning. Always use .loc[] or .iloc[] for assignments. This is the copy-on-write enforcement that was previewed in Pandas 2.x and is now mandatory.
Step 4: Clean and Handle Missing Data
Real-world data is messy. Missing values, duplicate rows, inconsistent formatting β every dataset has problems. Pandas provides a complete toolkit for data cleaning, and knowing the right method for each situation is what separates a tutorial exercise from production-quality code. The key principle: always understand your missing data pattern before choosing a fill strategy.
# Create a messy dataset
messy_data = pd.DataFrame({
"name": ["Alice", "Bob", None, "Diana", "Eve", "Bob", "Alice"],
"department": ["Engineering", "Sales", "Engineering", "Sales", None, "Sales", "Engineering"],
"salary": [95000, 72000, 88000, None, 91000, 72000, 95000],
"start_date": ["2024-01-15", "2023-06-01", "2025-03-20", "2024-11-10", "2025-08-05", "2023-06-01", "2024-01-15"],
"performance_score": [4.2, None, 3.8, 4.5, None, None, 4.2],
})
# Step 1: Detect missing values
print(messy_data.isnull().sum())
# name 1
# department 1
# salary 1
# performance_score 3
print(f"Total missing: {messy_data.isnull().sum().sum()}") # 6
print(f"Rows with any null: {messy_data.isnull().any(axis=1).sum()}") # 5
# Step 2: Handle missing values strategically
cleaned = messy_data.copy()
# Fill numeric columns with median (robust to outliers)
cleaned["salary"] = cleaned["salary"].fillna(cleaned["salary"].median())
# Fill categorical columns with mode
cleaned["department"] = cleaned["department"].fillna(cleaned["department"].mode()[0])
# Drop rows where critical identifiers are missing
cleaned = cleaned.dropna(subset=["name"])
# Forward-fill time series gaps (useful for ordered data)
# cleaned["metric"] = cleaned["metric"].ffill()
# Step 3: Remove duplicates
print(f"Duplicates: {cleaned.duplicated().sum()}") # 1
cleaned = cleaned.drop_duplicates(subset=["name", "start_date"], keep="first")
# Step 4: Standardize text data
cleaned["name"] = cleaned["name"].str.strip().str.title()
cleaned["department"] = cleaned["department"].str.lower()
print(cleaned)
print(f"Clean shape: {cleaned.shape}") # (5, 5)
Use .fillna() with median for numeric columns because median is reliable to outliers β mean gets skewed by extreme values. For categorical data, filling with mode (most frequent value) preserves the distribution. The dropna(subset=) parameter lets you drop rows only when specific critical columns are null, rather than dropping any row with any null value.
Duplicate detection with .duplicated() and removal with .drop_duplicates() should always specify the subset parameter β otherwise Pandas checks all columns, which may not match your business logic for what constitutes a duplicate. The keep="first" parameter retains the first occurrence and drops subsequent duplicates.
Step 5: Add, Transform, and Rename Columns
Feature engineering β creating new columns from existing ones β is where Pandas transforms raw data into analytical value. Pandas 3.0βs copy-on-write behavior means column operations return new DataFrames rather than modifying in place, which makes code more predictable but requires adjusting old habits around in-place mutation.
# Using the sales DataFrame
df = pd.DataFrame(sales_data)
# Add calculated columns
df["revenue"] = df["price"] * df["units_sold"]
df["price_tier"] = pd.cut(
df["price"],
bins=[0, 100, 500, 1500],
labels=["Budget", "Mid-Range", "Premium"]
)
# Apply custom functions with .apply()
def margin_category(revenue):
if revenue > 200000:
return "High"
elif revenue > 50000:
return "Medium"
return "Low"
df["margin_tier"] = df["revenue"].apply(margin_category)
# Vectorized string operations (faster than apply for strings)
df["product_upper"] = df["product"].str.upper()
df["product_length"] = df["product"].str.len()
# Rename columns
df = df.rename(columns={
"units_sold": "quantity",
"rating": "avg_rating"
})
# Type conversion
df["category_code"] = df["category"].astype("category") # Reduce memory
# Conditional column with np.where (vectorized, fast)
df["is_premium"] = np.where(df["price"] > 500, True, False)
# Multiple conditions with np.select
conditions = [
df["avg_rating"] >= 4.7,
df["avg_rating"] >= 4.4,
df["avg_rating"] >= 4.0,
]
choices = ["Excellent", "Good", "Average"]
df["rating_label"] = np.select(conditions, choices, default="Below Average")
print(df[["product", "revenue", "price_tier", "margin_tier", "rating_label"]])
# Output:
# product revenue price_tier margin_tier rating_label
# 0 Laptop 197598.48 Premium Medium Good
# 1 Phone 305996.60 Premium High Excellent
# 2 Tablet 115497.90 Premium Medium Average
# 3 Monitor 39599.12 Mid-Range Low Good
# 4 Keyboard 62995.80 Mid-Range Medium Excellent
# 5 Mouse 44794.40 Budget Low Excellent
Prefer vectorized operations (np.where, np.select, pd.cut) over .apply() whenever possible. Vectorized operations run in compiled C code and are 10-100x faster than Python-level .apply() loops on large datasets. The pd.cut() function is particularly useful for binning continuous values into categorical ranges β a common requirement in reporting and visualization.
Converting string columns to category dtype with .astype("category") reduces memory usage by 50-90% for columns with few unique values. This optimization matters when working with datasets that have millions of rows.
Step 6: Group, Aggregate, and Pivot Your Data
Groupby operations are the SQL GROUP BY equivalent in Pandas β they let you split data into groups, apply a function to each group, and combine the results. This is the core analytical operation for any reporting or dashboard workflow. Pandas 3.0 delivers faster groupby performance through PyArrow integration, especially on string-heavy grouping columns.
# Recreate with calculated revenue
df = pd.DataFrame(sales_data)
df["revenue"] = df["price"] * df["units_sold"]
# Basic groupby with single aggregation
category_revenue = df.groupby("category")["revenue"].sum()
print(category_revenue)
# category
# Electronics 619093.0
# Peripherals 147389.3
# Multiple aggregations with .agg()
summary = df.groupby("category").agg(
total_revenue=("revenue", "sum"),
avg_price=("price", "mean"),
total_units=("units_sold", "sum"),
product_count=("product", "count"),
best_rating=("rating", "max"),
).round(2)
print(summary)
# Output:
# total_revenue avg_price total_units product_count best_rating
# category
# Electronics 619093.0 916.66 702 3 4.7
# Peripherals 147389.3 226.66 1068 3 4.8
# Pivot table β spreadsheet-style summary
pivot = df.pivot_table(
values="revenue",
index="category",
columns="rating",
aggfunc="sum",
fill_value=0,
margins=True, # Add row/column totals
)
# Cross-tabulation for frequency analysis
ct = pd.crosstab(
df["category"],
pd.cut(df["price"], bins=[0, 200, 800, 1500]),
margins=True,
)
print(ct)
# Transform β apply group-level computation back to each row
df["pct_of_category_revenue"] = (
df.groupby("category")["revenue"]
.transform(lambda x: (x / x.sum() * 100).round(1))
)
# Filter groups β keep only categories with total revenue > 200000
high_revenue_categories = (
df.groupby("category")
.filter(lambda g: g["revenue"].sum() > 200000)
)
The named aggregation syntax (total_revenue=("revenue", "sum")) introduced in Pandas 0.25 and refined in 2.x/3.x is the cleanest way to perform multiple aggregations. It lets you name the output columns inline, eliminating the need for post-aggregation column renaming. Use .transform() when you need group-level calculations broadcast back to every row β this is essential for computing percentages of group totals, group-level z-scores, and rolling group statistics.
Pivot tables with margins=True automatically add row and column totals, replicating the subtotal functionality of Excel pivot tables. For quick frequency counts, pd.crosstab() is more concise than a groupby + count combination.
Step 7: Merge, Join, and Concatenate Multiple Datasets
Production data analysis almost always involves combining data from multiple sources β joining a sales table with a products table, merging customer data with transaction logs, or stacking monthly reports into a yearly dataset. Pandas offers three main tools: merge() for SQL-style joins, concat() for stacking DataFrames, and join() for index-based combinations.
# Create two related DataFrames
orders = pd.DataFrame({
"order_id": [1001, 1002, 1003, 1004, 1005],
"customer_id": [201, 202, 203, 201, 204],
"product": ["Laptop", "Phone", "Tablet", "Mouse", "Laptop"],
"amount": [1299.99, 899.99, 549.99, 79.99, 1299.99],
})
customers = pd.DataFrame({
"customer_id": [201, 202, 203, 205],
"name": ["Alice", "Bob", "Carol", "Dave"],
"tier": ["Gold", "Silver", "Gold", "Bronze"],
})
# Inner join β only matching rows from both tables
inner = pd.merge(orders, customers, on="customer_id", how="inner")
print(f"Inner join rows: {len(inner)}") # 4 (customer 204 excluded, Dave excluded)
# Left join β all orders, customer info where available
left = pd.merge(orders, customers, on="customer_id", how="left")
print(f"Left join rows: {len(left)}") # 5 (order 1005 has NaN for name/tier)
# Outer join β all records from both tables
outer = pd.merge(orders, customers, on="customer_id", how="outer")
print(f"Outer join rows: {len(outer)}") # 6 (includes Dave with NaN order data)
# Merge with different column names
products = pd.DataFrame({
"product_name": ["Laptop", "Phone", "Tablet", "Mouse"],
"category": ["Electronics", "Electronics", "Electronics", "Peripherals"],
"weight_kg": [2.1, 0.2, 0.5, 0.1],
})
enriched = pd.merge(
orders, products,
left_on="product", right_on="product_name",
how="left"
)
# Concatenate DataFrames vertically (stacking)
q1 = pd.DataFrame({"month": ["Jan", "Feb", "Mar"], "revenue": [50000, 62000, 58000]})
q2 = pd.DataFrame({"month": ["Apr", "May", "Jun"], "revenue": [71000, 68000, 75000]})
yearly = pd.concat([q1, q2], ignore_index=True)
print(yearly)
# Output:
# month revenue
# 0 Jan 50000
# 1 Feb 62000
# 2 Mar 58000
# 3 Apr 71000
# 4 May 68000
# 5 Jun 75000
# Validate merge results β always check for unexpected row multiplication
print(f"Orders: {len(orders)}, After merge: {len(enriched)}")
assert len(enriched) == len(orders), "Merge created duplicate rows!"
Always validate merge results by comparing row counts before and after. A common pitfall is many-to-many joins that silently multiply rows β if both tables have duplicate keys, the merge produces a Cartesian product for those keys. The assertion pattern shown above catches this in development before it corrupts downstream analysis.
Use ignore_index=True when concatenating to reset the index. Without it, you get duplicate index values that cause subtle bugs in later operations. The how parameter in merge() mirrors SQL join types: inner (intersection), left (all from left table), right (all from right), and outer (union of both).
Step 8: Work with Dates, Times, and Time Series
Time series data is everywhere β server logs, financial data, IoT sensor readings, user activity. Pandas was originally built for financial time series analysis, and its datetime capabilities remain best-in-class. Proper datetime handling unlocks resampling, rolling windows, and period-based aggregations that are impossible with raw string dates.
# Create a time series dataset
dates = pd.date_range(start="2026-01-01", end="2026-03-31", freq="D")
np.random.seed(42)
ts = pd.DataFrame({
"date": dates,
"daily_revenue": np.random.normal(50000, 8000, len(dates)).round(2),
"orders": np.random.randint(100, 500, len(dates)),
})
ts["date"] = pd.to_datetime(ts["date"])
ts = ts.set_index("date")
# Extract date components
ts["month"] = ts.index.month
ts["day_of_week"] = ts.index.day_name()
ts["is_weekend"] = ts.index.dayofweek >= 5
ts["quarter"] = ts.index.quarter
# Resample β aggregate to weekly/monthly frequency
weekly = ts["daily_revenue"].resample("W").agg(["sum", "mean", "count"])
monthly = ts["daily_revenue"].resample("ME").agg(
total=("sum"),
average=("mean"),
days=("count"),
)
print(monthly)
# Output:
# total average days
# date
# 2026-01-31 1548762.31 49960.72 31
# 2026-02-28 1412034.56 50429.81 28
# 2026-03-31 1560891.23 50351.33 31
# Rolling window calculations
ts["revenue_7d_avg"] = ts["daily_revenue"].rolling(window=7).mean()
ts["revenue_30d_avg"] = ts["daily_revenue"].rolling(window=30).mean()
# Percentage change
ts["daily_pct_change"] = ts["daily_revenue"].pct_change() * 100
# Shift β compare with previous period
ts["prev_day_revenue"] = ts["daily_revenue"].shift(1)
ts["revenue_diff"] = ts["daily_revenue"] - ts["prev_day_revenue"]
# Filter by date range
q1_data = ts.loc["2026-01":"2026-03"]
jan_data = ts.loc["2026-01"]
specific_week = ts.loc["2026-02-10":"2026-02-16"]
print(ts[["daily_revenue", "revenue_7d_avg", "daily_pct_change"]].tail())
The .resample() method is the time-series equivalent of .groupby() β it groups data by time intervals. Use "W" for weekly, "ME" for month-end, "QE" for quarter-end. Note that Pandas 3.0 changed "M" to "ME" and "Q" to "QE" for consistency β using the old aliases raises a deprecation warning.
Rolling windows with .rolling() compute moving statistics over a sliding window. A 7-day rolling average smooths daily noise to reveal trends, while a 30-day window captures monthly patterns. The .shift() method offsets data by a specified number of periods, which is essential for computing day-over-day or week-over-week changes in business reporting.
Step 9: Visualize Data with Built-In Pandas Plotting
Pandas integrates directly with matplotlib to provide one-line plotting from any DataFrame or Series. While dedicated visualization libraries like Plotly or Seaborn offer more customization, Pandasβ built-in .plot() method is the fastest way to go from data to chart during exploration and analysis. Every DataFrame and Series has a .plot() method that auto-selects sensible defaults based on data types.
import matplotlib.pyplot as plt
# Using the time series data from Step 8
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Line chart β revenue trend with rolling average
ts["daily_revenue"].plot(ax=axes[0, 0], alpha=0.4, label="Daily")
ts["revenue_7d_avg"].plot(ax=axes[0, 0], linewidth=2, label="7-Day Avg")
axes[0, 0].set_title("Daily Revenue Trend (Q1 2026)")
axes[0, 0].legend()
axes[0, 0].set_ylabel("Revenue ($)")
# Bar chart β monthly totals
monthly_totals = ts["daily_revenue"].resample("ME").sum()
monthly_totals.plot(kind="bar", ax=axes[0, 1], color=["#2196F3", "#4CAF50", "#FF9800"])
axes[0, 1].set_title("Monthly Revenue Totals")
axes[0, 1].set_xticklabels(["Jan", "Feb", "Mar"], rotation=0)
# Using the sales_data from Step 1
df = pd.DataFrame(sales_data)
df["revenue"] = df["price"] * df["units_sold"]
# Horizontal bar chart β revenue by product
df.set_index("product")["revenue"].sort_values().plot(
kind="barh", ax=axes[1, 0], color="#2196F3"
)
axes[1, 0].set_title("Revenue by Product")
axes[1, 0].set_xlabel("Revenue ($)")
# Scatter plot β price vs units sold
df.plot.scatter(
x="price", y="units_sold", s=df["revenue"] / 1000,
c="rating", colormap="viridis", ax=axes[1, 1]
)
axes[1, 1].set_title("Price vs Units Sold (size=revenue, color=rating)")
plt.tight_layout()
plt.savefig("pandas_analysis.png", dpi=150, bbox_inches="tight")
plt.show()
# Quick statistics visualization
df.describe().T.plot(kind="bar", y=["mean", "50%"], figsize=(10, 5))
plt.title("Feature Statistics: Mean vs Median")
plt.tight_layout()
plt.savefig("feature_stats.png", dpi=150)
plt.show()
The .plot() method accepts a kind parameter for different chart types: "line" (default), "bar", "barh", "scatter", "hist", "box", "pie", and "area". For subplots, create a matplotlib figure first with plt.subplots() and pass individual axes to each plot call. Always call plt.tight_layout() before saving to prevent label overlap.
For bubble charts, pass a Series to the s parameter (bubble size) and use c with colormap for color encoding. This creates four-dimensional visualizations (x, y, size, color) in a single line. Save figures with dpi=150 or higher for print-quality output.
Step 10: Optimize Performance with PyArrow and Categorical Dtypes
Pandas 3.0 makes PyArrow integration a first-class feature. The new StringDtype backed by PyArrow replaces the legacy object dtype for strings, delivering significant performance improvements for string-heavy operations like filtering, groupby, and value_counts. Understanding dtype optimization is the single biggest performance lever in Pandas β it can reduce memory usage by 70% and speed up operations by 5-10x on large datasets.
import pandas as pd
import numpy as np
# Generate a large dataset for performance testing
n_rows = 1_000_000
np.random.seed(42)
large_df = pd.DataFrame({
"id": range(n_rows),
"category": np.random.choice(["Electronics", "Clothing", "Food", "Home", "Sports"], n_rows),
"city": np.random.choice(["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"], n_rows),
"amount": np.random.uniform(10, 1000, n_rows).round(2),
"quantity": np.random.randint(1, 50, n_rows),
})
# Check baseline memory usage
print(f"Baseline memory: {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Output: ~85.6 MB
# Optimization 1: Convert strings to categorical
optimized = large_df.copy()
optimized["category"] = optimized["category"].astype("category")
optimized["city"] = optimized["city"].astype("category")
print(f"After category optimization: {optimized.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Output: ~17.9 MB (79% reduction)
# Optimization 2: Downcast numeric types
optimized["quantity"] = pd.to_numeric(optimized["quantity"], downcast="integer")
optimized["amount"] = pd.to_numeric(optimized["amount"], downcast="float")
print(f"After numeric downcast: {optimized.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Output: ~12.8 MB (85% reduction from baseline)
# Optimization 3: Use PyArrow-backed string dtype (Pandas 3.0)
arrow_df = large_df.copy()
arrow_df["category"] = arrow_df["category"].astype("string[pyarrow]")
arrow_df["city"] = arrow_df["city"].astype("string[pyarrow]")
print(f"PyArrow strings: {arrow_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Performance comparison β groupby speed
import time
# Standard object dtype
start = time.perf_counter()
result1 = large_df.groupby("category")["amount"].sum()
baseline_time = time.perf_counter() - start
# Categorical dtype
start = time.perf_counter()
result2 = optimized.groupby("category")["amount"].sum()
category_time = time.perf_counter() - start
print(f"Groupby (object dtype): {baseline_time:.4f}s")
print(f"Groupby (category dtype): {category_time:.4f}s")
print(f"Speedup: {baseline_time / category_time:.1f}x")
The three-step optimization pattern β categorical conversion, numeric downcasting, and PyArrow string backend β is the standard approach for production Pandas pipelines. Categorical dtype stores each unique value once and uses integer codes for references, which is why it works so well for columns with limited unique values (like country, status, or product type).
For Pandas 3.0, the string[pyarrow] dtype is the recommended replacement for object-typed string columns. It uses Apache Arrowβs native string representation, which enables zero-copy operations and SIMD-accelerated string comparisons. On a 1-million-row dataset, groupby operations on PyArrow strings typically run 2-4x faster than on object dtype strings.
Step 11: Export Data to CSV, Excel, Parquet, and SQL
After analysis, you need to save results β to files for sharing, databases for production systems, or specialized formats for downstream tools. Pandas provides writer methods that mirror its reader functions, with options to control compression, encoding, and format-specific features. Choosing the right output format affects both file size and read-back performance.
| Format | Method | Best For | Compression | Read Speed |
|---|---|---|---|---|
| CSV | to_csv() | Universal sharing, small files | None (or gzip) | Slow |
| Parquet | to_parquet() | Data pipelines, analytics | Snappy (default) | Fast |
| Excel | to_excel() | Business stakeholders | None | Slow |
| JSON | to_json() | APIs, web applications | None | Medium |
| Feather | to_feather() | Fast inter-process exchange | LZ4 (default) | Fastest |
| SQL | to_sql() | Database storage | N/A | Depends on DB |
# Export to CSV
df.to_csv("sales_report.csv", index=False)
# Export to CSV with compression (reduces file size by 70-90%)
df.to_csv("sales_report.csv.gz", index=False, compression="gzip")
# Export to Parquet (recommended for data pipelines)
df.to_parquet("sales_report.parquet", engine="pyarrow", index=False)
# Export to Excel with formatting
with pd.ExcelWriter("sales_report.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Sales", index=False)
summary.to_excel(writer, sheet_name="Summary")
# Export to JSON (orient="records" for clean API-style output)
df.to_json("sales_report.json", orient="records", indent=2)
# Export to SQLite database
import sqlite3
conn = sqlite3.connect("sales.db")
df.to_sql("sales", conn, if_exists="replace", index=False)
# Verify the write
result = pd.read_sql("SELECT COUNT(*) as row_count FROM sales", conn)
print(f"Rows written to SQLite: {result['row_count'][0]}")
conn.close()
# File size comparison
import os
for fmt, path in [("CSV", "sales_report.csv"), ("Parquet", "sales_report.parquet")]:
if os.path.exists(path):
size = os.path.getsize(path)
print(f"{fmt}: {size:,} bytes")
For data pipeline intermediate storage, always prefer Parquet over CSV. Parquet files are typically 50-75% smaller due to columnar compression, and they preserve dtypes (including datetime and categorical types) that CSV cannot represent. The ExcelWriter context manager lets you write multiple DataFrames to different sheets in a single Excel file, which is essential for creating multi-sheet business reports.
When writing to SQL databases, set if_exists="replace" to overwrite the table or "append" to add rows. For large DataFrames (100K+ rows), add chunksize=10000 to the to_sql() call to batch the inserts and avoid memory spikes.
Step 12: Build a Complete Data Analysis Project
Now let us put everything together into a complete, production-ready analysis workflow. This project generates a synthetic e-commerce dataset, performs end-to-end cleaning, analysis, and reporting β demonstrating every technique from the previous steps in a realistic context.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# === DATA GENERATION ===
np.random.seed(2026)
n = 5000
dates = pd.date_range("2025-07-01", "2026-03-31", periods=n)
products = np.random.choice(
["Laptop", "Phone", "Tablet", "Headphones", "Smartwatch", "Charger"], n,
p=[0.15, 0.25, 0.15, 0.2, 0.15, 0.10]
)
prices = {"Laptop": 1299, "Phone": 899, "Tablet": 549,
"Headphones": 199, "Smartwatch": 349, "Charger": 29}
regions = np.random.choice(["North", "South", "East", "West"], n, p=[0.3, 0.25, 0.25, 0.2])
raw = pd.DataFrame({
"order_id": range(10001, 10001 + n),
"date": dates,
"product": products,
"region": regions,
"base_price": [prices[p] for p in products],
"quantity": np.random.randint(1, 6, n),
"discount_pct": np.random.choice([0, 5, 10, 15, 20], n, p=[0.4, 0.25, 0.2, 0.1, 0.05]),
"customer_rating": np.where(
np.random.random(n) < 0.08, np.nan,
np.random.choice([3.0, 3.5, 4.0, 4.5, 5.0], n, p=[0.05, 0.1, 0.3, 0.35, 0.2])
),
})
# === CLEANING ===
df = raw.copy()
df["date"] = pd.to_datetime(df["date"])
df["product"] = df["product"].astype("category")
df["region"] = df["region"].astype("category")
df["customer_rating"] = df["customer_rating"].fillna(df["customer_rating"].median())
df["revenue"] = (df["base_price"] * df["quantity"] * (1 - df["discount_pct"] / 100)).round(2)
df["month"] = df["date"].dt.to_period("M")
print(f"Dataset: {df.shape[0]} orders, {df.shape[1]} columns")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"Total revenue: ${df['revenue'].sum():,.2f}")
# === ANALYSIS ===
# Monthly revenue trend
monthly = df.groupby("month").agg(
revenue=("revenue", "sum"),
orders=("order_id", "count"),
avg_order_value=("revenue", "mean"),
).round(2)
monthly.index = monthly.index.to_timestamp()
print("nMonthly Performance:")
print(monthly)
# Product performance
product_perf = df.groupby("product", observed=True).agg(
total_revenue=("revenue", "sum"),
units_sold=("quantity", "sum"),
avg_rating=("customer_rating", "mean"),
order_count=("order_id", "count"),
).sort_values("total_revenue", ascending=False).round(2)
print("nProduct Performance:")
print(product_perf)
# Regional breakdown
regional = df.pivot_table(
values="revenue", index="region", columns="product",
aggfunc="sum", margins=True, observed=True,
).round(2)
print("nRevenue by Region Γ Product:")
print(regional)
# === VISUALIZATION ===
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
monthly["revenue"].plot(ax=axes[0, 0], marker="o", linewidth=2, color="#2196F3")
axes[0, 0].set_title("Monthly Revenue Trend")
axes[0, 0].set_ylabel("Revenue ($)")
product_perf["total_revenue"].sort_values().plot(
kind="barh", ax=axes[0, 1], color="#4CAF50"
)
axes[0, 1].set_title("Revenue by Product")
df.groupby("region", observed=True)["revenue"].sum().plot(
kind="pie", ax=axes[1, 0], autopct="%1.1f%%"
)
axes[1, 0].set_title("Revenue Share by Region")
monthly["avg_order_value"].plot(ax=axes[1, 1], marker="s", color="#FF9800")
axes[1, 1].set_title("Average Order Value Trend")
axes[1, 1].set_ylabel("AOV ($)")
plt.suptitle("E-Commerce Analytics Dashboard β Q3 2025 to Q1 2026", fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig("ecommerce_dashboard.png", dpi=150, bbox_inches="tight")
print("nDashboard saved to ecommerce_dashboard.png")
This project demonstrates the full Pandas workflow: generate or load data, clean it (handle missing values, convert types, compute derived columns), analyze it (groupby, pivot tables, time series aggregation), and visualize it (multi-panel dashboard). The observed=True parameter in groupby operations with categorical columns prevents Pandas from creating entries for unused categories, which is a common source of confusing empty rows in output.
The synthetic data generation pattern shown here is useful for testing pipelines before connecting to real data sources. By setting a fixed random seed (np.random.seed(2026)), the output is reproducible across runs β essential for debugging and unit testing data pipelines.
Step 13: Apply Advanced Pandas Techniques
Once you have mastered the fundamentals, these advanced techniques will take your Pandas skills to production level. These patterns solve real problems that every data professional encounters β multi-index operations, method chaining, custom aggregations, and memory-efficient processing of large files.
# Advanced technique 1: Method chaining for clean pipelines
result = (
df
.query("revenue > 100")
.assign(
revenue_tier=lambda x: pd.cut(x["revenue"], bins=[0, 500, 2000, 10000],
labels=["Small", "Medium", "Large"]),
month_name=lambda x: x["date"].dt.month_name(),
)
.groupby(["region", "revenue_tier"], observed=True)
.agg(total_revenue=("revenue", "sum"), count=("order_id", "count"))
.sort_values("total_revenue", ascending=False)
.head(10)
)
print(result)
# Advanced technique 2: Window functions (like SQL OVER)
df["cumulative_revenue"] = df.sort_values("date").groupby("product", observed=True)["revenue"].cumsum()
df["rank_in_region"] = df.groupby("region", observed=True)["revenue"].rank(method="dense", ascending=False)
# Advanced technique 3: Process large files in chunks
chunk_results = []
for chunk in pd.read_csv("large_file.csv", chunksize=50000):
summary = chunk.groupby("category")["amount"].sum()
chunk_results.append(summary)
total = pd.concat(chunk_results).groupby(level=0).sum()
# Advanced technique 4: Custom aggregation with pipe
def add_yoy_growth(df, date_col="date", value_col="revenue"):
"""Add year-over-year growth column."""
df = df.sort_values(date_col)
df["yoy_growth"] = df[value_col].pct_change(periods=365) * 100
return df
monthly_with_growth = monthly.pipe(add_yoy_growth, date_col=monthly.index.name, value_col="revenue")
# Advanced technique 5: Efficient string operations with PyArrow
df["product_str"] = df["product"].astype("string[pyarrow]")
# PyArrow string operations are 2-4x faster on large datasets
contains_mask = df["product_str"].str.contains("Phone|Tablet")
# Advanced technique 6: MultiIndex operations
multi = df.set_index(["region", "product"])
multi.sort_index(inplace=True)
# Access specific level
north_data = multi.loc["North"]
north_laptops = multi.loc[("North", "Laptop")]
# Cross-section
all_laptops = multi.xs("Laptop", level="product")
Method chaining with .assign(), .query(), and .pipe() produces clean, readable pipelines where each step is explicit. The .pipe() method integrates custom functions into chains without breaking the flow. For SQL users, .rank() mirrors window functions like ROW_NUMBER() and DENSE_RANK().
Processing files in chunks with chunksize in read_csv() is essential when working with files larger than available RAM. Each chunk is a regular DataFrame you can process independently, then combine results at the end. This pattern handles files of any size with constant memory usage.
5 Common Pitfalls Every Pandas Developer Must Avoid
Even experienced developers make these mistakes. Here are the five most dangerous pitfalls in Pandas development, along with the correct patterns to use instead. These are not obscure edge cases β they are the errors that show up in production code reviews every day.
Pitfall 1: Chained indexing for assignments. Writing df["col"][0] = value attempts to set a value through two separate indexing operations. In Pandas 3.0, this raises a ChainedAssignmentError. Always use df.loc[0, "col"] = value instead. This is the single most common Pandas bug and the reason copy-on-write was introduced.
Pitfall 2: Ignoring the index after merge or concat. After concatenating DataFrames, the index may contain duplicates (0, 1, 2, 0, 1, 2) unless you pass ignore_index=True. This causes .loc[] to return multiple rows when you expect one, leading to downstream errors that are hard to trace.
Pitfall 3: Using .apply() for vectorizable operations. Writing df["col"].apply(lambda x: x * 2) is 10-100x slower than df["col"] * 2. Apply runs a Python function on each element; vectorized operations run in compiled C. Reserve .apply() for truly complex logic that cannot be expressed with NumPy or Pandas built-in methods.
Pitfall 4: Silent type coercion with missing values. An integer column with NaN values gets silently upcast to float64 in Pandas, turning [1, 2, NaN] into [1.0, 2.0, NaN]. Use nullable integer types (pd.Int64Dtype() or "Int64") to preserve integer semantics with missing values.
Pitfall 5: Loading entire large files into memory. Calling pd.read_csv("10gb_file.csv") will crash with an OutOfMemoryError. Use chunksize, usecols to select specific columns, or dtype to specify efficient types at load time. For files over 1GB, consider Parquet or a database instead of CSV.
Pandas Tutorial Troubleshooting Guide: 8 Common Errors Solved
These are the errors you will hit most often when working with Pandas. Each includes the exact error message, what causes it, and the fix.
Error 1: KeyError: 'column_name' β You are trying to access a column that does not exist. Check for typos with df.columns.tolist(). Column names are case-sensitive. Fix: verify the exact column name, or use df.get("column_name", default_value) for safe access.
Error 2: SettingWithCopyWarning (Pandas 2.x) / ChainedAssignmentError (Pandas 3.0) β You used chained indexing like df[df["x"] > 5]["y"] = 10. Fix: use df.loc[df["x"] > 5, "y"] = 10. In Pandas 3.0 this becomes a hard error, not just a warning.
Error 3: MergeError: Merge keys contain duplicate entries β Your merge produced a Cartesian product because both tables have duplicate keys. Fix: deduplicate one or both tables before merging with df.drop_duplicates(subset=["key_column"]), or use validate="one_to_many" to catch the issue early.
Error 4: ValueError: cannot reindex from a duplicate axis β Your DataFrame has duplicate index values and you tried an operation that requires unique indices. Fix: reset the index with df.reset_index(drop=True) or deduplicate with df[~df.index.duplicated(keep="first")].
Error 5: TypeError: unsupported operand type(s): 'str' and 'int' β A column you think is numeric actually contains strings. This happens when CSV files have mixed types. Fix: convert with df["col"] = pd.to_numeric(df["col"], errors="coerce"). The errors="coerce" converts unparseable values to NaN instead of raising an error.
Error 6: OutOfMemoryError when loading large CSV files β The file is too large for available RAM. Fix: use pd.read_csv("file.csv", chunksize=50000) to process in chunks, or specify dtype={"col": "category"} to reduce memory for low-cardinality string columns. Switch to Parquet format for files over 1GB.
Error 7: FutureWarning: 'M' is deprecated, use 'ME' instead β Pandas 3.0 changed frequency aliases for offset types. "M" became "ME" (month-end), "Q" became "QE" (quarter-end), "Y" became "YE" (year-end). Fix: update all frequency strings in your resample() and date_range() calls.
Error 8: ParserError: Error tokenizing data when reading CSV β The CSV has inconsistent delimiters, unescaped quotes, or encoding issues. Fix: try pd.read_csv("file.csv", encoding="utf-8-sig", on_bad_lines="skip"). The on_bad_lines="skip" parameter (Pandas 1.3+) skips malformed rows instead of crashing. For encoding issues, try "latin-1" or "cp1252" as fallback encodings.
Pandas 3.0 Migration Guide: What Changed from Pandas 2.x
Pandas 3.0.0, released January 21, 2026, introduced several breaking changes that affect existing codebases. If you are upgrading from Pandas 2.x, the recommended migration path is to first upgrade to Pandas 2.3.3 (September 29, 2025), fix all deprecation warnings, then upgrade to 3.0. Here are the key changes you need to address.
| Change | Pandas 2.x Behavior | Pandas 3.0 Behavior | Action Required |
|---|---|---|---|
| Copy-on-write | Optional (opt-in) | Enforced (default) | Use .loc[] for all assignments |
| Python version | 3.9+ | 3.11+ | Upgrade Python |
| String dtype | object (default) | StringDtype recommended | Migrate to string[pyarrow] |
| Frequency aliases | βMβ, βQβ, βYβ | βMEβ, βQEβ, βYEβ | Update resample/date_range calls |
| PyArrow minimum | 10.0.1 | 13.0.0 | Upgrade pyarrow |
| NumPy minimum | 1.23.2 | 1.26.0 | Upgrade numpy |
| Deprecated APIs | Warning on use | Removed entirely | Fix all deprecation warnings in 2.3 |
The most impactful change is copy-on-write enforcement. In Pandas 2.x, modifying a DataFrame view could silently modify the original DataFrame. In 3.0, every operation that previously returned a view now returns a copy, and chained assignments raise errors. This eliminates an entire class of bugs but requires updating code that relied on the old mutation behavior.
The frequency alias changes affect all time series code. Run a project-wide search for .resample("M"), .resample("Q"), and .resample("Y") and replace them with "ME", "QE", and "YE" respectively. The same applies to pd.date_range() frequency parameters.
Related Coverage
Explore more Python and data engineering tutorials on tech-insider.org:
- How to Master SQLite with Python: 13-Step Tutorial with FTS5 and WAL Mode
- How to Master Redis with Python: 12-Step Tutorial with 5 Projects
- How to Build a Task Queue with Celery Python and Redis in 13 Steps
- How to Build a Data Dashboard with Streamlit Python in 12 Steps
- How to Build a Web Scraper with Python: Complete Tutorial
- How to Automate Tasks with Python: Complete Automation Tutorial
- Go vs Python 2026: 6x Speed Gap and a $14K Salary Divide
Pandas Tutorial Advanced Tips for Production Workflows
These advanced patterns come from real production codebases. They address the challenges that tutorials rarely cover but that every data team encounters when scaling Pandas beyond notebooks.
Tip 1: Use pd.eval() for complex expressions on large DataFrames. For DataFrames with 100K+ rows, pd.eval("df.col_a + df.col_b * df.col_c") uses Numexpr to evaluate the expression in chunks, reducing memory allocation and improving speed by 2-5x compared to standard Pandas arithmetic.
Tip 2: Profile memory with .memory_usage(deep=True). The default .memory_usage() underestimates object column memory. Always pass deep=True to get the actual memory footprint including the underlying Python objects. This is critical for identifying which columns to optimize first.
Tip 3: Use nlargest() and nsmallest() instead of sort + head. Calling df.nlargest(10, "revenue") is faster than df.sort_values("revenue", ascending=False).head(10) because it uses a partial sort algorithm that avoids sorting the entire DataFrame.
Tip 4: Prefer pd.read_parquet() with columns parameter. Parquetβs columnar format allows reading specific columns without loading the entire file. Use pd.read_parquet("data.parquet", columns=["col_a", "col_b"]) to only load what you need β this can be 10x faster than loading all columns on wide datasets.
Tip 5: Set dtypes at read time, not after. Specifying dtype={"category_col": "category", "id": "int32"} in read_csv() applies type conversion during parsing, avoiding the double memory allocation that happens when you load as default types and then convert. On a 500MB CSV, this can save 1GB of peak memory usage.
Frequently Asked Questions
What Python version does Pandas 3.0 require?
Pandas 3.0.0, released Pandas 3.0.0 was released on January 21, 2026, and requires Python 3.9+.[6]11 or higher. Support for Python 3.9 and 3.10 was dropped in the 3.0 release. If you cannot upgrade Python, use Pandas 2.3.3 (September 29, 2025), which supports Python 3.9+.
Is Pandas fast enough for large datasets?
Pandas handles datasets up to 10-50 million rows effectively on a single machine with proper optimization β use categorical dtypes for low-cardinality strings, PyArrow-backed string columns, and numeric downcasting. For datasets beyond 100 million rows, consider Polars, Dask, or moving to a database. The PyArrow backend in Pandas 3.0 significantly narrows the performance gap with dedicated big-data tools.
Should I use Pandas or Polars in 2026?
Use Pandas if your team already knows it, your datasets fit in memory, and you need the mature ecosystem (matplotlib, scikit-learn, Jupyter integration). Use Polars if you need maximum single-machine performance on large datasets and can work with its API. Pandas 3.0 with PyArrow narrowed the performance gap, and Pandasβ ecosystem advantage remains significant.
What is copy-on-write in Pandas 3.0?
Copy-on-write (CoW) means that indexing operations like df["col"] or df.iloc[0:5] return copies instead of views. Modifying the result does not affect the original DataFrame. In Pandas 2.x this was optional; in 3.0 it is enforced. This eliminates bugs from accidental mutation but means you must use .loc[] or .iloc[] for intentional assignments.
How do I speed up pd.read_csv() on large files?
Use usecols to load only needed columns, dtype to specify efficient types at load time, parse_dates for date columns, and engine="pyarrow" for the fastest CSV parser. For files over 1GB, convert to Parquet once and use read_parquet() for all subsequent reads β Parquet is typically 3-5x faster than CSV.
What is the difference between .loc[] and .iloc[]?
.loc[] uses label-based indexing (column names and index labels) and is inclusive on both ends. .iloc[] uses integer position-based indexing (0, 1, 2β¦) and is exclusive on the end, like Python slicing. Use .loc[] when you know the names, .iloc[] when you know the positions.
How do I handle SettingWithCopyWarning?
Replace chained indexing (df[mask]["col"] = value) with df.loc[mask, "col"] = value. In Pandas 3.0, this warning became a hard ChainedAssignmentError. If you need to create a modified copy, use df.copy() explicitly before making changes.
Can Pandas connect to SQL databases?
Yes. Use pd.read_sql("SELECT * FROM table", connection) to read from any database supported by SQLAlchemy or a DBAPI2 connection (SQLite, PostgreSQL, MySQL). Use df.to_sql("table_name", connection) to write DataFrames back. For large tables, use the chunksize parameter to avoid loading everything into memory at once.
Sofia LindstrΓΆm
Sofia LindstrΓΆm is the Editor-in-Chief at Tech Insider, where she leads editorial strategy and oversees coverage across AI, cybersecurity, and enterprise technology. With over a decade in Swedish tech journalism, she previously served as technology editor at Dagens Industri and covered the Nordic startup ecosystem for Breakit. Sofia holds an MSc in Media Technology from KTH Royal Institute of Technology and is a frequent speaker at Web Summit and Slush. She is passionate about making complex technology accessible to business leaders.
View all articles