Pandas Masterclass
Introduction to Pandas
๐ What is Pandas?
Pandas is the core library for data manipulation in Python. It's like "Excel on steroids" for programmers.
๐ Setup
import pandas as pd
print(pd.__version__)
Pandas Series
A Series is like a column in a table. It is a one-dimensional array holding data of any type.
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar["y"]) # Returns 7
DataFrames
๐ What is a DataFrame?
A DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
Reading Files
๐ CSV & JSON
Data usually comes in files. Pandas makes loading them one-line magic.
df = pd.read_csv('data.csv')
df_json = pd.read_json('data.json')
Inspecting Data
df.head(10): First 10 rows.df.tail(): Last 5 rows.df.info(): Data types and memory usage.df.describe(): Statistical summary (mean, std, min, max).
Selecting & Filtering
Filtering data is intuitive.
# Select column
ages = df['Age']
# Select Row (by index)
row = df.loc[0]
# Conditional Filtering
adults = df[df['Age'] > 18]
Cleaning Empty Cells
Bad data can ruin your analysis. Handle it!
# Remove rows with empty cells
new_df = df.dropna()
# Fill empty cells with a value
df.fillna(130, inplace = True)
# Fill with Mean
x = df["Calories"].mean()
df["Calories"].fillna(x, inplace = True)
Cleaning Wrong Data
Correcting logical errors.
# Set a max limit for duration
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
Removing Duplicates
# Check for duplicates
print(df.duplicated())
# Remove duplicates
df.drop_duplicates(inplace = True)
Grouping & Aggregation
The groupby() method helps group data into categories and apply functions.
# Average calories by Workout Type
numeric_df = df[["Duration", "Pulse", "Maxpulse", "Calories"]]
numeric_df.groupby(df["Type"]).mean()
Merging & Joins
Combining multiple DataFrames.
merged = pd.merge(df1, df2, on='ID', how='inner')
Plotting Integration
Pandas hooks directly into Matplotlib.
import matplotlib.pyplot as plt
df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')
plt.show()
๐ Real World Projects
๐ข Beginner: Dataset Explorer
Goal: Load a CSV, print stats, and fix missing values.
๐ก Intermediate: Sales Analysis
Goal: Group sales by Month and Product Category to find top performers.
๐ด Advanced: Stock Predictor (Prep)
Goal: Calculate moving averages and daily returns on historical finance data.
๐ฏ Pandas Mini Task
Goal: Create your own DataFrame.
๐ Requirements:
- Import pandas.
- Create a DataFrame with columns: "Fruit" and "Color".
- Add 3 rows (e.g., Apple-Red, Banana-Yellow).
- Print the whole table.
Data Science starts here! ๐งช
๐ Congratulations!
You've completed the Pandas module.