Develop a Spam Email Filter Using Machine Learning 📧🤖

Spam emails are annoying, dangerous, and even fraudulent. To protect users, email providers use AI-powered spam filters that automatically detect and block unwanted emails.

How Spam Email Filtering Works? 🧐📩

✅ Ham (Legitimate Email) – Important, useful messages.
❌ Spam (Unwanted Email) – Promotional, phishing, or malicious emails.

🔹 Techniques Used in Spam Detection

📌 Keyword-Based Filtering – Detects words like “lottery”, “free money”, “urgent”.
📌 Machine Learning (ML) Models – Learn from past emails to classify new ones.
📌 Bayesian Filtering – Calculates the probability of an email being spam.
📌 Deep Learning (LSTMs, Transformers) – Advanced AI models for spam detection.

Install Required Libraries 📦

pip install numpy pandas scikit-learn nltk

Import Libraries & Load Data 🗂️

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

🔹 Load the Dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam-collection.csv"
df = pd.read_csv(url, encoding='latin-1', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Preprocess the Text Data 📝🔍

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    words = [ps.stem(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['clean_message'] = df['message'].apply(preprocess_text)
df.head()

Convert Text Data into Numerical Features 🔢

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['clean_message']).toarray()
y = df['label']

Split Data for Training & Testing 🎯

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Train the Spam Classifier 🚀📩

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Test the Spam Filter on New Emails 📬

def predict_spam(email_text):
    processed_email = preprocess_text(email_text)
    email_vector = tfidf_vectorizer.transform([processed_email]).toarray()
    prediction = model.predict(email_vector)[0]
    return "Spam" if prediction == 1 else "Not Spam"

test_email = "Congratulations! You have won a $1000 Walmart gift card. Claim now!"
print(f"Email: {test_email} -> Prediction: {predict_spam(test_email)}")

Save & Load the Model for Future Use 💾

import joblib
joblib.dump(model, "spam_classifier.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")

loaded_model = joblib.load("spam_classifier.pkl")
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")

Improve the Spam Classifier 🚀

✅ Using Deep Learning (LSTMs, Transformers) for better accuracy.
✅ Expanding Training Data by using real-world spam emails.
✅ Applying Additional NLP Techniques like lemmatization instead of stemming.
✅ Combining Multiple Models (Ensemble Learning) to improve classification.

Real-World Applications of Spam Detection 🌍

📧 Email Security – Protects users from phishing attacks.
📱 SMS Spam Filtering – Identifies spam texts on mobile phones.
🔒 Cybersecurity – Detects fraudulent messages and scam attempts.
🤖 Chatbot Moderation – Blocks inappropriate or harmful messages.

Conclusion 🎯🏆

We successfully built a machine learning-based spam filter using:

✅ Natural Language Processing (NLP) for text preprocessing.
✅ TF-IDF for feature extraction.
✅ Naïve Bayes classifier for accurate spam detection.

This AI-powered model automates spam filtering, keeping inboxes clean and users safe! 🚀

🔹 Next Step: Try deploying this model into a real-time email filtering system!