Develop a Spam Email Filter Using Machine Learning πŸ“§πŸ€–

Develop a Spam Email Filter Using Machine Learning πŸ“§πŸ€–

Spam emails are annoying, dangerous, and even fraudulent. To protect users, email providers use AI-powered spam filters that automatically detect and block unwanted emails.

How Spam Email Filtering Works? πŸ§πŸ“©

  • βœ… Ham (Legitimate Email) – Important, useful messages.
  • ❌ Spam (Unwanted Email) – Promotional, phishing, or malicious emails.

πŸ”Ή Techniques Used in Spam Detection

  • πŸ“Œ Keyword-Based Filtering – Detects words like β€œlottery”, β€œfree money”, β€œurgent”.
  • πŸ“Œ Machine Learning (ML) Models – Learn from past emails to classify new ones.
  • πŸ“Œ Bayesian Filtering – Calculates the probability of an email being spam.
  • πŸ“Œ Deep Learning (LSTMs, Transformers) – Advanced AI models for spam detection.

Install Required Libraries πŸ“¦

pip install numpy pandas scikit-learn nltk

Import Libraries & Load Data πŸ—‚οΈ

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

πŸ”Ή Load the Dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam-collection.csv"
df = pd.read_csv(url, encoding='latin-1', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Preprocess the Text Data πŸ“πŸ”

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    words = [ps.stem(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['clean_message'] = df['message'].apply(preprocess_text)
df.head()

Convert Text Data into Numerical Features πŸ”’

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['clean_message']).toarray()
y = df['label']

Split Data for Training & Testing 🎯

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Train the Spam Classifier πŸš€πŸ“©

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Test the Spam Filter on New Emails πŸ“¬

def predict_spam(email_text):
    processed_email = preprocess_text(email_text)
    email_vector = tfidf_vectorizer.transform([processed_email]).toarray()
    prediction = model.predict(email_vector)[0]
    return "Spam" if prediction == 1 else "Not Spam"

test_email = "Congratulations! You have won a $1000 Walmart gift card. Claim now!"
print(f"Email: {test_email} -> Prediction: {predict_spam(test_email)}")

Save & Load the Model for Future Use πŸ’Ύ

import joblib
joblib.dump(model, "spam_classifier.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")

loaded_model = joblib.load("spam_classifier.pkl")
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")

Improve the Spam Classifier πŸš€

  • βœ… Using Deep Learning (LSTMs, Transformers) for better accuracy.
  • βœ… Expanding Training Data by using real-world spam emails.
  • βœ… Applying Additional NLP Techniques like lemmatization instead of stemming.
  • βœ… Combining Multiple Models (Ensemble Learning) to improve classification.

Real-World Applications of Spam Detection 🌍

  • πŸ“§ Email Security – Protects users from phishing attacks.
  • πŸ“± SMS Spam Filtering – Identifies spam texts on mobile phones.
  • πŸ”’ Cybersecurity – Detects fraudulent messages and scam attempts.
  • πŸ€– Chatbot Moderation – Blocks inappropriate or harmful messages.

Conclusion πŸŽ―πŸ†

We successfully built a machine learning-based spam filter using:

  • βœ… Natural Language Processing (NLP) for text preprocessing.
  • βœ… TF-IDF for feature extraction.
  • βœ… NaΓ―ve Bayes classifier for accurate spam detection.

This AI-powered model automates spam filtering, keeping inboxes clean and users safe! πŸš€

πŸ”Ή Next Step: Try deploying this model into a real-time email filtering system!