Spam emails are annoying, dangerous, and even fraudulent. To protect users, email providers use AI-powered spam filters that automatically detect and block unwanted emails.
How Spam Email Filtering Works? π§π©
- β Ham (Legitimate Email) β Important, useful messages.
- β Spam (Unwanted Email) β Promotional, phishing, or malicious emails.
πΉ Techniques Used in Spam Detection
- π Keyword-Based Filtering β Detects words like βlotteryβ, βfree moneyβ, βurgentβ.
- π Machine Learning (ML) Models β Learn from past emails to classify new ones.
- π Bayesian Filtering β Calculates the probability of an email being spam.
- π Deep Learning (LSTMs, Transformers) β Advanced AI models for spam detection.
Install Required Libraries π¦
pip install numpy pandas scikit-learn nltk
Import Libraries & Load Data ποΈ
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
πΉ Load the Dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam-collection.csv"
df = pd.read_csv(url, encoding='latin-1', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()
Preprocess the Text Data ππ
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
def preprocess_text(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
words = text.split()
words = [ps.stem(word) for word in words if word not in stop_words]
return ' '.join(words)
df['clean_message'] = df['message'].apply(preprocess_text)
df.head()
Convert Text Data into Numerical Features π’
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['clean_message']).toarray()
y = df['label']
Split Data for Training & Testing π―
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
Train the Spam Classifier ππ©
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
Test the Spam Filter on New Emails π¬
def predict_spam(email_text):
processed_email = preprocess_text(email_text)
email_vector = tfidf_vectorizer.transform([processed_email]).toarray()
prediction = model.predict(email_vector)[0]
return "Spam" if prediction == 1 else "Not Spam"
test_email = "Congratulations! You have won a $1000 Walmart gift card. Claim now!"
print(f"Email: {test_email} -> Prediction: {predict_spam(test_email)}")
Save & Load the Model for Future Use πΎ
import joblib
joblib.dump(model, "spam_classifier.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")
loaded_model = joblib.load("spam_classifier.pkl")
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")
Improve the Spam Classifier π
- β Using Deep Learning (LSTMs, Transformers) for better accuracy.
- β Expanding Training Data by using real-world spam emails.
- β Applying Additional NLP Techniques like lemmatization instead of stemming.
- β Combining Multiple Models (Ensemble Learning) to improve classification.
Real-World Applications of Spam Detection π
- π§ Email Security β Protects users from phishing attacks.
- π± SMS Spam Filtering β Identifies spam texts on mobile phones.
- π Cybersecurity β Detects fraudulent messages and scam attempts.
- π€ Chatbot Moderation β Blocks inappropriate or harmful messages.
Conclusion π―π
We successfully built a machine learning-based spam filter using:
- β Natural Language Processing (NLP) for text preprocessing.
- β TF-IDF for feature extraction.
- β NaΓ―ve Bayes classifier for accurate spam detection.
This AI-powered model automates spam filtering, keeping inboxes clean and users safe! π
πΉ Next Step: Try deploying this model into a real-time email filtering system!