2022 · NLP
Automated Plagiarism Detector
NLP Text Analysis Python
Automated Plagiarism Detector
BITS Pilani | Sep 2022 – Dec 2022
Python NLTK TF-IDF Cosine Similarity Text Processing
Project Overview
Built automated plagiarism detection system capable of identifying copied content across document corpora using multiple text similarity algorithms.
Key Contributions
Multi-Algorithm Approach: Developed ensemble plagiarism detection system combining TF-IDF vectorization, cosine similarity, and Jaccard index for robust text comparison
Text Preprocessing: Implemented comprehensive NLP pipeline including tokenization, stemming, stopword removal, and n-gram extraction using NLTK
High Accuracy: Achieved 91% plagiarism detection accuracy on test corpus of 500+ academic documents, with configurable similarity thresholds
Technologies Used
- Languages: Python
- NLP Libraries: NLTK, Scikit-Learn
- Methods: TF-IDF, Cosine Similarity, Jaccard Index
- Text Processing: Tokenization, Stemming, N-grams
Key Results
| Metric | Value |
|---|---|
| Detection Accuracy | 91% |
| Test Corpus Size | 500+ documents |
| Algorithms Used | 3 (TF-IDF, Cosine, Jaccard) |
| Similarity Threshold | Configurable |
System Architecture
Document Input → Preprocessing Pipeline
→ Tokenization, Stemming, Stopword Removal
→ Feature Extraction (TF-IDF, N-grams)
→ Similarity Computation
→ Plagiarism Score → Report Generation
