2022 · NLP

Automated Plagiarism Detector

NLP Text Analysis Python

Automated Plagiarism Detector

BITS Pilani | Sep 2022 – Dec 2022

Python NLTK TF-IDF Cosine Similarity Text Processing

Project Overview

Built automated plagiarism detection system capable of identifying copied content across document corpora using multiple text similarity algorithms.

Key Contributions

Multi-Algorithm Approach: Developed ensemble plagiarism detection system combining TF-IDF vectorization, cosine similarity, and Jaccard index for robust text comparison
Text Preprocessing: Implemented comprehensive NLP pipeline including tokenization, stemming, stopword removal, and n-gram extraction using NLTK
High Accuracy: Achieved 91% plagiarism detection accuracy on test corpus of 500+ academic documents, with configurable similarity thresholds

Technologies Used

  • Languages: Python
  • NLP Libraries: NLTK, Scikit-Learn
  • Methods: TF-IDF, Cosine Similarity, Jaccard Index
  • Text Processing: Tokenization, Stemming, N-grams

Key Results

MetricValue
Detection Accuracy91%
Test Corpus Size500+ documents
Algorithms Used3 (TF-IDF, Cosine, Jaccard)
Similarity ThresholdConfigurable

System Architecture

Document Input → Preprocessing Pipeline
               → Tokenization, Stemming, Stopword Removal
               → Feature Extraction (TF-IDF, N-grams)
               → Similarity Computation
               → Plagiarism Score → Report Generation