
Academic integrity is a cornerstone of any educational institution. For years, Global Tech University relied on proprietary, expensive plagiarism detection suites tightly coupled with our LMS. However, as our computer science and engineering programs expanded, we realized traditional text-matching engines were failing at detecting code plagiarism and architectural diagram similarities. Furthermore, the licensing costs for scanning tens of thousands of submissions per semester were astronomical.
We decided to build our own serverless ingestion and analysis pipeline. By leveraging AWS Lambda, S3, and customized Python analysis engines, we created an event-driven architecture that scales down to zero cost during the summer and easily handles the bursty traffic of final exam week. This post outlines the architecture of our bespoke detection pipeline.
The core philosophy of our system is asynchronous processing. When a student submits an assignment via Canvas or Moodle, we don’t want the LMS to block while an analysis engine grinds through Megabytes of data.
Our pipeline looks like this:
Normalizing student submissions is notoriously difficult. Students upload .docx, .pdf, .zip, and occasionally bizarre formats. Here is a simplified version of our Python ingestion Lambda that handles PDF text extraction using pdfplumber before passing the sanitized text to S3.
import boto3
import json
import requests
import pdfplumber
import io
import os
s3_client = boto3.client('s3')
BUCKET_NAME = os.environ['INTAKE_BUCKET']
def lambda_handler(event, context):
for record in event['Records']:
payload = json.loads(record['body'])
submission_url = payload['file_url']
student_id = payload['student_id']
assignment_id = payload['assignment_id']
# Download the file
response = requests.get(submission_url, stream=True)
response.raise_for_status()
extracted_text = ""
# Handle PDF extraction
if 'application/pdf' in response.headers.get('Content-Type', ''):
with pdfplumber.open(io.BytesIO(response.content)) as pdf:
for page in pdf.pages:
text = page.extract_text()
if text:
extracted_text += text + "\n"
else:
# Fallback for plain text
extracted_text = response.text
# Store normalized text in S3 to trigger analysis
object_key = f"{assignment_id}/{student_id}.txt"
s3_client.put_object(
Bucket=BUCKET_NAME,
Key=object_key,
Body=extracted_text.encode('utf-8')
)
return {"statusCode": 200, "body": "Ingestion successful"}
While NLP-based text similarity is relatively straightforward (using TF-IDF or BERT embeddings), code similarity requires Abstract Syntax Tree (AST) analysis. We utilize Stanford’s MOSS (Measure of Software Similarity) engine via a custom Lambda wrapper.
When a .zip file containing source code lands in our S3 intake bucket, it triggers our MOSS Lambda. This function unzips the payload, packages it alongside previous historical submissions from the same assignment, and submits the batch to the MOSS server.
import mosspy
import boto3
import os
def submit_to_moss(assignment_dir, language="python"):
userid = os.environ.get("MOSS_USER_ID")
m = mosspy.Moss(userid, language)
# Add historical base files (boilerplate code provided by instructor)
m.addBaseFile(f"{assignment_dir}/template.py")
# Add all student submissions
m.addFilesByWildcard(f"{assignment_dir}/submissions/*.py")
url = m.send()
# Store the result URL in DynamoDB for instructor review
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('PlagiarismResults')
table.put_item(
Item={
'AssignmentID': os.path.basename(assignment_dir),
'MossReportUrl': url,
'Timestamp': int(time.time())
}
)
return url
The beauty of a serverless architecture in EdTech is the pricing model. University traffic is highly seasonal. During July, this pipeline costs us literally $0. During the week of final exams in December, when thousands of essays and code projects are submitted simultaneously, AWS provisions hundreds of concurrent Lambda executions to handle the queue.
However, there are architectural trade-offs:
Building your own plagiarism detection pipeline isn’t trivial. It requires maintaining complex ingestion logic and tuning similarity algorithms to avoid false positives. However, the capability to analyze custom programmatic formats, maintain strict data privacy internally, and eliminate per-user vendor licensing has proven to be a massive operational win for our engineering team.
This pipeline is deliberately LMS-agnostic at the ingestion layer. Whether you are running a self-hosted Moodle instance or Canvas LMS, the webhook contract is identical 鈥?a JSON payload with a file_url, student_id, and assignment_id. For instructions on how to configure Moodle webhooks and integrate external event-driven pipelines, refer to our foundational guide on Self-Hosting Educational Tools with Docker and HomeLab.
If your institution is also looking to capture what students learned (not just whether they submitted), consider pairing this pipeline with an xAPI event stream. Our article on Moving from an LMS to a Learning Record Store (LRS) with xAPI demonstrates how to route submission events alongside engagement signals into a unified data lake for longitudinal academic integrity analysis.