In this module, we will explore how tornet_scraper operates. You’ve already learned to extract data from both tornet and clearnet forums, but to streamline this process, we need an advanced web scraper equipped with features for large-scale data scraping.
This topic may seem challenging to some, but in the era of AI, where much of programming can be automated, and with this course leveraging AI for cyber defense, building a web app for data scraping is now more accessible.
Every web application relies on a tech stack, a collection of software tools used to develop different components of the application. For instance, websites like Facebook feature a user interface (front-end) and server-side logic (back-end) that makes pages accessible. In some cases, the front-end and back-end can use the same programming language. For example, JavaScript-based web apps might use Vue for the front-end and Nuxt for the back-end, simplifying development since both are written in JavaScript.
For our purposes, we need an API-friendly and scalable tech stack that you can customize independently. This module won’t delve deeply into programming concepts, as you can copy and paste code into your preferred AI tool for a tailored explanation that suits your learning style.
If you prefer to skip this module and start with a working project, you can do so, but you’ll miss learning how the project functions, it will simply be another tool. The final project is available here:
https://github.com/CyberMounties/tornet_scraper
The topics of this section include the following:
- How templates work
- How
main.pyworks - How databases work
- How routes work
- How scrapers work
- How services work
How templates work
Modern web applications use templates, such as a default or base template, to define elements that appear across all pages, like a navigation menu or navbar. While you could copy and paste the navbar into every page, any changes would require updating each page individually.
To streamline this, we use a templating engine. In Python web applications, Jinja2 allows us to create a base template with content blocks for consistent and efficient design.
Open app/templates/base.html:
<!-- app/templates/base.html -->
<!DOCTYPE html>
<html lang="en" data-theme="nord">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>{% block title %}{% endblock %} - Tornet Scraper</title>
<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/full.min.css" rel="stylesheet" type="text/css" />
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://code.jquery.com/jquery-3.7.1.min.js"></script>
</head>
<body class="min-h-screen bg-base-200">
<!-- Navbar -->
<div class="navbar bg-base-100 shadow-xl">
<div class="navbar-start">
<a class="btn btn-ghost text-xl">Tornet Scraper</a>
</div>
<div class="navbar-center hidden lg:flex">
<ul class="menu menu-horizontal px-1">
<li><a href="/" class="btn btn-ghost">Dashboard</a></li>
<li><a href="/proxy-gen" class="btn btn-ghost">Proxy Gen</a></li>
<li><a href="/manage-api" class="btn btn-ghost">API Management</a></li>
<li><a href="/bot-profile" class="btn btn-ghost">Bot Profile</a></li>
<li><a href="/marketplace-scan" class="btn btn-ghost">Marketplace Scan</a></li>
<li><a href="/posts-scans" class="btn btn-ghost">Posts Scans</a></li>
<li><a href="/watchlist" class="btn btn-ghost">Watchlist</a></li>
</ul>
</div>
<div class="navbar-end">
<div class="dropdown dropdown-end lg:hidden">
<label tabindex="0" class="btn btn-ghost">
<svg xmlns="http://www.w3.org/2000/svg" class="h-5 w-5" fill="none" viewBox="0 0 24 24" stroke="currentColor">
<path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M4 6h16M4 12h16M4 18h16" />
</svg>
</label>
<ul tabindex="0" class="menu menu-sm dropdown-content mt-3 p-2 shadow bg-base-100 rounded-box w-52">
<li><a href="/">Dashboard</a></li>
<li><a href="/proxy-gen">Proxy Gen</a></li>
<li><a href="/manage-api">API Management</a></li>
<li><a href="/bot-profile">Bot Profile</a></li>
<li><a href="/marketplace-scan">Marketplace Scan</a></li>
<li><a href="/posts-scans" class="btn btn-ghost">Posts Scans</a></li>
<li><a href="/watchlist" class="btn btn-ghost">Watchlist</a></li>
</ul>
</div>
</div>
</div>
<!-- Main content -->
<main class="container mx-auto p-4 mt-6 bg-base-300 rounded-box">
<!-- Flash Messages Container -->
<div class="mb-4" id="flash-messages">
{% if messages %}
{% for message in messages %}
<div class="alert alert-{{ message.category | default('info') }} shadow-lg mb-4 flex justify-between items-center flash-message">
<div>
<span>{{ message.text }}</span>
</div>
<button class="btn btn-sm btn-circle btn-ghost" onclick="this.parentElement.remove()">✕</button>
</div>
{% endfor %}
{% endif %}
</div>
{% block content %}
{% endblock %}
</main>
<script>
// Automatically remove flash messages after 5 seconds
document.addEventListener('DOMContentLoaded', function() {
function removeFlashMessages() {
const flashMessages = document.querySelectorAll('.flash-message');
flashMessages.forEach(function(message) {
// Skip if already fading
if (message.style.opacity === '0') return;
setTimeout(function() {
message.style.transition = 'opacity 0.5s ease';
message.style.opacity = '0';
setTimeout(function() {
message.remove();
}, 500);
}, 5000);
});
}
removeFlashMessages();
const flashContainer = document.querySelector('#flash-messages');
if (flashContainer) {
const observer = new MutationObserver(function() {
removeFlashMessages();
});
observer.observe(flashContainer, { childList: true });
}
});
</script>
</body>
</html>
The base.html template is a Jinja2 base template for providing a consistent layout for all pages. It includes a navbar, flash messages, and placeholders for title and content that child templates can override. Below is a technical explanation of how it works and its extension mechanism.
-
HTML Structure:
- Defines a standard HTML5 document with a
data-theme="nord"attribute for DaisyUI styling. - Includes external resources: DaisyUI CSS, Tailwind CSS, and jQuery for styling and interactivity.
- Defines a standard HTML5 document with a
-
Title Block:
- The
<title>tag contains a Jinja2 block{% block title %}{% endblock %} - Tornet Scraper. - Child templates override the
titleblock to set a page-specific title, which is appended with " - Tornet Scraper". - Example: A child template with
{% block title %}Dashboard{% endblock %}results in<title>Dashboard - Tornet Scraper</title>.
- The
-
Navbar:
- A responsive navbar with a logo ("Tornet Scraper") and navigation links (Dashboard, Proxy Gen, etc.).
- On large screens (
lg:flex), links display horizontally; on smaller screens, a dropdown menu (triggered by a hamburger icon) shows links vertically.
-
Flash Messages:
- A
<div id="flash-messages">displays messages sent from backend (e.g., success or error notifications). - Uses Jinja2:
{% if messages %}loops throughmessages(a list of objects withtextandcategory). - Each message is styled as a DaisyUI
alertwith a dynamic class (alert-{{ message.category }}, defaulting toinfo). - A close button (
✕) removes the message on click. - JavaScript automatically fades out and removes flash messages after 5 seconds using a CSS transition (
opacity). - A
MutationObserverensures new flash messages (added dynamically) are also auto-removed.
- A
-
Content Block:
- The
<main>section contains a{% block content %}{% endblock %}placeholder. - Child templates override this block to inject page-specific content into the main container, styled with Tailwind/DaisyUI classes (
container,mx-auto, etc.).
- The
-
Extension in Child Templates:
-
Child templates use
{% extends 'base.html' %}to inherit the base structure. -
They override
{% block title %}for the page title and{% block content %}for the main content.- This renders the full
base.htmllayout with "Dashboard" in the title and the specified content in the<main>section.
- This renders the full
-
This setup ensures a consistent UI, dynamic titles, reusable navigation, and user-friendly flash message handling across all pages.
dashboard.html
The file located at app/templates/dashboard.html provides a clear example of how it extends base.html. While this file is not strictly necessary, I’ve developed a habit of including it. Below is the code contained within it:
<!-- app/templates/dashboard.html -->
{% extends "base.html" %}
{% block title %}Dashboard{% endblock %}
{% block content %}
<h1 class="text-3xl font-bold text-base-content mb-4">Dashboard</h1>
{% endblock %}
Notice how it extend base.html?
Content within {% block content %} is inserted into the main content container defined in base.html:
<!-- Main content -->
<main class="container mx-auto p-4 mt-6 bg-base-300 rounded-box">
{% block content %}
{% endblock %}
</main>
The title is also defined:
{% block title %}Dashboard{% endblock %}
In base.html, the title structure is:
<title>{% block title %}{% endblock %} - Tornet Scraper</title>
When viewing the page source code of the dashboard, the title appears as:
<title>Dashboard - Tornet Scraper</title>
Instead of rewriting the title for every page, you can define it on base.html and then only define page name for every other page.
How main.py works
You can find main.py at app/main.py.
The main.py file is the core of a FastAPI web application, defining routes, templates, and database interactions for a web scraper interface. Below is a technical explanation of global variables, included routers, and route handler functions.
Global Variables
-
app:- Purpose: The main FastAPI application instance.
- Details: Initializes the FastAPI framework to handle HTTP requests and responses.
-
BASE_DIR:- Purpose: Stores the absolute path of the project’s root directory.
- Details: Derived using
os.pathto locate theapp/templatesdirectory for Jinja2 templates.
-
TEMPLATES_DIR:- Purpose: Specifies the directory for Jinja2 templates.
- Details: Set to
app/templateswithin the project root for rendering HTML templates.
-
templates:- Purpose: Jinja2 template engine instance.
- Details: Configured to load templates from
TEMPLATES_DIRfor rendering dynamic HTML.
Included Routers
-
proxy_gen_router:- Purpose: Registers routes for proxy generation functionality.
- Details: Imported from
app.routes.proxy_gen, handles proxy-related endpoints.
-
manage_api_router:- Purpose: Registers routes for API management.
- Details: Imported from
app.routes.manage_api, manages API keys and settings.
-
bot_profile_router:- Purpose: Registers routes for bot profile management.
- Details: Imported from
app.routes.bot_profile, handles bot configuration.
-
marketplace_api_router:- Purpose: Registers routes for marketplace scanning.
- Details: Imported from
app.routes.marketplace, manages marketplace data retrieval.
-
posts_api_router:- Purpose: Registers routes for post scanning.
- Details: Imported from
app.routes.posts, handles post-related data operations.
-
watchlist_api_router:- Purpose: Registers routes for watchlist management.
- Details: Imported from
app.routes.watchlist, manages watchlist items and scans.
Route Handler Functions
-
dashboard:- Purpose: Renders the dashboard page.
- Key Parameters:
request: FastAPIRequestobject for template context.db: SQLAlchemySessionfor database access (viaDepends(get_db)).
- Returns:
TemplateResponserenderingdashboard.htmlwith flashed messages from the session.
-
proxy_gen:- Purpose: Displays proxy generation page with a list of proxies.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor querying proxies.
- Returns:
TemplateResponserenderingproxy_gen.htmlwith proxy data (container name, IP, Tor exit node, timestamp, running status) and flashed messages.
-
manage_api:- Purpose: Shows API management page with API details.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor querying APIs.
- Returns:
TemplateResponserenderingmanage_api.htmlwith API data (ID, name, provider, key, model, max tokens, prompt, timestamp, active status) and flashed messages.
-
bot_profile:- Purpose: Displays bot profile management page.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor querying bot profiles and latest onion URL.
- Returns:
TemplateResponserenderingbot_profile.htmlwith bot profile data (ID, username, masked password, purpose, Tor proxy, timestamp), latest onion URL, and flashed messages.
-
marketplace:- Purpose: Renders marketplace scan page with pagination and post scan data.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor querying scans.
- Returns:
TemplateResponserenderingmarketplace.htmlwith pagination scans, post scans, and flashed messages.
-
posts_scans:- Purpose: Displays the post scans overview page.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor database access.
- Returns:
TemplateResponserenderingposts_scans.htmlwith flashed messages.
-
posts_scan_result:- Purpose: Shows results for a specific post scan.
- Key Parameters:
scan_id: Integer ID of the scan.request: FastAPIRequestobject.db: SQLAlchemySessionfor database access.name: Optional scan name (string, default empty).
- Returns:
TemplateResponserenderingposts_scan_result.htmlwith scan ID, name, and flashed messages.
-
watchlist:- Purpose: Renders the watchlist overview page.
- Key Parameters:
request: FastAPIRequestobject.db: SQLAlchemySessionfor database access.
- Returns:
TemplateResponserenderingwatchlist.htmlwith flashed messages.
-
watchlist_profile:- Purpose: Displays a specific watchlist item’s profile and associated scans.
- Key Parameters:
target_id: Integer ID of the watchlist item.request: FastAPIRequestobject.db: SQLAlchemySessionfor querying watchlist item and scans.
- Returns:
TemplateResponserenderingwatchlist_profile.htmlwith watchlist item, scan data, and flashed messages; raisesHTTPException(404 for missing item, 500 for server errors).
How databases work
In our FastAPI application, db.py manages database engine configurations, while models.py defines the database schema and tables. Using a separate db.py file enables easy switching between database types. Although tornet_scraper uses SQLite3 for prototyping and testing, it is not ideal for large-scale data scraping. For production environments, we recommend transitioning to a more robust database like PostgreSQL.
You can also implement logic in db.py to dynamically switch between SQLite3 and a database server based on conditions or the environment (e.g., development, testing, or production).
db.py
The db.py file configures the database connection and session management for a FastAPI application using SQLAlchemy. Below is a concise explanation of its components and functionality.
You can find db.py in app/database/db.py.
-
Purpose:
- Sets up a SQLite database connection, creates tables, and provides a dependency for database sessions in FastAPI routes.
-
Global Variables:
BASE_DIR:- Purpose: Defines the project’s root directory.
- Details: Uses
Path(__file__).resolve().parent.parent.parentto compute the absolute path to the project root, ensuring the database file is located correctly.
DATABASE_URL:- Purpose: Specifies the database connection string.
- Details: Configured for SQLite as
sqlite:///{BASE_DIR}/tornet_scraper.db, pointing to atornet_scraper.dbfile in the project root.
engine:- Purpose: SQLAlchemy engine for database interactions.
- Details: Created with
create_engine(DATABASE_URL, connect_args={"check_same_thread": False})to handle SQLite connections. Thecheck_same_threadargument allows SQLite to be used in a multi-threaded FastAPI environment.
SessionLocal:- Purpose: Factory for creating database sessions.
- Details: Configured with
sessionmaker(autocommit=False, autoflush=False, bind=engine)to create sessions bound to the engine, with manual commit and flush control.
-
Functions:
get_db:- Purpose: Provides a database session for FastAPI routes.
- Key Parameters: None (dependency function).
- Returns: Yields a SQLAlchemy session (
db) and ensures it closes after use. - Details: Used as a FastAPI dependency (via
Depends(get_db)) to inject a session into route handlers. Thetry/finallyblock guarantees the session is closed, preventing resource leaks.
init_db:- Purpose: Initializes the database by creating tables.
- Key Parameters: None.
- Returns: None.
- Details: Imports
Basefrommodels.pyand callsBase.metadata.create_all(bind=engine)to create all defined tables (e.g.,proxies,apis) in the SQLite database if they don’t exist.
-
Usage:
- Purpose: Enables database operations in the application.
- Details:
init_dbis called inmain.pyto set up the database schema on startup.get_dbis used in route handlers (e.g.,/proxy-gen,/manage-api) to query or modify data in tables defined inmodels.py. The session ensures transactional integrity and proper resource cleanup.
models.py
The models.py file defines the database schema for a FastAPI application using SQLAlchemy, specifying tables, columns, and relationships. Below is a concise explanation of its components and functionality.
You can find models.py at app/database/models.py.
-
SQLAlchemy Overview:
- Purpose: SQLAlchemy is an ORM (Object-Relational Mapping) library for Python, enabling interaction with a relational database using Python objects.
- Details: Maps Python classes to database tables, allowing CRUD operations via object-oriented syntax. Uses
declarative_baseto create a base class (Base) for model definitions.
-
Columns:
- Purpose: Represent fields in a database table.
- Details: Defined using SQLAlchemy’s
Columnclass, specifying attributes like data type (e.g.,Integer,String,DateTime), constraints (e.g.,primary_key,unique,nullable), and defaults (e.g.,datetime.utcnow). Columns store data for each record in a table.
-
Tables:
- Purpose: Represent database tables that store structured data.
- Details: Each class (e.g.,
Proxy,APIs) inherits fromBaseand defines a table via__tablename__. Tables include columns and optional constraints (e.g.,UniqueConstraintfor unique combinations of fields). Examples includeproxies,apis,bot_profiles, etc.
-
Relationships:
- Purpose: Define associations between tables for relational queries.
- Details: Established using
ForeignKeycolumns (e.g.,MarketplacePost.scan_idreferencesmarketplace_post_scans.id). Relationships enable joining tables, like linkingMarketplacePosttoMarketplacePostScanviascan_id. SQLAlchemy handles these as object references in queries.
-
Enums:
- Purpose: Define controlled vocabularies for specific fields.
- Details: Uses Python’s
enum.Enum(e.g.,BotPurpose,ScanStatus) to restrict column values to predefined options (e.g.,SCRAPE_MARKETPLACE,RUNNING). Stored as strings in the database viaEnum.
-
Key Components:
Base: The SQLAlchemy base class for all models, used to generate table schemas.- Table Definitions: Classes like
Proxy,APIs,BotProfile, etc., define tables for storing proxy details, API configurations, bot profiles, onion URLs, marketplace scans, posts, and watchlists. - Constraints: Unique constraints (e.g.,
UniqueConstraintinMarketplacePost) ensure data integrity by preventing duplicate records based on specific column combinations. - Timestamps: Many tables include a
timestampcolumn (defaulting todatetime.utcnow) to track record creation/update times.
-
Usage:
- Purpose: Models are used by the FastAPI application to interact with the database.
- Details: The
main.pyroutes query these models (via SQLAlchemy sessions) to retrieve or store data, which is then passed to Jinja2 templates for rendering or processed in API logic. For example,Proxydata is queried in the/proxy-genroute to display proxy details.
How routes work
In API development, organizing API endpoints effectively is crucial, as routing all requests through a single endpoint is impractical.
For instance, the /api/posts/get-post format indicates that /api/posts/* is dedicated to post-related operations, such as viewing, deleting, or modifying posts. All related API calls are prefixed with /api/posts/.
Given that our application includes features like proxy generation, bot management, marketplace scraping, post scraping, threat monitoring, and more, proper organization of endpoints is essential.
You can find all routes in app/routes/*.py, which contains six Python files. Below is an example of bot_profile.py:
# app/routes/bot_profile.py
import logging
from fastapi import APIRouter, Depends, HTTPException, Request
from sqlalchemy.orm import Session
from pydantic import BaseModel
from app.database.models import BotProfile, OnionUrl, BotPurpose, APIs
from app.database.db import get_db
from typing import Optional
from app.services.tornet_forum_login import login_to_tor_website
from app.services.gen_random_ua import gen_desktop_ua
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
bot_profile_router = APIRouter(prefix="/api/bot-profile", tags=["API", "Bot Profile Management"])
# Pydantic models for validation
class BotProfileCreate(BaseModel):
username: str
password: str
purpose: str
tor_proxy: Optional[str] = None
session: Optional[str] = None
class BotProfileUpdate(BaseModel):
username: Optional[str] = None
password: Optional[str] = None
purpose: Optional[str] = None
tor_proxy: Optional[str] = None
user_agent: Optional[str] = None
session: Optional[str] = None
class OnionUrlCreate(BaseModel):
url: str
# Get all bot profiles
@bot_profile_router.get("/list")
async def get_bot_profiles(db: Session = Depends(get_db)):
try:
profiles = db.query(BotProfile).all()
return [
{
"id": p.id,
"username": p.username,
"password": "********",
"actual_password": p.password,
"purpose": p.purpose.value,
"tor_proxy": p.tor_proxy,
"has_session": bool(p.session and len(p.session) > 0),
"session": p.session,
"user_agent": p.user_agent,
"timestamp": p.timestamp.isoformat()
} for p in profiles
]
except Exception as e:
logger.error(f"Error fetching bot profiles: {str(e)}")
raise HTTPException(status_code=500, detail="Internal server error")
We utilize logging to track events within the application. Logging is critical for understanding what’s happening in your web app; without it, you’re operating blindly. While logs can be saved to a file, I typically avoid this during local development but recommend it for production environments.
The most critical aspect, however, is creating routers:
bot_profile_router = APIRouter(prefix="/api/bot-profile", tags=["API", "Bot Profile Management"])
This router establishes an endpoint prefix, /api/bot-profile, which all related endpoints will use.
For instance, the following function retrieves all bot profiles from the BotProfile table:
@bot_profile_router.get("/list")
async def get_bot_profiles(db: Session = Depends(get_db)):
When you use @bot_profile_router.get("/list"), the endpoint is prefixed with /api/bot-profile, resulting in /api/bot-profile/list.
This structure enables efficient organization of hundreds or thousands of API routes. The bot_profile_router is imported into main.py and registered.
How scrapers work
All scrapers are stored in app/scrapers/*.py.
These files are typically imported as modules in other parts of the codebase. For example, within marketplace_scraper.py, the create_pagination_batches function takes a web page URL as input, such as:
http://site.url/pagination?page=1
It creates a batch of paginations like this:
http://site.url/pagination?page=1
http://site.url/pagination?page=2
http://site.url/pagination?page=3
--- snip ---
http://site.url/pagination?page=10
This approach is effective for task distribution, such as assigning each batch of 10 pagination pages to a single bot for scraping.
How services work
All services are located in app/services/*.py.
Services are also used as modules but serve a distinct purpose, handling tasks such as:
- Managing Docker for proxies
- Generating proxies
- Performing logins
- Bypassing CAPTCHAs
As you’ll learn later, we use Docker containers to create Tor proxies locally. When a proxy is no longer needed, it must be deleted. To verify whether proxies are running, you can use a function like container_running from container_status.py to check Docker container status.