In this module, we will explore how tornet_scraper operates. You’ve already learned to extract data from both tornet and clearnet forums, but to streamline this process, we need an advanced web scraper equipped with features for large-scale data scraping.

This topic may seem challenging to some, but in the era of AI, where much of programming can be automated, and with this course leveraging AI for cyber defense, building a web app for data scraping is now more accessible.

Every web application relies on a tech stack, a collection of software tools used to develop different components of the application. For instance, websites like Facebook feature a user interface (front-end) and server-side logic (back-end) that makes pages accessible. In some cases, the front-end and back-end can use the same programming language. For example, JavaScript-based web apps might use Vue for the front-end and Nuxt for the back-end, simplifying development since both are written in JavaScript.

For our purposes, we need an API-friendly and scalable tech stack that you can customize independently. This module won’t delve deeply into programming concepts, as you can copy and paste code into your preferred AI tool for a tailored explanation that suits your learning style.

If you prefer to skip this module and start with a working project, you can do so, but you’ll miss learning how the project functions, it will simply be another tool. The final project is available here:

https://github.com/CyberMounties/tornet_scraper

The topics of this section include the following:

  1. How templates work
  2. How main.py works
  3. How databases work
  4. How routes work
  5. How scrapers work
  6. How services work

How templates work

Modern web applications use templates, such as a default or base template, to define elements that appear across all pages, like a navigation menu or navbar. While you could copy and paste the navbar into every page, any changes would require updating each page individually.

To streamline this, we use a templating engine. In Python web applications, Jinja2 allows us to create a base template with content blocks for consistent and efficient design.

Open app/templates/base.html:

<!-- app/templates/base.html -->
<!DOCTYPE html>
<html lang="en" data-theme="nord">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{% block title %}{% endblock %} - Tornet Scraper</title>
    <link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/full.min.css" rel="stylesheet" type="text/css" />
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://code.jquery.com/jquery-3.7.1.min.js"></script>
</head>
<body class="min-h-screen bg-base-200">
    <!-- Navbar -->
    <div class="navbar bg-base-100 shadow-xl">
        <div class="navbar-start">
            <a class="btn btn-ghost text-xl">Tornet Scraper</a>
        </div>
        <div class="navbar-center hidden lg:flex">
            <ul class="menu menu-horizontal px-1">
                <li><a href="/" class="btn btn-ghost">Dashboard</a></li>
                <li><a href="/proxy-gen" class="btn btn-ghost">Proxy Gen</a></li>
                <li><a href="/manage-api" class="btn btn-ghost">API Management</a></li>
                <li><a href="/bot-profile" class="btn btn-ghost">Bot Profile</a></li>
                <li><a href="/marketplace-scan" class="btn btn-ghost">Marketplace Scan</a></li>
                <li><a href="/posts-scans" class="btn btn-ghost">Posts Scans</a></li>
                <li><a href="/watchlist" class="btn btn-ghost">Watchlist</a></li>
            </ul>
        </div>
        <div class="navbar-end">
            <div class="dropdown dropdown-end lg:hidden">
                <label tabindex="0" class="btn btn-ghost">
                    <svg xmlns="http://www.w3.org/2000/svg" class="h-5 w-5" fill="none" viewBox="0 0 24 24" stroke="currentColor">
                        <path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M4 6h16M4 12h16M4 18h16" />
                    </svg>
                </label>
                <ul tabindex="0" class="menu menu-sm dropdown-content mt-3 p-2 shadow bg-base-100 rounded-box w-52">
                    <li><a href="/">Dashboard</a></li>
                    <li><a href="/proxy-gen">Proxy Gen</a></li>
                    <li><a href="/manage-api">API Management</a></li>
                    <li><a href="/bot-profile">Bot Profile</a></li>
                    <li><a href="/marketplace-scan">Marketplace Scan</a></li>
                    <li><a href="/posts-scans" class="btn btn-ghost">Posts Scans</a></li>
                    <li><a href="/watchlist" class="btn btn-ghost">Watchlist</a></li>
                </ul>
            </div>
        </div>
    </div>

    <!-- Main content -->
    <main class="container mx-auto p-4 mt-6 bg-base-300 rounded-box">
        <!-- Flash Messages Container -->
        <div class="mb-4" id="flash-messages">
            {% if messages %}
                {% for message in messages %}
                    <div class="alert alert-{{ message.category | default('info') }} shadow-lg mb-4 flex justify-between items-center flash-message">
                        <div>
                            <span>{{ message.text }}</span>
                        </div>
                        <button class="btn btn-sm btn-circle btn-ghost" onclick="this.parentElement.remove()">✕</button>
                    </div>
                {% endfor %}
            {% endif %}
        </div>

        {% block content %}
        {% endblock %}
    </main>

    <script>
        // Automatically remove flash messages after 5 seconds
        document.addEventListener('DOMContentLoaded', function() {
            function removeFlashMessages() {
                const flashMessages = document.querySelectorAll('.flash-message');
                flashMessages.forEach(function(message) {
                    // Skip if already fading
                    if (message.style.opacity === '0') return;
                    setTimeout(function() {
                        message.style.transition = 'opacity 0.5s ease';
                        message.style.opacity = '0';
                        setTimeout(function() {
                            message.remove();
                        }, 500);
                    }, 5000);
                });
            }
            removeFlashMessages();
            const flashContainer = document.querySelector('#flash-messages');
            if (flashContainer) {
                const observer = new MutationObserver(function() {
                    removeFlashMessages();
                });
                observer.observe(flashContainer, { childList: true });
            }
        });
    </script>
</body>
</html>

The base.html template is a Jinja2 base template for providing a consistent layout for all pages. It includes a navbar, flash messages, and placeholders for title and content that child templates can override. Below is a technical explanation of how it works and its extension mechanism.

  1. HTML Structure:

    • Defines a standard HTML5 document with a data-theme="nord" attribute for DaisyUI styling.
    • Includes external resources: DaisyUI CSS, Tailwind CSS, and jQuery for styling and interactivity.
  2. Title Block:

    • The <title> tag contains a Jinja2 block {% block title %}{% endblock %} - Tornet Scraper.
    • Child templates override the title block to set a page-specific title, which is appended with " - Tornet Scraper".
    • Example: A child template with {% block title %}Dashboard{% endblock %} results in <title>Dashboard - Tornet Scraper</title>.
  3. Navbar:

    • A responsive navbar with a logo ("Tornet Scraper") and navigation links (Dashboard, Proxy Gen, etc.).
    • On large screens (lg:flex), links display horizontally; on smaller screens, a dropdown menu (triggered by a hamburger icon) shows links vertically.
  4. Flash Messages:

    • A <div id="flash-messages"> displays messages sent from backend (e.g., success or error notifications).
    • Uses Jinja2: {% if messages %} loops through messages (a list of objects with text and category).
    • Each message is styled as a DaisyUI alert with a dynamic class (alert-{{ message.category }}, defaulting to info).
    • A close button () removes the message on click.
    • JavaScript automatically fades out and removes flash messages after 5 seconds using a CSS transition (opacity).
    • A MutationObserver ensures new flash messages (added dynamically) are also auto-removed.
  5. Content Block:

    • The <main> section contains a {% block content %}{% endblock %} placeholder.
    • Child templates override this block to inject page-specific content into the main container, styled with Tailwind/DaisyUI classes (container, mx-auto, etc.).
  6. Extension in Child Templates:

    • Child templates use {% extends 'base.html' %} to inherit the base structure.

    • They override {% block title %} for the page title and {% block content %} for the main content.

      • This renders the full base.html layout with "Dashboard" in the title and the specified content in the <main> section.

This setup ensures a consistent UI, dynamic titles, reusable navigation, and user-friendly flash message handling across all pages.

dashboard.html

The file located at app/templates/dashboard.html provides a clear example of how it extends base.html. While this file is not strictly necessary, I’ve developed a habit of including it. Below is the code contained within it:

<!-- app/templates/dashboard.html -->
{% extends "base.html" %}
{% block title %}Dashboard{% endblock %}

{% block content %}
<h1 class="text-3xl font-bold text-base-content mb-4">Dashboard</h1>
{% endblock %}

Notice how it extend base.html? Content within {% block content %} is inserted into the main content container defined in base.html:

    <!-- Main content -->
    <main class="container mx-auto p-4 mt-6 bg-base-300 rounded-box">
        {% block content %}
        {% endblock %}
    </main>

The title is also defined:

{% block title %}Dashboard{% endblock %}

In base.html, the title structure is:

<title>{% block title %}{% endblock %} - Tornet Scraper</title>

When viewing the page source code of the dashboard, the title appears as:

<title>Dashboard - Tornet Scraper</title>

Instead of rewriting the title for every page, you can define it on base.html and then only define page name for every other page.


How main.py works

You can find main.py at app/main.py.

The main.py file is the core of a FastAPI web application, defining routes, templates, and database interactions for a web scraper interface. Below is a technical explanation of global variables, included routers, and route handler functions.

Global Variables

  1. app:

    • Purpose: The main FastAPI application instance.
    • Details: Initializes the FastAPI framework to handle HTTP requests and responses.
  2. BASE_DIR:

    • Purpose: Stores the absolute path of the project’s root directory.
    • Details: Derived using os.path to locate the app/templates directory for Jinja2 templates.
  3. TEMPLATES_DIR:

    • Purpose: Specifies the directory for Jinja2 templates.
    • Details: Set to app/templates within the project root for rendering HTML templates.
  4. templates:

    • Purpose: Jinja2 template engine instance.
    • Details: Configured to load templates from TEMPLATES_DIR for rendering dynamic HTML.

Included Routers

  1. proxy_gen_router:

    • Purpose: Registers routes for proxy generation functionality.
    • Details: Imported from app.routes.proxy_gen, handles proxy-related endpoints.
  2. manage_api_router:

    • Purpose: Registers routes for API management.
    • Details: Imported from app.routes.manage_api, manages API keys and settings.
  3. bot_profile_router:

    • Purpose: Registers routes for bot profile management.
    • Details: Imported from app.routes.bot_profile, handles bot configuration.
  4. marketplace_api_router:

    • Purpose: Registers routes for marketplace scanning.
    • Details: Imported from app.routes.marketplace, manages marketplace data retrieval.
  5. posts_api_router:

    • Purpose: Registers routes for post scanning.
    • Details: Imported from app.routes.posts, handles post-related data operations.
  6. watchlist_api_router:

    • Purpose: Registers routes for watchlist management.
    • Details: Imported from app.routes.watchlist, manages watchlist items and scans.

Route Handler Functions

  1. dashboard:

    • Purpose: Renders the dashboard page.
    • Key Parameters:
      • request: FastAPI Request object for template context.
      • db: SQLAlchemy Session for database access (via Depends(get_db)).
    • Returns: TemplateResponse rendering dashboard.html with flashed messages from the session.
  2. proxy_gen:

    • Purpose: Displays proxy generation page with a list of proxies.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for querying proxies.
    • Returns: TemplateResponse rendering proxy_gen.html with proxy data (container name, IP, Tor exit node, timestamp, running status) and flashed messages.
  3. manage_api:

    • Purpose: Shows API management page with API details.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for querying APIs.
    • Returns: TemplateResponse rendering manage_api.html with API data (ID, name, provider, key, model, max tokens, prompt, timestamp, active status) and flashed messages.
  4. bot_profile:

    • Purpose: Displays bot profile management page.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for querying bot profiles and latest onion URL.
    • Returns: TemplateResponse rendering bot_profile.html with bot profile data (ID, username, masked password, purpose, Tor proxy, timestamp), latest onion URL, and flashed messages.
  5. marketplace:

    • Purpose: Renders marketplace scan page with pagination and post scan data.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for querying scans.
    • Returns: TemplateResponse rendering marketplace.html with pagination scans, post scans, and flashed messages.
  6. posts_scans:

    • Purpose: Displays the post scans overview page.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for database access.
    • Returns: TemplateResponse rendering posts_scans.html with flashed messages.
  7. posts_scan_result:

    • Purpose: Shows results for a specific post scan.
    • Key Parameters:
      • scan_id: Integer ID of the scan.
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for database access.
      • name: Optional scan name (string, default empty).
    • Returns: TemplateResponse rendering posts_scan_result.html with scan ID, name, and flashed messages.
  8. watchlist:

    • Purpose: Renders the watchlist overview page.
    • Key Parameters:
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for database access.
    • Returns: TemplateResponse rendering watchlist.html with flashed messages.
  9. watchlist_profile:

    • Purpose: Displays a specific watchlist item’s profile and associated scans.
    • Key Parameters:
      • target_id: Integer ID of the watchlist item.
      • request: FastAPI Request object.
      • db: SQLAlchemy Session for querying watchlist item and scans.
    • Returns: TemplateResponse rendering watchlist_profile.html with watchlist item, scan data, and flashed messages; raises HTTPException (404 for missing item, 500 for server errors).

How databases work

In our FastAPI application, db.py manages database engine configurations, while models.py defines the database schema and tables. Using a separate db.py file enables easy switching between database types. Although tornet_scraper uses SQLite3 for prototyping and testing, it is not ideal for large-scale data scraping. For production environments, we recommend transitioning to a more robust database like PostgreSQL.

You can also implement logic in db.py to dynamically switch between SQLite3 and a database server based on conditions or the environment (e.g., development, testing, or production).

db.py

The db.py file configures the database connection and session management for a FastAPI application using SQLAlchemy. Below is a concise explanation of its components and functionality.

You can find db.py in app/database/db.py.

  1. Purpose:

    • Sets up a SQLite database connection, creates tables, and provides a dependency for database sessions in FastAPI routes.
  2. Global Variables:

    • BASE_DIR:
      • Purpose: Defines the project’s root directory.
      • Details: Uses Path(__file__).resolve().parent.parent.parent to compute the absolute path to the project root, ensuring the database file is located correctly.
    • DATABASE_URL:
      • Purpose: Specifies the database connection string.
      • Details: Configured for SQLite as sqlite:///{BASE_DIR}/tornet_scraper.db, pointing to a tornet_scraper.db file in the project root.
    • engine:
      • Purpose: SQLAlchemy engine for database interactions.
      • Details: Created with create_engine(DATABASE_URL, connect_args={"check_same_thread": False}) to handle SQLite connections. The check_same_thread argument allows SQLite to be used in a multi-threaded FastAPI environment.
    • SessionLocal:
      • Purpose: Factory for creating database sessions.
      • Details: Configured with sessionmaker(autocommit=False, autoflush=False, bind=engine) to create sessions bound to the engine, with manual commit and flush control.
  3. Functions:

    • get_db:
      • Purpose: Provides a database session for FastAPI routes.
      • Key Parameters: None (dependency function).
      • Returns: Yields a SQLAlchemy session (db) and ensures it closes after use.
      • Details: Used as a FastAPI dependency (via Depends(get_db)) to inject a session into route handlers. The try/finally block guarantees the session is closed, preventing resource leaks.
    • init_db:
      • Purpose: Initializes the database by creating tables.
      • Key Parameters: None.
      • Returns: None.
      • Details: Imports Base from models.py and calls Base.metadata.create_all(bind=engine) to create all defined tables (e.g., proxies, apis) in the SQLite database if they don’t exist.
  4. Usage:

    • Purpose: Enables database operations in the application.
    • Details: init_db is called in main.py to set up the database schema on startup. get_db is used in route handlers (e.g., /proxy-gen, /manage-api) to query or modify data in tables defined in models.py. The session ensures transactional integrity and proper resource cleanup.

models.py

The models.py file defines the database schema for a FastAPI application using SQLAlchemy, specifying tables, columns, and relationships. Below is a concise explanation of its components and functionality.

You can find models.py at app/database/models.py.

  1. SQLAlchemy Overview:

    • Purpose: SQLAlchemy is an ORM (Object-Relational Mapping) library for Python, enabling interaction with a relational database using Python objects.
    • Details: Maps Python classes to database tables, allowing CRUD operations via object-oriented syntax. Uses declarative_base to create a base class (Base) for model definitions.
  2. Columns:

    • Purpose: Represent fields in a database table.
    • Details: Defined using SQLAlchemy’s Column class, specifying attributes like data type (e.g., Integer, String, DateTime), constraints (e.g., primary_key, unique, nullable), and defaults (e.g., datetime.utcnow). Columns store data for each record in a table.
  3. Tables:

    • Purpose: Represent database tables that store structured data.
    • Details: Each class (e.g., Proxy, APIs) inherits from Base and defines a table via __tablename__. Tables include columns and optional constraints (e.g., UniqueConstraint for unique combinations of fields). Examples include proxies, apis, bot_profiles, etc.
  4. Relationships:

    • Purpose: Define associations between tables for relational queries.
    • Details: Established using ForeignKey columns (e.g., MarketplacePost.scan_id references marketplace_post_scans.id). Relationships enable joining tables, like linking MarketplacePost to MarketplacePostScan via scan_id. SQLAlchemy handles these as object references in queries.
  5. Enums:

    • Purpose: Define controlled vocabularies for specific fields.
    • Details: Uses Python’s enum.Enum (e.g., BotPurpose, ScanStatus) to restrict column values to predefined options (e.g., SCRAPE_MARKETPLACE, RUNNING). Stored as strings in the database via Enum.
  6. Key Components:

    • Base: The SQLAlchemy base class for all models, used to generate table schemas.
    • Table Definitions: Classes like Proxy, APIs, BotProfile, etc., define tables for storing proxy details, API configurations, bot profiles, onion URLs, marketplace scans, posts, and watchlists.
    • Constraints: Unique constraints (e.g., UniqueConstraint in MarketplacePost) ensure data integrity by preventing duplicate records based on specific column combinations.
    • Timestamps: Many tables include a timestamp column (defaulting to datetime.utcnow) to track record creation/update times.
  7. Usage:

    • Purpose: Models are used by the FastAPI application to interact with the database.
    • Details: The main.py routes query these models (via SQLAlchemy sessions) to retrieve or store data, which is then passed to Jinja2 templates for rendering or processed in API logic. For example, Proxy data is queried in the /proxy-gen route to display proxy details.

How routes work

In API development, organizing API endpoints effectively is crucial, as routing all requests through a single endpoint is impractical.

For instance, the /api/posts/get-post format indicates that /api/posts/* is dedicated to post-related operations, such as viewing, deleting, or modifying posts. All related API calls are prefixed with /api/posts/.

Given that our application includes features like proxy generation, bot management, marketplace scraping, post scraping, threat monitoring, and more, proper organization of endpoints is essential.

You can find all routes in app/routes/*.py, which contains six Python files. Below is an example of bot_profile.py:

# app/routes/bot_profile.py
import logging
from fastapi import APIRouter, Depends, HTTPException, Request
from sqlalchemy.orm import Session
from pydantic import BaseModel
from app.database.models import BotProfile, OnionUrl, BotPurpose, APIs
from app.database.db import get_db
from typing import Optional
from app.services.tornet_forum_login import login_to_tor_website
from app.services.gen_random_ua import gen_desktop_ua


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


bot_profile_router = APIRouter(prefix="/api/bot-profile", tags=["API", "Bot Profile Management"])


# Pydantic models for validation
class BotProfileCreate(BaseModel):
    username: str
    password: str
    purpose: str
    tor_proxy: Optional[str] = None
    session: Optional[str] = None


class BotProfileUpdate(BaseModel):
    username: Optional[str] = None
    password: Optional[str] = None
    purpose: Optional[str] = None
    tor_proxy: Optional[str] = None
    user_agent: Optional[str] = None
    session: Optional[str] = None


class OnionUrlCreate(BaseModel):
    url: str


# Get all bot profiles
@bot_profile_router.get("/list")
async def get_bot_profiles(db: Session = Depends(get_db)):
    try:
        profiles = db.query(BotProfile).all()
        return [
            {
                "id": p.id,
                "username": p.username,
                "password": "********",
                "actual_password": p.password,
                "purpose": p.purpose.value,
                "tor_proxy": p.tor_proxy,
                "has_session": bool(p.session and len(p.session) > 0),
                "session": p.session,
                "user_agent": p.user_agent,
                "timestamp": p.timestamp.isoformat()
            } for p in profiles
        ]
    except Exception as e:
        logger.error(f"Error fetching bot profiles: {str(e)}")
        raise HTTPException(status_code=500, detail="Internal server error")

We utilize logging to track events within the application. Logging is critical for understanding what’s happening in your web app; without it, you’re operating blindly. While logs can be saved to a file, I typically avoid this during local development but recommend it for production environments.

The most critical aspect, however, is creating routers:

bot_profile_router = APIRouter(prefix="/api/bot-profile", tags=["API", "Bot Profile Management"])

This router establishes an endpoint prefix, /api/bot-profile, which all related endpoints will use.

For instance, the following function retrieves all bot profiles from the BotProfile table:

@bot_profile_router.get("/list")
async def get_bot_profiles(db: Session = Depends(get_db)):

When you use @bot_profile_router.get("/list"), the endpoint is prefixed with /api/bot-profile, resulting in /api/bot-profile/list.

This structure enables efficient organization of hundreds or thousands of API routes. The bot_profile_router is imported into main.py and registered.


How scrapers work

All scrapers are stored in app/scrapers/*.py.

These files are typically imported as modules in other parts of the codebase. For example, within marketplace_scraper.py, the create_pagination_batches function takes a web page URL as input, such as:

http://site.url/pagination?page=1

It creates a batch of paginations like this:

http://site.url/pagination?page=1
http://site.url/pagination?page=2
http://site.url/pagination?page=3
--- snip ---
http://site.url/pagination?page=10

This approach is effective for task distribution, such as assigning each batch of 10 pagination pages to a single bot for scraping.


How services work

All services are located in app/services/*.py.

Services are also used as modules but serve a distinct purpose, handling tasks such as:

  1. Managing Docker for proxies
  2. Generating proxies
  3. Performing logins
  4. Bypassing CAPTCHAs

As you’ll learn later, we use Docker containers to create Tor proxies locally. When a proxy is no longer needed, it must be deleted. To verify whether proxies are running, you can use a function like container_running from container_status.py to check Docker container status.