In this section, we will explore the components involved in building a successful threat intelligence platform. As noted earlier, this course will involve testing data scraping techniques on fictional sites that you can set up locally using Docker. Here, you will gain insight into the types of sites we will be working with.

The topics covered in this section include:

  1. Clearnet forum
  2. Tor forum
  3. Non-LLM approach to text analysis
  4. Handling anti-scraping technologies
  5. Intelligence watchlists

Clearnet Forum

Although most cybercrime websites do not rely heavily on JavaScript, it is important to understand how to scrape data from clearnet websites. Most Clearnet sites often fetch data from the backend using JavaScript and then update the frontend dynamically. When you view the page source, you may find no usable data, as the content is loaded after the initial page load. A good example is our own scraper, where nearly all content is dynamic and constantly changing.

In such cases, we use headless browsers like Playwright to interact with the page and extract data from the DOM (Document Object Model) after it has been fully loaded and rendered.

To practice this, you can download the dummy clearnet cybercrime forum created for this course from GitHub:

https://github.com/CyberMounties/clearnet_forum

The repository contains complete setup instructions for running the forum locally.


Tor Forum

Much of cybercrime activity occurs on Tor due to its privacy-centric architecture. Our simulated Tor forum is designed without JavaScript to mimic real-world scenarios while maintaining simplicity. Unlike the clearnet forum, which includes a shoutbox (general chat) feature, the Tor forum does not.

You can set up the dummy Tor forum for this course by downloading it from GitHub:

https://github.com/CyberMounties/tornet_forum

The repository includes detailed instructions for configuring and running the forum locally.


Non-LLM Approach to text analysis

Instead of relying on Large Language Models (LLMs) or AI, you can use traditional Natural Language Processing (NLP) libraries like spaCy to analyze data. These tools can help with tasks such as text classification, entity recognition, and pattern matching. However, they come with limitations: they typically require more manual customization, rule creation, and training to achieve results comparable to LLMs. While they can be effective for certain use cases, setting them up for complex tasks like identifying IAB activity may take significantly more time and effort.

Example with spaCy

Create a directory, setup a virtual environment and install dependencies:

mkdir non_llm_cti && cd non_llm_cti
python3 -m venv venv
source venv/bin/activate
pip install spacy
python -m spacy download en_core_web_sm

Create a script file named main.py and paste the following code inside it:

import spacy
import re
from typing import Tuple

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

# Define keywords and patterns for initial access detection
ACCESS_KEYWORDS = {
    "access", "vpn", "rdp", "network", "server", "admin", "credentials", 
    "login", "remote", "domain", "shell", "backdoor"
}
SALE_KEYWORDS = {"sell", "selling", "offer", "offering", "sale", "available", "price", "btc", "escrow"}
NEGATIVE_KEYWORDS = {"hosting", "vps", "software", "malware", "loader", "botnet"}
OBFUSCATION_PATTERNS = [
    r"acc[e3]ss",  # Handles "acc3ss", "access"
    r"v[pP][nN]",  # Handles "VPN", "VpN", "vpn"
    r"rd[pP]",     # Handles "RDP", "rdp"
]

def preprocess_text(text: str) -> str:
    """Normalize text by converting to lowercase and handling basic obfuscation."""
    text = text.lower()
    for pattern in OBFUSCATION_PATTERNS:
        text = re.sub(pattern, re.search(pattern, text).group().replace("3", "e"), text) if re.search(pattern, text) else text
    return text

def analyze_post(text: str) -> Tuple[str, float]:
    """
    Analyze a post to determine if it offers initial access.
    Returns: (classification, confidence_score)
    Classifications: 'positive' (initial access sale), 'neutral' (general ad), 'negative' (unrelated)
    """
    # Preprocess text
    text = preprocess_text(text)
    doc = nlp(text)

    # Initialize scores
    access_score = 0.0
    sale_score = 0.0
    negative_score = 0.0

    # Rule 1: Check for access-related keywords
    for token in doc:
        if token.lemma_ in ACCESS_KEYWORDS:
            access_score += 0.4
        if token.lemma_ in SALE_KEYWORDS:
            sale_score += 0.3
        if token.lemma_ in NEGATIVE_KEYWORDS:
            negative_score += 0.5

    # Rule 2: Dependency parsing for sales intent (e.g., "selling access")
    for token in doc:
        if token.lemma_ in SALE_KEYWORDS and token.head.lemma_ in ACCESS_KEYWORDS:
            access_score += 0.3
            sale_score += 0.3
        elif token.lemma_ in ACCESS_KEYWORDS and token.head.lemma_ in SALE_KEYWORDS:
            access_score += 0.3
            sale_score += 0.3

    # Rule 3: Named entities (e.g., "RBC Bank", "EU telecom") for context
    for ent in doc.ents:
        if ent.label_ in ["ORG", "GPE"]:
            access_score += 0.2

    # Rule 4: Regex for specific patterns (e.g., "access to [entity]")
    access_pattern = r"\b(access|vpn|rdp)\b.*\b(to|for|at)\b.*\b(bank|telecom|network|company|org)\b"
    if re.search(access_pattern, text, re.IGNORECASE):
        access_score += 0.3

    # Rule 5: Cryptocurrency or escrow mentions
    if any(word in text for word in ["btc", "bitcoin", "escrow", "monero"]):
        sale_score += 0.2

    # Calculate total score and determine classification
    total_score = access_score + sale_score - negative_score
    if access_score >= 0.4 and sale_score >= 0.3 and total_score > 0.5:
        classification = "positive"
        confidence = min(total_score, 1.0)
    elif negative_score > access_score and negative_score > sale_score:
        classification = "negative"
        confidence = min(negative_score, 1.0)
    else:
        classification = "neutral"
        confidence = min(max(access_score, sale_score, negative_score), 1.0)

    return classification, confidence

def main():
    # Test cases
    test_posts = [
        # Positive: Initial access sale with details
        """Back again, mods nuked my last thread. Selling VPN access to a big EU telecom, one of them household names in Germany (you figure it out). Full internal access, you can pivot to servers, grab call logs, or mess with billing systems. Been stable for a month, no kicks. Asking $4,000 BTC, negotiable for bulk buyers (got 2 more telecoms). I'll throw in logs or a test connection for proof, but only if you're serious. Escrow or MM accepted, I'm vouched here. Hit me up in PM, let's make this quick. No skids, please.""",
        # Positive: Direct initial access sale
        "Initial access to RBC Bank available, full admin creds, $5k BTC, DM for details.",
        # Neutral: General ad, not access-related
        "Selling malware loader, high quality, $200, contact me for demo.",
        # Negative: Unrelated service
        "Offering VPS hosting, 16GB RAM, $50/month, legit service."
    ]

    for i, post in enumerate(test_posts, 1):
        classification, confidence = analyze_post(post)
        print(f"\nPost {i}:")
        print(f"Text: {post[:100]}..." if len(post) > 100 else f"Text: {post}")
        print(f"Classification: {classification.capitalize()}")
        print(f"Confidence: {confidence:.2f}")

if __name__ == "__main__":
    main()

Run it and see the results for yourself:

$ python3 main.py

Post 1:
Text: Back again, mods nuked my last thread. Selling VPN access to a big EU telecom, one of them household...
Classification: Positive
Confidence: 1.00

Post 2:
Text: Initial access to RBC Bank available, full admin creds, $5k BTC, DM for details.
Classification: Positive
Confidence: 1.00

Post 3:
Text: Selling malware loader, high quality, $200, contact me for demo.
Classification: Negative
Confidence: 1.00

Post 4:
Text: Offering VPS hosting, 16GB RAM, $50/month, legit service.
Classification: Negative
Confidence: 1.00

This approach, as demonstrated, is less reliable, particularly in accurately classifying the third post as neutral. It demands extensive customization and fine-tuning, which, in this instance, outweighs the benefits given the availability of simpler, more adaptable solutions like existing AI APIs.


Anti-Scraping Technologies

Extracting data from websites is straightforward at a small scale, but it becomes complex when sites deploy anti-scraping technologies to block automated access. Many websites enforce terms of service that prohibit scraping, and developers implement protective measures to uphold these restrictions.

Common anti-scraping techniques include:

  • CAPTCHAs: Require users to solve puzzles or image-based challenges to verify they are not bots.
  • Rate-limiting: Limits the number of requests from a single IP address or account within a specific timeframe.
  • Account lockouts: Temporarily or permanently suspend accounts after detecting suspicious activity or excessive failed login attempts.
  • IP bans: Block traffic from IP addresses exhibiting automated patterns or abusive behavior.
  • Header and behavior checks: Detect requests that deviate from typical browser behavior, such as missing headers or unrealistic interaction patterns.

Tools like Playwright can help circumvent some of the anti-bot defenses by simulating a real browser but it's still not enough, you may want to use undetected chrome driver.

Although this course does not address bypassing all advanced anti-scraping techniques, the example below illustrates how to use Playwright to simulate human-like typing speeds.

To get started setup an environment & install dependencies:

mkdir play && cd play
touch play.py
python3 -m venv venv
source venv/bin/activate
sudo apt install libavif16
pip3 install playwright
playwright install

Open play.py and enter this code:

from playwright.sync_api import sync_playwright
import time
import os

def fill_input_slowly():
    text_to_type = "This is a test string"
    local_html = """
    <!DOCTYPE html>
    <html>
    <body>
        <form>
            <input type="text" id="fname" name="fname" value="Hamy">
        </form>
    </body>
    </html>
    """
    
    # Write the HTML to a temporary file
    html_file_path = "temp.html"
    with open(html_file_path, "w") as f:
        f.write(local_html)
    
    with sync_playwright() as p:
        # Set headless to True if you don't want to see the browser
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()
        
        try:
            page.goto(f"file://{os.path.abspath(html_file_path)}", timeout=60000)
            input_field = page.locator("input#fname[name='fname']")
            input_field.wait_for(timeout=10000) 
            input_field.clear()

            for char in text_to_type:
                input_field.type(char, delay=750)
            time.sleep(2)
            
        except Exception as e:
            print(f"Error occurred: {e}")
        
        finally:
            browser.close()
            if os.path.exists(html_file_path):
                os.remove(html_file_path)

if __name__ == "__main__":
    fill_input_slowly()

Intelligence Watchlists

One of the key features that sets this course apart from most CTI training is that you will learn how to create realistic intelligence watchlists. These watchlists function like an automated surveillance system, continuously monitoring a user’s activity over time. Instead of just collecting snapshots of data, you will build tools that can track changes and actions as they happen.

The main purpose of these watchlists is to demonstrate how to cross-reference a threat actor’s activities across multiple platforms, helping you build a complete picture of their behavior. This is essential for tracking prolific threat actors, correlating data from different sources, and understanding patterns that may signal malicious intent.

To create a watchlist profile, you will define:

  • Link: The identifier of the individual or entity you want to monitor.
  • Frequency: How often the system checks for new activity. Options include:
    • Low: Every 24 hours
    • Medium: Every 12 hours
    • High: Every 6 hours
    • Very High: Every 1 hour
    • Critical: Every 5 minutes
  • Priority: The depth of information gathered. Options include:
    • Everything: All available data (posts & comments)
    • Posts only: Just the content posted by the user
    • Comments only: Only comments made by the user

For example, a low-priority watchlist checks for activity every 24 hours and scrapes relevant data. Results can be exported in JSON format for easy downloading and cross-referencing with other platforms.