In this section, we will go over a couple of topics that you will be introduced to in the upcoming sections. It's important that you read through everything in here because it would help you to get a high-level picture of the course material.

The topics of this section include the following:

Simulation sites on multiple nets
Artificial Intelligence Utilization: prompting, RAG & fine-tuning
Advanced web scraping
Terminologies

Simulation Sites on Multiple Nets

In this course, you will work with two types of demo cybercrime marketplaces designed for training purposes. We cannot perform scraping or testing on real cybercrime sites like darkforums.st, which you were introduced to earlier, because doing so may create legal risks in many jurisdictions. To ensure a safe and lawful learning environment, we have created two simulation sites:

One hosted on the clearnet (also called the normal or public web)
One hosted on the Tor network

These simulation sites are designed to mimic real cyber-crime forums, complete with human-like activity, posts, and comments. This provides you with the closest possible experience to practicing cyber-crime identification and data scraping in a controlled and legal setting.

Clearnet vs Tor: A brief technical explanation

Clearnet refers to the publicly accessible part of the internet that you use every day. Sites like google.com or wikipedia.org are clearnet sites. They can be accessed directly through standard browsers (Chrome, Firefox, Edge) using DNS and without any special configuration.
Tor network is an anonymity network that routes traffic through multiple nodes to hide the user’s location and identity. Tor sites (sometimes called “dark web sites”) use .onion domains and can only be accessed using a Tor-enabled browser or tor proxy. These sites are not indexed by traditional search engines and are often used to host hidden or privacy-focused services.

In this course, we will refer to normal websites as clearnet sites, and Tor-hosted sites as tornet sites.

This setup allows you to safely build and test your threat intelligence skills without interacting with real criminal infrastructure.

In Module One, you will learn all the key terminology needed to understand the topics, tools, and infrastructure we will work with throughout the course. In later modules, you will be introduced to the Tor and clearnet simulation sites and will set them up locally using Docker. If you are not yet familiar with Docker or what a "local" website is, don’t worry, you will learn about those as well as part of the process.

Artificial Intelligence Utilization: Prompting, RAG & Fine-tuning

In this course, you will learn how artificial intelligence (AI) can assist us in cyber threat intelligence tasks such as identifying suspicious activity or detecting Initial Access Broker (IAB) posts. We will cover three main AI approaches: prompting, RAG (Retrieval-Augmented Generation), and fine-tuning.

Prompting

Prompting means using a pre-trained AI model by giving it carefully written instructions (prompts) to get the result you want. No extra data or changes to the model are needed. The quality of the prompt influences the quality of the result, and the outcome can also depend on the reasoning capabilities of the model you are prompting.

Example:
You give an AI model like OpenAI’s GPT this prompt:

Given this post, tell me if it offers initial access to a network: 

`Back again, mods nuked my last thread. Selling VPN access to a big EU telecom, one of them household names in Germany (you figure it out). Full internal access, you can pivot to servers, grab call logs, or mess with billing systems. Been stable for a month, no kicks. Asking $4,000 BTC, negotiable for bulk buyers (got 2 more telecoms). I’ll throw in logs or a test connection for proof, but only if you’re serious. Escrow or MM accepted, I’m vouched here. Hit me up in PM, let’s make this quick. No skids, please.`

The model analyzes the text and answers based on its existing knowledge.

RAG (Retrieval-Augmented Generation) in Cyber Threat Intelligence

RAG is a technique that combines a language model with an external knowledge base. It works by retrieving relevant information from that knowledge base when you ask a question or provide a prompt. This retrieved information is then passed along with your prompt to the language model, allowing it to give a more informed and accurate response.

In cyber threat intelligence, RAG can be used to index and store labeled examples of cyber-crime-related posts such as positive, negative, and neutral posts and make those examples available to the model at the time of prompting.

Example use in CTI

The system indexes labeled posts:
- Positive posts: Direct sale of unauthorized access (e.g. Initial access to RBC Bank available).
- Neutral posts: General ads not tied to access sales (e.g. Selling malware loader).
- Negative posts: Unrelated or off-topic services (e.g. Offering VPS hosting).

When you input a suspected IAB post, the RAG system retrieves similar examples from the knowledge base. The language model then considers those examples while generating its answer. This helps the model provide better assessments of whether a post signals IAB activity.

Fine-tuning

While RAG retrieves external data at the time of prompting, fine-tuning is different. In fine-tuning, the model is trained on a custom dataset (such as labeled posts) so that the knowledge is built directly into the model’s parameters. The model "learns" the patterns in the data during the fine-tuning process and no longer needs to retrieve examples at prompt time, it applies what it learned during training.

Both approaches help tailor a model to specific tasks, but:

RAG is typically used for indexing and retrieving external information to assist the model during inference (response generation).
Fine-tuning adjusts the model itself so it can apply the new knowledge without needing external retrieval.

In this course, we will only use prompting because that's the easiest approach that you can get started with.

Advanced web scraping

Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting information from web pages, scraping uses software (called scrapers) to collect and structure data efficiently and at scale.

Scraping can target many types of information, such as:

text from posts, comments, or articles
product listings or prices
images or links
metadata like timestamps or usernames

Web scraping is usually straightforward at a small scale. The challenge begins when you need to scrape data at an industrial level, for example, gathering all data from a site, from the day it launched to the present moment.

Imagine a site that has been live since January 19th, 2021. How can you reliably scrape every post from that date until today without missing posts that are published while your scraper is running? And what about posts that appear after your scraper finishes its scan?

The scraping techniques we will cover in this course go far beyond what is typically shown in beginner tutorials or YouTube videos. The sites you will work with in our simulations have auto-populating features, continuously generating new posts. This creates challenges for traditional scraping methods, which you will learn how to overcome.

As you progress through the course, you will see that we use two types of bots. One group focuses entirely on collecting post titles and links, systematically moving between pages to ensure that no data is missed. Their job is to make sure every piece of content, old or new, is gathered accurately and efficiently. The other group is responsible for scraping data from individual posts.

Terminologies

In this section, you may come across terms that are new or unfamiliar. For this reason, we have included a dedicated section covering all the key terminology discussed in this course. It is important that you don’t skip this part, even if you consider yourself a subject matter expert.

There is always something new to learn. And there’s no need to feel intimidated by terms like industrial web scraping. these concepts will become clear as you progress.

We will explain everything you need to know in Module 1 to prepare you for what’s ahead.