In this section, we will introduce and briefly explain the key terminologies that form the foundation of this course. Understanding these terms is important because they will appear throughout the modules, sections, and exercises. Make sure to review them carefully, as they will help you follow the material with clarity and confidence.
The topics of this section include the following:
- Regular website & normal browsers
- Tor site & Tor browser
- I2P site & I2P browser
- LLMs & AI
- Web scraping
- Headless browsers
- Proxies
- Programming concepts
- Rate-limiting
- Captcha
- Account lockout
- Docker
- IDE (VSCode)
- Linux
- Browser extension
- Databases
- APIs
Regular Website & Normal Browsers
A regular website refers to any site that is publicly accessible over the internet without requiring special software or network configurations. Examples include google.com, wikipedia.org, and news websites. These sites are hosted on the clearnet, the part of the web indexed by search engines. Normal browsers like Chrome, Firefox, Edge, and Safari are used to visit these sites. They support modern web standards (HTML, CSS, JavaScript) and provide graphical interfaces for browsing. No special setup is needed to access regular websites, making them the primary way most people interact with the web.
Tor Site & Tor Browser
A Tor site is a website hosted within the Tor network, usually identified by a .onion domain. These sites are not accessible using normal browsers; you need the Tor Browser, which routes your connection through multiple encrypted nodes to hide your identity and location. Tor sites are often called part of the "dark web" because they are hidden from search engines and require special access. Many Tor sites focus on privacy, while others may host illegal or controversial content. Tor is used for anonymity by both legitimate users and criminals.
I2P Site & I2P Browser
An I2P site is a service hosted on the Invisible Internet Project (I2P) network. Like Tor, I2P provides anonymity, but it works differently by creating peer-to-peer tunnels between users. I2P sites often have .i2p domains and can only be accessed through an I2P-enabled browser or proxy setup. I2P is mainly used for secure internal services rather than accessing the regular internet. In this course, we will not be working with I2P, but it is important to be aware that I2P exists as another privacy-focused network used by both legitimate users and criminals.
LLMs & AI
Large Language Models (LLMs) are advanced AI systems trained on massive text datasets. Examples include GPT, LLaMA, and Claude. These models generate human-like text, answer questions, translate languages, and perform reasoning tasks. LLMs are useful in cyber threat intelligence for analyzing large volumes of unstructured data, spotting patterns, or generating reports. AI refers to broader technologies that allow machines to simulate intelligent behavior. In this course, you will use AI for tasks like detecting IAB posts, analyzing forum data, and managing threat watchlists.
Web Scraping Libraries
python-requests
A simple Python library for sending HTTP requests. It lets you download web pages and API responses in a controlled way, useful for scraping data from websites that don’t rely heavily on JavaScript.
BeautifulSoup
A Python library for parsing HTML or XML documents. It helps extract data by navigating the document’s structure, like finding specific tags, attributes, or text on a web page.
Headless Browsers
Browsers that run without a graphical interface. They load and interact with web pages just like regular browsers but are controlled by code. Useful for large-scale or automated scraping.
Headless Browsers
A headless browser is a web browser that runs without a visible graphical interface. It behaves like a normal browser by processing JavaScript, loading CSS, rendering pages, and handling cookies, but all of this happens in the background. Headless browsers are controlled through code, allowing developers to automate interactions with websites for purposes like scraping, testing, or monitoring. They are essential when working with JavaScript-heavy or dynamic websites where simple HTTP requests are not enough. Headless browsers can mimic real user behavior, which helps bypass some anti-bot protections.
Puppeteer
Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium in headless or full mode. It allows you to load pages, click buttons, fill forms, take screenshots, and extract content. Puppeteer is popular for its speed, stability, and deep integration with the browser. It can emulate device settings, network conditions, and viewport sizes, making it powerful for scraping modern websites or testing how pages behave under different scenarios. Link: https://github.com/puppeteer/puppeteer
Playwright
Playwright is a modern automation tool that supports Chromium, Firefox, and WebKit browsers. Like Puppeteer, it can run in headless or headed mode. Playwright is known for handling complex tasks like interacting with multi-page flows, managing multiple browser contexts, and bypassing anti-bot protections. It offers advanced features such as intercepting network requests, simulating user input with realistic delays, and capturing console logs. Playwright is often favored for cross-browser testing and scraping in environments where different browsers need to be supported.
Link: https://github.com/microsoft/playwright
Selenium
Selenium is one of the oldest and most widely used browser automation frameworks. It supports many programming languages, including Python, Java, and C#. Selenium can drive headless versions of Chrome, Firefox, and other browsers. It is commonly used in testing, but also works for scraping dynamic sites that require user interactions. Selenium’s flexibility allows integration with testing frameworks, but it can be slower compared to Puppeteer or Playwright for large-scale scraping tasks because of its older architecture.
Link: https://github.com/SeleniumHQ/selenium
Undetected chrome driver
Selenium and Playwright may trigger anti-bot systems due to their underlying technology. For a solution that minimizes detection, consider using Undetected ChromeDriver, a modified version of Selenium's ChromeDriver designed to bypass anti-bot mechanisms like Cloudflare, DataDome, Imperva, and Distil Networks. It achieves this by patching the ChromeDriver binary to obscure automation fingerprints, such as user agents and JavaScript variables, and mimicking human-like browser behavior.
Link: https://github.com/UltrafunkAmsterdam/undetected-chromedriver
Proxies: HTTP, HTTPS, Tor, SOCKS5
Proxies route your internet traffic through another server to hide your real IP address or bypass restrictions.
- HTTP/HTTPS proxies are for standard web traffic.
- SOCKS5 proxies work at a lower network level, supporting any traffic type.
- Tor proxies route traffic through the Tor network for anonymity.
Proxies are essential in scraping to avoid bans, distribute load, or access region-locked content.
Programming Concepts
Python and JavaScript are the main languages used for scraping and automation.
- Libraries/packages: Reusable sets of code to perform specific tasks.
- Front-end frameworks: Tools like Bootstrap, Daisyui, TailwindCSS, React or Vue for building web interfaces.
- Backend frameworks: Tools like Django, Flask, Laravel or Express for building server-side logic.
- Databases: Tools like PostgreSQL, MySQL, Firebase for storing data.
- IDE (Integrated Development Environment): Software like VSCode, PyCharm, Visual studio, used for writing and managing code.
- Compilers: Programs that turn code into executable instructions.
- Virtual environment: A virtual environment or venv is used for creating an isolated environment where multiple libraries/packages can be installed without performing a system-wide installation. You will be working with python virtual environment a lot.
Rate-limiting
Rate-limiting is a security measure that restricts how many requests you can send to a server in a set period. For example, a site might only allow 10 requests per minute from one IP address. This prevents abuse, spam, or overloading. Scraping tools must respect or work around rate limits by slowing down requests, using proxies, or distributing load across multiple accounts to avoid detection and bans.
Captcha
Captcha is a challenge-response system designed to verify that a user is human. It prevents automated bots from performing actions like creating accounts or scraping data. Common captchas involve selecting images, typing distorted text, or checking boxes. For scrapers, captchas are a significant obstacle, and solving them often requires advanced techniques like AI-based optical character recognition (OCR) or third-party solving services.
Account Lockout
Account lockout is a security feature that disables an account after a certain number of failed login attempts or suspicious activities. This protects against brute force attacks and unauthorized access. When scraping or interacting with sites that require logins, you must handle credentials carefully, spread actions across multiple accounts, or use session management to avoid triggering lockouts.
Docker
Docker is a tool that lets you package applications and their environments into isolated containers. This ensures software runs the same way everywhere, regardless of the system it’s on. In this course, Docker is used to set up simulated websites (both clearnet and Tor) so you can practice scraping and analysis without interacting with real-world criminal infrastructure. It makes the environment portable, reproducible, and easy to manage.
To install Docker, follow this guide: https://docs.docker.com/engine/install/
IDE: VSCode
An Integrated Development Environment (IDE) is software for writing, editing, and managing code. VSCode is a popular IDE that supports multiple programming languages, provides debugging tools, integrates with version control (like Git), and includes extensions for working with Python, JavaScript, Docker, and more. You will use VSCode to build, test, and manage your code in this course.
Linux
Linux is an open-source operating system widely used in cybersecurity, development, and server environments. It offers powerful command-line tools, high stability, and flexibility for automation and scripting. In this course, you will use Linux as part of your lab environment to run scrapers, manage Docker containers, and handle data processing tasks. Ubuntu is the preferred Linux distribution for this course.
You can download multiple distributions of Linux such as Ubuntu Desktop or xUbuntu.
Browser Extension
A browser extension is a small software module that adds features to a web browser. Extensions can modify page content, automate tasks, or interact with APIs. In scraping and automation, custom extensions can help with data extraction, bypassing restrictions, or managing sessions.
Databases
Databases store, organize, and manage data for easy access and analysis.
- MongoDB is a NoSQL database that stores data in flexible, JSON-like documents, ideal for unstructured or semi-structured data like scraped posts.
- PostgreSQL is a relational database that stores data in structured tables with strict schemas, suitable for complex queries and relationships.
APIs
An Application Programming Interface (API) allows software to communicate with other software. In scraping and automation, APIs provide a structured way to request and receive data without parsing HTML. Many sites offer public or private APIs for data access, and understanding how to interact with them can make data gathering faster and more reliable.