In this section, we will take the first steps toward mastering web scraping. What I love most about this section is how engaging data collection can be, and overcoming challenges like rate limiting makes it even more exciting.

I’m confident you’ll breeze through this section with ease, especially with your AI assistants ready to clarify anything I might not explain clearly enough.

The topics of this section include the following:

Language identification & data translation
Clearnet forum data extraction
Tornet forum data extraction

Language identification & data translation

Our simulated cyber-crime sites feature posts in various languages, including Chinese, Russian, and Korean. Therefore, it’s essential to identify the language of a text and translate it into English if it’s not already in English.

To begin with data translation, you can use DeepL’s free API plan:

https://www.deepl.com/en/pro#developer

For scaling up and translating larger datasets, you will need to purchase one of their subscription plans.

Access your API key via the API keys page:

https://www.deepl.com/en/your-account/keys

Below is a Python program demonstrating how to identify the language of five posts and translate non-English posts into English. First, set up a Python development environment:

mkdir deepl_test && cd deepl_test
python3 -m venv venv
source venv/bin/activate
pip install deepl
touch deepl_en.py

Enter following code in deepl_en.py:

import deepl, json


# Specify your API key
auth_key = "xxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxx:fx"
translator = deepl.Translator(auth_key)

# Sellers posts
sellers_posts = [
                {
                "title": "AspenWell 生物科技有限公司 访问",
                "description": "公司：AspenWell Biotech Ltd\n地点：荷兰\n收入：$1.3 亿美元\n访问类型：VPN > 内部 RDP > DA\n额外信息：SharePoint 暴露，部署了 RMM 代理\n安全：内部无防病毒软件，未触发日志\n价格：1.05 BTC，支持 escrow 支付（需与卖家协商）\n仅向一名买家出售。",
                "price": "1.05 BTC"
            },
            {
                "title": "Silverline Capital Partners Access",
                "description": "Company: Silverline Capital Partners (NY Private Equity firm)\nRevenue: approx $500M AUM\nAccess: RDP into domain w/ admin rights\nDetected: not yet, 3 days stable\nBonus: Access to doc mgmt system + Zoom internal creds\nSelling for 0.9 BTC – will include persistence script\nDM to deal",
                "price": "0.9 BTC"
            },
            {
                "title": "베리타스 헬스 파트너스 액세스",
                "description": "판매 중: Veritas Health Partners, 캐나다\n매출: 약 $430M\n산업: 의료\n접근 유형: VPN + 작동 중인 RDP\n권한: 로컬 관리자 권한 및 내부 공유 액세스\n탐지되지 않음 – 6일 연속 가동 중\n가격: 0.58 BTC 고정 가격",
                "price": "0.58 BTC"
            },
            {
                "title": "Portwell Maritime Group Access",
                "description": "Selling access to Portwell Maritime Group (Singapore)\nAccess: DA creds via exposed RDP svc acct\nRevenue: Est. $2B\nIndustry: Shipping / Transport\nExtras: payroll exports + crew schedules\nContact: DM if you’ve dealt before. Fast buyers only.",
                "price": "DM for price"
            },
            {
                "title": "Доступ к недвижимости Monarch",
                "description": "Компания: Monarch Real Estate Investments\nМестонахождение: США, Нью-Йорк\nВыручка: 380 млн долларов\nДоступ: Citrix > внутренний RDP > DA\nНабор постоянных файлов, нетронутый более 4 дней\nДоступны резервные копии документов + договоры аренды клиентов\nЦена: 0,7 BTC, только для серьезных людей",
                "price": "0.7 BTC"
            }]


results = []

for post in sellers_posts:
    # Detect language for title
    title_detection = translator.translate_text(
        post["title"], target_lang="EN-US", source_lang=None
    )
    title_lang = title_detection.detected_source_lang
    title_translated = title_lang != "EN"
    title_text = (
        title_detection.text if title_translated else post["title"]
    )

    # Detect language for description
    description_detection = translator.translate_text(
        post["description"], target_lang="EN-US", source_lang=None
    )
    description_lang = description_detection.detected_source_lang
    description_translated = description_lang != "EN"
    description_text = (
        description_detection.text if description_translated else post["description"]
    )

    # Build result dictionary for the post
    result = {
        "original_title": {
            "text": post["title"],
            "language": title_lang,
            "translated": title_translated,
            "translated_text": title_text if title_translated else None
        },
        "original_description": {
            "text": post["description"],
            "language": description_lang,
            "translated": description_translated,
            "translated_text": description_text if description_translated else None
        },
        "price": post["price"]
    }
    results.append(result)

# Print beautified JSON
print(json.dumps(results, indent=2, ensure_ascii=False))

Run that code and the JSON output would be the following:

[
  {
    "original_title": {
      "text": "AspenWell 生物科技有限公司 访问",
      "language": "ZH",
      "translated": true,
      "translated_text": "AspenWell Biotechnology Limited Visit"
    },
    "original_description": {
      "text": "公司：AspenWell Biotech Ltd\n地点：荷兰\n收入：$1.3 亿美元\n访问类型：VPN > 内部 RDP > DA\n额外信息：SharePoint 暴露，部署了 RMM 代理\n安全：内部无防病毒软件，未触发日志\n价格：1.05 BTC，支持 escrow 支付（需与卖家协商）\n仅向一名买家出售。",
      "language": "ZH",
      "translated": true,
      "translated_text": "Company: AspenWell Biotech Ltd\nLocation: Netherlands\nRevenue: $130 million\nAccess Type: VPN > Internal RDP > DA\nAdditional Information: SharePoint exposed, RMM proxy deployed\nSecurity: No internal anti-virus software, no logs triggered\nPrice: 1.05 BTC, escrow payments supported (subject to negotiation with seller)\nSold to one buyer only."
    },
    "price": "1.05 BTC"
  },
  {
    "original_title": {
      "text": "Silverline Capital Partners Access",
      "language": "EN",
      "translated": false,
      "translated_text": null
    },
    "original_description": {
      "text": "Company: Silverline Capital Partners (NY Private Equity firm)\nRevenue: approx $500M AUM\nAccess: RDP into domain w/ admin rights\nDetected: not yet, 3 days stable\nBonus: Access to doc mgmt system + Zoom internal creds\nSelling for 0.9 BTC – will include persistence script\nDM to deal",
      "language": "EN",
      "translated": false,
      "translated_text": null
    },
    "price": "0.9 BTC"
  },
  {
    "original_title": {
      "text": "베리타스 헬스 파트너스 액세스",
      "language": "KO",
      "translated": true,
      "translated_text": "Veritas Health Partners Access"
    },
    "original_description": {
      "text": "판매 중: Veritas Health Partners, 캐나다\n매출: 약 $430M\n산업: 의료\n접근 유형: VPN + 작동 중인 RDP\n권한: 로컬 관리자 권한 및 내부 공유 액세스\n탐지되지 않음 – 6일 연속 가동 중\n가격: 0.58 BTC 고정 가격",
      "language": "KO",
      "translated": true,
      "translated_text": "Sold by: Veritas Health Partners, Canada\nRevenue: Approximately $430M\nIndustry: Healthcare\nAccess type: VPN + working RDP\nPermissions: Local administrator privileges and access to internal shares\nUndetected - up and running for 6 days straight\nPrice: 0.58 BTC fixed price"
    },
    "price": "0.58 BTC"
  },
  {
    "original_title": {
      "text": "Portwell Maritime Group Access",
      "language": "EN",
      "translated": false,
      "translated_text": null
    },
    "original_description": {
      "text": "Selling access to Portwell Maritime Group (Singapore)\nAccess: DA creds via exposed RDP svc acct\nRevenue: Est. $2B\nIndustry: Shipping / Transport\nExtras: payroll exports + crew schedules\nContact: DM if you’ve dealt before. Fast buyers only.",
      "language": "EN",
      "translated": false,
      "translated_text": null
    },
    "price": "DM for price"
  },
  {
    "original_title": {
      "text": "Доступ к недвижимости Monarch",
      "language": "RU",
      "translated": true,
      "translated_text": "Access to Monarch real estate"
    },
    "original_description": {
      "text": "Компания: Monarch Real Estate Investments\nМестонахождение: США, Нью-Йорк\nВыручка: 380 млн долларов\nДоступ: Citrix > внутренний RDP > DA\nНабор постоянных файлов, нетронутый более 4 дней\nДоступны резервные копии документов + договоры аренды клиентов\nЦена: 0,7 BTC, только для серьезных людей",
      "language": "RU",
      "translated": true,
      "translated_text": "Company: Monarch Real Estate Investments\nLocation: USA, New York\nRevenue: $380 million\nAccess: Citrix > internal RDP > DA\nSet of permanent files, untouched for more than 4 days\nBackup copies of documents + client leases available\nPrice: 0.7 BTC, for serious people only"
    },
    "price": "0.7 BTC"
  }
]

Notice how we preserve the original text for English posts (indicated by "translated": false)? This is because when the detected language is English, translation is unnecessary. For non-English posts, we translate them into English.

In later modules, you will learn how to use Python’s langdetect module to identify a text’s language before sending it to DeepL. We use langdetect to avoid wasting API credits on detecting or translating English text.

Translator module

The previous example focused on mass translation of hardcoded data. For a more flexible and modular development experience, I will provide a code snippet that accepts an API key and data as inputs for translation. This approach, encapsulated in a translator.py module, is easier to integrate and use within any program:

import deepl, json


def translate_to_en(auth_key: str, data_string: str) -> dict:
    """
    Detects the language of a data string and translates it to English if not already in English.
    
    Args:
        auth_key (str): DeepL API authentication key.
        data_string (str): The input string to process.
    
    Returns:
        dict: A dictionary containing:
            - text: Original input string
            - language: Detected source language
            - translated: Boolean indicating if translation was performed
            - translated_text: Translated text (if translated) or None (if English)
    
    Raises:
        ValueError: If auth_key or data_string is empty or invalid.
        deepl.exceptions.AuthorizationException: If the API key is invalid.
        deepl.exceptions.DeepLException: For other DeepL API errors.
    """
    # Input validation
    if not auth_key:
        raise ValueError("DeepL API key cannot be empty")
    if not data_string or not isinstance(data_string, str):
        raise ValueError("Data string must be a non-empty string")

    try:
        # Initialize DeepL translator
        translator = deepl.Translator(auth_key)
        
        # Detect language and translate if necessary
        detection = translator.translate_text(
            data_string, target_lang="EN-US", source_lang=None
        )
        detected_lang = detection.detected_source_lang
        translated = detected_lang != "EN"
        translated_text = detection.text if translated else None

        # Build result dictionary
        result = {
            "text": data_string,
            "language": detected_lang,
            "translated": translated,
            "translated_text": translated_text
        }
        
        return result

    except deepl.exceptions.AuthorizationException as e:
        raise deepl.exceptions.AuthorizationException(
            f"Authorization error: Check your DeepL API key. {str(e)}"
        )
    except deepl.exceptions.DeepLException as e:
        raise deepl.exceptions.DeepLException(
            f"DeepL API error: {str(e)}"
        )
    except Exception as e:
        raise Exception(f"Unexpected error: {str(e)}")


def format_result(result: dict) -> str:
    """
    Formats the translation result as beautified JSON.
    
    Args:
        result (dict): The result dictionary from translate_to_en.
    
    Returns:
        str: Beautified JSON string.
    """
    return json.dumps(result, indent=2, ensure_ascii=False)

Here is how to call translator.py from translate.py:

import translator
import os
import json


# Load API key
auth_key = "XXXXXXX-XXXXXXXXXXXXXXXX:fx"

# Sellers posts
sellers_posts = [
                {
                "title": "AspenWell 生物科技有限公司 访问",
                "description": "公司：AspenWell Biotech Ltd\n地点：荷兰\n收入：$1.3 亿美元\n访问类型：VPN > 内部 RDP > DA\n额外信息：SharePoint 暴露，部署了 RMM 代理\n安全：内部无防病毒软件，未触发日志\n价格：1.05 BTC，支持 escrow 支付（需与卖家协商）\n仅向一名买家出售。",
                "price": "1.05 BTC"
            },
            {
                "title": "Silverline Capital Partners Access",
                "description": "Company: Silverline Capital Partners (NY Private Equity firm)\nRevenue: approx $500M AUM\nAccess: RDP into domain w/ admin rights\nDetected: not yet, 3 days stable\nBonus: Access to doc mgmt system + Zoom internal creds\nSelling for 0.9 BTC – will include persistence script\nDM to deal",
                "price": "0.9 BTC"
            },
            {
                "title": "베리타스 헬스 파트너스 액세스",
                "description": "판매 중: Veritas Health Partners, 캐나다\n매출: 약 $430M\n산업: 의료\n접근 유형: VPN + 작동 중인 RDP\n권한: 로컬 관리자 권한 및 내부 공유 액세스\n탐지되지 않음 – 6일 연속 가동 중\n가격: 0.58 BTC 고정 가격",
                "price": "0.58 BTC"
            },
            {
                "title": "Portwell Maritime Group Access",
                "description": "Selling access to Portwell Maritime Group (Singapore)\nAccess: DA creds via exposed RDP svc acct\nRevenue: Est. $2B\nIndustry: Shipping / Transport\nExtras: payroll exports + crew schedules\nContact: DM if you’ve dealt before. Fast buyers only.",
                "price": "DM for price"
            },
            {
                "title": "Доступ к недвижимости Monarch",
                "description": "Компания: Monarch Real Estate Investments\nМестонахождение: США, Нью-Йорк\nВыручка: 380 млн долларов\nДоступ: Citrix > внутренний RDP > DA\nНабор постоянных файлов, нетронутый более 4 дней\nДоступны резервные копии документов + договоры аренды клиентов\nЦена: 0,7 BTC, только для серьезных людей",
                "price": "0.7 BTC"
            }]


results = []

for post in sellers_posts:
    try:
        title_result = translator.translate_to_en(auth_key, post["title"])
        description_result = translator.translate_to_en(auth_key, post["description"])
        result = {
            "original_title": title_result,
            "original_description": description_result,
            "price": post["price"]
        }
        results.append(result)
    except Exception as e:
        print(f"Error processing post '{post['title']}': {e}")
        continue


print(json.dumps(results, indent=2, ensure_ascii=False))

In later chapters when you build a web scraper, you would use something like translator.py because it's modular and the API that it uses can be changed at any time. This makes it easier to customize it. This is what we call a modular design, something that can be modified or deleted later on.

Clearnet forum data extraction

To begin, I will demonstrate how to use Playwright to extract a post's title and description and translate them if needed.

First, ensure the clearnet forum is running. Log in and navigate to a specific post, such as post number 17 in the sellers' marketplace:

http://127.0.0.1:5000/post/marketplace/17

When you access this page, the post title, description, username, and timestamp are clearly visible. However, if you inspect the page's source code:

view-source:http://127.0.0.1:5000/post/marketplace/17

You won’t find this data. This is because the page imports and uses main.js. You can view the source code of this file here:

view-source:http://127.0.0.1:5000/static/js/main.js

Within main.js, the loadPostDetails function is responsible for fetching data from backend APIs:

    function loadPostDetails() {
        if ($('#post-title').length) {
            console.log('Loading post details for:', postType, postId);
            $.get(`/api/post/${postType}/${postId}`, function(data) {
                console.log('API response:', data);
                const post = data.post;
                const comments = data.comments;
                const user = data.user;

                // Update post title
                $('#post-title').text(post.title || 'Untitled');

                // Update post content
                let contentHtml = '';
                if (postType === 'announcements') {
                    contentHtml = post.content || '';
                } else {
                    contentHtml = `${post.description || ''}<br><strong>Price:</strong> ${post.price || 'N/A'}`;
                }
                contentHtml += `<br><br><strong>Posted by:</strong> <a href="/profile/${post.username}" class="text-light">${post.username}</a>`;
                contentHtml += `<br><strong>Date:</strong> ${post.date}`;
                $('#post-content').html(contentHtml);

                // Update category
                $('#post-category').text(post.category || 'N/A');

                // Update comments
                $('#comments-section').empty();
                if (comments && comments.length > 0) {
                    comments.forEach(function(comment) {
                        $('#comments-section').append(
                            `<div class="mb-3">
                                <p class="text-light"><strong>${comment.username}</strong> (${comment.date}): ${comment.content}</p>
                            </div>`
                        );
                    });
                } else {
                    $('#comments-section').html('<p class="text-light">No comments yet.</p>');
                }

                // Update back link
                $('#back-link').attr('href', `/category/${postType}/${post.category}`).text(`Back to ${post.category}`);
            }).fail(function(jqXHR, textStatus, errorThrown) {
                console.error('Failed to load post details:', textStatus, errorThrown);
                $('#post-title').text('Error');
                $('#post-content').html('<p class="text-light">Failed to load post. Please try again later.</p>');
                $('#comments-section').html('<p class="text-light">Failed to load comments.</p>');
            });
        }
    }

Right-click on the post description and select Inspect Element in your browser's Developer Tools to view the post content clearly. This shows that a browser is necessary to extract the data. We use Playwright because it offers powerful tools to automate data extraction from web pages within a browser environment.

How does the web page works?

Before diving into the code, it’s crucial to understand how the web page, and by extension the site, functions. Each piece of data is contained within an element, such as <h1 id="header">. To extract data from this header, we can target its ID, #header.

To explore this, open any post on the clearnet site -> right-click on the title -> select Inspect Element -> then right-click the highlighted code in the inspector and choose Edit as HTML.

Copy and paste the code into a note-taking editor:

<h2 class="text-light" id="post-title">AspenWell 生物科技有限公司 访问</h2>

Right-click on description -> inspect element -> right-click on highlighted code in inspector -> click Edit As HTML

Copy and paste the code into your editor:

<p class="card-text text-light" id="post-content">公司：AspenWell Biotech Ltd
地点：荷兰
收入：$1.3 亿美元
访问类型：VPN &gt; 内部 RDP &gt; DA
额外信息：SharePoint 暴露，部署了 RMM 代理
安全：内部无防病毒软件，未触发日志
价格：1.05 BTC，支持 escrow 支付（需与卖家协商）
仅向一名买家出售。<br><strong>Price:</strong> 1.05 BTC<br><br><strong>Posted by:</strong> <a href="/profile/AnonX" class="text-light">AnonX</a><br><strong>Date:</strong> 2025-06-30 14:53:51</p>

The username and date is also here. We will be targeting these elements with playwright.

Using cookies with playwright

Since we are already logged into the website, we can view posts. While it’s possible to automate logging in and bypassing CAPTCHAs using AI, for now, a simpler approach is to log in manually and extract your session cookies from Developer Tools. You will learn how to automate CAPTCHA bypassing with AI later in the course.

To retrieve your session cookies:

Press F12 to open Developer Tools.
Navigate to the Storage tab.
In the left panel, select Cookies.
Click the website address and copy all the cookies.

To inspect how requests are sent to the website:

Press F12 to open Developer Tools.
Go to the Network tab.
Locate the file for post 17 with type html and click it.
In the right panel, scroll to the Request Headers section.
Click Raw and copy all the contents.

Below is an example of how requests, including cookies, are structured:

GET /post/marketplace/17 HTTP/1.1
Host: 127.0.0.1:5000
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br, zstd
Referer: http://127.0.0.1:5000/category/marketplace/Sellers?page=1
Connection: keep-alive
Cookie: csrftoken=RxeTY7ifP4ooXCk95Q5ry9ZaFGMOMuZI; _ga_HSTF3XTLDW=GS2.1.s1750188904$o17$g1$t1750189624$j47$l0$h0; _ga_Y0Z4N92WQM=GS2.1.s1751918138$o16$g1$t1751919098$j60$l0$h0; sessionid=qqri9200q5ve9sehy7gp3yeax28bnrhm; session=.eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: same-origin
Sec-Fetch-User: ?1
Priority: u=0, i

This might be a little different for you and this is an example. The only cookie we need is session.

Data extraction with playwright

To get started, setup an environment and install the dependencies:

mkdir play && cd play
touch play.py
python3 -m venv venv
source venv/bin/activate
sudo apt install libavif16
pip3 install playwright
playwright install

Open play.py and paste the following code:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()

    # Essential cookies
    cookies = [
        {
            "name": "session",
            "value": ".eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8",
            "domain": "127.0.0.1",
            "path": "/",
            "expires": -1,
            "httpOnly": True,
            "secure": False,
            "sameSite": "Lax"
        }
    ]

    # Add cookies to the context
    context.add_cookies(cookies)

    # Open a page and navigate to the target URL
    page = context.new_page()
    page.goto("http://127.0.0.1:5000/post/marketplace/17")
    print(page.title())
    from time import sleep
    sleep(400000)
    browser.close()

This code uses the session cookie to access a web page. An invalid cookie will prevent access, requiring a login. The sleep(400000) function provides ample time to interact with Playwright's browser session.

With an understanding of where data is stored, we are now ready to extract it from the web page. Below is how data can be extracted:

from playwright.sync_api import sync_playwright
import json
from urllib.parse import urljoin


with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    context = browser.new_context()

    # Essential cookies
    cookies = [
        {
            "name": "session",
            "value": ".eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8",
            "domain": "127.0.0.1",
            "path": "/",
            "expires": -1,
            "httpOnly": True,
            "secure": False,
            "sameSite": "Lax"
        }
    ]

    # Add cookies to the context
    context.add_cookies(cookies)

    # Open a page and navigate to the target URL
    page = context.new_page()
    page.goto("http://127.0.0.1:5000/post/marketplace/17")

    # Extract data
    title = page.locator('h2#post-title').inner_text()
    content = page.locator('p#post-content').inner_text()
    username = page.locator('p#post-content a').inner_text()
    author_link = page.locator('p#post-content a').get_attribute('href')
    
    # Extract timestamp
    timestamp = page.locator('p#post-content').evaluate(
        """el => el.innerHTML.split('<strong>Date:</strong> ')[1].trim()"""
    )

    post_url = page.url
    full_author_link = urljoin(post_url, author_link)

    # Construct JSON output
    post_data = {
        "title": title,
        "content": content,
        "username": username,
        "timestamp": timestamp,
        "post_link": post_url,
        "author_link": full_author_link
    }
    print(json.dumps(post_data, indent=2, ensure_ascii=False))

    browser.close()

With slight adjustments, we can import the translate_to_en function from the translator.py module and utilize it. To make this work, place both play.py and translator.py in the same directory, create a Python virtual environment, and install all dependencies required by both files. Then, modify play.py to integrate with translator.py:

from playwright.sync_api import sync_playwright
import json
from urllib.parse import urljoin
from translator import translate_to_en


DEEPL_API_KEY = "XXXXXXXXXXXX-XXXXXXXXXXXXXX:fx"

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()

    # Essential cookies
    cookies = [
        {
            "name": "session",
            "value": ".eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8",
            "domain": "127.0.0.1",
            "path": "/",
            "expires": -1,
            "httpOnly": True,
            "secure": False,
            "sameSite": "Lax"
        }
    ]

    # Add cookies to the context
    context.add_cookies(cookies)
    page = context.new_page()
    page.goto("http://127.0.0.1:5000/post/marketplace/17")

    # Extract data
    title = page.locator('h2#post-title').inner_text()
    content = page.locator('p#post-content').inner_text()
    username = page.locator('p#post-content a').inner_text()
    author_link = page.locator('p#post-content a').get_attribute('href')
    
    # Extract timestamp
    timestamp = page.locator('p#post-content').evaluate(
        """el => el.innerHTML.split('<strong>Date:</strong> ')[1].trim()"""
    )

    # Get the current page URL dynamically
    post_url = page.url
    full_author_link = urljoin(post_url, author_link)

    # Translate title and content using DeepL
    try:
        title_translation = translate_to_en(DEEPL_API_KEY, title)
        content_translation = translate_to_en(DEEPL_API_KEY, content)
    except Exception as e:
        print(f"Translation error: {str(e)}")
        title_translation = {
            "text": title,
            "language": "UNKNOWN",
            "translated": False,
            "translated_text": None
        }
        content_translation = {
            "text": content,
            "language": "UNKNOWN",
            "translated": False,
            "translated_text": None
        }

    # Construct JSON output with translation details
    post_data = {
        "title": {
            "original": title_translation["text"],
            "language": title_translation["language"],
            "translated": title_translation["translated"],
            "translated_text": title_translation["translated_text"]
        },
        "content": {
            "original": content_translation["text"],
            "language": content_translation["language"],
            "translated": content_translation["translated"],
            "translated_text": content_translation["translated_text"]
        },
        "username": username,
        "timestamp": timestamp,
        "post_link": post_url,
        "author_link": full_author_link
    }

    print(json.dumps(post_data, indent=2, ensure_ascii=False))
    browser.close()

Here is how the code works:

Imports: Uses playwright for browser automation, json for output, urljoin for URL handling, and translate_to_en for DeepL translations. Defines DeepL API key.
Browser Setup: Launches headless Chromium browser with sync_playwright, creates context, and adds session cookie for authentication.
Page Navigation: Opens page at http://127.0.0.1:5000/post/marketplace/17.
Data Extraction: Uses Playwright locators to extract:
- Title from <h2> with ID post-title.
- Content from <p> with ID post-content.
- Username from <a> in post-content.
- Author link from <a>’s href.
- Timestamp by parsing text after <strong>Date:</strong> via JavaScript.
- Post URL from current page.
- Full author link using urljoin.
Translation: Translates title and content to English via DeepL. On error, returns original text with "UNKNOWN" language.
Output: Creates JSON with title, content (original, language, translated), username, timestamp, post link, and author link. Prints with indentation.
Cleanup: Closes browser.

Here is the JSON output on terminal:

{
  "title": {
    "original": "AspenWell 生物科技有限公司 访问",
    "language": "ZH",
    "translated": true,
    "translated_text": "AspenWell Biotechnology Limited Visit"
  },
  "content": {
    "original": "公司：AspenWell Biotech Ltd 地点：荷兰 收入：$1.3 亿美元 访问类型：VPN > 内部 RDP > DA 额外信息：SharePoint 暴露，部署了 RMM 代理 安全：内部无防病毒软件，未触发日志 价格：1.05 BTC，支持 escrow 支付（需与卖家协商） 仅向一名买家出售。\nPrice: 1.05 BTC\n\nPosted by: AnonX\nDate: 2025-06-30 14:53:51",
    "language": "ZH",
    "translated": true,
    "translated_text": "Company: AspenWell Biotech Ltd Location: Netherlands Revenue: $130M Access Type: VPN > Internal RDP > DA Additional Information: SharePoint exposed, RMM proxy deployed Security: No antivirus software on-premise, no logs triggered Price: 1.05 BTC, escrow payment supported (subject to seller's negotiation) Sold to a single Sold to one buyer only.\nPrice: 1.05 BTC\n\nPosted by: AnonX\nDate: 2025-06-30 14:53:51"
  },
  "username": "AnonX",
  "timestamp": "2025-06-30 14:53:51",
  "post_link": "http://127.0.0.1:5000/post/marketplace/17",
  "author_link": "http://127.0.0.1:5000/profile/AnonX"
}

As shown, both the title and description have been translated. This serves as a straightforward example of how to scrape data using Playwright and perform translations.

Tornet forum data extraction

In this section, I will demonstrate how to extract all post titles, usernames, timestamps, and links from the Buyers section of the Tornet forum. Since the Tor forum does not use JavaScript, data extraction is straightforward.

For this task, we will use Python’s requests and BeautifulSoup libraries.

Start by accessing the Buyers section:

http://127.0.0.1:5000/category/marketplace/Buyers

For me, this page contains 13 posts, with pagination to navigate to the next set of 10 or fewer posts. Clicking "Next" changes the URL to:

http://127.0.0.1:5000/category/marketplace/Buyers?page=2

This pagination system is convenient—simply adjust the page number to navigate forward or backward. However, if you enter an invalid page number, such as 56, the response will display No posts found. This serves as a clear signal for your script to stop.

Make a note of this in your note-taking editor.

By viewing the page source code for page 1, you can see all posts directly within the HTML:

<!-- templates/category.html -->
<!-- templates/base.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Buyers - Cyber Forum</title>
    <link href="/static/css/bootstrap.min.css" rel="stylesheet">
</head>
<body class="bg-dark text-light">
    <nav class="navbar navbar-dark bg-secondary">
        <div class="container-fluid">
            <a class="navbar-brand" href="/">Cyber Forum</a>
            <ul class="navbar-nav ms-auto d-flex flex-row">
                <li class="nav-item me-3"><a class="nav-link" href="/">Home</a></li>
                <li class="nav-item me-3"><a class="nav-link" href="/marketplace">Marketplace</a></li>
                <li class="nav-item me-3"><a class="nav-link" href="/services">Services</a></li>
                
                    <li class="nav-item me-3"><a class="nav-link" href="/search">Search</a></li>
                    <li class="nav-item me-3"><a class="nav-link" href="/profile/DarkHacker">Profile</a></li>
                    <li class="nav-item"><a class="nav-link" href="/logout">Logout</a></li>
            </ul>
        </div>
    </nav>
    <div class="container my-4">

    <h2 class="text-light">Buyers</h2>
    <div class="card bg-dark border-secondary">
        <div class="card-body">
            <table class="table table-dark table-hover">
                <thead class="table-dark">
                    <tr>
                        <th scope="col">Title</th>
                        <th scope="col">Posted By</th>
                        <th scope="col">Date</th>
                        <th scope="col">Comments</th>
                        <th scope="col">Action</th>
                    </tr>
                </thead>
                <tbody class="text-light">
                    
                        
                            <tr>
                                <td>Seeking CC dumps ASAP</td>
                                <td><a href="/profile/GhostRider" class="text-light">GhostRider</a></td>
                                <td>2025-07-06 17:32:21</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/7" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Seeking data leaks ASAP</td>
                                <td><a href="/profile/HackSavvy" class="text-light">HackSavvy</a></td>
                                <td>2025-07-06 15:05:00</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/2" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Buying Fresh PayPal accounts</td>
                                <td><a href="/profile/N3tRunn3r" class="text-light">N3tRunn3r</a></td>
                                <td>2025-07-06 03:29:42</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/12" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Need gift card codes, High Budget</td>
                                <td><a href="/profile/ShadowV" class="text-light">ShadowV</a></td>
                                <td>2025-07-04 00:58:36</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/4" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Looking for CC dumps</td>
                                <td><a href="/profile/Crypt0King" class="text-light">Crypt0King</a></td>
                                <td>2025-07-03 07:51:05</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/11" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Looking for PayPal accounts</td>
                                <td><a href="/profile/DarkHacker" class="text-light">DarkHacker</a></td>
                                <td>2025-07-02 03:44:01</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/8" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Need CC dumps, High Budget</td>
                                <td><a href="/profile/GhostRider" class="text-light">GhostRider</a></td>
                                <td>2025-06-30 16:23:41</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/1" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Looking for data leaks</td>
                                <td><a href="/profile/CyberGhost" class="text-light">CyberGhost</a></td>
                                <td>2025-06-26 23:42:20</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/5" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Seeking RDP credentials ASAP</td>
                                <td><a href="/profile/N3tRunn3r" class="text-light">N3tRunn3r</a></td>
                                <td>2025-06-26 18:50:33</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/13" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                            <tr>
                                <td>Seeking VPN logins ASAP</td>
                                <td><a href="/profile/GhostRider" class="text-light">GhostRider</a></td>
                                <td>2025-06-21 22:55:49</td>
                                <td>2</td>
                                <td><a href="/post/marketplace/3" class="btn btn-outline-secondary btn-sm">View</a></td>
                            </tr>
                        
                    
                </tbody>
            </table>
            <nav aria-label="Category pagination">
                <ul class="pagination justify-content-center">
                    <li class="page-item disabled">
                        <a class="page-link" href="#">Previous</a>
                    </li>
                    <li class="page-item"><span class="page-link">Page 1 of 2</span></li>
                    <li class="page-item ">
                        <a class="page-link" href="/category/marketplace/Buyers?page=2">Next</a>
                    </li>
                </ul>
            </nav>
        </div>
    </div>
    <a href="/" class="btn btn-outline-secondary mt-3">Back to Home</a>

    </div>
</body>
</html>

The key focus is understanding the table's structure. You need to programmatically parse the HTML source code using BeautifulSoup and extract the data from the table.

Coding the scraper

To get started, setup a python virtual environment and install dependencies.

mkdir clearnet_scraper && cd clearnet_scraper
python3 -m venv venv
source venv/bin/activate
pip install requests beautifulsoup4
touch tornet_scrape.py

Open tornet_scrape.py and paste the following code:

import requests
from bs4 import BeautifulSoup
import json

url = "http://127.0.0.1:5000/category/marketplace/Buyers?page=1"
cookies = {"session": ".eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8"}

response = requests.get(url, cookies=cookies)
if response.status_code != 200:
    print(f"Failed to retrieve page: {response.status_code}")
    exit()

soup = BeautifulSoup(response.text, "html.parser")
posts = []
tbody = soup.find("tbody", class_="text-light")
base_url = "http://127.0.0.1:5000"

for row in tbody.find_all("tr"):
    title = row.find_all("td")[0].text.strip()
    author_cell = row.find_all("td")[1]
    author = author_cell.text.strip()
    author_link = base_url + author_cell.find("a")["href"]
    timestamp = row.find_all("td")[2].text.strip()
    post_link = base_url + row.find_all("td")[4].find("a")["href"]
    posts.append({
        "title": title,
        "post_link": post_link,
        "post_author": author,
        "author_link": author_link,
        "timestamp": timestamp
    })

json_output = json.dumps(posts, indent=4)
print(json_output)

The script fetches a webpage from a local server using requests with a session cookie for authentication. It parses the HTML with BeautifulSoup, targeting a table (<tbody> with class text-light). For each table row, it extracts the post title, author, author link, timestamp, and post link, storing them in a list of dictionaries. Finally, it converts the data to JSON format with json.dumps and prints it.

Scraping all of the data

You learned how to scrape data from just one page, but now you are going to learn how you can move to the next page by increasing page number by 1.

Here are the changes you need to make in your code:

import requests
from bs4 import BeautifulSoup
import json

url_base = "http://127.0.0.1:5000/category/marketplace/Buyers?page={}"
cookies = {"session": ".eJwlzjkOwjAQAMC_uKbwru098hmUvQRtQirE30FiXjDvdq8jz0fbXseVt3Z_RtsaLq_dS6iPEovRuyIyhhGoxMppFG5CJEAEy3cuZ9LivWN0nWwyLCZS6tI5h66oWdA9hBADWN14xyrJ6LkmFFj5yjKsMVjbL3Kdefw30D5fvlgvjw.aG6mEQ.B_zqhhmM1qXJrt8glWcY3eIzNQ8"}

posts = []
page = 1
base_url = "http://127.0.0.1:5000"

while True:
    url = url_base.format(page)
    response = requests.get(url, cookies=cookies)
    
    if response.status_code != 200:
        print(f"Failed to retrieve page {page}: {response.status_code}")
        break
    
    soup = BeautifulSoup(response.text, "html.parser")
    
    if "No posts found." in soup.text:
        break
    
    tbody = soup.find("tbody", class_="text-light")
    if not tbody:
        break
    
    for row in tbody.find_all("tr"):
        title = row.find_all("td")[0].text.strip()
        author_cell = row.find_all("td")[1]
        author = author_cell.text.strip()
        author_link = base_url + author_cell.find("a")["href"]
        timestamp = row.find_all("td")[2].text.strip()
        post_link = base_url + row.find_all("td")[4].find("a")["href"]
        posts.append({
            "title": title,
            "post_link": post_link,
            "post_author": author,
            "author_link": author_link,
            "timestamp": timestamp
        })
    
    page += 1

json_output = json.dumps(posts, indent=4)
print(json_output)

The output should now show 13 posts in total (the number might be different for you):

[
    {
        "title": "Seeking CC dumps ASAP",
        "post_link": "http://127.0.0.1:5000/post/marketplace/7",
        "post_author": "GhostRider",
        "author_link": "http://127.0.0.1:5000/profile/GhostRider",
        "timestamp": "2025-07-06 17:32:21"
    },
    {
        "title": "Seeking data leaks ASAP",
        "post_link": "http://127.0.0.1:5000/post/marketplace/2",
        "post_author": "HackSavvy",
        "author_link": "http://127.0.0.1:5000/profile/HackSavvy",
        "timestamp": "2025-07-06 15:05:00"
    },
    {
        "title": "Buying Fresh PayPal accounts",
        "post_link": "http://127.0.0.1:5000/post/marketplace/12",
        "post_author": "N3tRunn3r",
        "author_link": "http://127.0.0.1:5000/profile/N3tRunn3r",
        "timestamp": "2025-07-06 03:29:42"
    },
    {
        "title": "Need gift card codes, High Budget",
        "post_link": "http://127.0.0.1:5000/post/marketplace/4",
        "post_author": "ShadowV",
        "author_link": "http://127.0.0.1:5000/profile/ShadowV",
        "timestamp": "2025-07-04 00:58:36"
    },
    {
        "title": "Looking for CC dumps",
        "post_link": "http://127.0.0.1:5000/post/marketplace/11",
        "post_author": "Crypt0King",
        "author_link": "http://127.0.0.1:5000/profile/Crypt0King",
        "timestamp": "2025-07-03 07:51:05"
    },
    {
        "title": "Looking for PayPal accounts",
        "post_link": "http://127.0.0.1:5000/post/marketplace/8",
        "post_author": "DarkHacker",
        "author_link": "http://127.0.0.1:5000/profile/DarkHacker",
        "timestamp": "2025-07-02 03:44:01"
    },
    {
        "title": "Need CC dumps, High Budget",
        "post_link": "http://127.0.0.1:5000/post/marketplace/1",
        "post_author": "GhostRider",
        "author_link": "http://127.0.0.1:5000/profile/GhostRider",
        "timestamp": "2025-06-30 16:23:41"
    },
    {
        "title": "Looking for data leaks",
        "post_link": "http://127.0.0.1:5000/post/marketplace/5",
        "post_author": "CyberGhost",
        "author_link": "http://127.0.0.1:5000/profile/CyberGhost",
        "timestamp": "2025-06-26 23:42:20"
    },
    {
        "title": "Seeking RDP credentials ASAP",
        "post_link": "http://127.0.0.1:5000/post/marketplace/13",
        "post_author": "N3tRunn3r",
        "author_link": "http://127.0.0.1:5000/profile/N3tRunn3r",
        "timestamp": "2025-06-26 18:50:33"
    },
    {
        "title": "Seeking VPN logins ASAP",
        "post_link": "http://127.0.0.1:5000/post/marketplace/3",
        "post_author": "GhostRider",
        "author_link": "http://127.0.0.1:5000/profile/GhostRider",
        "timestamp": "2025-06-21 22:55:49"
    },
    {
        "title": "Seeking RDP credentials ASAP",
        "post_link": "http://127.0.0.1:5000/post/marketplace/10",
        "post_author": "AnonX",
        "author_link": "http://127.0.0.1:5000/profile/AnonX",
        "timestamp": "2025-06-20 11:57:41"
    },
    {
        "title": "Buying Fresh RDP credentials",
        "post_link": "http://127.0.0.1:5000/post/marketplace/9",
        "post_author": "DarkHacker",
        "author_link": "http://127.0.0.1:5000/profile/DarkHacker",
        "timestamp": "2025-06-20 06:34:27"
    },
    {
        "title": "Need VPN logins, High Budget",
        "post_link": "http://127.0.0.1:5000/post/marketplace/6",
        "post_author": "Crypt0King",
        "author_link": "http://127.0.0.1:5000/profile/Crypt0King",
        "timestamp": "2025-06-13 07:52:15"
    }
]

Try this on any other page and see what you get. Does it work or do you need to modify it to properly target the data?