Artificial Intelligence refers to the broad discipline of computer science devoted to creating systems capable of performing tasks that typically require human intelligence. Within this field, Large Language Models are a specialized class of statistical AI systems trained on massive text corpora to predict and generate coherent language. Notable examples include ChatGPT, Grok xAI, Claude and Gemini, which excel at translation, summarisation and conversational functions, yet they operate through pattern matching rather than genuine understanding.
Critics argue that conflating AI with LLMs overlooks the diversity of AI research, which also encompasses areas such as computer vision, robotics, expert systems and autonomous decision making.
We are not here to argue the science of AI or LLMs, instead we are going to focus on what AI can be used for in cyber defense.
The topics of this section include the following:
- Different types of AI utilization: Prompting, RAG & Fine-tuning
- Setting up an AI API
- AI Prompting
Different types of AI utilization: Prompting, RAG & Fine-tuning
Context and Goal
Our objective is to analyze posts on cyber-crime forums to identify which ones discuss initial access sales (IABs). Posts are classified as:
- Positive Posts: direct sale of unauthorized access to a company (for example, "Initial access to RBC Bank available").
- Neutral Posts: general offers for tools, exploits or malware without naming a specific target.
- Negative Posts: off-topic or unrelated services such as hosting, spam tools or generic VPS sales.
AI Prompting
Prompting uses pre-trained APIs (for example OpenAI, xAI, Anthropic) by sending crafted prompts and receiving unmodified responses.
Costs and Technical Overhead
- Third-party API calls range from low (GPT-3.5 at ~$0.002 per 1,000 tokens) to medium (GPT-4 at ~$0.03–0.06 per 1,000 tokens).
- Self-hosting incurs hardware costs (GPU servers at $2–3/hour) and license fees for open models.
- Minimal infrastructure beyond secure storage for prompt logs and API keys.
Data Storage
Only prompt templates, logs and post metadata; usually a few megabytes daily.
Privacy Considerations
Sending sensitive threat intelligence to external APIs can violate confidentiality. Local models improve privacy but raise infrastructure and training costs.
Ease of Adoption
Prompting is the fastest way to get started with CTI, requiring no model training and minimal setup.
Retrieval Augmented Generation (RAG)
RAG combines a document store of CTI data with an LLM. It retrieves relevant passages (for example from malware reports or leak archives) to augment prompt context.
Core Components and Costs
- Vector Store (for example FAISS, Pinecone)
- Embedding Model (OpenAI embeddings or open-source alternatives)
- LLM
- Leading cloud LLMs: GPT-4, Claude 2
- Self-hosted models: Llama 2-70B, Mistral Instruct
Cloud inference costs include embedding calls and generation calls (similar to prompting). Self-hosting requires GPUs (A100 or equivalent), with 40-80 GB VRAM for large models.
Technical Overhead
Setting up a retrieval pipeline, indexing documents and maintaining vector databases requires technical skills.
Data Storage
Corpus storage (tens to hundreds of gigabytes) plus vector indexes (roughly 2-5x the corpus size).
Compute Requirements
- Embedding: few seconds per document on GPU
- Generation: depends on model size; GPT-4 or 70 B-parameter models need high-end GPUs for low-latency inference.
Fine-tuning
Fine-tuning adapts an LLM to CTI by training on labeled examples (for example post classifications or threat report summaries).
Best Reasoning LLMs and Compute Needs
- Cloud†: GPT-3.5 and GPT-4 fine-tuning (requires minimal user-side compute).
- Self-hosted: Mistral 7B/13B, Llama 2 variants, Falcon series.
- Fine-tuning large models (30 B+ parameters) demands multiple A100 GPUs and can take hours to days.
Costs and Technical Overhead
- Cloud: token-based fine-tune fees (~$0.03–0.12 per 1 000 tokens) plus storage for the tuned model.
- On-premises: GPU cluster rentals ($3–10/hour per GPU), data preprocessing pipelines, maintain training infrastructure.
Data Storage
Training datasets (up to several gigabytes) plus final model checkpoints (10–200 GB depending on size).
To get started with fine-tuning, here is an example dataset you can use:
https://github.com/0xHamy/minimal_scraper/tree/main/datasets
To generate more data like it, you can use paragraphing APIs.
Comparative Summary
| Aspect | Prompting | RAG | Fine-tuning | |----------------------|-------------------------------|---------------------------------------------|--------------------------------------| | Privacy | Low (public API) or high (local) | Medium (local corpus, public LLM) | High (local model) | | Setup Time | Minutes | Hours to days | Days to weeks | | Infrastructure | Minimal | Moderate (vector DB, retrieval server) | Extensive (GPU fleet, training stack)| | Cost Profile | Low to medium | Medium to high | High | | Data Storage | Logs and metadata | Documents + indexes | Training data + model checkpoints | | Model Reasoning | Limited to prompt design | Enhanced by context retrieval | Best domain alignment | | Learning Curve | Easiest | Intermediate | Steep |
Setting up an AI API
As noted earlier, prompting is the simplest way to begin. But how do we select the right models for our tasks?
The AI providers listed below offer a range of models beyond those mentioned. I encourage you to explore their offerings and understand their capabilities. Note that AI subscriptions (e.g., for ChatGPT or Claude) differ from AI APIs, which require purchasing API credits. These typically start at $5 and last for a considerable time.
Anthropic
Anthropic’s Claude 3.5 Sonnet is a model I frequently use. It excels in reasoning and has robust guardrails, making it ideal for identifying posts related to initial access sales, hate speech, or cyberbullying. Additionally, it performs well in Optical Character Recognition (OCR), particularly for solving CAPTCHAs by extracting text from images effectively.
Get started:
xAI
xAI’s Grok 3 is exceptional for writing and analyzing large codebases, handling up to 2,000 lines of code. While suitable for malware analysis, it is less effective in reasoning tasks, such as debates, where it struggles to generate unique arguments. Claude 3.5 Sonnet outperforms Grok 3 in general reasoning and debates, but Grok 3 leads in coding tasks.
Get started:
OpenAI
OpenAI provides various models, with o1 standing out for debugging code and resolving errors. This model has been invaluable for troubleshooting large codebases. As you develop extensive CTI platforms, you’ll encounter numerous errors, and o1 is a reliable tool for finding solutions.
Get started:
AI Prompting
The AI API providers mentioned earlier offer comprehensive documentation detailing how to interact with their APIs using programming languages like Python or JavaScript. In this section, I will dive deeper and explain how to leverage Anthropic’s AI for specific use cases, such as distinguishing posts discussing initial access sales from those that do not.
Identify an initial access sale with Claude
To identify initial access sales with Claude, we can prompt the AI inference. To get started, setup a python virtual environment:
mkdir ai_prompting && cd ai_prompting
python3 -m venv venv
source venv/bin/activate
pip install anthropic
touch iab_identifier.py
Open iab_identifier.py and paste the following code:
from anthropic import Anthropic
import json
def claude_classify(
api_key: str,
model_name: str,
post_content: str,
max_tokens: int = 100,
temperature: float = 0.1
) -> dict:
"""
Classify a post using Claude (e.g., Sonnet) to determine if it discusses selling initial access,
unrelated items, or warnings/complaints.
Args:
api_key (str): Anthropic API key.
model_name (str): Claude model name (e.g., 'claude-3-5-sonnet-20241022').
post_content (str): Post text to classify.
max_tokens (int, optional): Max output tokens. Defaults to 100.
temperature (float, optional): Sampling temperature. Defaults to 0.1.
Returns:
dict: JSON with classification, scores, and error (if any).
"""
client = Anthropic(api_key=api_key)
prompt_template = """
Does this post discuss selling initial access to a company (e.g., RDP, VPN, admin access), selling unrelated items (e.g., accounts, tools), or warnings/complaints? Classify it as:
- Positive Posts: direct sale of unauthorized access to a company, this usually include the target's name.
- Neutral Posts: general offers for tools, exploits or malware without naming a specific target.
- Negative Posts: off-topic or unrelated services such as hosting, spam tools or generic VPS sales.
The content must be specifically about selling access to a company or business whose name is mentioned in the post.
Return **only** a JSON object with:
- `classification`: "Positive", "Neutral", or "Negative".
- `scores`: Probabilities for `positive`, `neutral`, `negative` (summing to 1).
Wrap the JSON in ```json
{
...
}
``` to ensure proper formatting. Do not include any reasoning or extra text.
Post:
```markdown
{{POST}}
```
"""
prompt = prompt_template.replace("{{POST}}", post_content)
try:
message = client.messages.create(
model=model_name,
max_tokens=max_tokens,
temperature=temperature,
messages=[
{"role": "user", "content": prompt}
]
)
content = message.content[0].text
# Extract JSON between ```json and ```
start = content.index("```json\n") + 7
end = content.index("\n```", start)
result = json.loads(content[start:end])
return result
except Exception as e:
return {"error": f"Failed to classify post: {str(e)}", "classification": None, "scores": None}
# Example usage
if __name__ == "__main__":
API_KEY = ""
MODEL_NAME = "claude-3-7-sonnet-20250219"
sample_post = """Selling access to Horizon Logistics\nRevenue: $1.2B\nAccess: RDP with DA\nPrice: 0.8 BTC\nDM for details"""
result = claude_classify(
api_key=API_KEY,
model_name=MODEL_NAME,
post_content=sample_post,
max_tokens=100,
temperature=0.1
)
print("API response:")
print(json.dumps(result, indent=4))
This is called function docstring:
"""
Classify a post using Claude (e.g., Sonnet) to determine if it discusses selling initial access,
unrelated items, or warnings/complaints.
Args:
api_key (str): Anthropic API key.
model_name (str): Claude model name (e.g., 'claude-3-5-sonnet-20241022').
post_content (str): Post text to classify.
max_tokens (int, optional): Max output tokens. Defaults to 100.
temperature (float, optional): Sampling temperature. Defaults to 0.1.
Returns:
dict: JSON with classification, scores, and error (if any).
"""
Docstrings are used to document how code functions.
This code can also be used as a module. For instance, you can import claude_classify from iab_identifier.py, pass the required arguments, and utilize it.
Here, we define a conditional statement, if __name__ == "__main__":, to check if the program is being run as a script. If so, the code within the conditional block is executed.
To use this, set your API key in the API_KEY variable and specify the model name in the MODEL_NAME variable. For sample_post, you can provide either a single post or a list of posts and iterate over them.
Below is an example using three posts such as positive, neutral, and negative, sourced from the following dataset:
https://github.com/0xHamy/minimal_scraper/blob/main/datasets/base.csv
# Example usage
if __name__ == "__main__":
API_KEY = "sk-ant-apixxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
MODEL_NAME = "claude-3-5-sonnet-20241022"
sample_posts = ["Selling access to Capital Freight Co. (Texas-based logistics) \nRevenue: ~$900M \nAccess: Full RDP, domain admin via svc_x backup creds \nExtras: Internal invoices, mapped drives, Outlook web creds \nEDR: minimal – Defender only \nPrice: 0.6 BTC firm, BTC only, DM me quick",
"Private obfuscator for Python tools \nRandom junk + encryption \nFUD for most stealers \n0.012 BTC",
"Selling 150k+ Netflix logins from Latin America \nMost accounts active, pulled from private stealer logs \nNo warranty – buy at your own risk \n0.02 BTC per 1k – discounts on bulk \nDM for sample"
]
for idx, post in enumerate(sample_posts, start=1):
result = claude_classify(
api_key=API_KEY,
model_name=MODEL_NAME,
post_content=post,
max_tokens=100,
temperature=0.1
)
print(f"API response for post number {idx}:\n")
print(json.dumps(result, indent=4))
print(60 * '*')
Here is the output:
-> % python3 iab_identifier.py
API response for post number 1:
{
"classification": "Positive",
"scores": {
"positive": 0.95,
"neutral": 0.04,
"negative": 0.01
}
}
************************************************************
API response for post number 2:
{
"classification": "Neutral",
"scores": {
"positive": 0.05,
"neutral": 0.85,
"negative": 0.1
}
}
************************************************************
API response for post number 3:
{
"classification": "Negative",
"scores": {
"positive": 0.05,
"neutral": 0.15,
"negative": 0.8
}
}
************************************************************
This approach is highly effective and performs the task quickly. While APIs may occasionally misclassify neutral and negative posts, they consistently identify positive posts accurately, which is critical. General cyber-crime activity often falls into the negative or neutral categories, but initial access sales can always be reliably distinguished.