# Yuval's Tech Blog

> A technical blog covering AI, LLMs, software development, and best practices.

This file provides content for LLMs to understand this website.
See https://llmstxt.org/ for more information.

## Blog Posts

### LLM Logic Distillation: Extracting Python Rules from LLMs
URL: https://yuval.coffee/posts/llm-logic-distillation/
Date: 2025-12-29

![Naive Bayes Classification Results](/assets/images/naive-base-working-great.png)

A pattern I'm using lately is extracting logical (Pythonic) rules from LLMs.

## Example 1: Extracting HTML Content

Let's say you have a business process where LLM helps a lot. For example, extracting text content from HTML (there are some libs for that; let's ignore them for a moment).

- **Input to LLM**: HTML of web page
- **Output from LLM**: Page title, page menu content, page main content (e.g., article)

So you scrape pages, feed into Gemini/GPT-4.1-mini, and it's all good.

Now, what if you want to speed up the process? You can "distill" it - not in the form of another LLM, but in the form of Python rules.

### Rules for Extracting Headers

Ask the LLM to create/append rules for extracting title. Very fast, it would extract `<title>` and `<h1>`; easy enough.

### Rules for Extracting Body

Then, ask the LLM to create/append rules for extracting main body.

The LLM will see WordPress HTML, so it would add regex for the `<main id="main" role="main">` element (regex or even better, a custom BeautifulSoup function). So we have WordPress - 40% of websites covered.

Then it will encounter a Wix website, and it will create a rule for Wix, extracting `<main id="PAGES_CONTAINER" class="PAGES_CONTAINER">` elements. Now we have Wix as well - 4% more covered.

Now, we would have a very long tail - and we add these rules.

After each rule addition, run a suite of tests to make sure we did not break anything.

After you run it long enough, you might end with something like [Mozilla Readability](https://github.com/mozilla/readability).

## Example 2: Classification

Assume we have a list of job titles, e.g., [job-titles.txt](https://github.com/jneidel/job-titles/blob/master/job-titles.txt). We want to classify: is it a tech job or not?

The process is:

1. Create a prompt using GPT-4.1-mini + Instructor
2. Run prompt, 10 job titles at a time
3. Improve prompt using DSPy
4. After we have enough classified titles using LLM, create a Naive Bayes classifier
5. Before sending new titles to LLM, check with Bayes Classifier; if we have results with >0.95 certainty, don't send to LLM
6. Improve classifier

This way, we replace slow LLM with fast Bayes classifier.

### Results

The Naive Bayes classifier achieved **99.9% accuracy** and skipped **85%** of the API calls! Out of 700 job titles, only 59 required GPT API calls - the rest were handled by the fast Bayes classifier.

We could train BERT as well, as it would also be very fast (and probably more accurate than the Bayes classifier).

---

### Expert Trajectory RAG: Claude Code-Level Quality with Internal Models
URL: https://yuval.coffee/posts/expert-trajectory-rag/
Date: 2025-12-18

![Expert Trajectory RAG - Real world example](/assets/images/expert-trajectory-rag.png)

How to achieve Claude Code-level quality using internal coding models?

**Disclosure**: This approach is for specific scenarios and requires significant pre-computation (tokens).

## When to Use This

This is for you if:
- You have code you **cannot share** with public models; AND/OR
- You **must work locally**; AND
- You want SOTA results (assume SOTA = Claude Code)

## Solution TL;DR

Assume you want to install "Langfuse" locally, and you can't use Claude Code/ChatGPT/public models. The flow is basically RAG over thousands of possible scenarios:

1. **Ask Claude Code**: "How to install Langfuse locally?"
2. **Save** the markdown file
3. **Introduce change request**: "How to change ports of services?"
4. **Ask Claude Code** to fix and validate
5. **Save tutorial** with metadata:
   - Keywords: `ports`, `mapping`, `dockers`
   - Description: "changing ports of services"
6. **Introduce change request**: "How to use ELK with this setup?"
7. **Ask Claude Code** to relate, save tutorial:
   - Keywords: `logs`, `monitoring`, `devops`
   - Description: "Using ELK for logs"
8. **Rinse and repeat**
9. **Push everything** into RAG
10. **Let your internal model** search over the very detailed instructions

## What This Enables

- Map your existing technologies
- Create thousands of such tutorials
- Given a problem, search thousands of working, accurate, verified tutorials

## Tutorial Structure

Each tutorial should have:
- **Basic keywords** and **short description** of the problem
- This enables efficient RAG + keyword-based hybrid search over problem solutions

## Q&A

**Q:** How to prevent Information Leakage?

**A:** Assume you have a system using Python 2.7 with a very specific library for some proprietary task. If you ask Claude: *"I'm using Python 2.7 with lib XYZ, please help me fix bug ABC"* - this is information leakage. The solution is to mass-generate thousands of solutions to virtually **all** possible combinations. Your actual query gets lost in the noise.

**Q:** Legal stuff? Is model distillation illegal?

**A:** Distilling models for creation of new models is illegal under major LLMs TOS. However, creating a RAG Library ("Expert Trajectory RAG") is **not illegal** AFAIK.

**Q:** Why do we need keywords AND description for each scenario?

**A:** Hybrid search often gives better results than just vector search.

**Q:** Will this work in real life?

**A:** Yes. See this real-life example: [PostHog Issue #38494](https://github.com/PostHog/posthog/issues/38494#issuecomment-3656289544). I installed PostHog locally, and Claude Code fixed an issue. It wasn't 100% autonomous - it was partly instructed by me on how to progress. But with some simple agentic loops, I believe this can be fully autonomous.

---

### Mini-Firewall" Over EC2 Machines, AKA use Lambda function to update SecGroup with your IP address
URL: https://yuval.coffee/posts/mini-firewall-ec2-lambda-ip/
Date: 2024-12-04

Tl;Dr — in some setups, it makes sense to allow connection to EC2 machine only to your (home laptop) IP address.
If you have dynamic IP, you can use Lambda function (+API GW) to update the IP in the security group automatically.

See snippet: <https://gist.github.com/yuval1024/7aa78bf2fac234d0491c1695c0b5ff99>

![](https://cdn-images-1.medium.com/max/1024/1*RNbPE-t68m5PiP6oydi-Qg.png)

**“Mini-Firewall”? Why?**

So the idea is to get an extra layer of security and allow connection to EC2 machines only from a specific IP address.
Create a new machine, create a new security group, and allow connection only from the specific IP address.
So far, so good.

However, what happens when the IP changes?
When IP changes, you need to update the security group — login into AWS console, find the security group, update the IP address.
This can be a bit annoying.

So there’s another option — update the security group automatically, using Lambda function.

**Lambda function**

Steps:
Create a Lambda function, which gets IP address and changes the security group.
Create API Gateway, which calls the Lambda function.
Add 2 links to your bookmarks — enable and disable access
* Enable access — call the “Enable” Lambda function which changes the IP address to your IP address (eg caller-IP address)
* Disable access — call the “Disable” Lambda function which changes the IP address to 8.8.8.8 (eg Google DNS…)

See _change_secgroup_lambda.py_ for the code. It was written for Python 3.8 originally, but should work with newer versions..
We have _lambda_handler_ which is the function being called by the Lambda infrastructure; and we have _update_security_group_ which is the function which actually updates the security group.

**Manual Installing**

After installing Lambda function, go to Configuration -> Triggers.

And then click on “Add Trigger” and add API gateway function for the trigger.

Then go to API Gateway, and create a new API.

Then create a new resource, and then create a new method.

Then go to “Integration Request” and set “Body Mapping Templates” to “application/json” and add the following:


    {
        "action": "enable",
        "name": "$input.params('name')"
    }

Then go to “Method Response” and add 200 response.
Then go to “Integration Response” and add 200 response.
Then go to “Integration Request” and add the following:


    {
        "body-json" : { "action" : "enable", "name" : "$input.params('name')" },
        "context" : {
            "account-id" : "$context.identity.accountId",
            "api-id" : "$context.apiId",
            "api-key" : "$context.identity.apiKey",
            "authorizer-principal-id" : "$context.authorizer.principalId",
            "caller" : "$context.identity.caller",
            "cognito-authentication-provider" : "$context.identity.cognitoAuthenticationProvider",
            "cognito-authentication-type" : "$context.identity.cognitoAuthenticationType",
            "cognito-identity-id" : "$context.identity.cognitoIdentityId",
            "cognito-identity-pool-id" : "$context.identity.cognitoIdentityPoolId",
            "http-method" : "$context.httpMethod",
            "stage" : "$context.stage",
            "source-ip" : "$context.identity.sourceIp",
            "user" : "$context.identity.user",
            "user-agent" : "$context.identity.userAgent",
            "user-arn" : "$context.identity.userArn",
            "request-id" : "$context.requestId",
            "resource-id" : "$context.resourceId",
            "resource-path" : "$context.resourcePath"
        }
    }

go to “Method Execution” and click on “Test” and then “Test” again.
go to “Method Execution” and click on “Deploy API” and then “Deploy” again.
go to “Stages” and click on “Prod” and then “Invoke URL” and copy the URL.
go to “Resources” and click on “Actions” and then “Create Method” and select “GET” and click on the checkmark.
go to “Method Execution” and click on “Test” and then “Test” again.

**Automatic Installing, AKA productonize code using ChatGPT/Bard**

So after we created the Lambda and API manually, we would like to “productionize” it, e.g. create code to create these things.
We would use AWS CLI to inspect the existing Lambda and API, and then create the code to create them.


    # let's get the APIS in region, then resource for the relevant one (ID pg6hm5xzu0), then 'enable' method for that resource (ID jiv1tt)
    aws apigateway get-rest-apis --region us-west-2
    aws apigateway get-resources --rest-api-id pg6hm5xzu0 --region us-west-2
    aws apigateway get-method --rest-api-id pg6hm5xzu0 --resource-id jiv1tt --http-method GET --region us-west-2

Now after we gave ChatGPT/Bard the JSONs to reproduce, we have commands to reproduce it:


    # create
    aws apigateway create-rest-api --name 'add_rule_to_sg-yuval-reprod' --description 'Created by AWS Lambda' --endpoint-configuration types=REGIONAL --api-key-source HEADER --region us-west-2

    # compare
    aws apigateway get-resources --rest-api-id pg6hm5xzu0 --region us-west-2
    aws apigateway get-resources --rest-api-id 3p06imz4sc --region us-west-2

And we can continue to the next steps, create method etc.
And also continue and use Terraform or CloudFormation to create the API and Lambda.

**FAQ**

**Q** : Why not use a VPN? Eg developer should connect to AWS VPN and policy should allow connection only from the VPN?
**A** : This is a good solution for an SMB/Enterprise, but for a solo developer/hobbyist just playing around, it’s an overkill.

**Q** : There is no need for authentication when calling the API (Eg when visiting the URLs)?
**A** : Correct, this is by design; we exchange security for convenience.
Worst case, we have another layer of security — the SSH certificate required for logging in — this is the main layer of protection.

---

### Batch Processing in SQLite: A Deep Dive into Database Field Updates
URL: https://yuval.coffee/posts/batch-processing-sqlite-updates/
Date: 2024-12-04

### **Batch Processing in SQLite: How to update [efficiently] a table with millions of records**

![](https://cdn-images-1.medium.com/max/861/1*4k4P0TM9WAsKUcgTmFNYyw.png) sample run, update in batches from last id

Tl;Dr — if you want to bulk update database table, you can do it in bulk iterations. It would take more time, but will have shorter locks on the database. It’s very important in Sqlite3, while other database engines have some mechanisms to overcome the issue (eg MVCC).
In addition, in order to prevent scanning of the table in each iteration, you should save last id of the updated field.

For full code see: <https://gist.github.com/yuval1024/8b63174779861bb5f3dc1baf32994ae1>

Let’s say you have an SQL database, and you would like to update field in bulk, eg reset it to null.
What happens if you have 1000s of records? 100s of 1000s? Millions of records?

You can get the DB locked..
Some notes:

1\. THIS MAINLY SPEAKS ABOUT SQLITE3;
in Postgresql and other advanced DBs, we have some kind of versioning system, and thus we don’t really lock the entire table, but only some records.
Eg — **MVCC** in Postgres.
In Sqlite3, the locking issue is more present.
2\. Between each run, we update random rankings.
3\. Between each run, we _*_ **_must purge page cache_** _*_ , in order to prevent consecutives runs be faster due to cache.

**Creating and populating 20M records**

Let’s create and populate table:


    CREATE TABLE IF NOT EXISTS domains (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        domain_name TEXT NOT NULL,
        domain_ranking INTEGER


    def populate_data(self, num_records: int = 10_000_000):
        with sqlite3.connect(self.db_path) as conn:
            batch_size = 10000
            for i in range(0, num_records, batch_size):
                batch = [
                    (f"domain_{j}", random.randint(1, 1000000))
                    for j in range(i, min(i + batch_size, num_records))
                ]
                conn.executemany(
                    "INSERT INTO domains (domain_name, domain_ranking) VALUES (?, ?)",
                    batch
                )

**Anti-pattern: update row by row**

We update row by row;
Note we didn’t go full-anti pattern;
We still commit every 1000 records and not after each recod; committing after each record would be the worst possible option to do.


    def reset_row_by_row(self):
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("SELECT id FROM domains WHERE domain_ranking IS NOT NULL")
            for (id_,) in cursor:
                conn.execute(
                    "UPDATE domains SET domain_ranking = NULL WHERE id = ?",
                    (id_,)
                )
                if total_reset % 1000 == 0:
                    conn.commit()

**First method: using 1 SQL statement**

This approach is simple and also the quickest; however, it locks the table during execution;


    def reset_all(self):
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("UPDATE domains SET domain_ranking = NULL")

**Second method: Using loop + batch, eg reset in chunks**

This approach looks better;
while True, each iteration will reset 10k records;
This will still lock the table, but for a shorter period of time.
Total time for the whole table will be longer, but each lock time will be much shorter.
However, **each iteration needs to scan the table** to find first relevant record.


    def reset_in_chunks(self, chunk_size: int = 10_000):
        with sqlite3.connect(self.db_path) as conn:
            while True:
                cursor = conn.execute(f"""
                    UPDATE domains
                    SET domain_ranking = NULL
                    WHERE id IN (
                        SELECT id FROM domains
                        WHERE domain_ranking IS NOT NULL
                        LIMIT {chunk_size}
                    )
                """)

                if cursor.rowcount == 0:
                    break

                conn.commit()

**Improvement: Save last id and start from this id**

In the last example, we would still have to scan the table during each iteration;
Yet, we may have many pages in the cache; and we might as well have an index for the domain_ranking field;
However, in cases where we don’t have index, we may save the index of the last updated record; and start from this index, in order to prevent scanning.


    def reset_with_tracking(self, chunk_size: int = 10000):
        last_id = 0
        with sqlite3.connect(self.db_path) as conn:
            while True:
                cursor = conn.execute(f"""
                    UPDATE domains
                    SET domain_ranking = NULL
                    WHERE id IN (
                        SELECT id FROM domains
                        WHERE id > {last_id}
                        AND domain_ranking IS NOT NULL
                        ORDER BY id
                        LIMIT {chunk_size}
                    )
                    RETURNING id
                """)

                updated_ids = cursor.fetchall()
                if not updated_ids:
                    break

                last_id = max(id[0] for id in updated_ids)
                conn.commit()

For conclusion, when working with DBs, and performing heavy-duty tasks, always have in mind the locking mechanism and your DB access patterns.

---

### Fine-Tuning Bert using LoRA, hosting on Cloudflare using Cloudflare AI Workers
URL: https://yuval.coffee/posts/fine-tuning-bert-lora-cloudflare/
Date: 2024-11-12

Let’s say you have a website and you would like to add content filtering to it.

So yes, you could use OpenAI Content Moderation API, but what if you want your own solution?

I did a quick POC with Cloudflare (“CF” now on) Workers + CF AI, to test if I can create and serve such model.

CF offers AI workers, which are a closed set of models _*and fine-tunes*_ on these models.

So, we can take a BERT model, fine-tune using LoRA, and serve via CF AI API.

I guess the closed set of base models because of the serving optimizations, which allows to serve many in only fraction of the required resources.

**Setup**

There are [many content moderation datasets](https://github.com/fcakyon/content-moderation-deep-learning), but for this example let’s use well-known the Spam/No Spam dataset.

**Installing basic libs and setting up.**

Logging-in to Clouflare — we have 2 issues:
#1 — The console asks for consent regarding telemetry — so we use `!yes` for that;
Issue #2 — as part of the Oauth auth, the browser redirects to localhost, which is not good..

So we log-in using Cloudflare API Token, and not browser Oauth flow.


    # install some libs; use node.js 18.x (LTS), and verify
    !pip install transformers torch pandas peft datasets numpy scikit-learn

    !apt-get remove -y nodejs npm
    !curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -

    !node --version
    !npm --version
    !npm install -g wrangler
    !which wrangler

    CLOUDFLARE_API_TOKEN = "...."
    os.environ['CLOUDFLARE_API_TOKEN'] = CLOUDFLARE_API_TOKEN
    !wrangler --version
    !yes | wrangler whoami

**Basic Fine-Tuning**

Here we have basic fine-tuning code; which will take _bert-base-uncased_ and fine-tune on the Spam/No-spam dataset.


    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    import torch
    from transformers import (
        AutoTokenizer,
        AutoModelForSequenceClassification,
        Trainer,
        TrainingArguments,
        DataCollatorWithPadding
    )
    from peft import get_peft_model, LoraConfig, TaskType
    from datasets import Dataset
    import os

    class SpamDetectorTrainer:
        def __init__(self, model_name="bert-base-uncased"):
            self.model_name = model_name
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(
                model_name,
                num_labels=2
            )


        def load_data(self, url="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"):
            """Load and prepare the SMS spam dataset"""
            os.system(f"wget {url} -O smsspamcollection.zip")
            os.system("unzip smsspamcollection.zip")

            df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['label', 'message'])
            df['label'] = df['label'].map({'ham': 0, 'spam': 1})

            train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)
            self.train_dataset = Dataset.from_pandas(train_df)
            self.eval_dataset = Dataset.from_pandas(eval_df)

            return self.train_dataset, self.eval_dataset

        def preprocess_data(self):
            """Tokenize and prepare datasets"""
            def tokenize_function(examples):
                return self.tokenizer(
                    examples['message'],
                    truncation=True,
                    padding=True,
                    max_length=128
                )

            self.tokenized_train = self.train_dataset.map(tokenize_function, batched=True)
            self.tokenized_eval = self.eval_dataset.map(tokenize_function, batched=True)

            self.tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
            self.tokenized_eval.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

            return self.tokenized_train, self.tokenized_eval

        def setup_lora(self, r=8, alpha=32, dropout=0.1):
            """Configure and apply LoRA"""
            lora_config = LoraConfig(
                task_type=TaskType.SEQ_CLS,
                inference_mode=False,
                r=r,
                lora_alpha=alpha,
                lora_dropout=dropout,
                target_modules=['query', 'value']
            )
            self.lora_model = get_peft_model(self.model, lora_config)
            return self.lora_model

        def train(self, output_dir="./results", epochs=3, batch_size=16):
            """Train the model"""
            training_args = TrainingArguments(
                output_dir=output_dir,
                evaluation_strategy="steps",
                eval_steps=500,
                save_strategy="steps",
                save_steps=500,
                learning_rate=2e-5,
                per_device_train_batch_size=batch_size,
                per_device_eval_batch_size=batch_size,
                num_train_epochs=epochs,
                weight_decay=0.01,
                logging_dir='./logs',
                logging_steps=100,
                load_best_model_at_end=True,
                metric_for_best_model="loss",
                save_total_limit=3,
            )

            data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)

            trainer = Trainer(
                model=self.lora_model,
                args=training_args,
                train_dataset=self.tokenized_train,
                eval_dataset=self.tokenized_eval,
                data_collator=data_collator,
            )

            # Train and evaluate
            train_result = trainer.train()
            eval_result = trainer.evaluate()

            return train_result, eval_result

        def save_model(self, path="lora_spam_adapter"):
            """Save the LoRA adapter"""
            self.lora_model.save_pretrained(path)
            self.tokenizer.save_pretrained(path)

    trainer = SpamDetectorTrainer()
    train_dataset, eval_dataset = trainer.load_data()
    trainer.preprocess_data()
    trainer.setup_lora()
    train_result, eval_result = trainer.train()
    trainer.save_model()

**Hyper-Parameters Search**

There are several parameters worth checking out; I’m not an expert LoRA fine-tuning guy, but read some [blog](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms) [posts](https://www.determined.ai/blog/lora-parameters) about it.

We could do a grid search and let the machine churn a little bit to squeeze some more performance.

eg, **Rank** , which is the rank of the inner matrix; more compute work, but more information caputred; and also learning-rate.


    results = []
    learning_rates = [1e-5, 2e-5, 5e-5]
    lora_ranks = [4, 8, 16]

    for lr in learning_rates:
        for rank in lora_ranks:
            print(f"Training with learning_rate={lr}, lora_rank={rank}")

            # Reinitialize the model and LoRA setup
            trainer.model = AutoModelForSequenceClassification.from_pretrained(
                trainer.model_name,
                num_labels=2
            )
            trainer.setup_lora(r=rank)

            # Train the model
            train_result, eval_result = trainer.train(
                epochs=3,
                batch_size=16,
                learning_rate=lr
            )

            train_loss = train_result.training_loss
            eval_loss = eval_result['eval_loss']

            results.append({
                'learning_rate': lr,
                'lora_rank': rank,
                'train_loss': train_loss,
                'eval_loss': eval_loss
            })
            print(f"Results: train_loss={train_loss:.4f}, eval_loss={eval_loss:.4f}")

**Hyper-Parameters Search — Results**

![Loss, Learning Rate and Rank](https://cdn-images-1.medium.com/max/871/1*I1K0gyqIASpcPaLQAfjvcA.png)

**Push to Cloudflare AI**

Here we actually understand we can **not** push to Cloudflare, because of wrong selection of base model;
eg Cloudflare AI worker with LoRA Fine-tuning have very specific list of based models, and Bert is not one of them: <https://developers.cloudflare.com/workers-ai/fine-tunes/loras/>

@cf/meta-llama/llama-2–7b-chat-hf-lora
@cf/mistral/mistral-7b-instruct-v0.2-lora
@cf/google/gemma-2b-it-lora
@cf/google/gemma-7b-it-lora

We can do training again, this time using [AutoTrain](https://developers.cloudflare.com/workers-ai/tutorials/fine-tune-models-with-autotrain/) .
And serve using [this code sample](https://developers.cloudflare.com/workers-ai/get-started/workers-wrangler/), with adding finetune id.


    export interface Env {
      // If you set another name in wrangler.toml as the value for 'binding',
      // replace "AI" with the variable name you defined.
      AI: Ai;
    }

    export default {
      async fetch(request, env): Promise<Response> {
        const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", {
          prompt: "What is the origin of the phrase Hello, World",
        });

        return new Response(JSON.stringify(response));
      },
    } satisfies ExportedHandler<Env>;

**Conclusion**

We saw how to **fine-tune Bert model** using LoRA.

We [almost] saw wow to deploy (specific base models) into Cloudflare AI Workers, after fine-tuning.

We tested several hyper-parameters for training, to get optimized model.

---

### BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website
URL: https://yuval.coffee/posts/bfs-dfs-pagerank-embeddings/
Date: 2024-09-26

### BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website

Depth-First Search (DFS) and Breadth-First Search (BFS) are two common algorithms used in web scraping to traverse websites.
Usually, you would want to use BFS; and this in order [no pun intended] to get to important pages first.

In short, DFS will explore as far as possible along each branch before backtracking, while BFS will explore neighbor nodes at the present depth before moving to nodes at the next depth level.

**BFS
** First we visit Homepage, then Products, Blog and About pages, and then children of each page, etc..

![](https://cdn-images-1.medium.com/max/1024/1*vPGooJnoa0gMAODIbds9_w.png)

**DFS
** First we visit Homepage, then Products,
then children of Products,
then Blog,
**then we’re “stuck” in all blogposts [children of Blog page].**
Instead of visiting About page, after the Blog page.

**Creating this pre-coding-LLMs era, would take lots more time!**

Using PageRank algorithm, we see that we visit:
Homepage, Products, and then Category 1 and Category 2 pages.

This is of course just for illustration — if the internal link structure “implies us” **[read: “implies Google”**] that these pages are more important, we would like to visit them first.

![](https://cdn-images-1.medium.com/max/1024/1*YhPs3PTyMKettPJqxZNdmQ.png)

**Let’s use Claude/ChatGPT/Cursor; let’s use PageRank**

Using Cursor, it took me about 10 minutes to add PageRank class.
Created a class which:
\- Accepts a page and links
\- Adds to internal graph database [could use networkx, but for this case used a simple dict]
\- Calculate PageRank
\- Suggests next page to visit


    class WebsitePagerank {
        constructor(initialUrl) {
            // Note - for numerical and other reasons, the invariant that total PageRank of all pages is 1, does not hold here

            this.initialUrl = this.normalizeURL(initialUrl);
            this.pages = new Map();
            this.links = new Map();
            this.crawledPages = new Set();
            this.pages.set(this.initialUrl, 10);
        }

        addPage(url) {
            url = this.normalizeURL(url);
            if (!this.pages.has(url)) {
                this.pages.set(url, url === this.initialUrl ? 10 : 1);
            }
        }

        addLink(sourceUrl, destUrl) {
            sourceUrl = this.normalizeURL(sourceUrl);
            destUrl = this.normalizeURL(destUrl);
            this.addPage(sourceUrl);
            this.addPage(destUrl);

            if (!this.links.has(sourceUrl)) {
                this.links.set(sourceUrl, new Set());
            }
            this.links.get(sourceUrl).add(destUrl);
        }

        markAsCrawled(url) {
            url = this.normalizeURL(url);
            console.log('markAsCrawled', url);
            this.crawledPages.add(url);
        }

        calculatePageRank(iterations = 10, dampingFactor = 0.85) {
            const numPages = this.pages.size;

            for (let i = 0; i < iterations; i++) {
                const newRanks = new Map();

                for (const [page] of this.pages) {
                    let sum = 0;
                    for (const [sourcePage, destPages] of this.links) {
                        if (destPages.has(page)) {
                            sum += this.pages.get(sourcePage) / destPages.size;
                        }
                    }
                    newRanks.set(page, (1 - dampingFactor) / numPages + dampingFactor * sum);
                }

                this.pages = newRanks;
            }
        }

        getNextUrl() {
            return Array.from(this.pages.entries())
            .filter(([url]) => !this.crawledPages.has(url))
            .sort((a, b) => b[1] - a[1])[0]?.[0];
        }

        normalizeURL(url) {
            url = url.split('#')[0];
            if (url.endsWith('/')) {
                url = url.slice(0, -1);
            }
            return url;
        }

        dumpPageRankInfo(filePath) {
            const data = Array.from(this.pages.entries()).map(([url, rank]) => ({
                url,
                rank,
                outgoingLinks: Array.from(this.links.get(url) || [])
            }));

            fs.writeFileSync(filePath, JSON.stringify(data, null, 2));
        }
    }

![](https://cdn-images-1.medium.com/max/1024/1*E9jB_DNezLJowRmC7l_8gw.png)

**Conclusion**
We saw toy examples for BFS, DFS, and PageRank algorithms.
When website has 100+ pages, this becomes a true pain.

**Q** : If we need to browse the pages to build the PageRank, we don’t save anything, right?
**A** : Correct, if we look about bandwidth or crawled pages;
**However, if we would like to embed** the pages afterward, we surely want to embed only the most important pages.

---

### There are more than 2 UUID types — UUIDv4, 7, ULID, etc…
URL: https://yuval.coffee/posts/uuid-types-v4-7-ulid/
Date: 2024-02-17

### There are more than 2 UUID types — UUIDv4, 7, ULID, etc…

**Tl;dr** — UUIDv4, UUIDv7, ULID, Base64, Base58, Base85, HashIDs (hiding IDs on the frontend), libs compatibility between different SDKs.

So, you’ve all heard about UUIDv4. It’s just a very random collection of bits, represented nicely.

Let’s review some other UUIDs:

### ULID / UUIDv7

UUIDv4 has a major issue — it would give you some issues when you try to order by UUIDv4.

Let’s say you have a database table, with id which is just a simple int, and AUTO_INCREMENT.

first record will be id=1, second record will be id=2, etc..
Now, when you do something like SELECT * FROM my_table ORDER BY id, results will be sorted, but more importantly - the results will be close to each other.

e.g., if you iterate in chunks of 100 for example, you would not jump all over the database, but results should be pretty close to each other.

What happens with UUIDv4? There’s no really point in sorting, because you just sort a bunch of random numbers.

In addition, you will “jump” all over the database when reading records. And if your DB is big enough, results you read will page out.

### So what about ULID/ UUIVv7?

So, ULID / UUIDv7 are 2 protocol which offer to prefix the UUID with time signature; eg the IDs are always increasing.

This way, for example, if you have table with ULID/UUIDv7 as an index, you can run SELECT * FROM my_table ORDER BY id and it would make sense.

### Problems with ULID/UUIv7?

So, one thing is adoption; another problem which is more problematic, is information leakage — given an ID, we can know when it was created..

### “Nano IDs”

This is just a summary of this great article — [The UX of UUIDs](https://unkey.dev/blog/uuid-ux). Go read it now.

UUIDs are not easier to copy; the “-” in the UUID prevent from copying the whole string.

We can see what Stripe is doing — key is just a random string, without dashed; in addition it is prefixed with key description. For example:


    STRIPE_LIVE_PUBLIC_KEY="pk_live_xUBcwUhe....."
    STRIPE_LIVE_SECRET_KEY="sk_live_gpTjnUwB....."
    STRIPE_TEST_PUBLIC_KEY="pk_test_CcfLsSzE....."
    STRIPE_TEST_SECRET_KEY="sk_test_WFnNSjpB....."
    DJSTRIPE_WEBHOOK_SECRET="whsec_LqqRWEKkd....."

We can copy the key(s) using double-click, and also key has lots more information. How come?

Answer is, UUID (v4 for example) represents in hexadecimal base; Stripe IDs, however, represent in a different base. The bigger the base, the shorter the string for the same amount of data represented.

### Base64 vs Base58

We all know about base64 (see FAQ if not) — but what is Base58??
Base58 is just like Base64, but with some confusing letters omitted; eg we remove I and l, remove O and 0 and o avoid confusion. and + and / to as well.

### Base85???

Yes, another base is base85; let’s say:

  1. you work for a software company which distributes signed .exe files to customers
  2. and you want the filename to contain url the executable should connect to on first run.
  3. You can’t add this URL to the file content, since you would have to sign many different files (*).
  4. URL should be part of the filename
  5. Filename should be short as possible.

So — use you base85 to encapsulate the URL; the bigger the base -> lower string of the filename.
And this way you get a short filename.

### Hash IDs

Let’s say you have a SaaS, and you give each new user an ID. And you have a view of the format https://my-saas.com/users/123 (where 123 is the user_id) ;

What happens is, people can [estimate](https://en.wikipedia.org/wiki/German_tank_problem) the number of users in your website, by creating a new user and checking the id they got.

So - how can you hide the current user_id from the user itself??

One option of course is to use a random id, but then we would get all the issues of UUID (UUID is just a private case of random ID).

Another option, is to encrypt the ID using some key; and this is exactly what [Squid (formerly HashIDs)](https://sqids.org/python?hashids) is doing!

Using a secret key (*), you can convert id->string and string->id, and this way you can have something like https://my-saas.com/users/nVB, and convert nVB to user_id 123 in your backend.

Sample code:


    # Taken from: https://github.com/davidaurelio/hashids-python

    hashids = Hashids(salt='this is my salt 1')
    hashid = hashids.encode(123) # 'nVB'

    # and with different salt:
    hashids = Hashids(salt='this is my salt 2')
    hashid = hashids.encode(123) # 'ojK'# Taken from: https://github.com/davidaurelio/hashids-python hashids = Hashids(salt='this is my salt 1') hashid = hashids.encode(123) # 'nVB' # and with different salt: hashids = Hashids(salt='this is my salt 2') hashid = hashids.encode(123) # 'ojK

### What can be the problem with HashIDs?

Well, first we should check what algorithm does it use; and make sure e(d(id)) == id for all ids; eg that we can trust the lib (algorithm) to do conversions without an error.

Another issue, is we might be bound to a specific implementation (and thus technology), unless we prove that results are not changed when we switch lib.
Security review of the algorithm - the algorithm does some logic to [avoid generating most common English curse words by never placing some letters next to each other](https://github.com/davidaurelio/hashids-python?tab=readme-ov-file#curses-) ; so this might sound like trouble, entropy-wise..

#### Checking Cross-Language Consistency

What happens if frontend uses HashIDs with Javascript but backend uses Python/C++/Rust/ for example?

### FAQ

**Q: Do we really need 128 bits as an ID? isn’t it too much? What are the odds of collisions?
** A: Some people claim this isn’t really needed. Referring to the [Birthday Attack probability table](https://en.wikipedia.org/wiki/Birthday_problem) we see that for 128 bits we need to have 1.6×10 76keys in order to get collision with 1% chance.

**Q: What is base64 for?
A**: Let’s say you want to transfer information via text, eg you want to serialize info and send it to someone (other program)
Let’s say you want to transfer information, eg serialize it. Serializing means converting information to text, so you can send it from one program to another

**Q: Why the “ _” in “you would have to sign many different files”?
_**_**A_ *: There are some mechanisms to deal with it, eg signing the file except a small part of metadata, for example.

**Q: If we use UUID with time as prefix, what happens on daylight saving time?
A**: Nothing; as the time is unix epoch, which is always increasing.

**Q: Why do hash IDs lib call the secret key “salt”? It’s a secret, not salt..
A**: The goal of hash IDs is to convert number to string and vice versa; eg supply a two-direction hash function.

Thus, in order to change the hash result, we use salt.
In the algorithmic layer, this hould indeed be called “hash”.

In the Product/Marketing layer, this is should be called “secret”.

**Q: In “Checking cross-language”, why do we need a Dockerfile for the test?
A**: We really don’t need; we can do this one time to check the implementations we need and that’s it. Dockerfile is for demonstration purposes only.

### References

[UUIDs and poor index locality](https://buildkite.com/blog/goodbye-integers-hello-uuids)

[Benchmarking UUIDs and checking WAL](https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/)

<https://buildkite.com/blog/goodbye-integers-hello-uuids>

 _Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/there-are-more-than-2-uuid-types-uuidv4-7-ulid-etc-1jg4) _on February 17, 2024._

---

### Python — PDB usage and reproducing program execution
URL: https://yuval.coffee/posts/python-pdb-usage-execution/
Date: 2022-11-11

### Python — PDB usage and reproducing program execution

So imagine you have a Python program, and you want to inspect some parameters during an error.

There are [many](https://www.rookout.com/), [possible](https://sentry.io), [ways](https://docs.python.org/3/library/traceback.html) to do that;
I’d like to speak about a basic one, which involves debugger. Just like GDB for C/C++, Python has PDB.

PDB is command line debugger, which can be attached to process or started from within the process.

Just add the lines import pdb; pdb.set_trace() and you will have a shell where you can communicate with the process.

Needless to say, this is good only for CLI programs. Others, like servers, should have other solutions (Rookout etc.., PyCharm remote debugger etc..).

Let’s say we run a program, which calls some_erroneous_function and we want to know some value from this function.
main() -> foo() -> some_erroneous_function()

how can we know the value inside some_erroneous_function()?
simple - add next line:

import pdb; pdb.set_trace()

Can’t see value of a vs being able to see:

![](https://cdn-images-1.medium.com/max/859/0*c39IEUZKlIaVNXcW.png)Can’t see value of a![](https://cdn-images-1.medium.com/max/745/0*RlpAUWs3UMi47z3c.png)Do manage to see value of a

### What happens when program A runs program B?

When we have
main() -> bar() -> cli_app_bar.py -> some_erroneous_function(),

the import pdb; pdb.set_trace() trick simply doesn't work;
We get a stuck process instead. This is because the pdb opens in the child process, however the parent process is waiting for the child process the finish and we're stuck.

In this case, we should run child process ourselves.

### Which parts are required to run a child process ourself?

So there are 2 parts which are required; one is obvious, the other part is often forgotten!!
2 parts are:

  * program name + command like arguments
  * Environment variables!!
  * (there’s a 3rd part which is IPC messages, but it’s very hard to mimic such behavior…)

Let’s see how do we capture this:

  * Modify program to save CLI arguments and env vars
  * Run using CLI and env vars

### getting cmd + env vars

Several methods; getting env vars for a running process you could use cat /proc/46/environ | tr '\0' '\n' (replace 46 with process id)

From within Python process, we want to print env vars in “ready to go” format, eg with the export prefix:


    with open('/tmp/params.txt', 'w') as fout:
        # print all env vars
        for k, v in os.environ.items():
            fout.write('export "%s"="%s"\n' % (k,v))

And then diff with current env vars:


    echo "creating bar before"
    cat <<EOF > create_before.py
    #!/usr/bin/python3
    import os
    with open('/tmp/params.before.txt', 'w') as fout:
        for k, v in os.environ.items():
            fout.write('export "%s"="%s"\n' % (k,v))
    EOF

    python create_before.py

    echo "print some stats"
    wc -l /tmp/params.txt /tmp/params.before.txt

    echo "get keys"
    cat /tmp/params.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.keys.txt
    cat /tmp/params.before.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.before.keys.txt
    wc -l /tmp/params.keys.txt /tmp/params.before.keys.txt
    diff /tmp/params.keys.txt /tmp/params.before.keys.txtecho "creating bar before" cat << EOF > create_before.py #!/usr/bin/python3 import os with open('/tmp/params.before.txt', 'w') as fout: for k, v in os.environ.items(): fout.write('export "%s"="%s" \n ' % (k,v)) EOF python create_before.py echo "print some stats" wc -l /tmp/params.txt /tmp/params.before.txt echo "get keys" cat /tmp/params.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.keys.txt cat /tmp/params.before.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.before.keys.txt wc -l /tmp/params.keys.txt /tmp/params.before.keys.txt diff /tmp/params.keys.txt /tmp/params.before.keys.txt

and we got the newly added env var key, EXTRA:


    YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt
    1d0
    < export "EXTRA"YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt 1d0 < export "EXTRA"

### Questions

**Q: What is the “YuvShell”??
** A: It’s just me editing the ~/.bashrc and changing the PS1 (Prompt String) var;

![](https://cdn-images-1.medium.com/max/880/0*g0b1WR_40NgBvl15.png)~/.bashrc to change shell prefix

**Q: What is the different between****cat some_file.txt | wc -l and****wc -l some_file.txt?
** A: with cat + wc we use a pipe to transfer data from the cat output to the wc input; with wc only, we don't use the pipe.

Let's create some big file from urandom, and see time output of both options:


    cat /dev/urandom | base64 | head -c 1GB > /tmp/random_1GB_file.txt

    time cat /tmp/random_1GB_file.txt | wc -l
    time wc -l /tmp/random_1GB_file.txtcat /dev/urandom | base64 | head -c 1GB > /tmp/random_1GB_file.txt time cat /tmp/random_1GB_file.txt | wc -l time wc -l /tmp/random_1GB_file.txt

![](https://cdn-images-1.medium.com/max/564/0*xvSfSPw8Z3_6CZWW.png)

### Source Code


    #!/usr/bin/python3
    import argparse
    import subprocess
    import os
    from err_module import some_erroneous_function

    def foo():
        some_erroneous_function()

    def bar():
        my_env = os.environ.copy()
        my_env["EXTRA"] = 'True'
        cmd = "python cli_app.py --run"
        cmds = cmd.split(' ')

        stdout, stderr = subprocess.Popen(cmds, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=my_env).communicate()
        print("stdout [%s], stderr [%s]" % (stdout, stderr))

    def do_main():
        parser = argparse.ArgumentParser()
        parser.add_argument('--foo', dest='foo', default=False, action='store_true')
        parser.add_argument('--bar', dest='bar', default=False, action='store_true')
        args = parser.parse_args()

        if args.foo:
            foo()
        if args.bar:
            bar()

    if __name__ == '__main__':
        do_main()


    #!/usr/bin/python3
    import random

    def some_erroneous_function():
        print("some_erroneous_function:: enter")
        a = random.randint(1,100)
        import pdb; pdb.set_trace()     # adding PDB to open debugger
        raise Exception("some error")
        print("value of a: %d" % (a))


    #!/usr/bin/python3
    import sys, os
    from err_module import some_erroneous_function

    # code to save program execution parameters
    with open('/tmp/params.txt', 'w') as fout:
        # print all env vars
        for k, v in os.environ.items():
            fout.write('export "%s"="%s"\n' % (k, v))

    if __name__ == '__main__':
        if sys.argv[-1] == '--run' and os.environ.get('EXTRA') == 'True':
            some_erroneous_function()

_Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/python-pdb-usage-and-reproducing-program-execution-mo6) _on November 11, 2022._

---

### Data Ingestion — Build Your Own “Map Reduce”?
URL: https://yuval.coffee/posts/data-ingestion-map-reduce/
Date: 2021-12-24

### Data Ingestion — Build Your Own “Map Reduce”?

Map-reduce, multi-process, Python profiling [using Py-Spy]

### Why map reduce

Let’s say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks.
You will use mrjob/PySpark/spark/hadoop. You got the point — you need 1 framework to rule them all.
You need a system: where will temp file be stored, API with cloud, data security, multi-tenant etc..
You need standards — standards between developers to themselves, between developers to devops etc. ;

Let’s say, for the other hand, your a solopreneur/small startup. Max 3–4 developers team.
You need things to work, and work fast.
Don’t have 10ks of map-reduce jobs, but probably 1 or 2.
You won’t be using Hadoop, that’s for sure. Might be using:

### Different approaches

### Linq

not really map reduce per se, more like “sql w/out sql engine”
However, this adds complexities of .net to your environment;
e.g. read release notes and understand if you can run it on your [different OSes](https://docs.microsoft.com/en-us/dotnet/core/install/linux) (production, staging, developers machines).
Also — need to learn C#; loading from files, different encodings, saving, iterators etc..
If you’re not proficient with C#, this could be one-time investment which will not worth it.

Pros: Python native lib; able to debug easily (using [inline](https://mrjob.readthedocs.io/en/latest/runners-inline.html)) run [locally](https://mrjob.readthedocs.io/en/latest/runners-local.html) e.g. multi process on local machine,
use hadoop, dataproc (seems that[”Dataproc is a managed Spark and Hadoop service…”](https://cloud.google.com/dataproc/docs/concepts/overview) ) etc.
However, lots of moving parts and different configuration options.

### Custom made map-reduce

Let’s go to UCI Machine Learning website (2015 is on the phone..)
Choose some [dataset](http://archive.ics.uci.edu/ml/datasets/Bar+Crawl%3A+Detecting+Heavy+Drinking), and test

Some notes:
We don’t need SHA-256 and not evey base64; nothing will happen if keys will not distribute very equally.
we could take MMH3; googling “python murmurhash” gives 2 interesting results; and since both use [the](https://github.com/hajimes/mmh3/blob/master/MurmurHash3.cpp) [same](https://github.com/explosion/murmurhash/blob/master/murmurhash/MurmurHash3.cpp) cpp code, let’s take the one with [most stars](https://github.com/hajimes/mmh3)
Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2).

mini setup script:


    # get data
    wget http://archive.ics.uci.edu/ml/machine-learning-databases/00516/os_scan/OS%20Scan_dataset.csv.gz
    gunzip 'OS Scan_dataset.csv.gz'
    wc -l all_accelerometer_data_pids_13.csv

    # create venv
    venv_path=/tmp/venv_mr/
    python3 -m venv ${venv_path}
    source ${venv_path}/bin/activate

    python3 -m pip install mmh3
    python3 -m pip install tqdm
    python3 -m pip install py-spy

and 2 python test scripts:


    #!/usr/bin/env python3
    import mmh3
    import tqdm

    NUMBER_OF_SHARDS = 10

    id_to_file = {}
    for i in range(NUMBER_OF_SHARDS):
      id_to_file[i] = open('/tmp/%s.shard.txt' % (i), 'w')

    def get_shard_index(id_, num_shards):
      return mmh3.hash(id_, signed=False) % num_shards

    with open(r'all_accelerometer_data_pids_13.csv', 'r') as fin:
      for line in tqdm.tqdm(fin, mininterval=1):
        id_ = line.split(',', 1)[0]
        id_to_file[get_shard_index(id_, NUMBER_OF_SHARDS)].write("%s\n" % (line))

    for i in range(NUMBER_OF_SHARDS):
      id_to_file[i].close()


    #!/usr/bin/env python3
    import multiprocessing
    import mmh3
    import tqdm

    NUMBER_OF_PROCESSES = multiprocessing.cpu_count()-1
    NUMBER_OF_SHARDS = 10

    id_to_file = {}
    for i in range(NUMBER_OF_SHARDS):
      id_to_file[i] = open('/tmp/%s.shard.txt' % (i), 'w')

    def _get_shard_index(id_, num_shards):
      return mmh3.hash(id_, signed=False) % num_shards

    def get_id_and_line(line):
        id_ = line.split(',', 1)[0]
        shard_id = _get_shard_index(id_, NUMBER_OF_SHARDS)
        return shard_id, line

    pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
    with open(r'all_accelerometer_data_pids_13.csv', 'r') as fin:
      iter_data = pool.imap(get_id_and_line, fin)
      for shard_id, line in tqdm.tqdm(iter_data, mininterval=1):
        id_to_file[shard_id].write("%s\n" % (line))

    for i in range(NUMBER_OF_SHARDS):
      id_to_file[i].close()

Results:
imap runs much slower;
we can look at it/sec from tqdm to see that:


    # test.py sample after 4 seconds:
    2801493it [00:04, 566075.99it/s]

    # test_imap.py sample after 4 seconds:
    73439it [00:04, 18754.44it/s]# test.py sample after 4 seconds: 2801493it [00:04, 566075.99it/s] # test_imap.py sample after 4 seconds: 73439it [00:04, 18754.44it/s]

We could see the non-imap version is 30x faster!

### Q&A

Q: Why setup.sh and not requirements.txt file?
A: this is not production code; it’s aimed for quick reproducibility, not for having exact same lib (e.g. security etc.)

Q: Why MMH3 and not sha256?
A: This is not a security product, we don’t need cryptographic hash; we just need a _nice distribution of keys_ , and we want this to be **fast**.

Q: Why is imap slower than single process?
A: Might be because the imap version has **lots of overhead because of IPC** ;
The trade-off between offloading the (alleged) “heavy lifting” calculation of hash to external process is being erased by the IPC.

![](https://cdn-images-1.medium.com/max/880/0*z1odokAUxlqPKkMs.png)use [py-spy](https://github.com/benfred/py-spy) for Python profiling

Q: Why?
A: Using process pool might worth it if task is more CPU bound; here, **task is more io bound** the overhead of MMH hash doesn’t justify it.

Q: Conclusion?
A: it depends
Also — depends on size of file.
Also — depends on post-processing each shard.
conclusion — test mrjob as well; it might have a better IPC.

_Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/data-ingestion-build-your-own-map-reduce-2j1h) _on December 24, 2021._

---