# Yuval's Tech Blog > A technical blog covering AI, LLMs, software development, and best practices. This file provides content for LLMs to understand this website. See https://llmstxt.org/ for more information. ## Blog Posts ### Expert Trajectory RAG: Claude Code-Level Quality with Internal Models URL: https://yuval.coffee/posts/expert-trajectory-rag/ Date: 2025-12-18 How to achieve Claude Code-level quality using internal coding models? **Disclosure**: This approach is for specific scenarios and requires significant pre-computation (tokens). ## When to Use This This is for you if: - You have code you **cannot share** with public models; AND/OR - You **must work locally**; AND - You want SOTA results (assume SOTA = Claude Code) ## Solution TL;DR Assume you want to install "Langfuse" locally, and you can't use Claude Code/ChatGPT/public models. The flow is basically RAG over thousands of possible scenarios: 1. **Ask Claude Code**: "How to install Langfuse locally?" 2. **Save** the markdown file 3. **Introduce change request**: "How to change ports of services?" 4. **Ask Claude Code** to fix and validate 5. **Save tutorial** with metadata: - Keywords: `ports`, `mapping`, `dockers` - Description: "changing ports of services" 6. **Introduce change request**: "How to use ELK with this setup?" 7. **Ask Claude Code** to relate, save tutorial: - Keywords: `logs`, `monitoring`, `devops` - Description: "Using ELK for logs" 8. **Rinse and repeat** 9. **Push everything** into RAG 10. **Let your internal model** search over the very detailed instructions ## What This Enables - Map your existing technologies - Create thousands of such tutorials - Given a problem, search thousands of working, accurate, verified tutorials ## Tutorial Structure Each tutorial should have: - **Basic keywords** and **short description** of the problem - This enables efficient RAG + keyword-based hybrid search over problem solutions ## Q&A **Q:** How to prevent Information Leakage? **A:** Assume you have a system using Python 2.7 with a very specific library for some proprietary task. If you ask Claude: *"I'm using Python 2.7 with lib XYZ, please help me fix bug ABC"* - this is information leakage. The solution is to mass-generate thousands of solutions to virtually **all** possible combinations. Your actual query gets lost in the noise. **Q:** Legal stuff? Is model distillation illegal? **A:** Distilling models for creation of new models is illegal under major LLMs TOS. However, creating a RAG Library ("Expert Trajectory RAG") is **not illegal** AFAIK. **Q:** Why do we need keywords AND description for each scenario? **A:** Hybrid search often gives better results than just vector search. **Q:** Will this work in real life? **A:** Yes. See this real-life example: [PostHog Issue #38494](https://github.com/PostHog/posthog/issues/38494#issuecomment-3656289544). I installed PostHog locally, and Claude Code fixed an issue. It wasn't 100% autonomous - it was partly instructed by me on how to progress. But with some simple agentic loops, I believe this can be fully autonomous. --- ### Mini-Firewall" Over EC2 Machines, AKA use Lambda function to update SecGroup with your IP address URL: https://yuval.coffee/posts/mini-firewall-ec2-lambda-ip/ Date: 2024-12-04 Tl;Dr — in some setups, it makes sense to allow connection to EC2 machine only to your (home laptop) IP address. If you have dynamic IP, you can use Lambda function (+API GW) to update the IP in the security group automatically. See snippet: ![](https://cdn-images-1.medium.com/max/1024/1*RNbPE-t68m5PiP6oydi-Qg.png) **“Mini-Firewall”? Why?** So the idea is to get an extra layer of security and allow connection to EC2 machines only from a specific IP address. Create a new machine, create a new security group, and allow connection only from the specific IP address. So far, so good. However, what happens when the IP changes? When IP changes, you need to update the security group — login into AWS console, find the security group, update the IP address. This can be a bit annoying. So there’s another option — update the security group automatically, using Lambda function. **Lambda function** Steps: Create a Lambda function, which gets IP address and changes the security group. Create API Gateway, which calls the Lambda function. Add 2 links to your bookmarks — enable and disable access * Enable access — call the “Enable” Lambda function which changes the IP address to your IP address (eg caller-IP address) * Disable access — call the “Disable” Lambda function which changes the IP address to 8.8.8.8 (eg Google DNS…) See _change_secgroup_lambda.py_ for the code. It was written for Python 3.8 originally, but should work with newer versions.. We have _lambda_handler_ which is the function being called by the Lambda infrastructure; and we have _update_security_group_ which is the function which actually updates the security group. **Manual Installing** After installing Lambda function, go to Configuration -> Triggers. And then click on “Add Trigger” and add API gateway function for the trigger. Then go to API Gateway, and create a new API. Then create a new resource, and then create a new method. Then go to “Integration Request” and set “Body Mapping Templates” to “application/json” and add the following: { "action": "enable", "name": "$input.params('name')" } Then go to “Method Response” and add 200 response. Then go to “Integration Response” and add 200 response. Then go to “Integration Request” and add the following: { "body-json" : { "action" : "enable", "name" : "$input.params('name')" }, "context" : { "account-id" : "$context.identity.accountId", "api-id" : "$context.apiId", "api-key" : "$context.identity.apiKey", "authorizer-principal-id" : "$context.authorizer.principalId", "caller" : "$context.identity.caller", "cognito-authentication-provider" : "$context.identity.cognitoAuthenticationProvider", "cognito-authentication-type" : "$context.identity.cognitoAuthenticationType", "cognito-identity-id" : "$context.identity.cognitoIdentityId", "cognito-identity-pool-id" : "$context.identity.cognitoIdentityPoolId", "http-method" : "$context.httpMethod", "stage" : "$context.stage", "source-ip" : "$context.identity.sourceIp", "user" : "$context.identity.user", "user-agent" : "$context.identity.userAgent", "user-arn" : "$context.identity.userArn", "request-id" : "$context.requestId", "resource-id" : "$context.resourceId", "resource-path" : "$context.resourcePath" } } go to “Method Execution” and click on “Test” and then “Test” again. go to “Method Execution” and click on “Deploy API” and then “Deploy” again. go to “Stages” and click on “Prod” and then “Invoke URL” and copy the URL. go to “Resources” and click on “Actions” and then “Create Method” and select “GET” and click on the checkmark. go to “Method Execution” and click on “Test” and then “Test” again. **Automatic Installing, AKA productonize code using ChatGPT/Bard** So after we created the Lambda and API manually, we would like to “productionize” it, e.g. create code to create these things. We would use AWS CLI to inspect the existing Lambda and API, and then create the code to create them. # let's get the APIS in region, then resource for the relevant one (ID pg6hm5xzu0), then 'enable' method for that resource (ID jiv1tt) aws apigateway get-rest-apis --region us-west-2 aws apigateway get-resources --rest-api-id pg6hm5xzu0 --region us-west-2 aws apigateway get-method --rest-api-id pg6hm5xzu0 --resource-id jiv1tt --http-method GET --region us-west-2 Now after we gave ChatGPT/Bard the JSONs to reproduce, we have commands to reproduce it: # create aws apigateway create-rest-api --name 'add_rule_to_sg-yuval-reprod' --description 'Created by AWS Lambda' --endpoint-configuration types=REGIONAL --api-key-source HEADER --region us-west-2 # compare aws apigateway get-resources --rest-api-id pg6hm5xzu0 --region us-west-2 aws apigateway get-resources --rest-api-id 3p06imz4sc --region us-west-2 And we can continue to the next steps, create method etc. And also continue and use Terraform or CloudFormation to create the API and Lambda. **FAQ** **Q** : Why not use a VPN? Eg developer should connect to AWS VPN and policy should allow connection only from the VPN? **A** : This is a good solution for an SMB/Enterprise, but for a solo developer/hobbyist just playing around, it’s an overkill. **Q** : There is no need for authentication when calling the API (Eg when visiting the URLs)? **A** : Correct, this is by design; we exchange security for convenience. Worst case, we have another layer of security — the SSH certificate required for logging in — this is the main layer of protection. --- ### Batch Processing in SQLite: A Deep Dive into Database Field Updates URL: https://yuval.coffee/posts/batch-processing-sqlite-updates/ Date: 2024-12-04 ### **Batch Processing in SQLite: How to update [efficiently] a table with millions of records** ![](https://cdn-images-1.medium.com/max/861/1*4k4P0TM9WAsKUcgTmFNYyw.png) sample run, update in batches from last id Tl;Dr — if you want to bulk update database table, you can do it in bulk iterations. It would take more time, but will have shorter locks on the database. It’s very important in Sqlite3, while other database engines have some mechanisms to overcome the issue (eg MVCC). In addition, in order to prevent scanning of the table in each iteration, you should save last id of the updated field. For full code see: Let’s say you have an SQL database, and you would like to update field in bulk, eg reset it to null. What happens if you have 1000s of records? 100s of 1000s? Millions of records? You can get the DB locked.. Some notes: 1\. THIS MAINLY SPEAKS ABOUT SQLITE3; in Postgresql and other advanced DBs, we have some kind of versioning system, and thus we don’t really lock the entire table, but only some records. Eg — **MVCC** in Postgres. In Sqlite3, the locking issue is more present. 2\. Between each run, we update random rankings. 3\. Between each run, we _*_ **_must purge page cache_** _*_ , in order to prevent consecutives runs be faster due to cache. **Creating and populating 20M records** Let’s create and populate table: CREATE TABLE IF NOT EXISTS domains ( id INTEGER PRIMARY KEY AUTOINCREMENT, domain_name TEXT NOT NULL, domain_ranking INTEGER def populate_data(self, num_records: int = 10_000_000): with sqlite3.connect(self.db_path) as conn: batch_size = 10000 for i in range(0, num_records, batch_size): batch = [ (f"domain_{j}", random.randint(1, 1000000)) for j in range(i, min(i + batch_size, num_records)) ] conn.executemany( "INSERT INTO domains (domain_name, domain_ranking) VALUES (?, ?)", batch ) **Anti-pattern: update row by row** We update row by row; Note we didn’t go full-anti pattern; We still commit every 1000 records and not after each recod; committing after each record would be the worst possible option to do. def reset_row_by_row(self): with sqlite3.connect(self.db_path) as conn: cursor = conn.execute("SELECT id FROM domains WHERE domain_ranking IS NOT NULL") for (id_,) in cursor: conn.execute( "UPDATE domains SET domain_ranking = NULL WHERE id = ?", (id_,) ) if total_reset % 1000 == 0: conn.commit() **First method: using 1 SQL statement** This approach is simple and also the quickest; however, it locks the table during execution; def reset_all(self): with sqlite3.connect(self.db_path) as conn: conn.execute("UPDATE domains SET domain_ranking = NULL") **Second method: Using loop + batch, eg reset in chunks** This approach looks better; while True, each iteration will reset 10k records; This will still lock the table, but for a shorter period of time. Total time for the whole table will be longer, but each lock time will be much shorter. However, **each iteration needs to scan the table** to find first relevant record. def reset_in_chunks(self, chunk_size: int = 10_000): with sqlite3.connect(self.db_path) as conn: while True: cursor = conn.execute(f""" UPDATE domains SET domain_ranking = NULL WHERE id IN ( SELECT id FROM domains WHERE domain_ranking IS NOT NULL LIMIT {chunk_size} ) """) if cursor.rowcount == 0: break conn.commit() **Improvement: Save last id and start from this id** In the last example, we would still have to scan the table during each iteration; Yet, we may have many pages in the cache; and we might as well have an index for the domain_ranking field; However, in cases where we don’t have index, we may save the index of the last updated record; and start from this index, in order to prevent scanning. def reset_with_tracking(self, chunk_size: int = 10000): last_id = 0 with sqlite3.connect(self.db_path) as conn: while True: cursor = conn.execute(f""" UPDATE domains SET domain_ranking = NULL WHERE id IN ( SELECT id FROM domains WHERE id > {last_id} AND domain_ranking IS NOT NULL ORDER BY id LIMIT {chunk_size} ) RETURNING id """) updated_ids = cursor.fetchall() if not updated_ids: break last_id = max(id[0] for id in updated_ids) conn.commit() For conclusion, when working with DBs, and performing heavy-duty tasks, always have in mind the locking mechanism and your DB access patterns. --- ### Fine-Tuning Bert using LoRA, hosting on Cloudflare using Cloudflare AI Workers URL: https://yuval.coffee/posts/fine-tuning-bert-lora-cloudflare/ Date: 2024-11-12 Let’s say you have a website and you would like to add content filtering to it. So yes, you could use OpenAI Content Moderation API, but what if you want your own solution? I did a quick POC with Cloudflare (“CF” now on) Workers + CF AI, to test if I can create and serve such model. CF offers AI workers, which are a closed set of models _*and fine-tunes*_ on these models. So, we can take a BERT model, fine-tune using LoRA, and serve via CF AI API. I guess the closed set of base models because of the serving optimizations, which allows to serve many in only fraction of the required resources. **Setup** There are [many content moderation datasets](https://github.com/fcakyon/content-moderation-deep-learning), but for this example let’s use well-known the Spam/No Spam dataset. **Installing basic libs and setting up.** Logging-in to Clouflare — we have 2 issues: #1 — The console asks for consent regarding telemetry — so we use `!yes` for that; Issue #2 — as part of the Oauth auth, the browser redirects to localhost, which is not good.. So we log-in using Cloudflare API Token, and not browser Oauth flow. # install some libs; use node.js 18.x (LTS), and verify !pip install transformers torch pandas peft datasets numpy scikit-learn !apt-get remove -y nodejs npm !curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash - !node --version !npm --version !npm install -g wrangler !which wrangler CLOUDFLARE_API_TOKEN = "...." os.environ['CLOUDFLARE_API_TOKEN'] = CLOUDFLARE_API_TOKEN !wrangler --version !yes | wrangler whoami **Basic Fine-Tuning** Here we have basic fine-tuning code; which will take _bert-base-uncased_ and fine-tune on the Spam/No-spam dataset. import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import torch from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding ) from peft import get_peft_model, LoraConfig, TaskType from datasets import Dataset import os class SpamDetectorTrainer: def __init__(self, model_name="bert-base-uncased"): self.model_name = model_name self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=2 ) def load_data(self, url="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"): """Load and prepare the SMS spam dataset""" os.system(f"wget {url} -O smsspamcollection.zip") os.system("unzip smsspamcollection.zip") df = pd.read_csv("SMSSpamCollection", sep='\t', header=None, names=['label', 'message']) df['label'] = df['label'].map({'ham': 0, 'spam': 1}) train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42) self.train_dataset = Dataset.from_pandas(train_df) self.eval_dataset = Dataset.from_pandas(eval_df) return self.train_dataset, self.eval_dataset def preprocess_data(self): """Tokenize and prepare datasets""" def tokenize_function(examples): return self.tokenizer( examples['message'], truncation=True, padding=True, max_length=128 ) self.tokenized_train = self.train_dataset.map(tokenize_function, batched=True) self.tokenized_eval = self.eval_dataset.map(tokenize_function, batched=True) self.tokenized_train.set_format('torch', columns=['input_ids', 'attention_mask', 'label']) self.tokenized_eval.set_format('torch', columns=['input_ids', 'attention_mask', 'label']) return self.tokenized_train, self.tokenized_eval def setup_lora(self, r=8, alpha=32, dropout=0.1): """Configure and apply LoRA""" lora_config = LoraConfig( task_type=TaskType.SEQ_CLS, inference_mode=False, r=r, lora_alpha=alpha, lora_dropout=dropout, target_modules=['query', 'value'] ) self.lora_model = get_peft_model(self.model, lora_config) return self.lora_model def train(self, output_dir="./results", epochs=3, batch_size=16): """Train the model""" training_args = TrainingArguments( output_dir=output_dir, evaluation_strategy="steps", eval_steps=500, save_strategy="steps", save_steps=500, learning_rate=2e-5, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, num_train_epochs=epochs, weight_decay=0.01, logging_dir='./logs', logging_steps=100, load_best_model_at_end=True, metric_for_best_model="loss", save_total_limit=3, ) data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer) trainer = Trainer( model=self.lora_model, args=training_args, train_dataset=self.tokenized_train, eval_dataset=self.tokenized_eval, data_collator=data_collator, ) # Train and evaluate train_result = trainer.train() eval_result = trainer.evaluate() return train_result, eval_result def save_model(self, path="lora_spam_adapter"): """Save the LoRA adapter""" self.lora_model.save_pretrained(path) self.tokenizer.save_pretrained(path) trainer = SpamDetectorTrainer() train_dataset, eval_dataset = trainer.load_data() trainer.preprocess_data() trainer.setup_lora() train_result, eval_result = trainer.train() trainer.save_model() **Hyper-Parameters Search** There are several parameters worth checking out; I’m not an expert LoRA fine-tuning guy, but read some [blog](https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms) [posts](https://www.determined.ai/blog/lora-parameters) about it. We could do a grid search and let the machine churn a little bit to squeeze some more performance. eg, **Rank** , which is the rank of the inner matrix; more compute work, but more information caputred; and also learning-rate. results = [] learning_rates = [1e-5, 2e-5, 5e-5] lora_ranks = [4, 8, 16] for lr in learning_rates: for rank in lora_ranks: print(f"Training with learning_rate={lr}, lora_rank={rank}") # Reinitialize the model and LoRA setup trainer.model = AutoModelForSequenceClassification.from_pretrained( trainer.model_name, num_labels=2 ) trainer.setup_lora(r=rank) # Train the model train_result, eval_result = trainer.train( epochs=3, batch_size=16, learning_rate=lr ) train_loss = train_result.training_loss eval_loss = eval_result['eval_loss'] results.append({ 'learning_rate': lr, 'lora_rank': rank, 'train_loss': train_loss, 'eval_loss': eval_loss }) print(f"Results: train_loss={train_loss:.4f}, eval_loss={eval_loss:.4f}") **Hyper-Parameters Search — Results** ![Loss, Learning Rate and Rank](https://cdn-images-1.medium.com/max/871/1*I1K0gyqIASpcPaLQAfjvcA.png) **Push to Cloudflare AI** Here we actually understand we can **not** push to Cloudflare, because of wrong selection of base model; eg Cloudflare AI worker with LoRA Fine-tuning have very specific list of based models, and Bert is not one of them: @cf/meta-llama/llama-2–7b-chat-hf-lora @cf/mistral/mistral-7b-instruct-v0.2-lora @cf/google/gemma-2b-it-lora @cf/google/gemma-7b-it-lora We can do training again, this time using [AutoTrain](https://developers.cloudflare.com/workers-ai/tutorials/fine-tune-models-with-autotrain/) . And serve using [this code sample](https://developers.cloudflare.com/workers-ai/get-started/workers-wrangler/), with adding finetune id. export interface Env { // If you set another name in wrangler.toml as the value for 'binding', // replace "AI" with the variable name you defined. AI: Ai; } export default { async fetch(request, env): Promise { const response = await env.AI.run("@cf/meta/llama-3.1-8b-instruct", { prompt: "What is the origin of the phrase Hello, World", }); return new Response(JSON.stringify(response)); }, } satisfies ExportedHandler; **Conclusion** We saw how to **fine-tune Bert model** using LoRA. We [almost] saw wow to deploy (specific base models) into Cloudflare AI Workers, after fine-tuning. We tested several hyper-parameters for training, to get optimized model. --- ### BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website URL: https://yuval.coffee/posts/bfs-dfs-pagerank-embeddings/ Date: 2024-09-26 ### BFS, DFS, PageRank, AKA — How to run embeddings only on important parts of a website Depth-First Search (DFS) and Breadth-First Search (BFS) are two common algorithms used in web scraping to traverse websites. Usually, you would want to use BFS; and this in order [no pun intended] to get to important pages first. In short, DFS will explore as far as possible along each branch before backtracking, while BFS will explore neighbor nodes at the present depth before moving to nodes at the next depth level. **BFS ** First we visit Homepage, then Products, Blog and About pages, and then children of each page, etc.. ![](https://cdn-images-1.medium.com/max/1024/1*vPGooJnoa0gMAODIbds9_w.png) **DFS ** First we visit Homepage, then Products, then children of Products, then Blog, **then we’re “stuck” in all blogposts [children of Blog page].** Instead of visiting About page, after the Blog page. **Creating this pre-coding-LLMs era, would take lots more time!** Using PageRank algorithm, we see that we visit: Homepage, Products, and then Category 1 and Category 2 pages. This is of course just for illustration — if the internal link structure “implies us” **[read: “implies Google”**] that these pages are more important, we would like to visit them first. ![](https://cdn-images-1.medium.com/max/1024/1*YhPs3PTyMKettPJqxZNdmQ.png) **Let’s use Claude/ChatGPT/Cursor; let’s use PageRank** Using Cursor, it took me about 10 minutes to add PageRank class. Created a class which: \- Accepts a page and links \- Adds to internal graph database [could use networkx, but for this case used a simple dict] \- Calculate PageRank \- Suggests next page to visit class WebsitePagerank { constructor(initialUrl) { // Note - for numerical and other reasons, the invariant that total PageRank of all pages is 1, does not hold here this.initialUrl = this.normalizeURL(initialUrl); this.pages = new Map(); this.links = new Map(); this.crawledPages = new Set(); this.pages.set(this.initialUrl, 10); } addPage(url) { url = this.normalizeURL(url); if (!this.pages.has(url)) { this.pages.set(url, url === this.initialUrl ? 10 : 1); } } addLink(sourceUrl, destUrl) { sourceUrl = this.normalizeURL(sourceUrl); destUrl = this.normalizeURL(destUrl); this.addPage(sourceUrl); this.addPage(destUrl); if (!this.links.has(sourceUrl)) { this.links.set(sourceUrl, new Set()); } this.links.get(sourceUrl).add(destUrl); } markAsCrawled(url) { url = this.normalizeURL(url); console.log('markAsCrawled', url); this.crawledPages.add(url); } calculatePageRank(iterations = 10, dampingFactor = 0.85) { const numPages = this.pages.size; for (let i = 0; i < iterations; i++) { const newRanks = new Map(); for (const [page] of this.pages) { let sum = 0; for (const [sourcePage, destPages] of this.links) { if (destPages.has(page)) { sum += this.pages.get(sourcePage) / destPages.size; } } newRanks.set(page, (1 - dampingFactor) / numPages + dampingFactor * sum); } this.pages = newRanks; } } getNextUrl() { return Array.from(this.pages.entries()) .filter(([url]) => !this.crawledPages.has(url)) .sort((a, b) => b[1] - a[1])[0]?.[0]; } normalizeURL(url) { url = url.split('#')[0]; if (url.endsWith('/')) { url = url.slice(0, -1); } return url; } dumpPageRankInfo(filePath) { const data = Array.from(this.pages.entries()).map(([url, rank]) => ({ url, rank, outgoingLinks: Array.from(this.links.get(url) || []) })); fs.writeFileSync(filePath, JSON.stringify(data, null, 2)); } } ![](https://cdn-images-1.medium.com/max/1024/1*E9jB_DNezLJowRmC7l_8gw.png) **Conclusion** We saw toy examples for BFS, DFS, and PageRank algorithms. When website has 100+ pages, this becomes a true pain. **Q** : If we need to browse the pages to build the PageRank, we don’t save anything, right? **A** : Correct, if we look about bandwidth or crawled pages; **However, if we would like to embed** the pages afterward, we surely want to embed only the most important pages. --- ### There are more than 2 UUID types — UUIDv4, 7, ULID, etc… URL: https://yuval.coffee/posts/uuid-types-v4-7-ulid/ Date: 2024-02-17 ### There are more than 2 UUID types — UUIDv4, 7, ULID, etc… **Tl;dr** — UUIDv4, UUIDv7, ULID, Base64, Base58, Base85, HashIDs (hiding IDs on the frontend), libs compatibility between different SDKs. So, you’ve all heard about UUIDv4. It’s just a very random collection of bits, represented nicely. Let’s review some other UUIDs: ### ULID / UUIDv7 UUIDv4 has a major issue — it would give you some issues when you try to order by UUIDv4. Let’s say you have a database table, with id which is just a simple int, and AUTO_INCREMENT. first record will be id=1, second record will be id=2, etc.. Now, when you do something like SELECT * FROM my_table ORDER BY id, results will be sorted, but more importantly - the results will be close to each other. e.g., if you iterate in chunks of 100 for example, you would not jump all over the database, but results should be pretty close to each other. What happens with UUIDv4? There’s no really point in sorting, because you just sort a bunch of random numbers. In addition, you will “jump” all over the database when reading records. And if your DB is big enough, results you read will page out. ### So what about ULID/ UUIVv7? So, ULID / UUIDv7 are 2 protocol which offer to prefix the UUID with time signature; eg the IDs are always increasing. This way, for example, if you have table with ULID/UUIDv7 as an index, you can run SELECT * FROM my_table ORDER BY id and it would make sense. ### Problems with ULID/UUIv7? So, one thing is adoption; another problem which is more problematic, is information leakage — given an ID, we can know when it was created.. ### “Nano IDs” This is just a summary of this great article — [The UX of UUIDs](https://unkey.dev/blog/uuid-ux). Go read it now. UUIDs are not easier to copy; the “-” in the UUID prevent from copying the whole string. We can see what Stripe is doing — key is just a random string, without dashed; in addition it is prefixed with key description. For example: STRIPE_LIVE_PUBLIC_KEY="pk_live_xUBcwUhe....." STRIPE_LIVE_SECRET_KEY="sk_live_gpTjnUwB....." STRIPE_TEST_PUBLIC_KEY="pk_test_CcfLsSzE....." STRIPE_TEST_SECRET_KEY="sk_test_WFnNSjpB....." DJSTRIPE_WEBHOOK_SECRET="whsec_LqqRWEKkd....." We can copy the key(s) using double-click, and also key has lots more information. How come? Answer is, UUID (v4 for example) represents in hexadecimal base; Stripe IDs, however, represent in a different base. The bigger the base, the shorter the string for the same amount of data represented. ### Base64 vs Base58 We all know about base64 (see FAQ if not) — but what is Base58?? Base58 is just like Base64, but with some confusing letters omitted; eg we remove I and l, remove O and 0 and o avoid confusion. and + and / to as well. ### Base85??? Yes, another base is base85; let’s say: 1. you work for a software company which distributes signed .exe files to customers 2. and you want the filename to contain url the executable should connect to on first run. 3. You can’t add this URL to the file content, since you would have to sign many different files (*). 4. URL should be part of the filename 5. Filename should be short as possible. So — use you base85 to encapsulate the URL; the bigger the base -> lower string of the filename. And this way you get a short filename. ### Hash IDs Let’s say you have a SaaS, and you give each new user an ID. And you have a view of the format https://my-saas.com/users/123 (where 123 is the user_id) ; What happens is, people can [estimate](https://en.wikipedia.org/wiki/German_tank_problem) the number of users in your website, by creating a new user and checking the id they got. So - how can you hide the current user_id from the user itself?? One option of course is to use a random id, but then we would get all the issues of UUID (UUID is just a private case of random ID). Another option, is to encrypt the ID using some key; and this is exactly what [Squid (formerly HashIDs)](https://sqids.org/python?hashids) is doing! Using a secret key (*), you can convert id->string and string->id, and this way you can have something like https://my-saas.com/users/nVB, and convert nVB to user_id 123 in your backend. Sample code: # Taken from: https://github.com/davidaurelio/hashids-python hashids = Hashids(salt='this is my salt 1') hashid = hashids.encode(123) # 'nVB' # and with different salt: hashids = Hashids(salt='this is my salt 2') hashid = hashids.encode(123) # 'ojK'# Taken from: https://github.com/davidaurelio/hashids-python hashids = Hashids(salt='this is my salt 1') hashid = hashids.encode(123) # 'nVB' # and with different salt: hashids = Hashids(salt='this is my salt 2') hashid = hashids.encode(123) # 'ojK ### What can be the problem with HashIDs? Well, first we should check what algorithm does it use; and make sure e(d(id)) == id for all ids; eg that we can trust the lib (algorithm) to do conversions without an error. Another issue, is we might be bound to a specific implementation (and thus technology), unless we prove that results are not changed when we switch lib. Security review of the algorithm - the algorithm does some logic to [avoid generating most common English curse words by never placing some letters next to each other](https://github.com/davidaurelio/hashids-python?tab=readme-ov-file#curses-) ; so this might sound like trouble, entropy-wise.. #### Checking Cross-Language Consistency What happens if frontend uses HashIDs with Javascript but backend uses Python/C++/Rust/ for example? ### FAQ **Q: Do we really need 128 bits as an ID? isn’t it too much? What are the odds of collisions? ** A: Some people claim this isn’t really needed. Referring to the [Birthday Attack probability table](https://en.wikipedia.org/wiki/Birthday_problem) we see that for 128 bits we need to have 1.6×10 76keys in order to get collision with 1% chance. **Q: What is base64 for? A**: Let’s say you want to transfer information via text, eg you want to serialize info and send it to someone (other program) Let’s say you want to transfer information, eg serialize it. Serializing means converting information to text, so you can send it from one program to another **Q: Why the “ _” in “you would have to sign many different files”? _**_**A_ *: There are some mechanisms to deal with it, eg signing the file except a small part of metadata, for example. **Q: If we use UUID with time as prefix, what happens on daylight saving time? A**: Nothing; as the time is unix epoch, which is always increasing. **Q: Why do hash IDs lib call the secret key “salt”? It’s a secret, not salt.. A**: The goal of hash IDs is to convert number to string and vice versa; eg supply a two-direction hash function. Thus, in order to change the hash result, we use salt. In the algorithmic layer, this hould indeed be called “hash”. In the Product/Marketing layer, this is should be called “secret”. **Q: In “Checking cross-language”, why do we need a Dockerfile for the test? A**: We really don’t need; we can do this one time to check the implementations we need and that’s it. Dockerfile is for demonstration purposes only. ### References [UUIDs and poor index locality](https://buildkite.com/blog/goodbye-integers-hello-uuids) [Benchmarking UUIDs and checking WAL](https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/) _Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/there-are-more-than-2-uuid-types-uuidv4-7-ulid-etc-1jg4) _on February 17, 2024._ --- ### Python — PDB usage and reproducing program execution URL: https://yuval.coffee/posts/python-pdb-usage-execution/ Date: 2022-11-11 ### Python — PDB usage and reproducing program execution So imagine you have a Python program, and you want to inspect some parameters during an error. There are [many](https://www.rookout.com/), [possible](https://sentry.io), [ways](https://docs.python.org/3/library/traceback.html) to do that; I’d like to speak about a basic one, which involves debugger. Just like GDB for C/C++, Python has PDB. PDB is command line debugger, which can be attached to process or started from within the process. Just add the lines import pdb; pdb.set_trace() and you will have a shell where you can communicate with the process. Needless to say, this is good only for CLI programs. Others, like servers, should have other solutions (Rookout etc.., PyCharm remote debugger etc..). Let’s say we run a program, which calls some_erroneous_function and we want to know some value from this function. main() -> foo() -> some_erroneous_function() how can we know the value inside some_erroneous_function()? simple - add next line: import pdb; pdb.set_trace() Can’t see value of a vs being able to see: ![](https://cdn-images-1.medium.com/max/859/0*c39IEUZKlIaVNXcW.png)Can’t see value of a![](https://cdn-images-1.medium.com/max/745/0*RlpAUWs3UMi47z3c.png)Do manage to see value of a ### What happens when program A runs program B? When we have main() -> bar() -> cli_app_bar.py -> some_erroneous_function(), the import pdb; pdb.set_trace() trick simply doesn't work; We get a stuck process instead. This is because the pdb opens in the child process, however the parent process is waiting for the child process the finish and we're stuck. In this case, we should run child process ourselves. ### Which parts are required to run a child process ourself? So there are 2 parts which are required; one is obvious, the other part is often forgotten!! 2 parts are: * program name + command like arguments * Environment variables!! * (there’s a 3rd part which is IPC messages, but it’s very hard to mimic such behavior…) Let’s see how do we capture this: * Modify program to save CLI arguments and env vars * Run using CLI and env vars ### getting cmd + env vars Several methods; getting env vars for a running process you could use cat /proc/46/environ | tr '\0' '\n' (replace 46 with process id) From within Python process, we want to print env vars in “ready to go” format, eg with the export prefix: with open('/tmp/params.txt', 'w') as fout: # print all env vars for k, v in os.environ.items(): fout.write('export "%s"="%s"\n' % (k,v)) And then diff with current env vars: echo "creating bar before" cat < create_before.py #!/usr/bin/python3 import os with open('/tmp/params.before.txt', 'w') as fout: for k, v in os.environ.items(): fout.write('export "%s"="%s"\n' % (k,v)) EOF python create_before.py echo "print some stats" wc -l /tmp/params.txt /tmp/params.before.txt echo "get keys" cat /tmp/params.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.keys.txt cat /tmp/params.before.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.before.keys.txt wc -l /tmp/params.keys.txt /tmp/params.before.keys.txt diff /tmp/params.keys.txt /tmp/params.before.keys.txtecho "creating bar before" cat << EOF > create_before.py #!/usr/bin/python3 import os with open('/tmp/params.before.txt', 'w') as fout: for k, v in os.environ.items(): fout.write('export "%s"="%s" \n ' % (k,v)) EOF python create_before.py echo "print some stats" wc -l /tmp/params.txt /tmp/params.before.txt echo "get keys" cat /tmp/params.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.keys.txt cat /tmp/params.before.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.before.keys.txt wc -l /tmp/params.keys.txt /tmp/params.before.keys.txt diff /tmp/params.keys.txt /tmp/params.before.keys.txt and we got the newly added env var key, EXTRA: YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt 1d0 < export "EXTRA"YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt 1d0 < export "EXTRA" ### Questions **Q: What is the “YuvShell”?? ** A: It’s just me editing the ~/.bashrc and changing the PS1 (Prompt String) var; ![](https://cdn-images-1.medium.com/max/880/0*g0b1WR_40NgBvl15.png)~/.bashrc to change shell prefix **Q: What is the different between****cat some_file.txt | wc -l and****wc -l some_file.txt? ** A: with cat + wc we use a pipe to transfer data from the cat output to the wc input; with wc only, we don't use the pipe. Let's create some big file from urandom, and see time output of both options: cat /dev/urandom | base64 | head -c 1GB > /tmp/random_1GB_file.txt time cat /tmp/random_1GB_file.txt | wc -l time wc -l /tmp/random_1GB_file.txtcat /dev/urandom | base64 | head -c 1GB > /tmp/random_1GB_file.txt time cat /tmp/random_1GB_file.txt | wc -l time wc -l /tmp/random_1GB_file.txt ![](https://cdn-images-1.medium.com/max/564/0*xvSfSPw8Z3_6CZWW.png) ### Source Code #!/usr/bin/python3 import argparse import subprocess import os from err_module import some_erroneous_function def foo(): some_erroneous_function() def bar(): my_env = os.environ.copy() my_env["EXTRA"] = 'True' cmd = "python cli_app.py --run" cmds = cmd.split(' ') stdout, stderr = subprocess.Popen(cmds, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, env=my_env).communicate() print("stdout [%s], stderr [%s]" % (stdout, stderr)) def do_main(): parser = argparse.ArgumentParser() parser.add_argument('--foo', dest='foo', default=False, action='store_true') parser.add_argument('--bar', dest='bar', default=False, action='store_true') args = parser.parse_args() if args.foo: foo() if args.bar: bar() if __name__ == '__main__': do_main() #!/usr/bin/python3 import random def some_erroneous_function(): print("some_erroneous_function:: enter") a = random.randint(1,100) import pdb; pdb.set_trace() # adding PDB to open debugger raise Exception("some error") print("value of a: %d" % (a)) #!/usr/bin/python3 import sys, os from err_module import some_erroneous_function # code to save program execution parameters with open('/tmp/params.txt', 'w') as fout: # print all env vars for k, v in os.environ.items(): fout.write('export "%s"="%s"\n' % (k, v)) if __name__ == '__main__': if sys.argv[-1] == '--run' and os.environ.get('EXTRA') == 'True': some_erroneous_function() _Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/python-pdb-usage-and-reproducing-program-execution-mo6) _on November 11, 2022._ --- ### Data Ingestion — Build Your Own “Map Reduce”? URL: https://yuval.coffee/posts/data-ingestion-map-reduce/ Date: 2021-12-24 ### Data Ingestion — Build Your Own “Map Reduce”? Map-reduce, multi-process, Python profiling [using Py-Spy] ### Why map reduce Let’s say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks. You will use mrjob/PySpark/spark/hadoop. You got the point — you need 1 framework to rule them all. You need a system: where will temp file be stored, API with cloud, data security, multi-tenant etc.. You need standards — standards between developers to themselves, between developers to devops etc. ; Let’s say, for the other hand, your a solopreneur/small startup. Max 3–4 developers team. You need things to work, and work fast. Don’t have 10ks of map-reduce jobs, but probably 1 or 2. You won’t be using Hadoop, that’s for sure. Might be using: ### Different approaches ### Linq not really map reduce per se, more like “sql w/out sql engine” However, this adds complexities of .net to your environment; e.g. read release notes and understand if you can run it on your [different OSes](https://docs.microsoft.com/en-us/dotnet/core/install/linux) (production, staging, developers machines). Also — need to learn C#; loading from files, different encodings, saving, iterators etc.. If you’re not proficient with C#, this could be one-time investment which will not worth it. Pros: Python native lib; able to debug easily (using [inline](https://mrjob.readthedocs.io/en/latest/runners-inline.html)) run [locally](https://mrjob.readthedocs.io/en/latest/runners-local.html) e.g. multi process on local machine, use hadoop, dataproc (seems that[”Dataproc is a managed Spark and Hadoop service…”](https://cloud.google.com/dataproc/docs/concepts/overview) ) etc. However, lots of moving parts and different configuration options. ### Custom made map-reduce Let’s go to UCI Machine Learning website (2015 is on the phone..) Choose some [dataset](http://archive.ics.uci.edu/ml/datasets/Bar+Crawl%3A+Detecting+Heavy+Drinking), and test Some notes: We don’t need SHA-256 and not evey base64; nothing will happen if keys will not distribute very equally. we could take MMH3; googling “python murmurhash” gives 2 interesting results; and since both use [the](https://github.com/hajimes/mmh3/blob/master/MurmurHash3.cpp) [same](https://github.com/explosion/murmurhash/blob/master/murmurhash/MurmurHash3.cpp) cpp code, let’s take the one with [most stars](https://github.com/hajimes/mmh3) Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2). mini setup script: # get data wget http://archive.ics.uci.edu/ml/machine-learning-databases/00516/os_scan/OS%20Scan_dataset.csv.gz gunzip 'OS Scan_dataset.csv.gz' wc -l all_accelerometer_data_pids_13.csv # create venv venv_path=/tmp/venv_mr/ python3 -m venv ${venv_path} source ${venv_path}/bin/activate python3 -m pip install mmh3 python3 -m pip install tqdm python3 -m pip install py-spy and 2 python test scripts: #!/usr/bin/env python3 import mmh3 import tqdm NUMBER_OF_SHARDS = 10 id_to_file = {} for i in range(NUMBER_OF_SHARDS): id_to_file[i] = open('/tmp/%s.shard.txt' % (i), 'w') def get_shard_index(id_, num_shards): return mmh3.hash(id_, signed=False) % num_shards with open(r'all_accelerometer_data_pids_13.csv', 'r') as fin: for line in tqdm.tqdm(fin, mininterval=1): id_ = line.split(',', 1)[0] id_to_file[get_shard_index(id_, NUMBER_OF_SHARDS)].write("%s\n" % (line)) for i in range(NUMBER_OF_SHARDS): id_to_file[i].close() #!/usr/bin/env python3 import multiprocessing import mmh3 import tqdm NUMBER_OF_PROCESSES = multiprocessing.cpu_count()-1 NUMBER_OF_SHARDS = 10 id_to_file = {} for i in range(NUMBER_OF_SHARDS): id_to_file[i] = open('/tmp/%s.shard.txt' % (i), 'w') def _get_shard_index(id_, num_shards): return mmh3.hash(id_, signed=False) % num_shards def get_id_and_line(line): id_ = line.split(',', 1)[0] shard_id = _get_shard_index(id_, NUMBER_OF_SHARDS) return shard_id, line pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) with open(r'all_accelerometer_data_pids_13.csv', 'r') as fin: iter_data = pool.imap(get_id_and_line, fin) for shard_id, line in tqdm.tqdm(iter_data, mininterval=1): id_to_file[shard_id].write("%s\n" % (line)) for i in range(NUMBER_OF_SHARDS): id_to_file[i].close() Results: imap runs much slower; we can look at it/sec from tqdm to see that: # test.py sample after 4 seconds: 2801493it [00:04, 566075.99it/s] # test_imap.py sample after 4 seconds: 73439it [00:04, 18754.44it/s]# test.py sample after 4 seconds: 2801493it [00:04, 566075.99it/s] # test_imap.py sample after 4 seconds: 73439it [00:04, 18754.44it/s] We could see the non-imap version is 30x faster! ### Q&A Q: Why setup.sh and not requirements.txt file? A: this is not production code; it’s aimed for quick reproducibility, not for having exact same lib (e.g. security etc.) Q: Why MMH3 and not sha256? A: This is not a security product, we don’t need cryptographic hash; we just need a _nice distribution of keys_ , and we want this to be **fast**. Q: Why is imap slower than single process? A: Might be because the imap version has **lots of overhead because of IPC** ; The trade-off between offloading the (alleged) “heavy lifting” calculation of hash to external process is being erased by the IPC. ![](https://cdn-images-1.medium.com/max/880/0*z1odokAUxlqPKkMs.png)use [py-spy](https://github.com/benfred/py-spy) for Python profiling Q: Why? A: Using process pool might worth it if task is more CPU bound; here, **task is more io bound** the overhead of MMH hash doesn’t justify it. Q: Conclusion? A: it depends Also — depends on size of file. Also — depends on post-processing each shard. conclusion — test mrjob as well; it might have a better IPC. _Originally published at_[ _https://dev.to_](https://dev.to/yuval1024/data-ingestion-build-your-own-map-reduce-2j1h) _on December 24, 2021._ ---