In the world of Data Engineering and Web Architecture, the quality of information comes before any analysis. Before we talk about conversion metrics or Business Intelligence, we need to face a silent enemy: data pollution. This is a real case of URL Sanitization in WordPress.
In a high-scale WordPress environment, years of content and campaigns spread thousands of Hotmart URLs throughout the database. The result was a trail of obsolete parameters (src, utm, sck) and even residues like u0022, creating noise, inconsistency, and real risk to SEO, performance, and conversion.
The goal was not to “make links pretty”. It was to restore the integrity of the system with Data Sanitization, keeping only what supports the business, with a critical rule: preserve checkoutMode=10 when necessary.
Here I document the basis of the process and the decision logic that makes the cleaning safe. Next, I show the batch execution and the refinement of the most problematic cases in Execution of URL Sanitization in WordPress. Finally, I conclude with a final audit, post-operative (caches and CSS), and the direct impact on data and revenue in: “Audit and Impact of URL Sanitization in WordPress“.
Table of Contents
Tools and Environment for URL Sanitization in WordPress
Many developers ask: why not use a simple SQL command directly on the database? The answer lies in the nature of the problem. The REPLACE command in SQL is a literal tool, not interpretative.
SQL works perfectly when we know exactly what to search for and what to replace. It swaps “text A” for “text B”. However, our challenge involved dynamic and unpredictable variables.
Each Hotmart URL had a unique tail of parameters. SQL does not have the native intelligence to say: “keep the product ID, check if the checkout parameter exists, and discard the rest”.
To manipulate complex patterns, the logic needs to be procedural. It was necessary to use a programming language capable of processing conditionals. That’s why we chose the integration of WP-CLI with PHP.
The WP-CLI (WordPress Command Line Interface) acts as a bridge. It allows us to interact with the WordPress infrastructure directly from the terminal, without the processing overhead of a web browser.
By injecting PHP scripts via the terminal, we gain access to the power of Regular Expressions (Regex). Regex does not search for words; it searches for patterns of characters, allowing us to identify and isolate the ID of each product.
With this technology, we built an “intelligent sanitization” algorithm. The code not only deletes data; it reads the URL, interprets its structure, makes decisions based on business rules, and reconstructs the clean data.
Finally, the security of the operation was ensured by strict protocols. In data engineering, we never operate in production without a safety net. A complete backup of the database was step zero.
In addition to the backup, we worked with the concept of “Dry Run” (Simulation). Our scripts were designed to initially run in read mode, generating reports of “what would happen if…”.
This validation step allowed us to audit the logic of the algorithm before any real changes. Only after the mathematical validation of the proposed substitutions was the write command executed.
Diagnosis and Mapping
Before any surgical intervention on the database, it is mandatory to understand the extent of the problem. In data engineering, operating blindly is a recipe for disaster. Therefore, the first step was not corrective but investigative.
We needed a complete “X-Ray” of all the Hotmart URLs present in the site ecosystem, regardless of whether they were visible in a blog article or hidden in the settings of an SEO plugin.
Creating the Crawler via wp eval-file
WordPress has thousands of lines and tables. Manually searching this would be unfeasible. The solution was to create a custom PHP script to run via WP-CLI.
We opted for the wp eval-file command instead of wp eval (single line) to avoid syntax conflicts with quotes in the terminal and allow for more robust logic. The script acts as an internal crawler: it scans the content, extracts specific patterns using Regex (Regular Expressions), and cleans up escape formatting (backslashes common in databases).
Below is the code used to map the scenario. It ignores exact duplicates but lists all variations of parameters, giving us the real dimension of the pollution.
<?php
/**
* Hotmart URL Audit and Mapping Script
* Execution via terminal: wp eval-file rastreador.php
*/
global $wpdb;
echo "n--------------------------------------------------n";
echo "🔎 STARTING DATA TRACKERn";
echo "--------------------------------------------------n";
// Regex designed to capture http/https + domain + any tail
// The pattern only matches when finding space, quotes, or HTML tags
$regex = "/https?://[^s"'<]*hotmart.com[^s"'<]*/i";
$found_urls = [];
// 1. Scan the Posts Table (Visible content)
echo "1. Analyzing wp_posts... ";
$posts = $wpdb->get_results("SELECT post_content FROM {$wpdb->posts} WHERE post_content LIKE '%hotmart.com%'");
foreach ($posts as $p) {
if (preg_match_all($regex, $p->post_content, $matches)) {
foreach ($matches[0] as $url) {
// stripslashes removes escape slashes from JSON (ex: https:// becomes https://)
$found_urls[] = stripslashes($url);
}
}
}
echo "OK (" . count($posts) . " records analyzed).n";
// 2. Scan the Metadata Table (Hidden settings)
echo "2. Analyzing wp_postmeta... ";
$metas = $wpdb->get_results("SELECT meta_value FROM {$wpdb->postmeta} WHERE meta_value LIKE '%hotmart.com%'");
foreach ($metas as $m) {
if (preg_match_all($regex, $m->meta_value, $matches)) {
foreach ($matches[0] as $url) {
$found_urls[] = stripslashes($url);
}
}
}
echo "OK (" . count($metas) . " records analyzed).n";
// 3. Data Consolidation
// array_unique ensures we see each URL variation only once
$unique_urls = array_unique($found_urls);
sort($unique_urls);
echo "n--------------------------------------------------n";
echo "DIAGNOSTIC REPORT (Total: " . count($unique_urls) . " unique URLs)n";
echo "--------------------------------------------------nn";
foreach ($unique_urls as $url) {
echo $url . "n";
}
echo "n";
Pattern Analysis: Identifying Variations of Clutter
When running the script above, the terminal output confirmed the hypothesis of severe data pollution. What should have been a list of unique products turned out to be a tangle of orphaned tracking parameters.
We identified three critical categories of “clutter” in the links:
- Source Parameters: Hundreds of URLs contained variations of
?src=facebook,?src=email_marketing,?utm_source=blog. This fragments data analysis, as the same product is counted as dozens of different URLs. - Code Junk (Encoding): We found URLs ending in
u0022oru003e. These are residues of Unicode JSON encoding that were incorrectly saved along with the link, often breaking click functionality. - Checkout Inconsistency: Some links had the vital parameter
checkoutMode=10, others did not, and some had it duplicated or poorly formatted.
The Serialization Challenge: Detecting Links in wp_postmeta
The most complex part of the diagnosis was not in the article texts, but where the eyes do not see: the wp_postmeta table.
Modern plugins like Rank Math (SEO) and GenerateBlocks (Design) do not save data as plain text. They use serialized arrays or JSON objects.
Imagine that WordPress saves the settings of a button block not as “Link: http…”, but as a coded data package:
{"type":"button","link":"https://pay.hotmart.com/XYZ","color":"blue"}.
If we used a simple text replacement (SQL REPLACE), we would risk corrupting the structure of this JSON package, invalidating the entire block or the SEO settings of the page.
Our tracker was specifically programmed to read within these structures (using stripslashes and scanning meta_value), ensuring that even the “hidden” links within the complex settings of the plugins were mapped for future cleanup. Without this, sanitization would be superficial and incomplete.
The Logic of the Cleaning Algorithm
Mapping the problem was just the first step. The real engineering began by defining how to handle each variation of URL found. It wasn’t enough to just wipe everything clean; we needed criteria.
Here comes the concept of “Decision Funnel”. Instead of trying to clean the existing URL by removing piece by piece, we adopted a reconstructive approach.
The algorithm does not “fix” the old link. It identifies the product ID (the main asset), checks for exception rules, and then discards the entire old URL to build a new one, absolutely clean, from scratch.
The “Decision Funnel”: What Stays vs. What Goes
The logic of the script operates by separating the “Signal” from the “Noise”. The Signal is the base URL that contains the product ID (e.g., go.hotmart.com/XYZ123). The Noise is everything that comes after: tracking parameters (src), affiliate codes, and formatting junk.
The standard funnel rule is strict: if it is not essential to the transaction, it is eliminated. This ensures that old and fragmented src metrics do not pollute the future database.
Exception Handling: The Golden Rule of checkoutMode=10
Every rule has its critical exception. In our case, the parameter checkoutMode=10 alters the link’s behavior, taking the user directly to payment instead of the sales page. Removing this would break the conversion strategy.
The algorithm needed to be smart enough to distinguish between “junk” (like src=facebook) and “functionality” (checkoutMode).
If the script detects the presence of this parameter in the “dirty” tail of the old URL, it activates the preservation flag. However, it does not preserve the original text (which could be contaminated with u0022). It adds a new clean string to the end of the ID.
CamelCase Standardization: Converting checkoutmode to checkoutMode
Consistency is the best friend of data analysis. We found variations like checkoutmode (lowercase) and CheckOutMode mixed in the database. For a computer, ?a=1 and ?A=1 can be different things depending on the system.
We implemented forced normalization. Regardless of how the parameter was written in the past, the algorithm rewrites it following the official CamelCase pattern of the platform: checkoutMode=10.
Below is the snippet of PHP code that implements this decision-making and reconstruction logic:
// Decision algorithm snippet (Funnel)
// 1. Separates the BASE (Product ID) from the REST (Junk/Parameters)
// The Regex captures the ID in group 1 and the rest in group 2
if (preg_match('/^(https?://[^/]*hotmart.com/[a-zA-Z0-9]+)(.*)$/i', $dirty_url, $parts)) {
$base_url = $parts[1]; // Ex: https://pay.hotmart.com/H12345
$tail = $parts[2]; // Ex: ?src=fb&checkoutmode=10u0022...
// 2. The Golden Rule: Checks if checkoutMode exists (Case Insensitive)
// stripos allows finding 'checkoutmode' even if it is lowercase
if (stripos($tail, 'checkoutmode=10') !== false) {
// SCENARIO A: It is a Direct Checkout link.
// We reconstruct the URL with the STANDARDIZED parameter (CamelCase).
// Note that we ignore the original $tail. The junk is discarded.
$clean_url = $base_url . "?checkoutMode=10";
} else {
// SCENARIO B: Standard link.
// We keep only the base. We remove src, utm, and any other noise.
$clean_url = $base_url;
}
// 3. If the reconstructed URL is different from the original, we schedule the change.
if ($dirty_url !== $clean_url) {
$commands[] = "wp search-replace '$dirty_url' '$clean_url' --all-tables";
}
}
With the problem mapped and the sanitization logic defined, the next step was to transform strategy into execution securely. In operations of this type, cleaning is not “replacing text”: it is ensuring that each change respects the integrity of WordPress and the business rule of checkoutMode=10.
To achieve this, I built a layered approach, with generator scripts, prior auditing, and execution via terminal, reducing risk and increasing control. It is at this stage that many people make mistakes out of haste and create a problem larger than the original.
The practical continuation, with scripts, batch execution, and correction of the “stubborn” cases, is in the next post: Execution of URL cleaning via WP-CLI.