How to Extract Product Data from the Online Store with N8N: Automated Web Scraping

Learning how to extract product data that appears in a grid on the showcase of an online store is a common need for those who want to monitor prices, update catalogs, compare competitors, or feed internal dashboards. With n8n, an open-source automation orchestrator, it is possible to build a robust flow to capture this data directly from the HTML of the page, standardize, validate, store, and even trigger alerts — all without needing to write a system from scratch. In this article, using the theme “Extracting Product Data from the Online Store Grid with N8N” as a reference, we will explore a complete step-by-step guide, best practices, and strategies to handle different scenarios of modern e-commerce.

Video credits: Bruno Devx | BR Criativus

Why use n8n to extract product data

n8n allows you to create visual automations by connecting nodes that make HTTP requests, extract and transform information, make conditional decisions, log events, and send data to spreadsheets, databases, or APIs. For those who need to extract data from a product grid, this means:

No dependency on complex code: you build the flow by dragging and configuring nodes.
Scalability: easy to replicate the flow for multiple categories, stores, or brands.
Reliability: you schedule executions, handle errors, and notify incidents.
Integration: from raw data to the final layer (Google Sheets, Airtable, MySQL, Notion, BigQuery, or APIs), all in one pipeline.

Understanding the product “grid”

Generally, category, search, or highlight pages display products in repeated cards. Each card usually contains:

Product name
Current price and sometimes previous price (promotion)
Main image
Link to the product page
SKU, code, or internal identifier (when visible)
Badge or tag (new, promotion, free shipping, Black Friday, etc.)
Availability (in stock, unavailable, variations)

This data is typically accessible in the HTML via CSS selectors (like classes and attributes) or in data attributes (e.g., data-sku=”…”). In some cases, the page injects JSON into the source code (scripts of type application/ld+json or inline blocks). In stores that render content dynamically via JavaScript, it may be necessary to use a headless browser to “see” the final HTML.

Flow architecture in n8n

Before clicking and dragging nodes, plan the architecture. A typical flow includes:

Input of URLs: list of categories/pages where the grid is located.
HTML capture: via HTTP Request or Browser (when there is heavy JS).
Structured extraction: using an extraction node by selectors.
Cleaning and standardization: normalization of text, currency, numbers, and links.
Pagination: proceed to the next page until there are no more results.
Deduplication: avoid creating duplicate records between executions.
Persistence: save to a spreadsheet, database, or send to an API.
Observability: logs, notifications, and error handling.
Scheduling: recurring execution with Cron (daily, peak hours, etc.).

Step-by-step guide on how to extract product data with n8n

1) Define the objectives and fields

List exactly what you need to extract and where you intend to save it. For example:

Fields: title, price, previous_price, link, image, sku, availability, category, capture_date.
Destination: Google Sheets for quick analysis and a MySQL database for history.
Frequency: 2 times a day.

2) Preparing the input of URLs

You can feed the flow with a fixed list of URLs (e.g., category pages) using a static data node, a control spreadsheet, or a database. This allows you to scale the same flow for 10, 50, or 200 different pages without duplicating work.

3) HTML capture

On static pages, the HTTP Request is usually sufficient. Configure:

GET method
Realistic User-Agent (some stores block generic agents)
Appropriate timeout and number of retries
Respect for robots.txt and terms of use

If the page loads products via JavaScript after the initial load, use a Browser node (headless browser) to render the page, wait for the product elements to appear, and then extract the already processed HTML.

4) Data extraction from the grid

With the HTML in hand, use an extraction node by selectors (for example, a parser that accepts CSS). The idea is to select the product “card” and, within it, look for:

Title selector (e.g., .product-card .title)
Price selector (e.g., .price .amount)
Previous price selector (e.g., .price .old)
Link selector (e.g., .product-card a)
Image selector (e.g., .product-card img)
SKU selector (when available) or data attributes
Availability flag (e.g., .badge.out-of-stock)

Practical tips:

Start by inspecting the HTML in the browser to find stable selectors.
Prefer consistent attributes; avoid purely utility classes that change frequently.
If the store has JSON-LD with product data, it may be more reliable to parse it.

5) Normalization and enrichment

In the transformation stage, standardize and enrich the data to facilitate analysis and integrations:

Price: convert comma to point, remove symbols, and extract numbers.
Title: apply trim, capitalize if needed, remove unnecessary special characters.
Links: transform relative URLs into absolute ones.
Category: include the name of the source category as metadata.
Date/Time: save an ISO timestamp (e.g., 2025-09-02T01:57:29Z).
Availability: normalize to “in_stock”/“unavailable”.

If desired, calculate derived fields, such as “discount_percentage” based on the current and previous prices.

6) Automatic pagination

Most grids use pagination. Two common approaches:

Follow the “Next page” link: after extracting the grid from the current page, look for the selector of the “Next” button/link and, if it exists, proceed.
Change the URL parameter: many stores use page=2, page=3… You can iterate this parameter while still finding products.

Stop control:

If there is no “Next” link, stop.
If the page returned zero products, finish.
Set a safety limit on pages to avoid infinite loops.

7) Deduplication

To avoid duplicating records with each execution:

Generate a unique identifier (e.g., hash of the product link + store).
Before saving, check if the ID already exists in the destination (spreadsheet/DB).
Maintain a repository of “seen” items (cache in Redis, control table, or status column in the spreadsheet).

8) Data persistence

Choose where the data will reside:

Google Sheets: great for quick visualization and manual validations.
Airtable: combines spreadsheet and database, with views and automations.
MySQL/PostgreSQL: recommended for history, advanced queries, and integration with BI.
Notion: useful for teams that prefer a documentation/data hub.
APIs: send to a custom endpoint, or integrate with ERPs and CRMs.

If the intention is to feed a catalog in WordPress/WooCommerce, you can:

Use the WooCommerce REST API to create/update products.
Maintain a staging layer (spreadsheet/DB) and only publish after review.

9) Observability and notifications

Good automation is observable automation. Consider:

Send a summary of the captures to Slack/Telegram (e.g., total products, new promotions, errors).
Log processed URLs, execution time, and item count.
Separate errors by type (timeout, broken selector, store blocking) for quick diagnostics.

10) Scheduling

With the scheduling node, define when the flow should run. Tips:

Avoid peak traffic hours of the store to reduce blocking and improve performance.
If necessary, perform lighter scans more frequently (incremental collection).

When to use a headless browser

Some showcases load products via JavaScript, without complete static HTML. In these cases, the HTTP Request is not enough. Use a Browser node to:

Load the page like a real browser (with User-Agent, cookies, and JS execution).
Wait for the grid selector to appear (e.g., .product-grid .product-card).
Extract the already rendered HTML and then apply the data selectors.

Stability tips:

Set realistic timeouts and wait for specific elements, not just document loading.
Simulate scrolling when pagination is infinite, collecting in batches with each “scroll”.
Reduce the request rate and intersperse with small delays to avoid blocking.

Data quality and compliance

When dealing with various stores and different templates, data quality can vary. Adopt clear standards:

Validation of mandatory fields (title, price, link) with discarding or marking incomplete records.
Normalization of currency and numeric format (R$, thousand and decimal separators).
Standardization of availability (e.g., “Available”, “Out of stock”, “On order”).
Standardization of images: ensure absolute links in HTTPS and acceptable sizes.

Scalability and performance

If you intend to capture hundreds of pages or thousands of products:

Implement a queue: process URLs in batches to control parallelism.
Cache: if the store has a strong CDN, back it up with ETags/Last-Modified when possible.
Exponential backoff: when detecting blocks or many 429 errors, increase the interval between requests.
Incremental storage: save in “streaming”, not everything at the end, to avoid losing data in failures.

Common errors and how to avoid them

Brittle selectors: depend on temporary classes; prefer more semantic selectors or data attributes.
Ignoring pagination: extracting only the first page completely underestimates the catalog.
Not handling variations: price and availability may change due to color/size variation; understand what your analysis requires.
Lack of deduplication: generates “inflated” spreadsheets and complicates history.
Not versioning the flow: changes in the store require adjustments; keep versions of your workflow and comments.

Ethics and legality

Before extracting data, check the website's terms of use and the robots.txt file. In many cases, collecting publicly displayed information for analysis purposes is tolerated, as long as it does not overload the infrastructure or violate contractual restrictions. Best practices:

Respect request limits and implement rate limiting.
Identify yourself appropriately in the User-Agent when applicable.
Avoid circumventing protection mechanisms that indicate clear restrictions.
Protect collected data and comply with applicable legislation (LGPD when involving personal data).

From HTML to your WordPress

If the end goal is to feed a WordPress site (with or without WooCommerce), n8n can bridge the gap:

For catalogs: integrate with the WordPress API to create custom posts (CPT) with advanced fields.
For e-commerce: use the WooCommerce API to create/update products, prices, and inventory.
For editorial content: transform the collection into topics, lists, or price comparisons.

Practical recommendation: maintain a “hub” of data (spreadsheet or DB) between collection and WordPress. This creates a layer of security and review, avoiding the publication of incorrect or incomplete items.

Example of a summarized flow

Start: Daily Cron at 6 AM and 6 PM.
Input: list of category URLs (spreadsheet).
Loop: for each URL, capture the HTML (HTTP Request or Browser).
Extract: select the cards and extract necessary fields.
Transform: normalize prices, links, and availability.
Paginate: proceed while there is a next page.
Deduplicate: check in DB if the item already exists (unique ID by link+store).
Persist: save to Google Sheets and MySQL.
Notify: send summary to Slack/Telegram.
End: log execution metrics.

How to test and validate

Do not throw the flow directly into production. Test with a few pages and validate:

If the number of products per page matches.
If the prices match the showcase (pay attention to promotions and installments).
If the links open the correct product page.
If pagination covers the entire catalog, without skipping pages.

Create test cases: pages with promotions, with variations, out of stock, and even empty pages. Note the adjustments to selectors and transformation rules that arise during this stage.

Quick checklist to get started today

Map 3 to 5 categories from your store or target market.
Inspect the HTML and record selectors for title, price, link, and image.
Build a simple flow with capture, extraction, and sending to a spreadsheet.
Add pagination and data normalization.
Implement deduplication and a simple summary alert.
Schedule and monitor for a week, making incremental adjustments.

Conclusion

Extracting data from the product grid of an online store with n8n is a powerful combination of practicality and flexibility. You can transform public pages into structured data, ready to feed analyses, update catalogs, compare prices, and generate business insights — all with a visual, versionable, and scalable flow.

If you want to see in practice how I apply automations and turn ideas into results, also check out my real work on web and e-commerce projects. Visit my portfolio to see cases, layouts, and solutions that can elevate your site's level.

And you, have you ever tried to extract data from a product grid with any automation? What was the biggest challenge you encountered (unstable selector, pagination, blocking) and what would you like to see detailed in the next content?

Anderson Barbosa