smart-crawler · API v2 · 2026-05-25

v2 API · Firecrawl 兼容

Firecrawl-style 端点(scrape/map/crawl/extract/batch)+ smart-crawler 完整数据契约(ProductData / DataSourceInfo)。同一把 Bearer key,多种调用方式自由选。

1概览

2认证(兼容 Firecrawl)

API KEY · External Integration · 2026-05-25
sck_ikuBVCAjAKygdAxu8_DNSDc9iOJkgXMY7jBf5ceMmlw 点击复制
# 方式 A · Bearer(Firecrawl 兼容)
curl -H "Authorization: Bearer sck_ikuBVCAjAKygdAxu8_DNSDc9iOJkgXMY7jBf5ceMmlw" \
     "https://smartcrawler.io/api/v2/sources"

# 方式 B · X-API-Key(旧风格也支持)
curl -H "X-API-Key: sck_..." "https://smartcrawler.io/api/v2/sources"
Firecrawl SDK 直接可用:把 Firecrawl client 的 base_url 改成 https://smartcrawler.io/api/v2,API key 换成我们的 sck_,scrape / map / crawl 三个端点 schema 完全一致。

3核心端点

POST /api/v2/scrape
单 URL 抓取 → 返结构化数据 + markdown。如 URL 已在 DB 直接返存量,否则入队。
# Request
{
  "url": "https://www.songmics.com/products/sol-3782",
  "formats": ["markdown", "structured"],
  "only_main_content": true,
  "timeout": 30000
}

# Response
{
  "success": true,
  "url": "https://www.songmics.com/products/sol-3782",
  "crawl_url": "https://www.songmics.com/products/sol-3782",
  "site": "songmics_us",
  "data": { // ProductData schema · 见 §4
    "sku": "SOL-3782",
    "title": "电动升降桌 48英寸",
    "sale_price": 189.99,
    "currency": "USD",
    "ratings": 4.6,
    "review_count": 412,
    "image_urls": ["https://cdn..."],
    "product_url": "https://www.songmics.com/products/sol-3782",
    "site_url": "https://www.songmics.com/",
    "crawled_at": "2026-05-24T10:30:00",
    "confidence": 0.97
  },
  "markdown": "# 电动升降桌 48英寸\n\nPrice: 189.99 USD...",
  "scrape_id": "scr_5191fa07f09641bd",
  "credits_used": 1
}
POST /api/v2/map
列出站点全部已知 URL(来自 DB),可按 search 过滤。比 sitemap 更新更及时。
# Request
{
  "url": "https://www.songmics.com/",
  "limit": 1000,
  "search": "desk"  // 可选
}

# Response
{
  "success": true,
  "url": "https://www.songmics.com/",
  "site": "songmics_us",
  "links": [
    "https://www.songmics.com/products/sol-3782",
    "https://www.songmics.com/products/sol-3783"
  ],
  "count": 100,
  "credits_used": 1
}
POST /api/v2/crawl
触发整站爬取(异步)。返 job_id,用 GET 轮询。
# Request
{
  "url": "https://www.songmics.com/",
  "limit": 1000,
  "include_paths": ["^/products/"],
  "max_depth": 2
}

# Response · 立即返
{
  "success": true,
  "job_id": 730,
  "status": "pending",
  "site": "songmics_us",
  "crawl_url": "https://www.songmics.com/",
  "poll_url": "/api/v2/crawl/730",
  "credits_used": 1000
}
GET /api/v2/crawl/{job_id}
轮询爬取任务状态。完成时返 data[]。
{
  "success": true,
  "job_id": 730,
  "status": "success",  // pending / running / success / failed
  "site": "songmics_us",
  "crawl_url": "https://www.songmics.com/",
  "total": 4202,
  "products_count": 4202,
  "duration_sec": 42.3,
  "data": [ // ProductData × 100(首批) ]
}
POST /api/v2/batch/scrape
批量异步抓取(最多 100 URL)。
{
  "urls": ["https://a.com/p/1", "https://b.com/p/2"],
  "formats": ["structured"],
  "webhook": "https://yourapp.com/cb"  // 可选
}
POST /api/v2/extract
LLM 抽取自定义 schema(v2.1 接 claude-haiku-4-5)。
{
  "urls": ["https://..."],
  "schema": {
    "price": { "type": "number" },
    "in_stock": { "type": "boolean" },
    "variant_count": { "type": "integer" }
  },
  "prompt": "Extract pricing and stock info"
}
GET /api/v2/sources
列出全部 59 数据源元数据(含 crawl_url 字段)。
[
  {
    "site": "vidaxl_de",
    "crawl_url": "https://www.vidaxl.de/",
    "brand": "Vidaxl",
    "country": "DE",
    "platform": "vidaxl",
    "sku_count": 5000,
    "coverage_pct": 100.0,
    "status": "healthy",
    "last_crawled": "2026-05-24T06:05:43",
    "proxy_tier": "residential",
    "anti_bot_level": 4  // 1-5
  },
  ...
]

4数据类型 · ProductData Schema

所有 product 接口返回统一 14 字段。可序列化为 JSON / 入 DataFrame / 直接喂 LLM 上下文。

字段类型含义示例
sitestring内部站点代号songmics_us
site_urlurl站点根 URL(爬取域名)https://www.songmics.com/
skustring商品唯一标识SOL-3782
spustring?父商品标识(变体合并)SOL-3782-G
titlestring商品名电动升降桌 48 英寸
descriptionstring?商品描述电动升降办公桌, 静音电机...
image_urlsstring[]商品图片 URL 列表["https://cdn..."]
category_pathstring?分类路径(/分隔)Office Furniture/Desks
sale_pricenumber?当前售价189.99
original_pricenumber?原价(划线价)209.99
currencystring?货币 3 字母代码USD / EUR / GBP
statusstring?状态on_sale / out_of_stock / discontinued
ratingsnumber?平均评分4.6
review_countinteger?评论数412
brandstring?品牌SONGMICS
product_urlurl商品 PDP URL(可点击)https://www.songmics.com/products/sol-3782
crawled_atiso8601?抓取时间戳2026-05-24T10:30:00
confidencenumber数据置信度 0-10.97

5数据类型 · DataSourceInfo

字段类型含义
sitestring内部代号 · songmics_us / vidaxl_de / wayfair_us / ...
crawl_urlurl实际爬取的网站 URL(重要!告诉你这个数据从哪来)
brandstring品牌
countrystring2 字母国家代码
platformstringshopify / vue_spa / nuxt / vidaxl / wayfair / bol / cdiscount / ikea / westelm / cratebarrel / overstock / idealo / otto / allegro / ebay / houzz / article / generic
sku_countinteger当前已抓 SKU 数
coverage_pctnumber覆盖率(已抓 / 满量)
statusstringhealthy / warning / critical / empty
last_crawlediso8601?最后抓取时间
proxy_tierstringnone / datacenter / residential
anti_bot_levelinteger反爬难度 1-5: 1 容易(Shopify)/ 2-3 中(Cloudflare)/ 4 难(PerimeterX)/ 5 最难(Akamai+DataDome)

659 个数据源 · 现有 crawl_url 清单

所有 site 的爬取目标 URL(可直接看 /api/v2/sources)。

类别站点crawl_url 模式反爬
家居品牌自营SONGMICS × 6songmics.com / songmics.de / .fr / .uk / .es / .itL1
Costway × 9costway.com/.ca/.co.uk/.de/.fr/.it/.es/.nl/.plL2
Homary × 5homary.com / uk.homary.com / de./es./fr.L2
Vidaxl × 12vidaxl.com/.co.uk/.ca/.ie/.de/.it/.es/.fr/.ro/.pt/.nl/.plL4
Flexispot × 9flexispot.com/.co.uk/.ca/.de/.it/.es/.fr/.nl/.plL2
家居 MarketplaceWayfair / Overstock / WestElm / Crate&Barrel / Article / IKEAwayfair.com / overstock.com / westelm.com / crateandbarrel.com / article.com / ikea.com/us/en/L2-L5
欧洲电商Otto / Bol / CDiscount / Idealootto.de / bol.com / cdiscount.com / idealo.deL3-L4
大型市场eBay / Allegroebay.com / allegro.plL5
其他BCP / Yaheetech / VonHaus / Woltu / Houzzbestchoiceproducts.com / yaheetech.shop / vonhaus.com / woltu.eu / houzz.comL1-L3

7SDK 示例

Python(requests)

import requests

API = "https://smartcrawler.io/api/v2"
KEY = "sck_ikuBVCAjAKygdAxu8_DNSDc9iOJkgXMY7jBf5ceMmlw"
H = {"Authorization": f"Bearer {KEY}"}

# Scrape
r = requests.post(f"{API}/scrape", headers=H, json={"url": "https://www.songmics.com/products/sol-3782"})
data = r.json()["data"]
print(f"{data['title']} · {data['sale_price']} {data['currency']}")

# Map · 列出某站所有商品 URL
r = requests.post(f"{API}/map", headers=H, json={"url": "https://www.songmics.com/", "limit": 100})
urls = r.json()["links"]

# Crawl · 整站爬
r = requests.post(f"{API}/crawl", headers=H, json={"url": "https://www.songmics.com/", "limit": 1000})
job = r.json()["job_id"]

# Poll · 等完成
import time
while True:
    r = requests.get(f"{API}/crawl/{job}", headers=H)
    s = r.json()
    if s["status"] in ("success", "failed"): break
    time.sleep(5)
print(f"Got {len(s['data'])} products")

Firecrawl SDK(直接兼容)

from firecrawl import FirecrawlApp

# 把 base_url 改成我们的
app = FirecrawlApp(
    api_key="sck_ikuBVCAjAKygdAxu8_DNSDc9iOJkgXMY7jBf5ceMmlw",
    api_url="https://smartcrawler.io/api/v2"
)
r = app.scrape_url("https://www.songmics.com/products/sol-3782")

cURL

curl -X POST -H "Authorization: Bearer sck_iku..." \
     -H "Content-Type: application/json" \
     -d '{"url":"https://www.songmics.com/"}' \
     https://smartcrawler.io/api/v2/scrape

8错误码

状态含义
200OK
400请求 body 缺字段
401未登录 / API key 无效
404资源不存在(site 不在 59 列表 / job 不存在)
429调用频率超限
500服务端错误

9同时支持的旧 v1 API

v1 (/api/v1/*) 仍正常工作。v1 偏向「读数据」(GET 居多),v2 偏向「爬数据」(POST 触发)。

→ 在 Swagger UI 试调所有端点