Firecrawl 是一项 API 服务。它能够获取 URL,对相应的网页进行抓取操作,并将抓取到的内容转换成格式规范的 Markdown 格式文本或者结构化数据。
可以实现对所有能够访问到的子页面进行抓取,并且为每个子页面提供格式规范、清晰的数据。使用该服务无需站点地图。
官方服务地址:https://www.firecrawl.dev/
开源地址:https://github.com/mendableai/firecrawl
一、为什么选择 Firecrawl 作为 Web Search 实现?
在 Diry 平台的工具列表中,我们可以看到很多关于 搜索
相关的工具:
其中大家熟悉的 google
和 bing
都有相关 API 提供,接口速度较快,但都有一定费用。
其中 SearXNG
是一个不错的开源互联网搜索方案,经过测试后发现国内环境下稳定性较差,且目前和 Dify 整合不够好,有不少 bug。
而 Firecrawl
,是一款强大的结构化爬虫工具,底层使用了 playwright,可以通过爬取互联网搜索网站(比如百度)实现 Web Search 插件的能力。
二、私有化部署 Firecrawl
官方部署教程:https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md
本文使用 Docker Compose 进行部署。
1. 克隆仓库
git clone https://github.com/mendableai/firecrawl.git
cd firecrawl
2. 配置环境
直接复制官方配置案例。
cp apps/api/.env.example apps/api/.env
3. 启动容器
前台运行:
docker compose up
后台运行:
docker compose up -d
4. 测试接口
curl -X GET http://localhost:3002/test
成功返回:Hello, world!
三、接入 Diry 平台
只需配置 Firecrawl 的接口地址,端口号默认是 3002
。
私有化部署的情况下,密钥随意填写即可。
四、实现一个具有网络搜索能力的 ChatFlow
1. 添加单页面抓取节点
抓取 https://www.baidu.com/s?wd={用户输入}
的内容,并只获取 class 属性包含 result 的标签,最终以 links 的格式返回。
通过这个节点,我们可以获取百度搜索结果页面的链接列表,这些链接指向结果网站页面。
调试结果如下结构:
{
"text": "",
"files": [],
"json": [
{
"success": true,
"data": {
"metadata": {
"theme-color": "#ffffff",
"referrer": "always",
"title": "全球票房最高的动画片是什么_百度搜索",
"favicon": "https://www.baidu.com/favicon.ico",
"scrapeId": "a6b1bed8-7ac7-4bb8-b37f-ddc90d936f5d",
"sourceURL": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",
"url": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",
"statusCode": 200
},
"links": [
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKiIaBT5bldOg8WKw5Yl2kzyTAdhDjXg8fhOAUuSR0fvpwQ7kTPFB-00BOS0TaSwsO",
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFNvD9E44qfYsc_CRAkAaDSLLL_OZbEFrFQiDs1vzQdsHKVoczUckq3tgOnP4TvzGfy",
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFMLSKanxg5wetFwpRfsvMLFqPjubf0Q79hiCwwLk0XKJ2ZV3M64hHqIGPfscUPLZue",
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID7IndCL1Xm3QNdL78B9KuvTNq9Az180HZ6F8vySsvmuS",
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFLu1lPaS1L3aIN164b5IAB_0iDZV-XzWC9A5kqVBECNegvVSYKko9wD_0ZxQfl5c4S",
"http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID-vOAlRTrK0bQRF_mm5qBOav-iqkxX7OlvWt5htLOP3y"
]
}
}
]
}
2. 添加代码执行节点
改节点编写一个简单的 JS 脚本,用于获取 links 数组,便于后续节点遍历。
3. 添加迭代节点
迭代抓取每个结果页面内容,这里设置只抓取主要内容,并移除了一些无用标签(例如:style,script,img,svg,a)。
4. 添加大模型节点
改节点汇总之前抓取的页面内容,并通过指令要求大模型进行总结回复。
5. 添加直接回复节点
6. 完整 DSL 文件
可在 Dify 流程设计中,右键导入。
app:
description: ''
icon: 🤖
icon_background: '#FFEAD5'
mode: advanced-chat
name: Web Search Bot
use_icon_as_answer_icon: false
kind: app
version: 0.1.5
workflow:
conversation_variables: []
environment_variables: []
features:
file_upload:
allowed_file_extensions:
- .JPG
- .JPEG
- .PNG
- .GIF
- .WEBP
- .SVG
allowed_file_types:
- image
allowed_file_upload_methods:
- local_file
- remote_url
enabled: false
fileUploadConfig:
audio_file_size_limit: 50
batch_count_limit: 5
file_size_limit: 15
image_file_size_limit: 10
video_file_size_limit: 100
workflow_file_upload_limit: 10
image:
enabled: false
number_limits: 3
transfer_methods:
- local_file
- remote_url
number_limits: 3
opening_statement: ''
retriever_resource:
enabled: true
sensitive_word_avoidance:
enabled: false
speech_to_text:
enabled: false
suggested_questions: []
suggested_questions_after_answer:
enabled: false
text_to_speech:
enabled: false
language: ''
voice: ''
graph:
edges:
- data:
isInIteration: false
sourceType: start
targetType: tool
id: 1740103540241-source-1740121470685-target
selected: false
source: '1740103540241'
sourceHandle: source
target: '1740121470685'
targetHandle: target
type: custom
zIndex: 0
- data:
isInIteration: false
sourceType: tool
targetType: code
id: 1740121470685-source-1740122039767-target
selected: false
source: '1740121470685'
sourceHandle: source
target: '1740122039767'
targetHandle: target
type: custom
zIndex: 0
- data:
isInIteration: false
sourceType: code
targetType: iteration
id: 1740122039767-source-1740122637931-target
source: '1740122039767'
sourceHandle: source
target: '1740122637931'
targetHandle: target
type: custom
zIndex: 0
- data:
isInIteration: true
iteration_id: '1740122637931'
sourceType: iteration-start
targetType: tool
id: 1740122637931start-source-1740122872165-target
source: 1740122637931start
sourceHandle: source
target: '1740122872165'
targetHandle: target
type: custom
zIndex: 1002
- data:
isInIteration: false
sourceType: llm
targetType: answer
id: 1740122936994--1740122982132-target
source: '1740122936994'
sourceHandle: source
target: '1740122982132'
targetHandle: target
type: custom
zIndex: 0
- data:
isInIteration: false
sourceType: iteration
targetType: llm
id: 1740122637931-source-1740122936994-target
source: '1740122637931'
sourceHandle: source
target: '1740122936994'
targetHandle: target
type: custom
zIndex: 0
nodes:
- data:
desc: ''
selected: false
title: 开始
type: start
variables: []
height: 54
id: '1740103540241'
position:
x: 30
y: 406
positionAbsolute:
x: 30
y: 406
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
- data:
desc: ''
provider_id: firecrawl
provider_name: firecrawl
provider_type: builtin
selected: false
title: 单页面抓取
tool_configurations:
excludeTags: null
formats: links
headers: null
includeTags: .result
onlyMainContent: 1
prompt: null
schema: null
systemPrompt: null
timeout: 30000
waitFor: 0
tool_label: 单页面抓取
tool_name: scrape
tool_parameters:
url:
type: mixed
value: https://www.baidu.com/s?wd={{#sys.query#}}
type: tool
height: 324
id: '1740121470685'
position:
x: 334
y: 406
positionAbsolute:
x: 334
y: 406
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
- data:
code: "\nfunction main({arg1}) {\n return {\n result: arg1[0].data.links\n\
\ }\n}\n"
code_language: javascript
desc: ''
outputs:
result:
children: null
type: array[string]
selected: false
title: 代码执行
type: code
variables:
- value_selector:
- '1740121470685'
- json
variable: arg1
height: 54
id: '1740122039767'
position:
x: 638
y: 406
positionAbsolute:
x: 638
y: 406
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
- data:
desc: ''
error_handle_mode: terminated
height: 412
is_parallel: false
iterator_selector:
- '1740122039767'
- result
output_selector:
- '1740122872165'
- text
output_type: array[string]
parallel_nums: 10
selected: false
start_node_id: 1740122637931start
title: 迭代
type: iteration
width: 692
height: 412
id: '1740122637931'
position:
x: 942
y: 406
positionAbsolute:
x: 942
y: 406
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 692
zIndex: 1
- data:
desc: ''
isInIteration: true
selected: false
title: ''
type: iteration-start
draggable: false
height: 48
id: 1740122637931start
parentId: '1740122637931'
position:
x: 24
y: 68
positionAbsolute:
x: 966
y: 474
selectable: false
sourcePosition: right
targetPosition: left
type: custom-iteration-start
width: 44
zIndex: 1002
- data:
desc: ''
isInIteration: true
iteration_id: '1740122637931'
provider_id: firecrawl
provider_name: firecrawl
provider_type: builtin
selected: false
title: 单页面抓取
tool_configurations:
excludeTags: style,script,img,svg,a
formats: null
headers: null
includeTags: body
onlyMainContent: 1
prompt: null
schema: null
systemPrompt: null
timeout: 30000
waitFor: 0
tool_label: 单页面抓取
tool_name: scrape
tool_parameters:
url:
type: mixed
value: '{{#1740122637931.item#}}'
type: tool
height: 324
id: '1740122872165'
parentId: '1740122637931'
position:
x: 283.42857142857133
y: 66.57142857142856
positionAbsolute:
x: 1225.4285714285713
y: 472.57142857142856
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
zIndex: 1002
- data:
context:
enabled: false
variable_selector: []
desc: ''
model:
completion_params:
temperature: 0.7
mode: chat
name: deepseek-r1:70b
provider: ollama
prompt_template:
- edition_type: basic
id: 10d89ba1-6562-486a-83c3-6cdb53359a7a
jinja2_text: ''
role: system
text: '你是一个乐于助人的助手。
在<context></context> XML标记中使用以下上下文作为您学到的知识。这些知识来源于网络搜索,不是用户提供给你的。
<context>
{{#1740122637931.output#}}
</context>
回答用户时,避免提到你是从上下文中获得信息的。
并根据用户提问的语言进行回答。'
selected: false
title: LLM
type: llm
variables: []
vision:
enabled: false
height: 98
id: '1740122936994'
position:
x: 1694
y: 406
positionAbsolute:
x: 1694
y: 406
selected: true
sourcePosition: right
targetPosition: left
type: custom
width: 244
- data:
answer: '{{#1740122936994.text#}}'
desc: ''
selected: false
title: 直接回复
type: answer
variables: []
height: 103
id: '1740122982132'
position:
x: 1998
y: 406
positionAbsolute:
x: 1998
y: 406
selected: false
sourcePosition: right
targetPosition: left
type: custom
width: 244
viewport:
x: -629
y: -20
zoom: 0.7