Junki
Junki
Published on 2025-02-21 / 1,265 Visits
0
0

解锁免费且强大的 Web Search 方案:Firecrawl 部署并接入 Dify

firecrawl-logo-with-fire.png

Firecrawl 是一项 API 服务。它能够获取 URL,对相应的网页进行抓取操作,并将抓取到的内容转换成格式规范的 Markdown 格式文本或者结构化数据。

可以实现对所有能够访问到的子页面进行抓取,并且为每个子页面提供格式规范、清晰的数据。使用该服务无需站点地图。

官方服务地址:https://www.firecrawl.dev/

开源地址:https://github.com/mendableai/firecrawl

一、为什么选择 Firecrawl 作为 Web Search 实现?

在 Diry 平台的工具列表中,我们可以看到很多关于 搜索 相关的工具:

屏幕截图_21-2-2025_131026.jpeg

其中大家熟悉的 googlebing 都有相关 API 提供,接口速度较快,但都有一定费用。

其中 SearXNG 是一个不错的开源互联网搜索方案,经过测试后发现国内环境下稳定性较差,且目前和 Dify 整合不够好,有不少 bug。

Firecrawl,是一款强大的结构化爬虫工具,底层使用了 playwright,可以通过爬取互联网搜索网站(比如百度)实现 Web Search 插件的能力。

二、私有化部署 Firecrawl

官方部署教程:https://github.com/mendableai/firecrawl/blob/main/CONTRIBUTING.md

本文使用 Docker Compose 进行部署。

1. 克隆仓库

git clone https://github.com/mendableai/firecrawl.git

cd firecrawl

2. 配置环境

直接复制官方配置案例。

cp apps/api/.env.example apps/api/.env

3. 启动容器

前台运行:

docker compose up

后台运行:

docker compose up -d

4. 测试接口

curl -X GET http://localhost:3002/test

成功返回:Hello, world!

三、接入 Diry 平台

只需配置 Firecrawl 的接口地址,端口号默认是 3002

私有化部署的情况下,密钥随意填写即可。

屏幕截图_21-2-2025_131046.jpeg

四、实现一个具有网络搜索能力的 ChatFlow

1. 添加单页面抓取节点

1-qnhi.jpeg

抓取 https://www.baidu.com/s?wd={用户输入} 的内容,并只获取 class 属性包含 result 的标签,最终以 links 的格式返回。

通过这个节点,我们可以获取百度搜索结果页面的链接列表,这些链接指向结果网站页面。

调试结果如下结构:

{
  "text": "",
  "files": [],
  "json": [
    {
      "success": true,
      "data": {
        "metadata": {
          "theme-color": "#ffffff",
          "referrer": "always",
          "title": "全球票房最高的动画片是什么_百度搜索",
          "favicon": "https://www.baidu.com/favicon.ico",
          "scrapeId": "a6b1bed8-7ac7-4bb8-b37f-ddc90d936f5d",
          "sourceURL": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",
          "url": "https://www.baidu.com/s?wd=全球票房最高的动画片是什么",
          "statusCode": 200
        },
        "links": [
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKiIaBT5bldOg8WKw5Yl2kzyTAdhDjXg8fhOAUuSR0fvpwQ7kTPFB-00BOS0TaSwsO",
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFNvD9E44qfYsc_CRAkAaDSLLL_OZbEFrFQiDs1vzQdsHKVoczUckq3tgOnP4TvzGfy",
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFMLSKanxg5wetFwpRfsvMLFqPjubf0Q79hiCwwLk0XKJ2ZV3M64hHqIGPfscUPLZue",
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID7IndCL1Xm3QNdL78B9KuvTNq9Az180HZ6F8vySsvmuS",
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFLu1lPaS1L3aIN164b5IAB_0iDZV-XzWC9A5kqVBECNegvVSYKko9wD_0ZxQfl5c4S",
          "http://www.baidu.com/link?url=xzbQnC6i_MULtA48_qlGC1q1Jna2LebBf6dui64qwFKDBzUshbz_zll4SvSx-EID-vOAlRTrK0bQRF_mm5qBOav-iqkxX7OlvWt5htLOP3y"
        ]
      }
    }
  ]
}

2. 添加代码执行节点

2-dols.jpeg

改节点编写一个简单的 JS 脚本,用于获取 links 数组,便于后续节点遍历。

3. 添加迭代节点

3-asad.jpeg

4-jrbu.jpeg

迭代抓取每个结果页面内容,这里设置只抓取主要内容,并移除了一些无用标签(例如:style,script,img,svg,a)。

4. 添加大模型节点

5.jpeg

改节点汇总之前抓取的页面内容,并通过指令要求大模型进行总结回复。

5. 添加直接回复节点

6.jpeg

6. 完整 DSL 文件

可在 Dify 流程设计中,右键导入。

app:
  description: ''
  icon: 🤖
  icon_background: '#FFEAD5'
  mode: advanced-chat
  name: Web Search Bot
  use_icon_as_answer_icon: false
kind: app
version: 0.1.5
workflow:
  conversation_variables: []
  environment_variables: []
  features:
    file_upload:
      allowed_file_extensions:
      - .JPG
      - .JPEG
      - .PNG
      - .GIF
      - .WEBP
      - .SVG
      allowed_file_types:
      - image
      allowed_file_upload_methods:
      - local_file
      - remote_url
      enabled: false
      fileUploadConfig:
        audio_file_size_limit: 50
        batch_count_limit: 5
        file_size_limit: 15
        image_file_size_limit: 10
        video_file_size_limit: 100
        workflow_file_upload_limit: 10
      image:
        enabled: false
        number_limits: 3
        transfer_methods:
        - local_file
        - remote_url
      number_limits: 3
    opening_statement: ''
    retriever_resource:
      enabled: true
    sensitive_word_avoidance:
      enabled: false
    speech_to_text:
      enabled: false
    suggested_questions: []
    suggested_questions_after_answer:
      enabled: false
    text_to_speech:
      enabled: false
      language: ''
      voice: ''
  graph:
    edges:
    - data:
        isInIteration: false
        sourceType: start
        targetType: tool
      id: 1740103540241-source-1740121470685-target
      selected: false
      source: '1740103540241'
      sourceHandle: source
      target: '1740121470685'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: false
        sourceType: tool
        targetType: code
      id: 1740121470685-source-1740122039767-target
      selected: false
      source: '1740121470685'
      sourceHandle: source
      target: '1740122039767'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: false
        sourceType: code
        targetType: iteration
      id: 1740122039767-source-1740122637931-target
      source: '1740122039767'
      sourceHandle: source
      target: '1740122637931'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: true
        iteration_id: '1740122637931'
        sourceType: iteration-start
        targetType: tool
      id: 1740122637931start-source-1740122872165-target
      source: 1740122637931start
      sourceHandle: source
      target: '1740122872165'
      targetHandle: target
      type: custom
      zIndex: 1002
    - data:
        isInIteration: false
        sourceType: llm
        targetType: answer
      id: 1740122936994--1740122982132-target
      source: '1740122936994'
      sourceHandle: source
      target: '1740122982132'
      targetHandle: target
      type: custom
      zIndex: 0
    - data:
        isInIteration: false
        sourceType: iteration
        targetType: llm
      id: 1740122637931-source-1740122936994-target
      source: '1740122637931'
      sourceHandle: source
      target: '1740122936994'
      targetHandle: target
      type: custom
      zIndex: 0
    nodes:
    - data:
        desc: ''
        selected: false
        title: 开始
        type: start
        variables: []
      height: 54
      id: '1740103540241'
      position:
        x: 30
        y: 406
      positionAbsolute:
        x: 30
        y: 406
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        desc: ''
        provider_id: firecrawl
        provider_name: firecrawl
        provider_type: builtin
        selected: false
        title: 单页面抓取
        tool_configurations:
          excludeTags: null
          formats: links
          headers: null
          includeTags: .result
          onlyMainContent: 1
          prompt: null
          schema: null
          systemPrompt: null
          timeout: 30000
          waitFor: 0
        tool_label: 单页面抓取
        tool_name: scrape
        tool_parameters:
          url:
            type: mixed
            value: https://www.baidu.com/s?wd={{#sys.query#}}
        type: tool
      height: 324
      id: '1740121470685'
      position:
        x: 334
        y: 406
      positionAbsolute:
        x: 334
        y: 406
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        code: "\nfunction main({arg1}) {\n    return {\n        result: arg1[0].data.links\n\
          \    }\n}\n"
        code_language: javascript
        desc: ''
        outputs:
          result:
            children: null
            type: array[string]
        selected: false
        title: 代码执行
        type: code
        variables:
        - value_selector:
          - '1740121470685'
          - json
          variable: arg1
      height: 54
      id: '1740122039767'
      position:
        x: 638
        y: 406
      positionAbsolute:
        x: 638
        y: 406
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        desc: ''
        error_handle_mode: terminated
        height: 412
        is_parallel: false
        iterator_selector:
        - '1740122039767'
        - result
        output_selector:
        - '1740122872165'
        - text
        output_type: array[string]
        parallel_nums: 10
        selected: false
        start_node_id: 1740122637931start
        title: 迭代
        type: iteration
        width: 692
      height: 412
      id: '1740122637931'
      position:
        x: 942
        y: 406
      positionAbsolute:
        x: 942
        y: 406
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 692
      zIndex: 1
    - data:
        desc: ''
        isInIteration: true
        selected: false
        title: ''
        type: iteration-start
      draggable: false
      height: 48
      id: 1740122637931start
      parentId: '1740122637931'
      position:
        x: 24
        y: 68
      positionAbsolute:
        x: 966
        y: 474
      selectable: false
      sourcePosition: right
      targetPosition: left
      type: custom-iteration-start
      width: 44
      zIndex: 1002
    - data:
        desc: ''
        isInIteration: true
        iteration_id: '1740122637931'
        provider_id: firecrawl
        provider_name: firecrawl
        provider_type: builtin
        selected: false
        title: 单页面抓取
        tool_configurations:
          excludeTags: style,script,img,svg,a
          formats: null
          headers: null
          includeTags: body
          onlyMainContent: 1
          prompt: null
          schema: null
          systemPrompt: null
          timeout: 30000
          waitFor: 0
        tool_label: 单页面抓取
        tool_name: scrape
        tool_parameters:
          url:
            type: mixed
            value: '{{#1740122637931.item#}}'
        type: tool
      height: 324
      id: '1740122872165'
      parentId: '1740122637931'
      position:
        x: 283.42857142857133
        y: 66.57142857142856
      positionAbsolute:
        x: 1225.4285714285713
        y: 472.57142857142856
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
      zIndex: 1002
    - data:
        context:
          enabled: false
          variable_selector: []
        desc: ''
        model:
          completion_params:
            temperature: 0.7
          mode: chat
          name: deepseek-r1:70b
          provider: ollama
        prompt_template:
        - edition_type: basic
          id: 10d89ba1-6562-486a-83c3-6cdb53359a7a
          jinja2_text: ''
          role: system
          text: '你是一个乐于助人的助手。

            在<context></context> XML标记中使用以下上下文作为您学到的知识。这些知识来源于网络搜索,不是用户提供给你的。

            <context>

            {{#1740122637931.output#}}

            </context>

            回答用户时,避免提到你是从上下文中获得信息的。

            并根据用户提问的语言进行回答。'
        selected: false
        title: LLM
        type: llm
        variables: []
        vision:
          enabled: false
      height: 98
      id: '1740122936994'
      position:
        x: 1694
        y: 406
      positionAbsolute:
        x: 1694
        y: 406
      selected: true
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    - data:
        answer: '{{#1740122936994.text#}}'
        desc: ''
        selected: false
        title: 直接回复
        type: answer
        variables: []
      height: 103
      id: '1740122982132'
      position:
        x: 1998
        y: 406
      positionAbsolute:
        x: 1998
        y: 406
      selected: false
      sourcePosition: right
      targetPosition: left
      type: custom
      width: 244
    viewport:
      x: -629
      y: -20
      zoom: 0.7

五、对话体验

7.png


Comment