AI爬虫测试

由于时效问题,该文某些代码、技术可能已经过期,请注意!!!本文最后更新于:7 个月前

测试的框架:Firecrawl, crawlai, Scrapegraph-ai

个人测试效果比较好的是Scrapegraph-ai, 可以使用openai,也可以使用ollama调用本地llm

单页爬取

openai

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import os
from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info
import nest_asyncio
nest_asyncio.apply()

# 设置环境变量(现有版本需要设置全局环境,不然会报错)
os.environ['OPENAI_API_KEY'] = '****'
os.environ['OPENAI_API_BASE'] = '***'

OPENAI_BASE_URL="****"
OPENAI_API_KEY="****"


graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "gpt-3.5-turbo",
"base_url":OPENAI_BASE_URL
},
# "embeddings": {
# "model": "ollama/nomic-embed-text",
# "base_url": "****", # set Ollama URL
# },
"headless":True
}


start = time.time()

PROMPT = '''
Please provide the following information, which is typically found on the school's About page:
1. Founding date
2. School history
3. School philosophy
4. School motto
5. School vision
6. School mission
7. School values

Please visit the school's official website, navigate to the About or similar page, and extract the above information. If you cannot find all the details, please provide as much relevant information as possible. Thank you!

'''

smart_scraper_graph = SmartScraperGraph(
prompt=PROMPT,
# also accepts a string with the already downloaded HTML code
source="https://www.nyu.edu",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)

end = time.time()

print(end - start)

ollama 本地 llm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from scrapegraphai.graphs import SmartScraperGraph

start = time.time()

graph_config = {
"llm": {
"model": "ollama/qwen2:latest",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
"base_url": "****", # set Ollama URL
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "****", # set Ollama URL
},
"verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
prompt=PROMPT,
# also accepts a string with the already downloaded HTML code
source="https://www.nyu.edu",
config=graph_config
)

result = smart_scraper_graph.run()
print(result)
end = time.time()

print(end - start)

Scrapegraph-ai 也支持已经爬取好的html内容, 可以使用request 或者langchain获取到html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# langchain
urls = ["https://www.mit.edu/about/"]
from langchain_community.document_loaders import AsyncChromiumLoader

# Load HTML
loader = AsyncChromiumLoader(urls)
html = loader.load()[0].page_content

# request
import requests
url = "https://www.mit.edu/about/"
response = requests.get(url)
html = response.text

smart_scraper_graph = SmartScraperGraph(
prompt=PROMPT,
source=html,
config=graph_config_local
)

result = smart_scraper_graph.run()
print(result)

多页爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from scrapegraphai.graphs import SmartScraperMultiGraph

start = time.time()
smart_scraper_graph = SmartScraperMultiGraph(
prompt=PROMPT,
source=['https://www.mit.edu', 'https://www.mit.edu/about/'],

config=graph_config
)

result = smart_scraper_graph.run()
print(result)

end = time.time()

print(end - start)

搜索爬取

另外 Scrapegraph-ai 自带搜索功能,目前支持搜索的引擎是google , duckduckgo, 后期会支持bing. 不过现在版本还不能切换搜索引擎,可以自行改源码切换。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from scrapegraphai.graphs import SearchGraph
start = time.time()
# Define the configuration for the graph
graph_config = {
"llm": {
"model": "ollama/qwen2:latest",
"temperature": 0,
"format": "json", # Ollama needs the format to be specified explicitly
"base_url": "*****", # set Ollama URL
},
"embeddings": {
"model": "ollama/nomic-embed-text",
"base_url": "*****", # set Ollama URL
},
"headless":False,
"max_results": 2,
}

# Create the SearchGraph instance
search_graph = SearchGraph(
prompt="纽约大学的校训是什么",
config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)
end = time.time()

print(end - start)