微型SaaS-自动化搜集信息开发AI微型SaaS往往需要搜集网上的信息，下面介绍一些比较标志性的，平时会很容易需要用到的

开发AI微型SaaS往往需要搜集网上的信息，下面介绍一些比较标志性的，平时会很容易需要用到的搜集信息（爬虫）的方法，参考代码有：

1, AI-Content-Ideas-Generator-Prototype

自动化打开浏览器，读取文章主要部分，AI进行总结


chrome_options = Options()

chrome_options.add_argument("--headless")

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])

chrome_options.add_experimental_option("useAutomationExtension", False)

driver = webdriver.Chrome(options=chrome_options)

stealth(

driver,

languages=["en-US", "en"],

vendor="Google Inc.",

platform="Win32",

webgl_vendor="Intel Inc.",

renderer="Intel Iris OpenGL Engine",

fix_hairline=True,

)

“Selenium Stealth” 是一个用于 Selenium 框架的工具或技术，旨在帮助用户更隐秘地自动化浏览器操作，以避免被网站检测到并阻止。在网络爬取、自动化测试和其他需要模拟用户操作的情况下，有时候需要避免被网站检测到使用自动化工具而采取措施限制访问。


def get_article_from_url(url):

try:

# Scrape the web page for content using newspaper

article = newspaper.Article(url)

# Download the article's content with a timeout of 10 seconds

article.download()

# Check if the download was successful before parsing the article

if article.download_state == 2:

article.parse()

# Get the main text content of the article

article_text = article.text

return article_text

else:

print("Error: Unable to download article from URL:", url)

return None

except Exception as e:

print("An error occurred while processing the URL:", url)

print(str(e))

return None

newspaper3k 是一个流行的 Python 库，用于从新闻网站和文章中提取内容，如文章文本、标题、作者、发布日期、图片等。这个库可以让用户轻松地从在线新闻源中提取信息，方便进行数据分析、自然语言处理等任务。

–

2, auto_jobfindchatgpt__rpa

RPA 自动化打开浏览器，执行动作，获得信息推送给 gpt Assistants api ,自动化获得见解和操作处理

github.com/Frrrrrrrran…

driver = finding_jobs.get_driver()

# 更改下拉列表选项

finding_jobs.select_dropdown_option(driver, label)

# 调用 finding_jobs.py 中的函数来获取描述

job_description = finding_jobs.get_job_description_by_index(job_index)

if job_description:

element = driver.find_element(By.CSS_SELECTOR, '.op-btn.op-btn-chat').text

print(element)

if element == '立即沟通':

# 发送描述到聊天并打印响应

if should_use_langchain():

response = generate_letter(vectorstore, job_description)

else:

response = chat(job_description, assistant_id)

print(response)

time.sleep(1)



# 点击沟通按钮

contact_button = driver.find_element(By.XPATH, "//*[@id='wrap']/div[2]/div[2]/div/div/div[2]/div/div[1]/div[2]/a[2]")

contact_button.click()

# 等待回复框出现

xpath_locator_chat_box = "//*[@id='chat-input']"

chat_box = WebDriverWait(driver, 50).until(

EC.presence_of_element_located((By.XPATH, xpath_locator_chat_box))

)

# 调用函数发送响应

send_response_and_go_back(driver, response)

RPA调动浏览器，获得信息和操作动作。

file = client.files.create(file=open("my_cover.pdf", "rb"),

purpose='assistants')

assistant = client.beta.assistants.create(

# Getting assistant prompt from "prompts.py" file, edit on left panel if you want to change the prompt

instructions=assistant_instructions,

model="gpt-3.5-turbo-1106",

tools=[

{

"type": "retrieval" # This adds the knowledge base as a tool

},

],

file_ids=[file.id])

利用gpt的 “retrieval”， 向量数据库存储和索引搜索内容。

client.beta.threads.messages.create(

thread_id=thread_id,

role="user",

content=user_input

)

# Start the Assistant Run

run = client.beta.threads.runs.create(

thread_id=thread_id,

assistant_id=assistant_id

)