概述
我经营着Offer Optimist,一个比较注册奖金的网站。我维护一个免费的API来访问这里的数据。其中一个主要的挑战是保持这些数据的更新,传统上我是通过信贷博士和用户提供的报告来完成的。
然而,这可能很困难,也很费时间,取决于我希望数据有多准确。我通常承诺80%以上的准确性,但要非常清楚,它不应该被用于任何关键任务的使用情况。对我来说,通过任何形式的主动搜索来保持更新,实在是太费时间了。
进入,网络刮削
因此,我有了一个想法,尝试通过网络刮擦来自动拉下网页,以验证准确性。毕竟,我有我的数据集中每张卡片的URL。所以,我就试了一下。
请注意,下面所有的内容都是在对结果应用了html-to-text 包,以摆脱所有的HTML结构之后的。我使用ScrapingAnt作为网络搜刮的代理,它对我来说相当有效。
第一次尝试,Regex
这段代码最后看起来像这样。
const text = lines
.map((line) => {
// Remove substring, not whole line
return line
.replaceAll(",", "") // Remove commas, which can interfere with regex finding
.replaceAll(/spend \$\d+/gi, "") // Remove things that are clearly a spend amount
.replaceAll(/\d+x/gi, "") // Remove anything that looks like a multiplier
.replaceAll(/\d%/gi, "") // Remove anything that looks like a percentage
.replaceAll(/annual.{0,4}\d/gi, "") // Remove anything that looks like it's annual
.replaceAll(/\d+\.\d+x?/gi, "") // Remove anything that looks like a decimal
.replaceAll(/up to (a )?\$?\d+/gi, "") // Tends to indicate an offer with several breakpoints; we want to consider the smaller one
.replaceAll(/[^a-zA-Z0-9\$ ]*/gi, ""); // Replace all non-alphanumeric characters with a space
})
.filter((line) => {
line = line.replaceAll("\n", "").trim();
// Remove entire lines that clearly don't contain a bonus
if (!line.length) return false;
if (line.length < 5 || line.length > 200) {
logger.debug(
`Removing line ${line} b/c it's too short or too long.`
);
return false;
}
if (line.startsWith("--")) {
logger.debug(`Removing line ${line} b/c it starts with --`);
return false;
} // Remove artifact introduced by scraping proxy
if (!/\$?\d+/g.test(line)) {
logger.debug(
`Removing line ${line} b/c it has no distinct numbers in it.`
);
return false;
}
if (/per|each|every/gi.test(line)) {
// Indicates something recurring; usually referrals or an earnings rate
logger.debug(
`Removing line ${line} b/c it has 'per' or 'each' in it.`
);
return false;
}
if (/\d+\/\d+\/\d+\//gi.test(line)) {
logger.debug(`Removing line ${line} b/c it has a date in it.`);
return false;
}
return true;
});
// Find first regex match; already sorted from highest to lowest specificity
const match = CARD_SUBSTRING_TO_REGEX.reduce<string | undefined>(
(acc, search) => {
if (acc) return acc; // Already found our "match"
for (const line of text) {
const result = search.f(card, line);
if (result) return result;
}
return acc;
},
undefined
);
if (!match) {
skips.push({
card,
reason: `No regex match. Text (${text.length}): ${JSON.stringify(
text.map((t) => t.substring(0, 200)),
null,
2
)}`,
});
return;
}
进入全屏模式 退出全屏模式
正如你所看到的,有很多非常微妙的Regex规则。这可能对70%的卡片有效,但往往会出现大量的误报,最终只是有点太过繁琐和基于规则,不值得我花时间。
第二次尝试,GPT
我得到了一个很好的建议,让我研究一下GPT这种东西。毕竟,它在归纳方面要好得多。我的总体想法是刮取数据,应用我所能做的任何预处理(regex),以尽量减少GPT需要筛选的噪音(和标记,因为标记=成本)。我应用的一些规则是:
- 尽量减少空白部分(主要是为了节省标记)。
- 只包括其中有数字的行
- 删除任何提及COVID-19的内容
- 删除任何看起来像期限或日期的东西
我还应用了一些网站的特定规则,比如美国运通倾向于使用大量的块,周围是不相关的[] ,以及其他类似的东西。
具体代码如下。
const cleaned = text
.replaceAll(/\[.*]/g, "")
.replaceAll(/[\n\r]/g, " ")
.replaceAll(/\s{2,}/g, " | ")
.split("|")
.map((s) => s.trim())
.filter((s) => /\d/.test(s)) // Must include a number
.filter((s) => !/covid-19/gi.test(s)) // COVID-19
.filter((s) => !/\d* seconds/gi.test(s)) // Time
.filter((s) => !/ 101/gi.test(s)) // XYZ 101 is text that tends to show up in Amex
.filter((s) => !/2023/gi.test(s)) // Number is the current year
.filter((s) => !/\d*%/gi.test(s)) // Percentages
.filter(
// URLs
(s) =>
!/^https?:\/\/(?:www\.)?[\w#%+.:=@~-]{1,256}\.[\d()a-z]{1,6}\b[\w#%&()+./:=?@~-]*$/gi.test(
s
)
)
.join(" | ");
进入全屏模式 退出全屏模式
一旦我得到了实际的输出结果,就该把它输入到实际的提示中了。
const prompt = `I am scraping credit card websites to check whether credit card data I have on file is accurate, especially sign up bonus amounts. You are a helpful assistant helping me verify whether my data is still accurate.
I have the following data on file for this card, which I am providing in JSON format.
${JSON.stringify({
...card,
historicalOffers: undefined,
imageUrl: undefined,
})}
I stripped the HTML from the page, so I now just have the raw text. Here it is, with each page "section" separated by " | ". The page will either be specific to this card, or have infomration on this card. Here's the text:
${cleaned}
If my data is still up to date, please reply ONLY with "Up To Date." If the text resembles some kind of error, please reply ONLY with "Error" and then a brief description of the error. If my data is not up to date, please reply ONLY with details of what is inaccurate. Also feel free to check for any other inaccuracies, such as an incorrect annual fee or an incorrect value for whether the bonus is waived first year.`;
进入全屏模式 退出全屏模式
我在这里借用了提示工程中常用的一些技术。我给了LLM一个 "角色",而且我非常明确地告诉它我想要什么样的输出。
它的工作效果如何?
下面是一个输出的例子:
[09:43:23.866] INFO (16328): Getting page text for card BARCLAYS Upromise...
[09:43:27.207] INFO (16328): Got page text for card BARCLAYS Upromise...
[09:43:27.219] INFO (16328): Cleaned text: Get up to $250 in cash back rewards per calendar year on eligible gift card | Earn $100 Bonus Cash Back Rewards. | $0 Fraud Liability Protection. | made within 45 days of account opening.
After that (and for balance transfers | $0 | $100 BONUS CASH BACK REWARDS | Earn $100 bonus cash back rewards after spending $500 on purchases in the first | 90 days2 | when linked to an eligible College Savings Plan2 | EARN UP T
O $250 IN CASH BACK REWARDS PER YEAR | Get up to $250 in cash back rewards per calendar year on eligible gift card | at MyGiftCardsPlus.com .3 | $0 | annual fee1 | * EARN $100 BONUS CASH BACK REWARDS | Earn $100 bonus cash back r
ewards after spending $500 on purchases in the | first 90 days. | * EARN UP TO $250 IN CASH BACK REWARDS PER YEAR | Get up to $250 in cash back rewards per calendar year on eligible gift card | at MyGiftCardsPlus.com .3 | based o
n the limit you set (from $1 to $500). The total Round Up Amount | is considered a purchase and converted to cash back rewards.2 | on international purchases.1 | * $0 FRAUD LIABILITY PROTECTION | that your score has changed.4 | t
R: EARN 60,000 BONUS POINTS | after qualifying account activity2 | 6X POINTS | on eligible JetBlue purchases2 | 2X POINTS | at restaurants and eligible grocery stores2 | $99 | annual fee1 | * EARN 60,000 BONUS POINTS | after spending $1,000 on purchases and paying the annual fee in full, both | within the first 90 days2 | * 6X POINTS | on eligible JetBlue purchases2 | * 2X POINTS AT RESTAURANTS AND ELIGIBLE GROCERY STORES | and 1X points on all other purchases2 | for you and up to 3 eligible travel companions on JetBlue-operated flights2,4 | Earn toward Mosaic with every purchase3 | on eligible inflight purchases on JetBlue-operated flights2,4 | That’s any seat, any time, on JetBlue-operated flights3 | when you redeem for and travel on a JetBlue-operated Award Flight2 | fare at the time of booking3 | * ANNUAL $100 STATEMENT CREDIT | after you purchase a JetBlue Vacations package of $100 or more with your | JetBlue Plus Card2 | Your points will be ready whenever you are3 | * $0 FRAUD LIABILITY PROTECTION | Earn & share points with family and friends3 | $1,000 annually2 | combination or dollars and TrueBlue points – starting with as few as 500 | points3 | on international purchases1 | * EARN 5,000 POINTS BONUS | each year after your JetBlue Plus Card account anniversary2 | transfer that posts to your account within 45 days of account opening. | After that (and for balance transfers that do not post within 45 days of account | $99 | 1. Offer subject to credit approval. This offer is available through this | days of account opening is applicable for the first 12 billing cycles that | 2. Conditions and limitations apply. Please refer to the Reward Rules within | 3. Refer to TrueBlue Terms and Conditions | 4. JetBlue-operated flights only. Codeshare flights and flights operated by a | Credit Card Customer Support: 877-523-0478...
[09:43:35.314] INFO (16328): GPT Response: Up To Date. The text matches the information you provided in the JSON format, including the sign-up bonus of 60,000 points after spending $1,000 on purchases and paying the annual fee in full, both within the first 90 days.
进入全屏模式 退出全屏模式
是的,这有点乱,但GPT能够像一个冠军一样处理它。这是用ChatGPT / GPT3.5,我使用它是因为它有良好的性价比。
其他东西
我从Next.js路由处理程序中运行这个程序,并通过他们的CRON工作进行触发。由于Vercel运行时间的限制,我可能要做一些调整,以便我基本上把每张卡作为单独的HTTP请求的一部分来做,而不是作为一个整体的一部分。然而,最困难的部分已经完成了!
在我得到输出结果后,我正在使用octokit自动创建一个GitHub Issue,其中包含过时的信息。