作为一个个人挑战,我正在尝试创建一个工具,使用Puppeteer来刮取一个网站(阿里巴巴在这个实验中使用的购物平台)的搜索结果,并将输出保存到一个JSON对象中,以后可以用来在前端创建可视化。
我的第一步是访问搜索结果的第一页,并从那里将列表刮入一个数组:
const puppeteer = require('puppeteer');
const fs = require('fs');
/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`
/* keyword to search for */
const keyword = `future`;
(async () => {
try {
const browser = await puppeteer.launch({
headless: true
});
const page = await browser.newPage();
await page.goto(url(keyword), {
waitUntil: 'networkidle2'
});
await page.waitForSelector('.m-gallery-product-item-v2');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('.m-gallery-product-item-v2');
// This console.log never gets printed to either the browser window or the terminal?
console.log(items)
items.forEach( item => {
let CurrentTime = Date.now();
let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");
let link = item.querySelector('.organic-list-offer__img-section').getAttribute("href");
let img = item.querySelector('.seb-img-switcher__imgs').getAttribute("data-image");
results.push({
'scrapeTime': CurrentTime,
'title': title,
'link': `https:${link}`,
'img': `https:${img}`,
})
});
return results;
})
console.log(urls)
browser.close();
} catch (e) {
console.log(e);
browser.close();
}
})();
当我在终端中使用Node运行文件(test-2.js)时,它有时会很好地返回results
数组,而在其他时候则会抛出一个错误。 大约一半时间抛出的终端错误是:
Error: Evaluation failed: TypeError: Cannot read property 'getAttribute' of null
at __puppeteer_evaluation_script__:11:82
at NodeList.forEach (<anonymous>)
at __puppeteer_evaluation_script__:8:19
at ExecutionContext._evaluateInternal (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:102:19)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
at async ExecutionContext.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:33:16)
at async /Users/dmnk/scraper/test-2.js:24:20
-- ASYNC --
at ExecutionContext.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
at DOMWorld.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/DOMWorld.js:89:24)
at processTicksAndRejections (internal/process/task_queues.js:97:5)
-- ASYNC --
at Frame.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
at Page.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/Page.js:612:14)
at Page.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:95:27)
at /Users/dmnk/scraper/test-2.js:24:31
at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: ReferenceError: browser is not defined
at /Users/dmnk/scraper/test-2.js:52:9
at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:53159) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
我对掌握和学习异步JavaScript比较新鲜。
几天来,我一直在试图了解为什么会发生这种错误,但没有任何结果。 非常感谢您在理解原因/故障排除方面提供的任何帮助。
在使用getAttribute
之前,您需要检查title
,link
和img
是否存在。 因为,例如,对我来说,选择器的链接
没有找到,但它通过以下方式找到:
let link = item.querySelector('.organic-gallery-title').getAttribute('href');
我不知道这与什么有关,也许是因为我和你在不同的国家。 在任何情况下,你都可以检查这个选择器,检查使用它时程序将如何工作。 希望这能有所帮助。