提问者:小点点

puppeteer:Scrape有时有效,有时因TypeError而失败


作为一个个人挑战,我正在尝试创建一个工具,使用Puppeteer来刮取一个网站(阿里巴巴在这个实验中使用的购物平台)的搜索结果,并将输出保存到一个JSON对象中,以后可以用来在前端创建可视化。

我的第一步是访问搜索结果的第一页,并从那里将列表刮入一个数组:

const puppeteer = require('puppeteer');
const fs = require('fs');

/* First page search URL */
const url = (keyword) => `https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=${keyword}`

/* keyword to search for */
const keyword = `future`;

(async () => {
    try {
        const browser = await puppeteer.launch({
            headless: true
        });

        const page = await browser.newPage();

        await page.goto(url(keyword), {
            waitUntil: 'networkidle2'
        });

        await page.waitForSelector('.m-gallery-product-item-v2');

        let urls = await page.evaluate(() => {
            let results = [];
            let items = document.querySelectorAll('.m-gallery-product-item-v2');

            // This console.log never gets printed to either the browser window or the terminal?
            console.log(items)

            items.forEach( item => {
                let CurrentTime = Date.now();
                let title = item.querySelector('h4.organic-gallery-title__outter').getAttribute("title");
                let link = item.querySelector('.organic-list-offer__img-section').getAttribute("href");
                let img = item.querySelector('.seb-img-switcher__imgs').getAttribute("data-image");

                results.push({
                    'scrapeTime': CurrentTime,
                    'title': title,
                    'link': `https:${link}`,
                    'img': `https:${img}`,
                })
            });
            return results;
            
        })
        console.log(urls)
        browser.close();

    } catch (e) {
        console.log(e);
        browser.close();
    }
})();

当我在终端中使用Node运行文件(test-2.js)时,它有时会很好地返回results数组,而在其他时候则会抛出一个错误。 大约一半时间抛出的终端错误是:

Error: Evaluation failed: TypeError: Cannot read property 'getAttribute' of null
    at __puppeteer_evaluation_script__:11:82
    at NodeList.forEach (<anonymous>)
    at __puppeteer_evaluation_script__:8:19
    at ExecutionContext._evaluateInternal (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:102:19)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at async ExecutionContext.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/ExecutionContext.js:33:16)
    at async /Users/dmnk/scraper/test-2.js:24:20
  -- ASYNC --
    at ExecutionContext.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
    at DOMWorld.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/DOMWorld.js:89:24)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
  -- ASYNC --
    at Frame.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:94:19)
    at Page.evaluate (/Users/dmnk/scraper/node_modules/puppeteer/lib/Page.js:612:14)
    at Page.<anonymous> (/Users/dmnk/scraper/node_modules/puppeteer/lib/helper.js:95:27)
    at /Users/dmnk/scraper/test-2.js:24:31
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: ReferenceError: browser is not defined
    at /Users/dmnk/scraper/test-2.js:52:9
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
(node:53159) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:53159) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

我对掌握和学习异步JavaScript比较新鲜。

几天来,我一直在试图了解为什么会发生这种错误,但没有任何结果。 非常感谢您在理解原因/故障排除方面提供的任何帮助。


共1个答案

匿名用户

在使用getAttribute之前,您需要检查titlelinkimg是否存在。 因为,例如,对我来说,选择器的链接没有找到,但它通过以下方式找到:

let link = item.querySelector('.organic-gallery-title').getAttribute('href');

我不知道这与什么有关,也许是因为我和你在不同的国家。 在任何情况下,你都可以检查这个选择器,检查使用它时程序将如何工作。 希望这能有所帮助。