scraping-ninja-toolkit
All the goodies you'll ever need to scrape the web
Last updated a year ago by jimmylaurent .
MIT · Repository · Bugs · Original npm · Tarball · package.json
$ cnpm install scraping-ninja-toolkit 
SYNC missed versions from official npm registry.

scraping-ninja-toolkit

All the goodies you'll ever need to scrape the web

Documentation

In-browser Playground

You can try the library on codesandbox, it uses a cors proxy fetcher to let you grab contents from any website inside your browser.

Installation

yarn add scraping-ninja-toolkit
# or
npm i scraping-ninja-toolkit

Features

  • All in one package
  • Nodejs / Browsers compatibility
  • Blazingly fast
  • Extensible

Overview

The library is articulated around two main components:

  • the fetcher let you grab contents from any url,
  • the scraper let you extract data from webpages.

There is also some additional tools like an enhanced axios client.

Quick Example

const { fetcher } = require('scraping-ninja-toolkit');

// Fetch the given url and return a page scraper
const page = await fetcher.get('http://quotes.toscrape.com');

// Scrape an object
const quote = page.scrape('.quote', {
  author: '.author@text',
  text: '.text@text'
});
<!-- quote -->
{ 
  "author": "Albert Einstein", 
  "text": "“The world as we have created it is a process of our thinking.“"
}

Advanced real world example

const { fetcher } = require('scraping-ninja-toolkit');
const fs = require('fs');

(async () => {
  // Get categories urls
  const categories = await fetcher
    .get('https://coursehunters.net')
    .links('.menu-aside__a');

  // For each category
  // => frontend
  // => backend ...
  const results = await fetcher.getAll(categories).map(
    async (fetchNode, index) => {
      // Get all courses from the catagory in an flat array
      // https://coursehunters.net/frontend?page=1 => 10 courses
      // https://coursehunters.net/frontend?page=1 => 10 courses
      // ....
      //
      // allCourses => [{
      //   title: 'Modern HTML & CSS From The Beginning',
      //   url: 'https://coursehunters.net/course/sovremennyy-html-i-css-s-samogo-nachala'
      // }, ... ]
      const allCourses = await fetchNode
        .paginate('.pagination__a[rel="next"]')
        .flatMap(p =>
          p.scrapeAll('article', {
            title: '.standard-course-block__original@text',
            url: 'a[itemprop="mainEntityOfPage"]@href'
          })
        );

      // For each course scrape chapters
      // with a concurrency of 50 queries at the same time
      // and filter "undefined" values (courses without chapters)
      const courses = await fetcher
        .getAll(allCourses.map(c => c.url))
        .map(
          async p => {
            console.log(`Scraping url: ${p.location}`);

            const chapters = p.scrapeAll('.lessons-list__li', {
              name: 'span[itemprop="name"]@text',
              url: 'link[itemprop="url"]@href'
            });
            if (chapters && chapters.length && chapters[0].url) {
              const course = allCourses.find(c => c.url === p.location);
              course.chapters = chapters;
              return course;
            }
          },
          { concurrency: 50 }
        )
        .filter(c => c);

      return {
        category: categories[index].split('/').pop(),
        courses: courses
      };
    },
    { resolvePromise: false, concurrency: 6 }
  );

  fs.writeFileSync('courses.json', JSON.stringify(results, null, 2), 'utf8');
})();

Credits

• FB55: his work is the core of this library.

• Matt Mueller and cheerio contributors : A good portion of the code and concepts are copied/derived from the cheerio and x-ray scraper libraries.

License

MIT © 2019 Jimmy Laurent

Current Tags

  • 1.0.0-beta.2                                ...           latest (a year ago)

5 Versions

  • 1.0.0-beta.2                                ...           a year ago
  • 1.0.0-beta.1                                ...           a year ago
  • 1.0.0-alpha.3                                ...           a year ago
  • 1.0.0-alpha.2                                ...           a year ago
  • 1.0.0-alpha.1                                ...           a year ago
Maintainers (1)
Downloads
Today 0
This Week 0
This Month 0
Last Day 0
Last Week 0
Last Month 0
Dependencies (14)
Dev Dependencies (7)
Dependents (1)

Copyright 2014 - 2016 © taobao.org |