A solutions range from the ad-hoc, requiring human

A Web Scraping Approach in Node.js  Shikha Mahajan, Nikhit Kumar Information Science and Engineering R V College of Engineering  Bangalore, India  Abstract:Web scraping is the process of automatically collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an initiative that still requires breakthroughs in text processing, semantic understanding, and artificial intelligence and humancomputer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations. This paper describes a method for developing a web scraper in Node.js that locates files on a website and then decompresses and reads the files and stores their contents in a database. It mentions the modules used and the algorithm of automating the navigation of a website via links. It also describes a method of scanning the website at regular time intervals to locate newly added content with the aid of a cron job(scheduled task).  Keywords: web scraping, web mining, locating files in websites, navigating, DOM, cron job, JavaScript, Node.js, cheerio.js, decompressing files.  I.    INTRODUCTION  A. DEFINITION In its most basic form, web scraping enables a way to download web pages and then search for data in them. It often requires converting unstructured data in web pages to structured data and then storing it in a database. Web scraping can be used for indirect content searching on the internet.   B. USES OF WEB SCRAPING The uses ofweb scraping for business and personal requirements are endless. Each business or individual has his or her own specific need for gathering data. Here are few of the common usage scenarios: ? Gathering data from multiple sources for analysis:  Using a Web Scraper you can extract data from multiple websites to a single spreadsheet (or database) so that it becomes easy for you to analyze (or even visualize) the data. ? For research:  A Web Scraper will help you gather structured data from multiple sources in the Internet with ease.  ? For Marketing:  A web scraper can be used to gather contact details of businesses or individuals from websites like yellowpages.com or linkedin.com. Details like email address, phone, website URL etc. can be easily extracted using a web scraper. C. COMMONLY USED WEB SCRAPING TECHNIQUES  ? Human copy-and-paste:  In some cases even the best web-scraping technology cannot replace a human?s manual examination and copyand-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. ? Text grepping and regular expression matching:  A simple yet powerful approach to extract information from web pages can be based on the regular expressionmatching or UNIX grep command facilities of programming languages. ? HTTP programming:  Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. ? DOM parsing: By embedding a web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. ? Web-scraping software:  There are many software tools available that can be used to customize web-scraping solutions. The software may try to automatically recognize the data structure of a page or provide a recording interface that eliminates the necessity to manually write web-scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases.  D. WHY NODE.JS FOR WEB SCRAPING Node.js is a platform built on Chrome’s JavaScript runtime. It uses an event-driven, non-blocking I/O model that makes itlightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. JavaScript was born as a language to be embedded in web browsers, but we International Journal of Science, Engineering and Technology Research (IJSETR) Volume 4, Issue 4, April 2015 910 ISSN: 2278 – 7798                                           All Rights Reserved © 2015 IJSETR  can now write stand-alone scripts in JavaScript that can run on a desktop computer or on a web server using Node.js. Web scraping software till now has been written in Java, Ruby, and most popularly in Python. All modern languages provide functions to download web pages, or have extensions to do it. However, locating and isolating data in HTML pages is a challenging task. An HTML page has content, style and layouts elements all intermixed, so a non-trivial effort is required to parse and identify the interesting parts of the page. JavaScript and libraries like jQuery can powerfully and easily manipulate the DOM inside a web browser. Therefore writing web scraping scripts in Node.js is advantageous since we can use many techniques that we know from DOM manipulation in the client-side code for the web browser. This paperdescribes simple a method to implement a web scraper in a node.js application and demonstrates its use to locate and download contents of files of a particular format from a website.