A Web Scraping Approach in Node.
js Shikha Mahajan, Nikhit Kumar Information Science and Engineering R V College of Engineering Bangalore, India Abstract:Web scraping is the process of automatically collecting information from the World Wide Web. It is a field with active developments sharing a common goal with the semantic web vision, an initiative that still requires breakthroughs in text processing, semantic understanding, and artificial intelligence and humancomputer interactions. Current web scraping solutions range from the ad-hoc, requiring human effort, to fully automated systems that are able to convert entire web sites into structured information, with limitations.
This paper describes a method for developing a web scraper in Node.js that locates files on a website and then decompresses and reads the files and stores their contents in a database. It mentions the modules used and the algorithm of automating the navigation of a website via links.
INTRODUCTION A. DEFINITION In its most basic form, web scraping enables a way to download web pages and then search for data in them. It often requires converting unstructured data in web pages to structured data and then storing it in a database. Web scraping can be used for indirect content searching on the internet. B. USES OF WEB SCRAPING The uses ofweb scraping for business and personal requirements are endless. Each business or individual has his or her own specific need for gathering data. Here are few of the common usage scenarios: ? Gathering data from multiple sources for analysis: Using a Web Scraper you can extract data from multiple websites to a single spreadsheet (or database) so that it becomes easy for you to analyze (or even visualize) the data.
? For research: A Web Scraper will help you gather structured data from multiple sources in the Internet with ease. ? For Marketing: A web scraper can be used to gather contact details of businesses or individuals from websites like yellowpages.com or linkedin.
com. Details like email address, phone, website URL etc. can be easily extracted using a web scraper. C.
COMMONLY USED WEB SCRAPING TECHNIQUES ? Human copy-and-paste: In some cases even the best web-scraping technology cannot replace a human?s manual examination and copyand-paste, and sometimes this may be the only workable solution when the websites for scraping explicitly set up barriers to prevent machine automation. ? Text grepping and regular expression matching: A simple yet powerful approach to extract information from web pages can be based on the regular expressionmatching or UNIX grep command facilities of programming languages. ? HTTP programming: Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming. ? DOM parsing: By embedding a web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. ? Web-scraping software: There are many software tools available that can be used to customize web-scraping solutions. The software may try to automatically recognize the data structure of a page or provide a recording interface that eliminates the necessity to manually write web-scraping code, or some scripting functions that can be used to extract and transform content, and database interfaces that can store the scraped data in local databases. D.
Therefore writing web scraping scripts in Node.js is advantageous since we can use many techniques that we know from DOM manipulation in the client-side code for the web browser. This paperdescribes simple a method to implement a web scraper in a node.js application and demonstrates its use to locate and download contents of files of a particular format from a website.