What are web scrapers?

Web scrapers are tools or scripts that extract data from the website. Many modern day web applications provide APIs to serve data needed by their clients. However, there are many websites that are built for browsing and they may not come with APIs. In such cases, extracting data from these websites in automated fashion requires parsing plain HTML content, which is unstructured and extracting information that we need. Traditionally, web scraping is used for extracting data such as stock prices, product details, company contacts and sports stats. When it comes to security, web scrapers are used for extracting data such as email addresses, phone numbers, job roles of target organization, private IP addresses and URL links available in the HTML source. This is ideal in gathering information about a specific target and conducting a targeted attack.

According to Wikipedia, “Open-source intelligence is a multi-methods methodology for collecting, analyzing and making decisions about data accessible in publicly available sources to be used in an intelligence context. In the intelligence community, the term “open” refers to overt, publicly available sources.” In simple words, OSINT is gathering intelligence about targets using publicly available information. Web scrapers are one of the methods used in OSINT. As an example, let us assume that an attacker is planning to conduct a phishing exercise on a specific department of an organization. This requires gathering information about employees of that specific department using publicly available resources such as social media profiles and job portals. Web scrapers come handy in scraping details such as email addresses and job skills from websites like LinkedIn. Information about Job skills and titles can further be used to derive their department. 

What are the benefits of using web scrapers and OSINT?

There are several benefits with using web scrapers and OSINT. These can vary depending on the type of intelligence that we want to gather. Following are some of the key benefits:

Cost effective

Using web scrapers and collecting Open Source Intelligence (OSINT) is less expensive compared to other intelligence sources. This can add great value when there is little or no budget available.

Discover hidden risks

The primary goal of OSINT and web scrapers is to extract as much information as possible about the targets from publicly available sources. Obviously, this information is public to anyone and too many details about the organization or employees can pose a security risk to an organization. Collecting OSINT will help discover these hidden risks.

Ease of accessibility

Open sources are always available and from anywhere to anyone. One can easily access this information whenever he/she wants.

When accessing information about other parties, legal issues often arise. When collecting Open Source Intelligence, this is often not an issue as the resources are already publicly available.

Pros and cons of using existing web scrapers vs. building your own tool

When there is a need to use a web scraper, one can simply choose an existing web scraper or decide to develop a custom tool. Let us see what are some of the pros and cons of using existing web scrapers vs building our own tool. The primary benefit of using existing web scrapers is that they save us time. We do not need to reinvent the wheel as someone else has already done the work. Often, developing a custom scraper requires a detailed understanding of the information available in the web pages to be able to process it. For example, we need to study the HTML pages being retrieved to be able to identify the exact tag and location that contains the information we are looking for.  While using existing scrapers is time saving, there can be scenarios, where the tool is built for an older version of the target and it may not work with the latest version. This requires a custom scraper. Aside from that, many web scrapers are built using python2 and not ported to python3. In such cases, one may either need to port it to python3 or might as well develop a custom web scraper instead by understanding the logic used in the existing scrapers. On a side note, a temporary solution to this is to use a docker image with python2 installed and run it inside the container.  

Sources

https://www.edocr.com/v/d4yjxo0a/osintsolutionseo/Advantages-and-Limitations-of-Open-Source-Intellig https://itsec.group/blog-post-osint-guide-part-1.html /certification/how-to-prevent-web-scraping/