reading-notes

Reading Questions

1. What are the key differences between scraping static and dynamic websites?

Scraping static websites:
- Fixed HTML content.
- Data readily available in the HTML source code.
Scraping dynamic websites:
- Content changes dynamically, often through JavaScript.
- Data may be loaded asynchronously or generated dynamically.

2. Explain at least three techniques or best practices that can be employed to avoid getting blocked while scraping websites.

Implementing delays:
- Adding pauses between requests to mimic human behavior.
Using proxies:
- Rotating IP addresses through proxy servers to distribute requests.
Employing user-agent headers:
- Setting the user-agent header to match a legitimate browser.

3. What is Playwright, and how does it assist in web scraping tasks? Provide an example of a use case where Playwright would be particularly beneficial.

Playwright is a Python library for automating browser interactions.
Benefits of Playwright in web scraping:
- Interacting with JavaScript-driven elements.
- Handling complex scenarios like login and dynamic content.
- Capturing network traffic and taking screenshots.
- Supports multiple browsers (Chrome, Firefox, Safari).
Example use case: Scraping a website that requires login. Playwright can automate the login process and access protected content for scraping.

4. Describe the purpose of using XPath in web scraping, and provide an example of an XPath expression to select a specific HTML element from a webpage.

Purpose of using XPath in web scraping:
- XPath is used to navigate and select elements in XML or HTML documents.
- It helps target specific elements based on their structure, attributes, or text content.
Example XPath expression:
- //div[@class="product-title"]/a
  - This expression selects the <a> element within a <div> with the attribute class equal to “product-title”.
  - It can be used to extract specific product titles from a webpage.