We use cookies to make your experience better. To comply with the new e-Privacy directive, we need to ask for your consent to set the cookies. Learn more.
Is web scraping really illegal? Understanding the ethics and legality of data collection
In an era where information drives competition, the question of how data is collected has become central to discussions around digital ethics, compliance, and fair access.
Imagine your company runs an eсommerce platform for clothing and wants to conduct market research on competitors’ pricing, product assortment, and promotions.
You have two options:
- Manual data collection – visiting competitor websites one by one, copying information into spreadsheets, and structuring it manually. It’s painfully time-consuming, resource-heavy, and prone to human error.
- Automated data scraping – using tools or scripts to collect the same information programmatically. It’s faster, more scalable, and far more efficient — but also, as many argue, ethically gray and legally uncertain.
So, where exactly is the line between legitimate data collection and illegal web scraping?
1. When is data collection legal?
Let’s begin with a simple example.
If you are planning your wedding and create an Excel sheet comparing suit prices from various online stores, you are not breaking any laws. The data you use — prices, images, product names — is publicly available, meaning no login, paywall, or special access is required.
In legal terms, public data refers to information that anyone can view without authentication or bypassing security measures. The open nature of the Internet is built around this accessibility.
Ecommerce businesses intentionally make product data public to drive visibility and sales — so using that information for personal or analytical purposes is inherently lawful. However, the type of data and how it’s collected play a crucial role.
Public, private, and copyrighted data
When discussing web scraping, the first question should always be what type of data is being collected. Data on the Internet generally falls into three categories:
1) Public Data
What it is: Information that is freely accessible without authentication or special permissions.
Examples: Product listings, public company pages, publicly visible reviews and open social profiles.
Typical risk level: Lower (still depends on volume, usage, and ToS).
2) Private Data (High-Risk Category)
What it is: Information intentionally placed behind access controls such as logins, paywalls, or other protective mechanisms.
Examples: User account pages, order history, private dashboards, subscriber-only content, internal portals.
What’s special about it (and why it matters):
- Restricted by design: If access requires a login, payment, or validation, it signals the owner’s intention to limit access.
- Higher legal exposure: Scraping private data can be interpreted as unauthorized access, especially if it involves bypassing protections (e.g., paywalls, CAPTCHAs, IP blocks). This can trigger anti-hacking laws in some jurisdictions.
- Often tied to personal data: Private areas frequently contain PII (names, emails, addresses, payment details), which increases compliance obligations and privacy-law risk (e.g., GDPR in the EU).
- Contractual restrictions are stronger: Private sections commonly require explicit agreement to Terms of Service (often via clickwrap), making enforcement more likely.
Rule of thumb: If it requires authentication, treat it as private and assume elevated legal and ethical risk.
3) Copyrighted Data
What it is: Content protected by intellectual property law.
Examples: Articles, images, videos, product descriptions and proprietary databases.
Typical risk level: Medium to high depending on copying scope, reuse, and local law (fair use / exceptions vary).
Depending on the category, scraping may be legal, questionable, or clearly illegal. As a general example, scraping public ecommerce listings for market analysis is often permissible, while scraping private user data or reproducing copyrighted content without authorization may violate privacy, contract, or IP laws.
2. No universal rulebook: Regional differences in web scraping laws
There is no single global law that governs web scraping. Regulations differ significantly by region, often depending on how each jurisdiction defines “unauthorized access” and “public data.”
Below is an overview of how the U.S. and the European Union approach the issue.
Legality of web scraping in the United States
The U.S. legal framework for web scraping largely revolves around the Computer Fraud and Abuse Act (CFAA) — originally designed to prevent hacking and unauthorized system access.
Court decisions have clarified that:
- Scraping publicly available data (accessible without login or bypassing technical barriers) does not violate the CFAA.
- Accessing private or protected data (behind authentication or paywalls) can be considered unauthorized and therefore illegal.
U.S. courts distinguish between contract violations (breaching Terms of Service) and criminal actions. A ToS violation alone rarely constitutes a crime.
The landmark hiQ v. LinkedIn and Bright Data cases (detailed below) solidified the U.S. position that scraping public data is lawful, as long as it doesn’t involve hacking or data misuse.
Legality of web scraping in the European Union
The EU takes a stricter view due to its strong data protection and contract law framework.
Key factors:
- The General Data Protection Regulation (GDPR) regulates any use of personal data, even if it’s publicly visible. Scraping personal or identifying data without user consent can breach GDPR.
- Contract law gives force to website terms. If users must agree to Terms of Use before accessing content, those terms are enforceable.
- The Ryanair v. PR Aviation ruling (2015) confirmed that even publicly visible data can be legally protected through user agreements.
Thus, while the U.S. focuses on access rights, the EU emphasizes user consent and contractual compliance.
Browsewrap vs. Clickwrap agreements
Website Terms of Service are often presented in two formats:
- Browsewrap agreements: Terms are displayed passively; users are presumed to agree by browsing. Courts generally find these less enforceable, as passive browsing doesn’t imply consent.
- Clickwrap agreements: Require active user consent, such as clicking “I agree.” Courts consider these legally binding, as they reflect a deliberate act of acceptance.
This distinction matters — a scraper accessing data under browsewrap terms may not be bound by them, while scraping data behind clickwrap terms could constitute a contract breach.
3. The ethical dilemma of automation
If manually copying 10,000 product prices is acceptable, why does automating the process suddenly become “unethical”?
The difference lies in scale and impact. Automated scraping tools can overload servers, violate ToS, and extract massive datasets within seconds.
Major platforms like Booking.com, Trustpilot, and Amazon prohibit automated scraping in their terms. Yet the enforceability of these prohibitions varies — especially when applied to non-authenticated, public data.
This creates a persistent paradox: manual collection is tolerated, but automated collection of identical public data may be labeled unethical or even illegal.
4. Landmark legal cases defining web scraping legality
Below are the most influential court cases that have shaped the modern understanding of web scraping laws across jurisdictions.
1. hiQ Labs, Inc. v. LinkedIn Corp. (2017–2022, U.S.)
Court: U.S. Court of Appeals for the Ninth Circuit
Status: Settled in 2022
Overview:
hiQ Labs, a data analytics company, used automated tools to collect data from public LinkedIn profiles for workforce insights. LinkedIn attempted to block hiQ’s access and claimed that the scraping violated the Computer Fraud and Abuse Act (CFAA).
Court decision:
The Ninth Circuit ruled that scraping publicly available data does not violate the CFAA. Since the profiles were accessible without login, the access was not “unauthorized.”
Impact:
This case set a landmark precedent affirming that scraping public data is legal, while scraping private, restricted, or protected data remains unlawful. The court also warned that restricting access to public data could create information monopolies and harm public interest.
2. Meta (Facebook) v. Bright Data (2023–2024, U.S.)
Court: U.S. District Court, Northern District of California
Overview:
Meta sued Bright Data for scraping publicly available data from Facebook and Instagram, claiming it violated Terms of Service and copyright law.
Court decision:
In 2024, the court dismissed most of Meta’s claims. It ruled that:
- Scraping public data is lawful, as long as it does not bypass security measures.
- Meta’s ToS prohibitions didn’t apply to non-logged-in users who hadn’t agreed to those terms.
- Meta does not “own” the public user-generated data hosted on its platforms.
Impact:
The ruling reinforced hiQ v. LinkedIn and emphasized that platforms cannot claim exclusive rights over publicly available user content. It also clarified that ToS breaches are not inherently illegal unless technical barriers are violated.
3. X (Twitter) v. Bright Data (2024, U.S.)
Court: U.S. District Court, Northern District of California
Overview:
X (formerly Twitter) filed a lawsuit against Bright Data, accusing the company of scraping and selling publicly accessible Twitter data.
Court Decision:
A federal judge dismissed the case in May 2024, finding that:
- Scraping publicly available tweets does not violate copyright law or the CFAA.
- X could only refile claims related to improper server access, not the scraping itself.
Impact:
The decision aligned with hiQ and Meta rulings, reinforcing that public data scraping is permissible under U.S. law, as long as the scraping process does not breach authentication or cause system harm.
4. Ryanair DAC v. PR Aviation BV (2015, European Union)
Court: Court of Justice of the European Union (CJEU)
Overview:
PR Aviation scraped flight data from Ryanair’s website, which required users to agree to Terms of Use prohibiting scraping. PR Aviation argued that since the data was publicly accessible, the restriction was invalid.
Court Decision:
The CJEU ruled in favor of Ryanair, holding that when users must accept terms before accessing data, those terms are contractually binding. Even if the data is public, scraping it in violation of agreed terms constitutes a breach of contract.
Impact:
This case highlighted a key difference between the EU and the U.S.: EU courts uphold contractual prohibitions, even for public data, while U.S. courts focus on whether access itself was unauthorized.
5. eBay, Inc. v. Bidder’s Edge, Inc. (2000, U.S.)
Court: U.S. District Court, Northern District of California
Overview:
Bidder’s Edge used bots to aggregate auction listings from eBay without permission. eBay claimed this activity overloaded its servers and constituted trespass to chattels (unauthorized interference with property).
Court Decision:
The court granted an injunction in favor of eBay, finding that automated scraping caused measurable load on its servers and therefore violated property rights.
Impact:
Although predating modern web automation, this case established early legal recognition of server protection rights and influenced later rulings under the CFAA.
6. Craigslist, Inc. v. 3Taps, Inc. (2013, U.S.)
Court: U.S. District Court, Northern District of California
Overview:
3Taps scraped Craigslist’s listings to create a searchable interface. Craigslist sent cease-and-desist letters and blocked 3Taps’ IP addresses, but 3Taps continued scraping.
Court Decision:
The court ruled that continuing to access a site after explicit revocation of permission constitutes a violation of the CFAA.
Impact:
This decision clarified that scraping becomes illegal once explicit consent is withdrawn or technical barriers (like IP blocks) are bypassed, even if the data is public.
Summary: What these cases teach us
| Principle | Legal position (Mostly U.S.) | Key cases |
| Scraping public data | Legal if accessible without login or security bypass | hiQ v. LinkedIn, X v. Bright Data, Meta v. Bright Data |
| Scraping private/protected data | Illegal; violates CFAA or privacy laws | Craigslist v. 3Taps, eBay v. Bidder’s Edge |
| Ignoring Terms of Service | Usually civil, not criminal | Ryanair v. PR Aviation (EU), Meta v. Bright Data (U.S.) |
| Bypassing technical barriers | Considered unauthorized access | Craigslist v. 3Taps, eBay v. Bidder’s Edge |
| Monopolizing public data | Courts discourage data monopolies | hiQ v. LinkedIn, Meta v. Bright Data |
5. New developments: charging AI for crawling
In 2024, Cloudflare introduced “Pay per Crawl”, a feature allowing website owners to charge AI crawlers (e.g., OpenAI, Anthropic, Google DeepMind) a small fee for each page request.
While it aims to protect publishers and create fairer compensation models, it raises questions about digital fairness and the future of open data.
Monetizing public access could fragment the web into “pay-to-index” ecosystems, limiting access for startups, researchers, and small developers.
In the long term, such restrictions contradict the open data ethos — the idea that public information should remain accessible to all.
6. Ethical best practices for businesses
To ensure ethical and compliant data collection, businesses should:
- Target only public data. Avoid content requiring authentication.
- Respect robots.txt. Follow a website’s crawling policies.
- Throttle scraping frequency. Prevent excessive requests.
- Avoid personal data. Don’t collect or store PII.
- Use data responsibly. Limit use to legitimate business analytics.
7. Conclusion: The balance between freedom and control
Web scraping lies at the crossroads of innovation, regulation, and digital ethics.
While courts increasingly support the right to access public data, ethical considerations and regional laws add nuance.
The challenge isn’t whether scraping should exist, but how it can coexist responsibly with privacy, consent, and fair access.
The Internet was built on openness and transparency. Preserving those principles, while protecting users and infrastructure, will define the next generation of digital information ethics.