How to Scrape Email Addresses from a Website using Python?

How to Scrape Email Addresses from a Website using Python?

if ou want to srcap an email address you need to follow this easy processess

Step 1.  Import modules

We import six modules for this project. modules we need

  1. re for regular expression matching operations
  2. requests for sending HTTP requests
  3. urlsplit for breaking URLs down into component parts
  4. deque is a list-like container with fast appends and pops on either end
  5. BeautifulSoup for pulling data out of HTML files of websites
  6. pandas for formatting emails into a DataFrame for further manipulation

Step 2: Initialize variables

Then, we initialize a deque for saving unscraped URLs, a set for scraped URLs, and a set for saving emails scraped successfully from the website. variables

Elements in Set are unique. Duplicate elements are not allowed.

Step 3: Start scraping

  1. First, move a url from unscrapedto scraped. to scraped_url

2. Then we use urlsplit to extract different parts of the url.

urlsplit() returns a 5-tuple: (addressing scheme, network location, path, query, fragment identifier).

Sample input & output for urlsplit() for urlsplit()

In such a way, we are able to get the base and path part for the website URL.

3. Sending an HTTP GET request to the website. HTTP requests

4. Extract all email addresses from the response using a regular expression, and add them into the email set. emails using regular expression

If you are not familiar with Python regular regression, check Python RegEx for more information.

5. Find all linked URLs in the website.

To do so, we first need to create a Beautiful Soup to parse the HTML document. a Beautiful Soup for the HTML document

Then we can find all the linked URLs in the document by finding the <a href=””> tag which indicates a hyperlink. all linked URLs

Add the new url to the unscraped queue if it was not in unscraped nor in scraped yet.

We also need to exclude links like that are unable to be scraped. new URLs

Step 4: Export emails to a CSV file

After successfully scraping emails from the website, we can export the emails to a CSV file. emails to a csv file

If you are using Google Colaboratory, you can download the file to the local machine by: from Colaboratory

Sample output CSV file:

Image for post
Sample output

Complete Code Code

import re
import requests
from urllib.parse import urlsplit
from collections import deque
from bs4 import BeautifulSoup
import pandas as pd
from google.colab import files
original_url = input(“Enter the website url: “)
unscraped = deque([original_url])
scraped = set()
emails = set()
while len(unscraped):
url = unscraped.popleft()
parts = urlsplit(url)
base_url = “{0.scheme}://{0.netloc}”.format(parts)
if ‘/’ in parts.path:
path = url[:url.rfind(‘/’)+1]
path = url
print(“Crawling URL %s” % url)
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
new_emails = set(re.findall(r”[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.com”, response.text, re.I))
soup = BeautifulSoup(response.text, ‘lxml’)
for anchor in soup.find_all(“a”):
if “href” in anchor.attrs:
link = anchor.attrs[“href”]
link = ”
if link.startswith(‘/’):
link = base_url + link
elif not link.startswith(‘http’):
link = path + link
if not link.endswith(“.gz”):
if not link in unscraped and not link in scraped:
df = pd.DataFrame(emails, columns=[“Email”])
df.to_csv(’email.csv’, index=False)“email.csv”)
complete code


[1] Crawling all emails from a websiteThe Startup

Medium’s largest active publication, followed by +743K people. Follow to join our community.




Be the first to comment

Leave a Reply

Your email address will not be published.