Email Id Extractor Project from sites in Scrapy Python

m3uZael

Member

Jan 28, 2024

Python:

# web scraping framework
import scrapy

# for regular expression
import re

# for selenium request
from scrapy_selenium import SeleniumRequest

# for link extraction
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class EmailtrackSpider(scrapy.Spider):
    # name of spider
    name = 'emailtrack'

    # to have unique email ids
    uniqueemail = set()

    
    # and parse function is called
    def start_requests(self):
        yield SeleniumRequest(
            url="https://www.google.com",
            wait_time=3,
            screenshot=True,
            callback=self.parse,
            dont_filter=True
        )

    def parse(self, response):
            # this helps to get all links from source code
            links = LxmlLinkExtractor(allow=()).extract_links(response)

            # Finallinks contains links url
            Finallinks = [str(link.url) for link in links]

            # links list for url that may have email ids
            links = []

            # filtering and storing only needed url in links list
            # pages that are about us and contact us are the ones that have email ids
            for link in Finallinks:
                if ('Contact' in link or 'contact' in link or 'About' in link or 'about' in link or 'CONTACT' in link or 'ABOUT' in link):
                    links.append(link)

            # current page url also added because few sites have email ids on there main page
            links.append(str(response.url))



            # parse_link function is called for extracting email ids
            l = links[0]
            links.pop(0)

            # meta helps to transfer links list from parse to parse_link
            yield SeleniumRequest(
                url=l,
                wait_time=3,
                screenshot=True,
                callback=self.parse_link,
                dont_filter=True,
                meta={'links': links}
            )


    def parse_link(self, response):

        # response.meta['links'] this helps to get links list
        links = response.meta['links']
        flag = 0

        # links that contains following bad words are discarded
        bad_words = ['facebook', 'instagram', 'youtube', 'twitter', 'wiki', 'linkedin']

        for word in bad_words:
            # if any bad word is found in the current page url
            # flag is assigned to 1
            if word in str(response.url):
                flag = 1
                break

        # if flag is 1 then no need to get email from
        # that url/page
        if (flag != 1):
            html_text = str(response.text)
            # regular expression used for email id
            email_list = re.findall('\w+@\w+\.{1}\w+', html_text)
            # set of email_list to get unique
            email_list = set(email_list)
            if (len(email_list) != 0):
                for i in email_list:
                    # adding email ids to final uniqueemail
                    self.uniqueemail.add(i)

        # parse_link function is called till
        # if condition satisfy
        # else move to parsed function
        if (len(links) > 0):
            l = links[0]
            links.pop(0)
            yield SeleniumRequest(
                url=l,
                callback=self.parse_link,
                dont_filter=True,
                meta={'links': links}
            )
        else:
            yield SeleniumRequest(
                url=response.url,
                callback=self.parsed,
                dont_filter=True
            )

    def parsed(self, response):
        # emails list of uniqueemail set
        emails = list(self.uniqueemail)
        finalemail = []

        for email in emails:
            # avoid garbage value by using '.in' and '.com'
            # and append email ids to finalemail
            if ('.in' in email or '.com' in email or 'info' in email or 'org' in email):

                finalemail.append(email)

        # final unique email ids from geeksforgeeks site
        print('\n'*2)
        print("Emails scraped", finalemail)
        print('\n'*2)

Click to read more...

Email Id Extractor Project from sites in Scrapy Python (1 Viewer)

Currently reading: Email Id Extractor Project from sites in Scrapy Python (1 Viewer)

m3uZael

"S'all Good, Man."​

"Perfection Is The Enemy Of Perfectly Adequate."​

"Money Is The Point!"​

"I Travel In Worlds You Can't Even Imagine."​

"Say Nothing, You Understand? Get A Lawyer!"​

“Confidence is good. Facts on your side, better.” ​

“Facts are facts.”​

“Sometimes the good guys win.”​

“I’m not good at building shit, you know? I’m excellent at tearing it down.”​

“Money is not beside the point… Money is the point.”​

“Whoa, whoa. Hold up. What the hell happened to you? I get it, the first rule of Fight Club, right?”​

“A good magician never reveals his secrets.”​

“Got to look successful to be successful.”​

“The lesson is, if you’re gonna be a criminal, do your homework.” ​

“If I had to do it all over again, I would maybe do some things differently. I just thought you should know that.”​

“Some men aren't looking for anything logical. They can't be bought, bullied, reasoned or negotiated with. Some men just want to watch the world burn.”​

"Ernest Hemingway once wrote, "The world is a fine place and worth fighting for." I agree with the second part."​

“There’s no better way to destroy someone’s life than to uncover their secrets.”​

“Hackers are breaking the systems for profit. Before, it was about intellectual curiosity and pursuit of knowledge and thrill, and now hacking is big business.”​

“Hackers often describe what they do as playfully creative problem-solving.”​

“Computer hackers do not need to know each other’s real names, or even live on the same continent, to steal millions in mere hours."​

“While many hackers have the knowledge, skills, and tools to attack computer systems, they generally lack the motivation to cause violence or severe economic or social harm.”​

“Very smart people are often tricked by hackers, by phishing. I don’t exclude myself from that. It’s about being smarter than a hacker. Not about being smart.”​

“At the end of the day, my goal was to be the best hacker.”​

“Humiliation is the favorite currency of the hacker.”​

“The hacker didn’t succeed through sophistication. Rather he poked at obvious places, trying to enter through unlocked doors. Persistence, not wizardry, let him through.”​

"Rules. Without Them We Live With The Animals.”​

“Consider This A Professional Courtesy.”​

"I've Lived My Life My Way, And I'll Die My Way."​

"You stabbed the devil in the back, and forced him back into the life that he had just left."​

"You Want A War, Or Do You Want To Just Give Me A Gun?"​

"Leave one wolf alive and the sheep are never safe."​

"When you play the game of thrones, you win or you die. There is no middle ground."​

"It's not easy to see something that’s never been before: A good world."​

"I believe in second chances. I don't believe in third chances."​

"If you only trust the people you grew up with, you won't make many allies."​

"A man with no motive is a man no one suspects. Always keep your foes confused: If they don't know who you are, what you want—they can't know what you plan to do next."​

"Never forget what you are, the rest of the world will not. Wear it like armor and it can never be used to hurt you."​

"I try to know as many people as I can. You never know which one you'll need."​

"It's hard to put a leash on a dog once you've put a crown on its head."​

“Everything before the word ‘but’ is horseshit.”​

“A lion doesn’t concern himself with the opinions of a sheep.”​

“Nothing FUCKS you harder than time.”​

“You pray for rain, you gotta deal with the mud too. That’s a part of it.”​

“I’d be more frightened by not using whatever abilities I’d been given.”​

“Luck is where opportunity meets preparation.”​

“If you have an enemy, then learn and know your enemy, don’t just be mad at him or her.”​

“Every failed experiment is one step closer to success.”​

When you work on a computer your hands travel 20 kilometres a day!​

Fugaku supercomputer is the world’s fastest computer. The $1-billion supercomputer has 7,630,848 cores, requires 29,899 kilowatts of electricity, and can execute 442,010 teraFLOPs.​

“Every day, about 317 million new viruses are discovered.​

“Microsoft’s founder, the infamous Bill Gates, was actually a college dropout."​

Did you know?​

“On average, a human blinks 20 times per minute, but using a computer reduces it to 7."​

Did you know?​

“The most common password for a computer and social media platforms is 123456."​

Did you know?​

“There are eight varieties of computers: mainframe, supercomputer, workstation, personal computer, Apple Macintosh, laptop, tablet, and smartphone."​

Did you know?​

“Linux leads the industry as it is used by Google, Facebook, Twitter, and Amazon."​

Did you know?​

“NASA computers were hijacked by a 15-year-old, resulting in a 21-day halt."​

Did you know?​

“You may heat a room with Gaming PCs more effectively than a heater."​

Did you know?​

“Physical money accounts for just around 10% of global cash, while the rest is stored on computers."​

Did you know?​

“YouTube actually started as a dating website." (Oh crap xD)​

Did you know?​

“Before they could progress as stable brands, Microsoft, HP, and Apple began manufacturing computers in their Garages."​

Did you know?​

“For every 12 million email spams, only one gets a reply."​

Did you know?​

“Banks and other corporate giants hire white hats or “good hackers” to help fix security issues and prevent system infiltration."​

Did you know?​

“If Earth stopped rotating for 1 second, everyone would die."​

Did you know?​

Currently reading:
Email Id Extractor Project from sites in Scrapy Python (1 Viewer)

"S'all Good, Man."

"Perfection Is The Enemy Of Perfectly Adequate."

"Money Is The Point!"

"I Travel In Worlds You Can't Even Imagine."

"Say Nothing, You Understand? Get A Lawyer!"

“Confidence is good. Facts on your side, better.”

“Facts are facts.”

“Sometimes the good guys win.”

“I’m not good at building shit, you know? I’m excellent at tearing it down.”

“Money is not beside the point… Money is the point.”

“Whoa, whoa. Hold up. What the hell happened to you? I get it, the first rule of Fight Club, right?”

“A good magician never reveals his secrets.”

“Got to look successful to be successful.”

“The lesson is, if you’re gonna be a criminal, do your homework.”

“If I had to do it all over again, I would maybe do some things differently. I just thought you should know that.”

“Some men aren't looking for anything logical. They can't be bought, bullied, reasoned or negotiated with. Some men just want to watch the world burn.”

"Ernest Hemingway once wrote, "The world is a fine place and worth fighting for." I agree with the second part."

“There’s no better way to destroy someone’s life than to uncover their secrets.”

“Hackers are breaking the systems for profit. Before, it was about intellectual curiosity and pursuit of knowledge and thrill, and now hacking is big business.”

“Hackers often describe what they do as playfully creative problem-solving.”

“Computer hackers do not need to know each other’s real names, or even live on the same continent, to steal millions in mere hours."

“While many hackers have the knowledge, skills, and tools to attack computer systems, they generally lack the motivation to cause violence or severe economic or social harm.”

“Very smart people are often tricked by hackers, by phishing. I don’t exclude myself from that. It’s about being smarter than a hacker. Not about being smart.”

“At the end of the day, my goal was to be the best hacker.”

“Humiliation is the favorite currency of the hacker.”

“The hacker didn’t succeed through sophistication. Rather he poked at obvious places, trying to enter through unlocked doors. Persistence, not wizardry, let him through.”

"Rules. Without Them We Live With The Animals.”

“Consider This A Professional Courtesy.”

"I've Lived My Life My Way, And I'll Die My Way."

"You stabbed the devil in the back, and forced him back into the life that he had just left."

"You Want A War, Or Do You Want To Just Give Me A Gun?"

"Leave one wolf alive and the sheep are never safe."

"When you play the game of thrones, you win or you die. There is no middle ground."

"It's not easy to see something that’s never been before: A good world."

"I believe in second chances. I don't believe in third chances."

"If you only trust the people you grew up with, you won't make many allies."

"A man with no motive is a man no one suspects. Always keep your foes confused: If they don't know who you are, what you want—they can't know what you plan to do next."

"Never forget what you are, the rest of the world will not. Wear it like armor and it can never be used to hurt you."

"I try to know as many people as I can. You never know which one you'll need."

"It's hard to put a leash on a dog once you've put a crown on its head."

“Everything before the word ‘but’ is horseshit.”

“A lion doesn’t concern himself with the opinions of a sheep.”

“Nothing FUCKS you harder than time.”

“You pray for rain, you gotta deal with the mud too. That’s a part of it.”

“I’d be more frightened by not using whatever abilities I’d been given.”

“Luck is where opportunity meets preparation.”

“If you have an enemy, then learn and know your enemy, don’t just be mad at him or her.”

“Every failed experiment is one step closer to success.”

When you work on a computer your hands travel 20 kilometres a day!

Fugaku supercomputer is the world’s fastest computer. The $1-billion supercomputer has 7,630,848 cores, requires 29,899 kilowatts of electricity, and can execute 442,010 teraFLOPs.

“Every day, about 317 million new viruses are discovered.

“Microsoft’s founder, the infamous Bill Gates, was actually a college dropout."

Did you know?

“On average, a human blinks 20 times per minute, but using a computer reduces it to 7."

Did you know?

“The most common password for a computer and social media platforms is 123456."

Did you know?

“There are eight varieties of computers: mainframe, supercomputer, workstation, personal computer, Apple Macintosh, laptop, tablet, and smartphone."

Did you know?

“Linux leads the industry as it is used by Google, Facebook, Twitter, and Amazon."

Did you know?

“NASA computers were hijacked by a 15-year-old, resulting in a 21-day halt."

Did you know?

“You may heat a room with Gaming PCs more effectively than a heater."

Did you know?

“Physical money accounts for just around 10% of global cash, while the rest is stored on computers."

Did you know?

“YouTube actually started as a dating website." (Oh crap xD)

Did you know?

“Before they could progress as stable brands, Microsoft, HP, and Apple began manufacturing computers in their Garages."

Did you know?

“For every 12 million email spams, only one gets a reply."

Did you know?

“Banks and other corporate giants hire white hats or “good hackers” to help fix security issues and prevent system infiltration."

Did you know?

“If Earth stopped rotating for 1 second, everyone would die."

Did you know?

“If someone made a sound of 1100db or larger a black hole would form sucking in our whole solar system."

“People shouldn't be afraid of their government. Governments should be afraid of their people.”