Sometimes you need to get some data from your favorite website to use in your project. But what if this website doesn’t have an API? In this post, I’ll show you how to run Chrome and Selenium in Lambda to run a simple scraper.

In my current role, as AWS Practice Lead for Nordcloud, I want to keep track of AWS certifications our employees have achieved. The AWS partner portal gives me that overview, but I wanted to enrich this data with data from our HR system to see what the distribution of certifications across countries is for example. Unfortunately, at this time, the portal does not provide an API and only allows me download a CSV export containing the list of certifications. So I figured I could use a simple web scraping script to (1) login to the portal and (2) download the CSV file. While I use this as an example, your data source could be anything else of course.

Fetching the data

Let’s start by looking at the different components I’m using in the automation script.

  • Selenium is an open-source solution that can automate web browsers. It is mainly used for testing web applications. The Selenium library is available for in different languages, but for this project I’m using the Python version.
  • Chrome WebDriver Selenium uses WebDrivers to be able to support different browsers. In this project I’m Chrome, so I need the Chrome WebDriver.
  • Chromium for the implementation in Lambda I’m using Chromium as a web browser.

If you are following along, make sure to download a version of the web driver that is compatible with your browser version. Check the Chrome WebDriver downloads page to see which version you need.

While the next part depends entirely on your use case, I still wanted to show how I approached it. To get started, I first initiate a browser object. This is the browser, I will use to log in to the website.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions

chrome_options = ChromeOptions()
chrome_options.add_argument('--headless')

browser = webdriver.Chrome(
    executable_path='<path/to/chromedriver>',
    options=chrome_options,
)

Next, I define a simple function that logs in using the defined browser. Basically, the below script browses to the specified url, waits for certain webpage element to show, then fills the username and password fields and clicks the login button. It then captures and returns the webpage cookies. I will be using these cookies together with the Python requests library to download the report later. You will need to look at the source of your webpage to find the corresponding elements on your website to create something similar.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_login_cookies(browser, username, password):
    print('Getting login cookies...')
    browser.get("<replace-with-url>")
    WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.ID, '<replace-your-element-id>')))
    username_input = browser.find_element(By.ID, "<replace-with-username-field-id>")
    password_input = browser.find_element(By.ID, "<replace-with-password-field-id>"")
    username_input.send_keys(username)
    password_input.send_keys(password)
    browser.find_element(By.ID, "<replace-with-login-button-id>").click()
    WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.ID, 'context + logout')))
    cookies = {}
    for cookie in browser.get_cookies():
        cookies[cookie['name']] = cookie['value']
    browser.close()
    return cookies

With the collected cookies we can now download the file I need. As an example, I just return the file contents, but obviously you could write this to S3 or transform it and store it in DynamoDB.

def download_file(cookies):
    file = requests.get("<url-to-file>", cookies=cookies, allow_redirects=True)
    return file

Running it in Lambda

Now that we have the foundations of our script defined, let’s see how we can run this in Lambda. I’ve created a Dockerfile to use with container support in Lambda. Starting from the public.ecr.aws/lambda/python:3.8 base image, I installed several dependencies that are required to run Chromium and ChromeDriver. After that, I use a simple bash script to install Chromium and ChromeDriver and Selenium via Pip.

FROM public.ecr.aws/lambda/python:3.8

# Install chrome dependecies
RUN yum install unzip atk at-spi2-atk gtk3 cups-libs pango libdrm \ 
    libXcomposite libXcursor libXdamage libXext libXtst libXt \
    libXrandr libXScrnSaver alsa-lib -y

# Copy install scripts
COPY requirements.txt /tmp/
COPY install-chrome.sh /tmp/

# Install chromium, chrome-driver
RUN /usr/bin/bash /tmp/install.sh

# Install Python dependencies for function
RUN pip install --upgrade pip -q
RUN pip install -r /tmp/requirements.txt -q

# Remove unused packages
RUN yum remove unzip -y

COPY app.py /var/task/
CMD [ "app.lambda_handler" ] 

In my scraper script, I add a handler function that calls the earlier defined functions. And no, don’t worry, I will not store the Username and Password in my script. Once everything works as expected, I just have to replace this with a call to AWS Secrets Manager.

def lambda_handler(event, context):   
    # login, get data
    username = "<username>"
    password = "<password>"
    cookies = get_login_cookies(browser, username, password)
    certs = get_cert_report(cookies)
    return certs

Next, I build the image locally.

$ docker build . -t scraper

Once the build is completed, I start the image locally.

$ docker run -p 9000:8080 scraper:latest

Since the official Lambda images already contain the Lambda Remote Interface Environment (RIE), I can now invoke my function locally to see if it works

$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

Deploying it all

I’m using the AWS CDK to deploy the Lambda function. First I define an AWS Secrets Manager secret to store the login credentials for the website. I then pass the name of the secret to the Lambda function using an environment variable and assign permissions to the Lambda execution role to fetch this secret.

from aws_cdk import (
    Duration,
    aws_iam,
    aws_lambda,
    aws_secretsmanager,
    ...
)

...

secret = aws_secretsmanager.Secret(
    scope=self,
    id='WebsiteLogin',
    generate_secret_string=aws_secretsmanager.SecretStringGenerator(
        secret_string_template=json.dumps(dict(Username='')),
        generate_string_key='Password',
        password_length=32,
    )
)

scraper_function = aws_lambda.DockerImageFunction(
    scope=self,
    id="ScraperFunction",
    code=aws_lambda.DockerImageCode.from_image_asset("lambda/scraper"),
    environment={
        "SECRET_NAME": secret.secret_name
    },
    memory_size=2048,
    timeout=Duration.minutes(5),
)

scraper_function.role.add_to_principal_policy(
    aws_iam.PolicyStatement(
        effect=aws_iam.Effect.ALLOW,
        actions=[
            "secretsmanager:GetSecretValue",
            "secretsmanager:DescribeSecret",
            "secretsmanager:ListSecretVersionIds"
        ],
        resources=[
            secret.secret_arn
        ]
    )
)

I’ve also added a simple event rule to trigger the Lambda function daily.

from aws_cdk.aws_events import Rule, Schedule
from aws_cdk.aws_events_targets import LambdaFunction

...

rule = Rule(
    scope=self, 
    id="ReportScheduleRule",
    schedule=Schedule.cron(minute="0", hour="20")
)   
rule.add_target(target=LambdaFunction(scraper_function))

Conclusion

While not the prettiest of solutions, sometimes you just need a workaround for a missing API. This is one of those workarounds. For reference, I’ve uploaded my solution to GitHub, in case you want to have more detailed look at the code.

Photo by Lucian Alexe on Unsplash

comments powered by Disqus