Getting Python's requests-html running on AWS Lambda

TL;DR

If you're deploying python with Zappa to AWS Lambda and getting:

  File "multiprocessing/synchronize.py", line 59, in __init__
    unlink_now)
OSError: [Errno 38] Function not implemented

Make sure you have the latest tdqm (tqdm>=4.32.2) in your requirements.txt installed AFTER Zappa in your deployment - Zappa defaults to tqdm>=4.19 which won't work on AWS Lambda.

The Details

A big part of our job at Agnoris is getting lots of different data from lots of different sources (some public, some less) under one roof to give our customers what they need to know fast. Sometimes the raw data isn't so pretty so we need to use some great tools to get it and clean it up.

Python is our language of choice and AWS's Lambdas are our prefered infrastructure for these tasks. We mostly ship these specific functions with Zappa through a CI system to save deployment time.

Yesterday we broke stuff. We deployed a function that used requests-html through Zappa and got swamped with these kinds of errors:

    from requests_html import HTMLSession
  File "/var/task/requests_html.py", line 9, in <module>
    import pyppeteer
  File "/var/task/pyppeteer/__init__.py", line 30, in <module>
    from pyppeteer.launcher import connect, launch, executablePath  # noqa: E402
  File "/var/task/pyppeteer/launcher.py", line 24, in <module>
    from pyppeteer.browser import Browser
  File "/var/task/pyppeteer/browser.py", line 15, in <module>
    from pyppeteer.page import Page
  File "/var/task/pyppeteer/page.py", line 20, in <module>
    from pyppeteer.coverage import Coverage
  File "/var/task/pyppeteer/coverage.py", line 15, in <module>
    from pyppeteer.util import merge_dict
  File "/var/task/pyppeteer/util.py", line 10, in <module>
    from pyppeteer.chromium_downloader import check_chromium, chromium_executable
  File "/var/task/pyppeteer/chromium_downloader.py", line 15, in <module>
    from tqdm import tqdm
  File "/var/task/tqdm/__init__.py", line 1, in <module>
    from ._tqdm import tqdm
  File "/var/task/tqdm/_tqdm.py", line 53, in <module>
    mp_lock = mp.Lock()  # multiprocessing lock
  File "multiprocessing/context.py", line 67, in Lock
    return Lock(ctx=self.get_context())
  File "multiprocessing/synchronize.py", line 162, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
  File "multiprocessing/synchronize.py", line 59, in __init__
    unlink_now)
OSError: [Errno 38] Function not implemented

Turns out on the AWS Lambda Linux images multiple processes for python are turned off by default. Specifically that means that the tqdm library bundled nested within the requests-html depedency causes breaks - if it's not up to date!

See Parallel Processing in Python with AWS Lambda

Fortunately this issue was fix by some great folks in the tqdm repo (you can checkout the PR).

We added tqdm>=4.32.2 to our requirements.txt and updated our CI to install Zappa BEFORE installing other requirements so that Zappa don't uninstall the latest tqdm in favor of it's default version that doesn't work on Lambda.

Problem solved - back to collecting!