Asyncio Web Crawler

Approximately 100 requests per second

I spent a full day working out problems with asyncio. The current setup might be a little overdone, but I am running in the code in production on Google Cloud Functions, and testing it on Jupyter notebooks. It works without errors across different operating systems, platforms, and python versions (as long as it is 3.6 or 3.7).

What is Asyncio?

asyncio is a library to write concurrent code using the async/await syntax.
asyncio is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc.
asyncio is often a perfect fit for IO-bound and high-level structured network code.

Asyncio Flow

My pattern for using the asyncio for web scraping has three parts.

Make a function for a single request.
Turn this function into list of tasks (tasks is the nomenclature for asyncio library)
Execute the list of tasks.

Step 0. Imports

Nothing out of the ordinary except for nest_asyncio. I had a lot of trouble with theevent_loop. The event_loop can throw a lot of errors while running in the Jupyter notebook. Using nest_asyncio solves the problems in the notebook and it works in an app with no problems.

Step 1. Define the function

Putting nest_aysncio.apply() at the top of the program, just under the the imports is how to implement the module. This line of code is the answer to many headaches while working with asyncio. This function is our example function for demonstrating the pattern, use your own function if you are comfortable.

request_song_info takes three parameters.

session: session is the requests.Session() object. The session speeds up requests by reusing cookies and TCP connections.
song_num: an integer that is concatenated to the request url.
song_urls: our data structure for these requests. song_urls is updated for every requests. Each item is a tuple of the song_num and the information we are getting from the request.

Step 2. Gather the tasks

The doc string in the function is a good high-level description of how the function works. Executor is a subclass that uses a pool of workers to execute calls asynchronously.ThreadPoolExecutor is a subclass of Executor . get_event_loop gets or creates the event_loop. Event_loop is the core of the asyncio application and runs the tasks. tasks is a list comprehension which creates a list of individual function calls. loop.run_in_executor arranges for the function to be called in the executor. It takes the executor, the function we want to run, and the functions parameters in the form of *(param1, param2, param3). Since it uses a list comprehension, we need list of things to parameters to iterate over. It can be a list of ints, strings, other objects. aysncio.gather(*tasks)runs the tasks concurrently and gathers the results. Since the original function updates the data structure, we do nothing with the response.

The parameters for the get_data_asynchronously are the parameters used to form the list comprehension for the task creation. (song_num is not passed because it is iterated over in the range.)

Step 3: Execute the tasks

create_task schedules the execution of the tasks. (in python3.6, replace create_task with ensure_future. ) Get the event loop again. loop.run_until_complete finally executes the tasks.

Tips and tricks

If your tasks come from multiple functions, you can add two lists together i.e. tasks = [ loop.run_in_executor(...) for x in y] + [loop.run_in_executor(...) for v in w ]

To reuse this code, only a few changes need to be made.

Change your function at Step 1.
Update the parameters for get_data_asynchronous function. change the function inside loop.run_in_executor(...)
Update the parameters for execute_async_event_loop and asyncio.create_task(get_data_asynchronous(...)).

PreviousClean up a virtualenv NextVaex

Last updated 4 years ago