Asyncio Web Crawler
Approximately 100 requests per second
I spent a full day working out problems with asyncio. The current setup might be a little overdone, but I am running in the code in production on Google Cloud Functions, and testing it on Jupyter notebooks. It works without errors across different operating systems, platforms, and python versions (as long as it is 3.6 or 3.7).
What is Asyncio?
asyncio is a library to write concurrent code using the async/await syntax.
asyncio is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc.
asyncio is often a perfect fit for IO-bound and high-level structured network code.
Asyncio Flow
My pattern for using the asyncio for web scraping has three parts.
Make a function for a single request.
Turn this function into list of tasks (tasks is the nomenclature for asyncio library)
Execute the list of tasks.
Step 0. Imports
Nothing out of the ordinary except for nest_asyncio. I had a lot of trouble with theevent_loop
. The event_loop
can throw a lot of errors while running in the Jupyter notebook. Using nest_asyncio solves the problems in the notebook and it works in an app with no problems.
Step 1. Define the function
Putting nest_aysncio.apply()
at the top of the program, just under the the imports is how to implement the module. This line of code is the answer to many headaches while working with asyncio.
This function is our example function for demonstrating the pattern, use your own function if you are comfortable.
request_song_info
takes three parameters.
session: session is the requests.Session() object. The session speeds up requests by reusing cookies and TCP connections.
song_num: an integer that is concatenated to the request url.
song_urls: our data structure for these requests. song_urls is updated for every requests. Each item is a tuple of the song_num and the information we are getting from the request.
Step 2. Gather the tasks
The doc string in the function is a good high-level description of how the function works.
Executor
is a subclass that uses a pool of workers to execute calls asynchronously.ThreadPoolExecutor
is a subclass of Executor
. get_event_loop
gets or creates the event_loop
. Event_loop
is the core of the asyncio application and runs the tasks. tasks
is a list comprehension which creates a list of individual function calls. loop.run_in_executor
arranges for the function to be called in the executor. It takes the executor
, the function we want to run, and the functions parameters in the form of *(param1, param2, param3)
. Since it uses a list comprehension, we need list of things to parameters to iterate over. It can be a list of ints, strings, other objects.
aysncio.gather(*tasks)
runs the tasks concurrently and gathers the results. Since the original function updates the data structure, we do nothing with the response.
The parameters for the get_data_asynchronously
are the parameters used to form the list comprehension for the task creation. (song_num is not passed because it is iterated over in the range.)
Step 3: Execute the tasks
create_task
schedules the execution of the tasks. (in python3.6, replace create_task
with ensure_future
. ) Get the event loop again. loop.run_until_complete
finally executes the tasks.
Tips and tricks
If your tasks come from multiple functions, you can add two lists together i.e. tasks = [ loop.run_in_executor(...) for x in y] + [loop.run_in_executor(...) for v in w ]
To reuse this code, only a few changes need to be made.
Change your function at Step 1.
Update the parameters for
get_data_asynchronous
function. change the function insideloop.run_in_executor(...)
Update the parameters for
execute_async_event_loop
andasyncio.create_task(get_data_asynchronous(...))
.
Last updated