I spent a full day working out problems with asyncio. The current setup might be a little overdone, but I am running in the code in production on Google Cloud Functions, and testing it on Jupyter notebooks. It works without errors across different operating systems, platforms, and python versions (as long as it is 3.6 or 3.7).
asyncio is a library to write concurrent code using the async/await syntax.
asyncio is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc.
asyncio is often a perfect fit for IO-bound and high-level structured network code.
My pattern for using the asyncio for web scraping has three parts.
Make a function for a single request.
Turn this function into list of tasks (tasks is the nomenclature for asyncio library)
Execute the list of tasks.
Nothing out of the ordinary except for nest_asyncio. I had a lot of trouble with the
event_loop can throw a lot of errors while running in the Jupyter notebook. Using nest_asyncio solves the problems in the notebook and it works in an app with no problems.
nest_aysncio.apply() at the top of the program, just under the the imports is how to implement the module. This line of code is the answer to many headaches while working with asyncio.
This function is our example function for demonstrating the pattern, use your own function if you are comfortable.
request_song_info takes three parameters.
session: session is the requests.Session() object. The session speeds up requests by reusing cookies and TCP connections.
song_num: an integer that is concatenated to the request url.
song_urls: our data structure for these requests. song_urls is updated for every requests. Each item is a tuple of the song_num and the information we are getting from the request.
The doc string in the function is a good high-level description of how the function works.
Executor is a subclass that uses a pool of workers to execute calls asynchronously.
ThreadPoolExecutor is a subclass of
get_event_loop gets or creates the
Event_loop is the core of the asyncio application and runs the tasks.
tasks is a list comprehension which creates a list of individual function calls.
loop.run_in_executor arranges for the function to be called in the executor. It takes the
executor, the function we want to run, and the functions parameters in the form of
*(param1, param2, param3). Since it uses a list comprehension, we need list of things to parameters to iterate over. It can be a list of ints, strings, other objects.
aysncio.gather(*tasks)runs the tasks concurrently and gathers the results. Since the original function updates the data structure, we do nothing with the response.
The parameters for the
get_data_asynchronously are the parameters used to form the list comprehension for the task creation. (song_num is not passed because it is iterated over in the range.)
create_task schedules the execution of the tasks. (in python3.6, replace
ensure_future. ) Get the event loop again.
loop.run_until_complete finally executes the tasks.
If your tasks come from multiple functions, you can add two lists together i.e.
tasks = [ loop.run_in_executor(...) for x in y] + [loop.run_in_executor(...) for v in w ]
To reuse this code, only a few changes need to be made.
Change your function at Step 1.
Update the parameters for
get_data_asynchronous function. change the function inside
Update the parameters for