python - Looping through a paginated api asynchronously -

- September 15, 2011

i'm ingesting data through api returns close 100,000 documents in paginated fashion (100 per page). have code functions follows:

while c <= limit:     if not api_url:         break      req = urllib2.request(api_url)     opener = urllib2.build_opener()     f = opener.open(req)     response = simplejson.load(f)      item in response['documents']:         # here      if 'more_url' in response:         api_url = response['more_url']     else:         api_url = none         break     c += 1

downloading data way slow , wondering if there way loop through pages in async way. have been recommended take @ twisted, not entirely sure how proceed.

what have here not know front read next unless call api. think of like, can in parallel?

i not know how can in parallel , tasks, lets try...

some assumptions: - can retrieve data api without penalties or limits - data processing of 1 page/batch can done independently 1 other

what slow io - can split code 2 parallel running tasks - 1 read data, put in queue , continue reading unless hit limit/empty response or pause if queue full

then second task, taking data queue, , data

so can call 1 task another

other approach have 1 task, calling other 1 after data read, execution running in parallel shifted

how i'll implement it? celery tasks , yes requests

e.g. second one:

@task def do_data_process(data):    # data    pass  @task def parse_one_page(url):     response = requests.get(url)     data = response.json()      if 'more_url' in data:         parse_one_page.delay(data['more_url'])      # , here data processing in task     do_data_process(data)     # or call worker , try in other process     # do_data_process.delay(data)

and how many tasks run in parallel if add limits code, can have workers on multiple machines , have separate queues parse_one_page , do_data_process

why approach, not twisted or async?

because have cpu-bond data processing (json, data) , better have separate processes , celery perfect them.

Search This Blog

HTPPS

python - Looping through a paginated api asynchronously -

Comments

Post a Comment

Popular posts from this blog

wordpress - (T_ENDFOREACH) php error -

Export Excel workseet into txt file using vba - (text and numbers with formulas) -

Using django-mptt to get only the categories that have items -