How To Webscrape All Shoes On Nike Page Using Python
I am trying to webscrape all the shoes on https://www.nike.com/w/mens-shoes-nik1zy7ok. How do I scrape all the shoes including the shoes that load as you scroll down the page? The
Solution 1:
By examining the API calls made by the website you can find a cryptic URL starting with https://api.nike.com/. This URL is also stored in the INITIAL_REDUX_STATE
that you already used to get the first couple of products. So, I simply extend your approach:
import requests
import json
import re
# your product page
uri = 'https://www.nike.com/w/mens-shoes-nik1zy7ok'
base_url = 'https://api.nike.com'
session = requests.Session()
def get_lazy_products(stub, products):
"""Get the lazily loaded products."""
response = session.get(base_url + stub).json()
next_products = response['pages']['next']
products += response['objects']
if next_products:
get_lazy_products(next_products, products)
return products
# find INITIAL_REDUX_STATE
html_data = session.get(uri).text
redux = json.loads(re.search(r'window.INITIAL_REDUX_STATE=(\{.*?\});', html_data).group(1))
# find the initial products and the api entry point for the recursive loading of additional products
wall = redux['Wall']
initial_products = re.sub('anchor=[0-9]+', 'anchor=0', wall['pageData']['next'])
# find all the products
products = get_lazy_products(initial_products, [])
# Optional: filter by id to get a list with unique products
cloudProductIds = set()
unique_products = []
for product in products:
try:
if not product['id'] in cloudProductIds:
cloudProductIds.add(product['id'])
unique_products.append(product)
except KeyError:
print(product)
The api also returns the total number of products, though this number seems to vary and depend on the count
parameter in the api`s URL.
Do you need help parsing or aggregating the results?
Post a Comment for "How To Webscrape All Shoes On Nike Page Using Python"