Scraping Web Data Using BeautifulSoup

August 14, 2022 Post a Comment

I am trying to scrape the rain chance and the temperature/wind speed for each baseball game from rotowire.com. Once I scrape the data, I will then convert it to three columns - rai

Solution 1:

To locate all the data, see this example:

import pandas as pd
import requests
from bs4 import BeautifulSoup


url = "https://www.rotowire.com/baseball/daily-lineups.php"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

weather = []

for tag in soup.select(".lineup__bottom"):
    header = tag.find_previous(class_="lineup__teams").get_text(
        strip=True, separator=" vs "
    )
    rain = tag.select_one(".lineup__weather-text > b")
    forecast_info = rain.next_sibling.split()
    temp = forecast_info[0]
    wind = forecast_info[2]

    weather.append(
        {"Header": header, "Rain": rain.text.split()[0], "Temp": temp, "Wind": wind}
    )


df = pd.DataFrame(weather)
print(df)

Output:

        Header  Rain Temp     Wind
0   PHI vs CIN  100%  66°        8
1   CWS vs CLE    0%  64°        4
2    SD vs CHC    0%  69°        7
3   NYM vs ARI  Dome   In  Stadium
4   MIN vs BAL    0%  75°        9
5    TB vs NYY    0%  68°        9
6   MIA vs TOR    0%  81°        6
7   WAS vs ATL    0%  81°        4
8   BOS vs HOU  Dome   In  Stadium
9   TEX vs COL    0%  76°        6
10  STL vs LAD    0%  73°        4
11  OAK vs SEA  Dome   In  Stadium

Solution 2:

You can locate the cards with the game info and find the weather data at the bottom (if it is present):

from bs4 import BeautifulSoup as soup
import requests, re, pandas as pd
d = soup(requests.get('https://www.rotowire.com/baseball/daily-lineups.php').text, 'html.parser')
r = [{'header':' vs '.join(k.get_text(strip=True) for k in i.select('div.lineup__teams div.lineup__team')),
      'rain':(j:=i.select_one('.lineup__bottom .lineup__weather .lineup__weather-text')).contents[1].text,
      'temperature':x[0] if (x:=re.findall('^\d+', j.contents[2].strip())) else 'In Domed Stadium',
      'wind':x[0] if (x:=re.findall('(?<=Wind\s)[\w\W]+', j.contents[2].strip())) else 'In Domed Stadium'
      }
     for i in d.select('div.lineup.is-mlb div.lineup__box') if 'is-tools' not in i.parent['class']]

df = pd.DataFrame(r)
print(df)

Output:

        header       rain       temperature              wind
0   PHI vs CIN  100% Rain                66          8 mph In
1   CWS vs CLE    0% Rain                64         4 mph L-R
2    SD vs CHC    0% Rain                69          7 mph In
3   NYM vs ARI       Dome  In Domed Stadium  In Domed Stadium
4   MIN vs BAL    0% Rain                75         9 mph Out
5    TB vs NYY    0% Rain                68         9 mph R-L
6   MIA vs TOR    0% Rain                81         6 mph L-R
7   WAS vs ATL    0% Rain                81         4 mph R-L
8   BOS vs HOU       Dome  In Domed Stadium  In Domed Stadium
9   TEX vs COL    0% Rain                76             6 mph
10  STL vs LAD    0% Rain                73         4 mph Out
11  OAK vs SEA       Dome  In Domed Stadium  In Domed Stadium

Html5 Tutorial

Scraping Web Data Using BeautifulSoup

Solution 1:

Solution 2:

Post a Comment for "Scraping Web Data Using BeautifulSoup"