In many cases, you need to access systems via browsers to perform different steps. Usually, you would open your browser with Foxtrot to click, read, and write like you would manually. And usually, this is also the right approach, however, not necessarily always. If you have a process where you only need to retrieve data from a website, for example, all data from a table such as currencies, you can actually perform this lightning fast and even without opening a browser window. Instead, you will use requests to retrieve the html from the website and parse (read) it to extract exactly the information you need. You can compare this to APIs, but APIs are predefined calls to specific endpoints, whereas this will enable you to get exactly the same data you would also retrieve when actually opening the browser.
This article is building on the learnings of below guide on how to automate Google Chrome (and other browsers) using Foxtrot and Python. So, in this article, we assume that you know the basics of how to work with browsers, finding html elements using different approaches, and so forth:
To use the action, you have to install Python:
We also recommend that you install and set up PyCharm using the guide below if you have not already done so:
If you are looking for some general information and help to use the Python action in Foxtrot in general, you can read our guide:
With Python and PyCharm installed and set up, you are now equipped to get started!
Install Requests-HTML
You will use the Request-HTML library. To get started, we recommend that you create a new project in PyCharm to learn and test Requests-HTML. Then, when you have managed to get it working and understand the concept, you can start to implement actions in Foxtrot. So, we will begin this guide in PyCharm by creating a new project called "Web_Scrape". Make sure to select the option "Existing interpreter" as we will install Requests-HTML in a moment and we need this to be installed in the global Python interpreter to make sure that Foxtrot can also make use of the framework.
Now, start by installing the requests-html using pip as explained in below guide.
Simply go to the terminal of PyCharm and write "pip install requests-html" and press enter. This should install the package.
NOTE: We do recommend that you also install the library called Requests. This is made by the same as the Requests-HTML and is probably one of the most popular Python libraries of all. This can do many of the same things as Requests-HTML but is more used for API integrations. To install this, simply go to the terminal of PyCharm and write "pip install requests" and press enter. This should install the package.
Perfect, that means that Requests-HTML is now installed and related to your Python interpreter. This also means that when we go to Foxtrot later and use Requests-HTML there, it will work as well.
Get started
It is actually quite easy to use Requests-HTML. If you are struggling with anything throughout this guide, make sure to read their documentation or contact us for support. Let us start by taking a look at a very simple example. First, we create a new Python file for testing. You can give it any name you would like.
The four overall lines of code you will always use is. This makes the basis of your request, of course, they might be adjusted for some requests if you need, for example, authentication or other extra parameters in your request:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.someurl.com/'
r = session.get(url)
For this first test, let us have some fun extracting data from our support website.
Let us try to extract the main title "Welcome to our support" and print it to the console. So, our first part of the code would be:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://support.foxtrotalliance.com/hc/en-us'
r = session.get(url)
Now, append these two lines of code to print everything from the website (all the HTML) and then at the end we make sure to close the session.
print(r.text)
r.close()
session.close()
When you run this, you should see all the HTML printed to the console.
If you press "Ctrl+F" in the console, you can search the output. Let us see if we can find the element that we are looking for:
Of course, similar to how we do it in the Selenium article, you can also inspect the website via Google Chrome.
Now, instead of printing all the HTML, let us find this specific element using the CSS selector. In order to do so, we adjust our code to the following.
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://support.foxtrotalliance.com/hc/en-us'
r = session.get(url)
title = r.html.find('body > main > section > div > h1', first=True)
print(title.text)
r.close()
session.close()
Notice that we use "first=True" in call to find the element. This is because the find call will return a list as there could potentially be more than one element on the website that is fulfilling the criteria we use to find the element.
Instead of using "first=True", for example, if you know you always need the second one instead of the first one, you could use this instead:
title = r.html.find('body > main > section > div > h1')[1]
Remember, a list always start at 0, so above line of code will return the second match. Of course, in this case, it will give an error as there are not more than one match.
So, by adjusting it to "0" instead of "1", this works again and the same way as using "first=True".
Now, let us step it up a notch. Let us see how we could print all the promoted article.
So, the solution would be to find the element containing the list of promoted articles and then loop through them to print the individual articles. The solution would be like this (where we use the CSS selector on the "<ul class="article-list promoted-articles">"):
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://support.foxtrotalliance.com/hc/en-us'
r = session.get(url)
section = r.html.find('body > main > div:nth-child(2) > div > section.articles.promoted_blocks > ul', first=True)
articles = section.find('a')
for index, article in enumerate(articles):
print(article.text)
r.close()
session.close()
NOTE: After finding the block of the articles, we find every article (link) using the 'a':
And this should be the output.
Awesome. Now, what if we would like to print it like this: "article_name + (article_link)"? Let us have a look! Requests-HTML offers a very nice feature to retrieve all links or absolute links. So, we can simply adjust the line number 11 to be the following:
print(article.text + ' ' + str(article.absolute_links))
So, the full code will be:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://support.foxtrotalliance.com/hc/en-us'
r = session.get(url)
section = r.html.find('body > main > div:nth-child(2) > div > section.articles.promoted_blocks > ul', first=True)
articles = section.find('a')
for index, article in enumerate(articles):
print(article.text + ' ' + str(article.absolute_links))
r.close()
session.close()
Simple as that! That is it for this first very simple example. We will now move on to something a bit more exciting.
Extracting web table data
For this section, we will take a look at how to extract data from web tables. First, we will look at one website and showcase two different approaches. The first one will loop through every cell in the table, the second one will utilize the Pandas library to parse the HTML.
Currencies web table (approach 1)
Please reference the uploaded Foxtrot project called "Requests-HTML, Currency Approach 1.fox" available at the end of the article through this example. This is the web table we would like to extract. For the sake of the example, let us extract the data both to the console and to a CSV file.
Let us assume we only need the first four columns (as those are the only ones with relevant data). Then, this would be the full solution:
import csv
from requests_html import HTMLSession
session = HTMLSession()
url = 'http://www.nationalbanken.dk/valutakurser'
r = session.get(url)
tables = r.html.find('#currenciesTable')
with open(r'c:\currencies.csv', mode='w', newline='') as file:
file_writer = csv.writer(file, delimiter=';', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for table_index, table in enumerate(tables):
rows = table.xpath('//tr')
for row_index, row in enumerate(rows):
row_text = []
xpath = '//th' if row_index is 0 else '//td'
columns = row.xpath(xpath)
for index_column, column in enumerate(columns):
row_text.append(column.text)
if index_column == 3:
break
print(row_text)
file_writer.writerow(row_text)
r.close()
session.close()
Let us break it down. First, we make sure to import the "csv" package. Then, we navigate to the retrieve the website http://www.nationalbanken.dk/valutakurser. Hereafter, we locate the tables. Now, it is important to notice that we are actually extracting data from multiple tables, therefore, we declare a variable called tables. We find the table using the ID - notice how both of the tables have that ID, therefore, our variable "tables" will contain both tables so that we can loop through them.
Then we use the common method of writing data to CSV files. You can read more about this here:
We contruct a loop that loops through each row ("//tr") and for each row, we extract data from each column. Now, because the table headers and table rows are different, we need to use the logic with the if-statement "xpath = '//th' if row_index is 0 else '//td'".
The output of this code should be:
And a CSV file on the C: drive:
Now, keep in mind, if your system settings are different, you might need to adjust line number 12 in regards to the delimiter. Also, of course, you would probably want to delete row number 9 as that is a second row of headers before further processing.
Currencies web table (approach 2)
Here is an alternative, arguably even more efficient, method using the pandas library. The output is exactly the same with one exception, we actually avoid extracting the headers of the second table, and we skip the step of actually printing the output to the console. Here is all the code (notice that it is actually shorter than the first approach):
import pandas as pd
from requests_html import HTMLSession
session = HTMLSession()
url = 'http://www.nationalbanken.dk/valutakurser'
r = session.get(url)
tables = r.html.find('#currenciesTable')
for index, table in enumerate(tables):
df = pd.read_html(table.html, thousands='.', decimal=",")[0]
df = df.drop(["Se graf", "RSS", "Download"], axis=1)
if index == 0:
df.to_csv(r'c:/currencies.csv', sep=";", decimal=",", index=False)
else:
df.to_csv(r'c:/currencies.csv', sep=";", decimal=",", index=False, mode='a', header=False)
df.iloc[0:0]
r.close()
session.close()
Let us break down the code. First, instead of importing the CSV package, we import pandas with the alias "pd". We use exactly the same approach to find the two tables. Now, instead of opening a file writer to a CSV file, we are using the function in pandas to write the data to the CSV file. With pandas, we simply loop through the tables and use the function "read_html" to read the HTML from the current table. Pandas is intelligent enough to convert the HTML to a dataframe. Then, we use the function "drop" to remove the columns we are not interested in. Finally, we write the data to the CSV file. If we are in the first loop, we create a new file and include headers, in all other loops, we append the data without including headers. At the end, we use the "iloc" function to clear the dataframe object to make sure there is no memory leak.
And, of course, as soon as your code is working properly, you can simply copy-paste it to Foxtrot and use it as part of your project.
Currencies web table (approach 2) EXPANDED
As a last example, below code illustrate some expanded functionality that not only extracts the currency table from the website but also synchronously downloads all the CSV files for each single currency code. The method used to download the CSV files is explained in this link. Like the previous example, you should make sure that your settings for the "to_csv" function in terms of the separator and decimal character is appropriate. Also note that you need to pip install both "pandas", "requests", and "requests-html" to be able to run the code.
from io import StringIO import requests from requests_html import HTMLSession import pandas as pd session = HTMLSession() url = 'http://www.nationalbanken.dk/valutakurser' r = session.get(url) tables = r.html.find('#currenciesTable') for table_index, table in enumerate(tables): df_master = pd.read_html(table.html, thousands='.', decimal=",")[0] df_master = df_master.drop(["Se graf", "RSS", "Download"], axis=1) if table_index == 0: df_master.to_csv(r'c:/AllCurrencies.csv', sep=";", decimal=",", index=False) else: df_master.to_csv(r'c:/AllCurrencies.csv', sep=";", decimal=",", index=False, mode='a', header=False) for row_index, row in df_master.iterrows(): currency = row['ISO'] data = requests.get('http://www.nationalbanken.dk/_vti_bin/DN/DataService.svc/CurrencyRateCSV?lang=da&iso=' + currency) df_currency = pd.read_csv(StringIO(data.text), skiprows=[0, 1, 2, 3], delimiter=";", decimal=",", index_col=2) df_currency.to_csv(r'c:/' + currency + '.csv', sep=";", decimal=",", index=False) df_master.iloc[0:0] df_currency.iloc[0:0] session.close() r.close() data.close()
For an even more powerful (faster) but at the same time more advanced solution, here is the same functionality of downloading all the CSV files but using an asynchronous approach. The first above solution is synchronous code (what you typically see, it is the easiest to code). It runs a command, waits for it to complete, before moving on. Now, it is possible to run asynchronous code, meaning that you are, in this case, essentially asking the code to download ALL files simultaneously and then simply handle them as they come in. That is the below solution. Much more complicated but also much more sophisticated. But, this is only appropriate if you do not care about the order of the execution. If it is important that you download the files in a specific order, like A-->B-->C, then you need to do it synchronous. You can read more about this topic in this article.
from io import StringIO import asyncio from concurrent.futures import ThreadPoolExecutor import requests from requests_html import HTMLSession import pandas as pd def get_main_table(): url = 'http://www.nationalbanken.dk/valutakurser' currencies_to_fetch = [] with HTMLSession() as session: with session.get(url) as r: tables = r.html.find('#currenciesTable') for table_index, table in enumerate(tables): df_master = pd.read_html(table.html, thousands='.', decimal=",")[0] df_master = df_master.drop(["Se graf", "RSS", "Download"], axis=1) if table_index == 0: df_master.to_csv(r'c:/AllCurrencies.csv', sep=";", decimal=",", index=False) else: df_master.to_csv(r'c:/AllCurrencies.csv', sep=";", decimal=",", index=False, mode='a', header=False) for row_index, row in df_master.iterrows(): currencies_to_fetch.append(row['ISO']) return currencies_to_fetch def get_currencies(session, currency): data = session.get( 'http://www.nationalbanken.dk/_vti_bin/DN/DataService.svc/CurrencyRateCSV?lang=da&iso=' + currency) df_currency = pd.read_csv(StringIO(data.text), skiprows=[0, 1, 2, 3], delimiter=";", decimal=",", index_col=2) df_currency.to_csv(r'c:/' + currency + '.csv', sep=";", decimal=",", index=False) async def get_currencies_asynchronous(currencies_to_fetch): with ThreadPoolExecutor() as executor: with requests.Session() as session: loop = asyncio.get_event_loop() tasks = [ loop.run_in_executor( executor, get_currencies, *(session, currency) ) for currency in currencies_to_fetch ] for response in await asyncio.gather(*tasks): pass def main(): currencies_to_fetch = get_main_table() loop = asyncio.get_event_loop() future = asyncio.ensure_future(get_currencies_asynchronous(currencies_to_fetch)) loop.run_until_complete(future) main()
Comments
0 comments
Please sign in to leave a comment.