Tech and T-Accounts

Data Scraping for a Beach Vacation 

Friday, February 7, 2020 2:51:51 PM

With the proliferation of different ways to rent a beach house, finding he ideal property for a week at the beach has become an exhaustive exercise in researching multiple local beach rental properties, then checking on various owner managed efforts like VRBO and Airbnb. The filter conditions for each site vary such that searching for properties that are in a particular area (a map based search) sometimes isn’t available. Same is true for using the site to check on desirable options be it a pool, beach proximity, bedding, grill, etc.

The objective is to scrap the data from the various websites using free tools, consolidate those results into a spreadsheet or csv file, then use Tableau for filtering geographically as well as for amenities, price, location, etc.

Criteria

  • Emerald Isle, NC
  • Week of June 13 – June 20 (Saturday checkin/checkout)
  • 4 bedrooms with sleeping for 3 under age 6 and 3 sets of adults

Websites to consider:

  • Airbnb, Booking.com, VRBO (all these seem to already be consolidated by www.vacationrenter.com)
  • www.hometogo.com (which seems to consolidate Airbnb, Booking.com, VRBO, HomeAway, VacationRentals.com, TripAdvisor, e-domizil)
  • EmeraldIsleRealty.com
  • SunSurfRealty.com
  • www.bluewaternc.com/crystal-coast-vacation-rentals/

WEB Scrapers

The free webscrapers seem to be straightforward to configure (and have similar interfaces), so learning one scraper enables some carryover experience to use other scrapers. A different scraper was chosen when the first scraper chosen didn’t work for the next site. Wishing that one scraper would work on every site seems to be unattainable, but then again, I also wasn’t going to spend considerable time trying to figure out the peculiarities why a chosen scraper didn’t work (although I did spend some time going through tutorials to get comfortable that the approaches I was taking should have worked). I wanted to avoid writing code, so the scrapers I considered had to be suitable for non-programmers.

Summary of Scraping efforts (once a site had useful results, other approaches not attempted), so lots of blanks in table:

www.Import.io (have to use Chrome, Firefox or Safari)
www.Octoparse.com
www.parsehub.com
www.webscraper.io (chrome extension)
Google Sheets (using IMPORTHTML or IMPORTXML functions)
Colab (Python in cloud)

Consolidating data

The data from the various scrapers, while similar, had column variations – some included extra columns of URL links, some placed all the amenities in one cell, data and the element name was sometimes included in the same cell, etc. Getting the data into a consistent representation was done either in Excel or using Tableau Prep.

Adding missing Geocode data

The data from the websites had street address names and while the websites had map representations of the searched data, those maps didn’t expose the latitude and longitude used for each property.

Data Fields (as received)

The data model I've settled on requires splitting the data such that one table has a single row for each property, and for those fields which have multiple items (like amenities) to have those as a table with two columns – the property and the item. This setup will enable showcasing the new Tableau data model, forgoing the need for many LOD calculations.

Experience

SunSurf – Parsehub parsed data well, used Texas A&M to geocode. But, after reviewing results, had to re-parse list using Webscraper.io for missing data. Webscraper.io turned out to be the better option for this site.

BlueWater – Since Webscraper.io worked well for SunSurf, I started there, but it defied getting more than 7 rows of data through parsing (it would get stuck on the opening site, then when AJAX loaded the list of properties, my selection process or it as a tool got lost). Bluewater exposes a list of Property Names with URL to those properties, so if I could just get that list exposed via a URL, then I could use another tool to extract the data from it. So, I created my own version of the exposed list using F12 developer tools to copy html that had list of property names. Then loaded that html to a website so Parsehub could grab the list of property names and their associated URLs to an Excel sheet. When clicking on a listed name, the URL in the browser changed to a URL that included the property name. I then used that to model building the list of URLs for all the properties. This URL list was then used in Octoparse which permits loading up to a series of 10,000 webURLs to scrape. However, Octoparse free is throttled, so the 555 lines took 5 hours and 17 min to parse. The result, however provided me with street addresses, again, not quite clean as there was a carriage return (no space) before the city name making parsing out the address elements requiring using the substitute function in Excel to get the street address, city, state on one line, ready for further splitting needed to send to Geocoding. Once parsed, there were many rows that were duplicate addresses as would be expected for condo complexes. The duplicates were removed to get a unique list of addresses to submit to Texas A&M Geocoding, which in a few minutes geocoded the 385 records.

Emerald Isle Realty – Since Octoparse worked well for SunSurf, figured I’d start with that one. Caution -- I didn’t spend a lot of time trying to figure out whether a tool would work – if it didn’t work with limited instruction, then I just moved to the next tool. On this site, the street address was not a unique element on the page, but was sometimes buried in the description., so I used Excel to split out street address. Some descriptions lacked an address in the description, but did expose a mapped location on the website. While the site had a ‘mapped’ location for a property, the actual latitude/longitude pair used to place the marker on the map was not found for scraping. So, for these properties, I manually opened the property URL to compare to a second browser opened to google maps. By visually matching Google Maps to property website map, I then used Google Maps to obtain street address and Lat/Long coordinates. This process got quite tiresome, but fortunately there weren’t that many properties needing to fill in the missing address.

BlueWater Footnote – after scraping Emerald Isle Realty data, I noticed I didn’t get the image URL in my first attempt. Import.io worked OK for Emerald Isle properties and was the easiest to set up (it figures out what you are likely wanting and build the scraping model for you, which you can then adjust to add / remove columns). When I used this the second time, I figured out that to ‘train’ the Import.io process, choosing similar URL’s is absolutely necessary for it to be able to scrape properly. Also, their auto process picked up a bunch of URL’s very quickly and then presented a checkbox list of which group of URL’s were to be included in the scraping loop. There was a choice to pick up only the rental-property related URL’s, which when I selected only that subset, the training went exceptionally well, as were the results. Downside of Import.io – only 1000 webpages can be scraped per login per month, so after running it once incorrectly against 600+ sites (I had two groups selected, so import.io built a model based on one webpage that didn’t work for almost all of the pages – and thus, that was all I could do this month using that login identity).

Final Cleanup

Whew! Got the data downloaded, which even though was hoping that the data was similarly structured, it wasn’t between sites. So, the next step was to decide on a data table structure (data model) and load the scraped data into that model. This meant not including some data which was unique as the other sites didn’t have that same info (like customer or agency ratings and customer reviews). Some sites listed a few amenities into a scraping area, which was nice, except that they listed more amenities in their long description of the property. How amenities were identified meant that the data had to be conformed – that is ping-pong table and ping pong table would need to be adjusted to read one way or the other. Amenities by property are listed in their own tab. The idea is that some amenities may be considered a requirement (like pets allowed), while others are a nice to have. Those amenities which commonly are considered a requirement have a separate flag counter field in the RentalProperties tab. These are pets, pool, grill, game room, elevator, pier/dock, community pool, hot tub, boats allowed.

Prices – Some sites had prices for various rental options – weekly, two week or month rates, all different for different rental ranges. Others had just weekly rates for different ranges. I was only interested in a particular week, so that week’s rate is included as a column in the main RentalProperties tab. Even then, some properties scraping didn’t pick up a price, so a manual verification of those properties revealed three conditions – the property no longer was listed by the rental agency (it was listed when the list of property scrape occurred, but no longer existed a few weeks later when the price verification was being done). Second, the property is listed, but prices aren’t – instead ‘Advance Hold’ is listed where price was expected. I figure these are properties that had listed in the past, but haven’t signed the contract for this year yet. Third, the property is listed, but there is a note that no weekly rentals are available. This condition likely occurs when the property has been rented by the month, or because the contract hasn’t been signed for the period. All of these non-priced properties are included, but no pricing is available.

After sending the data for geocoding, there were many properties with the same street address (condominiums, for instance). I wanted Tableau to map those properties as separate map-points so when hovering, a person could get individual property information in a tooltip rather than a * indicating multiple responses.  So, in the data, I modified the Lat / long by 5/1000000 in order to nudge the mark enough to make it separately identifiable.

Recap - Recommendations - Results shared

I started preferring what I thought would be easiest - using the webscraper.io extension to Google Chrome.  While this was useful, it also revealed that some sites resist using this type of scraper -- i needs to build a list of URL's on the fly from the webscraping activity.  If that can't be accomplished, another tool / approach needs to be used.  Parsehub and Octoparse are similar, but not the same.  The interface on Parsehub took some getting used to, and quite frankly, I found myself backing out what I clicked to try again.  When you figure out it's very distinct clicking pattern to identify page elements to scrape, then this tool worked well.  It too, however, wanted to build the list of URLs to scrape itself, so when you have a long list of URL's already known, then Octoparse was the preferred tool.  Octoparse allowed for a long list of URLs to be copy/pasted into the tool, then used one of those URL's for you to set up the elements to scrape.  I also tried Google Sheets, but this seemed no to fit my particular need (again, I didn't spend much time there, only enough time to figure out it didn't get what I wanted with what I knew about the website).  You'll likely notice that Airbnb, Vrbo, etc. type sites aren't part of the results.  It isn't that we forgot about them, rather our limited knowledge of how these tools work and those sites extensive use of Ajax foiled our initial attempts and I ran out of patience/time.  I even sent the request to another person who applied Python to the HometoGo website, only to come away equally frustrated.  I'll likely look into getting this data into the dataset eventually, but not in the first exposure to the results.

I shared the results with the Piedmont Triad Tableau User group on February 6th 2020, showing them this PowerPoint presentation (click here to download).  The workbook has been loaded to Tableau Public for consumption, and looks like this (click on image to go to active workbook):

Next Steps:

My son wants to know if renting early gets either a better price or a better selection.  I'll check again in a few months to compare the results (which ones are available and what the prices are at the time).  Also, I'd like to get the missing VRBO, Airbnb, etc. properties into the data.

SQL ETL Flip-Flop to deliver no data outage experience 

Thursday, July 19, 2018 9:31:24 AM

I know that we were all taught to delete then replace as the standard practice for ETL. I personally don’t prefer that approach for these reasons:

  1. When an ETL goes foul, the prior table’s data was truncated or the table dropped. Reverting is possible if a backup exists, but it is a painful process most often avoided.
  2. The drop/truncate then select/insert into approach typically takes many minutes to perform. During that time, queries will return incomplete information (either none at all, or just the rows that have been ETL’d so far when query ran).
  3. During the ETL, while the end table is in flux, end users are without good data.

The approach I prefer involves swapping a table from one schema to another (I refer to it as a flip-flop) – this swap process takes under a second, so the benefits are:

  1. The end user can always rely on the table data – it is always a complete table from the ETL.
  2. IT can easily revert to the prior table if the ETL results are found to be wanting for whatever reason. The end user effect is that they can continue using the older data while the ETL result is reviewed/repaired.

To set up the schema swap process, I establish two additional schema – new and old (but call them what you want). The new schema is where the table from the ETL is built and indexes added. The old schema is used to move the current dbo (or whatever schema it resides in) table so simple reversion can occur if needed.

Steps (entire flip-flop process takes under a second, no matter what the size of the table):

  1. Either create table or select into ETL target table in the 'new' schema. Create indexes against as needed. These steps are basically the same as what is typically done in an ETL process, only done against a target table that isn't the currently active data.
  2. Prepare for what I refer to as the 'flip-flop' process.
    a. Get rid of the table if it exists in the 'old' schema
         if exists (select * from INFORMATION_SCHEMA.TABLES where TABLE_SCHEMA = 'old' and TABLE_NAME = 'ETL Target Table') Drop Table [old].[ETL Target Table];
    b. Place the current table into the 'old' schema
         Alter Schema old TRANSFER [dbo].[ETL Target Table];
    c. Move the 'new' ETL Target Table into the desired schema
         if exists (select * from INFORMATION_SCHEMA.TABLES where TABLE_SCHEMA = 'new' and TABLE_NAME = 'ETL Target Table') Alter Schema dbo TRANSFER [new].[ETL Target Table];

Obfuscate username and passwords in Powershell 

Wednesday, December 20, 2017 3:20:40 PM

There are a whole host of blogs on how to encrypt passwords using PowerShell.  For the best security, those approaches should be followed, not this approach.  I wanted to not include my password and username information in scripts, yet also not go through the setup required for encryption to work (from what I gathered, this encryption is machine and user specific -- not the setup tasks that I wanted to pursue.)

The steps are similar -- create a file with the text coded, then read that file and decode the text to use in the PowerShell script.

PowerShell to code text to a file:

#this is simple obfuscation method for password not appearing as clear text in powershell script
 
# change the three items below to satisfy your needs
$acc = "myaccount"
$pass = "mypassword"
$filelocation = "E:\folder_where_files_are_saved"
 
[System.Text.Encoding]::Unicode.GetBytes($acc) | Set-Content $filelocation'\am.txt' 
[System.Text.Encoding]::Unicode.GetBytes($pass) | Set-Content $filelocation'\amp.txt' 
 
 
Then, in a PowerShell script to use the saved files, add these rows to retreive the username and password:
$filelocation = "E:\folder_where_files_are_saved"
 
# Get username and password from files created by obfuscation method
$gu = Get-Content $filelocation'\am.txt'  -ReadCount 0
$username = [system.text.encoding]::Unicode.GetString($gu)
 
$gp = Get-Content $filelocation'\amp.txt'  -ReadCount 0
$password = [system.text.encoding]::Unicode.GetString($gp)
Page 1 of 7 1 2 3 4 5 6 7 > >> 
Copyright© Brad Earle
Site Map | Printable View | © 2008 - 2020 KB Earle Associates LLC | |