petitviolet blog

    Subscribe a Web Page using GoogleAppScript

    2021-06-20

    GASJavaScript

    Some of web pages have RSS feed so that we can get notified when the pages updated through Feedly, Slack, etc. However, some of them don't have such useful features. This post describes how to subscribe web pages that don't offer RSS feeds or something like that.

    HOW TO

    Use Google App Script(a.k.a GAS) to check a web page periodically, and then if there is an update, notify via Email.

    https://script.google.com

    In this post, use https://blog.petitviolet.net as an example, even though it has /rss.xml for RSS feed. As the blog looks like the below image, we're able to extract the latest published post's date.

    structure of blog.petitviolet.net

    "UpdatedAt" dates can be found within <small>...</small> HTML tags. Let's use this structure to get UpdatedAt. For this example, I'm going to use Regex to select a date time string from a given HTML. If you'd like to parse more complicated HTML and select elements, there is a library, Parser that you can import with 1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw.

    Implementation

    In GAS, we can use UrlFetchApp#fetch and then HttpResponse#getContentText to get a HTML.
    Then, date format in the blog is yyyy-MM-dd found within <small> HTML tag so that we can find the pattern with /<small>(\d{4}-\d{2}-\d{2})<\/small>/. A code snippete for extracting the latest date from the blog is as following:

    const URL = 'https://blog.petitviolet.net';
    
    const html = UrlFetchApp.fetch(url).getContentText('UTF-8');
    const pattern = /<small>(\d{4}-\d{2}-\d{2})<\/small>/;
    const updatedAt = html.match(pattern)[1];
    

    Next step is managing the state of this subscriber, that is to detect whether the obtained latest date is new one or not.
    Additionally, it needs to avoid notifying more than once for the same updates.
    GAS is basically based on Google SpreadSheet so that we can use it as a state storage.

    const CELL = 'A1'
    
    const sheet = SpreadsheetApp.getActiveSheet();
    const lastUpdatedAt = sheet.getRange(CELL).getValue();
    const today = Utilities.formatDate(new Date(), 'JST', 'yyyy-MM-dd');
    if (lastUpdatedAt === null || lastUpdatedAt == updatedAt) {
        return; // skip if it's the first time or already noticed
    }
    // store the last updatedAt in the sheet
    sheet.getRange(CELL).setValue(lastUpdatedAt);
    

    The last step is how to notify if an update is observed.
    How to notify is depending on what you want, but in GAS, it would be easiest to send email via MailApp.sendEmail.

    var subject = "Update from blog.petitviolet.net";
    var name = "GAS robot";
    var body = `<a href=${URL}>${URL}</a> updated at ${date}`;
    
    MailApp.sendEmail({
        to: MAIL_TO, 
        subject: subject,
        name: name, 
        htmlBody: body,
        body: body
    });
    

    That's it! If you'd like to get notification in Slack, chat.postMessage API should work.

    https://api.slack.com/methods/chat.postMessage

    Then, you can setup a time basis trigger to call these codes so that you'll be able to get notified when the page is updated.

    scheduler

    The whole code is available at https://gist.github.com/petitviolet/0316d0bf02d9e856c5e5b1151807574e

    Thoughts

    Google App Script is really useful in many cases since it offers the following:

    • JavaScript runtime
    • on Spreadsheet which is considered simple storage
    • A scheduler to call JavaScript functions periodically

    So, I'd say we can fulfill most of personal use-cases by using GAS.