LLM CLI Notes and Twitter Headless ScreenShot

To get going from: https://github.com/simonw/llm

brew install llm
llm keys set openai
llm "count to three" # uses default of 3.5-turbo
llm chat -m 4o
llm models
llm install llm-claude-3
llm keys set claude
llm models
llm -m claude-3-5-sonnet-latest "count to three"
llm logs
llm --system 'respond in json' "How are you doing today?"

I liked how he demonstrates the need to strip tags in an HTML page you fetch from the web because it wastes tokens: https://simonwillison.net/2023/May/18/cli-tools-for-llms/


And then I got excited about his shotscraper: https://shot-scraper.datasette.io/en/stable/

pip install shot-scraper
shot-scraper install
shot-scraper https://github.com/simonw/shot-scraper -h 900

And seeing how he addressed the auth problem: https://shot-scraper.datasette.io/en/stable/authentication.html

shot-scraper auth \
  https://x.com/i/flow/login \                                
  auth.json

… but unfortunately Twitter doesn’t like even an auth’d headless browser :-/

shot-scraper https://twitter.com/medialab/status/1854627927175901422 \
  -a auth.json -o authed.png

however it’s nice to know it might work elsewhere.


Update for capturing Twitter content from reading: https://jonathansoma.com/everything/scraping/scraping-twitter-playwright/

import time
from playwright.async_api import async_playwright
import asyncio

async def main():
    playwright = await async_playwright().start()
    device = playwright.devices["Desktop Firefox"]
    browser = await playwright.firefox.launch()
    device['viewport'] = {
        'width': 1280,
        'height': 3000
    }
    context = await browser.new_context(**device)
    page = await context.new_page()

    # Visit the page
    await page.goto("https://twitter.com/medialab/status/1854627927175901422")
    await page.wait_for_selector("[aria-label=\"Reply\"]")

    # Hope everything loads
    time.sleep(4)

    # Clean up the page, remove banners
    await page.evaluate(""" 
    () => {
        document.querySelector('[data-testid="BottomBar"]').remove()
        try {
            document.querySelector('[aria-label="sheetDialog"]').parentNode.remove() 
        } catch(err) {
        }
    }
    """)

    # Extract page content and save as HTML
    page_content = await page.content()  # Retrieve the HTML content of the page
    with open("page_content.html", "w", encoding="utf-8") as f:
        f.write(page_content)

    # Take the screenshot
    tweet = page.locator('[data-testid="tweet"]')
    await tweet.screenshot(path='screenshot.png')

    # Close the browser and playwright
    await browser.close()
    await playwright.stop()

# Run the async main function
asyncio.run(main())

Then you use striptags https://github.com/simonw/strip-tags

pipx install strip-tags

And can do something like

cat page_content.html | strip-tags 

Which isn’t perfect so use this approach instead

import time
from playwright.async_api import async_playwright
import asyncio

async def main():
    playwright = await async_playwright().start()
    device = playwright.devices["Desktop Firefox"]
    browser = await playwright.firefox.launch()
    device['viewport'] = {
        'width': 1280,
        'height': 3000
    }
    context = await browser.new_context(**device)
    page = await context.new_page()

    # Visit the page
    await page.goto("https://twitter.com/medialab/status/1854627927175901422")
    await page.wait_for_selector("[aria-label=\"Reply\"]")

    # Wait for everything to load
    time.sleep(4)

    # Clean up the page, remove banners
    await page.evaluate(""" 
    () => {
        document.querySelector('[data-testid="BottomBar"]').remove()
        try {
            document.querySelector('[aria-label="sheetDialog"]').parentNode.remove() 
        } catch(err) {
        }
    }
    """)

    # Extract the HTML content inside the tweet text div and save it
    tweet_text_html = await page.locator('[data-testid="tweetText"]').inner_html()
    with open("tweet_content_verbatim.html", "w", encoding="utf-8") as f:
        f.write(tweet_text_html)

    # Optionally, save the full page content if needed
    page_content = await page.content()
    with open("page_content.html", "w", encoding="utf-8") as f:
        f.write(page_content)

    # Take a screenshot of the tweet
    tweet = page.locator('[data-testid="tweet"]')
    await tweet.screenshot(path='screenshot.png')

    # Close the browser and playwright
    await browser.close()
    await playwright.stop()

# Run the async main function
asyncio.run(main())

And then

strip-tags -i tweet_content_verbatim.html 

CAUTION: Don’t do this too much or you’ll be identified as botty.


To jive with tokens on OAI install

pipx install ttok

And you can get a sense of how many tokens you’ve reduced the tweet to

% strip-tags -i tweet_content_verbatim.html | ttok 
93

Versus if you’re extracting all the HTML for use

% strip-tags -i page_content.html| ttok
415

You can also extract info as an image

cat screenshot.png | llm "describe this image" -a - 

And another cool thing is to bring the web in and apply a system message

curl -s 'https://maeda.pm/2019/12/08/whether-you-have-a-new-job-or-you-have-a-new-boss-its-all-about-dealing-with-change/' | \
  llm -s 'Suggest topics for this post as a JSON array'

For Azure OpenAI https://github.com/fabge/llm-azure/

For Ollama + Vision https://simonwillison.net/2024/Nov/13/ollama-llama-vision/