To get going from: https://github.com/simonw/llm
brew install llm
llm keys set openai
llm "count to three" # uses default of 3.5-turbo
llm chat -m 4o
llm models
llm install llm-claude-3
llm keys set claude
llm models
llm -m claude-3-5-sonnet-latest "count to three"
llm logs
llm --system 'respond in json' "How are you doing today?"
I liked how he demonstrates the need to strip tags in an HTML page you fetch from the web because it wastes tokens: https://simonwillison.net/2023/May/18/cli-tools-for-llms/
And then I got excited about his shotscraper: https://shot-scraper.datasette.io/en/stable/
pip install shot-scraper
shot-scraper install
shot-scraper https://github.com/simonw/shot-scraper -h 900
And seeing how he addressed the auth problem: https://shot-scraper.datasette.io/en/stable/authentication.html
shot-scraper auth \
https://x.com/i/flow/login \
auth.json
… but unfortunately Twitter doesn’t like even an auth’d headless browser :-/
shot-scraper https://twitter.com/medialab/status/1854627927175901422 \
-a auth.json -o authed.png
however it’s nice to know it might work elsewhere.
Update for capturing Twitter content from reading: https://jonathansoma.com/everything/scraping/scraping-twitter-playwright/
import time
from playwright.async_api import async_playwright
import asyncio
async def main():
playwright = await async_playwright().start()
device = playwright.devices["Desktop Firefox"]
browser = await playwright.firefox.launch()
device['viewport'] = {
'width': 1280,
'height': 3000
}
context = await browser.new_context(**device)
page = await context.new_page()
# Visit the page
await page.goto("https://twitter.com/medialab/status/1854627927175901422")
await page.wait_for_selector("[aria-label=\"Reply\"]")
# Hope everything loads
time.sleep(4)
# Clean up the page, remove banners
await page.evaluate("""
() => {
document.querySelector('[data-testid="BottomBar"]').remove()
try {
document.querySelector('[aria-label="sheetDialog"]').parentNode.remove()
} catch(err) {
}
}
""")
# Extract page content and save as HTML
page_content = await page.content() # Retrieve the HTML content of the page
with open("page_content.html", "w", encoding="utf-8") as f:
f.write(page_content)
# Take the screenshot
tweet = page.locator('[data-testid="tweet"]')
await tweet.screenshot(path='screenshot.png')
# Close the browser and playwright
await browser.close()
await playwright.stop()
# Run the async main function
asyncio.run(main())
Then you use striptags https://github.com/simonw/strip-tags
pipx install strip-tags
And can do something like
cat page_content.html | strip-tags
Which isn’t perfect so use this approach instead
import time
from playwright.async_api import async_playwright
import asyncio
async def main():
playwright = await async_playwright().start()
device = playwright.devices["Desktop Firefox"]
browser = await playwright.firefox.launch()
device['viewport'] = {
'width': 1280,
'height': 3000
}
context = await browser.new_context(**device)
page = await context.new_page()
# Visit the page
await page.goto("https://twitter.com/medialab/status/1854627927175901422")
await page.wait_for_selector("[aria-label=\"Reply\"]")
# Wait for everything to load
time.sleep(4)
# Clean up the page, remove banners
await page.evaluate("""
() => {
document.querySelector('[data-testid="BottomBar"]').remove()
try {
document.querySelector('[aria-label="sheetDialog"]').parentNode.remove()
} catch(err) {
}
}
""")
# Extract the HTML content inside the tweet text div and save it
tweet_text_html = await page.locator('[data-testid="tweetText"]').inner_html()
with open("tweet_content_verbatim.html", "w", encoding="utf-8") as f:
f.write(tweet_text_html)
# Optionally, save the full page content if needed
page_content = await page.content()
with open("page_content.html", "w", encoding="utf-8") as f:
f.write(page_content)
# Take a screenshot of the tweet
tweet = page.locator('[data-testid="tweet"]')
await tweet.screenshot(path='screenshot.png')
# Close the browser and playwright
await browser.close()
await playwright.stop()
# Run the async main function
asyncio.run(main())
And then
strip-tags -i tweet_content_verbatim.html
CAUTION: Don’t do this too much or you’ll be identified as botty.
To jive with tokens on OAI install
pipx install ttok
And you can get a sense of how many tokens you’ve reduced the tweet to
% strip-tags -i tweet_content_verbatim.html | ttok
93
Versus if you’re extracting all the HTML for use
% strip-tags -i page_content.html| ttok
415
You can also extract info as an image
cat screenshot.png | llm "describe this image" -a -
And another cool thing is to bring the web in and apply a system message
curl -s 'https://maeda.pm/2019/12/08/whether-you-have-a-new-job-or-you-have-a-new-boss-its-all-about-dealing-with-change/' | \
llm -s 'Suggest topics for this post as a JSON array'
For Azure OpenAI https://github.com/fabge/llm-azure/
For Ollama + Vision https://simonwillison.net/2024/Nov/13/ollama-llama-vision/