No internet connection
  1. Home
  2. Development

Talkyard JSON dumps: Posts and post numbers

By KajMagnus @KajMagnus2019-06-11 05:21:14.084Z

Talkyard lets one export one's site in JSON (and later, a Zip (?) archive that also includes uploaded files). (Not completely finished, is being tested out currently.)

The title post is always post nr 0 — but I'd like to change to -1 because 0 is a little bit bug prone. So 0 or -1 then. The orig post is always nr 1 (the 1st post on the page), and then the replies are post nr 2, 3, 4, 5. And parentNr tells which post replies to which post.

This approach, to let thet title, orig post, replies (and also custom CSS and Javascript in the admin area actually) all be "posts", makes edit history work for all those things, I mean, old revisions are remembered and one can run diffs and see what changes were made. There's a currRevNr field and an approvedRevNr field, and the latter tells which revision of the post that has been approved by someone — and only the approved version of a post, should be shown to others than staff. (Usually posts are auto approved by the System user, unless the post author is new or has said too weird things in the past.)

@chrscheuer The text to index, when inserting Talkyard contents into an external search engine, is in the field approvedHtmlSanitized, or approvedSource if you want the Commonmark source. — Maybe instead of indexing the exported JSON, you could send a search query to Talkyard? When people search via Soundflow, you do two searches, one in the Soundflow contents, and one via Talkyard for forum contents? Then the search would find also newly created forum topics (instead of maybe only daily, if you do an export each day, say).

  • 5 replies
  1. C
    Christian Scheuer @chrscheuer
      2019-06-11 05:56:14.247Z

      Hi @KajMagnus! Thank you for the explanation :)

      There are two reasons why we prefer using the json export over a search api from Talkyard.

      1. We are not only using this for realtime search but also for static generation of content. For example we'd like the "How to" category to be featured with links in our documentation menu on our site.

      2. Handling the search from multiple places in just one in-memory reverse index improves performance (slightly) since we skip another server roundtrip - but more importantly it means we can do custom indexing, ordering, filtering, grouping etc in a very simple way since we have full control over both indexing and searching.
        This simplicity means we actually already have this up & running locally :)

      1. CChristian Scheuer @chrscheuer
          2019-06-11 05:58:13.330Z

          I expect we'll want to do even more extractions in the future, using the forum as a data source for other aggregations such as likes/popularity for individual packages and stuff like that.
          So even though the search could be fixed in an api, we need this aggregated data to live on our server or we would have too little control over development time, feature addition, response times, etc. etc.

          1. CChristian Scheuer @chrscheuer
              2019-06-11 14:28:24.014Z

              @KajMagnus did you have any concerns about our approach that led you to suggest invoking a TY api endpoint for search instead? Or was it more about us using a non-stable json export format?

              FWIW, we're ok with the json export format changing slightly if you need to do that for it to make sense. It'd be super if you can give a heads up if the format changes of course, but since our solution would still work, just not do updates, it's not a big deal if we got a little downtime here and there while the format settles.
              To combat the issue of immediacy we might later end up either doing a webhook based approach to get immediate updating of our index (if TY supports that at the time) or cross-posting a TY search api to "fill up" our existing results with newer posts from the current day, if we deem it necessary. But for now, it's acceptable for us that our search results' completeness will be delayed with up to a day.

              The fields we're using right now are:

              posts:

              nr
              parentNr
              id
              pageId
              approvedSource
              postType

              pagePaths

              pageId
              value

              1. Thanks for the info on why you're using the json export, interesting to read. I think this'll work fine.

                The minor things that came to my mind was: 1) Non-stable JSON format, yes, however the fields you're using are unlikely to change. (The only thing I can think of, would be renaming nr to postNr and id to postId.) 2) That your search database lags a bit after Talkyard, however now when I read about how you're using the search results, seems to me this won't matter. 3) If the site in the future grows and becomes "really large", maybe doing daily full JSON exports could be a little bit much bandwidth. However "long before that" I hope there'll be ways to do incremental syncs, or webhooks like you mentioned.

                1. CChristian Scheuer @chrscheuer
                    2019-06-11 19:52:18.113Z

                    Thanks! This all sounds great. Yea I'm definitely going after something that's quick to set up now but that won't cause too much trouble in the long run. We're preparing everything in containers so the components can be switched out later, so for example we'll likely end up with Elasticsearch driving some of it eventually (like Talkyard does), but for now our internal C# full text algorithms are good enough for a quick first iteration.
                    And - like you say, incremental json exports and/or webhooks would be really cool, but are not needed just yet.
                    It's only about a 5-7 MB download right now so purely traffic wise we aren't in trouble yet.