No internet connection
  1. Home
  2. Support

Bulk import from Disqus

By Jason @detly2018-04-22 11:33:46.705Z

I'm migrating over from Disqus to Talkyard, and I have a lot of old comments in Disqus. Fortunately Disqus allow you to export comments, so I'm wondering what the best way to import them into Talkyard would be.

Just to be clear on what I'm asking, I don't need a "does-everything-for-me" wizard that understands Disqus' export format. I'm happy to write some Python and parse the XML. It's the "getting it into Talkyard" step I'm not sure about. I'd be submitting a massive number of comments under other peoples' email addresses, so I'm thinking I might need to turn off the spam/flood/auth filters, unless there's a method for an admin to post on behalf of someone else?

But more generally, how would I programmatically populate these comments? Are there docs for this, or a particular source file I should look at? Disqus gives me:

  • Display name
  • Email
  • Date/time
  • IP address
  • Comment
  • Threading information
  • The post it was on (via another XML section)

Where should I start?

  • 4 replies
  1. KajMagnus @KajMagnus2018-04-23 13:58:42.617Z

    would I programmatically populate these comments? Are there docs for this, or a particular source file I should look at?

    There is a HTTP endpoint to which one can POST a JSON file, with users, emails, topics, comments etcetera, in a Talkyard specific JSON structure. But right now it's for end-to-end tests only (it says 401 Forbidden for anything that isn't an end-to-end test).

    I could look into enabling it for "real" usage, and see what more things it maybe needs to do / support, to be able to import Disqus comments.

    And then, if you write a Python script that converst from Disqus XML to Talkyard's JSON format, you could send the JSON file to the Talkyard server (when you're logged in as admin), and all comments would get imported.

    (Here's the source: https://github.com/debiki/talkyard/blob/master/app/controllers/ImportExportController.scala )

    Here's how the JSON looks: (this JSON creates an end-to-end test site. It's an excerpt — I deleted things that's off-topic for Disqus comments)

    (this is just to give you and idea about roughly how it looks — probably you need more details, to be able to write the Python script. Also there're some fields below that you don't need to send to the server, it could fill them in itself. )

    {
      "members": [
        {
          "id": 101,
          "username": "owen_owner",
          "fullName": "Owen Owner",
          "createdAtMs": 1449198824000,
          "emailAddress": "e2e-test--owen-owner@example.com",
          "emailVerifiedAtMs": 1449198824000,
          "passwordHash": "cleartext:publicOwen123",
          "password": "publicOwen123",
          "isOwner": true,
          "isAdmin": true,
          "trustLevel": 2
        },
        {
          "id": 102,
          "username": "mod_mons",
          "fullName": "Mod Mons",
          "createdAtMs": 1449198824000,
          "emailAddress": "e2e-test--mod-mons@example.com",
          "emailVerifiedAtMs": 1449198824000,
          "passwordHash": "cleartext:publicMons123",
          "password": "publicMons123",
          "isModerator": true,
          "trustLevel": 2
        }
        ...
      ],
      "identities": [],
      "guests": [
        {
          "id": -10,
          "fullName": "Guest Gunnar",
          "createdAtMs": 1449198824000,
          "emailAddress": "e2e-test--guest-gunnar@example.com",
          "isGuest": true
        }
        ...
      ],
      "pages": [
        {
          "id": "byMariaCategoryA",
          "role": 12,
          "categoryId": 2,
          "authorId": 106,
          "createdAtMs": 1449198824000,
          "updatedAtMs": 1449198824000,
          "version": 1
        },
        {
          "id": "byMariaCategoryA_2",
          "role": 12,
          "categoryId": 2,
          "authorId": 106,
          "createdAtMs": 1449198824000,
          "updatedAtMs": 1449198824000,
          "version": 1
        }
        ...
      ],
      "pagePaths": [
        {
          "folder": "/",
          "pageId": "byMariaCategoryA",
          "showId": false,
          "slug": "by-maria-category-a"
        },
        {
          "folder": "/",
          "pageId": "byMariaCategoryA_2",
          "showId": false,
          "slug": "by-maria-category-a-2"
        }
        ...
      ],
      "posts": [
        {
          "id": 114,
          "pageId": "byMariaCategoryA",
          "nr": 1,
          "createdAtMs": 1449198824000,
          "createdById": 106,
          "currRevStartedAtMs": 1449198824000,
          "currRevById": 106,
          "numDistinctEditors": 1,
          "approvedSource": "By Maria in CategoryA, text text text.",
          "approvedHtmlSanitized": "<p>By Maria in CategoryA, text text text.</p>",
          "approvedAtMs": 1449198824000,
          "approvedById": 1,
          "approvedRevNr": 1,
          "currRevNr": 1
        },
        {
          "id": 115,
          "pageId": "byMariaCategoryA_2",
          "nr": 0,
          "createdAtMs": 1449198824000,
          "createdById": 106,
          "currRevStartedAtMs": 1449198824000,
          "currRevById": 106,
          "numDistinctEditors": 1,
          "approvedSource": "By Maria in CategoryA nr 2 title",
          "approvedHtmlSanitized": "By Maria in CategoryA nr 2 title",
          "approvedAtMs": 1449198824000,
          "approvedById": 1,
          "approvedRevNr": 1,
          "currRevNr": 1
        }
        ...
      ]
    }
    
    1. DJason @detly2018-04-28 10:42:04.690Z

      Thanks for this! I'll work on a script to create the JSON, and maybe by the time I've finished either you'll have an endpoint for it to be posted to or I'll have learnt Scala.

      A few questions:

      • Just overall, which source file should I dig into to understand the structure of this?
      • I notice that the guest ID is -10. Are all guest IDs negative?
      • How does threading work? Can I link a post to a parent post?
      • What's nr in the post data?
      • Can I skip the approvedSource since I already have my sanitised HTML via Disqus?
      1. KajMagnus @KajMagnus2018-04-30 10:30:19.173Z

        Ok :- )

        which source file should I dig into

        The end-to-end test files I would suggest. They create the JSON structure a Disqus importer also would need to create. Look here:

        • A Typescript definition of the JSON structure, interface SiteData in tests/e2e/test-types.ts

        • A function that constructs a discussion topic and adds to that JSON structure: addPage, here in tests/e2e/utils/site-builder.ts.
          The field role: PageRole should be set to PageRole.EmbeddedComments = 5 (an enum) for embedded comments topics.
          (here's that enum: client/app/model.ts )

        • How to create user JSON objects: functions like memberMaria and guestGunnar, in tests/e2e/utils/make.ts

        • Adding users to the JSON obj, in site-builder.ts
          e.g. site.members.push(forum.members.mallory);

          You could either 1) import the Disqus users into guests accounts (they don't need any password or username), or 2) into "real" accounts, i.e. with password and username. I suppose you'd then generate random passwords, and if someone who has commented on your blog previously, would want to continue using the same account, s/he would click "Forgot password", and get a password reset email.

        Are all guest IDs negative?

        Yes, <= -10 are for guests, and >= 100 are for members with real accounts. There are some magic ids too, from -9 up to +9, like +1 for the System user. And (in case you're curious) default built-in groups (Everyone, New Members, ... Regular Members, Core Members) have ids 10, 11, 12, ...).

        How does threading work? Can I link a post to a parent post?
        What's nr in the post data?

        One links to the parent post, via the field parentNr. Each post has a field nr which is the order in which that post was added to the discussion.

        The page title has nr = 0, page body (a.k.a. the Original Post, for forum topics) has nr 1. The first comment has nr 2, and parentNr = 1. The 2nd comment has nr = 3, and parentNr is 1 or 2, depending on if it replies to the blog post = nr 1, or to the first comment = nr 2. And so on.

        Embedded discussion pages have auto generated titles like "Comments for <the blog post url>)".

        There's also an id field, which uniquely identifies a comment in the database. nr is unique within a certain discussion only. If an admin moves a comment from one discussion to another, it'll get a new nr, but keep the same id.

        Note to myself: I'll probably need to make the importer work, without any id fields. It's not really possible for you to know which ids to use, since there are some ids in the database already (and those should be avoided).

        Can I skip the approvedSource since I already have my sanitised HTML via Disqus?

        It's used for editing: If someone decides to edit a post (e.g. you — admins can edit other's posts), the editor will display the source for that comment (which is the approvedSource field in the JSON to import).

        You can set approvedSource to the HTML exported from Disqus — that is, set both approvedSource and approvedHtmlSanitized to the post's HTML. Then, if someone wants to edit a comment imported from Disqus, s/he'll see & can edit the HTML from Disqus.


        I hope this helps :- ) & I've a little bit started looking at what I need to do server side.

        1. KajMagnus @KajMagnus2018-04-30 15:51:13.933Z

          Mentioning @detly. So you'll get a notification email and see my comment above.

          (About a week ago I changed the email notification sent-from address, but forgot to verify the new sent-from address, so no emails got sent :- P )