Skip to content

The project in which I explore archival data using different R libraries.

Notifications You must be signed in to change notification settings

kowalskihubert/R_data_exploration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R_data_exploration

The project in which I explore archival data using different R libraries.

How to use

Download and unzip Posts.csv.gz, Comments.csv.gz, Users.csv.gz

About the project

In this project I used data from https://archive.org/details/stackexchange. It's goal was to explore this data set using different libraries in R and compare the speed of the functions.

Technologies used:

  • R
  • sqldf
  • dplyr
  • data.table
  • microbenchmark

SQL queries

Query 1

SELECT Location, SUM(UpVotes) as TotalUpVotes FROM Users
  WHERE Location != ''
  GROUP BY Location
  ORDER BY TotalUpVotes DESC LIMIT 10

Query 2

SELECT STRFTIME('%Y', CreationDate) AS Year, STRFTIME('%m', CreationDate) AS Month, 
    COUNT(*) AS PostsNumber, MAX(Score) AS MaxScore
    FROM Posts
    WHERE PostTypeId IN (1, 2) 
    GROUP BY Year, Month 
    HAVING PostsNumber > 1000

Query 3

 SELECT Id, DisplayName, TotalViews 
  FROM (
      SELECT OwnerUserId, SUM(ViewCount) as TotalViews FROM Posts
      WHERE PostTypeId = 1
      GROUP BY OwnerUserId
    ) AS Questions 
  JOIN Users
  ON Users.Id = Questions.OwnerUserId 
  ORDER BY TotalViews DESC
  LIMIT 10

Query 4

SELECT DisplayName, QuestionsNumber, AnswersNumber, Location, Reputation, UpVotes, DownVotes 
    FROM (
      SELECT * FROM (
          SELECT COUNT(*) as AnswersNumber, OwnerUserId 
          FROM Posts
          WHERE PostTypeId = 2
          GROUP BY OwnerUserId
        ) AS Answers 
      JOIN
      (
          SELECT COUNT(*) as QuestionsNumber, OwnerUserId 
          FROM Posts
          WHERE PostTypeId = 1
          GROUP BY OwnerUserId
      ) AS Questions
      ON Answers.OwnerUserId = Questions.OwnerUserId 
      WHERE AnswersNumber > QuestionsNumber
      ORDER BY AnswersNumber DESC
      LIMIT 5
  ) AS PostsCounts 
  JOIN Users
  ON PostsCounts.OwnerUserId = Users.Id

Query 5

SELECT Title, CommentCount, ViewCount, CommentsTotalScore, DisplayName, Reputation, Location 
    FROM (
        SELECT Posts.OwnerUserId, Posts.Title, Posts.CommentCount, Posts.ViewCount, CmtTotScr.CommentsTotalScore
        FROM (
              SELECT PostId, SUM(Score) AS CommentsTotalScore 
              FROM Comments
              GROUP BY PostId
            ) AS CmtTotScr
        JOIN Posts ON Posts.Id = CmtTotScr.PostId 
        WHERE Posts.PostTypeId=1
      ) AS PostsBestComments
    JOIN Users 
    ON PostsBestComments.OwnerUserId = Users.Id 
    ORDER BY CommentsTotalScore DESC
    LIMIT 10

About

The project in which I explore archival data using different R libraries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages