Position Paper: Experiences on Clustering
High-Dimensional Data Using pbdR
Author/Presenters
Event Type
Workshop
Software Engineering
TimeSunday, November 12th3:30pm -
3:45pm
Location501
DescriptionMotivation: Software engineering for HPC environments,
in general, and for big data, in particular, faces a set
of unique challenges including the high complexity of
middleware and of computing environments. Tools that
make it easier for scientists to use HPC are, therefore,
of paramount importance. We provide an experience report
of using one of such highly effective middleware pbdR
that allow the scientist to use R programming language
without, at least nominally, having to master many
layers of HPC infrastructure, such as OpenMPI and
ScaLAPACK.
Objective: To evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites, and we are trying to cluster developers based on the text of these commit messages.
Context: We need to be able to identify the developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems and may depend on which computer is used.
Method: We train a Doc2Vec model where existing credentials are used as a document identifier and then use the resulting 200-dimensional vectors for the 2.3M identifiers to cluster these identifiers so that each cluster represents a specific individual.
Objective: To evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites, and we are trying to cluster developers based on the text of these commit messages.
Context: We need to be able to identify the developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems and may depend on which computer is used.
Method: We train a Doc2Vec model where existing credentials are used as a document identifier and then use the resulting 200-dimensional vectors for the 2.3M identifiers to cluster these identifiers so that each cluster represents a specific individual.
Author/Presenters




